-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-6812]: Convert keys to ByteArray in Combine.perKey to make sure hashCode is consistent #8042
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b0f74d0 to
86e9bad
Compare
|
@iemejia - Please review. |
9094205 to
a5dcf2c
Compare
| return (bundleSize > 0) | ||
| ? null | ||
| : new HashPartitioner(context.getSparkContext().defaultParallelism()); | ||
| ? CustomSparkHashPartitioner.of(context.getSparkContext().defaultParallelism()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic seems to be accidentally flipped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, this logic is odd, I don't think EvalutionContext is enough to properly resolve paralellism for the shuffle operation
In case bundleSize > 0, we should reuse paralellism of the input RDD. This way we would get rid of nullable case and we can simplify GBK and CPK translations.
Spark's default partitioner does the same thing: org.apache.spark.Partitioner#defaultPartitioner
...ers/spark/src/main/java/org/apache/beam/runners/spark/translation/GroupCombineFunctions.java
Outdated
Show resolved
Hide resolved
a5dcf2c to
9254352
Compare
|
@iemejia : Please take as lot this when you get a chance. Thanks. |
|
Sorry for the delay, have not forgotten about this one, just being swamped in other PRs. @dmvk could you help me review this one please. |
|
Run Portable_Python PreCommit |
|
Nice catch! This problem goes way deeper, we should not ever shuffle raw user data in the first place (we should always use beam coder to serialize data, before partitioning even happens). If you take a look at GBK implementation, it uses ByteArray instead of byte[] (which has correct hashCode implementation). I guess tests did not catch this because they use primitive types (which spark can serialize on its own) as keys. Also tests are run in a single JVM, therefore hashCode is stable. Correct fix would be:
|
dmvk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great start! It would be awesome if we could cover all cases by properly using Beam's coders for all shuffle operations as expected.
Thanks you for the contribution! ;)
9254352 to
ace220a
Compare
|
@dmvk - Thank you for looking at the PR. I removed the custom partitioner logic for now. I made the CPK similar to GBK where we convert the key to a |
dmvk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should rather keep the getPartition mechanism the way it was for now. Otherwise it is good to merge! Thanks ;)
| ? null | ||
| : new HashPartitioner(context.getSparkContext().defaultParallelism()); | ||
| ? new HashPartitioner(context.getSparkContext().defaultParallelism()) | ||
| : null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second thoughts, it was correct.
https://github.com/apache/beam/pull/6884/files#r246077919
This was in order to maintain old functionality (bundleSize == 0, which basically means to use predefined parallelism).
I think the old functionality doesn't make much sense as it doesn't scale with input data. I guess someone may use this in order to re-scale "downstream" stage, but there should be a better mechanism to do this.
Any thoughts? @timrobertson100 @kyle-winkelman
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I reverted back the changes made to getPartition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the old functionality seems strange, but I remember (when I had the logic backwards) that the performance tests for the spark runner were impacted. I think the impact was in streaming mode because if you don't use this HashPartitioner then it actually does a double shuffle of the data. I tried to clean this up but my I never finished getting PR #6511 merged.
…s when used in spark
ace220a to
18820bb
Compare
|
Run Spark ValidatesRunner |
…
The PR makes the CBK consistent with GBK where we convert the keys to a ByteArray before calling group/combine functions.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username).[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.Post-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.