-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21782][Core] Repartition creates skews when numPartitions is a power of 2 #18990
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You definitely can't do this. Evaluating the RDD becomes nondeterministic
Sorry, I edited the pull request body. The @srowen's comment above was referring to the initial version, where I proposed using default, non-deterministic constructor for |
Test build #3891 has finished for PR 18990 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave it open a bit for more comments, but in theory this change is fine as nobody should depend on the exact output. In practice it might change the exact output of a shuffle stage. But no tests failed, which is evidence that it has very little if any practical impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, cc @yanboliang @cloud-fan
LGTM. I agree that in theory there is no reason we should depend on the exact shuffle distribution here. It should be beneficial to have a more even distribution. |
@atronchi what is "df" here? I couldn't reproduce that with a DF of 200K simple rows |
It should have been fixed in 3.2+: #37855 |
|
I was using 3.3.2 |
Problem
When an RDD (particularly with a low item-per-partition ratio) is repartitioned to numPartitions = power of 2, the resulting partitions are very uneven-sized, due to using fixed seed to initialize PRNG, and using the PRNG only once. See details in https://issues.apache.org/jira/browse/SPARK-21782
What changes were proposed in this pull request?
Instead of directly using
0, 1, 2,...
seeds to initializeRandom
, hash them withscala.util.hashing.byteswap32()
.How was this patch tested?
build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.rdd.RDDSuite test