Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21782][Core] Repartition creates skews when numPartitions is a power of 2 #18990

Closed
wants to merge 2 commits into from

Conversation

megaserg
Copy link
Contributor

@megaserg megaserg commented Aug 18, 2017

Problem

When an RDD (particularly with a low item-per-partition ratio) is repartitioned to numPartitions = power of 2, the resulting partitions are very uneven-sized, due to using fixed seed to initialize PRNG, and using the PRNG only once. See details in https://issues.apache.org/jira/browse/SPARK-21782

What changes were proposed in this pull request?

Instead of directly using 0, 1, 2,... seeds to initialize Random, hash them with scala.util.hashing.byteswap32().

How was this patch tested?

build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.rdd.RDDSuite test

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You definitely can't do this. Evaluating the RDD becomes nondeterministic

@megaserg
Copy link
Contributor Author

Sorry, I edited the pull request body. The @srowen's comment above was referring to the initial version, where I proposed using default, non-deterministic constructor for Random().

@SparkQA
Copy link

SparkQA commented Aug 18, 2017

Test build #3891 has finished for PR 18990 at commit bee7fca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave it open a bit for more comments, but in theory this change is fine as nobody should depend on the exact output. In practice it might change the exact output of a shuffle stage. But no tests failed, which is evidence that it has very little if any practical impact.

Copy link
Contributor

@jiangxb1987 jiangxb1987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya
Copy link
Member

viirya commented Aug 20, 2017

LGTM. I agree that in theory there is no reason we should depend on the exact shuffle distribution here. It should be beneficial to have a more even distribution.

@atronchi
Copy link

This issue appears to remain at large for the dataframe API which is used more broadly than RDD. What would it take to extend the fix to the dataframe API?

I verified this on Spark 3.2 using df.repartition(1024) on a dataframe with ~200k rows, which resulted in almost 30% EMPTY partitions, and the below shown skew of the remaining non-empty ones.
Screenshot 2023-03-13 at 10 09 24 AM

@srowen
Copy link
Member

srowen commented Mar 13, 2023

@atronchi what is "df" here? I couldn't reproduce that with a DF of 200K simple rows

@cloud-fan
Copy link
Contributor

It should have been fixed in 3.2+: #37855

@atronchi
Copy link

df is a cached dataframe containing ~200k rows. @srowen what Spark version did you test with?

@srowen
Copy link
Member

srowen commented Mar 15, 2023

I was using 3.3.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants