[SPARK-21782][Core] Repartition creates skews when numPartitions is a power of 2 #18990

megaserg · 2017-08-18T05:58:00Z

Problem

When an RDD (particularly with a low item-per-partition ratio) is repartitioned to numPartitions = power of 2, the resulting partitions are very uneven-sized, due to using fixed seed to initialize PRNG, and using the PRNG only once. See details in https://issues.apache.org/jira/browse/SPARK-21782

What changes were proposed in this pull request?

Instead of directly using 0, 1, 2,... seeds to initialize Random, hash them with scala.util.hashing.byteswap32().

How was this patch tested?

build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.rdd.RDDSuite test

… power of 2

srowen

You definitely can't do this. Evaluating the RDD becomes nondeterministic

megaserg · 2017-08-18T07:49:53Z

Sorry, I edited the pull request body. The @srowen's comment above was referring to the initial version, where I proposed using default, non-deterministic constructor for Random().

SparkQA · 2017-08-18T11:55:53Z

Test build #3891 has finished for PR 18990 at commit bee7fca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

I'll leave it open a bit for more comments, but in theory this change is fine as nobody should depend on the exact output. In practice it might change the exact output of a shuffle stage. But no tests failed, which is evidence that it has very little if any practical impact.

jiangxb1987

LGTM, cc @yanboliang @cloud-fan

viirya · 2017-08-20T04:15:15Z

LGTM. I agree that in theory there is no reason we should depend on the exact shuffle distribution here. It should be beneficial to have a more even distribution.

atronchi · 2023-03-13T23:12:41Z

This issue appears to remain at large for the dataframe API which is used more broadly than RDD. What would it take to extend the fix to the dataframe API?

I verified this on Spark 3.2 using df.repartition(1024) on a dataframe with ~200k rows, which resulted in almost 30% EMPTY partitions, and the below shown skew of the remaining non-empty ones.

srowen · 2023-03-13T23:50:52Z

@atronchi what is "df" here? I couldn't reproduce that with a DF of 200K simple rows

cloud-fan · 2023-03-14T01:02:02Z

It should have been fixed in 3.2+: #37855

atronchi · 2023-03-14T20:57:01Z

df is a cached dataframe containing ~200k rows. @srowen what Spark version did you test with?

srowen · 2023-03-15T03:09:38Z

I was using 3.3.2

[SPARK-21782][Core] Repartition creates skews when numPartitions is a…

2cb7550

… power of 2

srowen reviewed Aug 18, 2017

View reviewed changes

Scramble seed bits instead of using non-deterministic ctor

bee7fca

srowen approved these changes Aug 19, 2017

View reviewed changes

jiangxb1987 approved these changes Aug 19, 2017

View reviewed changes

yanboliang approved these changes Aug 20, 2017

View reviewed changes

asfgit closed this in 77d046e Aug 21, 2017

gcz2022 mentioned this pull request Mar 30, 2020

Add repartition workload Intel-bigdata/HiBench#608

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-21782][Core] Repartition creates skews when numPartitions is a power of 2 #18990

[SPARK-21782][Core] Repartition creates skews when numPartitions is a power of 2 #18990

megaserg commented Aug 18, 2017 •

edited

Loading

srowen left a comment

megaserg commented Aug 18, 2017

SparkQA commented Aug 18, 2017

srowen left a comment

jiangxb1987 left a comment

viirya commented Aug 20, 2017

atronchi commented Mar 13, 2023

srowen commented Mar 13, 2023

cloud-fan commented Mar 14, 2023

atronchi commented Mar 14, 2023

srowen commented Mar 15, 2023

[SPARK-21782][Core] Repartition creates skews when numPartitions is a power of 2 #18990

[SPARK-21782][Core] Repartition creates skews when numPartitions is a power of 2 #18990

Conversation

megaserg commented Aug 18, 2017 • edited Loading

Problem

What changes were proposed in this pull request?

How was this patch tested?

srowen left a comment

Choose a reason for hiding this comment

megaserg commented Aug 18, 2017

SparkQA commented Aug 18, 2017

srowen left a comment

Choose a reason for hiding this comment

jiangxb1987 left a comment

Choose a reason for hiding this comment

viirya commented Aug 20, 2017

atronchi commented Mar 13, 2023

srowen commented Mar 13, 2023

cloud-fan commented Mar 14, 2023

atronchi commented Mar 14, 2023

srowen commented Mar 15, 2023

megaserg commented Aug 18, 2017 •

edited

Loading