[SPARK-4630][Core]Dynamically determine optimal number of partitions #4070

lianhuiwang · 2015-01-16T12:43:35Z

stages in application have different size of data. if user doesnot set numPartitions for any stages, spark will use same defaultParallelism as partitons.
then in DAGScheduler number of stage's running tasks is equal to partition size of stage.so usually this number is a same value.
so if number of stage's partitions is too small, then task need to process large mount of data and slows down due to spilling or gc.
if number of stage's partitions is too large, there is big cost in schedule.
To improve performance of application, we need to determine optimal number of partitions according to stage's input data size.
there are two steps:

estimate number of Stage's Partitions
according to its parent stages's input data size and spark.reduce.per.partition.bytes configuration we can determine number of stage's partitions.
how to get parent stages's input data Size?
if it has no parent, we get its inputSize through summing its input path's length.
else if it has parents, but its parents is no available, so we get its parents' input data size.
else if its parents is avilable, we just compute its parents' shuffle data size as its input size.
update Stage's Partitioner
firstly, update partition of parents' shuffleDep. that make shuffleMapTask write wanted number of partition files.
and then, update stage's information, particularly stage's rdd. it make stage's shuffleRDD can correctly pull data from map task.

finally, now this feature can be turned off/on with a configuration option.

TODO:

consider spark.shuffle.memoryFraction to determine spark.reduce.per.partition.bytes configuration.
when stage is final stage, resultHandler's value cannot be returned to SparkContext because partitions has been changed.
when number of stage's tasks has been changed, report stage's new information to UI.before submitStage,event SparkListenerJobStart that include all stage Infos of job has been post to listenerBus.
@ksakellis @sryza @JoshRosen @rxin

SparkQA · 2015-01-16T12:47:37Z

Test build #25658 has started for PR 4070 at commit fc652a5.

This patch merges cleanly.

SparkQA · 2015-01-16T12:48:29Z

Test build #25658 has finished for PR 4070 at commit fc652a5.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class HashPartitioner(var partitions: Int) extends Partitioner

AmplabJenkins · 2015-01-16T12:48:30Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25658/
Test FAILed.

SparkQA · 2015-01-16T12:57:38Z

Test build #25659 has started for PR 4070 at commit 668926c.

This patch merges cleanly.

SparkQA · 2015-01-16T12:59:27Z

Test build #25659 has finished for PR 4070 at commit 668926c.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class HashPartitioner(var partitions: Int) extends Partitioner

AmplabJenkins · 2015-01-16T12:59:28Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25659/
Test FAILed.

SparkQA · 2015-01-16T13:07:37Z

Test build #25660 has started for PR 4070 at commit 8b7216f.

This patch merges cleanly.

SparkQA · 2015-01-16T14:16:53Z

Test build #25660 has finished for PR 4070 at commit 8b7216f.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class HashPartitioner(var partitions: Int) extends Partitioner

AmplabJenkins · 2015-01-16T14:16:57Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25660/
Test FAILed.

SparkQA · 2015-01-16T14:47:33Z

Test build #25662 has started for PR 4070 at commit 622e45c.

This patch merges cleanly.

SparkQA · 2015-01-16T15:56:24Z

Test build #25662 has finished for PR 4070 at commit 622e45c.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class HashPartitioner(var partitions: Int) extends Partitioner

AmplabJenkins · 2015-01-16T15:56:27Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25662/
Test FAILed.

rxin · 2015-01-19T08:49:26Z

@lianhuiwang have you tried this in production on any jobs?

lianhuiwang · 2015-01-19T13:08:47Z

@rxin yes, some of etl jobs that has groupby and join operators have been tried to use this feature.most of time that can determine number of partition very well. why jenkins report "java.lang.RuntimeException: spark-core: Binary compatibility check failed!"? can you tell me the reasons of failed? thanks.
@ksakellis I take a look at your works in SPARK-4630 and Can you give some suggestions about this PR? thanks.

srowen · 2015-05-18T18:25:23Z

I don't think this is the direction that the discussion in SPARK-4630 is leading. This is trying to use output size as a heuristic and it isn't ideal. I'm also not sure of the implications of making the number of partitions mutable in RDDs. Do you mind closing this, as it hasn't been active in a while.

Dynamically determine optimal number of partitions

fc652a5

fix fails scala style

668926c

fix small bug

8b7216f

remove debug log

622e45c

lianhuiwang closed this May 19, 2015

sadikovi mentioned this pull request Mar 10, 2016

Explore smart partitioning sadikovi/spark-netflow#19

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-4630][Core]Dynamically determine optimal number of partitions #4070

[SPARK-4630][Core]Dynamically determine optimal number of partitions #4070

lianhuiwang commented Jan 16, 2015

SparkQA commented Jan 16, 2015

SparkQA commented Jan 16, 2015

AmplabJenkins commented Jan 16, 2015

SparkQA commented Jan 16, 2015

SparkQA commented Jan 16, 2015

AmplabJenkins commented Jan 16, 2015

SparkQA commented Jan 16, 2015

SparkQA commented Jan 16, 2015

AmplabJenkins commented Jan 16, 2015

SparkQA commented Jan 16, 2015

SparkQA commented Jan 16, 2015

AmplabJenkins commented Jan 16, 2015

rxin commented Jan 19, 2015

lianhuiwang commented Jan 19, 2015

srowen commented May 18, 2015

[SPARK-4630][Core]Dynamically determine optimal number of partitions #4070

[SPARK-4630][Core]Dynamically determine optimal number of partitions #4070

Conversation

lianhuiwang commented Jan 16, 2015

SparkQA commented Jan 16, 2015

SparkQA commented Jan 16, 2015

AmplabJenkins commented Jan 16, 2015

SparkQA commented Jan 16, 2015

SparkQA commented Jan 16, 2015

AmplabJenkins commented Jan 16, 2015

SparkQA commented Jan 16, 2015

SparkQA commented Jan 16, 2015

AmplabJenkins commented Jan 16, 2015

SparkQA commented Jan 16, 2015

SparkQA commented Jan 16, 2015

AmplabJenkins commented Jan 16, 2015

rxin commented Jan 19, 2015

lianhuiwang commented Jan 19, 2015

srowen commented May 18, 2015