SPARK-2203: PySpark defaults to use same num reduce partitions as map side #1138

aarondav · 2014-06-19T19:52:18Z

For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(), PySpark will always assume that the default parallelism to use for the reduce side is ctx.defaultParallelism, which is a constant typically determined by the number of cores in cluster.

In contrast, Spark's Partitioner#defaultPartitioner will use the same number of reduce partitions as map partitions unless the defaultParallelism config is explicitly set. This tends to be a better default in order to avoid OOMs, and should also be the behavior of PySpark.

JIRA: https://issues.apache.org/jira/browse/SPARK-2203

… partitions For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(), PySpark will always assume that the default parallelism to use for the reduce side is ctx.defaultParallelism, which is a constant typically determined by the number of cores in cluster. In contrast, Spark's Partitioner#defaultPartitioner will use the same number of reduce partitions as map partitions unless the defaultParallelism config is explicitly set. This tends to be a better default in order to avoid OOMs, and should also be the behavior of PySpark. JIRA: https://issues.apache.org/jira/browse/SPARK-2203

mateiz · 2014-06-19T19:54:35Z

Good catch. Change looks good to me.

AmplabJenkins · 2014-06-19T19:54:56Z

Merged build triggered.

AmplabJenkins · 2014-06-19T19:55:06Z

Merged build started.

AmplabJenkins · 2014-06-19T20:34:05Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-06-19T20:34:05Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15920/

rxin · 2014-06-20T07:06:48Z

Merging this in master

… side For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(), PySpark will always assume that the default parallelism to use for the reduce side is ctx.defaultParallelism, which is a constant typically determined by the number of cores in cluster. In contrast, Spark's Partitioner#defaultPartitioner will use the same number of reduce partitions as map partitions unless the defaultParallelism config is explicitly set. This tends to be a better default in order to avoid OOMs, and should also be the behavior of PySpark. JIRA: https://issues.apache.org/jira/browse/SPARK-2203 Author: Aaron Davidson <aaron@databricks.com> Closes apache#1138 from aarondav/pyfix and squashes the following commits: 1bd5751 [Aaron Davidson] SPARK-2203: PySpark defaults to use same num reduce partitions as map partitions

Co-authored-by: Egor Krivokon <>

asfgit closed this in f46e02f Jun 20, 2014

mapr-devops pushed a commit to mapr/spark that referenced this pull request May 8, 2025

MapR [SPARK-1199] [Spark 3.4]Can't open Spark Driver UI (apache#1138)

6f38451

Co-authored-by: Egor Krivokon <>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SPARK-2203: PySpark defaults to use same num reduce partitions as map side #1138

SPARK-2203: PySpark defaults to use same num reduce partitions as map side #1138

Uh oh!

aarondav commented Jun 19, 2014

Uh oh!

mateiz commented Jun 19, 2014

Uh oh!

AmplabJenkins commented Jun 19, 2014

Uh oh!

AmplabJenkins commented Jun 19, 2014

Uh oh!

AmplabJenkins commented Jun 19, 2014

Uh oh!

AmplabJenkins commented Jun 19, 2014

Uh oh!

rxin commented Jun 20, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SPARK-2203: PySpark defaults to use same num reduce partitions as map side #1138

SPARK-2203: PySpark defaults to use same num reduce partitions as map side #1138

Uh oh!

Conversation

aarondav commented Jun 19, 2014

Uh oh!

mateiz commented Jun 19, 2014

Uh oh!

AmplabJenkins commented Jun 19, 2014

Uh oh!

AmplabJenkins commented Jun 19, 2014

Uh oh!

AmplabJenkins commented Jun 19, 2014

Uh oh!

AmplabJenkins commented Jun 19, 2014

Uh oh!

rxin commented Jun 20, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants