[SPARK-2568] RangePartitioner should run only one job if data is balanced #1562

mengxr · 2014-07-24T02:53:05Z

As of Spark 1.0, RangePartitioner goes through data twice: once to compute the count and once to do sampling. As a result, to do sortByKey, Spark goes through data 3 times (once to count, once to sample, and once to sort).

RangePartitioner should go through data only once, collecting samples from input partitions as well as counting. If the data is balanced, this should give us a good sketch. If we see big partitions, we re-sample from them in order to collect enough items.

The downside is that we need to collect more from each partition in the first pass. An alternative solution is caching the intermediate result and decide whether to fetch the data after.

SparkQA · 2014-07-24T02:58:30Z

QA tests have started for PR 1562. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17089/consoleFull

SparkQA · 2014-07-24T03:37:20Z

QA results for PR 1562:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17089/consoleFull

SparkQA · 2014-07-24T03:53:42Z

QA tests have started for PR 1562. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17092/consoleFull

SparkQA · 2014-07-24T04:03:28Z

QA tests have started for PR 1562. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17093/consoleFull

SparkQA · 2014-07-24T04:32:29Z

QA results for PR 1562:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17092/consoleFull

SparkQA · 2014-07-24T04:45:44Z

QA results for PR 1562:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17093/consoleFull

rxin · 2014-07-24T05:21:19Z

core/src/main/scala/org/apache/spark/Partitioner.scala

-      val rddSample = rdd.sample(false, frac, 1).map(_._1).collect().sorted
-      if (rddSample.length == 0) {
-        Array()
+      // This is the sample size we need to have roughly balanced output partitions.


would be great to break this down into couple different functions that we can unit test.

Let me break it down.

@dbtsai

Allow small errors in comparison. @dbtsai , this unit test blocks #1562 . I may need to merge this one first. We can change it to use the tools in #1425 after that PR gets merged. Author: Xiangrui Meng <meng@databricks.com> Closes #1576 from mengxr/fix-binary-metrics-unit-tests and squashes the following commits: 5076a7f [Xiangrui Meng] fix binary metrics unit tests

mateiz · 2014-07-25T21:47:39Z

core/src/main/scala/org/apache/spark/Partitioner.scala

+      var numItems = 0L
+      sketch.foreach { case (_, n, _) =>
+        numItems += n
+      }


You can replace this with val numItems = sketch.map(_._2).sum

(It would probably also be more efficient than doing a pattern match here)

.sum will return an Int instead of Long. I will remove pattern matching.

val numItems = sketch.map(_._2.toLong).sum

Done. My previous concern was sketch.map generates a temp array. But since the number of partitions is small, it is not a big deal and this reads better.

mengxr · 2014-07-27T14:42:18Z

@rxin @mateiz I have one question about using rdd.id as random seed shift to avoid sampling the same sequence in each partition. It is a constant within a session. But it becomes harder for a user to reproduce the result again. Is it a big deal?

mateiz · 2014-07-27T17:25:32Z

I think it's fine to make it random. Actually it would be better to do something like idx | (rdd.id << 16) to have them overlap in fewer bits, since both idx and rdd.id are small numbers.

mateiz · 2014-07-27T18:04:02Z

Actually I meant ^, not |

SparkQA · 2014-07-27T19:33:54Z

QA tests have started for PR 1562. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17248/consoleFull

SparkQA · 2014-07-27T19:53:50Z

QA tests have started for PR 1562. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17249/consoleFull

SparkQA · 2014-07-27T20:20:39Z

QA results for PR 1562:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17248/consoleFull

SparkQA · 2014-07-27T20:41:45Z

QA results for PR 1562:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17249/consoleFull

rxin · 2014-07-30T05:15:52Z

core/src/main/scala/org/apache/spark/Partitioner.scala

@@ -103,26 +107,49 @@ class RangePartitioner[K : Ordering : ClassTag, V](
    private var ascending: Boolean = true)


It'd be great to update the documentation on when this results in two passes vs one pass. We should probably update the documentation for sortByKey and various other sorts that use this too. Let's do that in another PR.

rxin · 2014-07-30T05:15:59Z

LGTM. Merging in master.

@dbtsai

Allow small errors in comparison. @dbtsai , this unit test blocks apache#1562 . I may need to merge this one first. We can change it to use the tools in apache#1425 after that PR gets merged. Author: Xiangrui Meng <meng@databricks.com> Closes apache#1576 from mengxr/fix-binary-metrics-unit-tests and squashes the following commits: 5076a7f [Xiangrui Meng] fix binary metrics unit tests

…nced As of Spark 1.0, RangePartitioner goes through data twice: once to compute the count and once to do sampling. As a result, to do sortByKey, Spark goes through data 3 times (once to count, once to sample, and once to sort). `RangePartitioner` should go through data only once, collecting samples from input partitions as well as counting. If the data is balanced, this should give us a good sketch. If we see big partitions, we re-sample from them in order to collect enough items. The downside is that we need to collect more from each partition in the first pass. An alternative solution is caching the intermediate result and decide whether to fetch the data after. Author: Xiangrui Meng <meng@databricks.com> Author: Reynold Xin <rxin@apache.org> Closes apache#1562 from mengxr/range-partitioner and squashes the following commits: 6cc2551 [Xiangrui Meng] change foreach to for eb39b08 [Xiangrui Meng] Merge branch 'master' into range-partitioner eb95dd8 [Xiangrui Meng] separate sketching and determining bounds impl c436d30 [Xiangrui Meng] fix binary metrics unit tests db58a55 [Xiangrui Meng] add unit tests a6e35d6 [Xiangrui Meng] minor update 60be09e [Xiangrui Meng] remove importance sampler 9ee9992 [Xiangrui Meng] update range partitioner to run only one job on roughly balanced data cc12f47 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into range-part 06ac2ec [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into range-part 17bcbf3 [Reynold Xin] Added seed. badf20d [Reynold Xin] Renamed the method. 6940010 [Reynold Xin] Reservoir sampling implementation.

rxin and others added 8 commits July 17, 2014 22:00

Reservoir sampling implementation.

6940010

Renamed the method.

badf20d

Added seed.

17bcbf3

Merge remote-tracking branch 'apache/master' into range-part

06ac2ec

Merge remote-tracking branch 'apache/master' into range-part

cc12f47

update range partitioner to run only one job on roughly balanced data

9ee9992

remove importance sampler

60be09e

minor update

a6e35d6

add unit tests

db58a55

fix binary metrics unit tests

c436d30

rxin reviewed Jul 24, 2014
View reviewed changes

mengxr mentioned this pull request Jul 24, 2014

[SPARK-2479 (partial)][MLLIB] fix binary metrics unit tests #1576

Closed

mateiz reviewed Jul 25, 2014
View reviewed changes

mengxr added 2 commits July 27, 2014 12:30

separate sketching and determining bounds impl

eb95dd8

Merge branch 'master' into range-partitioner

eb39b08

change foreach to for

6cc2551

rxin reviewed Jul 30, 2014
View reviewed changes

asfgit closed this in 2e6efca Jul 30, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2568] RangePartitioner should run only one job if data is balanced #1562

[SPARK-2568] RangePartitioner should run only one job if data is balanced #1562

mengxr commented Jul 24, 2014

SparkQA commented Jul 24, 2014

SparkQA commented Jul 24, 2014

SparkQA commented Jul 24, 2014

SparkQA commented Jul 24, 2014

SparkQA commented Jul 24, 2014

SparkQA commented Jul 24, 2014

rxin Jul 24, 2014

mengxr Jul 27, 2014

mateiz Jul 25, 2014

mateiz Jul 25, 2014

mengxr Jul 27, 2014

markhamstra Jul 27, 2014

mengxr Jul 27, 2014

mengxr commented Jul 27, 2014

mateiz commented Jul 27, 2014

mateiz commented Jul 27, 2014

SparkQA commented Jul 27, 2014

SparkQA commented Jul 27, 2014

SparkQA commented Jul 27, 2014

SparkQA commented Jul 27, 2014

rxin Jul 30, 2014

rxin commented Jul 30, 2014

		@@ -103,26 +107,49 @@ class RangePartitioner[K : Ordering : ClassTag, V](
		private var ascending: Boolean = true)

[SPARK-2568] RangePartitioner should run only one job if data is balanced #1562

[SPARK-2568] RangePartitioner should run only one job if data is balanced #1562

Conversation

mengxr commented Jul 24, 2014

SparkQA commented Jul 24, 2014

SparkQA commented Jul 24, 2014

SparkQA commented Jul 24, 2014

SparkQA commented Jul 24, 2014

SparkQA commented Jul 24, 2014

SparkQA commented Jul 24, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mengxr commented Jul 27, 2014

mateiz commented Jul 27, 2014

mateiz commented Jul 27, 2014

SparkQA commented Jul 27, 2014

SparkQA commented Jul 27, 2014

SparkQA commented Jul 27, 2014

SparkQA commented Jul 27, 2014

Choose a reason for hiding this comment

rxin commented Jul 30, 2014