[SPARK-2617] Correct doc and usages of preservesPartitioning #1526

mengxr · 2014-07-22T09:33:57Z

The name preservesPartitioning is ambiguous: 1) preserves the indices of partitions, 2) preserves the partitioner. The latter is correct and preservesPartitioning should really be called preservesPartitioner to avoid confusion. Unfortunately, this is already part of the API and we cannot change. We should be clear in the doc and fix wrong usages.

This PR

adds notes in maPartitions*,
makes RDD.sample preserve partitioner,
changes preservesPartitioning to false in RDD.zip because the keys of the first RDD are no longer the keys of the zipped RDD,
fixes some wrong usages in MLlib.

fix wrong usage of preservesPartitioning make sample preserse partitioning

SparkQA · 2014-07-22T09:38:19Z

QA tests have started for PR 1526. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16960/consoleFull

SparkQA · 2014-07-22T10:59:34Z

QA results for PR 1526:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16960/consoleFull

SparkQA · 2014-07-22T17:03:14Z

QA tests have started for PR 1526. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16973/consoleFull

SparkQA · 2014-07-22T18:24:25Z

QA results for PR 1526:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16973/consoleFull

SparkQA · 2014-07-22T19:43:33Z

QA tests have started for PR 1526. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16984/consoleFull

SparkQA · 2014-07-22T21:23:57Z

QA results for PR 1526:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16984/consoleFull

pwendell · 2014-07-22T22:24:51Z

I think the reason you find it confusing is because you are interpreting it an imperative (please preserve X) instead of descriptive (this preserves X). I always interpreted it as descriptive. I.e. you are doing something that logically preserves the partitioning of the RDD. E.g. you are answering the question "Does this map function preserve the partitioning of the underlying data?".

pwendell · 2014-07-22T22:26:33Z

I think that's why it's called preservesPartitioning (descriptive) instead of preservePartitioning (imperative).

mengxr · 2014-07-23T01:20:53Z

@pwendell The confusing part is the definition of "partitioning". It could be the indexing of partitions or the partitioner. The first time I found that mapPartitions has a parameter called perservesPartitioning, I thought this is to preserve the indexing of partitions --- so partition 0 maps to partition 0 and partition 1 maps to partition 1, etc.

This causes problems. For example, we set preservesPartitioning to true in RDD.zip and the following code won't run correctly:

val a = sc.makeRDD(Seq(0, 1, 2, 3, 4)).map(x => (x, 1)).partitionBy(new HashPartitioner(2))
val b = a.map(x => 1)
a.zip(b).join(a.map(x => (x, 1)).collect()

Btw, preservePartitioning is used in streaming instead of preservesPartitioning. @tdas

SparkQA · 2014-07-23T05:53:28Z

QA tests have started for PR 1526. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17017/consoleFull

pwendell · 2014-07-23T06:18:26Z

core/src/main/scala/org/apache/spark/rdd/RDD.scala

@@ -585,7 +585,9 @@ abstract class RDD[T: ClassTag](
  }

  /**
-   * Return a new RDD by applying a function to each partition of this RDD.
+   * Return a new RDD by applying a function to each partition of this RDD. Note that


Nit: "Note that preservesPartitioning means whether" --> "preservesPartitioning indicates whether" (I think it's implicit that we are asking them to note everything, there are only two sentences here).

Also, it might be nice to add a blank line after the first sentence.

With a blank line now, it is much easier to copy & paste :)

pwendell · 2014-07-23T06:19:42Z

Only one pedantic comment. LGTM even if you ignore my comment.

SparkQA · 2014-07-23T06:28:31Z

QA tests have started for PR 1526. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17018/consoleFull

SparkQA · 2014-07-23T07:30:35Z

QA results for PR 1526:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17017/consoleFull

rxin · 2014-07-23T07:58:45Z

Merging in master. THanks!

SparkQA · 2014-07-23T08:07:38Z

QA results for PR 1526:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17018/consoleFull

The name `preservesPartitioning` is ambiguous: 1) preserves the indices of partitions, 2) preserves the partitioner. The latter is correct and `preservesPartitioning` should really be called `preservesPartitioner` to avoid confusion. Unfortunately, this is already part of the API and we cannot change. We should be clear in the doc and fix wrong usages. This PR 1. adds notes in `maPartitions*`, 2. makes `RDD.sample` preserve partitioner, 3. changes `preservesPartitioning` to false in `RDD.zip` because the keys of the first RDD are no longer the keys of the zipped RDD, 4. fixes some wrong usages in MLlib. Author: Xiangrui Meng <meng@databricks.com> Closes apache#1526 from mengxr/preserve-partitioner and squashes the following commits: b361e65 [Xiangrui Meng] update doc based on pwendell's comments 3b1ba19 [Xiangrui Meng] update doc 357575c [Xiangrui Meng] fix unit test 20b4816 [Xiangrui Meng] Merge branch 'master' into preserve-partitioner d1caa65 [Xiangrui Meng] add doc to explain preservesPartitioning fix wrong usage of preservesPartitioning make sample preserse partitioning

add doc to explain preservesPartitioning

d1caa65

fix wrong usage of preservesPartitioning make sample preserse partitioning

Merge branch 'master' into preserve-partitioner

20b4816

fix unit test

357575c

update doc

3b1ba19

pwendell reviewed Jul 23, 2014
View reviewed changes

update doc based on pwendell's comments

b361e65

asfgit closed this in 4c7243e Jul 23, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2617] Correct doc and usages of preservesPartitioning #1526

[SPARK-2617] Correct doc and usages of preservesPartitioning #1526

mengxr commented Jul 22, 2014

SparkQA commented Jul 22, 2014

SparkQA commented Jul 22, 2014

SparkQA commented Jul 22, 2014

SparkQA commented Jul 22, 2014

SparkQA commented Jul 22, 2014

SparkQA commented Jul 22, 2014

pwendell commented Jul 22, 2014

pwendell commented Jul 22, 2014

mengxr commented Jul 23, 2014

SparkQA commented Jul 23, 2014

pwendell Jul 23, 2014

mengxr Jul 23, 2014

pwendell commented Jul 23, 2014

SparkQA commented Jul 23, 2014

SparkQA commented Jul 23, 2014

rxin commented Jul 23, 2014

SparkQA commented Jul 23, 2014

[SPARK-2617] Correct doc and usages of preservesPartitioning #1526

[SPARK-2617] Correct doc and usages of preservesPartitioning #1526

Conversation

mengxr commented Jul 22, 2014

SparkQA commented Jul 22, 2014

SparkQA commented Jul 22, 2014

SparkQA commented Jul 22, 2014

SparkQA commented Jul 22, 2014

SparkQA commented Jul 22, 2014

SparkQA commented Jul 22, 2014

pwendell commented Jul 22, 2014

pwendell commented Jul 22, 2014

mengxr commented Jul 23, 2014

SparkQA commented Jul 23, 2014

pwendell Jul 23, 2014

Choose a reason for hiding this comment

mengxr Jul 23, 2014

Choose a reason for hiding this comment

pwendell commented Jul 23, 2014

SparkQA commented Jul 23, 2014

SparkQA commented Jul 23, 2014

rxin commented Jul 23, 2014

SparkQA commented Jul 23, 2014