Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2617] Correct doc and usages of preservesPartitioning #1526

Closed
wants to merge 5 commits into from

Conversation

mengxr
Copy link
Contributor

@mengxr mengxr commented Jul 22, 2014

The name preservesPartitioning is ambiguous: 1) preserves the indices of partitions, 2) preserves the partitioner. The latter is correct and preservesPartitioning should really be called preservesPartitioner to avoid confusion. Unfortunately, this is already part of the API and we cannot change. We should be clear in the doc and fix wrong usages.

This PR

  1. adds notes in maPartitions*,
  2. makes RDD.sample preserve partitioner,
  3. changes preservesPartitioning to false in RDD.zip because the keys of the first RDD are no longer the keys of the zipped RDD,
  4. fixes some wrong usages in MLlib.

fix wrong usage of preservesPartitioning
make sample preserse partitioning
@SparkQA
Copy link

SparkQA commented Jul 22, 2014

QA tests have started for PR 1526. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16960/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 22, 2014

QA results for PR 1526:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16960/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 22, 2014

QA tests have started for PR 1526. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16973/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 22, 2014

QA results for PR 1526:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16973/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 22, 2014

QA tests have started for PR 1526. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16984/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 22, 2014

QA results for PR 1526:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16984/consoleFull

@pwendell
Copy link
Contributor

I think the reason you find it confusing is because you are interpreting it an imperative (please preserve X) instead of descriptive (this preserves X). I always interpreted it as descriptive. I.e. you are doing something that logically preserves the partitioning of the RDD. E.g. you are answering the question "Does this map function preserve the partitioning of the underlying data?".

@pwendell
Copy link
Contributor

I think that's why it's called preservesPartitioning (descriptive) instead of preservePartitioning (imperative).

@mengxr
Copy link
Contributor Author

mengxr commented Jul 23, 2014

@pwendell The confusing part is the definition of "partitioning". It could be the indexing of partitions or the partitioner. The first time I found that mapPartitions has a parameter called perservesPartitioning, I thought this is to preserve the indexing of partitions --- so partition 0 maps to partition 0 and partition 1 maps to partition 1, etc.

This causes problems. For example, we set preservesPartitioning to true in RDD.zip and the following code won't run correctly:

val a = sc.makeRDD(Seq(0, 1, 2, 3, 4)).map(x => (x, 1)).partitionBy(new HashPartitioner(2))
val b = a.map(x => 1)
a.zip(b).join(a.map(x => (x, 1)).collect()

Btw, preservePartitioning is used in streaming instead of preservesPartitioning. @tdas

@SparkQA
Copy link

SparkQA commented Jul 23, 2014

QA tests have started for PR 1526. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17017/consoleFull

@@ -585,7 +585,9 @@ abstract class RDD[T: ClassTag](
}

/**
* Return a new RDD by applying a function to each partition of this RDD.
* Return a new RDD by applying a function to each partition of this RDD. Note that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "Note that preservesPartitioning means whether" --> "preservesPartitioning indicates whether" (I think it's implicit that we are asking them to note everything, there are only two sentences here).

Also, it might be nice to add a blank line after the first sentence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a blank line now, it is much easier to copy & paste :)

@pwendell
Copy link
Contributor

Only one pedantic comment. LGTM even if you ignore my comment.

@SparkQA
Copy link

SparkQA commented Jul 23, 2014

QA tests have started for PR 1526. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17018/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 23, 2014

QA results for PR 1526:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17017/consoleFull

@rxin
Copy link
Contributor

rxin commented Jul 23, 2014

Merging in master. THanks!

@asfgit asfgit closed this in 4c7243e Jul 23, 2014
@SparkQA
Copy link

SparkQA commented Jul 23, 2014

QA results for PR 1526:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17018/consoleFull

xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
The name `preservesPartitioning` is ambiguous: 1) preserves the indices of partitions, 2) preserves the partitioner. The latter is correct and `preservesPartitioning` should really be called `preservesPartitioner` to avoid confusion. Unfortunately, this is already part of the API and we cannot change. We should be clear in the doc and fix wrong usages.

This PR

1. adds notes in `maPartitions*`,
2. makes `RDD.sample` preserve partitioner,
3. changes `preservesPartitioning` to false in  `RDD.zip` because the keys of the first RDD are no longer the keys of the zipped RDD,
4. fixes some wrong usages in MLlib.

Author: Xiangrui Meng <meng@databricks.com>

Closes apache#1526 from mengxr/preserve-partitioner and squashes the following commits:

b361e65 [Xiangrui Meng] update doc based on pwendell's comments
3b1ba19 [Xiangrui Meng] update doc
357575c [Xiangrui Meng] fix unit test
20b4816 [Xiangrui Meng] Merge branch 'master' into preserve-partitioner
d1caa65 [Xiangrui Meng] add doc to explain preservesPartitioning fix wrong usage of preservesPartitioning make sample preserse partitioning
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants