New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2617] Correct doc and usages of preservesPartitioning #1526
Conversation
fix wrong usage of preservesPartitioning make sample preserse partitioning
QA tests have started for PR 1526. This patch merges cleanly. |
QA results for PR 1526: |
QA tests have started for PR 1526. This patch merges cleanly. |
QA results for PR 1526: |
QA tests have started for PR 1526. This patch merges cleanly. |
QA results for PR 1526: |
I think the reason you find it confusing is because you are interpreting it an imperative (please preserve X) instead of descriptive (this preserves X). I always interpreted it as descriptive. I.e. you are doing something that logically preserves the partitioning of the RDD. E.g. you are answering the question "Does this map function preserve the partitioning of the underlying data?". |
I think that's why it's called |
@pwendell The confusing part is the definition of "partitioning". It could be the indexing of partitions or the partitioner. The first time I found that This causes problems. For example, we set
Btw, |
QA tests have started for PR 1526. This patch merges cleanly. |
@@ -585,7 +585,9 @@ abstract class RDD[T: ClassTag]( | |||
} | |||
|
|||
/** | |||
* Return a new RDD by applying a function to each partition of this RDD. | |||
* Return a new RDD by applying a function to each partition of this RDD. Note that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: "Note that preservesPartitioning
means whether" --> "preservesPartitioning
indicates whether" (I think it's implicit that we are asking them to note everything, there are only two sentences here).
Also, it might be nice to add a blank line after the first sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a blank line now, it is much easier to copy & paste :)
Only one pedantic comment. LGTM even if you ignore my comment. |
QA tests have started for PR 1526. This patch merges cleanly. |
QA results for PR 1526: |
Merging in master. THanks! |
QA results for PR 1526: |
The name `preservesPartitioning` is ambiguous: 1) preserves the indices of partitions, 2) preserves the partitioner. The latter is correct and `preservesPartitioning` should really be called `preservesPartitioner` to avoid confusion. Unfortunately, this is already part of the API and we cannot change. We should be clear in the doc and fix wrong usages. This PR 1. adds notes in `maPartitions*`, 2. makes `RDD.sample` preserve partitioner, 3. changes `preservesPartitioning` to false in `RDD.zip` because the keys of the first RDD are no longer the keys of the zipped RDD, 4. fixes some wrong usages in MLlib. Author: Xiangrui Meng <meng@databricks.com> Closes apache#1526 from mengxr/preserve-partitioner and squashes the following commits: b361e65 [Xiangrui Meng] update doc based on pwendell's comments 3b1ba19 [Xiangrui Meng] update doc 357575c [Xiangrui Meng] fix unit test 20b4816 [Xiangrui Meng] Merge branch 'master' into preserve-partitioner d1caa65 [Xiangrui Meng] add doc to explain preservesPartitioning fix wrong usage of preservesPartitioning make sample preserse partitioning
The name
preservesPartitioning
is ambiguous: 1) preserves the indices of partitions, 2) preserves the partitioner. The latter is correct andpreservesPartitioning
should really be calledpreservesPartitioner
to avoid confusion. Unfortunately, this is already part of the API and we cannot change. We should be clear in the doc and fix wrong usages.This PR
maPartitions*
,RDD.sample
preserve partitioner,preservesPartitioning
to false inRDD.zip
because the keys of the first RDD are no longer the keys of the zipped RDD,