New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-1701] Clarify slice vs partition in the programming guide #2305
Conversation
This is a partial solution to SPARK-1701, only addressing the documentation confusion. Additional work can be to actually change the numSlices parameter name across languages, with care required for scala & python to maintain backward compatibility for named parameters.
QA tests have started for PR 2305 at commit
|
QA tests have finished for PR 2305 at commit
|
@JoshRosen will you take a look at this? |
Sorry for not reviewing this until now; it sort of fell off my radar. |
@@ -286,7 +286,7 @@ We describe operations on distributed datasets later on. | |||
|
|||
</div> | |||
|
|||
One important parameter for parallel collections is the number of *slices* to cut the dataset into. Spark will run one task for each slice of the cluster. Typically you want 2-4 slices for each CPU in your cluster. Normally, Spark tries to set the number of slices automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to `parallelize` (e.g. `sc.parallelize(data, 10)`). | |||
One important parameter for parallel collections is the number of *partitions* to cut the dataset into. Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to `parallelize` (e.g. `sc.parallelize(data, 10)`). Note: the parameter is called numSlices (not numPartitions) to maintain backward compatibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the "Note:" should mention that in some places we still say numSlices (for backwards compatibility with earlier versions of Spark) and that "slices" should be considered as a synonym for "partitions"; there are a lot of places that use numPartitions
, etc, so we may want to emphasize that this discrepancy only occurs in a few places.
thanks for the feedback. i've changed the language to be more inline with your suggestion. |
QA tests have started for PR 2305 at commit
|
QA tests have finished for PR 2305 at commit
|
i'm getting HTTP 503 from jenkins, but i'm gonna go out on a limb and say this doc change didn't break the unit tests. |
I think that Jenkins might have crashed or restarted overnight, but it seems to be working now. This looks good to me, so I'm going to merge it. Feel free to open similar PRs for other documentation improvements / clarifications, since these types of edits are really helpful. |
…nd code?) ## What changes were proposed in this pull request? Came across the term "slice" when running some spark scala code. Consequently, a Google search indicated that "slices" and "partitions" refer to the same things; indeed see: - [This issue](https://issues.apache.org/jira/browse/SPARK-1701) - [This pull request](apache#2305) - [This StackOverflow answer](http://stackoverflow.com/questions/23436640/what-is-the-difference-between-an-rdd-partition-and-a-slice) and [this one](http://stackoverflow.com/questions/24269495/what-are-the-differences-between-slices-and-partitions-of-rdds) Thus this pull request fixes the occurrence of slice I came accross. Nonetheless, [it would appear](https://github.com/apache/spark/search?utf8=%E2%9C%93&q=slice&type=) there are still many references to "slice/slices" - thus I thought I'd raise this Pull Request to address the issue (sorry if this is the wrong place, I'm not too familar with raising apache issues). ## How was this patch tested? (Not tested locally - only a minor exception message change.) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: asmith26 <asmith26@users.noreply.github.com> Closes apache#17565 from asmith26/master.
This is a partial solution to SPARK-1701, only addressing the
documentation confusion.
Additional work can be to actually change the numSlices parameter name
across languages, with care required for scala & python to maintain
backward compatibility for named parameters.