Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-1701] Clarify slice vs partition in the programming guide #2305

Closed
wants to merge 3 commits into from

Conversation

mattf
Copy link
Contributor

@mattf mattf commented Sep 6, 2014

This is a partial solution to SPARK-1701, only addressing the
documentation confusion.

Additional work can be to actually change the numSlices parameter name
across languages, with care required for scala & python to maintain
backward compatibility for named parameters.

This is a partial solution to SPARK-1701, only addressing the
documentation confusion.

Additional work can be to actually change the numSlices parameter name
across languages, with care required for scala & python to maintain
backward compatibility for named parameters.
@SparkQA
Copy link

SparkQA commented Sep 6, 2014

QA tests have started for PR 2305 at commit 7b045e0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 6, 2014

QA tests have finished for PR 2305 at commit 7b045e0.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mattf
Copy link
Contributor Author

mattf commented Sep 11, 2014

@JoshRosen will you take a look at this?

@JoshRosen
Copy link
Contributor

Sorry for not reviewing this until now; it sort of fell off my radar.

@@ -286,7 +286,7 @@ We describe operations on distributed datasets later on.

</div>

One important parameter for parallel collections is the number of *slices* to cut the dataset into. Spark will run one task for each slice of the cluster. Typically you want 2-4 slices for each CPU in your cluster. Normally, Spark tries to set the number of slices automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to `parallelize` (e.g. `sc.parallelize(data, 10)`).
One important parameter for parallel collections is the number of *partitions* to cut the dataset into. Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to `parallelize` (e.g. `sc.parallelize(data, 10)`). Note: the parameter is called numSlices (not numPartitions) to maintain backward compatibility.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the "Note:" should mention that in some places we still say numSlices (for backwards compatibility with earlier versions of Spark) and that "slices" should be considered as a synonym for "partitions"; there are a lot of places that use numPartitions, etc, so we may want to emphasize that this discrepancy only occurs in a few places.

@mattf
Copy link
Contributor Author

mattf commented Sep 19, 2014

thanks for the feedback. i've changed the language to be more inline with your suggestion.

@SparkQA
Copy link

SparkQA commented Sep 19, 2014

QA tests have started for PR 2305 at commit c0af05d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 19, 2014

QA tests have finished for PR 2305 at commit c0af05d.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mattf
Copy link
Contributor Author

mattf commented Sep 19, 2014

This patch fails unit tests.

i'm getting HTTP 503 from jenkins, but i'm gonna go out on a limb and say this doc change didn't break the unit tests.

@JoshRosen
Copy link
Contributor

I think that Jenkins might have crashed or restarted overnight, but it seems to be working now.

This looks good to me, so I'm going to merge it. Feel free to open similar PRs for other documentation improvements / clarifications, since these types of edits are really helpful.

@asfgit asfgit closed this in be0c756 Sep 19, 2014
@mattf mattf deleted the SPARK-1701 branch September 19, 2014 21:41
ghost pushed a commit to dbtsai/spark that referenced this pull request Apr 9, 2017
…nd code?)

## What changes were proposed in this pull request?

Came across the term "slice" when running some spark scala code. Consequently, a Google search indicated that "slices" and "partitions" refer to the same things; indeed see:

- [This issue](https://issues.apache.org/jira/browse/SPARK-1701)
- [This pull request](apache#2305)
- [This StackOverflow answer](http://stackoverflow.com/questions/23436640/what-is-the-difference-between-an-rdd-partition-and-a-slice) and [this one](http://stackoverflow.com/questions/24269495/what-are-the-differences-between-slices-and-partitions-of-rdds)

Thus this pull request fixes the occurrence of slice I came accross. Nonetheless, [it would appear](https://github.com/apache/spark/search?utf8=%E2%9C%93&q=slice&type=) there are still many references to "slice/slices" - thus I thought I'd raise this Pull Request to address the issue (sorry if this is the wrong place, I'm not too familar with raising apache issues).

## How was this patch tested?

(Not tested locally - only a minor exception message change.)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: asmith26 <asmith26@users.noreply.github.com>

Closes apache#17565 from asmith26/master.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants