[SPARK-18723][DOC] Expanded programming guide information on wholeTex… #16157

michalsenkyr · 2016-12-05T22:17:54Z

What changes were proposed in this pull request?

Add additional information to wholeTextFiles in the Programming Guide. Also explain partitioning policy difference in relation to textFile and its impact on performance.

Also added reference to the underlying CombineFileInputFormat

How was this patch tested?

Manual build of documentation and inspection in browser

cd docs
jekyll serve --watch

…tFiles ## What changes were proposed in this pull request? Add additional information to wholeTextFiles in the Programming Guide. Also explain partitioning policy difference in relation to textFile and its impact on performance. Also added reference to the underlying CombineFileInputFormat ## How was this patch tested? Manual build of documentation and inspection in browser ``` cd docs SKIP_API=1 jekyll serve --watch ``` Author: Michal Senkyr <mike.senkyr@gmail.com>

srowen · 2016-12-05T22:29:39Z

docs/programming-guide.md

 Apart from text files, Spark's Scala API also supports several other data formats:

-* `SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file.
+* `SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file. It takes an optional second argument for controlling the minimal number of partitions (by default this is 2). It uses [CombineFileInputFormat](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html) internally in order to process large numbers of small files effectively by grouping files on the same node into a single split. (This can lead to non-optimal partitioning. It is therefore advisable to set the minimal number of partitions explicitly.)


(What is a 'node' here in the context of Spark -- executor? I'm also not sure this behavior is guaranteed, from reading the code and docs)

I don't know that the implementation detail matters here as much as what problem the end user might solve by setting this. You might instead say that this can lead to many small files in relatively few partitions, and this is why you might wish to set a minimum. That's kind of what the docs suggest already; maybe this is better as a tiny improvement to the scaladoc?

Yes, you are right. A 'node' should probably be an 'executor'. Also 'split' should probably be replaced by 'partition' to prevent confusion. I used Hadoop terminology because it is used in the linked Hadoop documentation, but Spark terminology may be more appropriate here.

As I understand it, this behavior is guaranteed as setting minPartitions directly sets maxSplitSize by dividing the length of the file (code). Please note that minSplitSize is only used for leftover data.

As the programming guide already contains quite a detailed description of partitioning on textFile, I thought it would make sense to mention it in wholeTextFiles as well. Especially when the behavior differs that much.

A few more sentences in the docs here, and possibly the scaladoc, can't hurt. It would call attention to the fact that you may wish to set a minimum, and why you would do that.

…tFiles

…xtFilesExpandedDocs

srowen · 2016-12-06T22:17:18Z

docs/programming-guide.md

 Apart from text files, Spark's Scala API also supports several other data formats:

-* `SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file.
+* `SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file. It takes an optional second argument for controlling the minimal number of partitions (by default this is 2). It uses [CombineFileInputFormat](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html) internally in order to process large numbers of small files effectively by grouping files on the same executor into a single partition. (This can lead to non-optimal partitioning. It is therefore advisable to set the minimal number of partitions explicitly.)


I'm still not sure this accurate, that you "should" set this. It would be important to set in some contexts but not all, like lots of parameters. I still think promising a particular implementation here serves no purpose. This should just call attention to the parameter and when one would use it.

…tFiles

michalsenkyr · 2016-12-07T22:17:52Z

I added a few more sentences describing the cases in which the user might want to use the argument. However, I am afraid this might be a little too descriptive.

srowen · 2016-12-07T22:23:57Z

docs/programming-guide.md

 Apart from text files, Spark's Scala API also supports several other data formats:

-* `SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file.
+* `SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file. It takes an optional second argument for controlling the minimal number of partitions (by default this is 2). It uses [CombineFileInputFormat](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html) internally in order to process large numbers of small files effectively by grouping files on the same executor into a single partition. This can lead to sub-optimal partitioning when the file sets would benefit from residing in multiple partitions (e.g., larger partitions would not fit in memory, files are replicated but a large subset is locally reachable from a single executor, subsequent transformations would benefit from multi-core processing). In those cases, set the `minPartitions` argument to enforce splitting.


I don't see value in the @note about the implementation - really it's an implementation detail. Adding additional info to doc about the parameter makes some sense, but it can just echo whatever note goes here.

I think this is a bit complicated. How about: "In some cases, the result may have relatively few partitions containing relatively many files. Consider increasing minPartitions in this case to prefer more, smaller partitions, which may allow for more efficient processing of the result."

When I and my colleagues encountered this, we didn't think that the implementation would be so different to textFile. I would at least like to add the bit about merging multiple files into a single partition because textFile doesn't do that.
How about just: "Additionally merges multiple files to reduce the number of partitions, which may lead to relatively few partitions. Consider increasing minPartitions if more, smaller partitions are needed."

Also, the number of files in this case is irrelevant. Even with, say, 5 large files we got only 2 partitions by default.

Every element of the result is a file; it's fundamentally different from textFile, and can't be that each file therefore ends up in a partition. I don't think this merges files, right?

Yes, it is different in what the elements are. However, there is no indication that the partitioning policy differs that much. I always understood textFile's partitioning policy as "we will split it if we can" up to individual blocks. wholeTextFiles' partitioning seems to be more like "we will merge it if we can" up to the executor boundary. The polar opposite.

This manifests in the way the developer has to think about handling partitions. In textFile it is generally safe to let Spark figure it all out without sacrificing much in performance, whereas in wholeTextFiles you may frequently run into performance problems due to having too few partitions.

An alternate solution in this case would be to unite the partitioning policies of the textFile and wholeTextFiles methods by having the latter also "split if we can". In this case up to the individual files (presently achievable by setting minPartitions to an arbitrary large number). Therefore each file would, by default, have its own partition. However, this approach would mean a significant change which might break existing applications.

No, the difference is more fundamental. textFile returns lines of a file and wholeTextFiles returns entire contents of files. It is not a difference of partitioning. It's a different operation altogether.

I think you'd generally let data locality define partitions, so I don't agree that the partitioning determined by CombineFileInputFormat is "usually" inappropriate. You can of course change the partitioning if it happens to be.

As such if you want to update docs, I would stick to giving information that is actionable to the caller: if you end up with too few big partitions, increase this parameter. That's fairly well understood anyway but no harm in noting. Everything else doesn't seem like the right guidance.

SparkQA · 2016-12-09T08:11:33Z

Test build #3483 has started for PR 16157 at commit d76c6c5.

srowen · 2016-12-15T09:48:38Z

OK, net-net I am OK merging this doc change even if it's not exactly what I might write. It does contain useful additional detail that's accurate at the moment.

michalsenkyr · 2016-12-15T10:45:35Z

Sorry for the delay. You are probably right that the partitioning is primarily determined by data locality and that it is therefore appropriate in some cases and shouldn't be worded such as to imply it isn't usually appropriate. I will think about how to word it more appropriately and remove the implementation specifics.

…xtFilesExpandedDocs

SparkQA · 2016-12-16T14:12:29Z

Test build #3505 has finished for PR 16157 at commit 781a11e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-12-16T17:44:00Z

Merged to master

## What changes were proposed in this pull request? Add additional information to wholeTextFiles in the Programming Guide. Also explain partitioning policy difference in relation to textFile and its impact on performance. Also added reference to the underlying CombineFileInputFormat ## How was this patch tested? Manual build of documentation and inspection in browser ``` cd docs jekyll serve --watch ``` Author: Michal Senkyr <mike.senkyr@gmail.com> Closes apache#16157 from michalsenkyr/wholeTextFilesExpandedDocs.

srowen reviewed Dec 5, 2016

View reviewed changes

michalsenkyr added 2 commits December 5, 2016 23:58

[SPARK-18723][DOC] Expanded programming guide information on wholeTex…

cb6e002

…tFiles

Merge branch 'master' of https://github.com/apache/spark into wholeTe…

9f32eca

…xtFilesExpandedDocs

srowen reviewed Dec 6, 2016

View reviewed changes

[SPARK-18723][DOC] Expanded programming guide information on wholeTex…

d76c6c5

…tFiles

srowen reviewed Dec 7, 2016

View reviewed changes

michalsenkyr added 2 commits December 15, 2016 12:06

Reworded wholeTextFiles documentation expansion and scaladoc note

1fff423

Merge branch 'master' of https://github.com/apache/spark into wholeTe…

781a11e

…xtFilesExpandedDocs

srowen approved these changes Dec 16, 2016

View reviewed changes

asfgit closed this in 836c95b Dec 16, 2016

michalsenkyr deleted the wholeTextFilesExpandedDocs branch January 6, 2017 08:31

[SPARK-18723][DOC] Expanded programming guide information on wholeTex… #16157

[SPARK-18723][DOC] Expanded programming guide information on wholeTex… #16157

Uh oh!

Conversation

michalsenkyr commented Dec 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michalsenkyr commented Dec 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 9, 2016

Uh oh!

srowen commented Dec 15, 2016

Uh oh!

michalsenkyr commented Dec 15, 2016

Uh oh!

SparkQA commented Dec 16, 2016

Uh oh!

srowen commented Dec 16, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

michalsenkyr commented Dec 5, 2016 •

edited

Loading