-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-18723][DOC] Expanded programming guide information on wholeTex… #16157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18723][DOC] Expanded programming guide information on wholeTex… #16157
Conversation
…tFiles ## What changes were proposed in this pull request? Add additional information to wholeTextFiles in the Programming Guide. Also explain partitioning policy difference in relation to textFile and its impact on performance. Also added reference to the underlying CombineFileInputFormat ## How was this patch tested? Manual build of documentation and inspection in browser ``` cd docs SKIP_API=1 jekyll serve --watch ``` Author: Michal Senkyr <mike.senkyr@gmail.com>
docs/programming-guide.md
Outdated
| Apart from text files, Spark's Scala API also supports several other data formats: | ||
|
|
||
| * `SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file. | ||
| * `SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file. It takes an optional second argument for controlling the minimal number of partitions (by default this is 2). It uses [CombineFileInputFormat](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html) internally in order to process large numbers of small files effectively by grouping files on the same node into a single split. (This can lead to non-optimal partitioning. It is therefore advisable to set the minimal number of partitions explicitly.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(What is a 'node' here in the context of Spark -- executor? I'm also not sure this behavior is guaranteed, from reading the code and docs)
I don't know that the implementation detail matters here as much as what problem the end user might solve by setting this. You might instead say that this can lead to many small files in relatively few partitions, and this is why you might wish to set a minimum. That's kind of what the docs suggest already; maybe this is better as a tiny improvement to the scaladoc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. A 'node' should probably be an 'executor'. Also 'split' should probably be replaced by 'partition' to prevent confusion. I used Hadoop terminology because it is used in the linked Hadoop documentation, but Spark terminology may be more appropriate here.
As I understand it, this behavior is guaranteed as setting minPartitions directly sets maxSplitSize by dividing the length of the file (code). Please note that minSplitSize is only used for leftover data.
As the programming guide already contains quite a detailed description of partitioning on textFile, I thought it would make sense to mention it in wholeTextFiles as well. Especially when the behavior differs that much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more sentences in the docs here, and possibly the scaladoc, can't hurt. It would call attention to the fact that you may wish to set a minimum, and why you would do that.
docs/programming-guide.md
Outdated
| Apart from text files, Spark's Scala API also supports several other data formats: | ||
|
|
||
| * `SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file. | ||
| * `SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file. It takes an optional second argument for controlling the minimal number of partitions (by default this is 2). It uses [CombineFileInputFormat](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html) internally in order to process large numbers of small files effectively by grouping files on the same executor into a single partition. (This can lead to non-optimal partitioning. It is therefore advisable to set the minimal number of partitions explicitly.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure this accurate, that you "should" set this. It would be important to set in some contexts but not all, like lots of parameters. I still think promising a particular implementation here serves no purpose. This should just call attention to the parameter and when one would use it.
|
I added a few more sentences describing the cases in which the user might want to use the argument. However, I am afraid this might be a little too descriptive. |
docs/programming-guide.md
Outdated
| Apart from text files, Spark's Scala API also supports several other data formats: | ||
|
|
||
| * `SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file. | ||
| * `SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file. It takes an optional second argument for controlling the minimal number of partitions (by default this is 2). It uses [CombineFileInputFormat](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html) internally in order to process large numbers of small files effectively by grouping files on the same executor into a single partition. This can lead to sub-optimal partitioning when the file sets would benefit from residing in multiple partitions (e.g., larger partitions would not fit in memory, files are replicated but a large subset is locally reachable from a single executor, subsequent transformations would benefit from multi-core processing). In those cases, set the `minPartitions` argument to enforce splitting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see value in the @note about the implementation - really it's an implementation detail. Adding additional info to doc about the parameter makes some sense, but it can just echo whatever note goes here.
I think this is a bit complicated. How about: "In some cases, the result may have relatively few partitions containing relatively many files. Consider increasing minPartitions in this case to prefer more, smaller partitions, which may allow for more efficient processing of the result."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I and my colleagues encountered this, we didn't think that the implementation would be so different to textFile. I would at least like to add the bit about merging multiple files into a single partition because textFile doesn't do that.
How about just: "Additionally merges multiple files to reduce the number of partitions, which may lead to relatively few partitions. Consider increasing minPartitions if more, smaller partitions are needed."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the number of files in this case is irrelevant. Even with, say, 5 large files we got only 2 partitions by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every element of the result is a file; it's fundamentally different from textFile, and can't be that each file therefore ends up in a partition. I don't think this merges files, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is different in what the elements are. However, there is no indication that the partitioning policy differs that much. I always understood textFile's partitioning policy as "we will split it if we can" up to individual blocks. wholeTextFiles' partitioning seems to be more like "we will merge it if we can" up to the executor boundary. The polar opposite.
This manifests in the way the developer has to think about handling partitions. In textFile it is generally safe to let Spark figure it all out without sacrificing much in performance, whereas in wholeTextFiles you may frequently run into performance problems due to having too few partitions.
An alternate solution in this case would be to unite the partitioning policies of the textFile and wholeTextFiles methods by having the latter also "split if we can". In this case up to the individual files (presently achievable by setting minPartitions to an arbitrary large number). Therefore each file would, by default, have its own partition. However, this approach would mean a significant change which might break existing applications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the difference is more fundamental. textFile returns lines of a file and wholeTextFiles returns entire contents of files. It is not a difference of partitioning. It's a different operation altogether.
I think you'd generally let data locality define partitions, so I don't agree that the partitioning determined by CombineFileInputFormat is "usually" inappropriate. You can of course change the partitioning if it happens to be.
As such if you want to update docs, I would stick to giving information that is actionable to the caller: if you end up with too few big partitions, increase this parameter. That's fairly well understood anyway but no harm in noting. Everything else doesn't seem like the right guidance.
|
Test build #3483 has started for PR 16157 at commit |
|
OK, net-net I am OK merging this doc change even if it's not exactly what I might write. It does contain useful additional detail that's accurate at the moment. |
|
Sorry for the delay. You are probably right that the partitioning is primarily determined by data locality and that it is therefore appropriate in some cases and shouldn't be worded such as to imply it isn't usually appropriate. I will think about how to word it more appropriately and remove the implementation specifics. |
|
Test build #3505 has finished for PR 16157 at commit
|
|
Merged to master |
## What changes were proposed in this pull request? Add additional information to wholeTextFiles in the Programming Guide. Also explain partitioning policy difference in relation to textFile and its impact on performance. Also added reference to the underlying CombineFileInputFormat ## How was this patch tested? Manual build of documentation and inspection in browser ``` cd docs jekyll serve --watch ``` Author: Michal Senkyr <mike.senkyr@gmail.com> Closes apache#16157 from michalsenkyr/wholeTextFilesExpandedDocs.
What changes were proposed in this pull request?
Add additional information to wholeTextFiles in the Programming Guide. Also explain partitioning policy difference in relation to textFile and its impact on performance.
Also added reference to the underlying CombineFileInputFormat
How was this patch tested?
Manual build of documentation and inspection in browser