New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-15150: [Doc] Add guidance on partitioning datasets #11970
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format?
or
See also: |
docs/source/cpp/dataset.rst
Outdated
Partitioning performance considerations | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Partitioning datasets can improve performance when reading datasets, but have several | ||
potential costs when reading and writing: | ||
|
||
#. Can significantly increase the number of files to write. The number of partitions is a | ||
floor for the number of files in a dataset. If you partition a dataset by date with a | ||
year of data, you will have at least 365 files. If you further partition by another | ||
dimension with 1,000 unique values, you will have 365,000 files. This can make it slower | ||
to write and increase the size of the overall dataset because each file has some fixed | ||
overhead. For example, each file in parquet dataset contains the schema. | ||
#. Multiple partitioning columns can produce deeply nested folder structures which are slow | ||
to navigate because they require many recusive "list directory" calls to discover files. | ||
These operations may be particularly expensive if you are using an object store | ||
filesystem such as S3. One workaround is to combine multiple columns into one for | ||
partitioning. For example, instead of a schema like /year/month/day/ use /YYYY-MM-DD/. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both of these are in the "cons" section. It might be worth adding a bit more body to "can improve the performance when reading datasets".
There are two advantages (but really only one):
- We need multiple files to read in parallel.
- Smaller partitions allow for more selective queries. E.g. we can load less data from the disk.
We should also mention (here or elsewhere) that everything that applies here for # of files also applies for # of record batchs (or # of row groups in parquet). It's possible to have 1 file with way too many row groups and get similar performance issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a reasonable "rule of thumb" might be to structure files so that each column of data is at least 4MB large. This is somewhat arbitrary when it comes to data/metadata ratio but 4MB is also around the point where an HDD's sequential vs random reads tradeoff starts to fall off. Although for bitmaps the requirement for 32 million rows can be a bit extreme / difficult to satisfy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose compression complicates things too 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point on those two advantages. I will add that to discussion.
Row group settings are (somewhat strangely) in the generic FileSystemDatasetWriteOptions, so actually makes sense to discuss around here.
I feel somewhat reticent to specify too exact rules of thumb for file size. (1) that might change for Arrow over time as performance improvements are made. (2) that may be very different depending on use case (and compression as you point out); and (3) that may vary depending on who the reader is (it might be Spark or something else rather than Arrow C++.) What do you think about just pointing out cases that are pathological? For example, partitioning to file sizes less than a few MB means the overhead of the filesystem and the metadata outweighs any filtering speedups. And partitioning to file sizes of 2GB+ means not enough parallelism or OOM errors in many cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@westonpace rewrote based on your feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the dataset writer has a lot of control now over row group sizing. This is because a batch with 10,000 rows might arrive at the dataset writer and get partitioned into 100 batches with 100 rows based on the partitioning keys. If those 100 row batches were delivered to the file writers then the file writers would write tiny batches. Putting the "queue in memory until we have enough data" logic in the dataset writer instead of the file writers allowed us to solve that in one spot.
I agree on your thoughts for a specific limit and really like the idea of pointing out pathological cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. One annoying thing: the datasests documentation for C++ (which you have modified here) is very similar to the datasets documentation for Python (docs/source/python/dataset.rst) and R (r/vignettes/dataset.Rmd) so once you've finalized the wording here you will probably want to copy paste into the other docs.
docs/source/cpp/dataset.rst
Outdated
Partitioned datasets create nested folder structures, and those allow us to prune which | ||
files are loaded in a scan. However, this adds overhead to discovering files in the dataset, | ||
as we'll need to recursively "list directory" to find the data files. These operations may | ||
be particularly expensive if you are using an object store filesystem such as S3. Too fine | ||
partitions can cause problems here: Partitioning a dataset by date for a years worth | ||
of data will require 365 list calls to find all the files; adding another column with | ||
cardinality 1,000 will make that 365,365 calls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ironically I think this is actually worse on local filesystems than it is on S3. S3 supports a recursive query (and we use it I'm pretty sure) so we only actually do a single list directory call. Maybe just drop the These operations may be...such as S3.
line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm probably wrong at that then. I've heard this idea in the Spark context, but it might not be recursive part as much as pagination through a web API that makes this slow. I will remove this.
docs/source/cpp/dataset.rst
Outdated
range of file sizes and partitioning layouts, but there are extremes you should avoid. To | ||
avoid pathological behavior, keep to these guidelines: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea but the wording seems like you will always avoid bad behavior if you follow these rules. It's entirely possible to design a 50MB file with poor row groupings (although you do discuss that in the next paragraph). Maybe we could word it as "These guidelines can help avoid some known worst case situations" or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point. I can soften the language.
Co-authored-by: Weston Pace <weston.pace@gmail.com>
ca8f8a8
to
c8c1bac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks for adding this
I'm going to make a JIRA real quick since this is technically too large for the rules on minor PRs now that we are putting it in three places: https://github.com/apache/arrow/blob/5cabd31c90dbb32d87074928f68bf5d6e97e37c6/CONTRIBUTING.md#minor-fixes |
|
Benchmark runs are scheduled for baseline = 670af33 and contender = 7cf7442. 7cf7442 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Thanks for this! I opened https://issues.apache.org/jira/browse/ARROW-15069 struggling with this issue. The response could have been (and would have been in other projects) "you're doing it wrong, not an issue", but instead you took it seriously and expanded the documentation. Thanks for your commitment to your community. |
This guidance is here to help users avoid creating datasets that have poor partitioning structure. I've duplicated the same language in C++, R, and Python docs. Closes apache#11970 from wjones127/docs/partition-guidance Authored-by: Will Jones <willjones127@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
Any recommendations for a couple of scenarios that I run into often:
|
@ldacey Thanks for the question. Would you mind moving it to either: (1) our GitHub issues section, (2) our user mailing list (user@arrow.apache.org), or (3) our Jira? We like to keep usage discussion in one of those three places so it's easier for others to find. |
This guidance is here to help users avoid creating datasets that have poor partitioning structure. I've duplicated the same language in C++, R, and Python docs.