ARROW-15150: [Doc] Add guidance on partitioning datasets #11970

wjones127 · 2021-12-15T23:40:49Z

This guidance is here to help users avoid creating datasets that have poor partitioning structure. I've duplicated the same language in C++, R, and Python docs.

github-actions · 2021-12-15T23:41:07Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

westonpace · 2021-12-16T01:47:32Z

docs/source/cpp/dataset.rst

+Partitioning performance considerations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Partitioning datasets can improve performance when reading datasets, but have several 
+potential costs when reading and writing:
+
+#. Can significantly increase the number of files to write. The number of partitions is a
+   floor for the number of files in a dataset. If you partition a dataset by date with a 
+   year of data, you will have at least 365 files. If you further partition by another
+   dimension with 1,000 unique values, you will have 365,000 files. This can make it slower
+   to write and increase the size of the overall dataset because each file has some fixed
+   overhead. For example, each file in parquet dataset contains the schema.
+#. Multiple partitioning columns can produce deeply nested folder structures which are slow
+   to navigate because they require many recusive "list directory" calls to discover files.
+   These operations may be particularly expensive if you are using an object store 
+   filesystem such as S3. One workaround is to combine multiple columns into one for
+   partitioning. For example, instead of a schema like /year/month/day/ use /YYYY-MM-DD/.
+
+


Both of these are in the "cons" section. It might be worth adding a bit more body to "can improve the performance when reading datasets".

There are two advantages (but really only one):

We need multiple files to read in parallel.

Smaller partitions allow for more selective queries. E.g. we can load less data from the disk.

We should also mention (here or elsewhere) that everything that applies here for # of files also applies for # of record batchs (or # of row groups in parquet). It's possible to have 1 file with way too many row groups and get similar performance issues.

Probably a reasonable "rule of thumb" might be to structure files so that each column of data is at least 4MB large. This is somewhat arbitrary when it comes to data/metadata ratio but 4MB is also around the point where an HDD's sequential vs random reads tradeoff starts to fall off. Although for bitmaps the requirement for 32 million rows can be a bit extreme / difficult to satisfy.

I suppose compression complicates things too 🤷

Good point on those two advantages. I will add that to discussion.

Row group settings are (somewhat strangely) in the generic FileSystemDatasetWriteOptions, so actually makes sense to discuss around here.

I feel somewhat reticent to specify too exact rules of thumb for file size. (1) that might change for Arrow over time as performance improvements are made. (2) that may be very different depending on use case (and compression as you point out); and (3) that may vary depending on who the reader is (it might be Spark or something else rather than Arrow C++.) What do you think about just pointing out cases that are pathological? For example, partitioning to file sizes less than a few MB means the overhead of the filesystem and the metadata outweighs any filtering speedups. And partitioning to file sizes of 2GB+ means not enough parallelism or OOM errors in many cases.

@westonpace rewrote based on your feedback.

Yes, the dataset writer has a lot of control now over row group sizing. This is because a batch with 10,000 rows might arrive at the dataset writer and get partitioned into 100 batches with 100 rows based on the partitioning keys. If those 100 row batches were delivered to the file writers then the file writers would write tiny batches. Putting the "queue in memory until we have enough data" logic in the dataset writer instead of the file writers allowed us to solve that in one spot.

I agree on your thoughts for a specific limit and really like the idea of pointing out pathological cases.

westonpace

Looking good. One annoying thing: the datasests documentation for C++ (which you have modified here) is very similar to the datasets documentation for Python (docs/source/python/dataset.rst) and R (r/vignettes/dataset.Rmd) so once you've finalized the wording here you will probably want to copy paste into the other docs.

westonpace · 2021-12-16T21:30:23Z

docs/source/cpp/dataset.rst

+Partitioned datasets create nested folder structures, and those allow us to prune which 
+files are loaded in a scan. However, this adds overhead to discovering files in the dataset,
+as we'll need to recursively "list directory" to find the data files. These operations may 
+be particularly expensive if you are using an object store filesystem such as S3. Too fine
+partitions can cause problems here: Partitioning a dataset by date for a years worth
+of data will require 365 list calls to find all the files; adding another column with 
+cardinality 1,000 will make that 365,365 calls.


Ironically I think this is actually worse on local filesystems than it is on S3. S3 supports a recursive query (and we use it I'm pretty sure) so we only actually do a single list directory call. Maybe just drop the These operations may be...such as S3. line.

Yeah I'm probably wrong at that then. I've heard this idea in the Spark context, but it might not be recursive part as much as pagination through a web API that makes this slow. I will remove this.

docs/source/cpp/dataset.rst

westonpace · 2021-12-16T21:35:44Z

docs/source/cpp/dataset.rst

+range of file sizes and partitioning layouts, but there are extremes you should avoid. To 
+avoid pathological behavior, keep to these guidelines:


I like this idea but the wording seems like you will always avoid bad behavior if you follow these rules. It's entirely possible to design a 50MB file with poor row groupings (although you do discuss that in the next paragraph). Maybe we could word it as "These guidelines can help avoid some known worst case situations" or something like that.

That's a fair point. I can soften the language.

docs/source/cpp/dataset.rst

Co-authored-by: Weston Pace <weston.pace@gmail.com>

westonpace

Looks good to me, thanks for adding this

westonpace · 2021-12-17T19:27:23Z

I'm going to make a JIRA real quick since this is technically too large for the rules on minor PRs now that we are putting it in three places: https://github.com/apache/arrow/blob/5cabd31c90dbb32d87074928f68bf5d6e97e37c6/CONTRIBUTING.md#minor-fixes

github-actions · 2021-12-17T19:29:43Z

https://issues.apache.org/jira/browse/ARROW-15150

github-actions · 2021-12-17T19:29:44Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

ursabot · 2021-12-17T19:41:16Z

Benchmark runs are scheduled for baseline = 670af33 and contender = 7cf7442. 7cf7442 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Scheduled] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.22% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/122| 7cf74426 ec2-t3-xlarge-us-east-2>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/110| 7cf74426 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/113| 7cf74426 ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/121| 670af338 ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/109| 670af338 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/112| 670af338 ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ateucher · 2021-12-17T22:21:17Z

Thanks for this! I opened https://issues.apache.org/jira/browse/ARROW-15069 struggling with this issue. The response could have been (and would have been in other projects) "you're doing it wrong, not an issue", but instead you took it seriously and expanded the documentation. Thanks for your commitment to your community.

This guidance is here to help users avoid creating datasets that have poor partitioning structure. I've duplicated the same language in C++, R, and Python docs. Closes apache#11970 from wjones127/docs/partition-guidance Authored-by: Will Jones <willjones127@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

ldacey · 2022-02-08T17:41:41Z

Any recommendations for a couple of scenarios that I run into often:

Data gets downloaded often (hourly, every 30 minutes) and saved to a dataset. My current approach is to get the list of fragments which were written, then figure out the partitions (ds._get_partition_keys(frag.partition_expression)), read all of the data in those partitions, and then resave the dataset using "delete_matching". I suppose this question would be about data consolidation / small file cleanup best practices in general.
Any thoughts on partitions based on ETL schedules? I have used an old Airflow article as reference when considering how I partition data. This normally creates small files for a lot of sources, but one benefit is that I can always clear an Airflow task and it will overwrite any existing data which existed for that schedule run.

wjones127 · 2022-02-08T17:50:26Z

@ldacey Thanks for the question. Would you mind moving it to either: (1) our GitHub issues section, (2) our user mailing list (user@arrow.apache.org), or (3) our Jira? We like to keep usage discussion in one of those three places so it's easier for others to find.

Draft guidance

b3c2099

westonpace reviewed Dec 16, 2021

View reviewed changes

wjones127 added 2 commits December 16, 2021 12:43

Rewrite with positive aspects and guidance on pathalogical cases

dccafcc

Add group reference

31c951a

westonpace reviewed Dec 16, 2021

View reviewed changes

wjones127 and others added 2 commits December 16, 2021 13:54

Apply suggestions from code review

4e0d748

Co-authored-by: Weston Pace <weston.pace@gmail.com>

Link other docs

b75f6ed

github-actions bot added the Component: R label Dec 16, 2021

wjones127 changed the title ~~[Docs][Minor] Add guidance on partitioning datasets [WIP]~~ MINIOR: [Docs] Add guidance on partitioning datasets Dec 16, 2021

wjones127 marked this pull request as ready for review December 16, 2021 22:09

Soften some wording on guidelines.

c8c1bac

wjones127 force-pushed the docs/partition-guidance branch from ca8f8a8 to c8c1bac Compare December 16, 2021 22:10

wjones127 requested a review from westonpace December 17, 2021 17:32

westonpace approved these changes Dec 17, 2021

View reviewed changes

westonpace changed the title ~~MINIOR: [Docs] Add guidance on partitioning datasets~~ ARROW-15150: [Doc] Add guidance on partitioning datasets Dec 17, 2021

westonpace closed this in 7cf7442 Dec 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-15150: [Doc] Add guidance on partitioning datasets #11970

ARROW-15150: [Doc] Add guidance on partitioning datasets #11970

wjones127 commented Dec 15, 2021 •

edited

github-actions bot commented Dec 15, 2021

westonpace Dec 16, 2021

westonpace Dec 16, 2021 •

edited

westonpace Dec 16, 2021

wjones127 Dec 16, 2021

wjones127 Dec 16, 2021

westonpace Dec 16, 2021

westonpace left a comment

westonpace Dec 16, 2021

wjones127 Dec 16, 2021

westonpace Dec 16, 2021

wjones127 Dec 16, 2021

westonpace left a comment

westonpace commented Dec 17, 2021

github-actions bot commented Dec 17, 2021

github-actions bot commented Dec 17, 2021

ursabot commented Dec 17, 2021 •

edited

ateucher commented Dec 17, 2021

ldacey commented Feb 8, 2022

wjones127 commented Feb 8, 2022

		range of file sizes and partitioning layouts, but there are extremes you should avoid. To
		avoid pathological behavior, keep to these guidelines:

ARROW-15150: [Doc] Add guidance on partitioning datasets #11970

ARROW-15150: [Doc] Add guidance on partitioning datasets #11970

Conversation

wjones127 commented Dec 15, 2021 • edited

github-actions bot commented Dec 15, 2021

Choose a reason for hiding this comment

westonpace Dec 16, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

westonpace commented Dec 17, 2021

github-actions bot commented Dec 17, 2021

github-actions bot commented Dec 17, 2021

ursabot commented Dec 17, 2021 • edited

ateucher commented Dec 17, 2021

ldacey commented Feb 8, 2022

wjones127 commented Feb 8, 2022

wjones127 commented Dec 15, 2021 •

edited

westonpace Dec 16, 2021 •

edited

ursabot commented Dec 17, 2021 •

edited