Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-15150: [Doc] Add guidance on partitioning datasets #11970

Closed
wants to merge 6 commits into from

Conversation

wjones127
Copy link
Member

@wjones127 wjones127 commented Dec 15, 2021

This guidance is here to help users avoid creating datasets that have poor partitioning structure. I've duplicated the same language in C++, R, and Python docs.

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

Comment on lines 337 to 355
Partitioning performance considerations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Partitioning datasets can improve performance when reading datasets, but have several
potential costs when reading and writing:

#. Can significantly increase the number of files to write. The number of partitions is a
floor for the number of files in a dataset. If you partition a dataset by date with a
year of data, you will have at least 365 files. If you further partition by another
dimension with 1,000 unique values, you will have 365,000 files. This can make it slower
to write and increase the size of the overall dataset because each file has some fixed
overhead. For example, each file in parquet dataset contains the schema.
#. Multiple partitioning columns can produce deeply nested folder structures which are slow
to navigate because they require many recusive "list directory" calls to discover files.
These operations may be particularly expensive if you are using an object store
filesystem such as S3. One workaround is to combine multiple columns into one for
partitioning. For example, instead of a schema like /year/month/day/ use /YYYY-MM-DD/.


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these are in the "cons" section. It might be worth adding a bit more body to "can improve the performance when reading datasets".

There are two advantages (but really only one):

  • We need multiple files to read in parallel.
  • Smaller partitions allow for more selective queries. E.g. we can load less data from the disk.

We should also mention (here or elsewhere) that everything that applies here for # of files also applies for # of record batchs (or # of row groups in parquet). It's possible to have 1 file with way too many row groups and get similar performance issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a reasonable "rule of thumb" might be to structure files so that each column of data is at least 4MB large. This is somewhat arbitrary when it comes to data/metadata ratio but 4MB is also around the point where an HDD's sequential vs random reads tradeoff starts to fall off. Although for bitmaps the requirement for 32 million rows can be a bit extreme / difficult to satisfy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose compression complicates things too 🤷

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on those two advantages. I will add that to discussion.

Row group settings are (somewhat strangely) in the generic FileSystemDatasetWriteOptions, so actually makes sense to discuss around here.

I feel somewhat reticent to specify too exact rules of thumb for file size. (1) that might change for Arrow over time as performance improvements are made. (2) that may be very different depending on use case (and compression as you point out); and (3) that may vary depending on who the reader is (it might be Spark or something else rather than Arrow C++.) What do you think about just pointing out cases that are pathological? For example, partitioning to file sizes less than a few MB means the overhead of the filesystem and the metadata outweighs any filtering speedups. And partitioning to file sizes of 2GB+ means not enough parallelism or OOM errors in many cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonpace rewrote based on your feedback.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the dataset writer has a lot of control now over row group sizing. This is because a batch with 10,000 rows might arrive at the dataset writer and get partitioned into 100 batches with 100 rows based on the partitioning keys. If those 100 row batches were delivered to the file writers then the file writers would write tiny batches. Putting the "queue in memory until we have enough data" logic in the dataset writer instead of the file writers allowed us to solve that in one spot.

I agree on your thoughts for a specific limit and really like the idea of pointing out pathological cases.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. One annoying thing: the datasests documentation for C++ (which you have modified here) is very similar to the datasets documentation for Python (docs/source/python/dataset.rst) and R (r/vignettes/dataset.Rmd) so once you've finalized the wording here you will probably want to copy paste into the other docs.

Comment on lines 355 to 361
Partitioned datasets create nested folder structures, and those allow us to prune which
files are loaded in a scan. However, this adds overhead to discovering files in the dataset,
as we'll need to recursively "list directory" to find the data files. These operations may
be particularly expensive if you are using an object store filesystem such as S3. Too fine
partitions can cause problems here: Partitioning a dataset by date for a years worth
of data will require 365 list calls to find all the files; adding another column with
cardinality 1,000 will make that 365,365 calls.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ironically I think this is actually worse on local filesystems than it is on S3. S3 supports a recursive query (and we use it I'm pretty sure) so we only actually do a single list directory call. Maybe just drop the These operations may be...such as S3. line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm probably wrong at that then. I've heard this idea in the Spark context, but it might not be recursive part as much as pagination through a web API that makes this slow. I will remove this.

docs/source/cpp/dataset.rst Outdated Show resolved Hide resolved
Comment on lines 365 to 366
range of file sizes and partitioning layouts, but there are extremes you should avoid. To
avoid pathological behavior, keep to these guidelines:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea but the wording seems like you will always avoid bad behavior if you follow these rules. It's entirely possible to design a 50MB file with poor row groupings (although you do discuss that in the next paragraph). Maybe we could word it as "These guidelines can help avoid some known worst case situations" or something like that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point. I can soften the language.

docs/source/cpp/dataset.rst Outdated Show resolved Hide resolved
wjones127 and others added 2 commits December 16, 2021 13:54
Co-authored-by: Weston Pace <weston.pace@gmail.com>
@wjones127 wjones127 changed the title [Docs][Minor] Add guidance on partitioning datasets [WIP] MINIOR: [Docs] Add guidance on partitioning datasets Dec 16, 2021
@wjones127 wjones127 marked this pull request as ready for review December 16, 2021 22:09
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks for adding this

@westonpace
Copy link
Member

I'm going to make a JIRA real quick since this is technically too large for the rules on minor PRs now that we are putting it in three places: https://github.com/apache/arrow/blob/5cabd31c90dbb32d87074928f68bf5d6e97e37c6/CONTRIBUTING.md#minor-fixes

@westonpace westonpace changed the title MINIOR: [Docs] Add guidance on partitioning datasets ARROW-15150: [Doc] Add guidance on partitioning datasets Dec 17, 2021
@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@ursabot
Copy link

ursabot commented Dec 17, 2021

Benchmark runs are scheduled for baseline = 670af33 and contender = 7cf7442. 7cf7442 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Scheduled] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.22% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/122| 7cf74426 ec2-t3-xlarge-us-east-2>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/110| 7cf74426 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/113| 7cf74426 ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/121| 670af338 ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/109| 670af338 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/112| 670af338 ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ateucher
Copy link
Contributor

Thanks for this! I opened https://issues.apache.org/jira/browse/ARROW-15069 struggling with this issue. The response could have been (and would have been in other projects) "you're doing it wrong, not an issue", but instead you took it seriously and expanded the documentation. Thanks for your commitment to your community.

rafael-telles pushed a commit to rafael-telles/arrow that referenced this pull request Dec 20, 2021
This guidance is here to help users avoid creating datasets that have poor partitioning structure. I've duplicated the same language in C++, R, and Python docs.

Closes apache#11970 from wjones127/docs/partition-guidance

Authored-by: Will Jones <willjones127@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
@ldacey
Copy link

ldacey commented Feb 8, 2022

Any recommendations for a couple of scenarios that I run into often:

  1. Data gets downloaded often (hourly, every 30 minutes) and saved to a dataset. My current approach is to get the list of fragments which were written, then figure out the partitions (ds._get_partition_keys(frag.partition_expression)), read all of the data in those partitions, and then resave the dataset using "delete_matching". I suppose this question would be about data consolidation / small file cleanup best practices in general.

  2. Any thoughts on partitions based on ETL schedules? I have used an old Airflow article as reference when considering how I partition data. This normally creates small files for a lot of sources, but one benefit is that I can always clear an Airflow task and it will overwrite any existing data which existed for that schedule run.

image

@wjones127
Copy link
Member Author

@ldacey Thanks for the question. Would you mind moving it to either: (1) our GitHub issues section, (2) our user mailing list (user@arrow.apache.org), or (3) our Jira? We like to keep usage discussion in one of those three places so it's easier for others to find.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants