Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-15183: [Python][Docs] Add Missing Dataset Write Options #12112

Closed
wants to merge 12 commits into from

Conversation

vibhatha
Copy link
Collaborator

This PR includes a minor documentation update for showing how max_open_files, min_rows_per_group and max_rows_per_group parameters can be used in Python dataset API.

The disucssion on the issue: https://issues.apache.org/jira/browse/ARROW-15183

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding these docs! 😄

I think don't think code blocks are strictly necessary since you are just describing a function argument. It's sufficient to just say, something like

Set the maximum number of files opened with the ``max_open_files`` parameter of
:meth:`write_dataset`.

I think the important thing here is to explain the consequences of these configurations and give guidance on how to decide what are the optimal settings for a given use case. For example, this is the information that is in the C++ doc string for max_open_files:

If greater than 0 then this will limit the maximum number of files that can be left open. If an attempt is made to open too many files then the least recently used file will be closed. If this setting is set too low you may end up fragmenting your data into many small files.

The default is 900 which also allows some # of files to be open by the scanner before hitting the default Linux limit of 1024

So it's probably worth explaining that if you get a "too many open files" error, you either need to increase the number of allowed file handlers (commonly done on Linux) or reduce the max_open_files setting.

The C++ docs in that header file look pretty good so I would pull content from them as a starting point for the guidance.

docs/source/python/dataset.rst Outdated Show resolved Hide resolved
docs/source/python/dataset.rst Outdated Show resolved Hide resolved
@vibhatha
Copy link
Collaborator Author

@wjones127 thanks for the review, I will update the PR.

Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting closer. I've add some suggestions to the guide. Ideally, a user who reads this will know what they need to set for these options for optimal performance on their workload.

docs/source/python/dataset.rst Outdated Show resolved Hide resolved
docs/source/python/dataset.rst Outdated Show resolved Hide resolved
docs/source/python/dataset.rst Outdated Show resolved Hide resolved
docs/source/python/dataset.rst Outdated Show resolved Hide resolved
The default value is 900 which also allows some number of files to be open
by the scannerbefore hitting the default Linux limit of 1024. Modify this value
depending on the nature of write operations associated with the usage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonpace does my understand below sound correct? I know it's a little complicated with multi-threading

Suggested change
To mitigate the many-small-files problem caused by this limit, you can
also sort your data by the partition columns (assuming it is not already
sorted). This ensures that files are usually closed after all data for
their respective partition has been written.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this. This should help. Multi threading does cause the write_dataset call to be "jittery" but not completely random so this would help with the small files problem though you might still get one or two here and there.

docs/source/python/dataset.rst Outdated Show resolved Hide resolved
(in a mini-batch setting where, records are obtained in batch by batch)
the volume of data written to disk per each group can be configured.
This can be configured using a minimum and maximum parameter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few points worth discussing:

  • Row groups matter for Parquet and Feather/IPC; they affect how data is seen by reader and because of row group statistics can affect file size.
  • Row groups are just batch size for CSV / JSON; the readers aren't affected.

My impression is that we have reasonable default for these values, and users generally won't want to set these. Can you think of examples where we would recommend users adjust these values?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can think of logging activities where online activities are monitored in windows (window aggregations) and summaries are logged by computing on those aggregated values. So if we assume such a scenario, depending on the accuracy required for the computation (if it is a learning task) and the required performance optimizations (execution time and memory), the users should be able to tune the parameter. This could be an interesting blog article if we can demonstrate it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are good examples.

Could you add a paragraph discussing how row_groups affect later reads for Parquet and Feather/IPC, but not CSV or JSON?

@vibhatha
Copy link
Collaborator Author

@wjones127 Nice points. I will work on these ideas.

Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks there are are some changes in testing submodule. Those shouldn't be there, right?

I have a few small suggestions on the docs, but otherwise this looks good.

docs/source/python/dataset.rst Outdated Show resolved Hide resolved
docs/source/python/dataset.rst Outdated Show resolved Hide resolved
(in a mini-batch setting where, records are obtained in batch by batch)
the volume of data written to disk per each group can be configured.
This can be configured using a minimum and maximum parameter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are good examples.

Could you add a paragraph discussing how row_groups affect later reads for Parquet and Feather/IPC, but not CSV or JSON?

@vibhatha
Copy link
Collaborator Author

@wjones127 I wasn't exactly sure about not committing the changes to the test submodule. I will check this.

@vibhatha
Copy link
Collaborator Author

@wjones127 I think this was a mistake from my end. Sorry about the confusion on committing the submodule.
I corrected it.

Comment on lines 681 to 685
In addition row_groups are a factor which impacts write/read of Parquest, Feather and IPC
formats. The main purpose of these formats are to provide high performance data structures
for I/O operations on larger datasets. The row_group concept allows the write/read operations
to be optimized and gather a defined number of rows at once and execute the I/O operation.
But row_groups are not integrated to support JSON or CSV formats.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wjones127 I added a small para on row-groups. Is this helpful?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could use a little more direct advice to help users see the symptoms of when they've done something wrong. Here's my suggestion:

Row groups are build into the Parquet and IPC/Feather formats, but don't affect JSON or CSV. When reading back Parquet and IPC formats in Arrow, the row group boundaries become the record batch boundaries, determining the default batch size of downstream readers. Additionally, row groups in Parquet files have column statistics which can help readers skip irrelevant data but can add size to the file. As an extreme example, if one sets max_rows_per_group=1 in Parquet, they will have large files because most of the file will be row group statistics.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is much better. I replaced my content with this. @westonpace should we enhance further about CSV and JSON?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think this is probably ok. Thinking on it further my guess is the user would assume these properties are just plain ignored if writing CSV or JSON which is (more or less) what happens. So I think this is clear enough.

@westonpace westonpace self-requested a review March 21, 2022 17:17
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up. This is good information to get to the users.

docs/source/python/dataset.rst Outdated Show resolved Hide resolved
docs/source/python/dataset.rst Outdated Show resolved Hide resolved
docs/source/python/dataset.rst Outdated Show resolved Hide resolved
Comment on lines 651 to 655
For workloads writing a lot of data, files can get very large without a
row count cap, leading to out-of-memory errors in downstream readers. The
relationship between row count and file size depends on the dataset schema
and how well compressed (if at all) the data is. For most applications,
it's best to keep file sizes below 1GB.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as the user is creating multiple (reasonably sized) row groups we shouldn't get out-of-memory errors even if the file is very large. Also, what evidence do you have for "For most applications, it's best to keep file sizes below 1GB"?

Copy link
Member

@wjones127 wjones127 Mar 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as the user is creating multiple (reasonably sized) row groups we shouldn't get out-of-memory errors even if the file is very large.

Are we assuming downstream readers are necessarily Arrow? I suggested that based on my experience with Spark, which as I recall, read whole files.

Also, what evidence do you have for "For most applications, it's best to keep file sizes below 1GB"?
In retrospect, that guidance is a bit low. My previous heuristic target was between 50 MB per file at a minimum and 2 GB as a maximum. That might be more specific to a Spark / S3 context; so maybe not as appropriate here.

Copy link
Member

@westonpace westonpace Mar 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I have no experience with Spark so that could be entirely true. Maybe we could just change that sentence into "leading to out-of-memory errors in downstream readers that don't support partial-file reads"

docs/source/python/dataset.rst Outdated Show resolved Hide resolved
docs/source/python/dataset.rst Outdated Show resolved Hide resolved
Comment on lines 671 to 672
less than this value and other options such as ``max_open_files`` or
``max_rows_per_file`` lead to smaller row group sizes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
less than this value and other options such as ``max_open_files`` or
``max_rows_per_file`` lead to smaller row group sizes.
less than this value if other options such as ``max_open_files`` or
``max_rows_per_file`` force smaller row group sizes.

I think it is an error if max_rows_per_file is less than min_rows_per_group.

docs/source/python/dataset.rst Outdated Show resolved Hide resolved
Comment on lines 682 to 685
formats. The main purpose of these formats are to provide high performance data structures
for I/O operations on larger datasets. The row_group concept allows the write/read operations
to be optimized and gather a defined number of rows at once and execute the I/O operation.
But row_groups are not integrated to support JSON or CSV formats.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the dataset is JSON or CSV and this is set? Is it an error or is this property ignored?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good explanation...

#12112 (comment)

docs/source/python/dataset.rst Outdated Show resolved Hide resolved
docs/source/python/dataset.rst Outdated Show resolved Hide resolved
docs/source/python/dataset.rst Outdated Show resolved Hide resolved
docs/source/python/dataset.rst Outdated Show resolved Hide resolved
Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's delete that file size guidance for now. Otherwise I approve.

docs/source/python/dataset.rst Outdated Show resolved Hide resolved
Co-authored-by: Will Jones <willjones127@gmail.com>
@vibhatha
Copy link
Collaborator Author

Let's delete that file size guidance for now. Otherwise I approve.

@wjones127 updated.

@ursabot
Copy link

ursabot commented Apr 15, 2022

Benchmark runs are scheduled for baseline = fc9af3c and contender = 931907e. 931907e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.25% ⬆️0.04%] test-mac-arm
[Failed ⬇️3.93% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.13% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/508| 931907e9 ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/495| 931907e9 test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/494| 931907e9 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/505| 931907e9 ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/507| fc9af3cd ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/494| fc9af3cd test-mac-arm>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/493| fc9af3cd ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/504| fc9af3cd ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants