-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-15183: [Python][Docs] Add Missing Dataset Write Options #12112
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding these docs! 😄
I think don't think code blocks are strictly necessary since you are just describing a function argument. It's sufficient to just say, something like
Set the maximum number of files opened with the ``max_open_files`` parameter of
:meth:`write_dataset`.
I think the important thing here is to explain the consequences of these configurations and give guidance on how to decide what are the optimal settings for a given use case. For example, this is the information that is in the C++ doc string for max_open_files
:
If greater than 0 then this will limit the maximum number of files that can be left open. If an attempt is made to open too many files then the least recently used file will be closed. If this setting is set too low you may end up fragmenting your data into many small files.
The default is 900 which also allows some # of files to be open by the scanner before hitting the default Linux limit of 1024
So it's probably worth explaining that if you get a "too many open files" error, you either need to increase the number of allowed file handlers (commonly done on Linux) or reduce the max_open_files
setting.
The C++ docs in that header file look pretty good so I would pull content from them as a starting point for the guidance.
@wjones127 thanks for the review, I will update the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is getting closer. I've add some suggestions to the guide. Ideally, a user who reads this will know what they need to set for these options for optimal performance on their workload.
docs/source/python/dataset.rst
Outdated
The default value is 900 which also allows some number of files to be open | ||
by the scannerbefore hitting the default Linux limit of 1024. Modify this value | ||
depending on the nature of write operations associated with the usage. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@westonpace does my understand below sound correct? I know it's a little complicated with multi-threading
To mitigate the many-small-files problem caused by this limit, you can | |
also sort your data by the partition columns (assuming it is not already | |
sorted). This ensures that files are usually closed after all data for | |
their respective partition has been written. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed this. This should help. Multi threading does cause the write_dataset call to be "jittery" but not completely random so this would help with the small files problem though you might still get one or two here and there.
(in a mini-batch setting where, records are obtained in batch by batch) | ||
the volume of data written to disk per each group can be configured. | ||
This can be configured using a minimum and maximum parameter. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few points worth discussing:
- Row groups matter for Parquet and Feather/IPC; they affect how data is seen by reader and because of row group statistics can affect file size.
- Row groups are just batch size for CSV / JSON; the readers aren't affected.
My impression is that we have reasonable default for these values, and users generally won't want to set these. Can you think of examples where we would recommend users adjust these values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can think of logging activities where online activities are monitored in windows (window aggregations) and summaries are logged by computing on those aggregated values. So if we assume such a scenario, depending on the accuracy required for the computation (if it is a learning task) and the required performance optimizations (execution time and memory), the users should be able to tune the parameter. This could be an interesting blog article if we can demonstrate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are good examples.
Could you add a paragraph discussing how row_groups
affect later reads for Parquet and Feather/IPC, but not CSV or JSON?
@wjones127 Nice points. I will work on these ideas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks there are are some changes in testing
submodule. Those shouldn't be there, right?
I have a few small suggestions on the docs, but otherwise this looks good.
(in a mini-batch setting where, records are obtained in batch by batch) | ||
the volume of data written to disk per each group can be configured. | ||
This can be configured using a minimum and maximum parameter. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are good examples.
Could you add a paragraph discussing how row_groups
affect later reads for Parquet and Feather/IPC, but not CSV or JSON?
@wjones127 I wasn't exactly sure about not committing the changes to the test submodule. I will check this. |
@wjones127 I think this was a mistake from my end. Sorry about the confusion on committing the submodule. |
docs/source/python/dataset.rst
Outdated
In addition row_groups are a factor which impacts write/read of Parquest, Feather and IPC | ||
formats. The main purpose of these formats are to provide high performance data structures | ||
for I/O operations on larger datasets. The row_group concept allows the write/read operations | ||
to be optimized and gather a defined number of rows at once and execute the I/O operation. | ||
But row_groups are not integrated to support JSON or CSV formats. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wjones127 I added a small para on row-groups. Is this helpful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could use a little more direct advice to help users see the symptoms of when they've done something wrong. Here's my suggestion:
Row groups are build into the Parquet and IPC/Feather formats, but don't affect JSON or CSV. When reading back Parquet and IPC formats in Arrow, the row group boundaries become the record batch boundaries, determining the default batch size of downstream readers. Additionally, row groups in Parquet files have column statistics which can help readers skip irrelevant data but can add size to the file. As an extreme example, if one sets
max_rows_per_group=1
in Parquet, they will have large files because most of the file will be row group statistics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is much better. I replaced my content with this. @westonpace should we enhance further about CSV and JSON?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think this is probably ok. Thinking on it further my guess is the user would assume these properties are just plain ignored if writing CSV or JSON which is (more or less) what happens. So I think this is clear enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing this up. This is good information to get to the users.
docs/source/python/dataset.rst
Outdated
For workloads writing a lot of data, files can get very large without a | ||
row count cap, leading to out-of-memory errors in downstream readers. The | ||
relationship between row count and file size depends on the dataset schema | ||
and how well compressed (if at all) the data is. For most applications, | ||
it's best to keep file sizes below 1GB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as the user is creating multiple (reasonably sized) row groups we shouldn't get out-of-memory errors even if the file is very large. Also, what evidence do you have for "For most applications, it's best to keep file sizes below 1GB"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as the user is creating multiple (reasonably sized) row groups we shouldn't get out-of-memory errors even if the file is very large.
Are we assuming downstream readers are necessarily Arrow? I suggested that based on my experience with Spark, which as I recall, read whole files.
Also, what evidence do you have for "For most applications, it's best to keep file sizes below 1GB"?
In retrospect, that guidance is a bit low. My previous heuristic target was between 50 MB per file at a minimum and 2 GB as a maximum. That might be more specific to a Spark / S3 context; so maybe not as appropriate here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I have no experience with Spark so that could be entirely true. Maybe we could just change that sentence into "leading to out-of-memory errors in downstream readers that don't support partial-file reads"
docs/source/python/dataset.rst
Outdated
less than this value and other options such as ``max_open_files`` or | ||
``max_rows_per_file`` lead to smaller row group sizes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
less than this value and other options such as ``max_open_files`` or | |
``max_rows_per_file`` lead to smaller row group sizes. | |
less than this value if other options such as ``max_open_files`` or | |
``max_rows_per_file`` force smaller row group sizes. |
I think it is an error if max_rows_per_file
is less than min_rows_per_group
.
docs/source/python/dataset.rst
Outdated
formats. The main purpose of these formats are to provide high performance data structures | ||
for I/O operations on larger datasets. The row_group concept allows the write/read operations | ||
to be optimized and gather a defined number of rows at once and execute the I/O operation. | ||
But row_groups are not integrated to support JSON or CSV formats. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the dataset is JSON or CSV and this is set? Is it an error or is this property ignored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good explanation...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's delete that file size guidance for now. Otherwise I approve.
Co-authored-by: Will Jones <willjones127@gmail.com>
@wjones127 updated. |
Benchmark runs are scheduled for baseline = fc9af3c and contender = 931907e. 931907e is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
This PR includes a minor documentation update for showing how
max_open_files
,min_rows_per_group
andmax_rows_per_group
parameters can be used in Python dataset API.The disucssion on the issue: https://issues.apache.org/jira/browse/ARROW-15183