ARROW-15183: [Python][Docs] Add Missing Dataset Write Options #12112

vibhatha · 2022-01-10T10:58:21Z

This PR includes a minor documentation update for showing how max_open_files, min_rows_per_group and max_rows_per_group parameters can be used in Python dataset API.

The disucssion on the issue: https://issues.apache.org/jira/browse/ARROW-15183

github-actions · 2022-01-10T10:58:40Z

https://issues.apache.org/jira/browse/ARROW-15183

github-actions · 2022-01-10T10:58:42Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

wjones127

Thanks for adding these docs! 😄

I think don't think code blocks are strictly necessary since you are just describing a function argument. It's sufficient to just say, something like

Set the maximum number of files opened with the ``max_open_files`` parameter of
:meth:`write_dataset`.

I think the important thing here is to explain the consequences of these configurations and give guidance on how to decide what are the optimal settings for a given use case. For example, this is the information that is in the C++ doc string for max_open_files:

If greater than 0 then this will limit the maximum number of files that can be left open. If an attempt is made to open too many files then the least recently used file will be closed. If this setting is set too low you may end up fragmenting your data into many small files.

The default is 900 which also allows some # of files to be open by the scanner before hitting the default Linux limit of 1024

So it's probably worth explaining that if you get a "too many open files" error, you either need to increase the number of allowed file handlers (commonly done on Linux) or reduce the max_open_files setting.

The C++ docs in that header file look pretty good so I would pull content from them as a starting point for the guidance.

docs/source/python/dataset.rst

vibhatha · 2022-01-12T08:23:36Z

@wjones127 thanks for the review, I will update the PR.

wjones127

This is getting closer. I've add some suggestions to the guide. Ideally, a user who reads this will know what they need to set for these options for optimal performance on their workload.

docs/source/python/dataset.rst

wjones127 · 2022-01-19T21:52:39Z

docs/source/python/dataset.rst

+The default value is 900 which also allows some number of files to be open 
+by the scannerbefore hitting the default Linux limit of 1024. Modify this value 
+depending on the nature of write operations associated with the usage. 
+


@westonpace does my understand below sound correct? I know it's a little complicated with multi-threading

Suggested change

To mitigate the many-small-files problem caused by this limit, you can

also sort your data by the partition columns (assuming it is not already

sorted). This ensures that files are usually closed after all data for

their respective partition has been written.

Sorry I missed this. This should help. Multi threading does cause the write_dataset call to be "jittery" but not completely random so this would help with the small files problem though you might still get one or two here and there.

docs/source/python/dataset.rst

wjones127 · 2022-01-19T22:05:44Z

docs/source/python/dataset.rst

+(in a mini-batch setting where, records are obtained in batch by batch)
+the volume of data written to disk per each group can be configured. 
+This can be configured using a minimum and maximum parameter. 
+


A few points worth discussing:

Row groups matter for Parquet and Feather/IPC; they affect how data is seen by reader and because of row group statistics can affect file size.

Row groups are just batch size for CSV / JSON; the readers aren't affected.

My impression is that we have reasonable default for these values, and users generally won't want to set these. Can you think of examples where we would recommend users adjust these values?

I guess we can think of logging activities where online activities are monitored in windows (window aggregations) and summaries are logged by computing on those aggregated values. So if we assume such a scenario, depending on the accuracy required for the computation (if it is a learning task) and the required performance optimizations (execution time and memory), the users should be able to tune the parameter. This could be an interesting blog article if we can demonstrate it.

Those are good examples.

Could you add a paragraph discussing how row_groups affect later reads for Parquet and Feather/IPC, but not CSV or JSON?

vibhatha · 2022-01-20T03:57:57Z

@wjones127 Nice points. I will work on these ideas.

wjones127

It looks there are are some changes in testing submodule. Those shouldn't be there, right?

I have a few small suggestions on the docs, but otherwise this looks good.

docs/source/python/dataset.rst

wjones127 · 2022-03-15T16:19:03Z

docs/source/python/dataset.rst

+(in a mini-batch setting where, records are obtained in batch by batch)
+the volume of data written to disk per each group can be configured. 
+This can be configured using a minimum and maximum parameter. 
+


Those are good examples.

Could you add a paragraph discussing how row_groups affect later reads for Parquet and Feather/IPC, but not CSV or JSON?

vibhatha · 2022-03-16T00:08:56Z

@wjones127 I wasn't exactly sure about not committing the changes to the test submodule. I will check this.

vibhatha · 2022-03-19T14:09:58Z

@wjones127 I think this was a mistake from my end. Sorry about the confusion on committing the submodule.
I corrected it.

vibhatha · 2022-03-19T14:21:56Z

docs/source/python/dataset.rst

+In addition row_groups are a factor which impacts write/read of Parquest, Feather and IPC
+formats. The main purpose of these formats are to provide high performance data structures
+for I/O operations on larger datasets. The row_group concept allows the write/read operations
+to be optimized and gather a defined number of rows at once and execute the I/O operation. 
+But row_groups are not integrated to support JSON or CSV formats. 


@wjones127 I added a small para on row-groups. Is this helpful?

I think it could use a little more direct advice to help users see the symptoms of when they've done something wrong. Here's my suggestion:

Row groups are build into the Parquet and IPC/Feather formats, but don't affect JSON or CSV. When reading back Parquet and IPC formats in Arrow, the row group boundaries become the record batch boundaries, determining the default batch size of downstream readers. Additionally, row groups in Parquet files have column statistics which can help readers skip irrelevant data but can add size to the file. As an extreme example, if one sets max_rows_per_group=1 in Parquet, they will have large files because most of the file will be row group statistics.

This one is much better. I replaced my content with this. @westonpace should we enhance further about CSV and JSON?

No, I think this is probably ok. Thinking on it further my guess is the user would assume these properties are just plain ignored if writing CSV or JSON which is (more or less) what happens. So I think this is clear enough.

docs/source/python/dataset.rst

westonpace

Thanks for writing this up. This is good information to get to the users.

docs/source/python/dataset.rst

westonpace · 2022-03-22T01:18:49Z

docs/source/python/dataset.rst

+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.


As long as the user is creating multiple (reasonably sized) row groups we shouldn't get out-of-memory errors even if the file is very large. Also, what evidence do you have for "For most applications, it's best to keep file sizes below 1GB"?

As long as the user is creating multiple (reasonably sized) row groups we shouldn't get out-of-memory errors even if the file is very large.

Are we assuming downstream readers are necessarily Arrow? I suggested that based on my experience with Spark, which as I recall, read whole files.

Also, what evidence do you have for "For most applications, it's best to keep file sizes below 1GB"?
In retrospect, that guidance is a bit low. My previous heuristic target was between 50 MB per file at a minimum and 2 GB as a maximum. That might be more specific to a Spark / S3 context; so maybe not as appropriate here.

Ah, I have no experience with Spark so that could be entirely true. Maybe we could just change that sentence into "leading to out-of-memory errors in downstream readers that don't support partial-file reads"

docs/source/python/dataset.rst

westonpace · 2022-03-22T01:29:40Z

docs/source/python/dataset.rst

+less than this value and other options such as ``max_open_files`` or 
+``max_rows_per_file`` lead to smaller row group sizes.


Suggested change

less than this value and other options such as ``max_open_files`` or

``max_rows_per_file`` lead to smaller row group sizes.

less than this value if other options such as ``max_open_files`` or

``max_rows_per_file`` force smaller row group sizes.

I think it is an error if max_rows_per_file is less than min_rows_per_group.

docs/source/python/dataset.rst

westonpace · 2022-03-22T01:31:41Z

docs/source/python/dataset.rst

+formats. The main purpose of these formats are to provide high performance data structures
+for I/O operations on larger datasets. The row_group concept allows the write/read operations
+to be optimized and gather a defined number of rows at once and execute the I/O operation. 
+But row_groups are not integrated to support JSON or CSV formats. 


What happens if the dataset is JSON or CSV and this is set? Is it an error or is this property ignored?

I think this is a good explanation...

#12112 (comment)

docs/source/python/dataset.rst

wjones127

Let's delete that file size guidance for now. Otherwise I approve.

docs/source/python/dataset.rst

Co-authored-by: Will Jones <willjones127@gmail.com>

vibhatha · 2022-03-31T15:25:59Z

Let's delete that file size guidance for now. Otherwise I approve.

@wjones127 updated.

ursabot · 2022-04-15T03:54:32Z

Benchmark runs are scheduled for baseline = fc9af3c and contender = 931907e. 931907e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.25% ⬆️0.04%] test-mac-arm
[Failed ⬇️3.93% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.13% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/508| 931907e9 ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/495| 931907e9 test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/494| 931907e9 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/505| 931907e9 ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/507| fc9af3cd ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/494| fc9af3cd test-mac-arm>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/493| fc9af3cd ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/504| fc9af3cd ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

wjones127 reviewed Jan 11, 2022

View reviewed changes

docs/source/python/dataset.rst Outdated Show resolved Hide resolved

docs/source/python/dataset.rst Outdated Show resolved Hide resolved

kszucs force-pushed the arrow-15183 branch from 289efa5 to 3053224 Compare January 18, 2022 10:48

wjones127 requested changes Jan 19, 2022

View reviewed changes

vibhatha force-pushed the arrow-15183 branch from 3053224 to 035ee80 Compare March 15, 2022 04:26

wjones127 requested changes Mar 15, 2022

View reviewed changes

vibhatha added 3 commits March 19, 2022 19:14

updating submodule

c3f4a86

temp commit to remove files in submodule

1057f54

adding submodule

53a63a0

vibhatha force-pushed the arrow-15183 branch from aa1fa8a to 52b47ce Compare March 19, 2022 14:06

vibhatha commented Mar 19, 2022

View reviewed changes

vibhatha added 6 commits March 20, 2022 23:09

updating testing submodule

5ae5641

revert to uupstream version

225a1a0

adding docs for file write options

fb696b4

update the docs : raddress review comments

a0ce30a

addressing review comments

6db7078

adding a section on row groups and file formats

61103df

vibhatha force-pushed the arrow-15183 branch from 9f5e700 to 61103df Compare March 20, 2022 17:47

wjones127 reviewed Mar 21, 2022

View reviewed changes

docs/source/python/dataset.rst Outdated Show resolved Hide resolved

westonpace self-requested a review March 21, 2022 17:17

westonpace requested changes Mar 22, 2022

View reviewed changes

vibhatha requested a review from westonpace March 22, 2022 12:42

updating docs based on review comments

7632fbc

vibhatha force-pushed the arrow-15183 branch from 2939136 to 7632fbc Compare March 22, 2022 12:58

westonpace reviewed Mar 28, 2022

View reviewed changes

docs/source/python/dataset.rst Outdated Show resolved Hide resolved

docs/source/python/dataset.rst Outdated Show resolved Hide resolved

docs/source/python/dataset.rst Outdated Show resolved Hide resolved

docs/source/python/dataset.rst Outdated Show resolved Hide resolved

addressing reviews

f8204d7

vibhatha requested a review from westonpace March 31, 2022 05:05

wjones127 requested changes Mar 31, 2022

View reviewed changes

docs/source/python/dataset.rst Outdated Show resolved Hide resolved

removing file size guidance.

e8b2948

Co-authored-by: Will Jones <willjones127@gmail.com>

wjones127 approved these changes Mar 31, 2022

View reviewed changes

westonpace closed this in 931907e Apr 14, 2022

asfimport mentioned this pull request Nov 23, 2022

[Python][Docs] Add Missing Dataset Write Options #30685

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-15183: [Python][Docs] Add Missing Dataset Write Options #12112

ARROW-15183: [Python][Docs] Add Missing Dataset Write Options #12112

vibhatha commented Jan 10, 2022

github-actions bot commented Jan 10, 2022

github-actions bot commented Jan 10, 2022

wjones127 left a comment

vibhatha commented Jan 12, 2022

wjones127 left a comment

wjones127 Jan 19, 2022

westonpace Mar 28, 2022

wjones127 Jan 19, 2022

vibhatha Mar 15, 2022

wjones127 Mar 15, 2022

vibhatha commented Jan 20, 2022

wjones127 left a comment

wjones127 Mar 15, 2022

vibhatha commented Mar 16, 2022

vibhatha commented Mar 19, 2022

vibhatha Mar 19, 2022

wjones127 Mar 21, 2022

vibhatha Mar 22, 2022

westonpace Mar 28, 2022

westonpace left a comment

westonpace Mar 22, 2022

wjones127 Mar 22, 2022 •

edited

Loading

westonpace Mar 28, 2022 •

edited

Loading

westonpace Mar 22, 2022

westonpace Mar 22, 2022

vibhatha Mar 22, 2022

wjones127 left a comment

vibhatha commented Mar 31, 2022

ursabot commented Apr 15, 2022

+To mitigate the many-small-files problem caused by this limit, you can
+also sort your data by the partition columns (assuming it is not already
+sorted). This ensures that files are usually closed after all data for
+their respective partition has been written.

		less than this value and other options such as ``max_open_files`` or
		``max_rows_per_file`` lead to smaller row group sizes.

ARROW-15183: [Python][Docs] Add Missing Dataset Write Options #12112

ARROW-15183: [Python][Docs] Add Missing Dataset Write Options #12112

Conversation

vibhatha commented Jan 10, 2022

github-actions bot commented Jan 10, 2022

github-actions bot commented Jan 10, 2022

wjones127 left a comment

Choose a reason for hiding this comment

vibhatha commented Jan 12, 2022

wjones127 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vibhatha commented Jan 20, 2022

wjones127 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vibhatha commented Mar 16, 2022

vibhatha commented Mar 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjones127 Mar 22, 2022 • edited Loading

Choose a reason for hiding this comment

westonpace Mar 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjones127 left a comment

Choose a reason for hiding this comment

vibhatha commented Mar 31, 2022

ursabot commented Apr 15, 2022

wjones127 Mar 22, 2022 •

edited

Loading

westonpace Mar 28, 2022 •

edited

Loading