[C++] Add a minimum_row_group_size to dataset writing #29990

asfimport · 2021-10-21T19:21:07Z

Right now we right whatever chunks we get, but if those chunks are exceptionally small, we should bundle them up and write out a configurable minimum row group size

Reporter: Jonathan Keane / @jonkeane
Assignee: Weston Pace / @westonpace

PRs and other links:

GitHub Pull Request #11556

_{Note: This issue was originally created as ARROW-14426. Please see the migration documentation for further details.}

asfimport · 2021-10-22T01:48:45Z

Weston Pace / @westonpace:
So things can get a little tricky in certain situations. For example, if min_row_groups_size is 1M and the max_rows_queued is 64M and you just so happen to have 900k rows per file and are creating 100 files then you would end up in deadlock because it wouldn't write anything and it would hit the max_rows_queued limit.

Even if you were only creating 50 files it would still be non-ideal because none of the writes would happen until the entire dataset had accumulated in memory.

To work around this I think I will create a soft limit (defaulting to 8M rows because I like nice round powers of two) of batchable rows. Once there are more than 8M batchable rows I will start evicting batches, even though they are smaller than min_row_group_size.

I'm fairly certain this will go unnoticed in 99% of scenarios until some point in the future when I've forgotten all of this and I'm debugging why a small batch got created.

asfimport · 2021-12-09T13:04:57Z

David Li / @lidavidm:
Issue resolved by pull request 11556
#11556

asfimport closed this as completed Dec 9, 2021

asfimport assigned westonpace Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Add a minimum_row_group_size to dataset writing #29990

[C++] Add a minimum_row_group_size to dataset writing #29990

asfimport commented Oct 21, 2021

asfimport commented Oct 22, 2021

asfimport commented Dec 9, 2021

[C++] Add a minimum_row_group_size to dataset writing #29990

[C++] Add a minimum_row_group_size to dataset writing #29990

Comments

asfimport commented Oct 21, 2021

PRs and other links:

asfimport commented Oct 22, 2021

asfimport commented Dec 9, 2021