Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Add a minimum_row_group_size to dataset writing #29990

Closed
asfimport opened this issue Oct 21, 2021 · 2 comments
Closed

[C++] Add a minimum_row_group_size to dataset writing #29990

asfimport opened this issue Oct 21, 2021 · 2 comments
Assignees
Milestone

Comments

@asfimport
Copy link
Collaborator

Right now we right whatever chunks we get, but if those chunks are exceptionally small, we should bundle them up and write out a configurable minimum row group size

Reporter: Jonathan Keane / @jonkeane
Assignee: Weston Pace / @westonpace

PRs and other links:

Note: This issue was originally created as ARROW-14426. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
So things can get a little tricky in certain situations. For example, if min_row_groups_size is 1M and the max_rows_queued is 64M and you just so happen to have 900k rows per file and are creating 100 files then you would end up in deadlock because it wouldn't write anything and it would hit the max_rows_queued limit.

Even if you were only creating 50 files it would still be non-ideal because none of the writes would happen until the entire dataset had accumulated in memory.

To work around this I think I will create a soft limit (defaulting to 8M rows because I like nice round powers of two) of batchable rows. Once there are more than 8M batchable rows I will start evicting batches, even though they are smaller than min_row_group_size.

I'm fairly certain this will go unnoticed in 99% of scenarios until some point in the future when I've forgotten all of this and I'm debugging why a small batch got created.

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
Issue resolved by pull request 11556
#11556

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants