You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now we right whatever chunks we get, but if those chunks are exceptionally small, we should bundle them up and write out a configurable minimum row group size
Weston Pace / @westonpace:
So things can get a little tricky in certain situations. For example, if min_row_groups_size is 1M and the max_rows_queued is 64M and you just so happen to have 900k rows per file and are creating 100 files then you would end up in deadlock because it wouldn't write anything and it would hit the max_rows_queued limit.
Even if you were only creating 50 files it would still be non-ideal because none of the writes would happen until the entire dataset had accumulated in memory.
To work around this I think I will create a soft limit (defaulting to 8M rows because I like nice round powers of two) of batchable rows. Once there are more than 8M batchable rows I will start evicting batches, even though they are smaller than min_row_group_size.
I'm fairly certain this will go unnoticed in 99% of scenarios until some point in the future when I've forgotten all of this and I'm debugging why a small batch got created.
Right now we right whatever chunks we get, but if those chunks are exceptionally small, we should bundle them up and write out a configurable minimum row group size
Reporter: Jonathan Keane / @jonkeane
Assignee: Weston Pace / @westonpace
PRs and other links:
Note: This issue was originally created as ARROW-14426. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: