Skip to content

define minimum row group #9756

Answered by devinjdangelo
djouallah asked this question in Q&A
Discussion options

You must be logged in to vote

Hm... perhaps the name "max_row_group_size" is confusing. It sort of means the same thing as "minimum row group size" depending on your perspective. The parquet writer will continue to write data into a row group until it reaches "max_row_group_size" rows. Then, it will open a new row group and start writing to that. As a result, all but the very last row group will have exactly "max_row_group_size" rows.

So, if you set max_row_group_size to 8M the row groups will have 8M rows. You could have the very last row group be smaller if the total number of rows is not divisible by 8M. I am not aware of a mechanism in arrow-rs or datafusion to force even the last row group to be above a minimum v…

Replies: 3 comments 1 reply

Comment options

You must be logged in to vote
1 reply
@alamb
Comment options

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by djouallah
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants