You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The row_group_size property in pyarrow.parquet.write_table is described as:
Maximum size of each written row group. If None, the row group size will be the minimum of the Table size and 64 * 1024 * 1024.
This limit is in # of rows but that is not obvious from the description. Furthermore, 64Mi is an extremely high limit for row group size. I believe it is perhaps based on the Java implementation, however the Java implementation treats this number as "bytes" and not "rows" (64MiB row groups is very reasonable).
Perhaps the best solution would be to add support for a limit in terms of bytes. In the meantime, I think we should lower the default limit to 1Mi rows.
Component(s)
C++
The text was updated successfully, but these errors were encountered:
…default to 1Mi (#34281)
BREAKING CHANGE: Changes the default row group size when writing parquet files.
* Closes: #34280
Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
…36012)
### Rationale for this change
In #34280 the default row group size was changed to 1Mi. However, this was accidentally reverted (for python, but not C++) in #34435
The problem is that there is both an "absolute max row group size for the writer" and a "row group size to use for this table" The pyarrow user is unable to set the former property.
The behavior in pyarrow was previously "If no value is given in the call to write_table then don't specify anything and let the absolute max apply"
The first fix changed the absolute max to 1Mi. However, this made it impossible for the user to use a larger row group size. The second fix changed the absolute max back to 64Mi. However, this meant the default didn't change.
### What changes are included in this PR?
This change leaves the absolute max at 64Mi. However, if the user does not specify a row group size, we no longer "just use the table size" and instead use 1Mi.
### Are these changes tested?
Yes, a unit test was added.
### Are there any user-facing changes?
Yes, the default row group size now truly changes to 1Mi. This change was already announced as part of #34280
* Closes: #35859
Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
Describe the enhancement requested
The
row_group_size
property inpyarrow.parquet.write_table
is described as:This limit is in # of rows but that is not obvious from the description. Furthermore, 64Mi is an extremely high limit for row group size. I believe it is perhaps based on the Java implementation, however the Java implementation treats this number as "bytes" and not "rows" (64MiB row groups is very reasonable).
Perhaps the best solution would be to add support for a limit in terms of bytes. In the meantime, I think we should lower the default limit to 1Mi rows.
Component(s)
C++
The text was updated successfully, but these errors were encountered: