-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python][C++] No longer possible to specify higher chunksize than the default for Parquet writing #34410
Comments
Ah, I think the problem is that |
I don't know what the typical usage is from C++? For that, it might be more useful to actually change the I misinterpreted Naively, I would expect that specifying
That's also not that simple, since it has similar logic as in C++: the ParquetWriter class is created with properties, and then afterwards the |
For Python, could we just set |
That seems like a reasonable compromise. I think python will always use |
I created #34435 with @wjones127 's suggestion |
@wjones127 good idea! |
…ed (#34435) ### Rationale for this change We changed the default chunk size from 64Mi rows to 1Mi rows. However, it turns out that this property was being treated not just as the default but also as the absolute max. So it was no longer possible to specify chunk sizes larger than 1Mi rows. This change separates those two things and restores the max to 64Mi rows. ### What changes are included in this PR? Pyarrow will now set the `ParquetWriterProperties::max_row_group_length` to 64Mi when constructing a parquet writer. ### Are these changes tested? Yes. Unit tests are added. ### Are there any user-facing changes? No. The previous change #34281 changed two defaults (absolute max and default). This PR restores the absolute max back to what it was before. So it is removing a user-facing change. * Closes: #34410 Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Will Jones <willjones127@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>
Describe the bug, including details regarding any error messages, version, and platform.
See #34374 (comment) for context
#34281 changed the default row group size (
chunksize
in the C++ WriteTable API). However, that PR changedDEFAULT_MAX_ROW_GROUP_LENGTH
, which doesn't set the defaultchunksize
, but actually caps the max chunk (row group) size regardless of the user-specifiedchunksize
.It seems this constant is used both for setting this max upper cap, and as the default values for
chunk_size
inWriteTable
. I assume we will have to distinguish those two meanings.Component(s)
C++, Parquet, Python
The text was updated successfully, but these errors were encountered: