Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][C++] No longer possible to specify higher chunksize than the default for Parquet writing #34410

Closed
jorisvandenbossche opened this issue Mar 2, 2023 · 6 comments · Fixed by #34435

Comments

@jorisvandenbossche
Copy link
Member

Describe the bug, including details regarding any error messages, version, and platform.

See #34374 (comment) for context

#34281 changed the default row group size (chunksize in the C++ WriteTable API). However, that PR changed DEFAULT_MAX_ROW_GROUP_LENGTH, which doesn't set the default chunksize, but actually caps the max chunk (row group) size regardless of the user-specified chunksize.

It seems this constant is used both for setting this max upper cap, and as the default values for chunk_size in WriteTable. I assume we will have to distinguish those two meanings.

Component(s)

C++, Parquet, Python

@westonpace
Copy link
Member

Ah, I think the problem is that parquet::arrow::FileWriter::WriteTable has a chunk_size argument (which pyarrow sets) and it also has a parquet::WriterProperties which has max_row_group_length which pyarrow does not set (and this latter property overrides the former). So either we need to change pyarrow to set parquet::WriterProperties::max_chunk_size or only change the default for chunk_size (this might be preferred).

@jorisvandenbossche
Copy link
Member Author

or only change the default for chunk_size (this might be preferred).

I don't know what the typical usage is from C++? For that, it might be more useful to actually change the max_row_group_length? (since not everyone will write through WriteTable)

I misinterpreted max_row_group_length, thinking it was meant as some global max size we still set regardless of what the user provides, but through WriterProperties, it's basically the chunk size argument users can set using the properties interface. It's also only WriteTable that gives the additional chunk_size keyword, while other write methods like WriteRecordBatch only use the properties max_row_group_length.

Naively, I would expect that specifying chunk_size in WriteTable would overrule the max_row_group_size, instead of being overwritten by that.
Changing that to give priority to chunk_size would also solve the issue, but that's a breaking change?

So either we need to change pyarrow to set parquet::WriterProperties::max_row_group_length

That's also not that simple, since it has similar logic as in C++: the ParquetWriter class is created with properties, and then afterwards the write_table method has this chunk_size keyword, but at that point the Writer object was already created (and in theory you can also call that method multiple times with different chunk_size values for writing to the same file).

@wjones127
Copy link
Member

For Python, could we just set max_row_group_length to a very high value (knowing that it will always be overridden with chunk_size)? That seems to be the status quo right now, right? (That is, you can't from Python write more than 64 million rows per group)

@westonpace
Copy link
Member

For Python, could we just set max_row_group_length to a very high value (knowing that it will always be overridden with chunk_size)? That seems to be the status quo right now, right? (That is, you can't from Python write more than 64 million rows per group)

That seems like a reasonable compromise. I think python will always use write_table and so it should work.

@westonpace
Copy link
Member

I created #34435 with @wjones127 's suggestion

@jorisvandenbossche
Copy link
Member Author

@wjones127 good idea!

wjones127 added a commit that referenced this issue Mar 6, 2023
…ed (#34435)

### Rationale for this change

We changed the default chunk size from 64Mi rows to 1Mi rows.  However, it turns out that this property was being treated not just as the default but also as the absolute max.  So it was no longer possible to specify chunk sizes larger than 1Mi rows.  This change separates those two things and restores the max to 64Mi rows.

### What changes are included in this PR?

Pyarrow will now set the `ParquetWriterProperties::max_row_group_length` to 64Mi when constructing a parquet writer.

### Are these changes tested?

Yes.  Unit tests are added.

### Are there any user-facing changes?

No.  The previous change #34281 changed two defaults (absolute max and default).  This PR restores the absolute max back to what it was before.  So it is removing a user-facing change.
* Closes: #34410

Lead-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment