-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python][Parquet] Empty row groups left behind after hitting max_rows_per_file in ds.write_dataset
#39965
Comments
ds.write_dataset
Thank you for opening up an issue @ion-elgreco! I agree this is probably a bug in datasets. It would be in Arrow C++ I believe. @mapleFU would you mind sharing your view on this? |
@ion-elgreco What's the version of arrow you're now using? I guess that's from bad handling of |
@mapleFU it occurs with the latest one, so v15 We figured this out during this issue, a user mentioned he tried it on v15 up to v9: delta-io/delta-rs#2169 (comment) |
I think that's a problem in dataset writer, and the problem is not related to core parquet lib, I'll try to reproduce and fix it tomorrow |
Update: dataset writer |
ds.write_dataset
ds.write_dataset
@ion-elgreco Would you mind try: #39995 ? I guess this might fix the problem |
@mapleFU is there some guide I can follow to compile from source? Never touched any C++ codebase before |
https://arrow.apache.org/docs/developers/python.html |
@mapleFU ok I need to find some time to set this up then, I'll try it out in the weekend |
…ax_rows_per_file` enabled (#39995) ### Rationale for this change `DatasetWriter` might create empty `RecordBatch` when `max_rows_per_file` enabled. This is because `NextWritableChunk` might return a zero-sized batch when the file exactly contains the dest data. ### What changes are included in this PR? Check batch-size == 0 when append to file queue ### Are these changes tested? Yes ### Are there any user-facing changes? User can avoid zero-sized row-group/batch. * Closes: #39965 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
…hen `max_rows_per_file` enabled (apache#39995) ### Rationale for this change `DatasetWriter` might create empty `RecordBatch` when `max_rows_per_file` enabled. This is because `NextWritableChunk` might return a zero-sized batch when the file exactly contains the dest data. ### What changes are included in this PR? Check batch-size == 0 when append to file queue ### Are these changes tested? Yes ### Are there any user-facing changes? User can avoid zero-sized row-group/batch. * Closes: apache#39965 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
…hen `max_rows_per_file` enabled (apache#39995) ### Rationale for this change `DatasetWriter` might create empty `RecordBatch` when `max_rows_per_file` enabled. This is because `NextWritableChunk` might return a zero-sized batch when the file exactly contains the dest data. ### What changes are included in this PR? Check batch-size == 0 when append to file queue ### Are these changes tested? Yes ### Are there any user-facing changes? User can avoid zero-sized row-group/batch. * Closes: apache#39965 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
Describe the bug, including details regarding any error messages, version, and platform.
The
pyarrow.dataset.write_dataset
function leaves an empty row_group behind in the the parquet file after the writer hits the limit of max_rows_per_file. See reproducible example below:Component(s)
Python
The text was updated successfully, but these errors were encountered: