[Python][Parquet] Empty row groups left behind after hitting max_rows_per_file in `ds.write_dataset` #39965

ion-elgreco · 2024-02-06T18:32:27Z

Describe the bug, including details regarding any error messages, version, and platform.

The pyarrow.dataset.write_dataset function leaves an empty row_group behind in the the parquet file after the writer hits the limit of max_rows_per_file. See reproducible example below:

import pyarrow.dataset as ds
import pyarrow.parquet as pq
ds.write_dataset(data, "test_dataset", max_rows_per_file=1024*32, max_rows_per_group=1024 * 16, min_rows_per_group=8*1024, format='parquet')
metadata = pq.read_metadata("test_dataset/part-0.parquet")

for i in range(metadata.num_row_groups):
    print(metadata.row_group(i))

<pyarrow._parquet.RowGroupMetaData object at 0x7f92b5168180>
  num_columns: 1
  num_rows: 16384
  total_byte_size: 61
<pyarrow._parquet.RowGroupMetaData object at 0x7f92b5185bc0>
  num_columns: 1
  num_rows: 16384
  total_byte_size: 61
<pyarrow._parquet.RowGroupMetaData object at 0x7f92b5185bc0>
  num_columns: 1
  num_rows: 0
  total_byte_size: 14

Component(s)

Python

The text was updated successfully, but these errors were encountered:

AlenkaF · 2024-02-07T16:13:17Z

Thank you for opening up an issue @ion-elgreco!

I agree this is probably a bug in datasets. It would be in Arrow C++ I believe. @mapleFU would you mind sharing your view on this?

mapleFU · 2024-02-07T16:20:44Z

@ion-elgreco What's the version of arrow you're now using? I guess that's from bad handling of max_rows_per_group 🤔

ion-elgreco · 2024-02-07T16:46:50Z

@mapleFU it occurs with the latest one, so v15

We figured this out during this issue, a user mentioned he tried it on v15 up to v9: delta-io/delta-rs#2169 (comment)

mapleFU · 2024-02-07T17:36:33Z

I think that's a problem in dataset writer, and the problem is not related to core parquet lib, I'll try to reproduce and fix it tomorrow

mapleFU · 2024-02-07T18:01:46Z

Update: dataset writer DoWriteRecordBatch could cause this problem currently, I can fix it tomorrow, but currently I don't know if other place would causing this 🤔

mapleFU · 2024-02-08T05:35:17Z

@ion-elgreco Would you mind try: #39995 ? I guess this might fix the problem

ion-elgreco · 2024-02-08T18:13:58Z

@mapleFU is there some guide I can follow to compile from source? Never touched any C++ codebase before

mapleFU · 2024-02-08T18:15:46Z

https://arrow.apache.org/docs/developers/python.html
It's a bit complex because it need to build that

ion-elgreco · 2024-02-08T18:22:19Z

@mapleFU ok I need to find some time to set this up then, I'll try it out in the weekend

…ax_rows_per_file` enabled (#39995) ### Rationale for this change `DatasetWriter` might create empty `RecordBatch` when `max_rows_per_file` enabled. This is because `NextWritableChunk` might return a zero-sized batch when the file exactly contains the dest data. ### What changes are included in this PR? Check batch-size == 0 when append to file queue ### Are these changes tested? Yes ### Are there any user-facing changes? User can avoid zero-sized row-group/batch. * Closes: #39965 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>

…hen `max_rows_per_file` enabled (apache#39995) ### Rationale for this change `DatasetWriter` might create empty `RecordBatch` when `max_rows_per_file` enabled. This is because `NextWritableChunk` might return a zero-sized batch when the file exactly contains the dest data. ### What changes are included in this PR? Check batch-size == 0 when append to file queue ### Are these changes tested? Yes ### Are there any user-facing changes? User can avoid zero-sized row-group/batch. * Closes: apache#39965 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>

ion-elgreco added the Type: bug label Feb 6, 2024

github-actions bot added the Component: Python label Feb 6, 2024

ion-elgreco changed the title ~~Empty row groups left behind after hitting max_rows_per_file~~ [Python][parquet] Empty row groups left behind after hitting max_rows_per_file in ds.write_dataset Feb 6, 2024

kou changed the title ~~[Python][parquet] Empty row groups left behind after hitting max_rows_per_file in ds.write_dataset~~ [Python][Parquet] Empty row groups left behind after hitting max_rows_per_file in ds.write_dataset Feb 8, 2024

github-actions bot mentioned this issue Feb 8, 2024

GH-39965: [C++] DatasetWriter avoid creating zero-sized batch when max_rows_per_file enabled #39995

Merged

github-actions bot assigned mapleFU Feb 8, 2024

mapleFU closed this as completed in #39995 Feb 23, 2024

mapleFU added this to the 16.0.0 milestone Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python][Parquet] Empty row groups left behind after hitting max_rows_per_file in `ds.write_dataset` #39965

[Python][Parquet] Empty row groups left behind after hitting max_rows_per_file in `ds.write_dataset` #39965

ion-elgreco commented Feb 6, 2024

AlenkaF commented Feb 7, 2024

mapleFU commented Feb 7, 2024

ion-elgreco commented Feb 7, 2024 •

edited

Loading

mapleFU commented Feb 7, 2024 •

edited

Loading

mapleFU commented Feb 7, 2024

mapleFU commented Feb 8, 2024

ion-elgreco commented Feb 8, 2024

mapleFU commented Feb 8, 2024

ion-elgreco commented Feb 8, 2024

[Python][Parquet] Empty row groups left behind after hitting max_rows_per_file in ds.write_dataset #39965

[Python][Parquet] Empty row groups left behind after hitting max_rows_per_file in ds.write_dataset #39965

Comments

ion-elgreco commented Feb 6, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

AlenkaF commented Feb 7, 2024

mapleFU commented Feb 7, 2024

ion-elgreco commented Feb 7, 2024 • edited Loading

mapleFU commented Feb 7, 2024 • edited Loading

mapleFU commented Feb 7, 2024

mapleFU commented Feb 8, 2024

ion-elgreco commented Feb 8, 2024

mapleFU commented Feb 8, 2024

ion-elgreco commented Feb 8, 2024

[Python][Parquet] Empty row groups left behind after hitting max_rows_per_file in `ds.write_dataset` #39965

[Python][Parquet] Empty row groups left behind after hitting max_rows_per_file in `ds.write_dataset` #39965

ion-elgreco commented Feb 7, 2024 •

edited

Loading

mapleFU commented Feb 7, 2024 •

edited

Loading