Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Should we prevent from writing empty rowgroup in WriteRecordBatch #39210

Closed
mapleFU opened this issue Dec 13, 2023 · 2 comments · Fixed by #39211
Closed

[C++][Parquet] Should we prevent from writing empty rowgroup in WriteRecordBatch #39210

mapleFU opened this issue Dec 13, 2023 · 2 comments · Fixed by #39211

Comments

@mapleFU
Copy link
Member

mapleFU commented Dec 13, 2023

Describe the enhancement requested

Arrow Parquet Writer uses AppendRowGroup to produce the RowGroup during writing a Parquet file. The code path is listed in FileWriterImpl::WriteRecordBatch as below:

    while (offset < batch.num_rows()) {
      const int64_t batch_size =
          std::min(max_row_group_length - row_group_writer_->num_rows(),
                   batch.num_rows() - offset);
      RETURN_NOT_OK(WriteBatch(offset, batch_size));
      offset += batch_size;

      // Flush current row group if it is full.
      if (row_group_writer_->num_rows() >= max_row_group_length) {
        RETURN_NOT_OK(NewBufferedRowGroup());
      }
    }

Assume the max_row_group_length == k, if input recordBatch.num_rows() % k == 0, we would append an empty row-group.

The behavior of empty row-group is a bit tricky. It looks like below. The empty row-group only have metadata, and data does not exists.

     {
       "Id": "2",  "TotalBytes": "0",  "TotalCompressedBytes": "0",  "Rows": "0",
       "ColumnChunks": [
          {"Id": "0", "Values": "0", "StatsSet": "False",
           "Compression": "UNCOMPRESSED", "Encodings": "", "UncompressedSize": "0", "CompressedSize": "0" },
          {"Id": "1", "Values": "0", "StatsSet": "False",
           "Compression": "UNCOMPRESSED", "Encodings": "", "UncompressedSize": "0", "CompressedSize": "0" },
          {"Id": "2", "Values": "0", "StatsSet": "False",
           "Compression": "UNCOMPRESSED", "Encodings": "", "UncompressedSize": "0", "CompressedSize": "0" },
          {"Id": "3", "Values": "0", "StatsSet": "False",
           "Compression": "UNCOMPRESSED", "Encodings": "", "UncompressedSize": "0", "CompressedSize": "0" },
          {"Id": "4", "Values": "0", "StatsSet": "False",
           "Compression": "UNCOMPRESSED", "Encodings": "", "UncompressedSize": "0", "CompressedSize": "0" },
          {"Id": "5", "Values": "0", "StatsSet": "False",
           "Compression": "UNCOMPRESSED", "Encodings": "", "UncompressedSize": "0", "CompressedSize": "0" },
          {"Id": "6", "Values": "0", "StatsSet": "False",
           "Compression": "UNCOMPRESSED", "Encodings": "", "UncompressedSize": "0", "CompressedSize": "0" },
          {"Id": "7", "Values": "0", "StatsSet": "False",
           "Compression": "UNCOMPRESSED", "Encodings": "", "UncompressedSize": "0", "CompressedSize": "0" },
          {"Id": "8", "Values": "0", "StatsSet": "False",
           "Compression": "UNCOMPRESSED", "Encodings": "", "UncompressedSize": "0", "CompressedSize": "0" },
          {"Id": "9", "Values": "0", "StatsSet": "False",
           "Compression": "UNCOMPRESSED", "Encodings": "", "UncompressedSize": "0", "CompressedSize": "0" },
          {"Id": "10", "Values": "0", "StatsSet": "False",
           "Compression": "UNCOMPRESSED", "Encodings": "", "UncompressedSize": "0", "CompressedSize": "0" },
          {"Id": "11", "Values": "0", "StatsSet": "False",
           "Compression": "UNCOMPRESSED", "Encodings": "", "UncompressedSize": "0", "CompressedSize": "0" }
        ]
     }

Although it's hard to prevent all cases from "empty row-group", maybe we can prevent the case for the last write to the File

Component(s)

C++, Parquet

@mapleFU
Copy link
Member Author

mapleFU commented Dec 13, 2023

cc @pitrou @wgtmac

@mapleFU
Copy link
Member Author

mapleFU commented Dec 13, 2023

The fixing can be as simple as:

diff --git a/cpp/src/parquet/arrow/writer.cc b/cpp/src/parquet/arrow/writer.cc
index 07c627d5e..dd0f5be07 100644
--- a/cpp/src/parquet/arrow/writer.cc
+++ b/cpp/src/parquet/arrow/writer.cc
@@ -462,7 +462,7 @@ class FileWriterImpl : public FileWriter {
       offset += batch_size;
 
       // Flush current row group if it is full.
-      if (row_group_writer_->num_rows() >= max_row_group_length) {
+      if (row_group_writer_->num_rows() >= max_row_group_length && offset < batch.num_rows()) {
         RETURN_NOT_OK(NewBufferedRowGroup());
       }
     }

pitrou added a commit that referenced this issue Dec 13, 2023
…ed RowGroup (#39211)

### Rationale for this change

`WriteRecordBatch` might produce zero-sized row-group, which is mentioned in #39210 . This patch avoid WriteRecordBatch from produce zero-sized RowGroup.

### What changes are included in this PR?

adding a check for zero-sized row-group

### Are these changes tested?

Yes

### Are there any user-facing changes?

no

* Closes: #39210

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 15.0.0 milestone Dec 13, 2023
clayburn pushed a commit to clayburn/arrow that referenced this issue Jan 23, 2024
…ro-sized RowGroup (apache#39211)

### Rationale for this change

`WriteRecordBatch` might produce zero-sized row-group, which is mentioned in apache#39210 . This patch avoid WriteRecordBatch from produce zero-sized RowGroup.

### What changes are included in this PR?

adding a check for zero-sized row-group

### Are these changes tested?

Yes

### Are there any user-facing changes?

no

* Closes: apache#39210

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…ro-sized RowGroup (apache#39211)

### Rationale for this change

`WriteRecordBatch` might produce zero-sized row-group, which is mentioned in apache#39210 . This patch avoid WriteRecordBatch from produce zero-sized RowGroup.

### What changes are included in this PR?

adding a check for zero-sized row-group

### Are these changes tested?

Yes

### Are there any user-facing changes?

no

* Closes: apache#39210

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants