PARQUET-1381: Support merging of rowgroups during file rewrite by MaheshGPai · Pull Request #1121 · apache/parquet-java

MaheshGPai · 2023-07-15T12:11:10Z

Make sure you have checked all steps below.

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title.
- https://issues.apache.org/jira/browse/PARQUET-1381
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

MaheshGPai · 2023-07-15T12:19:43Z

Taking forward a PR that had remained inactive. Original PR - #775

wgtmac

I simply did an initial review. I

wgtmac · 2023-07-19T05:20:57Z


+  @Parameter(
+    names = {"-m", "--merge-rowgroups"},
+    description = "<true/false>",


Could you please add a brief description?

wgtmac · 2023-07-19T05:21:37Z

+
+  @Parameter(
+    names = {"-s", "--max-rowgroup-size"},
+    description = "<max size of the merged rowgroups>",


It would be good to say it is used together with --merge-rowgroups=true in the description.

wgtmac · 2023-07-19T05:23:16Z

+      builder.enableRowGroupMerge();
+      builder.maxRowGroupSize(maxRowGroupSize);


What about use a single function? Like builder.mergeRowGroups(maxRowGroupSize).

I have made changes as per the comment. I'm fine either way.

wgtmac · 2023-07-19T05:38:41Z

+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.nio.ByteBuffer;
+import java.util.*;


Please do not use import star.

wgtmac · 2023-07-19T05:41:15Z

+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.PrimitiveType;
+
+public class RowGroupMerger {


Suggested change

public class RowGroupMerger {

class RowGroupMerger {

It would be good not to make it public for now.

Probably you need to relocate it into the rewrite package.

wgtmac · 2023-07-19T06:03:16Z

+      initNextReader();
+    }
+    while(reader != null);
+    new RowGroupMerger(schema, newCodecName, v2EncodingHint).merge(readers, maxRowGroupSize, writer);


I didn't review it in depth. Does it handle encryption or masking properties internally?

Yes. Underneath, it uses the same instance of ParquetFileWriter which handles these operations.

advancedxy

This is a nice feature @MaheshGPai. I'm wondering similar features too, thanks for your work.

By the way, do you have any performance number comparing this with rewrote by query engines such as Spark/Hive.

advancedxy · 2023-07-20T08:53:06Z

+    List<ParquetFileReader> readers = new ArrayList<>();
+    do {
+      readers.add(reader);
+      initNextReader();


Looks like v2EncodingHint only checks the first parquet file..

Should all the files to be checked?

advancedxy · 2023-07-20T09:04:24Z

+        DictionaryPage dictPage = columnReader.readDictionaryPage();
+        Dictionary decodedDictionary = null;
+        if (dictPage != null) {
+          decodedDictionary = dictPage.getEncoding().initDictionary(column.getColumnDesc(), dictPage);
+        }


If I understand the process of page encoding correctly: parquet tries to use dictionary encoding by default, If the dictionary grows too big, whether in size or number of distinct values, the encoding will fall back to the plain encoding. The check and fallback logic happens when emit the first page.

So when we are merging multiple column chunks from different row groups, if the first column chunks is dictionary encoded and others are not because it fallbacks to plain encoding, we should disable the dictionary encoding for that column on purpose to avoid introducing overhead.

Current logic doesn't handle that, it will use dictionary encoding if the column chunk in the first row group to be merged use dictionary encoding.

advancedxy · 2023-07-20T09:08:20Z

+
+        if (mergedBlock == null && estimator.estimate(blockMeta) > maxRowGroupSize) {
+          //save it directly without re encoding it
+          saveBlockTo(ReadOnlyMergedBlock.of(blockMeta, group, schema, compressor), writer);


I checked related code, seems that startColumn and endColumn doesn't maintain bloom filter....

It might be hard to maintain bloom filters when merging multiple row groups, but it should be possible and easy to maintain bloom filter for only one row group. See ParquetWriter#L337 for related code.

I agree, so it might be good to integrate this with ParquetRewriter if one row group does not need to be merged.

wgtmac

Thanks for quick update!

I know this PR comes from another PR which was created long before ParquetRewriter was implemented. However, my main concern is that the current implementation of RowGroupMerger diverges from ParquetRewriter, which makes it difficult to maintain in the future. For example, RowGroupMerger seems does not support column masking (nullify column values) if RewriterOptions has requested to do so. And it has duplicate implementation (i.e. ReadOnlyMergedBlock) if a row group does not need to merge which ParquetRewriter has already supported. Could you consider to consolidate these implementations? Otherwise it would not be easy if we want to add more features to the rewriter.

cc @shangxinli @gszadovszky

wgtmac · 2023-07-22T15:01:35Z

+
+        if (mergedBlock == null && estimator.estimate(blockMeta) > maxRowGroupSize) {
+          //save it directly without re encoding it
+          saveBlockTo(ReadOnlyMergedBlock.of(blockMeta, group, schema, compressor), writer);


I agree, so it might be good to integrate this with ParquetRewriter if one row group does not need to be merged.

wgtmac · 2023-07-22T15:26:36Z

+            @Override
+            public DataPage visit(DataPageV1 pageV1) {
+
+              return new DataPageV1(compress(pageV1.getBytes(), compressor), pageV1.getValueCount(),


Why does DataPageV1 require to compress again here but DataPageV2 does not (line 384 below)?

wgtmac · 2023-07-22T15:58:51Z

+
+              newValuesWriter.reset();
+
+              long firstRowIndex = pageV1.getFirstRowIndex().orElse(-1L);


We cannot simply copy firstRowIndex if pages are not from the 1st row group in this MutableMergedBlock.

shangxinli · 2023-09-22T02:55:23Z

This is a great initiative. Do you still have plan to address the feedback @MaheshGPai ?

MaheshGPai · 2023-09-23T07:16:17Z

This is a great initiative. Do you still have plan to address the feedback @MaheshGPai ?

@shangxinli I do plan to work on it. But I have not had time to get to this.

ConeyLiu · 2023-09-28T12:53:50Z

Hi @MaheshGPai, thanks for the contribution. If you don't have time to work on this, I can continue with it.

MaheshGPai · 2023-09-28T15:29:50Z

@ConeyLiu Please feel free to continue. I'll not be able to look at this for another week or so.

ConeyLiu · 2023-09-29T09:01:02Z

OK, I will deep into it.

github-actions · 2026-06-01T00:22:55Z

This pull request has been automatically marked as stale because it has had no activity for at least 2 months. If you are still working on this change or plan to move it forward, please leave a comment or push a new commit so we know to keep it open. Otherwise, this PR will be closed automatically in about one month. Thank you for your contribution to Apache Parquet!

MaheshGPai changed the title ~~Support merging of rowgroups during file rewrite~~ PARQUET-1381: Support merging of rowgroups during file rewrite Jul 15, 2023

MaheshGPai mentioned this pull request Jul 15, 2023

PARQUET-1381: add parquet block merging feature #775

Open

MaheshGPai force-pushed the PR branch 2 times, most recently from 1a84d5c to 1f511a6 Compare July 15, 2023 17:28

wgtmac reviewed Jul 19, 2023

View reviewed changes

MaheshGPai requested a review from wgtmac July 19, 2023 12:22

advancedxy reviewed Jul 20, 2023

View reviewed changes

wgtmac requested changes Jul 22, 2023

View reviewed changes

MaheshGPai added 2 commits September 23, 2023 12:32

Support merging of rowgroups during file rewrite

324a669

Review comments

c3d5f12

Merge statistics

1f585e5

MaheshGPai force-pushed the PR branch from a07e39f to 1f585e5 Compare September 23, 2023 10:19

github-actions Bot added the stale label Jun 1, 2026

		builder.enableRowGroupMerge();
		builder.maxRowGroupSize(maxRowGroupSize);


		newValuesWriter.reset();

		long firstRowIndex = pageV1.getFirstRowIndex().orElse(-1L);

Conversation

MaheshGPai commented Jul 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Jira

Tests

Commits

Documentation

Uh oh!

MaheshGPai commented Jul 15, 2023

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

advancedxy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shangxinli commented Sep 22, 2023

Uh oh!

MaheshGPai commented Sep 23, 2023

Uh oh!

ConeyLiu commented Sep 28, 2023

Uh oh!

MaheshGPai commented Sep 28, 2023

Uh oh!

ConeyLiu commented Sep 29, 2023

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

MaheshGPai commented Jul 15, 2023 •

edited

Loading