PARQUET-2227: Refactor several file rewriters to use a new unified ParquetRewriter implementation #1014

wgtmac · 2022-12-14T07:47:28Z

Jira

This patch aims to solve the first step of PARQUET-2075.

Tests

Make sure all tasks pass, especially CompressionConverterTest, ColumnPrunerTest, ColumnMaskerTest, and ColumnEncryptorTest.

Commits

A new ParquetRewriter is introduced to unify rewriting logic.
RewriteOptions is defined to provide essential settings.
CompressionConverter, ColumnPruner, ColumnMasker, and ColumnEncryptor have been refactored and marked as deprecated.

wgtmac · 2022-12-14T15:52:35Z

Can you please take a look when you have time? @shangxinli @gszadovszky @ggershinsky

- A new ParquetRewriter is introduced to unify rewriting logic. - RewriteOptions is defined to provide essential settings. - CompressionConverter, ColumnPruner, ColumnMasker, and ColumnEncryptor have been refactored.

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java

shangxinli · 2022-12-24T22:40:18Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java

+    // Get file metadata and full schema from the input file
+    meta = ParquetFileReader.readFooter(conf, inPath, NO_FILTER);
+    schema = meta.getFileMetaData().getSchema();
+    createdBy = meta.getFileMetaData().getCreatedBy();


There was some discussion should we carry over the old author or replace it with this rewriter author. Or we can append 'reCreatedBy' but it needs a specification (parquet-format) change. Any thoughts on it?

IMHO, reusing the old createdBy is a bad idea which makes it difficult to reason about the origin of the file. What about concatenating the old createdBy with the writer version of ParquetRewriter so we can keep both old and new creator in the same field?

I'd rather keep the same behavior for now and fix it in a separate JIRA.

Since we have only one field for created by and it's content is more or less specified, we should write the current version of parquet-mr there. It is also a good idea to keep the original created by value, though. What do you think about adding it to the key_value_metadata with a specific key? Even though this field would not be specified at least we won't lose the info.

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java

shangxinli · 2022-12-24T22:58:13Z

I just left some comments initially. I will spend more time on it.

@ggershinsky If you have time, can you have a look too?

shangxinli · 2022-12-24T23:07:39Z

If you can add more unit tests, particularly the combinations of prune, mask, trans-compression etc, it would be better.

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java

ggershinsky · 2022-12-26T05:44:02Z

I just left some comments initially. I will spend more time on it.

@ggershinsky If you have time, can you have a look too?

sure, will do

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/MaskMode.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java

- Rename EncryptorRunTime to ColumnChunkEncryptorRunTime. - Avoid redundant check in the ColumnChunkEncryptorRunTime. - Simplify MaskMode enum.

wgtmac · 2023-01-05T07:51:56Z

If you can add more unit tests, particularly the combinations of prune, mask, trans-compression etc, it would be better.

I have added some test cases in the ParquetRewriterTest to mix config of prune, nullify and trans-compression.

There are two outstanding issues as below:

Masking and encrypting same column is not supported yet.
Fix createdBy of rewritten file to keep both original info and new rewriter info.

I am inclined to fix them in separate JIRAs because this patch is simply a refactoring patch to unify different rewriters without changing any behavior, and it is large enough.

@ggershinsky @shangxinli Please review again when you have time. Thanks!

ggershinsky · 2023-01-09T07:41:02Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java

+        if (encryptColumns != null) {
+          for (String pruneColumn : pruneColumns) {
+            Preconditions.checkArgument(!encryptColumns.contains(pruneColumn),
+                    "Cannot prune and mask same column");


Cannot prune and encrypt same column?

Pruned column does not exist in the rewritten file. Then it does not make sense to encrypt the missing column any more.

Yep. I meant the exception text in the line 149, it is identical to the masking check (line 142); I guess the encryption check would print "Cannot prune and encrypt same column"

Sorry, I misunderstood your meaning. Fixed the error message. Please take a look again. Thanks!

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java

parquet-hadoop/src/test/java/org/apache/parquet/hadoop/rewrite/ParquetRewriterTest.java

ggershinsky · 2023-01-09T07:47:15Z

parquet-hadoop/src/test/java/org/apache/parquet/hadoop/rewrite/ParquetRewriterTest.java

+  private ParquetRewriter rewriter = null;
+
+  @Test
+  public void testPruneSingleColumnAndTranslateCodec() throws Exception {


do we also have an encryption rewrite unitest?

Added a test case to prune, transcode and encrypt columns.

shangxinli · 2023-01-10T05:28:10Z

@gszadovszky I Just want to check if you have time to have a look. @wgtmac just be nice to take over the work that we discussed earlier to have an aggregated rewriter.

gszadovszky · 2023-01-10T18:54:14Z

@gszadovszky I Just want to check if you have time to have a look. @wgtmac just be nice to take over the work that we discussed earlier to have an aggregated rewriter.

@shangxinli, I'll try to take a look this week.

shangxinli · 2023-01-10T20:55:19Z

Thanks a lot @gszadovszky

gszadovszky

I think it is a great refactor. Thanks a lot for working on it, @wgtmac!
In the other hand I've thought about PARQUET-2075 as a request for a new feature in parquet-cli that can be used to convert from one parquet file to another with specific configurations. (Later on we might extend it to allow multiple parquet files to be merged/rewritten to one specified and the tool would decide which level of deserialization/serialization is required.)
I am fine with handling it in a separate jira but let's make it clear. Either create another jira for this refactor as a prerequisite of PARQUET-2075 or rephrase PARQUET-2075 and create a new for parquet-cli.
@shangxinli, what do you think?

gszadovszky · 2023-01-14T13:38:53Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java

+    // Get file metadata and full schema from the input file
+    meta = ParquetFileReader.readFooter(conf, inPath, NO_FILTER);
+    schema = meta.getFileMetaData().getSchema();
+    createdBy = meta.getFileMetaData().getCreatedBy();


Since we have only one field for created by and it's content is more or less specified, we should write the current version of parquet-mr there. It is also a good idea to keep the original created by value, though. What do you think about adding it to the key_value_metadata with a specific key? Even though this field would not be specified at least we won't lose the info.

wgtmac · 2023-01-14T14:22:47Z

I think it is a great refactor. Thanks a lot for working on it, @wgtmac! In the other hand I've thought about PARQUET-2075 as a request for a new feature in parquet-cli that can be used to convert from one parquet file to another with specific configurations. (Later on we might extend it to allow multiple parquet files to be merged/rewritten to one specified and the tool would decide which level of deserialization/serialization is required.) I am fine with handling it in a separate jira but let's make it clear. Either create another jira for this refactor as a prerequisite of PARQUET-2075 or rephrase PARQUET-2075 and create a new for parquet-cli. @shangxinli, what do you think?

Thanks for your review @gszadovszky

I'd prefer creating a new JIRA for this refactor to be a prerequisite. Merging multiple files to a single one with customized pruning, encryption, and codec is also in my mind and will be supported later. I will create separate JIRAs as sub-tasks of PARQUET-2075 and work on them progressively.
Putting the original created_by into key_value_metadata is a good idea. However, it is tricky if a file has been rewritten for several times. What about adding a key named original_created_by to key_value_metadata and concatenating all old created_bys to it?

gszadovszky · 2023-01-14T14:33:52Z

I'd prefer creating a new JIRA for this refactor to be a prerequisite. Merging multiple files to a single one with customized pruning, encryption, and codec is also in my mind and will be supported later. I will create separate JIRAs as sub-tasks of PARQUET-2075 and work on them progressively.

Perfect! :)

Putting the original created_by into key_value_metadata is a good idea. However, it is tricky if a file has been rewritten for several times. What about adding a key named original_created_by to key_value_metadata and concatenating all old created_bys to it?

It sounds good to me. Maybe have the latest one at the beginning and use the separator '\n'?

wgtmac · 2023-01-14T14:59:43Z

I'd prefer creating a new JIRA for this refactor to be a prerequisite. Merging multiple files to a single one with customized pruning, encryption, and codec is also in my mind and will be supported later. I will create separate JIRAs as sub-tasks of PARQUET-2075 and work on them progressively.

Perfect! :)

Putting the original created_by into key_value_metadata is a good idea. However, it is tricky if a file has been rewritten for several times. What about adding a key named original_created_by to key_value_metadata and concatenating all old created_bys to it?

It sounds good to me. Maybe have the latest one at the beginning and use the separator '\n'?

I am afraid some implementations may drop characters after '\n' when displaying the string content. Let me do some investigation.

gszadovszky · 2023-01-14T16:19:57Z

I am afraid some implementations may drop characters after '\n' when displaying the string content. Let me do some investigation.

I do not have a strong opinion for '\n' only that we need a character that probably won't be used by any systems writing parquet files.

wgtmac · 2023-01-15T09:29:05Z

I am afraid some implementations may drop characters after '\n' when displaying the string content. Let me do some investigation.

I do not have a strong opinion for '\n' only that we need a character that probably won't be used by any systems writing parquet files.

As we are discussing a new entry (original.created.by) to the key value metadata, I need to raise two related issues once we have supported rewriting (merging) several files into one:

We need to merge original.created.by from all input files, making it difficult to tell which created_by comes from which input file. Therefore, original.created.by should be dropped in this case.
Is there any key value metadata that will conflict from different input files and should be dealt with by the rewriter? For now we simply keep all the old key value metadata from the old file.

@gszadovszky @ggershinsky @shangxinli Thoughts?

If this behavior requires further discussion, I'd suggest to keep the current state of created_by unchanged in this pull request which is large enough. All rewriters (ColumnPruner, CompressionConverter, ColumnMasker, and ColumnEncrypter) have dropped original created_by and store the current writer version to the footer.

…rOptions

gszadovszky · 2023-01-15T13:35:07Z

I agree that merging the key-value metadata is not an easy question. Let's discuss it separately as it is not related to this PR.

I also agree to store the current writer (parquet-mr) in created_by in case of rewriting. It is not easy to decide what would be the proper solution anyway. created_by is usually used for handling potential erroneous writes. Let's say there was an issue in parquet-mr at the version 1.2.3 that written a specific encoding of integers wrongly (not according to spec). What if we rewrite the file but do not re-encode the pages? Can we still handle the original issue? What if the rewriter re-encodes the related pages?
Let's store the original writer in original.created.by for now. Let's discuss this topic separately however, I am not sure if we can find a proper solution.

wgtmac · 2023-01-15T14:07:58Z

I agree that merging the key-value metadata is not an easy question. Let's discuss it separately as it is not related to this PR.

I also agree to store the current writer (parquet-mr) in created_by in case of rewriting. It is not easy to decide what would be the proper solution anyway. created_by is usually used for handling potential erroneous writes. Let's say there was an issue in parquet-mr at the version 1.2.3 that written a specific encoding of integers wrongly (not according to spec). What if we rewrite the file but do not re-encode the pages? Can we still handle the original issue? What if the rewriter re-encodes the related pages? Let's store the original writer in original.created.by for now. Let's discuss this topic separately however, I am not sure if we can find a proper solution.

I agree. Now I have updated this PR to preserve the old writer version into original.created.by and added a test to make sure it is preserved. Please take a look. Thanks!

wgtmac · 2023-01-29T09:43:12Z

Gentle ping. @gszadovszky @ggershinsky @shangxinli

Any chance to take another look?

gszadovszky · 2023-01-29T12:41:12Z

Thank you, @wgtmac for working on this! It looks good to me.
Since there were other reviewers, I would wait a bit to give a chance to them for add feedback. I'll push it after a couple of days. (Feel free to ping me if I forget.)

shangxinli · 2023-01-29T16:45:57Z

Thanks @wgtmac for working on this and Thanks @gszadovszky and @ggershinsky for reviewing it! I am a little late for the comments discussion but I see we are in the right direction. Let's address it in a separate discussion. If it turns out that changing the parquet-format is the right way to solve it, we can make the proposal and I can help for the approval process.

My comments are all addressed. I don't have further comments.

ggershinsky

My comments are addressed too. Thanks for working on this PR.

wgtmac · 2023-01-30T01:33:03Z

Thank you @gszadovszky @ggershinsky @shangxinli

wgtmac force-pushed the PARQUET-2075 branch from b298580 to a3f8a1d Compare December 14, 2022 15:48

wgtmac changed the title ~~PARQUET-2075: Implement ParquetRewriter~~ PARQUET-2075: Implement unified file rewriter Dec 14, 2022

wgtmac force-pushed the PARQUET-2075 branch 2 times, most recently from c377d1d to 6b10e9c Compare December 16, 2022 02:29

PARQUET-2075: Implement unified file rewriter

33b7912

- A new ParquetRewriter is introduced to unify rewriting logic. - RewriteOptions is defined to provide essential settings. - CompressionConverter, ColumnPruner, ColumnMasker, and ColumnEncryptor have been refactored.

wgtmac force-pushed the PARQUET-2075 branch from 6b10e9c to 33b7912 Compare December 16, 2022 03:05