Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2227: Refactor several file rewriters to use a new unified ParquetRewriter implementation #1014

Merged
merged 7 commits into from Jan 29, 2023

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Dec 14, 2022

Jira

Tests

  • Make sure all tasks pass, especially CompressionConverterTest, ColumnPrunerTest, ColumnMaskerTest, and ColumnEncryptorTest.

Commits

  • A new ParquetRewriter is introduced to unify rewriting logic.
  • RewriteOptions is defined to provide essential settings.
  • CompressionConverter, ColumnPruner, ColumnMasker, and ColumnEncryptor have been refactored and marked as deprecated.

@wgtmac wgtmac changed the title PARQUET-2075: Implement ParquetRewriter PARQUET-2075: Implement unified file rewriter Dec 14, 2022
@wgtmac
Copy link
Member Author

wgtmac commented Dec 14, 2022

Can you please take a look when you have time? @shangxinli @gszadovszky @ggershinsky

@wgtmac wgtmac force-pushed the PARQUET-2075 branch 2 times, most recently from c377d1d to 6b10e9c Compare December 16, 2022 02:29
 - A new ParquetRewriter is introduced to unify rewriting logic.
 - RewriteOptions is defined to provide essential settings.
 - CompressionConverter, ColumnPruner, ColumnMasker, and ColumnEncryptor
   have been refactored.
// Get file metadata and full schema from the input file
meta = ParquetFileReader.readFooter(conf, inPath, NO_FILTER);
schema = meta.getFileMetaData().getSchema();
createdBy = meta.getFileMetaData().getCreatedBy();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some discussion should we carry over the old author or replace it with this rewriter author. Or we can append 'reCreatedBy' but it needs a specification (parquet-format) change. Any thoughts on it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, reusing the old createdBy is a bad idea which makes it difficult to reason about the origin of the file. What about concatenating the old createdBy with the writer version of ParquetRewriter so we can keep both old and new creator in the same field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather keep the same behavior for now and fix it in a separate JIRA.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have only one field for created by and it's content is more or less specified, we should write the current version of parquet-mr there. It is also a good idea to keep the original created by value, though. What do you think about adding it to the key_value_metadata with a specific key? Even though this field would not be specified at least we won't lose the info.

@shangxinli
Copy link
Contributor

I just left some comments initially. I will spend more time on it.

@ggershinsky If you have time, can you have a look too?

@shangxinli
Copy link
Contributor

shangxinli commented Dec 24, 2022

If you can add more unit tests, particularly the combinations of prune, mask, trans-compression etc, it would be better.

@ggershinsky
Copy link
Contributor

I just left some comments initially. I will spend more time on it.

@ggershinsky If you have time, can you have a look too?

sure, will do

 - Rename EncryptorRunTime to ColumnChunkEncryptorRunTime.
 - Avoid redundant check in the ColumnChunkEncryptorRunTime.
 - Simplify MaskMode enum.
@wgtmac
Copy link
Member Author

wgtmac commented Jan 5, 2023

If you can add more unit tests, particularly the combinations of prune, mask, trans-compression etc, it would be better.

I have added some test cases in the ParquetRewriterTest to mix config of prune, nullify and trans-compression.

There are two outstanding issues as below:

  • Masking and encrypting same column is not supported yet.
  • Fix createdBy of rewritten file to keep both original info and new rewriter info.

I am inclined to fix them in separate JIRAs because this patch is simply a refactoring patch to unify different rewriters without changing any behavior, and it is large enough.

@ggershinsky @shangxinli Please review again when you have time. Thanks!

if (encryptColumns != null) {
for (String pruneColumn : pruneColumns) {
Preconditions.checkArgument(!encryptColumns.contains(pruneColumn),
"Cannot prune and mask same column");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot prune and encrypt same column?

Copy link
Member Author

@wgtmac wgtmac Jan 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pruned column does not exist in the rewritten file. Then it does not make sense to encrypt the missing column any more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I meant the exception text in the line 149, it is identical to the masking check (line 142); I guess the encryption check would print "Cannot prune and encrypt same column"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I misunderstood your meaning. Fixed the error message. Please take a look again. Thanks!

private ParquetRewriter rewriter = null;

@Test
public void testPruneSingleColumnAndTranslateCodec() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we also have an encryption rewrite unitest?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test case to prune, transcode and encrypt columns.

@shangxinli
Copy link
Contributor

shangxinli commented Jan 10, 2023

@gszadovszky I Just want to check if you have time to have a look. @wgtmac just be nice to take over the work that we discussed earlier to have an aggregated rewriter.

@gszadovszky
Copy link
Contributor

@gszadovszky I Just want to check if you have time to have a look. @wgtmac just be nice to take over the work that we discussed earlier to have an aggregated rewriter.

@shangxinli, I'll try to take a look this week.

@shangxinli
Copy link
Contributor

Thanks a lot @gszadovszky

@wgtmac wgtmac requested review from ggershinsky and shangxinli and removed request for ggershinsky January 11, 2023 14:50
@wgtmac wgtmac requested review from ggershinsky and removed request for shangxinli January 11, 2023 14:50
Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a great refactor. Thanks a lot for working on it, @wgtmac!
In the other hand I've thought about PARQUET-2075 as a request for a new feature in parquet-cli that can be used to convert from one parquet file to another with specific configurations. (Later on we might extend it to allow multiple parquet files to be merged/rewritten to one specified and the tool would decide which level of deserialization/serialization is required.)
I am fine with handling it in a separate jira but let's make it clear. Either create another jira for this refactor as a prerequisite of PARQUET-2075 or rephrase PARQUET-2075 and create a new for parquet-cli.
@shangxinli, what do you think?

// Get file metadata and full schema from the input file
meta = ParquetFileReader.readFooter(conf, inPath, NO_FILTER);
schema = meta.getFileMetaData().getSchema();
createdBy = meta.getFileMetaData().getCreatedBy();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have only one field for created by and it's content is more or less specified, we should write the current version of parquet-mr there. It is also a good idea to keep the original created by value, though. What do you think about adding it to the key_value_metadata with a specific key? Even though this field would not be specified at least we won't lose the info.

@wgtmac
Copy link
Member Author

wgtmac commented Jan 14, 2023

I think it is a great refactor. Thanks a lot for working on it, @wgtmac! In the other hand I've thought about PARQUET-2075 as a request for a new feature in parquet-cli that can be used to convert from one parquet file to another with specific configurations. (Later on we might extend it to allow multiple parquet files to be merged/rewritten to one specified and the tool would decide which level of deserialization/serialization is required.) I am fine with handling it in a separate jira but let's make it clear. Either create another jira for this refactor as a prerequisite of PARQUET-2075 or rephrase PARQUET-2075 and create a new for parquet-cli. @shangxinli, what do you think?

Thanks for your review @gszadovszky

  • I'd prefer creating a new JIRA for this refactor to be a prerequisite. Merging multiple files to a single one with customized pruning, encryption, and codec is also in my mind and will be supported later. I will create separate JIRAs as sub-tasks of PARQUET-2075 and work on them progressively.
  • Putting the original created_by into key_value_metadata is a good idea. However, it is tricky if a file has been rewritten for several times. What about adding a key named original_created_by to key_value_metadata and concatenating all old created_bys to it?

@gszadovszky
Copy link
Contributor

  • I'd prefer creating a new JIRA for this refactor to be a prerequisite. Merging multiple files to a single one with customized pruning, encryption, and codec is also in my mind and will be supported later. I will create separate JIRAs as sub-tasks of PARQUET-2075 and work on them progressively.

Perfect! :)

  • Putting the original created_by into key_value_metadata is a good idea. However, it is tricky if a file has been rewritten for several times. What about adding a key named original_created_by to key_value_metadata and concatenating all old created_bys to it?

It sounds good to me. Maybe have the latest one at the beginning and use the separator '\n'?

@wgtmac
Copy link
Member Author

wgtmac commented Jan 14, 2023

  • I'd prefer creating a new JIRA for this refactor to be a prerequisite. Merging multiple files to a single one with customized pruning, encryption, and codec is also in my mind and will be supported later. I will create separate JIRAs as sub-tasks of PARQUET-2075 and work on them progressively.

Perfect! :)

  • Putting the original created_by into key_value_metadata is a good idea. However, it is tricky if a file has been rewritten for several times. What about adding a key named original_created_by to key_value_metadata and concatenating all old created_bys to it?

It sounds good to me. Maybe have the latest one at the beginning and use the separator '\n'?

I am afraid some implementations may drop characters after '\n' when displaying the string content. Let me do some investigation.

@wgtmac wgtmac changed the title PARQUET-2075: Implement unified file rewriter PARQUET-2227: Refactor several file rewriters to use a new unified ParquetRewriter implementation Jan 14, 2023
@gszadovszky
Copy link
Contributor

I am afraid some implementations may drop characters after '\n' when displaying the string content. Let me do some investigation.

I do not have a strong opinion for '\n' only that we need a character that probably won't be used by any systems writing parquet files.

@wgtmac
Copy link
Member Author

wgtmac commented Jan 15, 2023

I am afraid some implementations may drop characters after '\n' when displaying the string content. Let me do some investigation.

I do not have a strong opinion for '\n' only that we need a character that probably won't be used by any systems writing parquet files.

As we are discussing a new entry (original.created.by) to the key value metadata, I need to raise two related issues once we have supported rewriting (merging) several files into one:

  • We need to merge original.created.by from all input files, making it difficult to tell which created_by comes from which input file. Therefore, original.created.by should be dropped in this case.
  • Is there any key value metadata that will conflict from different input files and should be dealt with by the rewriter? For now we simply keep all the old key value metadata from the old file.

@gszadovszky @ggershinsky @shangxinli Thoughts?

If this behavior requires further discussion, I'd suggest to keep the current state of created_by unchanged in this pull request which is large enough. All rewriters (ColumnPruner, CompressionConverter, ColumnMasker, and ColumnEncrypter) have dropped original created_by and store the current writer version to the footer.

@gszadovszky
Copy link
Contributor

I agree that merging the key-value metadata is not an easy question. Let's discuss it separately as it is not related to this PR.

I also agree to store the current writer (parquet-mr) in created_by in case of rewriting. It is not easy to decide what would be the proper solution anyway. created_by is usually used for handling potential erroneous writes. Let's say there was an issue in parquet-mr at the version 1.2.3 that written a specific encoding of integers wrongly (not according to spec). What if we rewrite the file but do not re-encode the pages? Can we still handle the original issue? What if the rewriter re-encodes the related pages?
Let's store the original writer in original.created.by for now. Let's discuss this topic separately however, I am not sure if we can find a proper solution.

@wgtmac
Copy link
Member Author

wgtmac commented Jan 15, 2023

I agree that merging the key-value metadata is not an easy question. Let's discuss it separately as it is not related to this PR.

I also agree to store the current writer (parquet-mr) in created_by in case of rewriting. It is not easy to decide what would be the proper solution anyway. created_by is usually used for handling potential erroneous writes. Let's say there was an issue in parquet-mr at the version 1.2.3 that written a specific encoding of integers wrongly (not according to spec). What if we rewrite the file but do not re-encode the pages? Can we still handle the original issue? What if the rewriter re-encodes the related pages? Let's store the original writer in original.created.by for now. Let's discuss this topic separately however, I am not sure if we can find a proper solution.

I agree. Now I have updated this PR to preserve the old writer version into original.created.by and added a test to make sure it is preserved. Please take a look. Thanks!

@wgtmac
Copy link
Member Author

wgtmac commented Jan 29, 2023

Gentle ping. @gszadovszky @ggershinsky @shangxinli

Any chance to take another look?

@gszadovszky
Copy link
Contributor

Thank you, @wgtmac for working on this! It looks good to me.
Since there were other reviewers, I would wait a bit to give a chance to them for add feedback. I'll push it after a couple of days. (Feel free to ping me if I forget.)

@shangxinli
Copy link
Contributor

shangxinli commented Jan 29, 2023

Thanks @wgtmac for working on this and Thanks @gszadovszky and @ggershinsky for reviewing it! I am a little late for the comments discussion but I see we are in the right direction. Let's address it in a separate discussion. If it turns out that changing the parquet-format is the right way to solve it, we can make the proposal and I can help for the approval process.

My comments are all addressed. I don't have further comments.

Copy link
Contributor

@ggershinsky ggershinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments are addressed too. Thanks for working on this PR.

@gszadovszky gszadovszky merged commit d6417df into apache:master Jan 29, 2023
@wgtmac
Copy link
Member Author

wgtmac commented Jan 30, 2023

Thank you @gszadovszky @ggershinsky @shangxinli

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants