New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2227: Refactor several file rewriters to use a new unified ParquetRewriter implementation #1014
Conversation
Can you please take a look when you have time? @shangxinli @gszadovszky @ggershinsky |
c377d1d
to
6b10e9c
Compare
- A new ParquetRewriter is introduced to unify rewriting logic. - RewriteOptions is defined to provide essential settings. - CompressionConverter, ColumnPruner, ColumnMasker, and ColumnEncryptor have been refactored.
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java
Outdated
Show resolved
Hide resolved
// Get file metadata and full schema from the input file | ||
meta = ParquetFileReader.readFooter(conf, inPath, NO_FILTER); | ||
schema = meta.getFileMetaData().getSchema(); | ||
createdBy = meta.getFileMetaData().getCreatedBy(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was some discussion should we carry over the old author or replace it with this rewriter author. Or we can append 'reCreatedBy' but it needs a specification (parquet-format) change. Any thoughts on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, reusing the old createdBy
is a bad idea which makes it difficult to reason about the origin of the file. What about concatenating the old createdBy
with the writer version of ParquetRewriter
so we can keep both old and new creator in the same field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather keep the same behavior for now and fix it in a separate JIRA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have only one field for created by and it's content is more or less specified, we should write the current version of parquet-mr there. It is also a good idea to keep the original created by value, though. What do you think about adding it to the key_value_metadata
with a specific key? Even though this field would not be specified at least we won't lose the info.
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java
Outdated
Show resolved
Hide resolved
I just left some comments initially. I will spend more time on it. @ggershinsky If you have time, can you have a look too? |
If you can add more unit tests, particularly the combinations of prune, mask, trans-compression etc, it would be better. |
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java
Outdated
Show resolved
Hide resolved
sure, will do |
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/MaskMode.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/MaskMode.java
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java
Outdated
Show resolved
Hide resolved
- Rename EncryptorRunTime to ColumnChunkEncryptorRunTime. - Avoid redundant check in the ColumnChunkEncryptorRunTime. - Simplify MaskMode enum.
I have added some test cases in the There are two outstanding issues as below:
I am inclined to fix them in separate JIRAs because this patch is simply a refactoring patch to unify different rewriters without changing any behavior, and it is large enough. @ggershinsky @shangxinli Please review again when you have time. Thanks! |
if (encryptColumns != null) { | ||
for (String pruneColumn : pruneColumns) { | ||
Preconditions.checkArgument(!encryptColumns.contains(pruneColumn), | ||
"Cannot prune and mask same column"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot prune and encrypt same column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pruned column does not exist in the rewritten file. Then it does not make sense to encrypt the missing column any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. I meant the exception text in the line 149, it is identical to the masking check (line 142); I guess the encryption check would print "Cannot prune and encrypt same column"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I misunderstood your meaning. Fixed the error message. Please take a look again. Thanks!
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/rewrite/ParquetRewriterTest.java
Outdated
Show resolved
Hide resolved
private ParquetRewriter rewriter = null; | ||
|
||
@Test | ||
public void testPruneSingleColumnAndTranslateCodec() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we also have an encryption rewrite unitest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test case to prune, transcode and encrypt columns.
@gszadovszky I Just want to check if you have time to have a look. @wgtmac just be nice to take over the work that we discussed earlier to have an aggregated rewriter. |
@shangxinli, I'll try to take a look this week. |
Thanks a lot @gszadovszky |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a great refactor. Thanks a lot for working on it, @wgtmac!
In the other hand I've thought about PARQUET-2075 as a request for a new feature in parquet-cli
that can be used to convert from one parquet file to another with specific configurations. (Later on we might extend it to allow multiple parquet files to be merged/rewritten to one specified and the tool would decide which level of deserialization/serialization is required.)
I am fine with handling it in a separate jira but let's make it clear. Either create another jira for this refactor as a prerequisite of PARQUET-2075 or rephrase PARQUET-2075 and create a new for parquet-cli
.
@shangxinli, what do you think?
// Get file metadata and full schema from the input file | ||
meta = ParquetFileReader.readFooter(conf, inPath, NO_FILTER); | ||
schema = meta.getFileMetaData().getSchema(); | ||
createdBy = meta.getFileMetaData().getCreatedBy(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have only one field for created by and it's content is more or less specified, we should write the current version of parquet-mr there. It is also a good idea to keep the original created by value, though. What do you think about adding it to the key_value_metadata
with a specific key? Even though this field would not be specified at least we won't lose the info.
Thanks for your review @gszadovszky
|
Perfect! :)
It sounds good to me. Maybe have the latest one at the beginning and use the separator |
I am afraid some implementations may drop characters after |
I do not have a strong opinion for |
As we are discussing a new entry (
@gszadovszky @ggershinsky @shangxinli Thoughts? If this behavior requires further discussion, I'd suggest to keep the current state of |
I agree that merging the key-value metadata is not an easy question. Let's discuss it separately as it is not related to this PR. I also agree to store the current writer (parquet-mr) in |
I agree. Now I have updated this PR to preserve the old writer version into |
Gentle ping. @gszadovszky @ggershinsky @shangxinli Any chance to take another look? |
Thank you, @wgtmac for working on this! It looks good to me. |
Thanks @wgtmac for working on this and Thanks @gszadovszky and @ggershinsky for reviewing it! I am a little late for the comments discussion but I see we are in the right direction. Let's address it in a separate discussion. If it turns out that changing the parquet-format is the right way to solve it, we can make the proposal and I can help for the approval process. My comments are all addressed. I don't have further comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comments are addressed too. Thanks for working on this PR.
Thank you @gszadovszky @ggershinsky @shangxinli |
Jira
Tests
Commits