-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2078: Failed to read parquet file after writing with the same … #925
Conversation
This patch fixes both the write path and the read path. Another solution for read path:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for working on this, @loudongfeng!
I think it is fine to not use the file offset at all but calculate it by using the offsets of the first column chunk. @ggershinsky, what do you think?
But, you also need to handle the case of invalid dictionary offset in getOffset(ColumnChunk)
. Please check my comment in the jira.
Also, please check the whole code for potential usage of the dictionary offset and the file offset. It would be also great if you could validate the new code with the original invalid files.
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetFileWriter.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work. Thanks a lot for your efforts!
@ggershinsky Do you want to have a look? |
Sure. This won't work if the first column is encrypted and the reader doesn't have its key. Can the "write" part be fixed instead, so the RowGroup offset is set correctly? |
@ggershinsky, even though this PR fixes the write path as well we have already released 1.12.0 so we have to prepare for the case of |
Yep, but the current fix perpetuates the situation where some readers can't process encrypted files, even if they have keys for all projected columns; doesn't look like an optimal long-term solution.. I'm just back online after a vacation; will go over the details in this thread, maybe something else can be done here. |
@ggershinsky, sorry, I've completely missed the fact that |
FYI,Maybe we can make use of this information : |
…parquet version Read path fix that make usage of this information: RowGroup[n].file_offset = RowGroup[n-1].file_offset + RowGroup[n-1].total_compressed_size
@gszadovszky No problem at all, thank you for helping with this! |
Yep, exactly! If there are no hidden surprises and this works as expected, it would certainly be the optimal solution. While at it, maybe you can also add a check at the write side, to verify the RG offset values in a similar manner (must be equal to a sum of previous RG sizes, plus the first RG offset; this also runs in a loop). Thanks @loudongfeng ! |
//the first block always holds the truth | ||
startIndex = rowGroup.getFile_offset(); | ||
} else { | ||
//calculate offset for other blocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand the comment a bit, to briefly explain the problem? (maybe with mentioning the jira number). To help ensure future changes don't revert this to the more intuitive getFile_offset().
…parquet version addressing review comments: more check on writer side.
…parquet version taking alignment padding and sumarry file into account
for (BlockMetaData block : blocks) { | ||
numRows += block.getRowCount(); | ||
long blockStartPos = block.getStartingPos(); | ||
// first block | ||
if (blockStartPos == 4) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this check is necessary, doesn't the first block always start at 4? Or this addresses a file merging usecase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To address _common_metadata file case, which will merge multiple file footers into a single meta file.
preBlockCompressedSize = 0; | ||
} | ||
if (preBlockStartPos != 0) { | ||
assert blockStartPos >= preBlockStartPos + preBlockCompressedSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why >= instead of == ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are alignments between row groups. see PARQUET-306 and https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L411
if (startIndex < minStartIndex) { | ||
// a bad offset detected, try first column's offset | ||
// can not use minStartIndex in case of padding | ||
startIndex = getOffset(rowGroup.getColumns().get(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will throw an exception in encrypted files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of encrypted files with wongly setted file_offset, i have no idea how to fix it when alignment padding take place.
If no padding, then we can just use the calculated index.
I didn't find any footer meta about padding position or something else indicating padding or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, will check a few things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@loudongfeng @gszadovszky @shangxinli @sunchao . Looks like the problem is only in one mode of encryption ("encrypted footer", where the ColumnMetaData is not set). Let me propose the following:
long minStartIndex = preStartIndex + preCompressedSize;
if (startIndex < minStartIndex) {
// a bad offset detected, try first column's offset if available
ColumnChunk columnChunk = rowGroup.getColumns().get(0);
if (columnChunk.isSetMeta_data()) {
startIndex = getOffset(columnChunk);
} else { // EncryptedFooter mode. Plaintext ColumnMetaData is not available.
// use minStartIndex (imprecise in case of padding, but good enough for filtering)
startIndex = minStartIndex;
}
what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that might work for range filtering. as for the explicit offset filtering, the precision might be important. In systems that use padding (HDFS, WEBHDFS, VIEWFS) - instead of startIndex = minStartIndex
, we might have to throw an exception when this is called from filterFileMetaDataByStart
saying that this file can't be split by offsets, and should be processed without splitting. Again, given the number of conditions for this situation to occur, and the fact that 1.12.0 is not released yet any framework, such exceptions might never be thrown in reality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggershinsky, your proposal sounds perfect to me.Looking forward to your patch, or shall I update commit following your proposal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another choice: suppose columnChunk.isSetMeta_data() is the same across different row groups, how about using first columns's offset by default, only using file offset with "encrypted footer"?
(And only throw exception when bad file offset detected and called from filterFileMetaDataByStart, as you suggested.)
@ggershinsky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @loudongfeng, sure, please update the pr with these changes. Regarding columnChunk.isSetMeta_data() - the result will be the same across different row groups, since the whole file is written in one mode (either with the ColumnMetaData, or without them). If they are unavailable, we can't use them for splits, so will have to throw an exception in case of filterFileMetaDataByStart (only for bad columns where startIndex < minStartIndex)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update - there are situations where ColumnMetaData is available even in "encrypted footer" mode - if the column itself is unencrypted. So this is good news, the affected area becomes even smaller. But this means you need to check the firstColumnChunk.isSetMeta_data()) for the first block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit updated following ggershinsky's suggestions.Thanks.
@shangxinli, @ggershinsky, please note that I'll be on vacation from today till the end of next week so won't have time for this PR. While this seems to be quite urgent so do not hesitate to push it in and initiate a release for 1.12.1. I think this fix would worth a separate release as quick as possible. |
…parquet version only throw exception when: 1.footer(first column of block meta) encrypted and 2.file_offset corrupted
a85ba92
to
3a2757d
Compare
…parquet version only check firstColumnChunk.isSetMeta_data() for the first block
4cba7e2
to
cd13468
Compare
Thanks @loudongfeng , looks good. I'll run the last round of checks with a number of encryption modes early next week. |
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the last round of checks with a number of encryption modes early next week.
Everything was ok
…parquet version address review comments: empty lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@loudongfeng @gszadovszky @ggershinsky it seems there is also a bug in parquet-cpp which causes incorrect file offset to be written, see https://issues.apache.org/jira/browse/SPARK-36696, so we'll want to make sure the solution here work for that case as well.
preBlockCompressedSize = 0; | ||
} | ||
if (preBlockStartPos != 0) { | ||
assert blockStartPos >= preBlockStartPos + preBlockCompressedSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we should use assert
here since it's not always turned on in production. Perhaps use Preconditions.checkState
?
|
||
if (rowGroup.isSetFile_offset()) { | ||
//the file_offset of first block always holds the truth, while other blocks don't : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is no longer true with the issue we found in parquet-cpp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sunchao Thanks for your information.May be the first rowgroup check is enough for this situation(firstFileOffset==4)? I will submit a new commit.
…parquet version check first rowgroup's file_offset too(SPARK-36696)
…parquet version Using Preconditions.checkState instead of assert in write path remove summary file footers case check in read path(which will never happen)
@ggershinsky Can you have another look for the new commit? |
…parquet version more special case for first row group
Sure, looks good. The first row group always starts at offset 4. |
Yep, it does. I've taken the file that was posted at that jira, and read it with Spark with p1.12.0 - this indeed fails. After adding this fix to parquet, the reading worked ok. This happens because for regular files (and most of encrypted files), this fix ignores the |
Thanks, I will merge it soon. |
… … (#925) * PARQUET-2078 Failed to read parquet file after writing with the same parquet version * PARQUET-2078 Failed to read parquet file after writing with the same parquet version Read path fix that make usage of this information: RowGroup[n].file_offset = RowGroup[n-1].file_offset + RowGroup[n-1].total_compressed_size * PARQUET-2078 Failed to read parquet file after writing with the same parquet version addressing review comments: more check on writer side. * PARQUET-2078 Failed to read parquet file after writing with the same parquet version taking alignment padding and sumarry file into account * PARQUET-2078 Failed to read parquet file after writing with the same parquet version only throw exception when: 1.footer(first column of block meta) encrypted and 2.file_offset corrupted * PARQUET-2078 Failed to read parquet file after writing with the same parquet version only check firstColumnChunk.isSetMeta_data() for the first block * PARQUET-2078 Failed to read parquet file after writing with the same parquet version address review comments: empty lines * PARQUET-2078 Failed to read parquet file after writing with the same parquet version check first rowgroup's file_offset too(SPARK-36696) * PARQUET-2078 Failed to read parquet file after writing with the same parquet version Using Preconditions.checkState instead of assert in write path remove summary file footers case check in read path(which will never happen) * PARQUET-2078 Failed to read parquet file after writing with the same parquet version more special case for first row group
… … (apache#925) * PARQUET-2078 Failed to read parquet file after writing with the same parquet version * PARQUET-2078 Failed to read parquet file after writing with the same parquet version Read path fix that make usage of this information: RowGroup[n].file_offset = RowGroup[n-1].file_offset + RowGroup[n-1].total_compressed_size * PARQUET-2078 Failed to read parquet file after writing with the same parquet version addressing review comments: more check on writer side. * PARQUET-2078 Failed to read parquet file after writing with the same parquet version taking alignment padding and sumarry file into account * PARQUET-2078 Failed to read parquet file after writing with the same parquet version only throw exception when: 1.footer(first column of block meta) encrypted and 2.file_offset corrupted * PARQUET-2078 Failed to read parquet file after writing with the same parquet version only check firstColumnChunk.isSetMeta_data() for the first block * PARQUET-2078 Failed to read parquet file after writing with the same parquet version address review comments: empty lines * PARQUET-2078 Failed to read parquet file after writing with the same parquet version check first rowgroup's file_offset too(SPARK-36696) * PARQUET-2078 Failed to read parquet file after writing with the same parquet version Using Preconditions.checkState instead of assert in write path remove summary file footers case check in read path(which will never happen) * PARQUET-2078 Failed to read parquet file after writing with the same parquet version more special case for first row group (cherry picked from commit 615d769)
…parquet version
Jira
Tests
TestParquetFileWriter.testWriteReadWithRecordReader
Commits