Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2078: Failed to read parquet file after writing with the same … #925

Merged
merged 10 commits into from
Sep 9, 2021

Conversation

loudongfeng
Copy link
Contributor

…parquet version

Jira

Tests

  • My PR adds the following unit tests :
    TestParquetFileWriter.testWriteReadWithRecordReader

Commits

  • My commits all reference Jira issues in their subject lines.

@loudongfeng
Copy link
Contributor Author

This patch fixes both the write path and the read path.
Write path:
fix the currentDictionaryPageOffset reuse issue, then RowGroup.file_offset in parquet file will be correct.
Read path:
supporting read parquet file with wrong RowGroup.file_offset (by ignoring it)
I'm not sure how the read path changes will inflect encryption files written by parquet 1.12.0.
But encryption files‘s RowGroup.file_offset is wrongly setted already.

Another solution for read path:
only ignore file_offset when

  1. file version is parquet 1.12.0 .
  2. file is not encrypted

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this, @loudongfeng!

I think it is fine to not use the file offset at all but calculate it by using the offsets of the first column chunk. @ggershinsky, what do you think?

But, you also need to handle the case of invalid dictionary offset in getOffset(ColumnChunk). Please check my comment in the jira.

Also, please check the whole code for potential usage of the dictionary offset and the file offset. It would be also great if you could validate the new code with the original invalid files.

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work. Thanks a lot for your efforts!

@shangxinli
Copy link
Contributor

@ggershinsky Do you want to have a look?

@ggershinsky
Copy link
Contributor

Sure. This won't work if the first column is encrypted and the reader doesn't have its key. Can the "write" part be fixed instead, so the RowGroup offset is set correctly?

@gszadovszky
Copy link
Contributor

@ggershinsky, even though this PR fixes the write path as well we have already released 1.12.0 so we have to prepare for the case of RowGroup.file_offset is incorrect.

@ggershinsky
Copy link
Contributor

Yep, but the current fix perpetuates the situation where some readers can't process encrypted files, even if they have keys for all projected columns; doesn't look like an optimal long-term solution.. I'm just back online after a vacation; will go over the details in this thread, maybe something else can be done here.

@gszadovszky
Copy link
Contributor

@ggershinsky, sorry, I've completely missed the fact that RowGroup.file_offset is introduced for the encryption feature and it actually required for it. Somehow we shall check if the file_offset points to the previous row group. At the worst case we shall check at least if the parquet file is written by 1.12.0. But in this case we only know that the file_offset might be wrong.

@loudongfeng
Copy link
Contributor Author

FYI,Maybe we can make use of this information :
RowGroup[n].file_offset = RowGroup[n-1].file_offset + RowGroup[n-1].total_compressed_size
total_compressed_size always holds the truth, while file_offset doesn't.
total_compressed_size is also introduced in for encryption feature.

…parquet version

Read path fix that make usage of this information:
RowGroup[n].file_offset = RowGroup[n-1].file_offset + RowGroup[n-1].total_compressed_size
@ggershinsky
Copy link
Contributor

@gszadovszky No problem at all, thank you for helping with this!

@ggershinsky
Copy link
Contributor

FYI,Maybe we can make use of this information :
RowGroup[n].file_offset = RowGroup[n-1].file_offset + RowGroup[n-1].total_compressed_size
total_compressed_size always holds the truth, while file_offset doesn't.
total_compressed_size is also introduced in for encryption feature.

Yep, exactly! If there are no hidden surprises and this works as expected, it would certainly be the optimal solution. While at it, maybe you can also add a check at the write side, to verify the RG offset values in a similar manner (must be equal to a sum of previous RG sizes, plus the first RG offset; this also runs in a loop). Thanks @loudongfeng !

//the first block always holds the truth
startIndex = rowGroup.getFile_offset();
} else {
//calculate offset for other blocks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand the comment a bit, to briefly explain the problem? (maybe with mentioning the jira number). To help ensure future changes don't revert this to the more intuitive getFile_offset().

…parquet version

addressing review comments: more check on writer side.
…parquet version

taking alignment padding and sumarry file into account
for (BlockMetaData block : blocks) {
numRows += block.getRowCount();
long blockStartPos = block.getStartingPos();
// first block
if (blockStartPos == 4) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this check is necessary, doesn't the first block always start at 4? Or this addresses a file merging usecase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To address _common_metadata file case, which will merge multiple file footers into a single meta file.

preBlockCompressedSize = 0;
}
if (preBlockStartPos != 0) {
assert blockStartPos >= preBlockStartPos + preBlockCompressedSize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why >= instead of == ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (startIndex < minStartIndex) {
// a bad offset detected, try first column's offset
// can not use minStartIndex in case of padding
startIndex = getOffset(rowGroup.getColumns().get(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will throw an exception in encrypted files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of encrypted files with wongly setted file_offset, i have no idea how to fix it when alignment padding take place.
If no padding, then we can just use the calculated index.
I didn't find any footer meta about padding position or something else indicating padding or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, will check a few things

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@loudongfeng @gszadovszky @shangxinli @sunchao . Looks like the problem is only in one mode of encryption ("encrypted footer", where the ColumnMetaData is not set). Let me propose the following:

  long minStartIndex = preStartIndex + preCompressedSize;
  if (startIndex < minStartIndex) {
    // a bad offset detected, try first column's offset if available
   ColumnChunk columnChunk = rowGroup.getColumns().get(0);
   if (columnChunk.isSetMeta_data()) {
     startIndex = getOffset(columnChunk);
   } else { // EncryptedFooter mode. Plaintext ColumnMetaData is not available.
     // use minStartIndex (imprecise in case of padding, but good enough for filtering)
     startIndex = minStartIndex;
  }

what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that might work for range filtering. as for the explicit offset filtering, the precision might be important. In systems that use padding (HDFS, WEBHDFS, VIEWFS) - instead of startIndex = minStartIndex , we might have to throw an exception when this is called from filterFileMetaDataByStart saying that this file can't be split by offsets, and should be processed without splitting. Again, given the number of conditions for this situation to occur, and the fact that 1.12.0 is not released yet any framework, such exceptions might never be thrown in reality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggershinsky, your proposal sounds perfect to me.Looking forward to your patch, or shall I update commit following your proposal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another choice: suppose columnChunk.isSetMeta_data() is the same across different row groups, how about using first columns's offset by default, only using file offset with "encrypted footer"?
(And only throw exception when bad file offset detected and called from filterFileMetaDataByStart, as you suggested.)
@ggershinsky

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @loudongfeng, sure, please update the pr with these changes. Regarding columnChunk.isSetMeta_data() - the result will be the same across different row groups, since the whole file is written in one mode (either with the ColumnMetaData, or without them). If they are unavailable, we can't use them for splits, so will have to throw an exception in case of filterFileMetaDataByStart (only for bad columns where startIndex < minStartIndex)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update - there are situations where ColumnMetaData is available even in "encrypted footer" mode - if the column itself is unencrypted. So this is good news, the affected area becomes even smaller. But this means you need to check the firstColumnChunk.isSetMeta_data()) for the first block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit updated following ggershinsky's suggestions.Thanks.

@gszadovszky
Copy link
Contributor

@shangxinli, @ggershinsky, please note that I'll be on vacation from today till the end of next week so won't have time for this PR. While this seems to be quite urgent so do not hesitate to push it in and initiate a release for 1.12.1. I think this fix would worth a separate release as quick as possible.
Also, do not forget to approve the unit tests execution for every new commits since @loudongfeng is not a member (or whatever GitHub actions require to be executed automatically).
-- Thanks

…parquet version

only throw exception when: 1.footer(first column of block meta) encrypted and 2.file_offset corrupted
…parquet version

only check firstColumnChunk.isSetMeta_data() for the first block
@ggershinsky
Copy link
Contributor

Thanks @loudongfeng , looks good. I'll run the last round of checks with a number of encryption modes early next week.

Copy link
Contributor

@ggershinsky ggershinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the last round of checks with a number of encryption modes early next week.

Everything was ok

…parquet version

address review comments: empty lines
Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@loudongfeng @gszadovszky @ggershinsky it seems there is also a bug in parquet-cpp which causes incorrect file offset to be written, see https://issues.apache.org/jira/browse/SPARK-36696, so we'll want to make sure the solution here work for that case as well.

preBlockCompressedSize = 0;
}
if (preBlockStartPos != 0) {
assert blockStartPos >= preBlockStartPos + preBlockCompressedSize;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we should use assert here since it's not always turned on in production. Perhaps use Preconditions.checkState?


if (rowGroup.isSetFile_offset()) {
//the file_offset of first block always holds the truth, while other blocks don't :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is no longer true with the issue we found in parquet-cpp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunchao Thanks for your information.May be the first rowgroup check is enough for this situation(firstFileOffset==4)? I will submit a new commit.

…parquet version

check first rowgroup's file_offset too(SPARK-36696)
…parquet version

Using Preconditions.checkState instead of assert in write path
remove summary file footers case check in read path(which will never happen)
@shangxinli
Copy link
Contributor

@ggershinsky Can you have another look for the new commit?

…parquet version

more special case for first row group
@ggershinsky
Copy link
Contributor

ggershinsky commented Sep 9, 2021

Sure, looks good. The first row group always starts at offset 4.
@loudongfeng Maybe the hardcoded 4 should be replaced with eg ParquetFileWriter.MAGIC.length? Or even better, a new constant added at ParquetFileWriter, something like 'FIRST_ROW_OFFSET = MAGIC.length`

@ggershinsky
Copy link
Contributor

it seems there is also a bug in parquet-cpp which causes incorrect file offset to be written, see https://issues.apache.org/jira/browse/SPARK-36696, so we'll want to make sure the solution here work for that case as well.

Yep, it does. I've taken the file that was posted at that jira, and read it with Spark with p1.12.0 - this indeed fails. After adding this fix to parquet, the reading worked ok. This happens because for regular files (and most of encrypted files), this fix ignores the RowGroup.offset field, and reverts the offset compute to the pre-1.12 behavior.

@shangxinli
Copy link
Contributor

Thanks, I will merge it soon.

@shangxinli shangxinli merged commit 5f40350 into apache:master Sep 9, 2021
shangxinli pushed a commit that referenced this pull request Sep 9, 2021
… … (#925)

* PARQUET-2078 Failed to read parquet file after writing with the same parquet version

* PARQUET-2078 Failed to read parquet file after writing with the same parquet version

Read path fix that make usage of this information:
RowGroup[n].file_offset = RowGroup[n-1].file_offset + RowGroup[n-1].total_compressed_size

* PARQUET-2078 Failed to read parquet file after writing with the same parquet version

addressing review comments: more check on writer side.

* PARQUET-2078 Failed to read parquet file after writing with the same parquet version

taking alignment padding and sumarry file into account

* PARQUET-2078 Failed to read parquet file after writing with the same parquet version

only throw exception when: 1.footer(first column of block meta) encrypted and 2.file_offset corrupted

* PARQUET-2078 Failed to read parquet file after writing with the same parquet version

only check firstColumnChunk.isSetMeta_data() for the first block

* PARQUET-2078 Failed to read parquet file after writing with the same parquet version

address review comments: empty lines

* PARQUET-2078 Failed to read parquet file after writing with the same parquet version

check first rowgroup's file_offset too(SPARK-36696)

* PARQUET-2078 Failed to read parquet file after writing with the same parquet version

Using Preconditions.checkState instead of assert in write path
remove summary file footers case check in read path(which will never happen)

* PARQUET-2078 Failed to read parquet file after writing with the same parquet version

more special case for first row group
sunchao pushed a commit to sunchao/parquet-mr that referenced this pull request Mar 3, 2022
… … (apache#925)

* PARQUET-2078 Failed to read parquet file after writing with the same parquet version

* PARQUET-2078 Failed to read parquet file after writing with the same parquet version

Read path fix that make usage of this information:
RowGroup[n].file_offset = RowGroup[n-1].file_offset + RowGroup[n-1].total_compressed_size

* PARQUET-2078 Failed to read parquet file after writing with the same parquet version

addressing review comments: more check on writer side.

* PARQUET-2078 Failed to read parquet file after writing with the same parquet version

taking alignment padding and sumarry file into account

* PARQUET-2078 Failed to read parquet file after writing with the same parquet version

only throw exception when: 1.footer(first column of block meta) encrypted and 2.file_offset corrupted

* PARQUET-2078 Failed to read parquet file after writing with the same parquet version

only check firstColumnChunk.isSetMeta_data() for the first block

* PARQUET-2078 Failed to read parquet file after writing with the same parquet version

address review comments: empty lines

* PARQUET-2078 Failed to read parquet file after writing with the same parquet version

check first rowgroup's file_offset too(SPARK-36696)

* PARQUET-2078 Failed to read parquet file after writing with the same parquet version

Using Preconditions.checkState instead of assert in write path
remove summary file footers case check in read path(which will never happen)

* PARQUET-2078 Failed to read parquet file after writing with the same parquet version

more special case for first row group

(cherry picked from commit 615d769)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants