PARQUET-2078: Failed to read parquet file after writing with the same … #925

loudongfeng · 2021-08-28T11:37:07Z

…parquet version

Jira

My PR addresses the following Parquet Jira issues:
- https://issues.apache.org/jira/browse/PARQUET-2078

Tests

My PR adds the following unit tests :
TestParquetFileWriter.testWriteReadWithRecordReader

Commits

My commits all reference Jira issues in their subject lines.

…parquet version

loudongfeng · 2021-08-28T12:27:38Z

This patch fixes both the write path and the read path.
Write path:
fix the currentDictionaryPageOffset reuse issue, then RowGroup.file_offset in parquet file will be correct.
Read path:
supporting read parquet file with wrong RowGroup.file_offset (by ignoring it)
I'm not sure how the read path changes will inflect encryption files written by parquet 1.12.0.
But encryption files‘s RowGroup.file_offset is wrongly setted already.

Another solution for read path:
only ignore file_offset when

file version is parquet 1.12.0 .
file is not encrypted

gszadovszky

Thanks a lot for working on this, @loudongfeng!

I think it is fine to not use the file offset at all but calculate it by using the offsets of the first column chunk. @ggershinsky, what do you think?

But, you also need to handle the case of invalid dictionary offset in getOffset(ColumnChunk). Please check my comment in the jira.

Also, please check the whole code for potential usage of the dictionary offset and the file offset. It would be also great if you could validate the new code with the original invalid files.

parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetFileWriter.java

gszadovszky

Excellent work. Thanks a lot for your efforts!

shangxinli · 2021-08-30T15:15:14Z

@ggershinsky Do you want to have a look?

ggershinsky · 2021-08-30T15:33:36Z

Sure. This won't work if the first column is encrypted and the reader doesn't have its key. Can the "write" part be fixed instead, so the RowGroup offset is set correctly?

gszadovszky · 2021-08-30T15:44:49Z

@ggershinsky, even though this PR fixes the write path as well we have already released 1.12.0 so we have to prepare for the case of RowGroup.file_offset is incorrect.

ggershinsky · 2021-08-30T15:59:41Z

Yep, but the current fix perpetuates the situation where some readers can't process encrypted files, even if they have keys for all projected columns; doesn't look like an optimal long-term solution.. I'm just back online after a vacation; will go over the details in this thread, maybe something else can be done here.

gszadovszky · 2021-08-31T07:49:16Z

@ggershinsky, sorry, I've completely missed the fact that RowGroup.file_offset is introduced for the encryption feature and it actually required for it. Somehow we shall check if the file_offset points to the previous row group. At the worst case we shall check at least if the parquet file is written by 1.12.0. But in this case we only know that the file_offset might be wrong.

loudongfeng · 2021-08-31T08:08:52Z

FYI,Maybe we can make use of this information :
RowGroup[n].file_offset = RowGroup[n-1].file_offset + RowGroup[n-1].total_compressed_size
total_compressed_size always holds the truth, while file_offset doesn't.
total_compressed_size is also introduced in for encryption feature.

…parquet version Read path fix that make usage of this information: RowGroup[n].file_offset = RowGroup[n-1].file_offset + RowGroup[n-1].total_compressed_size

ggershinsky · 2021-08-31T10:36:38Z

@gszadovszky No problem at all, thank you for helping with this!

ggershinsky · 2021-08-31T10:44:36Z

FYI,Maybe we can make use of this information :
RowGroup[n].file_offset = RowGroup[n-1].file_offset + RowGroup[n-1].total_compressed_size
total_compressed_size always holds the truth, while file_offset doesn't.
total_compressed_size is also introduced in for encryption feature.

Yep, exactly! If there are no hidden surprises and this works as expected, it would certainly be the optimal solution. While at it, maybe you can also add a check at the write side, to verify the RG offset values in a similar manner (must be equal to a sum of previous RG sizes, plus the first RG offset; this also runs in a loop). Thanks @loudongfeng !

ggershinsky · 2021-08-31T10:51:05Z

parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java

+          //the first block always holds the truth
+          startIndex = rowGroup.getFile_offset();
+        } else {
+          //calculate offset for other blocks


Could you expand the comment a bit, to briefly explain the problem? (maybe with mentioning the jira number). To help ensure future changes don't revert this to the more intuitive getFile_offset().

…parquet version addressing review comments: more check on writer side.

…parquet version taking alignment padding and sumarry file into account

ggershinsky · 2021-09-01T07:38:38Z

parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java

    for (BlockMetaData block : blocks) {
      numRows += block.getRowCount();
+      long blockStartPos = block.getStartingPos();
+      // first block
+      if (blockStartPos == 4) {


why this check is necessary, doesn't the first block always start at 4? Or this addresses a file merging usecase?

To address _common_metadata file case, which will merge multiple file footers into a single meta file.

ggershinsky · 2021-09-01T07:39:45Z

parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java

+        preBlockCompressedSize = 0;
+      }
+      if (preBlockStartPos != 0) {
+        assert blockStartPos >= preBlockStartPos + preBlockCompressedSize;


why >= instead of == ?

There are alignments between row groups. see PARQUET-306 and https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L411

ggershinsky · 2021-09-01T07:43:52Z

parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java

+      if (startIndex < minStartIndex) {
+        // a bad offset detected, try first column's offset
+        // can not use minStartIndex in case of padding
+        startIndex = getOffset(rowGroup.getColumns().get(0));


this will throw an exception in encrypted files.

In case of encrypted files with wongly setted file_offset, i have no idea how to fix it when alignment padding take place.
If no padding, then we can just use the calculated index.
I didn't find any footer meta about padding position or something else indicating padding or not.

got it, will check a few things

@loudongfeng @gszadovszky @shangxinli @sunchao . Looks like the problem is only in one mode of encryption ("encrypted footer", where the ColumnMetaData is not set). Let me propose the following:

long minStartIndex = preStartIndex + preCompressedSize; if (startIndex < minStartIndex) { // a bad offset detected, try first column's offset if available ColumnChunk columnChunk = rowGroup.getColumns().get(0); if (columnChunk.isSetMeta_data()) { startIndex = getOffset(columnChunk); } else { // EncryptedFooter mode. Plaintext ColumnMetaData is not available. // use minStartIndex (imprecise in case of padding, but good enough for filtering) startIndex = minStartIndex; }

what do you think?

that might work for range filtering. as for the explicit offset filtering, the precision might be important. In systems that use padding (HDFS, WEBHDFS, VIEWFS) - instead of startIndex = minStartIndex , we might have to throw an exception when this is called from filterFileMetaDataByStart saying that this file can't be split by offsets, and should be processed without splitting. Again, given the number of conditions for this situation to occur, and the fact that 1.12.0 is not released yet any framework, such exceptions might never be thrown in reality.

@ggershinsky, your proposal sounds perfect to me.Looking forward to your patch, or shall I update commit following your proposal?

Another choice: suppose columnChunk.isSetMeta_data() is the same across different row groups, how about using first columns's offset by default, only using file offset with "encrypted footer"?
(And only throw exception when bad file offset detected and called from filterFileMetaDataByStart, as you suggested.)
@ggershinsky

Thanks @loudongfeng, sure, please update the pr with these changes. Regarding columnChunk.isSetMeta_data() - the result will be the same across different row groups, since the whole file is written in one mode (either with the ColumnMetaData, or without them). If they are unavailable, we can't use them for splits, so will have to throw an exception in case of filterFileMetaDataByStart (only for bad columns where startIndex < minStartIndex)

Update - there are situations where ColumnMetaData is available even in "encrypted footer" mode - if the column itself is unencrypted. So this is good news, the affected area becomes even smaller. But this means you need to check the firstColumnChunk.isSetMeta_data()) for the first block.

Commit updated following ggershinsky's suggestions.Thanks.

gszadovszky · 2021-09-01T08:38:11Z

@shangxinli, @ggershinsky, please note that I'll be on vacation from today till the end of next week so won't have time for this PR. While this seems to be quite urgent so do not hesitate to push it in and initiate a release for 1.12.1. I think this fix would worth a separate release as quick as possible.
Also, do not forget to approve the unit tests execution for every new commits since @loudongfeng is not a member (or whatever GitHub actions require to be executed automatically).
-- Thanks

…parquet version only throw exception when: 1.footer(first column of block meta) encrypted and 2.file_offset corrupted

…parquet version only check firstColumnChunk.isSetMeta_data() for the first block

ggershinsky · 2021-09-04T08:11:30Z

Thanks @loudongfeng , looks good. I'll run the last round of checks with a number of encryption modes early next week.

parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java

ggershinsky

the last round of checks with a number of encryption modes early next week.

Everything was ok

…parquet version address review comments: empty lines

sunchao

@loudongfeng @gszadovszky @ggershinsky it seems there is also a bug in parquet-cpp which causes incorrect file offset to be written, see https://issues.apache.org/jira/browse/SPARK-36696, so we'll want to make sure the solution here work for that case as well.

sunchao · 2021-09-01T17:21:29Z

parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java

+        preBlockCompressedSize = 0;
+      }
+      if (preBlockStartPos != 0) {
+        assert blockStartPos >= preBlockStartPos + preBlockCompressedSize;


I'm not sure if we should use assert here since it's not always turned on in production. Perhaps use Preconditions.checkState?

sunchao · 2021-09-08T22:52:34Z

parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java


-      if (rowGroup.isSetFile_offset()) {
+        //the file_offset of first block always holds the truth, while other blocks don't :


I think this is no longer true with the issue we found in parquet-cpp.

@sunchao Thanks for your information.May be the first rowgroup check is enough for this situation(firstFileOffset==4)? I will submit a new commit.

…parquet version check first rowgroup's file_offset too(SPARK-36696)

…parquet version Using Preconditions.checkState instead of assert in write path remove summary file footers case check in read path(which will never happen)

shangxinli · 2021-09-09T04:19:26Z

@ggershinsky Can you have another look for the new commit?

…parquet version more special case for first row group

ggershinsky · 2021-09-09T13:35:17Z

Sure, looks good. The first row group always starts at offset 4.
@loudongfeng Maybe the hardcoded 4 should be replaced with eg ParquetFileWriter.MAGIC.length? Or even better, a new constant added at ParquetFileWriter, something like 'FIRST_ROW_OFFSET = MAGIC.length`

ggershinsky · 2021-09-09T13:50:29Z

it seems there is also a bug in parquet-cpp which causes incorrect file offset to be written, see https://issues.apache.org/jira/browse/SPARK-36696, so we'll want to make sure the solution here work for that case as well.

Yep, it does. I've taken the file that was posted at that jira, and read it with Spark with p1.12.0 - this indeed fails. After adding this fix to parquet, the reading worked ok. This happens because for regular files (and most of encrypted files), this fix ignores the RowGroup.offset field, and reverts the offset compute to the pre-1.12 behavior.

shangxinli · 2021-09-09T15:39:35Z

Thanks, I will merge it soon.

… … (#925) * PARQUET-2078 Failed to read parquet file after writing with the same parquet version * PARQUET-2078 Failed to read parquet file after writing with the same parquet version Read path fix that make usage of this information: RowGroup[n].file_offset = RowGroup[n-1].file_offset + RowGroup[n-1].total_compressed_size * PARQUET-2078 Failed to read parquet file after writing with the same parquet version addressing review comments: more check on writer side. * PARQUET-2078 Failed to read parquet file after writing with the same parquet version taking alignment padding and sumarry file into account * PARQUET-2078 Failed to read parquet file after writing with the same parquet version only throw exception when: 1.footer(first column of block meta) encrypted and 2.file_offset corrupted * PARQUET-2078 Failed to read parquet file after writing with the same parquet version only check firstColumnChunk.isSetMeta_data() for the first block * PARQUET-2078 Failed to read parquet file after writing with the same parquet version address review comments: empty lines * PARQUET-2078 Failed to read parquet file after writing with the same parquet version check first rowgroup's file_offset too(SPARK-36696) * PARQUET-2078 Failed to read parquet file after writing with the same parquet version Using Preconditions.checkState instead of assert in write path remove summary file footers case check in read path(which will never happen) * PARQUET-2078 Failed to read parquet file after writing with the same parquet version more special case for first row group

… … (apache#925) * PARQUET-2078 Failed to read parquet file after writing with the same parquet version * PARQUET-2078 Failed to read parquet file after writing with the same parquet version Read path fix that make usage of this information: RowGroup[n].file_offset = RowGroup[n-1].file_offset + RowGroup[n-1].total_compressed_size * PARQUET-2078 Failed to read parquet file after writing with the same parquet version addressing review comments: more check on writer side. * PARQUET-2078 Failed to read parquet file after writing with the same parquet version taking alignment padding and sumarry file into account * PARQUET-2078 Failed to read parquet file after writing with the same parquet version only throw exception when: 1.footer(first column of block meta) encrypted and 2.file_offset corrupted * PARQUET-2078 Failed to read parquet file after writing with the same parquet version only check firstColumnChunk.isSetMeta_data() for the first block * PARQUET-2078 Failed to read parquet file after writing with the same parquet version address review comments: empty lines * PARQUET-2078 Failed to read parquet file after writing with the same parquet version check first rowgroup's file_offset too(SPARK-36696) * PARQUET-2078 Failed to read parquet file after writing with the same parquet version Using Preconditions.checkState instead of assert in write path remove summary file footers case check in read path(which will never happen) * PARQUET-2078 Failed to read parquet file after writing with the same parquet version more special case for first row group (cherry picked from commit 615d769)

PARQUET-2078 Failed to read parquet file after writing with the same …

0a0a00c

…parquet version

gszadovszky requested changes Aug 30, 2021

View reviewed changes

parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetFileWriter.java Show resolved Hide resolved

gszadovszky approved these changes Aug 30, 2021

View reviewed changes

PARQUET-2078 Failed to read parquet file after writing with the same …

12f39c3

…parquet version Read path fix that make usage of this information: RowGroup[n].file_offset = RowGroup[n-1].file_offset + RowGroup[n-1].total_compressed_size

ggershinsky reviewed Aug 31, 2021

View reviewed changes

loudongfeng added 2 commits August 31, 2021 20:15

PARQUET-2078 Failed to read parquet file after writing with the same …

0328fb7

…parquet version addressing review comments: more check on writer side.

PARQUET-2078 Failed to read parquet file after writing with the same …

0e01c90

…parquet version taking alignment padding and sumarry file into account

ggershinsky reviewed Sep 1, 2021

View reviewed changes

PARQUET-2078 Failed to read parquet file after writing with the same …

3a2757d

…parquet version only throw exception when: 1.footer(first column of block meta) encrypted and 2.file_offset corrupted

loudongfeng force-pushed the PARQUET-2078 branch from a85ba92 to 3a2757d Compare September 2, 2021 08:35

PARQUET-2078 Failed to read parquet file after writing with the same …

cd13468

…parquet version only check firstColumnChunk.isSetMeta_data() for the first block

loudongfeng force-pushed the PARQUET-2078 branch from 4cba7e2 to cd13468 Compare September 2, 2021 11:15

ggershinsky reviewed Sep 8, 2021

View reviewed changes

parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java Outdated Show resolved Hide resolved

parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java Outdated Show resolved Hide resolved

ggershinsky approved these changes Sep 8, 2021

View reviewed changes

PARQUET-2078 Failed to read parquet file after writing with the same …

347858d

…parquet version address review comments: empty lines

sunchao reviewed Sep 8, 2021

View reviewed changes

loudongfeng added 2 commits September 9, 2021 10:49

PARQUET-2078 Failed to read parquet file after writing with the same …

aefe136

…parquet version check first rowgroup's file_offset too(SPARK-36696)

PARQUET-2078 Failed to read parquet file after writing with the same …

00e737e

…parquet version Using Preconditions.checkState instead of assert in write path remove summary file footers case check in read path(which will never happen)

PARQUET-2078 Failed to read parquet file after writing with the same …

c1802ec

…parquet version more special case for first row group

shangxinli merged commit 5f40350 into apache:master Sep 9, 2021


		if (rowGroup.isSetFile_offset()) {
		//the file_offset of first block always holds the truth, while other blocks don't :

PARQUET-2078: Failed to read parquet file after writing with the same … #925

PARQUET-2078: Failed to read parquet file after writing with the same … #925

Conversation

loudongfeng commented Aug 28, 2021

Jira

Tests

Commits

loudongfeng commented Aug 28, 2021

gszadovszky left a comment

Choose a reason for hiding this comment

gszadovszky left a comment

Choose a reason for hiding this comment

shangxinli commented Aug 30, 2021

ggershinsky commented Aug 30, 2021

gszadovszky commented Aug 30, 2021

ggershinsky commented Aug 30, 2021

gszadovszky commented Aug 31, 2021

loudongfeng commented Aug 31, 2021

ggershinsky commented Aug 31, 2021

ggershinsky commented Aug 31, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ggershinsky Sep 2, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gszadovszky commented Sep 1, 2021

ggershinsky commented Sep 4, 2021

ggershinsky left a comment

Choose a reason for hiding this comment

sunchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shangxinli commented Sep 9, 2021

ggershinsky commented Sep 9, 2021 • edited

ggershinsky commented Sep 9, 2021

shangxinli commented Sep 9, 2021

ggershinsky Sep 2, 2021 •

edited

ggershinsky commented Sep 9, 2021 •

edited