PARQUET-1580: Page-level CRC checksum verfication for DataPageV1 #647

bbraams · 2019-06-11T18:02:28Z

This PR implements page-level CRC checksums for DataPageV1. It is the implementation follow up to the clarification of the checksums in parquet-format (Jira, PR). Key points:

Checksums are calculated using the CRC32 implementation in java.util.zip
Checksums are implemented for DataPageV1 and DictionaryPage
Writing out checksums is enabled by default (impact is minimal, see Benchmark results below)
Checksum verification on the read path is disabled by default (benchmarks indicate minimal impact, but requires more thorough benchmarking)
Backwards compatibility with Parquet files that do not have the CRC field set, even if checksum verification is turned on

Jira

My PR addresses the following Parquet Jira issue:
- https://jira.apache.org/jira/browse/PARQUET-1580

Tests

This PR adds the following tests in a new test suite TestDataPageV1Checksums:

Test writing out checksums without verification on read, verify calculated checksums are correct
Test not writing out checksums without verification on read, tests feature can be turned off entirely
Test not writing out checksums with verification on read, tests backward compatibility of read path with files without checksums
Test writing out checksums with verification on read, tests that the checksum is correctly set and that we can read what we wrote
Test corruption detection by corrupting bytes in data section of pages, tests that these corruptions go undetected with the feature disabled and are detected with the feature enabled
Test checksums with compression enabled, tests that the checksum is calculated on the compressed version of the data (as per the specification)
Test checksums with compression and nested schema, tests that the checksum is calculated on the compressed concatenation of the repetition levels, definition levels and the actual data (as per the specification)
Test checksums with dictionary encoded data, tests that the dictionary page is also checksummed

Documentation

The feature is feature flagged and is disabled by default. Both writing out checksums and verifying them on the read path can be turned on individually, via the following two new config flags:

parquet.page.write-checksum.enabled (default: true)
parquet.page.verify-checksum.enabled (default: false)

Benchmark results

This PR adds 2 new benchmark suites, benchmarking the penalty of both writing out checksums and verifying them. I've run the suites on the following setup:

AWS m4.4xlarge instance
Single shot mode
3 warmup iterations
5 measured iterations (3 forked JVMs)

PageChecksumWriteBenchmarks

Benchmark                                  Mode  Cnt   Score   Error  Units
write100KRowsGzipWithChecksums               ss   15   0.500 ? 0.012   s/op
write100KRowsGzipWithoutChecksums            ss   15   0.497 ? 0.013   s/op
write100KRowsSnappyWithChecksums             ss   15   0.235 ? 0.015   s/op
write100KRowsSnappyWithoutChecksums          ss   15   0.239 ? 0.018   s/op
write100KRowsUncompressedWithChecksums       ss   15   0.213 ? 0.012   s/op
write100KRowsUncompressedWithoutChecksums    ss   15   0.210 ? 0.013   s/op
write1MRowsGzipWithChecksums                 ss   15   4.720 ? 0.018   s/op
write1MRowsGzipWithoutChecksums              ss   15   4.718 ? 0.013   s/op
write1MRowsSnappyWithChecksums               ss   15   2.288 ? 0.182   s/op
write1MRowsSnappyWithoutChecksums            ss   15   2.122 ? 0.122   s/op
write1MRowsUncompressedWithChecksums         ss   15   1.945 ? 0.137   s/op
write1MRowsUncompressedWithoutChecksums      ss   15   1.913 ? 0.124   s/op
write10MRowsGzipWithChecksums                ss   15  47.226 ? 0.127   s/op
write10MRowsGzipWithoutChecksums             ss   15  47.430 ? 0.306   s/op
write10MRowsSnappyWithChecksums              ss   15  21.164 ? 0.628   s/op
write10MRowsSnappyWithoutChecksums           ss   15  20.905 ? 0.332   s/op
write10MRowsUncompressedWithChecksums        ss   15  18.806 ? 0.390   s/op
write10MRowsUncompressedWithoutChecksums     ss   15  18.637 ? 0.201   s/op

PageChecksumReadBenchmarks

Benchmark                                    Mode  Cnt  Score   Error  Units
read100KRowsGzipWithVerification               ss   15  0.105 ? 0.005   s/op
read100KRowsGzipWithoutVerification            ss   15  0.104 ? 0.004   s/op
read100KRowsSnappyWithVerification             ss   15  0.080 ? 0.003   s/op
read100KRowsSnappyWithoutVerification          ss   15  0.081 ? 0.006   s/op
read100KRowsUncompressedWithVerification       ss   15  0.074 ? 0.002   s/op
read100KRowsUncompressedWithoutVerification    ss   15  0.071 ? 0.005   s/op
read1MRowsGzipWithVerification                 ss   15  0.956 ? 0.012   s/op
read1MRowsGzipWithoutVerification              ss   15  0.952 ? 0.025   s/op
read1MRowsSnappyWithVerification               ss   15  0.707 ? 0.025   s/op
read1MRowsSnappyWithoutVerification            ss   15  0.699 ? 0.013   s/op
read1MRowsUncompressedWithVerification         ss   15  0.676 ? 0.013   s/op
read1MRowsUncompressedWithoutVerification      ss   15  0.651 ? 0.025   s/o
read10MRowsGzipWithVerification                ss   15  9.717 ? 0.588   s/op
read10MRowsGzipWithoutVerification             ss   15  9.593 ? 0.310   s/op
read10MRowsSnappyWithVerification              ss   15  7.038 ? 0.113   s/op
read10MRowsSnappyWithoutVerification           ss   15  6.854 ? 0.078   s/op
read10MRowsUncompressedWithVerification        ss   15  6.544 ? 0.071   s/op
read10MRowsUncompressedWithoutVerification     ss   15  6.463 ? 0.121   s/op

gszadovszky

Thanks a lot for working on this feature.
I've written some comments in the code then realized, that I would implement this feature in a different way.
The API user is not required to see the CRC value and it is not useful either. I think, it would be nicer (and maybe the implementation would also be easier) to implement the crc check and calculation "under the hood" at the points when we are reading/writing the pages.
The following code parts are used for reading/writing the V1 pages.

Please, let me know what you think.

parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java

parquet-hadoop/src/main/java/org/apache/parquet/ParquetReadOptions.java

parquet-column/src/main/java/org/apache/parquet/column/page/DataPageV1.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java

parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java

bbraams · 2019-06-17T12:04:40Z

@gszadovszky Thanks for quick review and comments! I agree that your proposed approach would indeed simplify the implementation and that there is indeed little use of exposing access to the crc in the DataPage class. There are however a few drawbacks of doing it this way that I can think of:

Less robust testing. We lose reference to the CRC up the call stack if we do it under the hood, so for testing we will have to rely on the 'side effects' of the implementation to verify correct behavior (either an exception is thrown on failed verification or not). Maybe we could implement some helper functions just for use in testing to expose the crc. But I'm not sure if we wouldn't end up with an implementation very similar to the current one (where we would have CRC in Page instead).
No easy way for users to see if their files have CRC set. In the current implementation, in parquet-tools, I just hook into the loop over all pages in the dump command. Again, I think we could solve this by implementing a boolean flag at the PageReadStore level or something set to true whenever we encountered a checksum in readAllPages.

Regarding point 1, given that the checksum calculation and verification are really only a handful of loc's, we might get away with less robust testing. Regarding point 2, I think this can be acceptable.

I'm on board for going for the 'under the hood' implementation given my points above. Your thoughts?

gszadovszky · 2019-06-17T13:27:13Z

As the crc in the footer structure is listed in PageHeader we should add it to Page instead of DataPageV1 or DataPage. What I don't really like here is that we already have a couple of constructors which we should not modify (backward compatibility) but add new ones and deprecate the old ones. However, we can solve this by implementing builders. If we are going to the direction of propagating the crc to the "user objects" I would vote on having crc in Page and creating the proper builders for all the supported page types.

If we prefer the "under the hood" solution then I agree that the testing and extending the tools to deal with the crc would be much harder.

After thinking a bit more I am not that confident that the "under the hood" solution is better than the other one but I would still vote on it because the public API could stay a bit cleaner and most of the users are not interested in the crc value. From testing point of view I think we can still validate all the possible cases (crc verification on/off + valid/invalid/missing crc values) only that we have to implement the tests at a higher level (writing actual parquet files with "hacked" crc values and read them back with different config setup).

It is another issue if we really want to have a tool that lists the crc values or the fact that there are valid crc values saved for the pages. It is a separate topic but I think a tool a tool that could list the original values of the thrift object without translating to them to "user objects" would help a lot to debug some low level issues. Such tool would easily list the crc values of the page headers even if the check is implemented "under the hood". For this current issue I can live without such a tool. :)

bbraams · 2019-06-18T11:31:30Z

Alright, I agree that we should move forward with the 'under the hood' solution. Some points to note:

Even though we won't be adding the crc to the implementations of the Page class, we should still respect the specification and thus also set/verify the crc for DictionaryPage's.
Previously we would perform the verification in ColumnChunkPageReader.readPage, which I believe is called 'on-demand'. We will now move it to readAllPages, which would make it less lazy in a sense, since the checksum verification will be performed before the pages are actually delivered? I don't believe this to be a big problem though.

gszadovszky · 2019-06-18T11:47:39Z

The specification (parquet.thrift) contains 5 different page types. One was never used in parquet-mr (IndexPageHeader), another is not implemented / not in master yet (BloomFilterPageHeader). The other 3 is used: pagev1, pagev2 and dictionary. Since this JIRA is about implementing CRC for V1 pages only, I am OK with implementing it for that only. However, it does not seem to be a big deal to implement it for all the 3.

The current idea is to throw an exception (and therefore fail the whole reading process) if the verify flag is on and a CRC check fails. I think, it is better to throw that exception as early as possible. All the pages are processed in readAllPages will be required later anyway.

bbraams · 2019-06-19T14:24:13Z

I’ve addressed some of the code review comments and made some changes:

The writing and verification of the checksums are now done ‘under the hood’, removing the explicit crc from DataPageV1
Added crc support for dictionary pages, as per the specification
Added getters/setter for a copy of the crc value to Page solely for the purpose of testing

With respect to the last change, I believe this is a reasonable compromise for what we’ve been discussing (having the crc in the public API for robust testing vs. handling crc’s only under the hood and having to rely on pre-generated files with baked crc’s for testing). Note that crc that we store in the Page object is merely a copy of the thrift object, and is not actually used to write out the crc or during validation (we use low level thrift values accessed in readAllPages for the latter). The presence of the crc in Page is not reflected in any (subclass) constructor, and is therefore relatively hidden. Ideally we’d have the setters/getters be package private, but this is unfortunately not possible (parquet.column vs parquet.hadoop). What we do get is more robust testing (see the TestDataPageV1Checksums suite), and the ability to display the crc in parquet-tools (I’ve added it to the dump command).

Is this an agreeable solution for you?

I’m currently writing benchmarks for parquet-benchmarks to see the performance penalty on both the write and read path. These numbers should give us confidence (or not) in enabling writing checksums by default.

gszadovszky

Thanks a lot for your additional efforts. I agree this is a fine solution between the two.

I have some additional comments on the code but in general I am OK with the solution.
One additional: You've added checksum writing/verification for the dictionary page. So, now only pagev2 is missing. I am fine with postponing the implementation for a later jira however, I don't think it would require too much additional efforts comparing to the ones already added. :)

parquet-column/src/main/java/org/apache/parquet/column/page/Page.java

parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java

parquet-tools/src/main/java/org/apache/parquet/tools/command/DumpCommand.java

…80-datapagev1-page-checksums

bbraams · 2019-07-19T16:46:21Z

I've addressed your comments and included benchmarks. I've created a new Jira ticket for adding support for DataPageV2 (https://jira.apache.org/jira/browse/PARQUET-1629).

I'm running the benchmarks as we speak and will report the results once they're in :).

Fokko · 2019-07-20T08:44:05Z

There are still failing tests:

Failed tests:   testAlignmentWithNoPaddingNeeded(org.apache.parquet.hadoop.TestParquetFileWriter): Second row group should start after no padding expected:<109> but was:<139>
  testAlignmentWithPadding(org.apache.parquet.hadoop.TestParquetFileWriter): First row group should end before the block size (120)

Fokko

Some small comments on the code

parquet-hadoop/src/main/java/org/apache/parquet/HadoopReadOptions.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestDataPageV1Checksums.java

gszadovszky · 2019-07-22T08:44:40Z

parquet-column/src/main/java/org/apache/parquet/column/page/Page.java

@@ -43,4 +45,18 @@ public int getUncompressedSize() {
    return uncompressedSize;
  }

+  // Note: the following fields are only used for testing purposes and are NOT used in checksum


nit:

Suggested change

// Note: the following fields are only used for testing purposes and are NOT used in checksum

// Note: the following field is only used for testing purposes and is NOT used in checksum

gszadovszky

The benchmarks might be improved by adding with/without checksum as a parameter instead of having them in the method names. This way, it is easier to visualize (by e.g. https://jmh.morethan.io) the results and differences. But I'm also good with the current implementation. Thanks a lot a again for working on this.
Let's wait for an approval from @Fokko.

bbraams · 2019-07-22T14:44:47Z

Thanks for the feedback. I've added the benchmark results to the PR description. As expected, the impact on the write path is minimal given the default page size and various compression schemes. I also ran the benchmarks on the read path, which also show minimal impact, however I'm not sure 100% the code path taken is representative of normal use. If we want to enable verification by default in the future we will need some more thorough testing.

Fokko

Small nits, apart from that LGTM

parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/PageChecksumReadBenchmarks.java

parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/PageChecksumWriteBenchmarks.java

parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java

parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestDataPageV1Checksums.java

Fokko

Great work @bbraams Thanks!

wangyum · 2021-01-21T14:17:39Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java

          (int)uncompressedSize,
          (int)compressedSize,
          valueCount,
          rlEncoding,
          dlEncoding,
          valuesEncoding,
+          (int) crc.getValue(),


@gszadovszky Do we need to consider int overflow? For example:

crc.getValue() = 4169965210 (int) crc.getValue() = -125002086 (int) (crc.getValue() & 0x7FFFFFFF) = 2022481562

@wangyum See the verification here https://github.com/apache/parquet-mr/pull/647/files#diff-8da24c84aef62e6e836d073938f7843d289785baaeddf446f3afeae6d4ef4b10R1176-R1182

Page-level checksums for DataPageV1

29933a2

gszadovszky requested changes Jun 12, 2019

View reviewed changes

bbraams added 3 commits June 19, 2019 14:39

Got rid of redundant constant

561924e

Use more direct way of obtaining defaults

02d00a1

Revised implementation, updated tests, addressed review comments

1bd0f8e

bbraams added 4 commits June 19, 2019 16:27

Revert auto whitespace trimming

4401644

Variable rename for consistency

6b5f8b6

Revert whitespace changes

9b78617

Revert more whitespace changes

1e910ff

gszadovszky requested changes Jun 24, 2019

View reviewed changes

bbraams added 4 commits July 19, 2019 13:22

Addressed code review comments

ad1845d

Enable writing out checksums by default

e5694a0

Added benchmarks

27a0e3c

Merge branch 'master' of github.com:apache/parquet-mr into parquet-15…

e267980

…80-datapagev1-page-checksums

Fokko reviewed Jul 20, 2019

View reviewed changes

gszadovszky reviewed Jul 22, 2019

View reviewed changes

bbraams added 2 commits July 22, 2019 10:54

Addressed review comments

34ebd45

Addressed test failures

b698f77

gszadovszky self-requested a review July 22, 2019 11:59

gszadovszky approved these changes Jul 22, 2019

View reviewed changes

Added run script for checksum benchmarks

f9040a1

Fokko approved these changes Jul 22, 2019

View reviewed changes

Addressed code review comments

318748d

Fokko approved these changes Jul 24, 2019

View reviewed changes

Fokko merged commit fcc5d1a into apache:master Jul 24, 2019

wangyum reviewed Jan 21, 2021

View reviewed changes

wangyum mentioned this pull request Jan 21, 2021

[SPARK-26346][BUILD][SQL] Upgrade Parquet to 1.11.1 apache/spark#26804

Closed

pitrou mentioned this pull request Dec 13, 2022

PARQUET-1539: Clarify CRC checksum in page header apache/parquet-format#126

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-1580: Page-level CRC checksum verfication for DataPageV1 #647

PARQUET-1580: Page-level CRC checksum verfication for DataPageV1 #647

bbraams commented Jun 11, 2019 •

edited

gszadovszky left a comment

bbraams commented Jun 17, 2019 •

edited

gszadovszky commented Jun 17, 2019

bbraams commented Jun 18, 2019

gszadovszky commented Jun 18, 2019

bbraams commented Jun 19, 2019 •

edited

gszadovszky left a comment

bbraams commented Jul 19, 2019 •

edited

Fokko commented Jul 20, 2019

Fokko left a comment

gszadovszky Jul 22, 2019

gszadovszky left a comment

bbraams commented Jul 22, 2019

Fokko left a comment

Fokko left a comment

wangyum Jan 21, 2021

bbraams Jan 21, 2021

	// Note: the following fields are only used for testing purposes and are NOT used in checksum
	// Note: the following field is only used for testing purposes and is NOT used in checksum

PARQUET-1580: Page-level CRC checksum verfication for DataPageV1 #647

PARQUET-1580: Page-level CRC checksum verfication for DataPageV1 #647

Conversation

bbraams commented Jun 11, 2019 • edited

Jira

Tests

Documentation

Benchmark results

gszadovszky left a comment

Choose a reason for hiding this comment

bbraams commented Jun 17, 2019 • edited

gszadovszky commented Jun 17, 2019

bbraams commented Jun 18, 2019

gszadovszky commented Jun 18, 2019

bbraams commented Jun 19, 2019 • edited

gszadovszky left a comment

Choose a reason for hiding this comment

bbraams commented Jul 19, 2019 • edited

Fokko commented Jul 20, 2019

Fokko left a comment

Choose a reason for hiding this comment

gszadovszky Jul 22, 2019

Choose a reason for hiding this comment

gszadovszky left a comment

Choose a reason for hiding this comment

bbraams commented Jul 22, 2019

Fokko left a comment

Choose a reason for hiding this comment

Fokko left a comment

Choose a reason for hiding this comment

wangyum Jan 21, 2021

Choose a reason for hiding this comment

bbraams Jan 21, 2021

Choose a reason for hiding this comment

bbraams commented Jun 11, 2019 •

edited

bbraams commented Jun 17, 2019 •

edited

bbraams commented Jun 19, 2019 •

edited

bbraams commented Jul 19, 2019 •

edited