Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-1580: Page-level CRC checksum verfication for DataPageV1 #647

Merged
merged 16 commits into from
Jul 24, 2019

Conversation

bbraams
Copy link
Contributor

@bbraams bbraams commented Jun 11, 2019

This PR implements page-level CRC checksums for DataPageV1. It is the implementation follow up to the clarification of the checksums in parquet-format (Jira, PR). Key points:

  • Checksums are calculated using the CRC32 implementation in java.util.zip
  • Checksums are implemented for DataPageV1 and DictionaryPage
  • Writing out checksums is enabled by default (impact is minimal, see Benchmark results below)
  • Checksum verification on the read path is disabled by default (benchmarks indicate minimal impact, but requires more thorough benchmarking)
  • Backwards compatibility with Parquet files that do not have the CRC field set, even if checksum verification is turned on

Jira

Tests

This PR adds the following tests in a new test suite TestDataPageV1Checksums:

  • Test writing out checksums without verification on read, verify calculated checksums are correct
  • Test not writing out checksums without verification on read, tests feature can be turned off entirely
  • Test not writing out checksums with verification on read, tests backward compatibility of read path with files without checksums
  • Test writing out checksums with verification on read, tests that the checksum is correctly set and that we can read what we wrote
  • Test corruption detection by corrupting bytes in data section of pages, tests that these corruptions go undetected with the feature disabled and are detected with the feature enabled
  • Test checksums with compression enabled, tests that the checksum is calculated on the compressed version of the data (as per the specification)
  • Test checksums with compression and nested schema, tests that the checksum is calculated on the compressed concatenation of the repetition levels, definition levels and the actual data (as per the specification)
  • Test checksums with dictionary encoded data, tests that the dictionary page is also checksummed

Documentation

The feature is feature flagged and is disabled by default. Both writing out checksums and verifying them on the read path can be turned on individually, via the following two new config flags:

  • parquet.page.write-checksum.enabled (default: true)
  • parquet.page.verify-checksum.enabled (default: false)

Benchmark results

This PR adds 2 new benchmark suites, benchmarking the penalty of both writing out checksums and verifying them. I've run the suites on the following setup:

  • AWS m4.4xlarge instance
  • Single shot mode
  • 3 warmup iterations
  • 5 measured iterations (3 forked JVMs)

PageChecksumWriteBenchmarks

Benchmark                                  Mode  Cnt   Score   Error  Units
write100KRowsGzipWithChecksums               ss   15   0.500 ? 0.012   s/op
write100KRowsGzipWithoutChecksums            ss   15   0.497 ? 0.013   s/op
write100KRowsSnappyWithChecksums             ss   15   0.235 ? 0.015   s/op
write100KRowsSnappyWithoutChecksums          ss   15   0.239 ? 0.018   s/op
write100KRowsUncompressedWithChecksums       ss   15   0.213 ? 0.012   s/op
write100KRowsUncompressedWithoutChecksums    ss   15   0.210 ? 0.013   s/op
write1MRowsGzipWithChecksums                 ss   15   4.720 ? 0.018   s/op
write1MRowsGzipWithoutChecksums              ss   15   4.718 ? 0.013   s/op
write1MRowsSnappyWithChecksums               ss   15   2.288 ? 0.182   s/op
write1MRowsSnappyWithoutChecksums            ss   15   2.122 ? 0.122   s/op
write1MRowsUncompressedWithChecksums         ss   15   1.945 ? 0.137   s/op
write1MRowsUncompressedWithoutChecksums      ss   15   1.913 ? 0.124   s/op
write10MRowsGzipWithChecksums                ss   15  47.226 ? 0.127   s/op
write10MRowsGzipWithoutChecksums             ss   15  47.430 ? 0.306   s/op
write10MRowsSnappyWithChecksums              ss   15  21.164 ? 0.628   s/op
write10MRowsSnappyWithoutChecksums           ss   15  20.905 ? 0.332   s/op
write10MRowsUncompressedWithChecksums        ss   15  18.806 ? 0.390   s/op
write10MRowsUncompressedWithoutChecksums     ss   15  18.637 ? 0.201   s/op

PageChecksumReadBenchmarks

Benchmark                                    Mode  Cnt  Score   Error  Units
read100KRowsGzipWithVerification               ss   15  0.105 ? 0.005   s/op
read100KRowsGzipWithoutVerification            ss   15  0.104 ? 0.004   s/op
read100KRowsSnappyWithVerification             ss   15  0.080 ? 0.003   s/op
read100KRowsSnappyWithoutVerification          ss   15  0.081 ? 0.006   s/op
read100KRowsUncompressedWithVerification       ss   15  0.074 ? 0.002   s/op
read100KRowsUncompressedWithoutVerification    ss   15  0.071 ? 0.005   s/op
read1MRowsGzipWithVerification                 ss   15  0.956 ? 0.012   s/op
read1MRowsGzipWithoutVerification              ss   15  0.952 ? 0.025   s/op
read1MRowsSnappyWithVerification               ss   15  0.707 ? 0.025   s/op
read1MRowsSnappyWithoutVerification            ss   15  0.699 ? 0.013   s/op
read1MRowsUncompressedWithVerification         ss   15  0.676 ? 0.013   s/op
read1MRowsUncompressedWithoutVerification      ss   15  0.651 ? 0.025   s/o
read10MRowsGzipWithVerification                ss   15  9.717 ? 0.588   s/op
read10MRowsGzipWithoutVerification             ss   15  9.593 ? 0.310   s/op
read10MRowsSnappyWithVerification              ss   15  7.038 ? 0.113   s/op
read10MRowsSnappyWithoutVerification           ss   15  6.854 ? 0.078   s/op
read10MRowsUncompressedWithVerification        ss   15  6.544 ? 0.071   s/op
read10MRowsUncompressedWithoutVerification     ss   15  6.463 ? 0.121   s/op

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this feature.
I've written some comments in the code then realized, that I would implement this feature in a different way.
The API user is not required to see the CRC value and it is not useful either. I think, it would be nicer (and maybe the implementation would also be easier) to implement the crc check and calculation "under the hood" at the points when we are reading/writing the pages.
The following code parts are used for reading/writing the V1 pages.

Please, let me know what you think.

@bbraams
Copy link
Contributor Author

bbraams commented Jun 17, 2019

@gszadovszky Thanks for quick review and comments! I agree that your proposed approach would indeed simplify the implementation and that there is indeed little use of exposing access to the crc in the DataPage class. There are however a few drawbacks of doing it this way that I can think of:

  1. Less robust testing. We lose reference to the CRC up the call stack if we do it under the hood, so for testing we will have to rely on the 'side effects' of the implementation to verify correct behavior (either an exception is thrown on failed verification or not). Maybe we could implement some helper functions just for use in testing to expose the crc. But I'm not sure if we wouldn't end up with an implementation very similar to the current one (where we would have CRC in Page instead).
  2. No easy way for users to see if their files have CRC set. In the current implementation, in parquet-tools, I just hook into the loop over all pages in the dump command. Again, I think we could solve this by implementing a boolean flag at the PageReadStore level or something set to true whenever we encountered a checksum in readAllPages.

Regarding point 1, given that the checksum calculation and verification are really only a handful of loc's, we might get away with less robust testing. Regarding point 2, I think this can be acceptable.

I'm on board for going for the 'under the hood' implementation given my points above. Your thoughts?

@gszadovszky
Copy link
Contributor

As the crc in the footer structure is listed in PageHeader we should add it to Page instead of DataPageV1 or DataPage. What I don't really like here is that we already have a couple of constructors which we should not modify (backward compatibility) but add new ones and deprecate the old ones. However, we can solve this by implementing builders. If we are going to the direction of propagating the crc to the "user objects" I would vote on having crc in Page and creating the proper builders for all the supported page types.

If we prefer the "under the hood" solution then I agree that the testing and extending the tools to deal with the crc would be much harder.

After thinking a bit more I am not that confident that the "under the hood" solution is better than the other one but I would still vote on it because the public API could stay a bit cleaner and most of the users are not interested in the crc value. From testing point of view I think we can still validate all the possible cases (crc verification on/off + valid/invalid/missing crc values) only that we have to implement the tests at a higher level (writing actual parquet files with "hacked" crc values and read them back with different config setup).

It is another issue if we really want to have a tool that lists the crc values or the fact that there are valid crc values saved for the pages. It is a separate topic but I think a tool a tool that could list the original values of the thrift object without translating to them to "user objects" would help a lot to debug some low level issues. Such tool would easily list the crc values of the page headers even if the check is implemented "under the hood". For this current issue I can live without such a tool. :)

@bbraams
Copy link
Contributor Author

bbraams commented Jun 18, 2019

Alright, I agree that we should move forward with the 'under the hood' solution. Some points to note:

  • Even though we won't be adding the crc to the implementations of the Page class, we should still respect the specification and thus also set/verify the crc for DictionaryPage's.
  • Previously we would perform the verification in ColumnChunkPageReader.readPage, which I believe is called 'on-demand'. We will now move it to readAllPages, which would make it less lazy in a sense, since the checksum verification will be performed before the pages are actually delivered? I don't believe this to be a big problem though.

@gszadovszky
Copy link
Contributor

The specification (parquet.thrift) contains 5 different page types. One was never used in parquet-mr (IndexPageHeader), another is not implemented / not in master yet (BloomFilterPageHeader). The other 3 is used: pagev1, pagev2 and dictionary. Since this JIRA is about implementing CRC for V1 pages only, I am OK with implementing it for that only. However, it does not seem to be a big deal to implement it for all the 3.

The current idea is to throw an exception (and therefore fail the whole reading process) if the verify flag is on and a CRC check fails. I think, it is better to throw that exception as early as possible. All the pages are processed in readAllPages will be required later anyway.

@bbraams
Copy link
Contributor Author

bbraams commented Jun 19, 2019

I’ve addressed some of the code review comments and made some changes:

  • The writing and verification of the checksums are now done ‘under the hood’, removing the explicit crc from DataPageV1
  • Added crc support for dictionary pages, as per the specification
  • Added getters/setter for a copy of the crc value to Page solely for the purpose of testing

With respect to the last change, I believe this is a reasonable compromise for what we’ve been discussing (having the crc in the public API for robust testing vs. handling crc’s only under the hood and having to rely on pre-generated files with baked crc’s for testing). Note that crc that we store in the Page object is merely a copy of the thrift object, and is not actually used to write out the crc or during validation (we use low level thrift values accessed in readAllPages for the latter). The presence of the crc in Page is not reflected in any (subclass) constructor, and is therefore relatively hidden. Ideally we’d have the setters/getters be package private, but this is unfortunately not possible (parquet.column vs parquet.hadoop). What we do get is more robust testing (see the TestDataPageV1Checksums suite), and the ability to display the crc in parquet-tools (I’ve added it to the dump command).

Is this an agreeable solution for you?

I’m currently writing benchmarks for parquet-benchmarks to see the performance penalty on both the write and read path. These numbers should give us confidence (or not) in enabling writing checksums by default.

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your additional efforts. I agree this is a fine solution between the two.

I have some additional comments on the code but in general I am OK with the solution.
One additional: You've added checksum writing/verification for the dictionary page. So, now only pagev2 is missing. I am fine with postponing the implementation for a later jira however, I don't think it would require too much additional efforts comparing to the ones already added. :)

@bbraams
Copy link
Contributor Author

bbraams commented Jul 19, 2019

I've addressed your comments and included benchmarks. I've created a new Jira ticket for adding support for DataPageV2 (https://jira.apache.org/jira/browse/PARQUET-1629).

I'm running the benchmarks as we speak and will report the results once they're in :).

@Fokko
Copy link
Contributor

Fokko commented Jul 20, 2019

There are still failing tests:

Failed tests:   testAlignmentWithNoPaddingNeeded(org.apache.parquet.hadoop.TestParquetFileWriter): Second row group should start after no padding expected:<109> but was:<139>
  testAlignmentWithPadding(org.apache.parquet.hadoop.TestParquetFileWriter): First row group should end before the block size (120)

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments on the code

@@ -43,4 +45,18 @@ public int getUncompressedSize() {
return uncompressedSize;
}

// Note: the following fields are only used for testing purposes and are NOT used in checksum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
// Note: the following fields are only used for testing purposes and are NOT used in checksum
// Note: the following field is only used for testing purposes and is NOT used in checksum

@gszadovszky gszadovszky self-requested a review July 22, 2019 11:59
Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmarks might be improved by adding with/without checksum as a parameter instead of having them in the method names. This way, it is easier to visualize (by e.g. https://jmh.morethan.io) the results and differences. But I'm also good with the current implementation. Thanks a lot a again for working on this.
Let's wait for an approval from @Fokko.

@bbraams
Copy link
Contributor Author

bbraams commented Jul 22, 2019

Thanks for the feedback. I've added the benchmark results to the PR description. As expected, the impact on the write path is minimal given the default page size and various compression schemes. I also ran the benchmarks on the read path, which also show minimal impact, however I'm not sure 100% the code path taken is representative of normal use. If we want to enable verification by default in the future we will need some more thorough testing.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nits, apart from that LGTM

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @bbraams Thanks!

@Fokko Fokko merged commit fcc5d1a into apache:master Jul 24, 2019
(int)uncompressedSize,
(int)compressedSize,
valueCount,
rlEncoding,
dlEncoding,
valuesEncoding,
(int) crc.getValue(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gszadovszky Do we need to consider int overflow? For example:

crc.getValue() = 4169965210
(int) crc.getValue() = -125002086
(int) (crc.getValue() & 0x7FFFFFFF) = 2022481562

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants