-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-1580: Page-level CRC checksum verfication for DataPageV1 #647
PARQUET-1580: Page-level CRC checksum verfication for DataPageV1 #647
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for working on this feature.
I've written some comments in the code then realized, that I would implement this feature in a different way.
The API user is not required to see the CRC value and it is not useful either. I think, it would be nicer (and maybe the implementation would also be easier) to implement the crc check and calculation "under the hood" at the points when we are reading/writing the pages.
The following code parts are used for reading/writing the V1 pages.
- writing
- reading
Please, let me know what you think.
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/page/DataPageV1.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java
Outdated
Show resolved
Hide resolved
@gszadovszky Thanks for quick review and comments! I agree that your proposed approach would indeed simplify the implementation and that there is indeed little use of exposing access to the crc in the DataPage class. There are however a few drawbacks of doing it this way that I can think of:
Regarding point 1, given that the checksum calculation and verification are really only a handful of loc's, we might get away with less robust testing. Regarding point 2, I think this can be acceptable. I'm on board for going for the 'under the hood' implementation given my points above. Your thoughts? |
As the crc in the footer structure is listed in If we prefer the "under the hood" solution then I agree that the testing and extending the tools to deal with the crc would be much harder. After thinking a bit more I am not that confident that the "under the hood" solution is better than the other one but I would still vote on it because the public API could stay a bit cleaner and most of the users are not interested in the crc value. From testing point of view I think we can still validate all the possible cases (crc verification on/off + valid/invalid/missing crc values) only that we have to implement the tests at a higher level (writing actual parquet files with "hacked" crc values and read them back with different config setup). It is another issue if we really want to have a tool that lists the crc values or the fact that there are valid crc values saved for the pages. It is a separate topic but I think a tool a tool that could list the original values of the thrift object without translating to them to "user objects" would help a lot to debug some low level issues. Such tool would easily list the crc values of the page headers even if the check is implemented "under the hood". For this current issue I can live without such a tool. :) |
Alright, I agree that we should move forward with the 'under the hood' solution. Some points to note:
|
The specification ( The current idea is to throw an exception (and therefore fail the whole reading process) if the verify flag is on and a CRC check fails. I think, it is better to throw that exception as early as possible. All the pages are processed in |
I’ve addressed some of the code review comments and made some changes:
With respect to the last change, I believe this is a reasonable compromise for what we’ve been discussing (having the crc in the public API for robust testing vs. handling crc’s only under the hood and having to rely on pre-generated files with baked crc’s for testing). Note that crc that we store in the Is this an agreeable solution for you? I’m currently writing benchmarks for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your additional efforts. I agree this is a fine solution between the two.
I have some additional comments on the code but in general I am OK with the solution.
One additional: You've added checksum writing/verification for the dictionary page. So, now only pagev2 is missing. I am fine with postponing the implementation for a later jira however, I don't think it would require too much additional efforts comparing to the ones already added. :)
parquet-column/src/main/java/org/apache/parquet/column/page/Page.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java
Outdated
Show resolved
Hide resolved
parquet-tools/src/main/java/org/apache/parquet/tools/command/DumpCommand.java
Outdated
Show resolved
Hide resolved
I've addressed your comments and included benchmarks. I've created a new Jira ticket for adding support for DataPageV2 (https://jira.apache.org/jira/browse/PARQUET-1629). I'm running the benchmarks as we speak and will report the results once they're in :). |
There are still failing tests:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small comments on the code
parquet-hadoop/src/main/java/org/apache/parquet/HadoopReadOptions.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestDataPageV1Checksums.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestDataPageV1Checksums.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestDataPageV1Checksums.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestDataPageV1Checksums.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestDataPageV1Checksums.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestDataPageV1Checksums.java
Outdated
Show resolved
Hide resolved
@@ -43,4 +45,18 @@ public int getUncompressedSize() { | |||
return uncompressedSize; | |||
} | |||
|
|||
// Note: the following fields are only used for testing purposes and are NOT used in checksum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
// Note: the following fields are only used for testing purposes and are NOT used in checksum | |
// Note: the following field is only used for testing purposes and is NOT used in checksum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmarks might be improved by adding with/without checksum as a parameter instead of having them in the method names. This way, it is easier to visualize (by e.g. https://jmh.morethan.io) the results and differences. But I'm also good with the current implementation. Thanks a lot a again for working on this.
Let's wait for an approval from @Fokko.
Thanks for the feedback. I've added the benchmark results to the PR description. As expected, the impact on the write path is minimal given the default page size and various compression schemes. I also ran the benchmarks on the read path, which also show minimal impact, however I'm not sure 100% the code path taken is representative of normal use. If we want to enable verification by default in the future we will need some more thorough testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nits, apart from that LGTM
parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/PageChecksumReadBenchmarks.java
Outdated
Show resolved
Hide resolved
parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/PageChecksumWriteBenchmarks.java
Outdated
Show resolved
Hide resolved
parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/PageChecksumWriteBenchmarks.java
Outdated
Show resolved
Hide resolved
parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/PageChecksumWriteBenchmarks.java
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestDataPageV1Checksums.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @bbraams Thanks!
(int)uncompressedSize, | ||
(int)compressedSize, | ||
valueCount, | ||
rlEncoding, | ||
dlEncoding, | ||
valuesEncoding, | ||
(int) crc.getValue(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gszadovszky Do we need to consider int overflow? For example:
crc.getValue() = 4169965210
(int) crc.getValue() = -125002086
(int) (crc.getValue() & 0x7FFFFFFF) = 2022481562
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR implements page-level CRC checksums for DataPageV1. It is the implementation follow up to the clarification of the checksums in
parquet-format
(Jira, PR). Key points:java.util.zip
DataPageV1
andDictionaryPage
Jira
Tests
This PR adds the following tests in a new test suite
TestDataPageV1Checksums
:Documentation
The feature is feature flagged and is disabled by default. Both writing out checksums and verifying them on the read path can be turned on individually, via the following two new config flags:
parquet.page.write-checksum.enabled
(default:true
)parquet.page.verify-checksum.enabled
(default:false
)Benchmark results
This PR adds 2 new benchmark suites, benchmarking the penalty of both writing out checksums and verifying them. I've run the suites on the following setup:
PageChecksumWriteBenchmarks
PageChecksumReadBenchmarks