New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2119: [C++] Fix DeltaBitPackDecoder fuzzer found issue #12365
Conversation
cpp/src/parquet/encoding.cc
Outdated
int total_value_count = static_cast<int>(std::min( | ||
total_value_count_, static_cast<uint32_t>(std::numeric_limits<int>::max()))); | ||
max_values = std::min(max_values, total_value_count); | ||
DCHECK_GE(max_values, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if this can result from a malformed file we should throw an exception (same in general for any checks in the parquet code per your point in the CL description.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry that this is the wrong place to put it though. The basic contract of these decoders seems to be, 'give me how many values you want, I will give you min(what you want, what is left) and then return how many were actually given'. Changing this one to throw an exception if 'what you want > what is left' would make it unlike the others, even if I only expect the caller to request more than is what is available if the caller has a bug or the file is malformed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On this subject, I noticed that DeltaByteArrayDecoder was only looking at the number of valid values in its prefix sub-decoder but not its suffix decoder; I added an exception if the suffix sub-decoder does not return the correct number of values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will probably second a followup PR to this one to change some of those DCHECKs in the calling code to exceptions
Is there a fuzzer file to go along with this? |
55fa75c
to
d86fde1
Compare
Yes, apache/arrow-testing#75 . I didn't include the submodule change with this PR to avoid last time's weirdness. Can send another PR with just the submodule update once both are merged. |
d86fde1
to
5f5cb6a
Compare
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
6d8b133
to
5b55ec1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Thanks a lot for this @tachyonwill.
Benchmark runs are scheduled for baseline = 43efadb and contender = 09c8554. 09c8554 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
DeltaBitPackDecoder was using num_values_(which includes) null to compute batch size instead of total_value_count_. This lead to a failed check when comparing those counts. Changed to just use total_value_count_ and get rid of the check. Alternatively, we could throw an exception instead of the check; I think we only enter this state if the file is malformed (assuming no bugs elsewhere). Also modified decode arrow to check the return value of DecodeInternal; this might be pointless as its callers only compare the returned value count as a DCHECK. It might be a good idea to standardize at what stage of decoding values should be compared to expected values and the behavior when they are not equal (preferably an exception and not a check). Closes apache#12365 from tachyonwill/delta_check Lead-authored-by: William Butler <tachyonwill@gmail.com> Co-authored-by: William Butler <wab@google.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
DeltaBitPackDecoder was using num_values_(which includes) null to compute batch size instead of total_value_count_. This lead to a failed check when comparing those counts. Changed to just use total_value_count_ and get rid of the check. Alternatively, we could throw an exception instead of the check; I think we only enter this state if the file is malformed (assuming no bugs elsewhere). Also modified decode arrow to check the return value of DecodeInternal; this might be pointless as its callers only compare the returned value count as a DCHECK. It might be a good idea to standardize at what stage of decoding values should be compared to expected values and the behavior when they are not equal (preferably an exception and not a check). Closes apache#12365 from tachyonwill/delta_check Lead-authored-by: William Butler <tachyonwill@gmail.com> Co-authored-by: William Butler <wab@google.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
DeltaBitPackDecoder was using num_values_(which includes) null to compute batch size instead of total_value_count_. This lead to a failed check when comparing those counts. Changed to just use total_value_count_ and get rid of the check. Alternatively, we could throw an exception instead of the check; I think we only enter this state if the file is malformed (assuming no bugs elsewhere). Also modified decode arrow to check the return value of DecodeInternal; this might be pointless as its callers only compare the returned value count as a DCHECK. It might be a good idea to standardize at what stage of decoding values should be compared to expected values and the behavior when they are not equal (preferably an exception and not a check). Closes apache#12365 from tachyonwill/delta_check Lead-authored-by: William Butler <tachyonwill@gmail.com> Co-authored-by: William Butler <wab@google.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
DeltaBitPackDecoder was using num_values_(which includes) null to compute batch size instead of total_value_count_. This lead to a failed check when comparing those counts. Changed to just use total_value_count_ and get rid of the check. Alternatively, we could throw an exception instead of the check; I think we only enter this state if the file is malformed (assuming no bugs elsewhere).
Also modified decode arrow to check the return value of DecodeInternal; this might be pointless as its callers only compare the returned value count as a DCHECK. It might be a good idea to standardize at what stage of decoding values should be compared to expected values and the behavior when they are not equal (preferably an exception and not a check).