Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2119: [C++] Fix DeltaBitPackDecoder fuzzer found issue #12365

Closed
wants to merge 4 commits into from

Conversation

tachyonwill
Copy link
Contributor

DeltaBitPackDecoder was using num_values_(which includes) null to compute batch size instead of total_value_count_. This lead to a failed check when comparing those counts. Changed to just use total_value_count_ and get rid of the check. Alternatively, we could throw an exception instead of the check; I think we only enter this state if the file is malformed (assuming no bugs elsewhere).

Also modified decode arrow to check the return value of DecodeInternal; this might be pointless as its callers only compare the returned value count as a DCHECK. It might be a good idea to standardize at what stage of decoding values should be compared to expected values and the behavior when they are not equal (preferably an exception and not a check).

@github-actions
Copy link

github-actions bot commented Feb 7, 2022

int total_value_count = static_cast<int>(std::min(
total_value_count_, static_cast<uint32_t>(std::numeric_limits<int>::max())));
max_values = std::min(max_values, total_value_count);
DCHECK_GE(max_values, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if this can result from a malformed file we should throw an exception (same in general for any checks in the parquet code per your point in the CL description.).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that this is the wrong place to put it though. The basic contract of these decoders seems to be, 'give me how many values you want, I will give you min(what you want, what is left) and then return how many were actually given'. Changing this one to throw an exception if 'what you want > what is left' would make it unlike the others, even if I only expect the caller to request more than is what is available if the caller has a bug or the file is malformed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On this subject, I noticed that DeltaByteArrayDecoder was only looking at the number of valid values in its prefix sub-decoder but not its suffix decoder; I added an exception if the suffix sub-decoder does not return the correct number of values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will probably second a followup PR to this one to change some of those DCHECKs in the calling code to exceptions

@emkornfield
Copy link
Contributor

Is there a fuzzer file to go along with this?

@tachyonwill
Copy link
Contributor Author

Is there a fuzzer file to go along with this?

Yes, apache/arrow-testing#75 . I didn't include the submodule change with this PR to avoid last time's weirdness. Can send another PR with just the submodule update once both are merged.

cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved
@pitrou pitrou changed the title PARQUET-2119: Fix DeltaBitPackDecoder fuzzer found issue. PARQUET-2119: [C++] Fix DeltaBitPackDecoder fuzzer found issue Feb 8, 2022
tachyonwill and others added 3 commits February 8, 2022 17:00
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Thanks a lot for this @tachyonwill.

@pitrou pitrou closed this in 09c8554 Feb 8, 2022
@ursabot
Copy link

ursabot commented Feb 8, 2022

Benchmark runs are scheduled for baseline = 43efadb and contender = 09c8554. 09c8554 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.13% ⬆️0.04%] test-mac-arm
[Failed ⬇️0.36% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.43% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

vfraga pushed a commit to rafael-telles/arrow that referenced this pull request Mar 21, 2022
DeltaBitPackDecoder was using num_values_(which includes) null to compute batch size instead of total_value_count_. This lead to a failed check when comparing those counts. Changed to just use total_value_count_ and get rid of the check. Alternatively, we could throw an exception instead of the check; I think we only enter this state if the file is malformed (assuming no bugs elsewhere).

Also modified decode arrow to check the return value of DecodeInternal; this might be pointless as its callers only compare the returned value count as a DCHECK. It might be a good idea to standardize at what stage of decoding values should be compared to expected values and the behavior when they are not equal (preferably an exception and not a check).

Closes apache#12365 from tachyonwill/delta_check

Lead-authored-by: William Butler <tachyonwill@gmail.com>
Co-authored-by: William Butler <wab@google.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
vfraga pushed a commit to rafael-telles/arrow that referenced this pull request Mar 24, 2022
DeltaBitPackDecoder was using num_values_(which includes) null to compute batch size instead of total_value_count_. This lead to a failed check when comparing those counts. Changed to just use total_value_count_ and get rid of the check. Alternatively, we could throw an exception instead of the check; I think we only enter this state if the file is malformed (assuming no bugs elsewhere).

Also modified decode arrow to check the return value of DecodeInternal; this might be pointless as its callers only compare the returned value count as a DCHECK. It might be a good idea to standardize at what stage of decoding values should be compared to expected values and the behavior when they are not equal (preferably an exception and not a check).

Closes apache#12365 from tachyonwill/delta_check

Lead-authored-by: William Butler <tachyonwill@gmail.com>
Co-authored-by: William Butler <wab@google.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
vfraga pushed a commit to rafael-telles/arrow that referenced this pull request Mar 25, 2022
DeltaBitPackDecoder was using num_values_(which includes) null to compute batch size instead of total_value_count_. This lead to a failed check when comparing those counts. Changed to just use total_value_count_ and get rid of the check. Alternatively, we could throw an exception instead of the check; I think we only enter this state if the file is malformed (assuming no bugs elsewhere).

Also modified decode arrow to check the return value of DecodeInternal; this might be pointless as its callers only compare the returned value count as a DCHECK. It might be a good idea to standardize at what stage of decoding values should be compared to expected values and the behavior when they are not equal (preferably an exception and not a check).

Closes apache#12365 from tachyonwill/delta_check

Lead-authored-by: William Butler <tachyonwill@gmail.com>
Co-authored-by: William Butler <wab@google.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants