-
Notifications
You must be signed in to change notification settings - Fork 3.9k
GH-47973: [C++][Parquet] Fix invalid Parquet files written when dictionary encoded pages are large #47998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I'm not sure this is the best approach for fixing this because it does slow down the RLE encoding benchmarks on my machine. Although it also seems to speed up some test cases: I tried some alternative approaches that avoid the cast to int64 and operate on int number of bytes instead of bits, but those were similarly slower and could also potentially still overflow if the max buffer size was close to int32 max. Eg: int new_bit_offset = bit_offset_ + num_bits;
if (ARROW_PREDICT_FALSE(byte_offset_ +
(new_bit_offset == 0 ? 0 : (1 + (new_bit_offset - 1) / 8)) >
max_bytes_))Another alternative solution could be to limit the maximum buffer size to something like I also noticed that the return value from |
| if (ARROW_PREDICT_FALSE(static_cast<int64_t>(byte_offset_) * 8 + bit_offset_ + | ||
| num_bits > | ||
| static_cast<int64_t>(max_bytes_) * 8)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main bug fix. Previously max_bytes_ * 8 could overflow int, resulting in a negative value on the RHS so that this comparison always returned true and the function returned without writing anything to the buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching it!
| int max_literal_run_size = 1 + static_cast<int>(::arrow::bit_util::BytesForBits( | ||
| MAX_VALUES_PER_LITERAL_RUN * bit_width)); | ||
| int64_t max_literal_run_size = | ||
| 1 + ::arrow::bit_util::BytesForBits(MAX_VALUES_PER_LITERAL_RUN * bit_width); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, I was not aware that our RLE-bit-packed encoder did not generate literal runs of more than 512 values at a time. This might pessimize decoding performance quite a bit...
@AntoinePrv This might be interesting to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that Parquet Java is doing the same thing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yhea that's a bit of a shame
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, some comments below
|
After merging your PR, Conbench analyzed the 0 benchmarking runs that have been run so far on merge-commit 055c2f4. None of the specified runs were found on the Conbench server. The full Conbench report has more details. |
Rationale for this change
Prevents silently writing invalid data when using dictionary encoding and the number of bits in the estimated max buffer size is greater than the max int32 value.
Also fixes an overflow resulting in a "Negative buffer resize" error if the buffer size in bytes is greater than max int32, and instead throw a more helpful exception.
What changes are included in this PR?
BitWriter::PutValue. This overflow would cause the method to return without writing data, and the return value is only checked in debug builds.Are these changes tested?
Yes, I've added unit tests for both issues. These require enabling
ARROW_LARGE_MEMORY_TESTSas they allocate a lot of memory.Are there any user-facing changes?
This PR contains a "Critical Fix".
This fixes a bug where invalid Parquet files can be silently written when the buffer size for dictionary indices is large.