Skip to content

Conversation

@adamreeve
Copy link
Contributor

@adamreeve adamreeve commented Oct 30, 2025

Rationale for this change

Prevents silently writing invalid data when using dictionary encoding and the number of bits in the estimated max buffer size is greater than the max int32 value.

Also fixes an overflow resulting in a "Negative buffer resize" error if the buffer size in bytes is greater than max int32, and instead throw a more helpful exception.

What changes are included in this PR?

  • Fix overflow when computing the bit position in BitWriter::PutValue. This overflow would cause the method to return without writing data, and the return value is only checked in debug builds.
  • Change buffer size calculations to use int64 and check for overflow before casting to int

Are these changes tested?

Yes, I've added unit tests for both issues. These require enabling ARROW_LARGE_MEMORY_TESTS as they allocate a lot of memory.

Are there any user-facing changes?

This PR contains a "Critical Fix".

This fixes a bug where invalid Parquet files can be silently written when the buffer size for dictionary indices is large.

@adamreeve
Copy link
Contributor Author

I'm not sure this is the best approach for fixing this because it does slow down the RLE encoding benchmarks on my machine. Although it also seems to speed up some test cases:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Non-regressions: (9)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                              benchmark        baseline       contender  change %                                                                                                                                                                                                        counters
                 BM_RleEncoding/32768/1   2.866 GiB/sec   3.402 GiB/sec    18.680                                           {'family_index': 0, 'per_family_instance_index': 2, 'run_name': 'BM_RleEncoding/32768/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 33348}
                  BM_RleEncoding/4096/1   2.857 GiB/sec   3.363 GiB/sec    17.740                                           {'family_index': 0, 'per_family_instance_index': 1, 'run_name': 'BM_RleEncoding/4096/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 263675}
                 BM_RleEncoding/65536/1   2.895 GiB/sec   3.381 GiB/sec    16.763                                           {'family_index': 0, 'per_family_instance_index': 3, 'run_name': 'BM_RleEncoding/65536/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 16491}
                  BM_RleEncoding/1024/1   2.775 GiB/sec   3.114 GiB/sec    12.186                                          {'family_index': 0, 'per_family_instance_index': 0, 'run_name': 'BM_RleEncoding/1024/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1047638}
BM_RleEncodingSpacedBoolean/32768/10000  62.305 GiB/sec  62.721 GiB/sec     0.668 {'family_index': 1, 'per_family_instance_index': 4, 'run_name': 'BM_RleEncodingSpacedBoolean/32768/10000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1427195, 'null_percent': 100.0}
             BM_RleEncodingBoolean/1024 410.025 MiB/sec 406.502 MiB/sec    -0.859                                      {'family_index': 0, 'per_family_instance_index': 0, 'run_name': 'BM_RleEncodingBoolean/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 294385}
             BM_RleEncodingBoolean/4096 426.461 MiB/sec 421.442 MiB/sec    -1.177                                       {'family_index': 0, 'per_family_instance_index': 1, 'run_name': 'BM_RleEncodingBoolean/4096', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 76216}
            BM_RleEncodingBoolean/32768 435.241 MiB/sec 425.014 MiB/sec    -2.350                                       {'family_index': 0, 'per_family_instance_index': 2, 'run_name': 'BM_RleEncodingBoolean/32768', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 9715}
            BM_RleEncodingBoolean/65536 435.624 MiB/sec 425.030 MiB/sec    -2.432                                       {'family_index': 0, 'per_family_instance_index': 3, 'run_name': 'BM_RleEncodingBoolean/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4871}

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Regressions: (12)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                             benchmark        baseline       contender  change %                                                                                                                                                                                                   counters
                BM_RleEncoding/4096/16 644.740 MiB/sec 604.668 MiB/sec    -6.215                                      {'family_index': 0, 'per_family_instance_index': 9, 'run_name': 'BM_RleEncoding/4096/16', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 57555}
               BM_RleEncoding/65536/16 646.133 MiB/sec 601.672 MiB/sec    -6.881                                     {'family_index': 0, 'per_family_instance_index': 11, 'run_name': 'BM_RleEncoding/65536/16', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3644}
               BM_RleEncoding/32768/16 649.870 MiB/sec 601.706 MiB/sec    -7.411                                     {'family_index': 0, 'per_family_instance_index': 10, 'run_name': 'BM_RleEncoding/32768/16', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7281}
                BM_RleEncoding/1024/16 650.748 MiB/sec 601.229 MiB/sec    -7.610                                     {'family_index': 0, 'per_family_instance_index': 8, 'run_name': 'BM_RleEncoding/1024/16', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 230323}
BM_RleEncodingSpacedBoolean/32768/5000 228.351 MiB/sec 207.755 MiB/sec    -9.020 {'family_index': 1, 'per_family_instance_index': 3, 'run_name': 'BM_RleEncodingSpacedBoolean/32768/5000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 5085, 'null_percent': 50.0}
BM_RleEncodingSpacedBoolean/32768/1000 172.863 MiB/sec 154.948 MiB/sec   -10.364 {'family_index': 1, 'per_family_instance_index': 2, 'run_name': 'BM_RleEncodingSpacedBoolean/32768/1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3833, 'null_percent': 10.0}
   BM_RleEncodingSpacedBoolean/32768/1 175.361 MiB/sec 155.387 MiB/sec   -11.390    {'family_index': 1, 'per_family_instance_index': 0, 'run_name': 'BM_RleEncodingSpacedBoolean/32768/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3953, 'null_percent': 0.01}
 BM_RleEncodingSpacedBoolean/32768/100 173.656 MiB/sec 153.307 MiB/sec   -11.718   {'family_index': 1, 'per_family_instance_index': 1, 'run_name': 'BM_RleEncodingSpacedBoolean/32768/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3910, 'null_percent': 1.0}
                BM_RleEncoding/32768/8 650.268 MiB/sec 573.097 MiB/sec   -11.868                                       {'family_index': 0, 'per_family_instance_index': 6, 'run_name': 'BM_RleEncoding/32768/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7267}
                 BM_RleEncoding/1024/8 644.999 MiB/sec 566.453 MiB/sec   -12.178                                      {'family_index': 0, 'per_family_instance_index': 4, 'run_name': 'BM_RleEncoding/1024/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 232475}
                 BM_RleEncoding/4096/8 653.119 MiB/sec 570.380 MiB/sec   -12.668                                       {'family_index': 0, 'per_family_instance_index': 5, 'run_name': 'BM_RleEncoding/4096/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 56561}
                BM_RleEncoding/65536/8 651.448 MiB/sec 567.123 MiB/sec   -12.944                                       {'family_index': 0, 'per_family_instance_index': 7, 'run_name': 'BM_RleEncoding/65536/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3611}

I tried some alternative approaches that avoid the cast to int64 and operate on int number of bytes instead of bits, but those were similarly slower and could also potentially still overflow if the max buffer size was close to int32 max. Eg:

  int new_bit_offset = bit_offset_ + num_bits;
  if (ARROW_PREDICT_FALSE(byte_offset_ +
                              (new_bit_offset == 0 ? 0 : (1 + (new_bit_offset - 1) / 8)) >
                          max_bytes_))

Another alternative solution could be to limit the maximum buffer size to something like int32_max / 8 - 8 which I think should also prevent any overflow without needing to change this if condition.

I also noticed that the return value from BitWriter::PutValue only appears to be used in debug builds, so it could possibly make sense to make this check only enabled in debug builds too. But removing this check for release builds would mean that rather than silently failing to write out some data, there could be invalid memory writes. And the encoders are public so could be used outside of this codebase by consumers that do check that the writes succeed, and that would be a breaking change.

@adamreeve adamreeve requested a review from pitrou October 30, 2025 03:15
Comment on lines +201 to +203
if (ARROW_PREDICT_FALSE(static_cast<int64_t>(byte_offset_) * 8 + bit_offset_ +
num_bits >
static_cast<int64_t>(max_bytes_) * 8))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main bug fix. Previously max_bytes_ * 8 could overflow int, resulting in a negative value on the RHS so that this comparison always returned true and the function returned without writing anything to the buffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching it!

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 30, 2025
int max_literal_run_size = 1 + static_cast<int>(::arrow::bit_util::BytesForBits(
MAX_VALUES_PER_LITERAL_RUN * bit_width));
int64_t max_literal_run_size =
1 + ::arrow::bit_util::BytesForBits(MAX_VALUES_PER_LITERAL_RUN * bit_width);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, I was not aware that our RLE-bit-packed encoder did not generate literal runs of more than 512 values at a time. This might pessimize decoding performance quite a bit...

@AntoinePrv This might be interesting to you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that Parquet Java is doing the same thing...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yhea that's a bit of a shame

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some comments below

@pitrou pitrou merged commit 055c2f4 into apache:main Oct 31, 2025
44 of 46 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Oct 31, 2025
@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Oct 31, 2025
@adamreeve adamreeve deleted the fix-invalid-dict-write branch October 31, 2025 08:59
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 0 benchmarking runs that have been run so far on merge-commit 055c2f4.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants