GH-47973: [C++][Parquet] Fix invalid Parquet files written when dictionary encoded pages are large #47998

adamreeve · 2025-10-30T02:54:03Z

Rationale for this change

Prevents silently writing invalid data when using dictionary encoding and the number of bits in the estimated max buffer size is greater than the max int32 value.

Also fixes an overflow resulting in a "Negative buffer resize" error if the buffer size in bytes is greater than max int32, and instead throw a more helpful exception.

What changes are included in this PR?

Fix overflow when computing the bit position in BitWriter::PutValue. This overflow would cause the method to return without writing data, and the return value is only checked in debug builds.
Change buffer size calculations to use int64 and check for overflow before casting to int

Are these changes tested?

Yes, I've added unit tests for both issues. These require enabling ARROW_LARGE_MEMORY_TESTS as they allocate a lot of memory.

Are there any user-facing changes?

This PR contains a "Critical Fix".

This fixes a bug where invalid Parquet files can be silently written when the buffer size for dictionary indices is large.

GitHub Issue: [C++][Parquet] Invalid files written when using large dictionary encoded pages #47973

adamreeve · 2025-10-30T03:11:00Z

I'm not sure this is the best approach for fixing this because it does slow down the RLE encoding benchmarks on my machine. Although it also seems to speed up some test cases:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Non-regressions: (9)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                              benchmark        baseline       contender  change %                                                                                                                                                                                                        counters
                 BM_RleEncoding/32768/1   2.866 GiB/sec   3.402 GiB/sec    18.680                                           {'family_index': 0, 'per_family_instance_index': 2, 'run_name': 'BM_RleEncoding/32768/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 33348}
                  BM_RleEncoding/4096/1   2.857 GiB/sec   3.363 GiB/sec    17.740                                           {'family_index': 0, 'per_family_instance_index': 1, 'run_name': 'BM_RleEncoding/4096/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 263675}
                 BM_RleEncoding/65536/1   2.895 GiB/sec   3.381 GiB/sec    16.763                                           {'family_index': 0, 'per_family_instance_index': 3, 'run_name': 'BM_RleEncoding/65536/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 16491}
                  BM_RleEncoding/1024/1   2.775 GiB/sec   3.114 GiB/sec    12.186                                          {'family_index': 0, 'per_family_instance_index': 0, 'run_name': 'BM_RleEncoding/1024/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1047638}
BM_RleEncodingSpacedBoolean/32768/10000  62.305 GiB/sec  62.721 GiB/sec     0.668 {'family_index': 1, 'per_family_instance_index': 4, 'run_name': 'BM_RleEncodingSpacedBoolean/32768/10000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1427195, 'null_percent': 100.0}
             BM_RleEncodingBoolean/1024 410.025 MiB/sec 406.502 MiB/sec    -0.859                                      {'family_index': 0, 'per_family_instance_index': 0, 'run_name': 'BM_RleEncodingBoolean/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 294385}
             BM_RleEncodingBoolean/4096 426.461 MiB/sec 421.442 MiB/sec    -1.177                                       {'family_index': 0, 'per_family_instance_index': 1, 'run_name': 'BM_RleEncodingBoolean/4096', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 76216}
            BM_RleEncodingBoolean/32768 435.241 MiB/sec 425.014 MiB/sec    -2.350                                       {'family_index': 0, 'per_family_instance_index': 2, 'run_name': 'BM_RleEncodingBoolean/32768', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 9715}
            BM_RleEncodingBoolean/65536 435.624 MiB/sec 425.030 MiB/sec    -2.432                                       {'family_index': 0, 'per_family_instance_index': 3, 'run_name': 'BM_RleEncodingBoolean/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4871}

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Regressions: (12)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                             benchmark        baseline       contender  change %                                                                                                                                                                                                   counters
                BM_RleEncoding/4096/16 644.740 MiB/sec 604.668 MiB/sec    -6.215                                      {'family_index': 0, 'per_family_instance_index': 9, 'run_name': 'BM_RleEncoding/4096/16', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 57555}
               BM_RleEncoding/65536/16 646.133 MiB/sec 601.672 MiB/sec    -6.881                                     {'family_index': 0, 'per_family_instance_index': 11, 'run_name': 'BM_RleEncoding/65536/16', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3644}
               BM_RleEncoding/32768/16 649.870 MiB/sec 601.706 MiB/sec    -7.411                                     {'family_index': 0, 'per_family_instance_index': 10, 'run_name': 'BM_RleEncoding/32768/16', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7281}
                BM_RleEncoding/1024/16 650.748 MiB/sec 601.229 MiB/sec    -7.610                                     {'family_index': 0, 'per_family_instance_index': 8, 'run_name': 'BM_RleEncoding/1024/16', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 230323}
BM_RleEncodingSpacedBoolean/32768/5000 228.351 MiB/sec 207.755 MiB/sec    -9.020 {'family_index': 1, 'per_family_instance_index': 3, 'run_name': 'BM_RleEncodingSpacedBoolean/32768/5000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 5085, 'null_percent': 50.0}
BM_RleEncodingSpacedBoolean/32768/1000 172.863 MiB/sec 154.948 MiB/sec   -10.364 {'family_index': 1, 'per_family_instance_index': 2, 'run_name': 'BM_RleEncodingSpacedBoolean/32768/1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3833, 'null_percent': 10.0}
   BM_RleEncodingSpacedBoolean/32768/1 175.361 MiB/sec 155.387 MiB/sec   -11.390    {'family_index': 1, 'per_family_instance_index': 0, 'run_name': 'BM_RleEncodingSpacedBoolean/32768/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3953, 'null_percent': 0.01}
 BM_RleEncodingSpacedBoolean/32768/100 173.656 MiB/sec 153.307 MiB/sec   -11.718   {'family_index': 1, 'per_family_instance_index': 1, 'run_name': 'BM_RleEncodingSpacedBoolean/32768/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3910, 'null_percent': 1.0}
                BM_RleEncoding/32768/8 650.268 MiB/sec 573.097 MiB/sec   -11.868                                       {'family_index': 0, 'per_family_instance_index': 6, 'run_name': 'BM_RleEncoding/32768/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7267}
                 BM_RleEncoding/1024/8 644.999 MiB/sec 566.453 MiB/sec   -12.178                                      {'family_index': 0, 'per_family_instance_index': 4, 'run_name': 'BM_RleEncoding/1024/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 232475}
                 BM_RleEncoding/4096/8 653.119 MiB/sec 570.380 MiB/sec   -12.668                                       {'family_index': 0, 'per_family_instance_index': 5, 'run_name': 'BM_RleEncoding/4096/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 56561}
                BM_RleEncoding/65536/8 651.448 MiB/sec 567.123 MiB/sec   -12.944                                       {'family_index': 0, 'per_family_instance_index': 7, 'run_name': 'BM_RleEncoding/65536/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3611}

I tried some alternative approaches that avoid the cast to int64 and operate on int number of bytes instead of bits, but those were similarly slower and could also potentially still overflow if the max buffer size was close to int32 max. Eg:

  int new_bit_offset = bit_offset_ + num_bits;
  if (ARROW_PREDICT_FALSE(byte_offset_ +
                              (new_bit_offset == 0 ? 0 : (1 + (new_bit_offset - 1) / 8)) >
                          max_bytes_))

Another alternative solution could be to limit the maximum buffer size to something like int32_max / 8 - 8 which I think should also prevent any overflow without needing to change this if condition.

I also noticed that the return value from BitWriter::PutValue only appears to be used in debug builds, so it could possibly make sense to make this check only enabled in debug builds too. But removing this check for release builds would mean that rather than silently failing to write out some data, there could be invalid memory writes. And the encoders are public so could be used outside of this codebase by consumers that do check that the writes succeed, and that would be a breaking change.

adamreeve · 2025-10-30T03:19:48Z

cpp/src/arrow/util/bit_stream_utils_internal.h

+  if (ARROW_PREDICT_FALSE(static_cast<int64_t>(byte_offset_) * 8 + bit_offset_ +
+                              num_bits >
+                          static_cast<int64_t>(max_bytes_) * 8))


This is the main bug fix. Previously max_bytes_ * 8 could overflow int, resulting in a negative value on the RHS so that this comparison always returned true and the function returned without writing anything to the buffer.

Thanks for catching it!

pitrou · 2025-10-30T09:29:43Z

cpp/src/arrow/util/rle_encoding_internal.h

-    int max_literal_run_size = 1 + static_cast<int>(::arrow::bit_util::BytesForBits(
-                                       MAX_VALUES_PER_LITERAL_RUN * bit_width));
+    int64_t max_literal_run_size =
+        1 + ::arrow::bit_util::BytesForBits(MAX_VALUES_PER_LITERAL_RUN * bit_width);


Wow, I was not aware that our RLE-bit-packed encoder did not generate literal runs of more than 512 values at a time. This might pessimize decoding performance quite a bit...

@AntoinePrv This might be interesting to you.

I see that Parquet Java is doing the same thing...

Yhea that's a bit of a shame

cpp/src/arrow/util/rle_encoding_internal.h

pitrou

LGTM, some comments below

conbench-apache-arrow · 2025-10-31T16:48:58Z

After merging your PR, Conbench analyzed the 0 benchmarking runs that have been run so far on merge-commit 055c2f4.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

adamreeve added 11 commits October 28, 2025 16:06

Add test to repro invalid file write

5ed8a62

Simplify repro

0d63276

Simplify repro

6307ccd

Simplify further

b3016dd

Fix overflow when indices buffer exceeds max int32

236e4dd

Update test

8449398

Fix overflow in PutValue

ce48dd4

Add another overflow check

310af07

Fix comment

990a869

Add test for exception on overflow

194e328

More casts

ca2724a

adamreeve requested a review from wgtmac as a code owner October 30, 2025 02:54

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Oct 30, 2025

adamreeve requested a review from pitrou October 30, 2025 03:15

adamreeve commented Oct 30, 2025

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 30, 2025

pitrou reviewed Oct 30, 2025

View reviewed changes

cpp/src/arrow/util/rle_encoding_internal.h Show resolved Hide resolved

pitrou approved these changes Oct 30, 2025

View reviewed changes

Prevent overflow in CheckBufferFull

d5bf9fb

pitrou approved these changes Oct 31, 2025

View reviewed changes

pitrou merged commit 055c2f4 into apache:main Oct 31, 2025
44 of 46 checks passed

pitrou removed the awaiting committer review Awaiting committer review label Oct 31, 2025

github-actions bot added the awaiting committer review Awaiting committer review label Oct 31, 2025

pitrou mentioned this pull request Oct 31, 2025

[C++][Parquet] Invalid files written when using large dictionary encoded pages #47973

Closed

pitrou added the backport-candidate label Oct 31, 2025

adamreeve deleted the fix-invalid-dict-write branch October 31, 2025 08:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-47973: [C++][Parquet] Fix invalid Parquet files written when dictionary encoded pages are large #47998

GH-47973: [C++][Parquet] Fix invalid Parquet files written when dictionary encoded pages are large #47998

adamreeve commented Oct 30, 2025 •

edited by github-actions bot

Loading

Uh oh!

adamreeve commented Oct 30, 2025

Uh oh!

adamreeve Oct 30, 2025

Uh oh!

AntoinePrv Oct 30, 2025

Uh oh!

pitrou Oct 30, 2025

Uh oh!

pitrou Oct 30, 2025

Uh oh!

AntoinePrv Oct 30, 2025

Uh oh!

Uh oh!

pitrou left a comment

Uh oh!

Uh oh!

conbench-apache-arrow bot commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

GH-47973: [C++][Parquet] Fix invalid Parquet files written when dictionary encoded pages are large #47998

GH-47973: [C++][Parquet] Fix invalid Parquet files written when dictionary encoded pages are large #47998

Conversation

adamreeve commented Oct 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

adamreeve commented Oct 30, 2025

Uh oh!

adamreeve Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

AntoinePrv Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

pitrou Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

pitrou Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

AntoinePrv Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

conbench-apache-arrow bot commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adamreeve commented Oct 30, 2025 •

edited by github-actions bot

Loading