feat(parquet): selective null padding for list child readers by HippoBaro · Pull Request #9848 · apache/arrow-rs

HippoBaro · 2026-04-29T02:14:03Z

Which issue does this PR close?

Contributes to Column performance: run-proportional read/write cost #9731
Depend on test(parquet): replace InMemoryArrayReader with PrimitiveArrayReader in tests #9847
Depend on bench(parquet): add ListArray benchmarks for runtime and peak memory #9846

Rationale for this change

Parquet list decoding currently materializes padding for null or empty parent lists and then copies the child array to filter that padding back out. This is expensive for nested list columns, especially sparse lists and fixed-width children, where memory can scale with decoded levels instead of actual emitted child values.

What changes are included in this PR?

This PR makes list child readers emit compact child arrays directly by pushing selective null padding into the leaf RecordReader. It also builds definition-level validity bitmaps word-at-a-time, sizes child buffers after levels are decoded, and adds list runtime and peak-memory benchmarks across element types and null densities.

Are these changes tested?

Extensive test coverage for the new logic through existing list reader tests, which now exercise the production PrimitiveArrayReader path with in-memory Parquet pages (see #9846), and BooleanBufferBuilder::append_word has targeted unit coverage.

Benchmarks results:

  Name                                           Before       After        Delta
  ListArray/StringList/no NULLs                 7.3395 ms    5.8001 ms    (-21.0%)
  ListArray/StringList/half NULLs               4.1255 ms    3.4274 ms    (-16.9%)
  ListArray/Int32List/90pct NULLs               1.0366 ms    975.63 us     (-5.9%)
  ListArray/Fixed32List/no NULLs                4.9298 ms    3.3137 ms    (-32.8%)
  ListArray/Fixed32List/half NULLs              2.8998 ms    2.7258 ms     (-6.0%)
  ListArray/Fixed32List/90pct NULLs             1.0556 ms    995.82 us     (-5.7%)
  ListArray/Fixed32List/99pct NULLs             508.65 us    467.16 us     (-8.2%)

  Name                                           Before       After        Delta
  ListArray_peak_memory/Int32List/no NULLs      836.51 KiB   574.79 KiB   (-31.3%)
  ListArray_peak_memory/Int32List/half NULLs    482.01 KiB   336.29 KiB   (-30.2%)
  ListArray_peak_memory/Int32List/90pct NULLs   271.95 KiB   175.32 KiB   (-35.5%)
  ListArray_peak_memory/Int32List/99pct NULLs   217.96 KiB   120.04 KiB   (-44.9%)
  ListArray_peak_memory/DoubleList/no NULLs     1.2399 MiB   715.31 KiB   (-43.7%)
  ListArray_peak_memory/DoubleList/half NULLs   753.66 KiB   400.39 KiB   (-46.9%)
  ListArray_peak_memory/DoubleList/90pct NULLs  380.62 KiB   190.89 KiB   (-49.8%)
  ListArray_peak_memory/DoubleList/99pct NULLs  315.21 KiB   121.61 KiB   (-61.4%)
  ListArray_peak_memory/Fixed32List/no NULLs    3.8031 MiB   1.5760 MiB   (-58.6%)
  ListArray_peak_memory/Fixed32List/half NULLs  2.1710 MiB   849.94 KiB   (-61.8%)
  ListArray_peak_memory/Fixed32List/90pct NULLs 1.0017 MiB   277.35 KiB   (-73.0%)
  ListArray_peak_memory/Fixed32List/99pct NULLs 898.69 KiB   130.93 KiB   (-85.4%)
  ListArray_peak_memory/StringList/no NULLs     3.7925 MiB   2.4715 MiB   (-34.8%)
  ListArray_peak_memory/StringList/half NULLs   1.2541 MiB   772.94 KiB   (-39.8%)
  ListArray_peak_memory/StringList/90pct NULLs  296.63 KiB   188.96 KiB   (-36.3%)
  ListArray_peak_memory/StringList/99pct NULLs  226.75 KiB   120.37 KiB   (-46.9%)

  Name                                               Before      After       Delta
  ListArray_allocated_bytes/Int32List/no NULLs      10.458 MiB  6.8018 MiB  (-35.0%)
  ListArray_allocated_bytes/Int32List/half NULLs    5.9797 MiB  4.0127 MiB  (-32.9%)
  ListArray_allocated_bytes/Int32List/90pct NULLs   2.9985 MiB  1.8210 MiB  (-39.3%)
  ListArray_allocated_bytes/Int32List/99pct NULLs   2.5579 MiB  1.3733 MiB  (-46.3%)
  ListArray_allocated_bytes/DoubleList/no NULLs     16.083 MiB  8.6546 MiB  (-46.2%)
  ListArray_allocated_bytes/DoubleList/half NULLs   8.9134 MiB  4.8497 MiB  (-45.6%)
  ListArray_allocated_bytes/DoubleList/90pct NULLs  4.3656 MiB  2.0179 MiB  (-53.8%)
  ListArray_allocated_bytes/DoubleList/99pct NULLs  3.7482 MiB  1.3903 MiB  (-62.9%)
  ListArray_allocated_bytes/Fixed32List/no NULLs    49.441 MiB  19.505 MiB  (-60.5%)
  ListArray_allocated_bytes/Fixed32List/half NULLs  26.846 MiB  10.459 MiB  (-61.0%)
  ListArray_allocated_bytes/Fixed32List/90pct NULLs 12.483 MiB  3.1127 MiB  (-75.1%)
  ListArray_allocated_bytes/Fixed32List/99pct NULLs 10.895 MiB  1.4980 MiB  (-86.3%)
  ListArray_allocated_bytes/StringList/no NULLs     47.519 MiB  21.743 MiB  (-54.2%)
  ListArray_allocated_bytes/StringList/half NULLs   19.097 MiB  10.478 MiB  (-45.1%)
  ListArray_allocated_bytes/StringList/90pct NULLs  3.4203 MiB  2.1165 MiB  (-38.1%)
  ListArray_allocated_bytes/StringList/99pct NULLs  2.6424 MiB  1.3777 MiB  (-47.9%)

Are there any user-facing changes?

None.

Extend the existing `arrow_reader` runtime benchmarks with `Int32` and `FixedBinary32` list columns alongside the existing `StringList`, with parameterized null density (0%, 50%, 90%, 99%). The prior benchmarks only covered string lists, which didn't surface costs specific to fixed-width and primitive element types. Add a new `arrow_reader_peak_memory` benchmark that measures peak heap usage during `ListArrayReader::consume_batch` using a thread-local tracking allocator. It captures how RSS-efficient we are when materializing a column into its final Arrow in-memory representation. Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>

`InMemoryArrayReader` coupled a column's in-memory representation (an Arrow array) with its storage representation (def/rep levels) by assuming a 1:1 mapping between array elements and levels. This held when list readers consumed fully-padded child arrays — one element per level, nulls included. Upcoming work pushes null filtering from `ListArrayReader` down into the child reader at the storage level, breaking that 1:1 assumption: the child returns fewer array elements than levels, and the mapping between them depends on the filtering logic itself. Keeping the mock would mean reimplementing that logic: testing filtered output against a second, hand-rolled filter. Replace `InMemoryArrayReader` with real `PrimitiveArrayReader` instances backed by in-memory Parquet pages. Tests now accept raw non-null values and levels (matching what Parquet actually stores) and exercise the production `RecordReader` path. Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>

When reading nested list columns, the leaf `RecordReader` previously padded the values buffer at every null definition level, including positions corresponding to null or empty parent lists. The parent `ListArrayReader` then had to scan the child array with `MutableArrayData` to filter out those list-level padding entries before building offsets, copying the entire child array in the process. This commit pushes the filtering decision down into the `RecordReader` via a new `padding_threshold` parameter. When set (to the parent list's `def_level`), the `RecordReader` only pads item-level nulls (def >= threshold) and skips list-level entries, producing a compact child array. A compact null bitmap is accumulated alongside the values buffer so leaf readers can consume it directly. A `max_def_level()` method is added to the `ArrayReader` trait so that `MapArrayReader` can derive the struct definition level from the key reader's schema rather than a hardcoded formula. This reduces peak memory usage and eliminates a full-array copy for list columns with any null density. Perf improves most for fully dense columns, because the `MutableArrayData` overhead dominated. For sparse columns, the bitmap construction that drives the filtering becomes load-bearing. Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>

The previous commit introduced selective null padding for list child readers, which makes bitmap construction from definition levels a load-bearing hot path: the leaf `RecordReader` builds a compact bitmap on every batch, and the `StructArrayReader` builds a validity bitmap from the same level data. Both were calling `BooleanBufferBuilder::append` per bit, which dominated CPU profiles at ~30% of list-read time. Add `BooleanBufferBuilder::append_word(word, count)` for bulk-appending pre-packed bits, and optimise `append()` to avoid the per-bit `advance(1)` overhead. Introduce `build_level_bitmap()` as a shared, word-at-a-time implementation that processes 64 definition levels per iteration: two comparison passes produce filter/value masks, a portable compress (PEXT equivalent) extracts the relevant bits, and append_word writes them in one operation. The comparison loops are straightforward enough for the compiler to auto-vectorise, benefiting from wider SIMD execution units where available. Recovers the runtime regressions from the selective-padding commit and adds further gains across all list types. Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>

Selective null padding for list child readers avoids materializing child slots for list-level NULLs, but the `ArrayReaderBuilder` batch-size hint made `RecordReader` preallocate value buffers from the parent batch size before decoding definition levels. This reintroduced a large fixed memory cost for sparse list columns, especially fixed-width children. Initialize value buffers lazily and reserve capacity after repetition and definition levels have been decoded. For selective padding, reserve from the compact item count (def >= padding threshold); otherwise reserve from the decoded level count. Add a `ValuesBuffer::reserve_exact` hook so primitive, offset, view, dictionary, and fixed-length byte-array buffers can reserve the logical output capacity for the next read. Remove the internal `GenericRecordReader` capacity plumbing now that capacity is derived at read time. Keep the public batch-size setting as an allocation hint, not an exact internal capacity contract, and preserve the existing full-padding reservation behavior for readers without a padding threshold. This keeps sparse list memory proportional to the child values that will actually be emitted without changing the public batch-size API semantics. Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>

HippoBaro · 2026-04-29T03:11:39Z

Although the code is ready for review, this is a draft PR because it depends on #9847 and #9848. The contents of both dependencies are included here as well so reviewers can better understand the direction of these changes, which may be hard to appreciate in isolation. I'll rebase and undraft once those have been merged 🙇

Feedback is appreciated!

…er` in tests (#9847) # Which issue does this PR close?  - Contributes to #9731 - Dependency of #9848 # Rationale for this change  `InMemoryArrayReader` couples a column's in-memory representation (an Arrow array) with its storage representation (def/rep levels) and assumes a 1:1 mapping between array elements and levels. This holds when list readers consume fully-padded child arrays — one element per level, nulls included. Upcoming work (see #9848) pushes null filtering from `ListArrayReader` down into the child reader at the storage level, breaking that 1:1 assumption: the child returns fewer array elements than levels, and the mapping between them depends on the filtering logic itself. Keeping the mock would mean reimplementing that logic: testing filtered output against a second, hand-rolled filter. # What changes are included in this PR?  Replace `InMemoryArrayReader` with real `PrimitiveArrayReader` instances backed by in-memory Parquet pages. Tests now accept raw non-null values and levels (matching what Parquet actually stores) and exercise the production `RecordReader` path. # Are these changes tested?  This is a net positive in test coverage. The existing tests now exercise real readers instead of in-memory and already-dense representations. # Are there any user-facing changes?  None. Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>

HippoBaro added 2 commits April 28, 2026 18:17

github-actions Bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Apr 29, 2026

HippoBaro force-pushed the unpadded_child_mode branch from 626fc9b to 54b90c7 Compare April 29, 2026 02:42

HippoBaro added 2 commits April 28, 2026 22:52

HippoBaro force-pushed the unpadded_child_mode branch from 54b90c7 to 0879144 Compare April 29, 2026 02:52

This was referenced Apr 29, 2026

test(parquet): replace InMemoryArrayReader with PrimitiveArrayReader in tests #9847

Merged

bench(parquet): add ListArray benchmarks for runtime and peak memory #9846

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parquet): selective null padding for list child readers #9848

feat(parquet): selective null padding for list child readers #9848
HippoBaro wants to merge 5 commits intoapache:mainfrom
HippoBaro:unpadded_child_mode

HippoBaro commented Apr 29, 2026

Uh oh!

HippoBaro commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HippoBaro commented Apr 29, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

HippoBaro commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant