Skip to content

feat(parquet): selective null padding for list child readers #9848

Draft
HippoBaro wants to merge 5 commits intoapache:mainfrom
HippoBaro:unpadded_child_mode
Draft

feat(parquet): selective null padding for list child readers #9848
HippoBaro wants to merge 5 commits intoapache:mainfrom
HippoBaro:unpadded_child_mode

Conversation

@HippoBaro
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

Parquet list decoding currently materializes padding for null or empty parent lists and then copies the child array to filter that padding back out. This is expensive for nested list columns, especially sparse lists and fixed-width children, where memory can scale with decoded levels instead of actual emitted child values.

What changes are included in this PR?

This PR makes list child readers emit compact child arrays directly by pushing selective null padding into the leaf RecordReader. It also builds definition-level validity bitmaps word-at-a-time, sizes child buffers after levels are decoded, and adds list runtime and peak-memory benchmarks across element types and null densities.

Are these changes tested?

Extensive test coverage for the new logic through existing list reader tests, which now exercise the production PrimitiveArrayReader path with in-memory Parquet pages (see #9846), and BooleanBufferBuilder::append_word has targeted unit coverage.

Benchmarks results:

  Name                                           Before       After        Delta
  ListArray/StringList/no NULLs                 7.3395 ms    5.8001 ms    (-21.0%)
  ListArray/StringList/half NULLs               4.1255 ms    3.4274 ms    (-16.9%)
  ListArray/Int32List/90pct NULLs               1.0366 ms    975.63 us     (-5.9%)
  ListArray/Fixed32List/no NULLs                4.9298 ms    3.3137 ms    (-32.8%)
  ListArray/Fixed32List/half NULLs              2.8998 ms    2.7258 ms     (-6.0%)
  ListArray/Fixed32List/90pct NULLs             1.0556 ms    995.82 us     (-5.7%)
  ListArray/Fixed32List/99pct NULLs             508.65 us    467.16 us     (-8.2%)

  Name                                           Before       After        Delta
  ListArray_peak_memory/Int32List/no NULLs      836.51 KiB   574.79 KiB   (-31.3%)
  ListArray_peak_memory/Int32List/half NULLs    482.01 KiB   336.29 KiB   (-30.2%)
  ListArray_peak_memory/Int32List/90pct NULLs   271.95 KiB   175.32 KiB   (-35.5%)
  ListArray_peak_memory/Int32List/99pct NULLs   217.96 KiB   120.04 KiB   (-44.9%)
  ListArray_peak_memory/DoubleList/no NULLs     1.2399 MiB   715.31 KiB   (-43.7%)
  ListArray_peak_memory/DoubleList/half NULLs   753.66 KiB   400.39 KiB   (-46.9%)
  ListArray_peak_memory/DoubleList/90pct NULLs  380.62 KiB   190.89 KiB   (-49.8%)
  ListArray_peak_memory/DoubleList/99pct NULLs  315.21 KiB   121.61 KiB   (-61.4%)
  ListArray_peak_memory/Fixed32List/no NULLs    3.8031 MiB   1.5760 MiB   (-58.6%)
  ListArray_peak_memory/Fixed32List/half NULLs  2.1710 MiB   849.94 KiB   (-61.8%)
  ListArray_peak_memory/Fixed32List/90pct NULLs 1.0017 MiB   277.35 KiB   (-73.0%)
  ListArray_peak_memory/Fixed32List/99pct NULLs 898.69 KiB   130.93 KiB   (-85.4%)
  ListArray_peak_memory/StringList/no NULLs     3.7925 MiB   2.4715 MiB   (-34.8%)
  ListArray_peak_memory/StringList/half NULLs   1.2541 MiB   772.94 KiB   (-39.8%)
  ListArray_peak_memory/StringList/90pct NULLs  296.63 KiB   188.96 KiB   (-36.3%)
  ListArray_peak_memory/StringList/99pct NULLs  226.75 KiB   120.37 KiB   (-46.9%)

  Name                                               Before      After       Delta
  ListArray_allocated_bytes/Int32List/no NULLs      10.458 MiB  6.8018 MiB  (-35.0%)
  ListArray_allocated_bytes/Int32List/half NULLs    5.9797 MiB  4.0127 MiB  (-32.9%)
  ListArray_allocated_bytes/Int32List/90pct NULLs   2.9985 MiB  1.8210 MiB  (-39.3%)
  ListArray_allocated_bytes/Int32List/99pct NULLs   2.5579 MiB  1.3733 MiB  (-46.3%)
  ListArray_allocated_bytes/DoubleList/no NULLs     16.083 MiB  8.6546 MiB  (-46.2%)
  ListArray_allocated_bytes/DoubleList/half NULLs   8.9134 MiB  4.8497 MiB  (-45.6%)
  ListArray_allocated_bytes/DoubleList/90pct NULLs  4.3656 MiB  2.0179 MiB  (-53.8%)
  ListArray_allocated_bytes/DoubleList/99pct NULLs  3.7482 MiB  1.3903 MiB  (-62.9%)
  ListArray_allocated_bytes/Fixed32List/no NULLs    49.441 MiB  19.505 MiB  (-60.5%)
  ListArray_allocated_bytes/Fixed32List/half NULLs  26.846 MiB  10.459 MiB  (-61.0%)
  ListArray_allocated_bytes/Fixed32List/90pct NULLs 12.483 MiB  3.1127 MiB  (-75.1%)
  ListArray_allocated_bytes/Fixed32List/99pct NULLs 10.895 MiB  1.4980 MiB  (-86.3%)
  ListArray_allocated_bytes/StringList/no NULLs     47.519 MiB  21.743 MiB  (-54.2%)
  ListArray_allocated_bytes/StringList/half NULLs   19.097 MiB  10.478 MiB  (-45.1%)
  ListArray_allocated_bytes/StringList/90pct NULLs  3.4203 MiB  2.1165 MiB  (-38.1%)
  ListArray_allocated_bytes/StringList/99pct NULLs  2.6424 MiB  1.3777 MiB  (-47.9%)

Are there any user-facing changes?

None.

Extend the existing `arrow_reader` runtime benchmarks with `Int32` and
`FixedBinary32` list columns alongside the existing `StringList`, with
parameterized null density (0%, 50%, 90%, 99%). The prior benchmarks
only covered string lists, which didn't surface costs specific to
fixed-width and primitive element types.

Add a new `arrow_reader_peak_memory` benchmark that measures peak heap
usage during `ListArrayReader::consume_batch` using a thread-local
tracking allocator. It captures how RSS-efficient we are when
materializing a column into its final Arrow in-memory representation.

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
`InMemoryArrayReader` coupled a column's in-memory representation (an
Arrow array) with its storage representation (def/rep levels) by
assuming a 1:1 mapping between array elements and levels. This held
when list readers consumed fully-padded child arrays — one element per
level, nulls included.

Upcoming work pushes null filtering from `ListArrayReader` down into the
child reader at the storage level, breaking that 1:1 assumption: the
child returns fewer array elements than levels, and the mapping between
them depends on the filtering logic itself. Keeping the mock would mean
reimplementing that logic: testing filtered output against a second,
hand-rolled filter.

Replace `InMemoryArrayReader` with real `PrimitiveArrayReader` instances
backed by in-memory Parquet pages. Tests now accept raw non-null values
and levels (matching what Parquet actually stores) and exercise the
production `RecordReader` path.

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
@github-actions github-actions Bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Apr 29, 2026
When reading nested list columns, the leaf `RecordReader` previously
padded the values buffer at every null definition level, including
positions corresponding to null or empty parent lists. The parent
`ListArrayReader` then had to scan the child array with
`MutableArrayData` to filter out those list-level padding entries before
building offsets, copying the entire child array in the process.

This commit pushes the filtering decision down into the `RecordReader`
via a new `padding_threshold` parameter. When set (to the parent list's
`def_level`), the `RecordReader` only pads item-level nulls (def >=
threshold) and skips list-level entries, producing a compact child
array. A compact null bitmap is accumulated alongside the values buffer
so leaf readers can consume it directly.

A `max_def_level()` method is added to the `ArrayReader` trait so that
`MapArrayReader` can derive the struct definition level from the key
reader's schema rather than a hardcoded formula.

This reduces peak memory usage and eliminates a full-array copy for list
columns with any null density. Perf improves most for fully dense
columns, because the `MutableArrayData` overhead dominated. For sparse
columns, the bitmap construction that drives the filtering becomes
load-bearing.

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
@HippoBaro HippoBaro force-pushed the unpadded_child_mode branch from 626fc9b to 54b90c7 Compare April 29, 2026 02:42
The previous commit introduced selective null padding for list child
readers, which makes bitmap construction from definition levels a
load-bearing hot path: the leaf `RecordReader` builds a compact bitmap
on every batch, and the `StructArrayReader` builds a validity bitmap
from the same level data. Both were calling
`BooleanBufferBuilder::append` per bit, which dominated CPU profiles at
~30% of list-read time.

Add `BooleanBufferBuilder::append_word(word, count)` for bulk-appending
pre-packed bits, and optimise `append()` to avoid the per-bit
`advance(1)` overhead. Introduce `build_level_bitmap()` as a shared,
word-at-a-time implementation that processes 64 definition levels per
iteration: two comparison passes produce filter/value masks, a portable
compress (PEXT equivalent) extracts the relevant bits, and append_word
writes them in one operation. The comparison loops are straightforward
enough for the compiler to auto-vectorise, benefiting from wider SIMD
execution units where available.

Recovers the runtime regressions from the selective-padding commit
and adds further gains across all list types.

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
Selective null padding for list child readers avoids materializing child
slots for list-level NULLs, but the `ArrayReaderBuilder` batch-size hint
made `RecordReader` preallocate value buffers from the parent batch size
before decoding definition levels. This reintroduced a large fixed
memory cost for sparse list columns, especially fixed-width children.

Initialize value buffers lazily and reserve capacity after repetition
and definition levels have been decoded. For selective padding, reserve
from the compact item count (def >= padding threshold); otherwise
reserve from the decoded level count. Add a
`ValuesBuffer::reserve_exact` hook so primitive, offset, view,
dictionary, and fixed-length byte-array buffers can reserve the logical
output capacity for the next read.

Remove the internal `GenericRecordReader` capacity plumbing now that
capacity is derived at read time. Keep the public batch-size setting as
an allocation hint, not an exact internal capacity contract, and
preserve the existing full-padding reservation behavior for readers
without a padding threshold.

This keeps sparse list memory proportional to the child values that will
actually be emitted without changing the public batch-size API
semantics.

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
@HippoBaro HippoBaro force-pushed the unpadded_child_mode branch from 54b90c7 to 0879144 Compare April 29, 2026 02:52
@HippoBaro
Copy link
Copy Markdown
Contributor Author

Although the code is ready for review, this is a draft PR because it depends on #9847 and #9848. The contents of both dependencies are included here as well so reviewers can better understand the direction of these changes, which may be hard to appreciate in isolation. I'll rebase and undraft once those have been merged 🙇

Feedback is appreciated!

alamb pushed a commit that referenced this pull request May 5, 2026
…er` in tests (#9847)

# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Contributes to #9731
- Dependency of #9848

# Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

`InMemoryArrayReader` couples a column's in-memory representation (an
Arrow array) with its storage representation (def/rep levels) and
assumes a 1:1 mapping between array elements and levels. This holds when
list readers consume fully-padded child arrays — one element per level,
nulls included.

Upcoming work (see #9848) pushes null filtering from `ListArrayReader`
down into the child reader at the storage level, breaking that 1:1
assumption: the child returns fewer array elements than levels, and the
mapping between them depends on the filtering logic itself. Keeping the
mock would mean reimplementing that logic: testing filtered output
against a second, hand-rolled filter.

# What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Replace `InMemoryArrayReader` with real `PrimitiveArrayReader` instances
backed by in-memory Parquet pages. Tests now accept raw non-null values
and levels (matching what Parquet actually stores) and exercise the
production `RecordReader` path.

# Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

This is a net positive in test coverage. The existing tests now exercise
real readers instead of in-memory and already-dense representations.

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->

None.

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant