Skip to content

parquet: speed up ByteView dictionary decoder with chunks_exact gather (~28%)#9745

Draft
Dandandan wants to merge 4 commits intoapache:mainfrom
Dandandan:optimize-byte-view-dict-decoder
Draft

parquet: speed up ByteView dictionary decoder with chunks_exact gather (~28%)#9745
Dandandan wants to merge 4 commits intoapache:mainfrom
Dandandan:optimize-byte-view-dict-decoder

Conversation

@Dandandan
Copy link
Copy Markdown
Contributor

@Dandandan Dandandan commented Apr 16, 2026

Which issue does this PR close?

None — targeted optimisation surfaced by profiling profile_clickbench locally.

Rationale for this change

ByteViewArrayDecoderDictionary::read is the inner loop for reading dictionary-encoded StringView / BinaryView columns. Its previous shape was:

output.views.extend(keys.iter().map(|k| match dict.views.get(*k as usize) {
    Some(&view) => view,
    None => {
        if error.is_none() {
            error = Some(general_err!("invalid key={} for dictionary", *k));
        }
        0
    }
}));

Per element this pays a bounds-checked get, a Some/None branch, an error.is_none() branch on the happy path, and Vec::extend's per-push capacity check (Map<_, closure> doesn't get the TrustedLen specialisation).

What changes are included in this PR?

Two commits:

1. chunks_exact(8) gather. Rewrite the inner loop to bulk-validate each 8-key chunk, then use get_unchecked and raw-pointer writes. Same idiom as RleDecoder::get_batch_with_dict. Invalid keys now return an error eagerly via a #[cold] helper instead of zero-filling and deferring.

2. Branchless helpers driven by asm inspection.

  • adjust_buffer_index rewritten as view.wrapping_add((is_long * base as u128) << 64) so LLVM emits csel in the chunked loop (previously a b.hs to an out-of-line adjustment block per view).
  • .all(|&k| cond) replaced with a u32 max-reduction; .all() short-circuits and blocked vectorisation. On aarch64 the check now compiles to ldp q1,q0 + umax.4s + umaxv.4s + cmp + b.hs — one SIMD load, one branch, reusing the NEON registers for the gather.

Casting keys via k as u32 correctly rejects negative i32 (corrupt data) because the value becomes a large u32.

Are these changes tested?

Existing unit tests in byte_view_array pass: test_byte_array_string_view_decoder, test_byte_view_array_plain_decoder_reuse_buffer.

Microbenchmarks (parquet/benches/arrow_reader.rs, arrow_array_reader/(String|Binary)ViewArray/dictionary *, aarch64 / Apple Silicon):

Bench Before After Δ
BinaryView mandatory, no NULLs 102.91 µs 72.96 µs −29.2%
BinaryView optional, no NULLs 104.63 µs 75.01 µs −28.4%
BinaryView optional, half NULLs 143.25 µs 133.06 µs −7.4%
StringView mandatory, no NULLs 105.98 µs 72.27 µs −30.7%
StringView optional, no NULLs 104.62 µs 75.41 µs −29.2%
StringView optional, half NULLs 141.86 µs 132.20 µs −6.8%

Half-NULL cases gain less because roughly half the views are null padding rather than gather output.

Are there any user-facing changes?

None — same public API, same semantics (invalid dictionary indices still surface as ParquetError::General).

🤖 Generated with Claude Code

Replace the `extend(keys.iter().map(...))` loop in
`ByteViewArrayDecoderDictionary::read` with a `chunks_exact(8)` loop
that bulk-validates each chunk's keys, then uses `get_unchecked`
gather plus raw-pointer writes. Matches the pattern in
`RleDecoder::get_batch_with_dict`.

Drops per-element bounds check, per-element `error.is_none()` branch,
and `Vec::extend`'s per-push capacity check. Invalid keys now return
an error eagerly via a cold helper instead of zero-filling and
deferring.

Dictionary-decode microbenchmarks (parquet/benches/arrow_reader.rs):

  BinaryView mandatory, no NULLs    102.91 µs -> 74.29 µs  -27.8%
  BinaryView optional, no NULLs     104.63 µs -> 76.65 µs  -26.9%
  BinaryView optional, half NULLs   143.25 µs -> 132.46 µs  -7.3%
  StringView mandatory, no NULLs    105.98 µs -> 73.87 µs  -28.8%
  StringView optional, no NULLs     104.62 µs -> 76.34 µs  -27.4%
  StringView optional, half NULLs   141.86 µs -> 131.85 µs  -7.1%

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the parquet Changes to the parquet crate label Apr 16, 2026
@Dandandan
Copy link
Copy Markdown
Contributor Author

run benchmark arrow_reader_clickbench

Two small follow-ups to the chunked-gather rewrite, both driven by
inspecting the aarch64 asm:

1) Rewrite `adjust_buffer_index` without an `if/else` so LLVM emits a
   `csel` in the hot chunked loop. Previously the main 8-key gather
   went through an out-of-line block with a conditional branch per
   view; now each view is 5 branchless instructions (ldp/cmp/csel/
   add/stp).

2) Replace `chunk.iter().all(|&k| cond)` with a max-reduction over
   `u32` keys. `.all()` short-circuits, which blocks vectorisation —
   LLVM emitted 8 sequential `ldrsw+cmp+b.ls`. The max-reduction
   compiles on aarch64 NEON to:

      ldp  q1, q0, [x1]         ; one load, 8 keys
      umax.4s  v2, v1, v0       ; pairwise lane max
      umaxv.4s s2, v2           ; horizontal reduce
      cmp  w13, w22             ; one compare
      b.hs <cold error path>    ; one branch

   The NEON registers are then reused for the gather (`fmov`/`mov.s
   v[i]`) so keys are loaded exactly once.

Casting keys via `k as u32` correctly rejects any negative i32
(corrupt data) because a negative value becomes a large u32.

Microbenchmark deltas over the previous commit (criterion, aarch64):

  BinaryView mandatory, no NULLs     74.29 µs -> 72.96 µs   -1.8%
  BinaryView optional,  no NULLs     76.65 µs -> 75.01 µs   -2.1%
  StringView mandatory, no NULLs     73.87 µs -> 72.27 µs   -2.2%
  StringView optional,  no NULLs     76.34 µs -> 75.41 µs   -1.2%

Cumulative vs. main HEAD (89b1497):

  BinaryView mandatory, no NULLs    102.91 µs -> 72.96 µs  -29.2%
  BinaryView optional,  no NULLs    104.63 µs -> 75.01 µs  -28.4%
  BinaryView optional, half NULLs   143.25 µs -> 133.06 µs  -7.4%
  StringView mandatory, no NULLs    105.98 µs -> 72.27 µs  -30.7%
  StringView optional,  no NULLs    104.62 µs -> 75.41 µs  -29.2%
  StringView optional, half NULLs   141.86 µs -> 132.20 µs  -6.8%

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4262428096-1396-tjkfl 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing optimize-byte-view-dict-decoder (fe1728d) to 89b1497 (merge-base) diff
BENCH_NAME=arrow_reader_clickbench
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_reader_clickbench
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                             main                                   optimize-byte-view-dict-decoder
-----                                             ----                                   -------------------------------
arrow_reader_clickbench/async/Q1                  1.01   1101.7±5.83µs        ? ?/sec    1.00   1094.3±7.81µs        ? ?/sec
arrow_reader_clickbench/async/Q10                 1.04      6.7±0.08ms        ? ?/sec    1.00      6.4±0.05ms        ? ?/sec
arrow_reader_clickbench/async/Q11                 1.03      7.7±0.10ms        ? ?/sec    1.00      7.4±0.07ms        ? ?/sec
arrow_reader_clickbench/async/Q12                 1.02     14.7±0.12ms        ? ?/sec    1.00     14.5±0.06ms        ? ?/sec
arrow_reader_clickbench/async/Q13                 1.01     17.4±0.12ms        ? ?/sec    1.00     17.2±0.08ms        ? ?/sec
arrow_reader_clickbench/async/Q14                 1.01     16.2±0.10ms        ? ?/sec    1.00     16.0±0.09ms        ? ?/sec
arrow_reader_clickbench/async/Q19                 1.02      3.1±0.03ms        ? ?/sec    1.00      3.1±0.03ms        ? ?/sec
arrow_reader_clickbench/async/Q20                 1.15     95.1±2.13ms        ? ?/sec    1.00     82.5±9.33ms        ? ?/sec
arrow_reader_clickbench/async/Q21                 1.07    108.7±4.81ms        ? ?/sec    1.00    101.4±5.42ms        ? ?/sec
arrow_reader_clickbench/async/Q22                 1.00   131.3±10.79ms        ? ?/sec    1.01    132.4±7.64ms        ? ?/sec
arrow_reader_clickbench/async/Q23                 1.05    254.2±2.16ms        ? ?/sec    1.00    242.0±1.92ms        ? ?/sec
arrow_reader_clickbench/async/Q24                 1.04     20.2±0.20ms        ? ?/sec    1.00     19.4±0.09ms        ? ?/sec
arrow_reader_clickbench/async/Q27                 1.05     59.7±0.55ms        ? ?/sec    1.00     57.0±0.17ms        ? ?/sec
arrow_reader_clickbench/async/Q28                 1.04     60.2±0.61ms        ? ?/sec    1.00     57.7±0.19ms        ? ?/sec
arrow_reader_clickbench/async/Q30                 1.03     18.8±0.12ms        ? ?/sec    1.00     18.2±0.06ms        ? ?/sec
arrow_reader_clickbench/async/Q36                 1.04     15.8±0.24ms        ? ?/sec    1.00     15.2±0.11ms        ? ?/sec
arrow_reader_clickbench/async/Q37                 1.02      5.4±0.04ms        ? ?/sec    1.00      5.3±0.03ms        ? ?/sec
arrow_reader_clickbench/async/Q38                 1.05     14.0±0.27ms        ? ?/sec    1.00     13.4±0.13ms        ? ?/sec
arrow_reader_clickbench/async/Q39                 1.06     25.5±0.54ms        ? ?/sec    1.00     24.0±0.18ms        ? ?/sec
arrow_reader_clickbench/async/Q40                 1.03      5.9±0.06ms        ? ?/sec    1.00      5.7±0.04ms        ? ?/sec
arrow_reader_clickbench/async/Q41                 1.01      5.0±0.04ms        ? ?/sec    1.00      5.0±0.03ms        ? ?/sec
arrow_reader_clickbench/async/Q42                 1.02      3.6±0.03ms        ? ?/sec    1.00      3.5±0.03ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q1     1.01   1076.9±6.54µs        ? ?/sec    1.00   1061.2±4.90µs        ? ?/sec
arrow_reader_clickbench/async_object_store/Q10    1.06      6.6±0.07ms        ? ?/sec    1.00      6.3±0.07ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q11    1.04      7.5±0.07ms        ? ?/sec    1.00      7.3±0.07ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q12    1.02     14.7±0.12ms        ? ?/sec    1.00     14.4±0.06ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q13    1.03     17.5±0.15ms        ? ?/sec    1.00     17.0±0.07ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q14    1.02     16.3±0.12ms        ? ?/sec    1.00     15.9±0.08ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q19    1.03      3.0±0.03ms        ? ?/sec    1.00      2.9±0.02ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q20    1.03     73.6±0.82ms        ? ?/sec    1.00     71.4±0.17ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q21    1.03     82.0±0.67ms        ? ?/sec    1.00     79.8±0.25ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q22    1.04    100.8±0.80ms        ? ?/sec    1.00     97.4±0.42ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q23    1.10    242.1±6.28ms        ? ?/sec    1.00    220.7±7.28ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q24    1.03     19.7±0.24ms        ? ?/sec    1.00     19.2±0.08ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q27    1.03     58.3±0.55ms        ? ?/sec    1.00     56.7±0.19ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q28    1.02     58.5±0.69ms        ? ?/sec    1.00     57.4±0.24ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q30    1.02     18.5±0.15ms        ? ?/sec    1.00     18.0±0.05ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q36    1.03     15.2±0.23ms        ? ?/sec    1.00     14.9±0.11ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q37    1.01      5.4±0.03ms        ? ?/sec    1.00      5.3±0.03ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q38    1.02     13.5±0.23ms        ? ?/sec    1.00     13.3±0.10ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q39    1.04     24.4±0.52ms        ? ?/sec    1.00     23.6±0.16ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q40    1.02      5.6±0.06ms        ? ?/sec    1.00      5.5±0.05ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q41    1.02      4.9±0.05ms        ? ?/sec    1.00      4.8±0.03ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q42    1.01      3.4±0.02ms        ? ?/sec    1.00      3.4±0.03ms        ? ?/sec
arrow_reader_clickbench/sync/Q1                   1.00    873.9±2.03µs        ? ?/sec    1.00    871.5±3.76µs        ? ?/sec
arrow_reader_clickbench/sync/Q10                  1.06      5.1±0.02ms        ? ?/sec    1.00      4.8±0.03ms        ? ?/sec
arrow_reader_clickbench/sync/Q11                  1.06      6.1±0.02ms        ? ?/sec    1.00      5.7±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q12                  1.02     21.9±0.06ms        ? ?/sec    1.00     21.4±0.07ms        ? ?/sec
arrow_reader_clickbench/sync/Q13                  1.03     30.8±0.18ms        ? ?/sec    1.00     30.0±0.25ms        ? ?/sec
arrow_reader_clickbench/sync/Q14                  1.03     23.4±0.14ms        ? ?/sec    1.00     22.8±0.05ms        ? ?/sec
arrow_reader_clickbench/sync/Q19                  1.03      2.7±0.02ms        ? ?/sec    1.00      2.6±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q20                  1.04    125.2±3.83ms        ? ?/sec    1.00    120.5±0.23ms        ? ?/sec
arrow_reader_clickbench/sync/Q21                  1.04     95.2±0.35ms        ? ?/sec    1.00     91.7±0.35ms        ? ?/sec
arrow_reader_clickbench/sync/Q22                  1.01    140.2±0.36ms        ? ?/sec    1.00    139.5±3.42ms        ? ?/sec
arrow_reader_clickbench/sync/Q23                  1.07   286.8±14.73ms        ? ?/sec    1.00   267.8±12.67ms        ? ?/sec
arrow_reader_clickbench/sync/Q24                  1.03     27.5±0.07ms        ? ?/sec    1.00     26.7±0.07ms        ? ?/sec
arrow_reader_clickbench/sync/Q27                  1.05    111.1±0.26ms        ? ?/sec    1.00    105.9±0.16ms        ? ?/sec
arrow_reader_clickbench/sync/Q28                  1.05    109.2±0.21ms        ? ?/sec    1.00    104.4±0.18ms        ? ?/sec
arrow_reader_clickbench/sync/Q30                  1.03     18.9±0.06ms        ? ?/sec    1.00     18.5±0.11ms        ? ?/sec
arrow_reader_clickbench/sync/Q36                  1.01     22.7±0.05ms        ? ?/sec    1.00     22.5±0.09ms        ? ?/sec
arrow_reader_clickbench/sync/Q37                  1.01      6.9±0.02ms        ? ?/sec    1.00      6.8±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q38                  1.00     11.6±0.03ms        ? ?/sec    1.00     11.5±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q39                  1.02     21.4±0.07ms        ? ?/sec    1.00     20.9±0.05ms        ? ?/sec
arrow_reader_clickbench/sync/Q40                  1.02      5.3±0.05ms        ? ?/sec    1.00      5.2±0.03ms        ? ?/sec
arrow_reader_clickbench/sync/Q41                  1.01      5.7±0.04ms        ? ?/sec    1.00      5.6±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q42                  1.01      4.4±0.03ms        ? ?/sec    1.00      4.3±0.03ms        ? ?/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 792.1s
Peak memory 3.1 GiB
Avg memory 3.0 GiB
CPU user 701.4s
CPU sys 89.0s
Peak spill 0 B

branch

Metric Value
Wall time 784.3s
Peak memory 3.2 GiB
Avg memory 3.1 GiB
CPU user 713.7s
CPU sys 70.7s
Peak spill 0 B

File an issue against this benchmark runner

Dandandan and others added 2 commits April 17, 2026 06:22
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants