parquet: speed up ByteView dictionary decoder with chunks_exact gather (~28%) by Dandandan · Pull Request #9745 · apache/arrow-rs

Dandandan · 2026-04-16T18:10:35Z

Which issue does this PR close?

None — targeted optimisation surfaced by profiling profile_clickbench locally.

Rationale for this change

ByteViewArrayDecoderDictionary::read is the inner loop for reading dictionary-encoded StringView / BinaryView columns. Its previous shape was:

output.views.extend(keys.iter().map(|k| match dict.views.get(*k as usize) {
    Some(&view) => view,
    None => {
        if error.is_none() {
            error = Some(general_err!("invalid key={} for dictionary", *k));
        }
        0
    }
}));

Per element this pays a bounds-checked get, a Some/None branch, an error.is_none() branch on the happy path, and Vec::extend's per-push capacity check (Map<_, closure> doesn't get the TrustedLen specialisation).

What changes are included in this PR?

Two commits:

1. chunks_exact(8) gather. Rewrite the inner loop to bulk-validate each 8-key chunk, then use get_unchecked and raw-pointer writes. Same idiom as RleDecoder::get_batch_with_dict. Invalid keys now return an error eagerly via a #[cold] helper instead of zero-filling and deferring.

2. Branchless helpers driven by asm inspection.

adjust_buffer_index rewritten as view.wrapping_add((is_long * base as u128) << 64) so LLVM emits csel in the chunked loop (previously a b.hs to an out-of-line adjustment block per view).
.all(|&k| cond) replaced with a u32 max-reduction; .all() short-circuits and blocked vectorisation. On aarch64 the check now compiles to ldp q1,q0 + umax.4s + umaxv.4s + cmp + b.hs — one SIMD load, one branch, reusing the NEON registers for the gather.

Casting keys via k as u32 correctly rejects negative i32 (corrupt data) because the value becomes a large u32.

Are these changes tested?

Existing unit tests in byte_view_array pass: test_byte_array_string_view_decoder, test_byte_view_array_plain_decoder_reuse_buffer.

Microbenchmarks (parquet/benches/arrow_reader.rs, arrow_array_reader/(String|Binary)ViewArray/dictionary *, aarch64 / Apple Silicon):

Bench	Before	After	Δ
BinaryView mandatory, no NULLs	102.91 µs	72.96 µs	−29.2%
BinaryView optional, no NULLs	104.63 µs	75.01 µs	−28.4%
BinaryView optional, half NULLs	143.25 µs	133.06 µs	−7.4%
StringView mandatory, no NULLs	105.98 µs	72.27 µs	−30.7%
StringView optional, no NULLs	104.62 µs	75.41 µs	−29.2%
StringView optional, half NULLs	141.86 µs	132.20 µs	−6.8%

Half-NULL cases gain less because roughly half the views are null padding rather than gather output.

Are there any user-facing changes?

None — same public API, same semantics (invalid dictionary indices still surface as ParquetError::General).

🤖 Generated with Claude Code

Replace the `extend(keys.iter().map(...))` loop in `ByteViewArrayDecoderDictionary::read` with a `chunks_exact(8)` loop that bulk-validates each chunk's keys, then uses `get_unchecked` gather plus raw-pointer writes. Matches the pattern in `RleDecoder::get_batch_with_dict`. Drops per-element bounds check, per-element `error.is_none()` branch, and `Vec::extend`'s per-push capacity check. Invalid keys now return an error eagerly via a cold helper instead of zero-filling and deferring. Dictionary-decode microbenchmarks (parquet/benches/arrow_reader.rs): BinaryView mandatory, no NULLs 102.91 µs -> 74.29 µs -27.8% BinaryView optional, no NULLs 104.63 µs -> 76.65 µs -26.9% BinaryView optional, half NULLs 143.25 µs -> 132.46 µs -7.3% StringView mandatory, no NULLs 105.98 µs -> 73.87 µs -28.8% StringView optional, no NULLs 104.62 µs -> 76.34 µs -27.4% StringView optional, half NULLs 141.86 µs -> 131.85 µs -7.1% Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Dandandan · 2026-04-16T18:19:24Z

run benchmark arrow_reader_clickbench

Two small follow-ups to the chunked-gather rewrite, both driven by inspecting the aarch64 asm: 1) Rewrite `adjust_buffer_index` without an `if/else` so LLVM emits a `csel` in the hot chunked loop. Previously the main 8-key gather went through an out-of-line block with a conditional branch per view; now each view is 5 branchless instructions (ldp/cmp/csel/ add/stp). 2) Replace `chunk.iter().all(|&k| cond)` with a max-reduction over `u32` keys. `.all()` short-circuits, which blocks vectorisation — LLVM emitted 8 sequential `ldrsw+cmp+b.ls`. The max-reduction compiles on aarch64 NEON to: ldp q1, q0, [x1] ; one load, 8 keys umax.4s v2, v1, v0 ; pairwise lane max umaxv.4s s2, v2 ; horizontal reduce cmp w13, w22 ; one compare b.hs <cold error path> ; one branch The NEON registers are then reused for the gather (`fmov`/`mov.s v[i]`) so keys are loaded exactly once. Casting keys via `k as u32` correctly rejects any negative i32 (corrupt data) because a negative value becomes a large u32. Microbenchmark deltas over the previous commit (criterion, aarch64): BinaryView mandatory, no NULLs 74.29 µs -> 72.96 µs -1.8% BinaryView optional, no NULLs 76.65 µs -> 75.01 µs -2.1% StringView mandatory, no NULLs 73.87 µs -> 72.27 µs -2.2% StringView optional, no NULLs 76.34 µs -> 75.41 µs -1.2% Cumulative vs. main HEAD (89b1497): BinaryView mandatory, no NULLs 102.91 µs -> 72.96 µs -29.2% BinaryView optional, no NULLs 104.63 µs -> 75.01 µs -28.4% BinaryView optional, half NULLs 143.25 µs -> 133.06 µs -7.4% StringView mandatory, no NULLs 105.98 µs -> 72.27 µs -30.7% StringView optional, no NULLs 104.62 µs -> 75.41 µs -29.2% StringView optional, half NULLs 141.86 µs -> 132.20 µs -6.8% Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

adriangbot · 2026-04-16T18:22:12Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4262428096-1396-tjkfl 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing optimize-byte-view-dict-decoder (fe1728d) to 89b1497 (merge-base) diff
BENCH_NAME=arrow_reader_clickbench
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_reader_clickbench
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-16T18:49:49Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                                             main                                   optimize-byte-view-dict-decoder
-----                                             ----                                   -------------------------------
arrow_reader_clickbench/async/Q1                  1.01   1101.7±5.83µs        ? ?/sec    1.00   1094.3±7.81µs        ? ?/sec
arrow_reader_clickbench/async/Q10                 1.04      6.7±0.08ms        ? ?/sec    1.00      6.4±0.05ms        ? ?/sec
arrow_reader_clickbench/async/Q11                 1.03      7.7±0.10ms        ? ?/sec    1.00      7.4±0.07ms        ? ?/sec
arrow_reader_clickbench/async/Q12                 1.02     14.7±0.12ms        ? ?/sec    1.00     14.5±0.06ms        ? ?/sec
arrow_reader_clickbench/async/Q13                 1.01     17.4±0.12ms        ? ?/sec    1.00     17.2±0.08ms        ? ?/sec
arrow_reader_clickbench/async/Q14                 1.01     16.2±0.10ms        ? ?/sec    1.00     16.0±0.09ms        ? ?/sec
arrow_reader_clickbench/async/Q19                 1.02      3.1±0.03ms        ? ?/sec    1.00      3.1±0.03ms        ? ?/sec
arrow_reader_clickbench/async/Q20                 1.15     95.1±2.13ms        ? ?/sec    1.00     82.5±9.33ms        ? ?/sec
arrow_reader_clickbench/async/Q21                 1.07    108.7±4.81ms        ? ?/sec    1.00    101.4±5.42ms        ? ?/sec
arrow_reader_clickbench/async/Q22                 1.00   131.3±10.79ms        ? ?/sec    1.01    132.4±7.64ms        ? ?/sec
arrow_reader_clickbench/async/Q23                 1.05    254.2±2.16ms        ? ?/sec    1.00    242.0±1.92ms        ? ?/sec
arrow_reader_clickbench/async/Q24                 1.04     20.2±0.20ms        ? ?/sec    1.00     19.4±0.09ms        ? ?/sec
arrow_reader_clickbench/async/Q27                 1.05     59.7±0.55ms        ? ?/sec    1.00     57.0±0.17ms        ? ?/sec
arrow_reader_clickbench/async/Q28                 1.04     60.2±0.61ms        ? ?/sec    1.00     57.7±0.19ms        ? ?/sec
arrow_reader_clickbench/async/Q30                 1.03     18.8±0.12ms        ? ?/sec    1.00     18.2±0.06ms        ? ?/sec
arrow_reader_clickbench/async/Q36                 1.04     15.8±0.24ms        ? ?/sec    1.00     15.2±0.11ms        ? ?/sec
arrow_reader_clickbench/async/Q37                 1.02      5.4±0.04ms        ? ?/sec    1.00      5.3±0.03ms        ? ?/sec
arrow_reader_clickbench/async/Q38                 1.05     14.0±0.27ms        ? ?/sec    1.00     13.4±0.13ms        ? ?/sec
arrow_reader_clickbench/async/Q39                 1.06     25.5±0.54ms        ? ?/sec    1.00     24.0±0.18ms        ? ?/sec
arrow_reader_clickbench/async/Q40                 1.03      5.9±0.06ms        ? ?/sec    1.00      5.7±0.04ms        ? ?/sec
arrow_reader_clickbench/async/Q41                 1.01      5.0±0.04ms        ? ?/sec    1.00      5.0±0.03ms        ? ?/sec
arrow_reader_clickbench/async/Q42                 1.02      3.6±0.03ms        ? ?/sec    1.00      3.5±0.03ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q1     1.01   1076.9±6.54µs        ? ?/sec    1.00   1061.2±4.90µs        ? ?/sec
arrow_reader_clickbench/async_object_store/Q10    1.06      6.6±0.07ms        ? ?/sec    1.00      6.3±0.07ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q11    1.04      7.5±0.07ms        ? ?/sec    1.00      7.3±0.07ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q12    1.02     14.7±0.12ms        ? ?/sec    1.00     14.4±0.06ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q13    1.03     17.5±0.15ms        ? ?/sec    1.00     17.0±0.07ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q14    1.02     16.3±0.12ms        ? ?/sec    1.00     15.9±0.08ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q19    1.03      3.0±0.03ms        ? ?/sec    1.00      2.9±0.02ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q20    1.03     73.6±0.82ms        ? ?/sec    1.00     71.4±0.17ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q21    1.03     82.0±0.67ms        ? ?/sec    1.00     79.8±0.25ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q22    1.04    100.8±0.80ms        ? ?/sec    1.00     97.4±0.42ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q23    1.10    242.1±6.28ms        ? ?/sec    1.00    220.7±7.28ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q24    1.03     19.7±0.24ms        ? ?/sec    1.00     19.2±0.08ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q27    1.03     58.3±0.55ms        ? ?/sec    1.00     56.7±0.19ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q28    1.02     58.5±0.69ms        ? ?/sec    1.00     57.4±0.24ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q30    1.02     18.5±0.15ms        ? ?/sec    1.00     18.0±0.05ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q36    1.03     15.2±0.23ms        ? ?/sec    1.00     14.9±0.11ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q37    1.01      5.4±0.03ms        ? ?/sec    1.00      5.3±0.03ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q38    1.02     13.5±0.23ms        ? ?/sec    1.00     13.3±0.10ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q39    1.04     24.4±0.52ms        ? ?/sec    1.00     23.6±0.16ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q40    1.02      5.6±0.06ms        ? ?/sec    1.00      5.5±0.05ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q41    1.02      4.9±0.05ms        ? ?/sec    1.00      4.8±0.03ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q42    1.01      3.4±0.02ms        ? ?/sec    1.00      3.4±0.03ms        ? ?/sec
arrow_reader_clickbench/sync/Q1                   1.00    873.9±2.03µs        ? ?/sec    1.00    871.5±3.76µs        ? ?/sec
arrow_reader_clickbench/sync/Q10                  1.06      5.1±0.02ms        ? ?/sec    1.00      4.8±0.03ms        ? ?/sec
arrow_reader_clickbench/sync/Q11                  1.06      6.1±0.02ms        ? ?/sec    1.00      5.7±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q12                  1.02     21.9±0.06ms        ? ?/sec    1.00     21.4±0.07ms        ? ?/sec
arrow_reader_clickbench/sync/Q13                  1.03     30.8±0.18ms        ? ?/sec    1.00     30.0±0.25ms        ? ?/sec
arrow_reader_clickbench/sync/Q14                  1.03     23.4±0.14ms        ? ?/sec    1.00     22.8±0.05ms        ? ?/sec
arrow_reader_clickbench/sync/Q19                  1.03      2.7±0.02ms        ? ?/sec    1.00      2.6±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q20                  1.04    125.2±3.83ms        ? ?/sec    1.00    120.5±0.23ms        ? ?/sec
arrow_reader_clickbench/sync/Q21                  1.04     95.2±0.35ms        ? ?/sec    1.00     91.7±0.35ms        ? ?/sec
arrow_reader_clickbench/sync/Q22                  1.01    140.2±0.36ms        ? ?/sec    1.00    139.5±3.42ms        ? ?/sec
arrow_reader_clickbench/sync/Q23                  1.07   286.8±14.73ms        ? ?/sec    1.00   267.8±12.67ms        ? ?/sec
arrow_reader_clickbench/sync/Q24                  1.03     27.5±0.07ms        ? ?/sec    1.00     26.7±0.07ms        ? ?/sec
arrow_reader_clickbench/sync/Q27                  1.05    111.1±0.26ms        ? ?/sec    1.00    105.9±0.16ms        ? ?/sec
arrow_reader_clickbench/sync/Q28                  1.05    109.2±0.21ms        ? ?/sec    1.00    104.4±0.18ms        ? ?/sec
arrow_reader_clickbench/sync/Q30                  1.03     18.9±0.06ms        ? ?/sec    1.00     18.5±0.11ms        ? ?/sec
arrow_reader_clickbench/sync/Q36                  1.01     22.7±0.05ms        ? ?/sec    1.00     22.5±0.09ms        ? ?/sec
arrow_reader_clickbench/sync/Q37                  1.01      6.9±0.02ms        ? ?/sec    1.00      6.8±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q38                  1.00     11.6±0.03ms        ? ?/sec    1.00     11.5±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q39                  1.02     21.4±0.07ms        ? ?/sec    1.00     20.9±0.05ms        ? ?/sec
arrow_reader_clickbench/sync/Q40                  1.02      5.3±0.05ms        ? ?/sec    1.00      5.2±0.03ms        ? ?/sec
arrow_reader_clickbench/sync/Q41                  1.01      5.7±0.04ms        ? ?/sec    1.00      5.6±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q42                  1.01      4.4±0.03ms        ? ?/sec    1.00      4.3±0.03ms        ? ?/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	792.1s
Peak memory	3.1 GiB
Avg memory	3.0 GiB
CPU user	701.4s
CPU sys	89.0s
Peak spill	0 B

branch

Metric	Value
Wall time	784.3s
Peak memory	3.2 GiB
Avg memory	3.1 GiB
CPU user	713.7s
CPU sys	70.7s
Peak spill	0 B

File an issue against this benchmark runner

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions bot added the parquet Changes to the parquet crate label Apr 16, 2026

Dandandan and others added 2 commits April 17, 2026 06:22

Remove unused ByteView import

90e095b

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Use fold for max-key reduction in dict gather

0fa7d13

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet: speed up ByteView dictionary decoder with chunks_exact gather (~28%)#9745

parquet: speed up ByteView dictionary decoder with chunks_exact gather (~28%)#9745
Dandandan wants to merge 4 commits intoapache:mainfrom
Dandandan:optimize-byte-view-dict-decoder

Dandandan commented Apr 16, 2026 •

edited

Loading

Uh oh!

Dandandan commented Apr 16, 2026

Uh oh!

adriangbot commented Apr 16, 2026

Uh oh!

adriangbot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Dandandan commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Dandandan commented Apr 16, 2026

Uh oh!

adriangbot commented Apr 16, 2026

Uh oh!

adriangbot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Dandandan commented Apr 16, 2026 •

edited

Loading