parquet: speed up ByteView dictionary decoder with chunks_exact gather (~28%)#9745
Draft
Dandandan wants to merge 4 commits intoapache:mainfrom
Draft
parquet: speed up ByteView dictionary decoder with chunks_exact gather (~28%)#9745Dandandan wants to merge 4 commits intoapache:mainfrom
Dandandan wants to merge 4 commits intoapache:mainfrom
Conversation
Replace the `extend(keys.iter().map(...))` loop in `ByteViewArrayDecoderDictionary::read` with a `chunks_exact(8)` loop that bulk-validates each chunk's keys, then uses `get_unchecked` gather plus raw-pointer writes. Matches the pattern in `RleDecoder::get_batch_with_dict`. Drops per-element bounds check, per-element `error.is_none()` branch, and `Vec::extend`'s per-push capacity check. Invalid keys now return an error eagerly via a cold helper instead of zero-filling and deferring. Dictionary-decode microbenchmarks (parquet/benches/arrow_reader.rs): BinaryView mandatory, no NULLs 102.91 µs -> 74.29 µs -27.8% BinaryView optional, no NULLs 104.63 µs -> 76.65 µs -26.9% BinaryView optional, half NULLs 143.25 µs -> 132.46 µs -7.3% StringView mandatory, no NULLs 105.98 µs -> 73.87 µs -28.8% StringView optional, no NULLs 104.62 µs -> 76.34 µs -27.4% StringView optional, half NULLs 141.86 µs -> 131.85 µs -7.1% Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
run benchmark arrow_reader_clickbench |
Two small follow-ups to the chunked-gather rewrite, both driven by
inspecting the aarch64 asm:
1) Rewrite `adjust_buffer_index` without an `if/else` so LLVM emits a
`csel` in the hot chunked loop. Previously the main 8-key gather
went through an out-of-line block with a conditional branch per
view; now each view is 5 branchless instructions (ldp/cmp/csel/
add/stp).
2) Replace `chunk.iter().all(|&k| cond)` with a max-reduction over
`u32` keys. `.all()` short-circuits, which blocks vectorisation —
LLVM emitted 8 sequential `ldrsw+cmp+b.ls`. The max-reduction
compiles on aarch64 NEON to:
ldp q1, q0, [x1] ; one load, 8 keys
umax.4s v2, v1, v0 ; pairwise lane max
umaxv.4s s2, v2 ; horizontal reduce
cmp w13, w22 ; one compare
b.hs <cold error path> ; one branch
The NEON registers are then reused for the gather (`fmov`/`mov.s
v[i]`) so keys are loaded exactly once.
Casting keys via `k as u32` correctly rejects any negative i32
(corrupt data) because a negative value becomes a large u32.
Microbenchmark deltas over the previous commit (criterion, aarch64):
BinaryView mandatory, no NULLs 74.29 µs -> 72.96 µs -1.8%
BinaryView optional, no NULLs 76.65 µs -> 75.01 µs -2.1%
StringView mandatory, no NULLs 73.87 µs -> 72.27 µs -2.2%
StringView optional, no NULLs 76.34 µs -> 75.41 µs -1.2%
Cumulative vs. main HEAD (89b1497):
BinaryView mandatory, no NULLs 102.91 µs -> 72.96 µs -29.2%
BinaryView optional, no NULLs 104.63 µs -> 75.01 µs -28.4%
BinaryView optional, half NULLs 143.25 µs -> 133.06 µs -7.4%
StringView mandatory, no NULLs 105.98 µs -> 72.27 µs -30.7%
StringView optional, no NULLs 104.62 µs -> 75.41 µs -29.2%
StringView optional, half NULLs 141.86 µs -> 132.20 µs -6.8%
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-byte-view-dict-decoder (fe1728d) to 89b1497 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
None — targeted optimisation surfaced by profiling
profile_clickbenchlocally.Rationale for this change
ByteViewArrayDecoderDictionary::readis the inner loop for reading dictionary-encodedStringView/BinaryViewcolumns. Its previous shape was:Per element this pays a bounds-checked
get, aSome/Nonebranch, anerror.is_none()branch on the happy path, andVec::extend's per-push capacity check (Map<_, closure>doesn't get theTrustedLenspecialisation).What changes are included in this PR?
Two commits:
1.
chunks_exact(8)gather. Rewrite the inner loop to bulk-validate each 8-key chunk, then useget_uncheckedand raw-pointer writes. Same idiom asRleDecoder::get_batch_with_dict. Invalid keys now return an error eagerly via a#[cold]helper instead of zero-filling and deferring.2. Branchless helpers driven by asm inspection.
adjust_buffer_indexrewritten asview.wrapping_add((is_long * base as u128) << 64)so LLVM emitscselin the chunked loop (previously ab.hsto an out-of-line adjustment block per view)..all(|&k| cond)replaced with au32max-reduction;.all()short-circuits and blocked vectorisation. On aarch64 the check now compiles toldp q1,q0 + umax.4s + umaxv.4s + cmp + b.hs— one SIMD load, one branch, reusing the NEON registers for the gather.Casting keys via
k as u32correctly rejects negative i32 (corrupt data) because the value becomes a large u32.Are these changes tested?
Existing unit tests in
byte_view_arraypass:test_byte_array_string_view_decoder,test_byte_view_array_plain_decoder_reuse_buffer.Microbenchmarks (
parquet/benches/arrow_reader.rs,arrow_array_reader/(String|Binary)ViewArray/dictionary *, aarch64 / Apple Silicon):Half-NULL cases gain less because roughly half the views are null padding rather than gather output.
Are there any user-facing changes?
None — same public API, same semantics (invalid dictionary indices still surface as
ParquetError::General).🤖 Generated with Claude Code