perf(parquet): Vectorize dict-index bounds check in RleDecoder::get_batch_with_dict (up to -7.9%)#9746
perf(parquet): Vectorize dict-index bounds check in RleDecoder::get_batch_with_dict (up to -7.9%)#9746Dandandan wants to merge 2 commits intoapache:mainfrom
Conversation
b3b75e2 to
b429fc1
Compare
|
run benchmark arrow_reader_clickbench |
0f17b76 to
b429fc1
Compare
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing rle-dict-max-reduction (b429fc1) to 89b1497 (merge-base) diff File an issue against this benchmark runner |
…batch_with_dict` Replace `idx_chunk.iter().all(|&i| (i as usize) < dict_len)` with a u32 max-reduction (`fold(0u32, |acc, &i| acc.max(i as u32))`). `.all` short-circuits and so blocks autovectorisation; on aarch64 the old form compiled to eight serialised `ldrsw` + `cmp` + `b.ls` pairs per 8-index chunk, followed by eight separate scalar gather loads. The max-reduction has no early exit, so LLVM now lowers the check to a single `ldp q1, q0` + `umax.4s` + `umaxv.4s` + one `cmp` + `b.ls`, then reuses the loaded NEON registers for the gather that follows. Negative `i32` values cast to `u32` become large, so the bounds check still rejects them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b429fc1 to
ab6e019
Compare
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
Follow-up to ab6e019. Two changes on the same RLE dict-gather loop: - `CHUNK = 16` (was 8): lets LLVM widen the `umax.4s` + `umaxv.4s` reduction to cover the whole chunk with one vector max per pair of loads, keeping the ratio of (cheap) bounds check to (cheap) gather in the loop's favour. - `#[cold] #[inline(never)] fn oob` replaces `assert!`: `max_idx` no longer has to stay live on the hot path for the panic format string, removing a per-chunk stack spill. Panic behaviour on out-of-bounds indices is preserved. Measured on aarch64 (Apple Silicon) with `cargo bench -p parquet --bench arrow_reader -- \ "dictionary encoded, mandatory, no NULLs"`: | bench | chunk 8 | chunk 16 + cold oob | Δ | | ------------------------ | ------- | ------------------- | ------- | | BinaryViewArray | 101 µs | 88 µs | -13.0% | | StringViewArray | 101 µs | 88 µs | -12.6% | | INT64 Decimal128 | 98 µs | 94 µs | -3.5% | | INT32 Decimal128 | 87 µs | 84 µs | -2.9% | | UInt8Array | 52 µs | 51 µs | -1.8% | View arrays see the biggest win because the inner copy is cheaper, so the bounds check is a larger share of the loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing rle-dict-max-reduction (0ca4c50) to 89b1497 (merge-base) diff using: File an issue against this benchmark runner |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
| idx_chunk.iter().all(|&i| (i as usize) < dict_len), | ||
| "dictionary index out of bounds" | ||
| ); | ||
| // u32 max-reduction instead of `.all(|&i| ..)`: `.all` |
There was a problem hiding this comment.
Hm, this sounds familiar, I opened an issue about this missed optimization a while ago (rust-lang/rust#113789). The issue seems to still exist.
There was a problem hiding this comment.
Oof :)
I compared it now as well to .fold(true, |a, &i| a & ((i as u32) < dict_len_u32)), looks like computing the max-then-compare does also generate better code (and benchmarks) than compare-against-max:
max-reduce:
ldp q1, q0, [x13, #32]
ldp q3, q2, [x13]
umax.4s v4, v2, v0
umax.4s v3, v3, v1
umax.4s v3, v3, v4
umaxv.4s s3, v3
fmov w8, s3
cmp x21, x8
b.ls LBB2735_48
AND-reduce :
ldp q1, q0, [x8, #32]
ldp q3, q2, [x8]
cmhs.4s v4, v2, v6
cmhs.4s v3, v3, v6
uzp1.8h v3, v3, v4
cmhs.4s v4, v1, v6
cmhs.4s v5, v0, v6
uzp1.8h v4, v4, v5
uzp1.16b v3, v3, v4
umaxv.16b b3, v3
fmov w14, s3
tbnz w14, #0, LBB2735_48
| // check still rejects it. | ||
| let max_idx = idx_chunk.iter().fold(0u32, |acc, &i| acc.max(i as u32)); | ||
| if (max_idx as usize) >= dict_len { | ||
| oob(max_idx, dict_len); |
There was a problem hiding this comment.
We should probably also return an error here instead of panicking, since the panic can be triggered by input provided by a user.
There was a problem hiding this comment.
Now I remember there was a draft pr about improving the error handling (#9365), so this could also be done separately.
There was a problem hiding this comment.
Ah yeah we could just as well return an error, I agree :)
There was a problem hiding this comment.
If you do, please also check the rle dict_value around line 487 and link #9434, otherwise I can do that afterwards.
Which issue does this PR close?
Close: #9747
Rationale for this change
Rewrite the code to generate more SIMD instructions / amortize loop/branching overhead.
What changes are included in this PR?
One-file change in
parquet/src/encodings/rle.rs:CHUNK = 16indices#[cold] #[inline(never)] fn oobfor the panic pathget_uncheckedgather after the vectorised checkAre these changes tested?
Existing tests
Are there any user-facing changes?
No — no API change; decoded output is identical, panic behaviour on out-of-bounds indices is preserved.
🤖 Generated with Claude Code