Skip to content

perf: Use AArch64 SVE gather to speed up RLE dictionary decoding #10036

@wuleiwuleiwulei

Description

@wuleiwuleiwulei

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

When reading dictionary-encoded columns from Parquet, RleDecoder::get_batch_with_dict (in parquet/src/encodings/rle.rs) is on a very hot path. In the bit-packed branch, the decoder unpacks the indices into a scratch buffer and then materializes the output with a scalar, per-element dictionary lookup:

buffer[values_read..values_read + num_values]
    .iter_mut()
    .zip(index_buf[..num_values].iter())
    .for_each(|(b, i)| b.clone_from(&dict[*i as usize]));

This is a sequence of dependent, data-dependent loads (a gather) and dominates decode time for dictionary columns with primitive value types. On AArch64 CPUs that implement SVE (e.g. Kunpeng 920 / Neoverse-class server cores), this loop leaves the hardware gather capability completely unused, so dictionary decode is slower than necessary on this architecture.

perf profiling of a TPC-H workload on AArch64 SVE hardware shows this gather as one of the top hotspots in the Parquet read path for dictionary-encoded primitive columns.

Describe the solution you'd like

Add an AArch64-only SVE fast path for the dictionary gather in get_batch_with_dict, keeping the scalar implementation as the fallback:

  • A small #[cfg(target_arch = "aarch64")] module that gathers 4-byte (i32/f32) and 8-byte (i64/f64) dictionary values using SVE indexed loads (ld1w / ld1d with a vector index), processing one vector-length of elements per iteration via whilelt predication (vector-length agnostic).
  • Runtime SVE detection via std::arch::is_aarch64_feature_detected!("sve"), cached in an AtomicU8 so the check amortizes to a single relaxed load on the hot path.
  • The fast path only engages for size_of::<T>() == 4 | 8; all other types, and all non-AArch64 / non-SVE targets, fall back to the existing scalar clone_from loop. Results are bit-for-bit identical to the scalar path — only the gather is accelerated.

This is purely additive: no public API change, and no behaviour change on any existing platform.

Measured improvement. Benchmarked on Kunpeng 920B (SVE, 256-bit) over the full TPC-H query set against ~140 GB of data. Build flags were identical for the baseline and the patched build; the only difference is this SVE fast path. Per-function times were measured with perf, aggregated by symbol; each value is the mean of 3 runs. The SVE path was confirmed active at runtime via is_aarch64_feature_detected!("sve").

  • Target function (get_batch_with_dict), summed over all 22 queries: 2875 ms → 1622 ms — a 43.6% reduction (1.77× faster) on the optimized kernel.
  • End-to-end TPC-H (22 queries): +1.83% overall (table below); 20/22 queries are faster and the 2 outliers (Q3, Q10, ≈1%) are within run-to-run noise. The end-to-end figure is smaller because dictionary decode is only a fraction of total query time — the kernel-level number above isolates the actual win.
Query before (s) after (s) Δ (faster)
Q1 6.037 5.987 +0.83%
Q2 1.208 1.190 +1.45%
Q3 5.584 5.673 −1.59%
Q4 3.190 3.127 +1.98%
Q5 4.432 4.407 +0.57%
Q6 1.473 1.398 +5.06%
Q7 6.441 6.265 +2.73%
Q8 6.057 5.983 +1.23%
Q9 23.494 22.921 +2.44%
Q10 6.302 6.366 −1.01%
Q11 2.465 2.436 +1.18%
Q12 3.070 2.953 +3.82%
Q13 9.954 9.708 +2.46%
Q14 3.676 3.631 +1.24%
Q15 2.862 2.798 +2.25%
Q16 3.458 3.402 +1.62%
Q17 3.349 3.327 +0.66%
Q18 10.431 10.278 +1.46%
Q19 4.756 4.610 +3.07%
Q20 4.888 4.845 +0.87%
Q21 50.797 49.641 +2.28%
Q22 3.937 3.837 +2.55%
Total 167.86 164.78 +1.83%

Describe alternatives you've considered

  • Rely on autovectorization — the compiler does not turn this arbitrary-index gather into SVE gather instructions.
  • std::simd / portable SIMD — gather with arbitrary indices is not available on stable, and portable fixed-width SIMD cannot express SVE's vector-length-agnostic (VLA) gather.
  • Stable std::arch SVE intrinsics — SVE intrinsics are still unstable in Rust, which is why a small, audited asm! block is used; it can be swapped for intrinsics once they stabilize. This is the main difference from existing SIMD in the repo — e.g. arrow-arith's AVX paths are target_feature-gated at compile time, and parquet's simdutf8 path is feature-gated — here runtime detection is needed because SVE availability/width isn't known at compile time for portable binaries.
  • NEON — fixed-width NEON has no true gather instruction, so it offers little benefit for this access pattern.
  • Leave as-is — simplest, but forfeits a meaningful win on a growing class of AArch64 SVE server CPUs.

Additional context

  • Scope is limited to RleDecoder::get_batch_with_dict; the encoder, get, get_batch, and skip are untouched.
  • Prior art in the repo for arch-specific SIMD acceleration: arrow-arith/src/aggregate.rs (AVX512/AVX dispatch) and parquet/src/util/utf8.rs (simdutf8). This proposal follows the same spirit, adding runtime-detected SVE for AArch64.
  • The SVE path uses unsafe inline assembly. Safety contract for each helper: dict must be valid for reads up to the maximum index, indices must point to count valid i32s, and output must have count writable slots; the public entry point only dispatches into it after confirming SVE availability and size_of::<T>().
  • I'm happy to open a PR with the implementation, an SVE-specific test plus a Criterion benchmark, and CI notes for exercising the AArch64 path. I've implemented this with runtime detection (zero cost on other targets, automatic on SVE hardware); happy to gate it behind a Cargo feature instead if you'd prefer a more conservative default.
  • This is my first contribution to arrow-rs, so apologies in advance if I've missed any conventions — happy to adjust the issue/PR format, benchmarks, or anything else per your guidance. Just let me know.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions