[Parquet] Improve dictionary decoder by unrolling loops#9662
[Parquet] Improve dictionary decoder by unrolling loops#9662Dandandan merged 9 commits intoapache:mainfrom
Conversation
Add `from_u64` method to `FromBytes` trait to convert directly from the u64 bit buffer, eliminating the intermediate `as_bytes()` → `try_from_le_slice()` round-trip that copied through a byte slice. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
run benchmark arrow_reader_clickbench |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing avoid-copy-bitreader-get-value (0c1ec22) to ec771cc (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark arrow_reader |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing avoid-copy-bitreader-get-value (0c1ec22) to ec771cc (merge-base) diff File an issue against this benchmark runner |
There was a problem hiding this comment.
Pull request overview
This PR aims to speed up bit-unpacking by avoiding intermediate byte-slice/array conversions in BitReader::get_value, introducing a direct conversion path from the internal u64 bit-buffer into the requested output type.
Changes:
- Add
FromBytes::from_u64(u64) -> Selfto convert directly from theu64bit buffer value. - Implement
from_u64for integer, float, and bool types (floats viafrom_bits). - Update
BitReader::get_valueto useSome(T::from_u64(v))instead of converting via byte slices.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
I would have expected the compiler to optimize these copies away. Still might be a good idea, since it also simplifies the logic a bit. I don't quite like how the method is intentionally unsupported for some of the |
These methods are quite hot in DataFusion profiles, so I think they are likely not optimized away (I run some benchmarks and/or check generated code).
Agreed. I'll check! |
Replace unreachable!() stubs for from_u64 on Int96/ByteArray/FixedLenByteArray with a separate FromBitpacked trait that is only implemented for types that can actually be converted from u64 (primitives, floats, bool). Also convert assert! to debug_assert! for num_bits bounds checks in get_value/get_batch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Currently doesn't look like a performance change. |
Use chunks_exact(8) to process dictionary index lookups in groups of 8, allowing the compiler to unroll the inner loop and pipeline dependent memory accesses. This gives ~12% improvement on dictionary-encoded reads. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of bounds-checking each dictionary index individually, compute the max index across the 8-element chunk and check once. This allows using get_unchecked in the inner loop, improving dictionary-encoded reads by ~17% (up from ~12% with per-element checks). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
run benchmark arrow_reader_clickbench |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing avoid-copy-bitreader-get-value (ac31357) to ec771cc (merge-base) diff File an issue against this benchmark runner |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace max()+assert with all(|i| i < len)+assert for the per-chunk dictionary bounds check. This avoids the reduction and allows the compiler to vectorize 8 independent comparisons, improving dictionary reads by ~19% over baseline (vs ~17% with max). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
This now yields -20% on my machine for dictionary decoding i32. |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
alamb
left a comment
There was a problem hiding this comment.
I went over this carefully and it makes sense to me. Thank you @Dandandan and @jhorstmann
|
run benchmark arrow_reader_clickbench |
|
run benchmark arrow_reader |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing avoid-copy-bitreader-get-value (bdbb3a6) to ec771cc (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing avoid-copy-bitreader-get-value (bdbb3a6) to ec771cc (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark arrow_reader_clickbench |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing avoid-copy-bitreader-get-value (995041f) to ec771cc (merge-base) diff File an issue against this benchmark runner |
Change assert! to debug_assert! in read_num_bytes, RleEncoder::new_from_buf, and RleDecoder::get_batch_with_dict where the checks are either redundant (subsequent indexing already panics) or in cold constructor code. Verified via cargo-asm that this removes instructions from hot paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
Thanks @alamb improving Parquet decoding perf piece by piece 🚀 |
Indeed -- step by step we'll have this thing screaming |
Which issue does this PR close?
#9670
Rationale
Improve dictionary decoding by unrolling loads / OOB check (largest win -15% for i32) and optimizing the
try_from_le_sliceusage to load from u64 instead (small win, mostly readability).What changes are included in this PR?
Are these changes tested?
Existing tests cover this path thoroughly (21 bit_util tests pass).
🤖 Generated with Claude Code