GH-50026: [C++][Parquet] SIMD-accelerate SBBF probe via branchless autovec#50030
GH-50026: [C++][Parquet] SIMD-accelerate SBBF probe via branchless autovec#50030dmatth1 wants to merge 2 commits into
Conversation
…ess autovec Rewrite BlockSplitBloomFilter::FindHash from a short-circuit early-exit loop to a branchless OR-accumulator reduction. The early `return false` blocked compilers from collapsing the 8-lane probe to a horizontal block test; the reduction autovectorizes to a single SSE/NEON block test on clang, gcc, and MSVC. Wire the probe through CpuInfo runtime dispatch, mirroring the existing level_comparison_avx2 pattern. The shared body in bloom_filter_block_inc.h is built once at the baseline (SSE on x86, NEON on aarch64) and once in bloom_filter_avx2.cc compiled with `-mavx2`. The AVX2 TU spells the reduction in xsimd rather than relying on autovec: clang lowers the autovec body to a single vptest, but gcc/MSVC emit a longer horizontal vpor reduction that costs ~20% out-of-L3. xsimd is guaranteed available under ARROW_HAVE_RUNTIME_AVX2. A new cross-target diff test calls both probe bodies directly across 20K random + 200 production-populated blocks per CI run, so neither path can silently drift. A static_assert ties the 8-lane assumption to BlockSplitBloomFilter::kBitsSetPerBlock. On-disk format unchanged. SALT, XXH64, bucket index unchanged. Bit-identical to the scalar reference. End-to-end FindHash perf via parquet/benches/bloom_filter_benchmark.cc. M1 (Apple clang -O3, NEON via autovec, 10 reps, CV<=0.4%): | Bench | upstream/main (scalar) | simd-sbbf-autovec | Speedup | |-------------------------------------|---------------------------|---------------------------|---------| | BM_FindExistingHash (hit-heavy) | 3.85 ns/probe (259.6 M/s) | 2.41 ns/probe (415.1 M/s) | 1.60x | | BM_FindNonExistingHash (miss-heavy) | 9.04 ns/probe (110.6 M/s) | 2.41 ns/probe (415.4 M/s) | 3.75x | x86-64 (gcc 13.3, -O2 -mavx2 via AVX2 dispatch TU, 5 reps, CV<=0.6%): | Bench | upstream/main (scalar) | simd-sbbf-autovec | Speedup | |-------------------------------------|---------------------------|---------------------------|---------| | BM_FindExistingHash (hit-heavy) | 8.62 ns/probe (116.0 M/s) | 4.32 ns/probe (231.6 M/s) | 2.00x | | BM_FindNonExistingHash (miss-heavy) | 15.29 ns/probe (65.4 M/s) | 4.33 ns/probe (230.8 M/s) | 3.53x | The scalar miss path stalls on the data-dependent early-exit (slower than its own hit path on both archs); the branchless reduction is constant-time across hit and miss. Miss-heavy is the common case for Parquet row-group skipping. Insert/ComputeHash/batch paths unchanged (16 benches within +/-0.6%). Cache-regime sweep in the PR description. Insert path uses the same loop shape and follows in a separate PR.
|
|
There was a problem hiding this comment.
Pull request overview
This PR accelerates Parquet’s BlockSplitBloomFilter::FindHash probe by reshaping the scalar short-circuit loop into a branchless reduction that autovectorizes, and by adding an AVX2 runtime-dispatched probe kernel for x86 targets.
Changes:
- Rework
BlockSplitBloomFilter::FindHashto call a dispatchable per-block probe (FindHashBlockImpl) implemented as a branchless OR-accumulator reduction. - Add an AVX2-specific probe implementation in a separate translation unit (
bloom_filter_avx2.cc) using xsimd, wired throughDynamicDispatch. - Add a kernel agreement test that compares baseline vs AVX2 implementations on AVX2-capable hosts.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| cpp/src/parquet/CMakeLists.txt | Adds bloom_filter_avx2.cc to Parquet sources under runtime-AVX2 builds and applies AVX2 compile flags. |
| cpp/src/parquet/bloom_filter.cc | Introduces DynamicDispatch plumbing and routes FindHash through the new per-block probe kernels. |
| cpp/src/parquet/bloom_filter_test.cc | Adds an AVX2-only cross-kernel agreement test and includes the baseline/AVX2 probe entrypoints. |
| cpp/src/parquet/bloom_filter_block_inc.h | New header containing the baseline branchless per-block probe implementation. |
| cpp/src/parquet/bloom_filter_avx2.cc | New AVX2 probe kernel implementation using xsimd. |
| cpp/src/parquet/bloom_filter_avx2_internal.h | New internal header declaring the AVX2 probe entrypoint (exported for Windows/MinGW test usage). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
I wonder how would the avx2 path faster than scalar path 🤔 |
|
Branchless body alone (no xsimd kernel) on AVX2:
Cache regime sweep: scalar vs xsimd, post-hash probe latency:
These numbers are with the Can re-bench in-tree with the commit if you want directly-comparable numbers. |
Rationale for this change
BlockSplitBloomFilter::FindHashships the scalar reference probe — an 8-iteration short-circuit loop. The short-circuit blocks autovectorization, and on miss-heavy workloads (Parquet row-group skipping) the per-lane branch-mispredict dominates probe latency.Closes #50026. Dev list discussion: https://lists.apache.org/thread/omof0fq47tndfd80g5hwp2bvjmzvpb40. Sibling change in Rust: apache/arrow-rs#10011.
What changes are included in this PR?
FindHashas a branchless OR-accumulator reduction. The new shape autovectorizes to SSE on x86 and NEON on aarch64 at the baseline.bloom_filter_avx2.cc(xsimd kernel built with-mavx2) behindCpuInfo-basedDynamicDispatch, mirroring the existinglevel_comparison_avx2pattern. xsimd was a requirement from the dev thread; the AVX2 target spells the reduction explicitly because gcc/MSVC don't lower the autovec body to a singlevptest.Performance
End-to-end
FindHashviaparquet/benches/bloom_filter_benchmark.cc.M1 (Apple clang -O3, NEON via autovec, 10 reps, CV ≤ 0.4%):
BM_FindExistingHash(hit-heavy)BM_FindNonExistingHash(miss-heavy)x86-64 (gcc 13.3, -O2 -mavx2 via AVX2 dispatch TU, 5 reps, CV ≤ 0.6%):
BM_FindExistingHash(hit-heavy)BM_FindNonExistingHash(miss-heavy)The scalar miss path stalls on the data-dependent early-exit (slower than its own hit path on both archs); the branchless reduction is constant-time across hit/miss.
InsertHash,BatchInsertHash,ComputeHash,BatchComputeHashallunchanged (16 benches within ±0.6%, inside CV).
Are these changes tested?
Yes. New
BloomFilterProbeKerneltest calls both dispatch targets directly across 20K random blocks + 200 production-populated blocks per CI run, asserting bit-identical output.DynamicDispatchresolves once at static init, so without thistest the un-picked target would never be exercised in CI.
Existing
BasicTest,FPPTest, andCompatibilityTestcontinue to pass on both the scalar baseline and the AVX2 dispatch path.Are there any user-facing changes?
No. Read-path implementation change only.