faiss: parallelize post-BLAS reduction loop and end_multiple() in result handlers#5185
Closed
alibeklfc wants to merge 1 commit into
Closed
faiss: parallelize post-BLAS reduction loop and end_multiple() in result handlers#5185alibeklfc wants to merge 1 commit into
alibeklfc wants to merge 1 commit into
Conversation
…ult handlers Summary: Three sequential post-BLAS / end_multiple loops in faiss were leaving OMP threads idle while a single thread did all the work. Each is parallelized with `#pragma omp parallel for schedule(static)`, gated by an `if (...)` clause to avoid spawn-cost regressions on small workloads. **Changes** 1. `exhaustive_L2sqr_blas_cmax` (AVX2 + ARM SVE): after `sgemm_` completes, the per-query result accumulation loop ran single-threaded while all OMP threads were idle. Each query `i` reads a distinct row of `ip_block` and writes to `dis_tab[i]/ids_tab[i]` — no cross-query dependencies. Added `#pragma omp parallel for schedule(static) if ((i1 - i0) >= 16)` to both ISA specializations. 2. `HeapBlockResultHandler::end_multiple`: `heap_reorder` is O(k log k) per query and was sequential. The original author left a `// maybe parallel for` comment. `add_results` in the same class already has `#pragma omp parallel for`; `end_multiple` was the only remaining sequential step. Gate: `if ((i1 - i0) * k >= 1024)`. 3. `ReservoirBlockResultHandler::end_multiple`: same pattern — reservoir `to_result` (partial sort, O(capacity)) was sequential despite `add_results` being parallelized. `// maybe parallel for` comment removed and replaced with the actual pragma. Gate: `if ((i1 - i0) * this->k >= 1024)`. The `if (...)` thresholds were chosen from microbenchmark data: below the threshold, OMP fanout cost exceeds the work, producing 3-6× regressions on small batches. Above the threshold, parallelization yields 9-14× speedups at 16 threads. Data independence verified for all three: each loop iteration operates on a disjoint slice of `dis_tab`/`ids_tab` indexed by query `i`. **Benchmark results** A local microbench (not landed) was used for A/B measurement. Host: Intel Sapphire Rapids, 28 physical cores, AVX-512. Pinned with `taskset -c 0-15` (OMP=16) and `taskset -c 0` (OMP=1). Median of 5 reps. Synthetic uniform-random distance distributions. `HeapBlockResultHandler::end_multiple` (us, lower better): | nq | k | parent t=1 | this t=1 | parent t=16 | this t=16 | speedup t=16 | |------:|-----:|-----------:|---------:|------------:|----------:|--------------:| | 64 | 10 | 9.2 | 7.2 | 8.1 | 8.3 | 0.98× (gated) | | 64 | 100 | 340 | 345 | 318 | 67 | 4.79× | | 64 | 1000 | 5,796 | 5,700 | 5,886 | 501 | 11.76× | | 512 | 100 | 2,811 | 2,769 | 2,677 | 312 | 8.59× | | 512 | 1000 | 46,109 | 46,070 | 45,758 | 3,778 | 12.11× | | 4096 | 100 | 22,041 | 21,588 | 21,672 | 1,869 | 11.60× | | 4096 | 1000 | 369,069 | 376,541 | 372,481 | 25,442 | 14.64× | `ReservoirBlockResultHandler::end_multiple` (us): | nq | k | parent t=16 | this t=16 | speedup | |------:|-----:|------------:|----------:|--------------:| | 64 | 10 | 18.0 | 18.1 | 0.99× (gated) | | 64 | 100 | 659 | 96 | 6.86× | | 64 | 1000 | 7,592 | 553 | 13.73× | | 512 | 100 | 5,498 | 490 | 11.21× | | 512 | 1000 | 59,548 | 4,677 | 12.73× | | 4096 | 100 | 44,064 | 3,230 | 13.64× | | 4096 | 1000 | 476,388 | 32,237 | 14.78× | `IndexFlatL2::search` end-to-end — drives `exhaustive_L2sqr_blas_cmax` (ms): | nb | nq | k | parent t=16 | this t=16 | speedup | |------:|------:|----:|------------:|----------:|--------:| | 1024 | 1024 | 10 | 1.71 | 1.45 | 1.18× | | 1024 | 4096 | 100 | 58.5 | 15.5 | 3.78× | | 4096 | 4096 | 100 | 76.9 | 39.4 | 1.95× | Single-threaded paths (OMP=1) are within ±5% of parent across all configurations — the `if (...)` clause makes the pragma a no-op below the threshold, eliminating overhead for serial callers. Caveats: the `IndexFlatL2::search` numbers measure the full search path, so the speedup attributed to change facebookresearch#1 also includes contributions from change facebookresearch#2 (heap handler, also called by this path). The `end_multiple` numbers isolate the changed function via `PauseTiming`/`ResumeTiming` around setup. ARM SVE not measured directly (no Graviton host); the AVX2 numbers are the strongest available proxy. Differential Revision: D103830810
Contributor
|
@alibeklfc has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103830810. |
Contributor
Noticed this earlier. Glad it's addressed now. |
Contributor
|
This pull request has been merged in 2322afd. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Three sequential post-BLAS / end_multiple loops in faiss were leaving OMP threads idle while a single thread did all the work. Each is parallelized with
#pragma omp parallel for schedule(static), gated by anif (...)clause to avoid spawn-cost regressions on small workloads.Changes
exhaustive_L2sqr_blas_cmax(AVX2 + ARM SVE): aftersgemm_completes, the per-query result accumulation loop ran single-threaded while all OMP threads were idle. Each queryireads a distinct row ofip_blockand writes todis_tab[i]/ids_tab[i]— no cross-query dependencies. Added#pragma omp parallel for schedule(static) if ((i1 - i0) >= 16)to both ISA specializations.HeapBlockResultHandler::end_multiple:heap_reorderis O(k log k) per query and was sequential. The original author left a// maybe parallel forcomment.add_resultsin the same class already has#pragma omp parallel for;end_multiplewas the only remaining sequential step. Gate:if ((i1 - i0) * k >= 1024).ReservoirBlockResultHandler::end_multiple: same pattern — reservoirto_result(partial sort, O(capacity)) was sequential despiteadd_resultsbeing parallelized.// maybe parallel forcomment removed and replaced with the actual pragma. Gate:if ((i1 - i0) * this->k >= 1024).The
if (...)thresholds were chosen from microbenchmark data: below the threshold, OMP fanout cost exceeds the work, producing 3-6× regressions on small batches. Above the threshold, parallelization yields 9-14× speedups at 16 threads. Data independence verified for all three: each loop iteration operates on a disjoint slice ofdis_tab/ids_tabindexed by queryi.Benchmark results
A local microbench (not landed) was used for A/B measurement. Host: Intel Sapphire Rapids, 28 physical cores, AVX-512. Pinned with
taskset -c 0-15(OMP=16) andtaskset -c 0(OMP=1). Median of 5 reps. Synthetic uniform-random distance distributions.HeapBlockResultHandler::end_multiple(us, lower better):ReservoirBlockResultHandler::end_multiple(us):IndexFlatL2::searchend-to-end — drivesexhaustive_L2sqr_blas_cmax(ms):Single-threaded paths (OMP=1) are within ±5% of parent across all configurations — the
if (...)clause makes the pragma a no-op below the threshold, eliminating overhead for serial callers.Caveats: the
IndexFlatL2::searchnumbers measure the full search path, so the speedup attributed to change #1 also includes contributions from change #2 (heap handler, also called by this path). Theend_multiplenumbers isolate the changed function viaPauseTiming/ResumeTimingaround setup. ARM SVE not measured directly (no Graviton host); the AVX2 numbers are the strongest available proxy.Differential Revision: D103830810