faiss: parallelize post-BLAS reduction loop and end_multiple() in result handlers by alibeklfc · Pull Request #5185 · facebookresearch/faiss

alibeklfc · 2026-05-06T17:56:25Z

Summary:
Three sequential post-BLAS / end_multiple loops in faiss were leaving OMP threads idle while a single thread did all the work. Each is parallelized with #pragma omp parallel for schedule(static), gated by an if (...) clause to avoid spawn-cost regressions on small workloads.

Changes

exhaustive_L2sqr_blas_cmax (AVX2 + ARM SVE): after sgemm_ completes, the per-query result accumulation loop ran single-threaded while all OMP threads were idle. Each query i reads a distinct row of ip_block and writes to dis_tab[i]/ids_tab[i] — no cross-query dependencies. Added #pragma omp parallel for schedule(static) if ((i1 - i0) >= 16) to both ISA specializations.
HeapBlockResultHandler::end_multiple: heap_reorder is O(k log k) per query and was sequential. The original author left a // maybe parallel for comment. add_results in the same class already has #pragma omp parallel for; end_multiple was the only remaining sequential step. Gate: if ((i1 - i0) * k >= 1024).
ReservoirBlockResultHandler::end_multiple: same pattern — reservoir to_result (partial sort, O(capacity)) was sequential despite add_results being parallelized. // maybe parallel for comment removed and replaced with the actual pragma. Gate: if ((i1 - i0) * this->k >= 1024).

The if (...) thresholds were chosen from microbenchmark data: below the threshold, OMP fanout cost exceeds the work, producing 3-6× regressions on small batches. Above the threshold, parallelization yields 9-14× speedups at 16 threads. Data independence verified for all three: each loop iteration operates on a disjoint slice of dis_tab/ids_tab indexed by query i.

Benchmark results

A local microbench (not landed) was used for A/B measurement. Host: Intel Sapphire Rapids, 28 physical cores, AVX-512. Pinned with taskset -c 0-15 (OMP=16) and taskset -c 0 (OMP=1). Median of 5 reps. Synthetic uniform-random distance distributions.

HeapBlockResultHandler::end_multiple (us, lower better):

nq	k	parent t=1	this t=1	parent t=16	this t=16	speedup t=16
64	10	9.2	7.2	8.1	8.3	0.98× (gated)
64	100	340	345	318	67	4.79×
64	1000	5,796	5,700	5,886	501	11.76×
512	100	2,811	2,769	2,677	312	8.59×
512	1000	46,109	46,070	45,758	3,778	12.11×
4096	100	22,041	21,588	21,672	1,869	11.60×
4096	1000	369,069	376,541	372,481	25,442	14.64×

ReservoirBlockResultHandler::end_multiple (us):

nq	k	parent t=16	this t=16	speedup
64	10	18.0	18.1	0.99× (gated)
64	100	659	96	6.86×
64	1000	7,592	553	13.73×
512	100	5,498	490	11.21×
512	1000	59,548	4,677	12.73×
4096	100	44,064	3,230	13.64×
4096	1000	476,388	32,237	14.78×

IndexFlatL2::search end-to-end — drives exhaustive_L2sqr_blas_cmax (ms):

nb	nq	k	parent t=16	this t=16	speedup
1024	1024	10	1.71	1.45	1.18×
1024	4096	100	58.5	15.5	3.78×
4096	4096	100	76.9	39.4	1.95×

Single-threaded paths (OMP=1) are within ±5% of parent across all configurations — the if (...) clause makes the pragma a no-op below the threshold, eliminating overhead for serial callers.

Caveats: the IndexFlatL2::search numbers measure the full search path, so the speedup attributed to change #1 also includes contributions from change #2 (heap handler, also called by this path). The end_multiple numbers isolate the changed function via PauseTiming/ResumeTiming around setup. ARM SVE not measured directly (no Graviton host); the AVX2 numbers are the strongest available proxy.

Differential Revision: D103830810

…ult handlers Summary: Three sequential post-BLAS / end_multiple loops in faiss were leaving OMP threads idle while a single thread did all the work. Each is parallelized with `#pragma omp parallel for schedule(static)`, gated by an `if (...)` clause to avoid spawn-cost regressions on small workloads. **Changes** 1. `exhaustive_L2sqr_blas_cmax` (AVX2 + ARM SVE): after `sgemm_` completes, the per-query result accumulation loop ran single-threaded while all OMP threads were idle. Each query `i` reads a distinct row of `ip_block` and writes to `dis_tab[i]/ids_tab[i]` — no cross-query dependencies. Added `#pragma omp parallel for schedule(static) if ((i1 - i0) >= 16)` to both ISA specializations. 2. `HeapBlockResultHandler::end_multiple`: `heap_reorder` is O(k log k) per query and was sequential. The original author left a `// maybe parallel for` comment. `add_results` in the same class already has `#pragma omp parallel for`; `end_multiple` was the only remaining sequential step. Gate: `if ((i1 - i0) * k >= 1024)`. 3. `ReservoirBlockResultHandler::end_multiple`: same pattern — reservoir `to_result` (partial sort, O(capacity)) was sequential despite `add_results` being parallelized. `// maybe parallel for` comment removed and replaced with the actual pragma. Gate: `if ((i1 - i0) * this->k >= 1024)`. The `if (...)` thresholds were chosen from microbenchmark data: below the threshold, OMP fanout cost exceeds the work, producing 3-6× regressions on small batches. Above the threshold, parallelization yields 9-14× speedups at 16 threads. Data independence verified for all three: each loop iteration operates on a disjoint slice of `dis_tab`/`ids_tab` indexed by query `i`. **Benchmark results** A local microbench (not landed) was used for A/B measurement. Host: Intel Sapphire Rapids, 28 physical cores, AVX-512. Pinned with `taskset -c 0-15` (OMP=16) and `taskset -c 0` (OMP=1). Median of 5 reps. Synthetic uniform-random distance distributions. `HeapBlockResultHandler::end_multiple` (us, lower better): | nq | k | parent t=1 | this t=1 | parent t=16 | this t=16 | speedup t=16 | |------:|-----:|-----------:|---------:|------------:|----------:|--------------:| | 64 | 10 | 9.2 | 7.2 | 8.1 | 8.3 | 0.98× (gated) | | 64 | 100 | 340 | 345 | 318 | 67 | 4.79× | | 64 | 1000 | 5,796 | 5,700 | 5,886 | 501 | 11.76× | | 512 | 100 | 2,811 | 2,769 | 2,677 | 312 | 8.59× | | 512 | 1000 | 46,109 | 46,070 | 45,758 | 3,778 | 12.11× | | 4096 | 100 | 22,041 | 21,588 | 21,672 | 1,869 | 11.60× | | 4096 | 1000 | 369,069 | 376,541 | 372,481 | 25,442 | 14.64× | `ReservoirBlockResultHandler::end_multiple` (us): | nq | k | parent t=16 | this t=16 | speedup | |------:|-----:|------------:|----------:|--------------:| | 64 | 10 | 18.0 | 18.1 | 0.99× (gated) | | 64 | 100 | 659 | 96 | 6.86× | | 64 | 1000 | 7,592 | 553 | 13.73× | | 512 | 100 | 5,498 | 490 | 11.21× | | 512 | 1000 | 59,548 | 4,677 | 12.73× | | 4096 | 100 | 44,064 | 3,230 | 13.64× | | 4096 | 1000 | 476,388 | 32,237 | 14.78× | `IndexFlatL2::search` end-to-end — drives `exhaustive_L2sqr_blas_cmax` (ms): | nb | nq | k | parent t=16 | this t=16 | speedup | |------:|------:|----:|------------:|----------:|--------:| | 1024 | 1024 | 10 | 1.71 | 1.45 | 1.18× | | 1024 | 4096 | 100 | 58.5 | 15.5 | 3.78× | | 4096 | 4096 | 100 | 76.9 | 39.4 | 1.95× | Single-threaded paths (OMP=1) are within ±5% of parent across all configurations — the `if (...)` clause makes the pragma a no-op below the threshold, eliminating overhead for serial callers. Caveats: the `IndexFlatL2::search` numbers measure the full search path, so the speedup attributed to change facebookresearch#1 also includes contributions from change facebookresearch#2 (heap handler, also called by this path). The `end_multiple` numbers isolate the changed function via `PauseTiming`/`ResumeTiming` around setup. ARM SVE not measured directly (no Graviton host); the AVX2 numbers are the strongest available proxy. Differential Revision: D103830810

meta-codesync · 2026-05-06T17:56:38Z

@alibeklfc has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103830810.

mulugetam · 2026-05-06T20:34:23Z

Summary: Three sequential post-BLAS / end_multiple loops in faiss were leaving OMP threads idle while a single thread did all the work. Each is parallelized with #pragma omp parallel for schedule(static), gated by an if (...) clause to avoid spawn-cost regressions on small workloads.

Noticed this earlier. Glad it's addressed now.

meta-codesync · 2026-05-06T22:32:26Z

This pull request has been merged in 2322afd.

meta-cla Bot added the CLA Signed label May 6, 2026

meta-codesync Bot added fb-exported meta-exported labels May 6, 2026

meta-codesync Bot closed this in 2322afd May 6, 2026

facebook-github-tools Bot added the Merged label May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

faiss: parallelize post-BLAS reduction loop and end_multiple() in result handlers#5185

faiss: parallelize post-BLAS reduction loop and end_multiple() in result handlers#5185
alibeklfc wants to merge 1 commit into
facebookresearch:mainfrom
alibeklfc:export-D103830810

alibeklfc commented May 6, 2026

Uh oh!

meta-codesync Bot commented May 6, 2026

Uh oh!

mulugetam commented May 6, 2026

Uh oh!

meta-codesync Bot commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alibeklfc commented May 6, 2026

Uh oh!

meta-codesync Bot commented May 6, 2026

Uh oh!

mulugetam commented May 6, 2026

Uh oh!

meta-codesync Bot commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants