Skip to content

faiss: parallelize post-BLAS reduction loop and end_multiple() in result handlers#5185

Closed
alibeklfc wants to merge 1 commit into
facebookresearch:mainfrom
alibeklfc:export-D103830810
Closed

faiss: parallelize post-BLAS reduction loop and end_multiple() in result handlers#5185
alibeklfc wants to merge 1 commit into
facebookresearch:mainfrom
alibeklfc:export-D103830810

Conversation

@alibeklfc
Copy link
Copy Markdown
Contributor

Summary:
Three sequential post-BLAS / end_multiple loops in faiss were leaving OMP threads idle while a single thread did all the work. Each is parallelized with #pragma omp parallel for schedule(static), gated by an if (...) clause to avoid spawn-cost regressions on small workloads.

Changes

  1. exhaustive_L2sqr_blas_cmax (AVX2 + ARM SVE): after sgemm_ completes, the per-query result accumulation loop ran single-threaded while all OMP threads were idle. Each query i reads a distinct row of ip_block and writes to dis_tab[i]/ids_tab[i] — no cross-query dependencies. Added #pragma omp parallel for schedule(static) if ((i1 - i0) >= 16) to both ISA specializations.

  2. HeapBlockResultHandler::end_multiple: heap_reorder is O(k log k) per query and was sequential. The original author left a // maybe parallel for comment. add_results in the same class already has #pragma omp parallel for; end_multiple was the only remaining sequential step. Gate: if ((i1 - i0) * k >= 1024).

  3. ReservoirBlockResultHandler::end_multiple: same pattern — reservoir to_result (partial sort, O(capacity)) was sequential despite add_results being parallelized. // maybe parallel for comment removed and replaced with the actual pragma. Gate: if ((i1 - i0) * this->k >= 1024).

The if (...) thresholds were chosen from microbenchmark data: below the threshold, OMP fanout cost exceeds the work, producing 3-6× regressions on small batches. Above the threshold, parallelization yields 9-14× speedups at 16 threads. Data independence verified for all three: each loop iteration operates on a disjoint slice of dis_tab/ids_tab indexed by query i.

Benchmark results

A local microbench (not landed) was used for A/B measurement. Host: Intel Sapphire Rapids, 28 physical cores, AVX-512. Pinned with taskset -c 0-15 (OMP=16) and taskset -c 0 (OMP=1). Median of 5 reps. Synthetic uniform-random distance distributions.

HeapBlockResultHandler::end_multiple (us, lower better):

nq k parent t=1 this t=1 parent t=16 this t=16 speedup t=16
64 10 9.2 7.2 8.1 8.3 0.98× (gated)
64 100 340 345 318 67 4.79×
64 1000 5,796 5,700 5,886 501 11.76×
512 100 2,811 2,769 2,677 312 8.59×
512 1000 46,109 46,070 45,758 3,778 12.11×
4096 100 22,041 21,588 21,672 1,869 11.60×
4096 1000 369,069 376,541 372,481 25,442 14.64×

ReservoirBlockResultHandler::end_multiple (us):

nq k parent t=16 this t=16 speedup
64 10 18.0 18.1 0.99× (gated)
64 100 659 96 6.86×
64 1000 7,592 553 13.73×
512 100 5,498 490 11.21×
512 1000 59,548 4,677 12.73×
4096 100 44,064 3,230 13.64×
4096 1000 476,388 32,237 14.78×

IndexFlatL2::search end-to-end — drives exhaustive_L2sqr_blas_cmax (ms):

nb nq k parent t=16 this t=16 speedup
1024 1024 10 1.71 1.45 1.18×
1024 4096 100 58.5 15.5 3.78×
4096 4096 100 76.9 39.4 1.95×

Single-threaded paths (OMP=1) are within ±5% of parent across all configurations — the if (...) clause makes the pragma a no-op below the threshold, eliminating overhead for serial callers.

Caveats: the IndexFlatL2::search numbers measure the full search path, so the speedup attributed to change #1 also includes contributions from change #2 (heap handler, also called by this path). The end_multiple numbers isolate the changed function via PauseTiming/ResumeTiming around setup. ARM SVE not measured directly (no Graviton host); the AVX2 numbers are the strongest available proxy.

Differential Revision: D103830810

…ult handlers

Summary:
Three sequential post-BLAS / end_multiple loops in faiss were leaving OMP threads idle while a single thread did all the work. Each is parallelized with `#pragma omp parallel for schedule(static)`, gated by an `if (...)` clause to avoid spawn-cost regressions on small workloads.

**Changes**

1. `exhaustive_L2sqr_blas_cmax` (AVX2 + ARM SVE): after `sgemm_` completes, the per-query result accumulation loop ran single-threaded while all OMP threads were idle. Each query `i` reads a distinct row of `ip_block` and writes to `dis_tab[i]/ids_tab[i]` — no cross-query dependencies. Added `#pragma omp parallel for schedule(static) if ((i1 - i0) >= 16)` to both ISA specializations.

2. `HeapBlockResultHandler::end_multiple`: `heap_reorder` is O(k log k) per query and was sequential. The original author left a `// maybe parallel for` comment. `add_results` in the same class already has `#pragma omp parallel for`; `end_multiple` was the only remaining sequential step. Gate: `if ((i1 - i0) * k >= 1024)`.

3. `ReservoirBlockResultHandler::end_multiple`: same pattern — reservoir `to_result` (partial sort, O(capacity)) was sequential despite `add_results` being parallelized. `// maybe parallel for` comment removed and replaced with the actual pragma. Gate: `if ((i1 - i0) * this->k >= 1024)`.

The `if (...)` thresholds were chosen from microbenchmark data: below the threshold, OMP fanout cost exceeds the work, producing 3-6× regressions on small batches. Above the threshold, parallelization yields 9-14× speedups at 16 threads. Data independence verified for all three: each loop iteration operates on a disjoint slice of `dis_tab`/`ids_tab` indexed by query `i`.

**Benchmark results**

A local microbench (not landed) was used for A/B measurement. Host: Intel Sapphire Rapids, 28 physical cores, AVX-512. Pinned with `taskset -c 0-15` (OMP=16) and `taskset -c 0` (OMP=1). Median of 5 reps. Synthetic uniform-random distance distributions.

`HeapBlockResultHandler::end_multiple` (us, lower better):

| nq    | k    | parent t=1 | this t=1 | parent t=16 | this t=16 | speedup t=16  |
|------:|-----:|-----------:|---------:|------------:|----------:|--------------:|
| 64    | 10   | 9.2        | 7.2      | 8.1         | 8.3       | 0.98× (gated) |
| 64    | 100  | 340        | 345      | 318         | 67        | 4.79×         |
| 64    | 1000 | 5,796      | 5,700    | 5,886       | 501       | 11.76×        |
| 512   | 100  | 2,811      | 2,769    | 2,677       | 312       | 8.59×         |
| 512   | 1000 | 46,109     | 46,070   | 45,758      | 3,778     | 12.11×        |
| 4096  | 100  | 22,041     | 21,588   | 21,672      | 1,869     | 11.60×        |
| 4096  | 1000 | 369,069    | 376,541  | 372,481     | 25,442    | 14.64×        |

`ReservoirBlockResultHandler::end_multiple` (us):

| nq    | k    | parent t=16 | this t=16 | speedup       |
|------:|-----:|------------:|----------:|--------------:|
| 64    | 10   | 18.0        | 18.1      | 0.99× (gated) |
| 64    | 100  | 659         | 96        | 6.86×         |
| 64    | 1000 | 7,592       | 553       | 13.73×        |
| 512   | 100  | 5,498       | 490       | 11.21×        |
| 512   | 1000 | 59,548      | 4,677     | 12.73×        |
| 4096  | 100  | 44,064      | 3,230     | 13.64×        |
| 4096  | 1000 | 476,388     | 32,237    | 14.78×        |

`IndexFlatL2::search` end-to-end — drives `exhaustive_L2sqr_blas_cmax` (ms):

| nb    | nq    | k   | parent t=16 | this t=16 | speedup |
|------:|------:|----:|------------:|----------:|--------:|
| 1024  | 1024  | 10  | 1.71        | 1.45      | 1.18×   |
| 1024  | 4096  | 100 | 58.5        | 15.5      | 3.78×   |
| 4096  | 4096  | 100 | 76.9        | 39.4      | 1.95×   |

Single-threaded paths (OMP=1) are within ±5% of parent across all configurations — the `if (...)` clause makes the pragma a no-op below the threshold, eliminating overhead for serial callers.

Caveats: the `IndexFlatL2::search` numbers measure the full search path, so the speedup attributed to change facebookresearch#1 also includes contributions from change facebookresearch#2 (heap handler, also called by this path). The `end_multiple` numbers isolate the changed function via `PauseTiming`/`ResumeTiming` around setup. ARM SVE not measured directly (no Graviton host); the AVX2 numbers are the strongest available proxy.

Differential Revision: D103830810
@meta-cla meta-cla Bot added the CLA Signed label May 6, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented May 6, 2026

@alibeklfc has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103830810.

@mulugetam
Copy link
Copy Markdown
Contributor

Summary: Three sequential post-BLAS / end_multiple loops in faiss were leaving OMP threads idle while a single thread did all the work. Each is parallelized with #pragma omp parallel for schedule(static), gated by an if (...) clause to avoid spawn-cost regressions on small workloads.

Noticed this earlier. Glad it's addressed now.

@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented May 6, 2026

This pull request has been merged in 2322afd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants