Fix OpenMP critical section contention in IndexBinaryHNSW search by sharm235 · Pull Request #4909 · facebookresearch/faiss

sharm235 · 2026-03-11T20:46:05Z

Summary:

Problem

We recently found 87% of CPU was being wasted on OpenMP lock contention in FlatHammingDis::~FlatHammingDis, not on useful computation.

The flame graph breakdown:

88.2% CPU in openmp_worker threads
87.2% in FlatHammingDis::~FlatHammingDis → __kmpc_critical_with_hint → __kmp_acquire_queuing_lock → __sched_yield (84% CPU spinning/yielding on lock)

Root Cause

The FlatHammingDis destructor used #pragma omp critical to accumulate a single size_t counter (hnsw_stats.ndis += ndis). Unnamed #pragma omp critical sections share a global serialization lock — when all OpenMP threads exit the #pragma omp parallel block in IndexBinaryHNSW::search() simultaneously, they ALL enter the destructor at the same time, serializing on that single lock.

With N threads, this means N sequential lock acquisitions where each thread spins/yields waiting for its turn. This is O(N) serialization at the end of every search call.

In IndexBinaryHNSWCagra::search() with base_level_only=true, the situation is even worse: FlatHammingDis is created and destroyed per query iteration inside #pragma omp parallel for, causing n × num_threads critical section entries.

Fix

Replace #pragma omp critical with #pragma omp atomic. Since hnsw_stats.ndis += ndis is a simple size_t addition, #pragma omp atomic compiles to a single hardware atomic instruction (lock xadd on x86-64) — orders of magnitude faster than a mutex-based critical section, with effectively zero contention.

For reference, the float HNSW path in IndexHNSW.cpp already uses the correct pattern: #pragma omp for reduction(+: n1, n2, ndis, nhops) with a single-threaded hnsw_stats.combine() call outside the parallel region.

Impact

Eliminates ~87% CPU waste from lock contention in binary HNSW search
Affects all users of IndexBinaryHNSW::search() and IndexBinaryHNSWCagra::search()
No change to search results or statistics accuracy — #pragma omp atomic provides the same correctness guarantees as #pragma omp critical for a single += operation

Differential Revision: D95910991

meta-codesync · 2026-03-11T20:46:14Z

@sharm235 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D95910991.

…ebookresearch#4909) Summary: ## Problem We recently found **87% of CPU** was being wasted on OpenMP lock contention in `FlatHammingDis::~FlatHammingDis`, not on useful computation. The flame graph breakdown: - 88.2% CPU in `openmp_worker` threads - 87.2% in `FlatHammingDis::~FlatHammingDis` → `__kmpc_critical_with_hint` → `__kmp_acquire_queuing_lock` → `__sched_yield` (84% CPU spinning/yielding on lock) ## Root Cause The `FlatHammingDis` destructor used `#pragma omp critical` to accumulate a single `size_t` counter (`hnsw_stats.ndis += ndis`). Unnamed `#pragma omp critical` sections share a **global serialization lock** — when all OpenMP threads exit the `#pragma omp parallel` block in `IndexBinaryHNSW::search()` simultaneously, they ALL enter the destructor at the same time, serializing on that single lock. With N threads, this means N sequential lock acquisitions where each thread spins/yields waiting for its turn. This is O(N) serialization at the end of every search call. In `IndexBinaryHNSWCagra::search()` with `base_level_only=true`, the situation is even worse: `FlatHammingDis` is created and destroyed **per query iteration** inside `#pragma omp parallel for`, causing `n × num_threads` critical section entries. ## Fix Replace `#pragma omp critical` with `#pragma omp atomic`. Since `hnsw_stats.ndis += ndis` is a simple `size_t` addition, `#pragma omp atomic` compiles to a single hardware atomic instruction (`lock xadd` on x86-64) — orders of magnitude faster than a mutex-based critical section, with effectively zero contention. For reference, the float HNSW path in `IndexHNSW.cpp` already uses the correct pattern: `#pragma omp for reduction(+: n1, n2, ndis, nhops)` with a single-threaded `hnsw_stats.combine()` call outside the parallel region. ## Impact - Eliminates ~87% CPU waste from lock contention in binary HNSW search - Affects all users of `IndexBinaryHNSW::search()` and `IndexBinaryHNSWCagra::search()` - No change to search results or statistics accuracy — `#pragma omp atomic` provides the same correctness guarantees as `#pragma omp critical` for a single `+=` operation Reviewed By: mnorris11 Differential Revision: D95910991

meta-codesync · 2026-03-12T06:04:06Z

This pull request has been merged in 5b83ec6.

meta-cla Bot added the CLA Signed label Mar 11, 2026

meta-codesync Bot added fb-exported meta-exported labels Mar 11, 2026

sharm235 force-pushed the export-D95910991 branch from d745c8b to 8903e72 Compare March 11, 2026 21:47

meta-codesync Bot closed this in 5b83ec6 Mar 12, 2026

facebook-github-bot added the Merged label Mar 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix OpenMP critical section contention in IndexBinaryHNSW search#4909

Fix OpenMP critical section contention in IndexBinaryHNSW search#4909
sharm235 wants to merge 1 commit into
facebookresearch:mainfrom
sharm235:export-D95910991

sharm235 commented Mar 11, 2026

Uh oh!

meta-codesync Bot commented Mar 11, 2026

Uh oh!

meta-codesync Bot commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sharm235 commented Mar 11, 2026

Problem

Root Cause

Fix

Impact

Uh oh!

meta-codesync Bot commented Mar 11, 2026

Uh oh!

meta-codesync Bot commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants