Skip to content

Fix OpenMP critical section contention in IndexBinaryHNSW search#4909

Closed
sharm235 wants to merge 1 commit into
facebookresearch:mainfrom
sharm235:export-D95910991
Closed

Fix OpenMP critical section contention in IndexBinaryHNSW search#4909
sharm235 wants to merge 1 commit into
facebookresearch:mainfrom
sharm235:export-D95910991

Conversation

@sharm235
Copy link
Copy Markdown

Summary:

Problem

We recently found 87% of CPU was being wasted on OpenMP lock contention in FlatHammingDis::~FlatHammingDis, not on useful computation.

The flame graph breakdown:

  • 88.2% CPU in openmp_worker threads
  • 87.2% in FlatHammingDis::~FlatHammingDis__kmpc_critical_with_hint__kmp_acquire_queuing_lock__sched_yield (84% CPU spinning/yielding on lock)

Root Cause

The FlatHammingDis destructor used #pragma omp critical to accumulate a single size_t counter (hnsw_stats.ndis += ndis). Unnamed #pragma omp critical sections share a global serialization lock — when all OpenMP threads exit the #pragma omp parallel block in IndexBinaryHNSW::search() simultaneously, they ALL enter the destructor at the same time, serializing on that single lock.

With N threads, this means N sequential lock acquisitions where each thread spins/yields waiting for its turn. This is O(N) serialization at the end of every search call.

In IndexBinaryHNSWCagra::search() with base_level_only=true, the situation is even worse: FlatHammingDis is created and destroyed per query iteration inside #pragma omp parallel for, causing n × num_threads critical section entries.

Fix

Replace #pragma omp critical with #pragma omp atomic. Since hnsw_stats.ndis += ndis is a simple size_t addition, #pragma omp atomic compiles to a single hardware atomic instruction (lock xadd on x86-64) — orders of magnitude faster than a mutex-based critical section, with effectively zero contention.

For reference, the float HNSW path in IndexHNSW.cpp already uses the correct pattern: #pragma omp for reduction(+: n1, n2, ndis, nhops) with a single-threaded hnsw_stats.combine() call outside the parallel region.

Impact

  • Eliminates ~87% CPU waste from lock contention in binary HNSW search
  • Affects all users of IndexBinaryHNSW::search() and IndexBinaryHNSWCagra::search()
  • No change to search results or statistics accuracy — #pragma omp atomic provides the same correctness guarantees as #pragma omp critical for a single += operation

Differential Revision: D95910991

@meta-cla meta-cla Bot added the CLA Signed label Mar 11, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Mar 11, 2026

@sharm235 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D95910991.

…ebookresearch#4909)

Summary:

## Problem

We recently found **87% of CPU** was being wasted on OpenMP lock contention in `FlatHammingDis::~FlatHammingDis`, not on useful computation.

The flame graph breakdown:
- 88.2% CPU in `openmp_worker` threads
- 87.2% in `FlatHammingDis::~FlatHammingDis` → `__kmpc_critical_with_hint` → `__kmp_acquire_queuing_lock` → `__sched_yield` (84% CPU spinning/yielding on lock)

## Root Cause

The `FlatHammingDis` destructor used `#pragma omp critical` to accumulate a single `size_t` counter (`hnsw_stats.ndis += ndis`). Unnamed `#pragma omp critical` sections share a **global serialization lock** — when all OpenMP threads exit the `#pragma omp parallel` block in `IndexBinaryHNSW::search()` simultaneously, they ALL enter the destructor at the same time, serializing on that single lock.

With N threads, this means N sequential lock acquisitions where each thread spins/yields waiting for its turn. This is O(N) serialization at the end of every search call.

In `IndexBinaryHNSWCagra::search()` with `base_level_only=true`, the situation is even worse: `FlatHammingDis` is created and destroyed **per query iteration** inside `#pragma omp parallel for`, causing `n × num_threads` critical section entries.

## Fix

Replace `#pragma omp critical` with `#pragma omp atomic`. Since `hnsw_stats.ndis += ndis` is a simple `size_t` addition, `#pragma omp atomic` compiles to a single hardware atomic instruction (`lock xadd` on x86-64) — orders of magnitude faster than a mutex-based critical section, with effectively zero contention.

For reference, the float HNSW path in `IndexHNSW.cpp` already uses the correct pattern: `#pragma omp for reduction(+: n1, n2, ndis, nhops)` with a single-threaded `hnsw_stats.combine()` call outside the parallel region.

## Impact

- Eliminates ~87% CPU waste from lock contention in binary HNSW search
- Affects all users of `IndexBinaryHNSW::search()` and `IndexBinaryHNSWCagra::search()`
- No change to search results or statistics accuracy — `#pragma omp atomic` provides the same correctness guarantees as `#pragma omp critical` for a single `+=` operation

Reviewed By: mnorris11

Differential Revision: D95910991
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Mar 12, 2026

This pull request has been merged in 5b83ec6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants