Fix OpenMP critical section contention in IndexBinaryHNSW search#4909
Closed
sharm235 wants to merge 1 commit into
Closed
Fix OpenMP critical section contention in IndexBinaryHNSW search#4909sharm235 wants to merge 1 commit into
sharm235 wants to merge 1 commit into
Conversation
Contributor
…ebookresearch#4909) Summary: ## Problem We recently found **87% of CPU** was being wasted on OpenMP lock contention in `FlatHammingDis::~FlatHammingDis`, not on useful computation. The flame graph breakdown: - 88.2% CPU in `openmp_worker` threads - 87.2% in `FlatHammingDis::~FlatHammingDis` → `__kmpc_critical_with_hint` → `__kmp_acquire_queuing_lock` → `__sched_yield` (84% CPU spinning/yielding on lock) ## Root Cause The `FlatHammingDis` destructor used `#pragma omp critical` to accumulate a single `size_t` counter (`hnsw_stats.ndis += ndis`). Unnamed `#pragma omp critical` sections share a **global serialization lock** — when all OpenMP threads exit the `#pragma omp parallel` block in `IndexBinaryHNSW::search()` simultaneously, they ALL enter the destructor at the same time, serializing on that single lock. With N threads, this means N sequential lock acquisitions where each thread spins/yields waiting for its turn. This is O(N) serialization at the end of every search call. In `IndexBinaryHNSWCagra::search()` with `base_level_only=true`, the situation is even worse: `FlatHammingDis` is created and destroyed **per query iteration** inside `#pragma omp parallel for`, causing `n × num_threads` critical section entries. ## Fix Replace `#pragma omp critical` with `#pragma omp atomic`. Since `hnsw_stats.ndis += ndis` is a simple `size_t` addition, `#pragma omp atomic` compiles to a single hardware atomic instruction (`lock xadd` on x86-64) — orders of magnitude faster than a mutex-based critical section, with effectively zero contention. For reference, the float HNSW path in `IndexHNSW.cpp` already uses the correct pattern: `#pragma omp for reduction(+: n1, n2, ndis, nhops)` with a single-threaded `hnsw_stats.combine()` call outside the parallel region. ## Impact - Eliminates ~87% CPU waste from lock contention in binary HNSW search - Affects all users of `IndexBinaryHNSW::search()` and `IndexBinaryHNSWCagra::search()` - No change to search results or statistics accuracy — `#pragma omp atomic` provides the same correctness guarantees as `#pragma omp critical` for a single `+=` operation Reviewed By: mnorris11 Differential Revision: D95910991
d745c8b to
8903e72
Compare
Contributor
|
This pull request has been merged in 5b83ec6. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Problem
We recently found 87% of CPU was being wasted on OpenMP lock contention in
FlatHammingDis::~FlatHammingDis, not on useful computation.The flame graph breakdown:
openmp_workerthreadsFlatHammingDis::~FlatHammingDis→__kmpc_critical_with_hint→__kmp_acquire_queuing_lock→__sched_yield(84% CPU spinning/yielding on lock)Root Cause
The
FlatHammingDisdestructor used#pragma omp criticalto accumulate a singlesize_tcounter (hnsw_stats.ndis += ndis). Unnamed#pragma omp criticalsections share a global serialization lock — when all OpenMP threads exit the#pragma omp parallelblock inIndexBinaryHNSW::search()simultaneously, they ALL enter the destructor at the same time, serializing on that single lock.With N threads, this means N sequential lock acquisitions where each thread spins/yields waiting for its turn. This is O(N) serialization at the end of every search call.
In
IndexBinaryHNSWCagra::search()withbase_level_only=true, the situation is even worse:FlatHammingDisis created and destroyed per query iteration inside#pragma omp parallel for, causingn × num_threadscritical section entries.Fix
Replace
#pragma omp criticalwith#pragma omp atomic. Sincehnsw_stats.ndis += ndisis a simplesize_taddition,#pragma omp atomiccompiles to a single hardware atomic instruction (lock xaddon x86-64) — orders of magnitude faster than a mutex-based critical section, with effectively zero contention.For reference, the float HNSW path in
IndexHNSW.cppalready uses the correct pattern:#pragma omp for reduction(+: n1, n2, ndis, nhops)with a single-threadedhnsw_stats.combine()call outside the parallel region.Impact
IndexBinaryHNSW::search()andIndexBinaryHNSWCagra::search()#pragma omp atomicprovides the same correctness guarantees as#pragma omp criticalfor a single+=operationDifferential Revision: D95910991