From 8903e72698bbdc52d8ea9c4b833e8f6b98327588 Mon Sep 17 00:00:00 2001 From: Pratyaksh Sharma Date: Wed, 11 Mar 2026 14:47:24 -0700 Subject: [PATCH] Fix OpenMP critical section contention in IndexBinaryHNSW search (#4909) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Summary: ## Problem We recently found **87% of CPU** was being wasted on OpenMP lock contention in `FlatHammingDis::~FlatHammingDis`, not on useful computation. The flame graph breakdown: - 88.2% CPU in `openmp_worker` threads - 87.2% in `FlatHammingDis::~FlatHammingDis` → `__kmpc_critical_with_hint` → `__kmp_acquire_queuing_lock` → `__sched_yield` (84% CPU spinning/yielding on lock) ## Root Cause The `FlatHammingDis` destructor used `#pragma omp critical` to accumulate a single `size_t` counter (`hnsw_stats.ndis += ndis`). Unnamed `#pragma omp critical` sections share a **global serialization lock** — when all OpenMP threads exit the `#pragma omp parallel` block in `IndexBinaryHNSW::search()` simultaneously, they ALL enter the destructor at the same time, serializing on that single lock. With N threads, this means N sequential lock acquisitions where each thread spins/yields waiting for its turn. This is O(N) serialization at the end of every search call. In `IndexBinaryHNSWCagra::search()` with `base_level_only=true`, the situation is even worse: `FlatHammingDis` is created and destroyed **per query iteration** inside `#pragma omp parallel for`, causing `n × num_threads` critical section entries. ## Fix Replace `#pragma omp critical` with `#pragma omp atomic`. Since `hnsw_stats.ndis += ndis` is a simple `size_t` addition, `#pragma omp atomic` compiles to a single hardware atomic instruction (`lock xadd` on x86-64) — orders of magnitude faster than a mutex-based critical section, with effectively zero contention. For reference, the float HNSW path in `IndexHNSW.cpp` already uses the correct pattern: `#pragma omp for reduction(+: n1, n2, ndis, nhops)` with a single-threaded `hnsw_stats.combine()` call outside the parallel region. ## Impact - Eliminates ~87% CPU waste from lock contention in binary HNSW search - Affects all users of `IndexBinaryHNSW::search()` and `IndexBinaryHNSWCagra::search()` - No change to search results or statistics accuracy — `#pragma omp atomic` provides the same correctness guarantees as `#pragma omp critical` for a single `+=` operation Reviewed By: mnorris11 Differential Revision: D95910991 --- faiss/IndexBinaryHNSW.cpp | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/faiss/IndexBinaryHNSW.cpp b/faiss/IndexBinaryHNSW.cpp index 73b19be334..641d3136e7 100644 --- a/faiss/IndexBinaryHNSW.cpp +++ b/faiss/IndexBinaryHNSW.cpp @@ -293,10 +293,8 @@ struct FlatHammingDis : DistanceComputer { } ~FlatHammingDis() override { -#pragma omp critical - { - hnsw_stats.ndis += ndis; - } +#pragma omp atomic + hnsw_stats.ndis += ndis; } };