From 8903e72698bbdc52d8ea9c4b833e8f6b98327588 Mon Sep 17 00:00:00 2001
From: Pratyaksh Sharma <pratsh@meta.com>
Date: Wed, 11 Mar 2026 14:47:24 -0700
Subject: [PATCH] Fix OpenMP critical section contention in IndexBinaryHNSW
 search (#4909)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Summary:

## Problem

We recently found **87% of CPU** was being wasted on OpenMP lock contention in `FlatHammingDis::~FlatHammingDis`, not on useful computation.

The flame graph breakdown:
- 88.2% CPU in `openmp_worker` threads
- 87.2% in `FlatHammingDis::~FlatHammingDis` → `__kmpc_critical_with_hint` → `__kmp_acquire_queuing_lock` → `__sched_yield` (84% CPU spinning/yielding on lock)

## Root Cause

The `FlatHammingDis` destructor used `#pragma omp critical` to accumulate a single `size_t` counter (`hnsw_stats.ndis += ndis`). Unnamed `#pragma omp critical` sections share a **global serialization lock** — when all OpenMP threads exit the `#pragma omp parallel` block in `IndexBinaryHNSW::search()` simultaneously, they ALL enter the destructor at the same time, serializing on that single lock.

With N threads, this means N sequential lock acquisitions where each thread spins/yields waiting for its turn. This is O(N) serialization at the end of every search call.

In `IndexBinaryHNSWCagra::search()` with `base_level_only=true`, the situation is even worse: `FlatHammingDis` is created and destroyed **per query iteration** inside `#pragma omp parallel for`, causing `n × num_threads` critical section entries.

## Fix

Replace `#pragma omp critical` with `#pragma omp atomic`. Since `hnsw_stats.ndis += ndis` is a simple `size_t` addition, `#pragma omp atomic` compiles to a single hardware atomic instruction (`lock xadd` on x86-64) — orders of magnitude faster than a mutex-based critical section, with effectively zero contention.

For reference, the float HNSW path in `IndexHNSW.cpp` already uses the correct pattern: `#pragma omp for reduction(+: n1, n2, ndis, nhops)` with a single-threaded `hnsw_stats.combine()` call outside the parallel region.

## Impact

- Eliminates ~87% CPU waste from lock contention in binary HNSW search
- Affects all users of `IndexBinaryHNSW::search()` and `IndexBinaryHNSWCagra::search()`
- No change to search results or statistics accuracy — `#pragma omp atomic` provides the same correctness guarantees as `#pragma omp critical` for a single `+=` operation

Reviewed By: mnorris11

Differential Revision: D95910991
---
 faiss/IndexBinaryHNSW.cpp | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/faiss/IndexBinaryHNSW.cpp b/faiss/IndexBinaryHNSW.cpp
index 73b19be334..641d3136e7 100644
--- a/faiss/IndexBinaryHNSW.cpp
+++ b/faiss/IndexBinaryHNSW.cpp
@@ -293,10 +293,8 @@ struct FlatHammingDis : DistanceComputer {
     }
 
     ~FlatHammingDis() override {
-#pragma omp critical
-        {
-            hnsw_stats.ndis += ndis;
-        }
+#pragma omp atomic
+        hnsw_stats.ndis += ndis;
     }
 };