Remove lock contention from DenseIntMap on concurrent graph build#2
Conversation
A herddb indexing lock profile showed ~92% of lock-wait time inside DenseIntMap under ConcurrentNeighborMap.insertDiverse / backlink / addNode: the read lock of the internal ReentrantReadWriteLock was being parked by waiting writers (ensureCapacity grows of the backing AtomicReferenceArray) under the non-fair AQS. Changes: - Rewrite DenseIntMap with a two-level spine-of-segments layout. Segments are fixed-size (1024) and never reallocated, so compareAndPut / get / remove are fully lock-free on the steady-state path. Only spine grow and segment install share a synchronized block, and those happen O(log N) and O(N/1024) times respectively across the map's lifetime. - Expose an initialCapacity hint through GraphIndexBuilder -> OnHeapGraphIndex -> ConcurrentNeighborMap -> DenseIntMap. Callers with a known node count (e.g. herddb with a fixed shard size) can pre-size the base-layer map so the spine is wide enough from the start and every segment is pre-allocated — the hot insert phase then makes zero allocations and never touches spineLock. - Add TestDenseIntMapSegmented covering cross-segment-boundary access, concurrent inserts that force spine growth, same-key CAS races, and concurrent insert+remove cycles. - Add GraphIndexBuilderTest.testInitialCapacityHintProducesEquivalentGraph to guard the new constructor overload. - Add benchmarks-jmh/DenseIntMapConcurrentBenchmark to measure throughput of insert / CAS update / get / mixed read-write workloads against both the default (1024) and pre-sized capacities. Public API: additive only. New overloads on GraphIndexBuilder (11-arg variant with int initialCapacity) and ConcurrentNeighborMap (4-arg variant). All existing constructors preserve their previous behaviour by delegating with initialCapacity=1024. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JMH results — legacy vs. segmentedRan the new Headline: the profiled contention is goneTwo benchmarks directly mirror the lock-profile hotspot. At 8 threads with the default
With a pre-sized hint (
Full resultsTrade-off to be transparent aboutThe pure-read path pays for the extra level of indirection (spine load + segment load vs. a single array load):
The mixed-workload regression is fully explained by the lower If that read regression matters for a specific workload, easy follow-ups:
Also pushed 🤖 Generated with Claude Code |
Keep the old RW-lock + single-array implementation in the benchmarks module as LegacyDenseIntMap so DenseIntMapConcurrentBenchmark can run legacy vs. segmented side-by-side in the same JVM under identical conditions. Eliminates the need for a separate checkout to produce apples-to-apples comparisons. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The earlier segmented rewrite unlocked the write path (8-20x faster under contention) but regressed pure-read throughput by ~22% because every get() went through an extra spine-load indirection. Replace the uniform segmented layout with a two-tier structure: - base: an AtomicReferenceArray sized from the constructor's initialCapacity. Immortal — allocated once, never resized, never copied. get() for keys < initialCapacity is a single volatile load + slot read, identical to the legacy implementation. compareAndPut is a single CAS + AtomicInteger.inc with NO lock (the legacy RW-lock was only there to serialise against resize — base never resizes, so no lock is needed). - overflow: a lazily-allocated segmented tier, only touched for keys at or beyond initialCapacity. Segments are immortal once installed; the spine grows under a lock that the hot path never takes. For callers who pass an accurate initialCapacity (e.g. herddb with a known shard size) every operation stays on the base path: - Reads: equivalent to legacy (volatile + slot load). - Writes: strictly faster than legacy (no lock traversal). Benchmark (pre-sized, initialCapacity = totalKeys = 1M): Benchmark legacy new change getHot1 37.1M 37.3M +0.6% (within noise) getHot8 333.9M 345.5M +3.5% casUpdate1 10.4M 13.2M +27% casUpdate8 3.1M 110.8M ~35x insertDense1 72.8M 142.8M +96% insertDense8 3.2M 22.5M ~7x mixed 7R:1W 165.4M 239.4M +45% Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Read-path regression eliminated — new design: immortal base + lazy segmented overflowPer review feedback, redesigned to remove the read-path regression. Design
For callers who pass an accurate Benchmark — pre-sized case (initialCapacity == totalKeys == 1M)This is the production workload. No regressions; wins everywhere.
Benchmark — default-capacity case (initialCapacity = 1024, totalKeys = 1M)Keys beyond 1024 go through the overflow tier, which is lock-free but has one extra
The overflow-path read regression is the price of not re-introducing the lock. Production All 238 tests in 🤖 Generated with Claude Code |
## Summary - Adopt the new `initialCapacity` hint on `GraphIndexBuilder` introduced by jvector branch [`reduce-denseintmap-lock-contention`](eolivelli/jvector#2) (commit `87e3bfff`), which rewrites `DenseIntMap` as a lock-free spine-of-segments. A herddb lock-profile showed ~92% of lock-wait time inside that map during concurrent graph build. - `PersistentVectorStore.createEmptyLiveShard` — pass `cap = computeEffectiveMaxLiveGraphSize()` as the hint. This is the same bound already used to pre-size the two `ConcurrentHashMap`s next to the builder. - `PersistentVectorStore.writeFusedPQGraphToTempFile` — pass `totalVectors = allNodeToPk.size()`, the exact node count about to be inserted in the compaction/merge path. - CI (`ci.yml` + `kubernetes-tests.yml`) now checks out the new jvector branch so the 11-arg constructor resolves at compile time. Artifact version (`4.0.0-rc.9-herddb-SNAPSHOT`) is unchanged, so no pom bump is required. Closes #223. ## Test plan - [x] `mvn -B checkstyle:check apache-rat:check spotbugs:check install -DskipTests -Pci` (green locally) - [ ] CI (`ci.yml` + `kubernetes-tests.yml`) runs against the new jvector branch - [ ] `DirectMultipleConcurrentUpdatesSuite{NoIndexes,WithNonUniqueIndexes,WithUniqueIndexes}Test` (hammer gate for index/checkpoint/concurrency changes) - [ ] Vector indexing smoke on k3s-local / GKE confirms no lock-profile regression 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
An async-profiler lock profile from a concurrent herddb indexing workload
showed that ~92% of lock-wait time lives inside
io.github.jbellis.jvector.util.DenseIntMap, on the graph-build pathsConcurrentNeighborMap.insertDiverse/backlink/addNode→DenseIntMap.compareAndPut. Those waits are on the read lock of theinternal
ReentrantReadWriteLock: writers runningensureCapacity(resizingthe backing
AtomicReferenceArray) park every concurrent updater underthe non-fair AQS.
This PR removes that hotspot with three coordinated changes, delivered in a
single commit so reviewers can see the full picture. Existing public API is
preserved — only additive overloads are introduced.
1. Segmented, lock-free
DenseIntMapReplaces the single
volatile AtomicReferenceArray<T>+ RW-lock with atwo-level spine-of-segments layout:
volatile AtomicReferenceArray<AtomicReferenceArray<T>> spineoffixed-size (1024) segments.
reads/CAS-writes are lock-free.
synchronizedblock (
spineLock). Everything that finds an already-installed segmentbypasses the lock completely.
get()remains fully lock-free.size()/compareAndPut/remove/forEach/containsKeypreserve their current semantics.Correctness argument (also in the code comment):
spineLock, so every thread agrees on the one-and-only segment object fora given
key >>> 10.same slot of the same
AtomicReferenceArray, regardless of which spinesnapshot they observed.
values — so no lost writes are possible during resize.
2.
initialCapacityhint through the public APICallers that know their final node count (herddb has a fixed shard size) can
now pre-size the base layer so the spine is wide enough and every segment is
pre-allocated. The hot insert phase then makes zero allocations and never
even touches
spineLock.New additive overloads:
GraphIndexBuilder(..., ForkJoinPool, ForkJoinPool, int initialCapacity)ConcurrentNeighborMap(DiversityProvider, int maxDegree, int maxOverflowDegree, int initialCapacity)OnHeapGraphIndex(..., int baseLayerInitialCapacity)(package-private)Existing constructors delegate with
initialCapacity=1024, matching theprevious default.
3. JMH benchmark
benchmarks-jmh/.../DenseIntMapConcurrentBenchmark— parameterised on{initialCapacity ∈ {1024, totalKeys}, totalKeys}and exercises:insertEdge/insertDiverse)get()throughput@Group/@GroupThreadsExpected: the pre-sized case removes all allocation traffic; the
default-capacity case removes the old RW-lock overhead on every operation.
Test plan
TestIntMappasses (4/4) — DenseIntMap + SparseIntMap coverage.TestDenseIntMapSegmented(8 tests): cross-segment-boundary reads/writes,concurrent inserts forcing spine growth from initial capacity 1, same-key CAS
races (exactly one winner), concurrent insert + remove cycles, capacity-hint
coverage, null/invalid argument rejection.
GraphIndexBuilderTest.testInitialCapacityHintProducesEquivalentGraph—end-to-end equivalence between default and hinted builds.
GraphIndexBuilderTest,OnHeapGraphIndexTest,TestNeighbors,TestConcurrentReadWriteDeletespass with the new impl.jvector-testssuite: 238 tests, 0 failures, 0 errors, 2 skipped.DenseIntMapsubtreesshrink to <5%.
DenseIntMapConcurrentBenchmark).🤖 Generated with Claude Code