Remove lock contention from DenseIntMap on concurrent graph build by eolivelli · Pull Request #2 · eolivelli/jvector

eolivelli · 2026-04-22T14:23:14Z

Summary

An async-profiler lock profile from a concurrent herddb indexing workload
showed that ~92% of lock-wait time lives inside
io.github.jbellis.jvector.util.DenseIntMap, on the graph-build paths
ConcurrentNeighborMap.insertDiverse / backlink / addNode →
DenseIntMap.compareAndPut. Those waits are on the read lock of the
internal ReentrantReadWriteLock: writers running ensureCapacity (resizing
the backing AtomicReferenceArray) park every concurrent updater under
the non-fair AQS.

This PR removes that hotspot with three coordinated changes, delivered in a
single commit so reviewers can see the full picture. Existing public API is
preserved — only additive overloads are introduced.

1. Segmented, lock-free `DenseIntMap`

Replaces the single volatile AtomicReferenceArray<T> + RW-lock with a
two-level spine-of-segments layout:

volatile AtomicReferenceArray<AtomicReferenceArray<T>> spine of
fixed-size (1024) segments.
Once a segment is installed it is never reallocated. All slot
reads/CAS-writes are lock-free.
Only spine grow + first-time segment install share one synchronized
block (spineLock). Everything that finds an already-installed segment
bypasses the lock completely.
get() remains fully lock-free. size() / compareAndPut / remove /
forEach / containsKey preserve their current semantics.

Correctness argument (also in the code comment):

Segment identity is stable: spine grow and segment install share
spineLock, so every thread agrees on the one-and-only segment object for
a given key >>> 10.
Lost-write freedom: two concurrent puts for the same key always CAS the
same slot of the same AtomicReferenceArray, regardless of which spine
snapshot they observed.
Spine grow copies references to already-published segments — never copies
values — so no lost writes are possible during resize.

2. `initialCapacity` hint through the public API

Callers that know their final node count (herddb has a fixed shard size) can
now pre-size the base layer so the spine is wide enough and every segment is
pre-allocated. The hot insert phase then makes zero allocations and never
even touches spineLock.

New additive overloads:

GraphIndexBuilder(..., ForkJoinPool, ForkJoinPool, int initialCapacity)
ConcurrentNeighborMap(DiversityProvider, int maxDegree, int maxOverflowDegree, int initialCapacity)
OnHeapGraphIndex(..., int baseLayerInitialCapacity) (package-private)

Existing constructors delegate with initialCapacity=1024, matching the
previous default.

3. JMH benchmark

benchmarks-jmh/.../DenseIntMapConcurrentBenchmark — parameterised on
{initialCapacity ∈ {1024, totalKeys}, totalKeys} and exercises:

dense-key insert throughput at 1 and 8 threads
CAS-update throughput (models insertEdge/insertDiverse)
pure get() throughput
mixed 7R:1W workload using @Group/@GroupThreads

Expected: the pre-sized case removes all allocation traffic; the
default-capacity case removes the old RW-lock overhead on every operation.

Test plan

Existing TestIntMap passes (4/4) — DenseIntMap + SparseIntMap coverage.
New TestDenseIntMapSegmented (8 tests): cross-segment-boundary reads/writes,
concurrent inserts forcing spine growth from initial capacity 1, same-key CAS
races (exactly one winner), concurrent insert + remove cycles, capacity-hint
coverage, null/invalid argument rejection.
New GraphIndexBuilderTest.testInitialCapacityHintProducesEquivalentGraph —
end-to-end equivalence between default and hinted builds.
GraphIndexBuilderTest, OnHeapGraphIndexTest, TestNeighbors,
TestConcurrentReadWriteDeletes pass with the new impl.
Full jvector-tests suite: 238 tests, 0 failures, 0 errors, 2 skipped.
Re-run the herddb lock profile to confirm the DenseIntMap subtrees
shrink to <5%.
JMH baseline vs. new impl comparison (DenseIntMapConcurrentBenchmark).

🤖 Generated with Claude Code

A herddb indexing lock profile showed ~92% of lock-wait time inside DenseIntMap under ConcurrentNeighborMap.insertDiverse / backlink / addNode: the read lock of the internal ReentrantReadWriteLock was being parked by waiting writers (ensureCapacity grows of the backing AtomicReferenceArray) under the non-fair AQS. Changes: - Rewrite DenseIntMap with a two-level spine-of-segments layout. Segments are fixed-size (1024) and never reallocated, so compareAndPut / get / remove are fully lock-free on the steady-state path. Only spine grow and segment install share a synchronized block, and those happen O(log N) and O(N/1024) times respectively across the map's lifetime. - Expose an initialCapacity hint through GraphIndexBuilder -> OnHeapGraphIndex -> ConcurrentNeighborMap -> DenseIntMap. Callers with a known node count (e.g. herddb with a fixed shard size) can pre-size the base-layer map so the spine is wide enough from the start and every segment is pre-allocated — the hot insert phase then makes zero allocations and never touches spineLock. - Add TestDenseIntMapSegmented covering cross-segment-boundary access, concurrent inserts that force spine growth, same-key CAS races, and concurrent insert+remove cycles. - Add GraphIndexBuilderTest.testInitialCapacityHintProducesEquivalentGraph to guard the new constructor overload. - Add benchmarks-jmh/DenseIntMapConcurrentBenchmark to measure throughput of insert / CAS update / get / mixed read-write workloads against both the default (1024) and pre-sized capacities. Public API: additive only. New overloads on GraphIndexBuilder (11-arg variant with int initialCapacity) and ConcurrentNeighborMap (4-arg variant). All existing constructors preserve their previous behaviour by delegating with initialCapacity=1024. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

eolivelli · 2026-04-22T14:47:44Z

JMH results — legacy vs. segmented

Ran the new DenseIntMapConcurrentBenchmark with both implementations side-by-side in the same JVM. Config: -wi 2 -w 1 -i 3 -r 2 -f 1, throughput mode, 1M total keys.

Headline: the profiled contention is gone

Two benchmarks directly mirror the lock-profile hotspot. At 8 threads with the default initialCapacity=1024 (i.e. the production config the profile was captured under):

Benchmark	legacy ops/s	segmented ops/s	speedup
`insertDense8` (concurrent inserts, models `addNode`)	3.68 M	29.83 M	~8.1x
`casUpdate8` (concurrent CAS-updates, models `insertEdge`/`insertDiverse`)	4.73 M	96.74 M	~20.4x

With a pre-sized hint (initialCapacity = 1_000_000) the legacy impl can skip most resizes and recovers somewhat for single-threaded inserts, but at 8 threads it's still bottlenecked by the RW-lock machinery itself:

Benchmark	legacy (pre-sized)	segmented (pre-sized)	speedup
`insertDense1`	74.62 M	149.42 M	~2.0x
`insertDense8`	3.42 M	22.35 M	~6.5x
`casUpdate8`	4.48 M	81.38 M	~18.2x

Full results

Benchmark                                           (impl)  (initialCapacity)  (totalKeys)   Mode  Cnt          Score  Units
DenseIntMapConcurrentBenchmark.casUpdate1           legacy               1024      1000000  thrpt    3   10122004  ops/s
DenseIntMapConcurrentBenchmark.casUpdate1           legacy            1000000      1000000  thrpt    3   10947069  ops/s
DenseIntMapConcurrentBenchmark.casUpdate1        segmented               1024      1000000  thrpt    3   11721164  ops/s
DenseIntMapConcurrentBenchmark.casUpdate1        segmented            1000000      1000000  thrpt    3    8723957  ops/s
DenseIntMapConcurrentBenchmark.casUpdate8           legacy               1024      1000000  thrpt    3    4731016  ops/s
DenseIntMapConcurrentBenchmark.casUpdate8           legacy            1000000      1000000  thrpt    3    4481018  ops/s
DenseIntMapConcurrentBenchmark.casUpdate8        segmented               1024      1000000  thrpt    3   96743850  ops/s
DenseIntMapConcurrentBenchmark.casUpdate8        segmented            1000000      1000000  thrpt    3   81384863  ops/s
DenseIntMapConcurrentBenchmark.getHot1              legacy               1024      1000000  thrpt    3   39383827  ops/s
DenseIntMapConcurrentBenchmark.getHot1              legacy            1000000      1000000  thrpt    3   40192896  ops/s
DenseIntMapConcurrentBenchmark.getHot1           segmented               1024      1000000  thrpt    3   31516734  ops/s
DenseIntMapConcurrentBenchmark.getHot1           segmented            1000000      1000000  thrpt    3   31324581  ops/s
DenseIntMapConcurrentBenchmark.getHot8              legacy               1024      1000000  thrpt    3  363532606  ops/s
DenseIntMapConcurrentBenchmark.getHot8              legacy            1000000      1000000  thrpt    3  360817153  ops/s
DenseIntMapConcurrentBenchmark.getHot8           segmented               1024      1000000  thrpt    3  284407962  ops/s
DenseIntMapConcurrentBenchmark.getHot8           segmented            1000000      1000000  thrpt    3  285423558  ops/s
DenseIntMapConcurrentBenchmark.insertDense1         legacy               1024      1000000  thrpt    3   20516307  ops/s
DenseIntMapConcurrentBenchmark.insertDense1         legacy            1000000      1000000  thrpt    3   74618101  ops/s
DenseIntMapConcurrentBenchmark.insertDense1      segmented               1024      1000000  thrpt    3  131166624  ops/s
DenseIntMapConcurrentBenchmark.insertDense1      segmented            1000000      1000000  thrpt    3  149416226  ops/s
DenseIntMapConcurrentBenchmark.insertDense8         legacy               1024      1000000  thrpt    3    3681628  ops/s
DenseIntMapConcurrentBenchmark.insertDense8         legacy            1000000      1000000  thrpt    3    3423176  ops/s
DenseIntMapConcurrentBenchmark.insertDense8      segmented               1024      1000000  thrpt    3   29831205  ops/s
DenseIntMapConcurrentBenchmark.insertDense8      segmented            1000000      1000000  thrpt    3   22349319  ops/s
DenseIntMapConcurrentBenchmark.mixed                legacy               1024      1000000  thrpt    3  245936775  ops/s
DenseIntMapConcurrentBenchmark.mixed                legacy            1000000      1000000  thrpt    3  227130770  ops/s
DenseIntMapConcurrentBenchmark.mixed             segmented               1024      1000000  thrpt    3  178302707  ops/s
DenseIntMapConcurrentBenchmark.mixed             segmented            1000000      1000000  thrpt    3  182707096  ops/s

Trade-off to be transparent about

The pure-read path pays for the extra level of indirection (spine load + segment load vs. a single array load):

Benchmark	legacy	segmented	change
`getHot1`	~40 M	~31 M	−22%
`getHot8`	~360 M	~285 M	−21%
`mixed` (7R:1W)	~245 M	~178 M	−27%

The mixed-workload regression is fully explained by the lower get() throughput — the mixedWrite arm is ~unchanged (~11 M for both). On the profiled workload the lookup cost is dominated by the hash + similarity work happening after the map lookup, so the 20-25% slower get() is not expected to move the end-to-end build time; the 8–20x faster contended write path very much will.

If that read regression matters for a specific workload, easy follow-ups:

Bump SEGMENT_BITS from 10 to e.g. 14 → fewer segments, less pointer chasing, same write properties.
Hybrid: single array if the initial hint already covers the whole map, segmented only when it doesn't.

Also pushed LegacyDenseIntMap in the benchmarks module so this comparison can be rerun at any time.

🤖 Generated with Claude Code

Keep the old RW-lock + single-array implementation in the benchmarks module as LegacyDenseIntMap so DenseIntMapConcurrentBenchmark can run legacy vs. segmented side-by-side in the same JVM under identical conditions. Eliminates the need for a separate checkout to produce apples-to-apples comparisons. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The earlier segmented rewrite unlocked the write path (8-20x faster under contention) but regressed pure-read throughput by ~22% because every get() went through an extra spine-load indirection. Replace the uniform segmented layout with a two-tier structure: - base: an AtomicReferenceArray sized from the constructor's initialCapacity. Immortal — allocated once, never resized, never copied. get() for keys < initialCapacity is a single volatile load + slot read, identical to the legacy implementation. compareAndPut is a single CAS + AtomicInteger.inc with NO lock (the legacy RW-lock was only there to serialise against resize — base never resizes, so no lock is needed). - overflow: a lazily-allocated segmented tier, only touched for keys at or beyond initialCapacity. Segments are immortal once installed; the spine grows under a lock that the hot path never takes. For callers who pass an accurate initialCapacity (e.g. herddb with a known shard size) every operation stays on the base path: - Reads: equivalent to legacy (volatile + slot load). - Writes: strictly faster than legacy (no lock traversal). Benchmark (pre-sized, initialCapacity = totalKeys = 1M): Benchmark legacy new change getHot1 37.1M 37.3M +0.6% (within noise) getHot8 333.9M 345.5M +3.5% casUpdate1 10.4M 13.2M +27% casUpdate8 3.1M 110.8M ~35x insertDense1 72.8M 142.8M +96% insertDense8 3.2M 22.5M ~7x mixed 7R:1W 165.4M 239.4M +45% Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

eolivelli · 2026-04-22T15:03:13Z

Read-path regression eliminated — new design: immortal base + lazy segmented overflow

Per review feedback, redesigned to remove the read-path regression.

Design

base: an AtomicReferenceArray<T> sized from the constructor's initialCapacity.
Immortal — allocated once, never resized, never copied. get() for keys < initialCapacity
is a single volatile load + slot read, identical cost to legacy. compareAndPut is a
single CAS + AtomicInteger.inc with no lock (the legacy RW-lock was only there to
serialise against resize — base never resizes, so no lock is needed).
overflow: a lazily-allocated segmented tier, only touched for keys at or beyond
initialCapacity. Segments are immortal once installed; the spine grows under a lock that
the hot path never takes.

For callers who pass an accurate initialCapacity (herddb knows its shard size) every
operation stays on the base path.

Benchmark — pre-sized case (initialCapacity == totalKeys == 1M)

This is the production workload. No regressions; wins everywhere.

Benchmark	legacy	new	change
`getHot1`	37.07 M	37.31 M	+0.6% (within noise)
`getHot8`	333.97 M	345.51 M	+3.5%
`casUpdate1`	10.37 M	13.17 M	+27%
`casUpdate8`	3.15 M	110.83 M	~35x
`insertDense1`	72.83 M	142.84 M	+96%
`insertDense8`	3.18 M	22.52 M	~7x
`mixed 7R:1W`	165.39 M	239.41 M	+45%

Benchmark — default-capacity case (initialCapacity = 1024, totalKeys = 1M)

Keys beyond 1024 go through the overflow tier, which is lock-free but has one extra
indirection than the (eventually-resized) legacy single array. Reads for those
keys pay ~20% relative to legacy's post-resize state, but writes are still hugely faster:

Benchmark	legacy	new	change
`getHot1`	37.53 M	28.90 M	−23% (overflow path)
`getHot8`	339.03 M	267.28 M	−21% (overflow path)
`casUpdate1`	12.16 M	6.46 M	−47% (overflow path)
`casUpdate8`	4.38 M	63.56 M	+14x
`insertDense1`	20.40 M	110.82 M	+5.4x
`insertDense8`	3.42 M	30.86 M	+9x
`mixed 7R:1W`	233.59 M	176.88 M	−24%

The overflow-path read regression is the price of not re-introducing the lock. Production
callers should pass an initialCapacity hint to get the best of both worlds — GraphIndexBuilder
has the new constructor overload for exactly this purpose.

All 238 tests in jvector-tests still pass.

🤖 Generated with Claude Code

## Summary - Adopt the new `initialCapacity` hint on `GraphIndexBuilder` introduced by jvector branch [`reduce-denseintmap-lock-contention`](eolivelli/jvector#2) (commit `87e3bfff`), which rewrites `DenseIntMap` as a lock-free spine-of-segments. A herddb lock-profile showed ~92% of lock-wait time inside that map during concurrent graph build. - `PersistentVectorStore.createEmptyLiveShard` — pass `cap = computeEffectiveMaxLiveGraphSize()` as the hint. This is the same bound already used to pre-size the two `ConcurrentHashMap`s next to the builder. - `PersistentVectorStore.writeFusedPQGraphToTempFile` — pass `totalVectors = allNodeToPk.size()`, the exact node count about to be inserted in the compaction/merge path. - CI (`ci.yml` + `kubernetes-tests.yml`) now checks out the new jvector branch so the 11-arg constructor resolves at compile time. Artifact version (`4.0.0-rc.9-herddb-SNAPSHOT`) is unchanged, so no pom bump is required. Closes #223. ## Test plan - [x] `mvn -B checkstyle:check apache-rat:check spotbugs:check install -DskipTests -Pci` (green locally) - [ ] CI (`ci.yml` + `kubernetes-tests.yml`) runs against the new jvector branch - [ ] `DirectMultipleConcurrentUpdatesSuite{NoIndexes,WithNonUniqueIndexes,WithUniqueIndexes}Test` (hammer gate for index/checkpoint/concurrency changes) - [ ] Vector indexing smoke on k3s-local / GKE confirms no lock-profile regression 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

eolivelli and others added 2 commits April 22, 2026 16:47

This was referenced Apr 22, 2026

Update Jvector to get benefits from DenseIntMap improvements eolivelli/herddb#223

Closed

issue #223: pre-size jvector GraphIndexBuilder base layer eolivelli/herddb#224

Merged

eolivelli merged commit fd2b411 into main Apr 22, 2026
1 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove lock contention from DenseIntMap on concurrent graph build#2

Remove lock contention from DenseIntMap on concurrent graph build#2
eolivelli merged 3 commits into
mainfrom
reduce-denseintmap-lock-contention

eolivelli commented Apr 22, 2026

Uh oh!

eolivelli commented Apr 22, 2026

Uh oh!

eolivelli commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eolivelli commented Apr 22, 2026

Summary

1. Segmented, lock-free DenseIntMap

2. initialCapacity hint through the public API

3. JMH benchmark

Test plan

Uh oh!

eolivelli commented Apr 22, 2026

JMH results — legacy vs. segmented

Headline: the profiled contention is gone

Full results

Trade-off to be transparent about

Uh oh!

eolivelli commented Apr 22, 2026

Read-path regression eliminated — new design: immortal base + lazy segmented overflow

Design

Benchmark — pre-sized case (initialCapacity == totalKeys == 1M)

Benchmark — default-capacity case (initialCapacity = 1024, totalKeys = 1M)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Segmented, lock-free `DenseIntMap`

2. `initialCapacity` hint through the public API