Skip to content

Remove lock contention from DenseIntMap on concurrent graph build#2

Merged
eolivelli merged 3 commits into
mainfrom
reduce-denseintmap-lock-contention
Apr 22, 2026
Merged

Remove lock contention from DenseIntMap on concurrent graph build#2
eolivelli merged 3 commits into
mainfrom
reduce-denseintmap-lock-contention

Conversation

@eolivelli
Copy link
Copy Markdown
Owner

Summary

An async-profiler lock profile from a concurrent herddb indexing workload
showed that ~92% of lock-wait time lives inside
io.github.jbellis.jvector.util.DenseIntMap, on the graph-build paths
ConcurrentNeighborMap.insertDiverse / backlink / addNode
DenseIntMap.compareAndPut. Those waits are on the read lock of the
internal ReentrantReadWriteLock: writers running ensureCapacity (resizing
the backing AtomicReferenceArray) park every concurrent updater under
the non-fair AQS.

This PR removes that hotspot with three coordinated changes, delivered in a
single commit so reviewers can see the full picture. Existing public API is
preserved — only additive overloads are introduced.

1. Segmented, lock-free DenseIntMap

Replaces the single volatile AtomicReferenceArray<T> + RW-lock with a
two-level spine-of-segments layout:

  • volatile AtomicReferenceArray<AtomicReferenceArray<T>> spine of
    fixed-size (1024) segments.
  • Once a segment is installed it is never reallocated. All slot
    reads/CAS-writes are lock-free.
  • Only spine grow + first-time segment install share one synchronized
    block (spineLock). Everything that finds an already-installed segment
    bypasses the lock completely.
  • get() remains fully lock-free. size() / compareAndPut / remove /
    forEach / containsKey preserve their current semantics.

Correctness argument (also in the code comment):

  • Segment identity is stable: spine grow and segment install share
    spineLock, so every thread agrees on the one-and-only segment object for
    a given key >>> 10.
  • Lost-write freedom: two concurrent puts for the same key always CAS the
    same slot of the same AtomicReferenceArray, regardless of which spine
    snapshot they observed.
  • Spine grow copies references to already-published segments — never copies
    values — so no lost writes are possible during resize.

2. initialCapacity hint through the public API

Callers that know their final node count (herddb has a fixed shard size) can
now pre-size the base layer so the spine is wide enough and every segment is
pre-allocated. The hot insert phase then makes zero allocations and never
even touches spineLock.

New additive overloads:

  • GraphIndexBuilder(..., ForkJoinPool, ForkJoinPool, int initialCapacity)
  • ConcurrentNeighborMap(DiversityProvider, int maxDegree, int maxOverflowDegree, int initialCapacity)
  • OnHeapGraphIndex(..., int baseLayerInitialCapacity) (package-private)

Existing constructors delegate with initialCapacity=1024, matching the
previous default.

3. JMH benchmark

benchmarks-jmh/.../DenseIntMapConcurrentBenchmark — parameterised on
{initialCapacity ∈ {1024, totalKeys}, totalKeys} and exercises:

  • dense-key insert throughput at 1 and 8 threads
  • CAS-update throughput (models insertEdge/insertDiverse)
  • pure get() throughput
  • mixed 7R:1W workload using @Group/@GroupThreads

Expected: the pre-sized case removes all allocation traffic; the
default-capacity case removes the old RW-lock overhead on every operation.

Test plan

  • Existing TestIntMap passes (4/4) — DenseIntMap + SparseIntMap coverage.
  • New TestDenseIntMapSegmented (8 tests): cross-segment-boundary reads/writes,
    concurrent inserts forcing spine growth from initial capacity 1, same-key CAS
    races (exactly one winner), concurrent insert + remove cycles, capacity-hint
    coverage, null/invalid argument rejection.
  • New GraphIndexBuilderTest.testInitialCapacityHintProducesEquivalentGraph
    end-to-end equivalence between default and hinted builds.
  • GraphIndexBuilderTest, OnHeapGraphIndexTest, TestNeighbors,
    TestConcurrentReadWriteDeletes pass with the new impl.
  • Full jvector-tests suite: 238 tests, 0 failures, 0 errors, 2 skipped.
  • Re-run the herddb lock profile to confirm the DenseIntMap subtrees
    shrink to <5%.
  • JMH baseline vs. new impl comparison (DenseIntMapConcurrentBenchmark).

🤖 Generated with Claude Code

A herddb indexing lock profile showed ~92% of lock-wait time inside
DenseIntMap under ConcurrentNeighborMap.insertDiverse / backlink / addNode:
the read lock of the internal ReentrantReadWriteLock was being parked by
waiting writers (ensureCapacity grows of the backing AtomicReferenceArray)
under the non-fair AQS.

Changes:

- Rewrite DenseIntMap with a two-level spine-of-segments layout. Segments
  are fixed-size (1024) and never reallocated, so compareAndPut / get /
  remove are fully lock-free on the steady-state path. Only spine grow and
  segment install share a synchronized block, and those happen O(log N)
  and O(N/1024) times respectively across the map's lifetime.

- Expose an initialCapacity hint through GraphIndexBuilder ->
  OnHeapGraphIndex -> ConcurrentNeighborMap -> DenseIntMap. Callers with a
  known node count (e.g. herddb with a fixed shard size) can pre-size the
  base-layer map so the spine is wide enough from the start and every
  segment is pre-allocated — the hot insert phase then makes zero
  allocations and never touches spineLock.

- Add TestDenseIntMapSegmented covering cross-segment-boundary access,
  concurrent inserts that force spine growth, same-key CAS races, and
  concurrent insert+remove cycles.

- Add GraphIndexBuilderTest.testInitialCapacityHintProducesEquivalentGraph
  to guard the new constructor overload.

- Add benchmarks-jmh/DenseIntMapConcurrentBenchmark to measure
  throughput of insert / CAS update / get / mixed read-write workloads
  against both the default (1024) and pre-sized capacities.

Public API: additive only. New overloads on GraphIndexBuilder (11-arg
variant with int initialCapacity) and ConcurrentNeighborMap (4-arg
variant). All existing constructors preserve their previous behaviour by
delegating with initialCapacity=1024.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@eolivelli
Copy link
Copy Markdown
Owner Author

JMH results — legacy vs. segmented

Ran the new DenseIntMapConcurrentBenchmark with both implementations side-by-side in the same JVM. Config: -wi 2 -w 1 -i 3 -r 2 -f 1, throughput mode, 1M total keys.

Headline: the profiled contention is gone

Two benchmarks directly mirror the lock-profile hotspot. At 8 threads with the default initialCapacity=1024 (i.e. the production config the profile was captured under):

Benchmark legacy ops/s segmented ops/s speedup
insertDense8 (concurrent inserts, models addNode) 3.68 M 29.83 M ~8.1x
casUpdate8 (concurrent CAS-updates, models insertEdge/insertDiverse) 4.73 M 96.74 M ~20.4x

With a pre-sized hint (initialCapacity = 1_000_000) the legacy impl can skip most resizes and recovers somewhat for single-threaded inserts, but at 8 threads it's still bottlenecked by the RW-lock machinery itself:

Benchmark legacy (pre-sized) segmented (pre-sized) speedup
insertDense1 74.62 M 149.42 M ~2.0x
insertDense8 3.42 M 22.35 M ~6.5x
casUpdate8 4.48 M 81.38 M ~18.2x

Full results

Benchmark                                           (impl)  (initialCapacity)  (totalKeys)   Mode  Cnt          Score  Units
DenseIntMapConcurrentBenchmark.casUpdate1           legacy               1024      1000000  thrpt    3   10122004  ops/s
DenseIntMapConcurrentBenchmark.casUpdate1           legacy            1000000      1000000  thrpt    3   10947069  ops/s
DenseIntMapConcurrentBenchmark.casUpdate1        segmented               1024      1000000  thrpt    3   11721164  ops/s
DenseIntMapConcurrentBenchmark.casUpdate1        segmented            1000000      1000000  thrpt    3    8723957  ops/s
DenseIntMapConcurrentBenchmark.casUpdate8           legacy               1024      1000000  thrpt    3    4731016  ops/s
DenseIntMapConcurrentBenchmark.casUpdate8           legacy            1000000      1000000  thrpt    3    4481018  ops/s
DenseIntMapConcurrentBenchmark.casUpdate8        segmented               1024      1000000  thrpt    3   96743850  ops/s
DenseIntMapConcurrentBenchmark.casUpdate8        segmented            1000000      1000000  thrpt    3   81384863  ops/s
DenseIntMapConcurrentBenchmark.getHot1              legacy               1024      1000000  thrpt    3   39383827  ops/s
DenseIntMapConcurrentBenchmark.getHot1              legacy            1000000      1000000  thrpt    3   40192896  ops/s
DenseIntMapConcurrentBenchmark.getHot1           segmented               1024      1000000  thrpt    3   31516734  ops/s
DenseIntMapConcurrentBenchmark.getHot1           segmented            1000000      1000000  thrpt    3   31324581  ops/s
DenseIntMapConcurrentBenchmark.getHot8              legacy               1024      1000000  thrpt    3  363532606  ops/s
DenseIntMapConcurrentBenchmark.getHot8              legacy            1000000      1000000  thrpt    3  360817153  ops/s
DenseIntMapConcurrentBenchmark.getHot8           segmented               1024      1000000  thrpt    3  284407962  ops/s
DenseIntMapConcurrentBenchmark.getHot8           segmented            1000000      1000000  thrpt    3  285423558  ops/s
DenseIntMapConcurrentBenchmark.insertDense1         legacy               1024      1000000  thrpt    3   20516307  ops/s
DenseIntMapConcurrentBenchmark.insertDense1         legacy            1000000      1000000  thrpt    3   74618101  ops/s
DenseIntMapConcurrentBenchmark.insertDense1      segmented               1024      1000000  thrpt    3  131166624  ops/s
DenseIntMapConcurrentBenchmark.insertDense1      segmented            1000000      1000000  thrpt    3  149416226  ops/s
DenseIntMapConcurrentBenchmark.insertDense8         legacy               1024      1000000  thrpt    3    3681628  ops/s
DenseIntMapConcurrentBenchmark.insertDense8         legacy            1000000      1000000  thrpt    3    3423176  ops/s
DenseIntMapConcurrentBenchmark.insertDense8      segmented               1024      1000000  thrpt    3   29831205  ops/s
DenseIntMapConcurrentBenchmark.insertDense8      segmented            1000000      1000000  thrpt    3   22349319  ops/s
DenseIntMapConcurrentBenchmark.mixed                legacy               1024      1000000  thrpt    3  245936775  ops/s
DenseIntMapConcurrentBenchmark.mixed                legacy            1000000      1000000  thrpt    3  227130770  ops/s
DenseIntMapConcurrentBenchmark.mixed             segmented               1024      1000000  thrpt    3  178302707  ops/s
DenseIntMapConcurrentBenchmark.mixed             segmented            1000000      1000000  thrpt    3  182707096  ops/s

Trade-off to be transparent about

The pure-read path pays for the extra level of indirection (spine load + segment load vs. a single array load):

Benchmark legacy segmented change
getHot1 ~40 M ~31 M −22%
getHot8 ~360 M ~285 M −21%
mixed (7R:1W) ~245 M ~178 M −27%

The mixed-workload regression is fully explained by the lower get() throughput — the mixedWrite arm is ~unchanged (~11 M for both). On the profiled workload the lookup cost is dominated by the hash + similarity work happening after the map lookup, so the 20-25% slower get() is not expected to move the end-to-end build time; the 8–20x faster contended write path very much will.

If that read regression matters for a specific workload, easy follow-ups:

  • Bump SEGMENT_BITS from 10 to e.g. 14 → fewer segments, less pointer chasing, same write properties.
  • Hybrid: single array if the initial hint already covers the whole map, segmented only when it doesn't.

Also pushed LegacyDenseIntMap in the benchmarks module so this comparison can be rerun at any time.

🤖 Generated with Claude Code

eolivelli and others added 2 commits April 22, 2026 16:47
Keep the old RW-lock + single-array implementation in the benchmarks module
as LegacyDenseIntMap so DenseIntMapConcurrentBenchmark can run legacy vs.
segmented side-by-side in the same JVM under identical conditions.
Eliminates the need for a separate checkout to produce apples-to-apples
comparisons.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The earlier segmented rewrite unlocked the write path (8-20x faster under
contention) but regressed pure-read throughput by ~22% because every get()
went through an extra spine-load indirection.

Replace the uniform segmented layout with a two-tier structure:

- base: an AtomicReferenceArray sized from the constructor's
  initialCapacity. Immortal — allocated once, never resized, never copied.
  get() for keys < initialCapacity is a single volatile load + slot read,
  identical to the legacy implementation. compareAndPut is a single CAS +
  AtomicInteger.inc with NO lock (the legacy RW-lock was only there to
  serialise against resize — base never resizes, so no lock is needed).

- overflow: a lazily-allocated segmented tier, only touched for keys at
  or beyond initialCapacity. Segments are immortal once installed; the
  spine grows under a lock that the hot path never takes.

For callers who pass an accurate initialCapacity (e.g. herddb with a
known shard size) every operation stays on the base path:
- Reads: equivalent to legacy (volatile + slot load).
- Writes: strictly faster than legacy (no lock traversal).

Benchmark (pre-sized, initialCapacity = totalKeys = 1M):
  Benchmark      legacy      new         change
  getHot1        37.1M       37.3M       +0.6%   (within noise)
  getHot8        333.9M      345.5M      +3.5%
  casUpdate1     10.4M       13.2M       +27%
  casUpdate8     3.1M        110.8M      ~35x
  insertDense1   72.8M       142.8M      +96%
  insertDense8   3.2M        22.5M       ~7x
  mixed 7R:1W    165.4M      239.4M      +45%

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@eolivelli
Copy link
Copy Markdown
Owner Author

Read-path regression eliminated — new design: immortal base + lazy segmented overflow

Per review feedback, redesigned to remove the read-path regression.

Design

  • base: an AtomicReferenceArray<T> sized from the constructor's initialCapacity.
    Immortal — allocated once, never resized, never copied. get() for keys < initialCapacity
    is a single volatile load + slot read, identical cost to legacy. compareAndPut is a
    single CAS + AtomicInteger.inc with no lock (the legacy RW-lock was only there to
    serialise against resize — base never resizes, so no lock is needed).
  • overflow: a lazily-allocated segmented tier, only touched for keys at or beyond
    initialCapacity. Segments are immortal once installed; the spine grows under a lock that
    the hot path never takes.

For callers who pass an accurate initialCapacity (herddb knows its shard size) every
operation stays on the base path.

Benchmark — pre-sized case (initialCapacity == totalKeys == 1M)

This is the production workload. No regressions; wins everywhere.

Benchmark legacy new change
getHot1 37.07 M 37.31 M +0.6% (within noise)
getHot8 333.97 M 345.51 M +3.5%
casUpdate1 10.37 M 13.17 M +27%
casUpdate8 3.15 M 110.83 M ~35x
insertDense1 72.83 M 142.84 M +96%
insertDense8 3.18 M 22.52 M ~7x
mixed 7R:1W 165.39 M 239.41 M +45%

Benchmark — default-capacity case (initialCapacity = 1024, totalKeys = 1M)

Keys beyond 1024 go through the overflow tier, which is lock-free but has one extra
indirection than the (eventually-resized) legacy single array. Reads for those
keys pay ~20% relative to legacy's post-resize state, but writes are still hugely faster:

Benchmark legacy new change
getHot1 37.53 M 28.90 M −23% (overflow path)
getHot8 339.03 M 267.28 M −21% (overflow path)
casUpdate1 12.16 M 6.46 M −47% (overflow path)
casUpdate8 4.38 M 63.56 M +14x
insertDense1 20.40 M 110.82 M +5.4x
insertDense8 3.42 M 30.86 M +9x
mixed 7R:1W 233.59 M 176.88 M −24%

The overflow-path read regression is the price of not re-introducing the lock. Production
callers should pass an initialCapacity hint to get the best of both worlds — GraphIndexBuilder
has the new constructor overload for exactly this purpose.

All 238 tests in jvector-tests still pass.

🤖 Generated with Claude Code

eolivelli added a commit to eolivelli/herddb that referenced this pull request Apr 22, 2026
## Summary

- Adopt the new `initialCapacity` hint on `GraphIndexBuilder` introduced
by jvector branch
[`reduce-denseintmap-lock-contention`](eolivelli/jvector#2)
(commit `87e3bfff`), which rewrites `DenseIntMap` as a lock-free
spine-of-segments. A herddb lock-profile showed ~92% of lock-wait time
inside that map during concurrent graph build.
- `PersistentVectorStore.createEmptyLiveShard` — pass `cap =
computeEffectiveMaxLiveGraphSize()` as the hint. This is the same bound
already used to pre-size the two `ConcurrentHashMap`s next to the
builder.
- `PersistentVectorStore.writeFusedPQGraphToTempFile` — pass
`totalVectors = allNodeToPk.size()`, the exact node count about to be
inserted in the compaction/merge path.
- CI (`ci.yml` + `kubernetes-tests.yml`) now checks out the new jvector
branch so the 11-arg constructor resolves at compile time. Artifact
version (`4.0.0-rc.9-herddb-SNAPSHOT`) is unchanged, so no pom bump is
required.

Closes #223.

## Test plan

- [x] `mvn -B checkstyle:check apache-rat:check spotbugs:check install
-DskipTests -Pci` (green locally)
- [ ] CI (`ci.yml` + `kubernetes-tests.yml`) runs against the new
jvector branch
- [ ]
`DirectMultipleConcurrentUpdatesSuite{NoIndexes,WithNonUniqueIndexes,WithUniqueIndexes}Test`
(hammer gate for index/checkpoint/concurrency changes)
- [ ] Vector indexing smoke on k3s-local / GKE confirms no lock-profile
regression

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@eolivelli eolivelli merged commit fd2b411 into main Apr 22, 2026
1 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant