Skip to content

node,chain: per-resource locks + BorrowedState (slice a-2 of #803)#805

Merged
ch4r10t33r merged 10 commits into
mainfrom
feat/agent-slice-a-2-chain-primitives
May 4, 2026
Merged

node,chain: per-resource locks + BorrowedState (slice a-2 of #803)#805
ch4r10t33r merged 10 commits into
mainfrom
feat/agent-slice-a-2-chain-primitives

Conversation

@zclawz
Copy link
Copy Markdown
Contributor

@zclawz zclawz commented Apr 29, 2026

Slice (a-2): per-resource locks + BorrowedState

Implements slice (a-2) of the BeamNode threading refactor per the design doc at docs/threading_refactor_slice_a.md (commit a76c274). Adds the locking primitives and migrates every chain.zig callsite to use them, without touching BeamNode.mutex itself — that drops in slice (a-3).

Tracking: #803. Design tracker: #804.

This PR is draft on purpose — please leave it draft for human review. ~1500 LOC changed.


What's new

pkgs/node/src/locking.zig (new file, ~744 LOC)

Three primitives + two debug helpers:

  1. LockedMap(K, V)std.Thread.Mutex + std.AutoHashMap bundle with the small set of methods we actually use (get / put / remove / fetchRemove / count / iterateLocked). The Network-side wiring lands in (a-3); the helper ships here with unit tests so its API contract is reviewed independently.

  2. BlockCache — atomic triple-update of fetched_blocks + fetched_block_ssz + fetched_block_children under a single block_cache_lock. The wiring into network.zig lands in (a-3) per Partha build all zkvm binaries + cli at once #4. removeChildrenOf is bounded by MAX_CACHED_BLOCKS = 1024.

  3. BorrowedState — RAII wrapper carrying *const BeamState + the backing lock. Two backing variants:

    • states_shared_rwlockBeamChain.states_lock read side (the common case).
    • events_mutexBeamChain.events_lock (used only by the cache / DB-load paths inside getFinalizedState).

    cloneAndRelease(allocator) consumes the borrow and produces an owned snapshot, with errdefer self.deinit() ensuring the lock is released even on OOM-mid-clone. deinit is idempotent; released: bool sentinel + assertReleased panics in debug builds if a borrow is dropped without release. Naming honesty per Partha r3 typos correction README.md #1 — the previous draft called this sszClone, which read like a non-mutating helper.

  4. LockTimer — per-resource lock observation. Records into the new zeam_lock_{wait,hold}_seconds{lock,site} histograms AND double-emits into the legacy zeam_node_mutex_{wait,hold}_time_seconds series so existing dashboards stay alive for one release (Partha build: glue builder and build.zig cleanup #11, r3 update codebase to support 0.14.0 #5 — code-side derived shim, not a Prometheus recording rule).

  5. tier5_depth thread-local + assertNoTier5SiblingHeld — the 5a/5b/5c sibling locks must never be co-held. A debug-build TLS depth counter increments on every tier-5 acquire and decrements on release; an assert at every entry catches violations in tests (Partha r3 build all zkvm binaries + cli at once #4).

pkgs/node/src/chain.zig — new lock fields

Added to BeamChain:

states_lock: std.Thread.RwLock = .{},
pending_blocks_lock: std.Thread.Mutex = .{},
pubkey_cache_lock: std.Thread.Mutex = .{},
root_to_slot_lock: std.Thread.Mutex = .{},
events_lock: std.Thread.Mutex = .{},

Plus three internal helpers around states:

  • statesGet(root) -> ?BorrowedState
  • statesPutExclusive(site, root, state) — single-key write
  • statesCommitKeepExisting(site, root, state) -> { effective, kept_existing } — atomic insert-or-keep that never invalidates an existing in-map pointer (the swap-overwrite would create a UAF for in-flight borrows).
  • statesFetchRemoveExclusivePtr(site, root) -> ?*BeamState

chain.zig callsite migration

Every states.{get,put,fetchRemove} callsite from the design-doc inventory has been migrated. Test sites are commented // SAFETY: test-only, single-threaded.

File:line (post-migration) Function Operation Lifetime classification
chain.zig:611 produceBlock states.get(parent_root)statesGet cloneAndRelease (FFI window via forkChoice.getProposalAttestations)
chain.zig:725 produceBlock states.put(block_root, post_state)statesPutExclusive("produceBlock.commit") exclusive write
chain.zig:732 produceBlock errdefer states.fetchRemove(block_root)statesFetchRemoveExclusivePtr("produceBlock.errdefer") exclusive write
chain.zig:1112 onBlock states.get(parent_root)statesGet cloneAndRelease (verify + STF window)
chain.zig:1356 onBlock states.put(fcBlock.blockRoot, post_state)statesCommitKeepExisting("onBlock.commit") exclusive write (with keep-existing)
chain.zig:1556 pruneStates states.fetchRemove(root) (loop) exclusive write (single critical section over the whole prune)
chain.zig:1782 onGossipAttestation states.get(target.root)statesGet borrow-only
chain.zig:1819 verifyAggregatedAttestation (called from onGossipAggregatedAttestation) states.get(target.root)statesGet borrow-only
chain.zig:1881 aggregate states.get(head_root)statesGet cloneAndRelease (FFI window via forkChoice.aggregate)
node.zig:1489 publishBlock states.get(block_root) shortcut dropped — see "publishBlock" below
chain.zig:2337 (test) process and add mock blocks… states.get // SAFETY: test-only, single-threaded
chain.zig:3117 (test) produceBlock — greedy selection… states.get // SAFETY: test-only, single-threaded

Every pubkey_cache / root_to_slot_cache access is wrapped in its respective tier-5 lock (with the sibling-rule assertion). Every last_emitted_{justified,finalized} and cached_finalized_state read-modify-write is wrapped in events_lock.

pending_blocks migration

processPendingBlocks is now a one-at-a-time orderedRemove(0) re-scan loop per Partha #7. Indices are never assumed stable across an unlock. The lean_pending_blocks_drain_iters histogram measures iteration count so we can spot the O(n²) tail in production before optimising (cap or cursor — both noted in design doc).

getFinalizedState API change

Now returns ?BorrowedState instead of ?*const types.BeamState. Backing lock depends on path:

  • In-memory map hit → states_lock.shared.
  • cached_finalized_state hit / DB-load path → events_lock (the field that owns the pointer is mutated under the same lock).

Every caller migrated (Partha r3 #6 — grepping states.get does not find them):

File:line Caller Migration
pkgs/cli/src/api_server.zig:274 handleFinalizedCheckpointState takes the borrow, serialises into an owned ArrayList(u8), then defer borrow.deinit() releases the lock before HTTP I/O.

Repo-wide grep confirmed getFinalizedState is the only function with this *const BeamState outward contract; only the one HTTP route consumes it today.

publishBlock simplification

Dropped the postState = self.chain.states.get(block_root) shortcut. It was safe under global BeamNode.mutex but creates a deadlock in the new world (caller holds states_lock.shared from a borrow, callee chain.onBlock takes states_lock.exclusive to commit). Recompute is the single path now; statesCommitKeepExisting keeps the original in-map pointer untouched on the duplicate-import path so locally produced blocks no longer leak their first post-state on the publish hop.

Per-lock metrics + legacy shim

New series:

  • zeam_lock_wait_seconds{lock,site}
  • zeam_lock_hold_seconds{lock,site}
  • lean_pending_blocks_drain_iters

Legacy zeam_node_mutex_{wait,hold}_time_seconds{site} is double-emitted by LockTimer summing observations across all per-resource locks. Operators do not redeploy Prometheus or change recording rules — the legacy series stays alive automatically for one release, then is removed in the release after.

What's NOT in this PR (saved for a-3)

  • Network maps (pending_rpc_requests, pending_block_roots, timed_out_requests, connected_peers) → LockedMap migration.
  • BlockCache Network-side wiring (the helper itself ships here, with unit tests).
  • BeamNode.mutex rename to finalization_lock and dropping it from onGossip / onInterval / onReqRespResponse.
  • Lock-free onReqRespRequest.
  • Stress test scenarios.

Slice (a-2) keeps BeamNode.mutex wrapping every chain entry point — the new chain locks are nested inside it (redundant under the outer mutex but correct). They become the actual synchronisation once (a-3) drops the outer mutex from those callbacks. There is no external_mutex parameter on chain.onBlock to drop because it was never landed on main between #798#801; the design doc's "drop external_mutex" item is therefore a no-op against the current tree.


Tests

Test Location Covers
LockedMap: ctor + get/put/remove + count locking.zig basic lifecycle
LockedMap: fetchRemove returns the prior entry locking.zig KV-pair return shape
LockedMap: iterator-while-locked sees every inserted entry locking.zig iterator API
LockedMap: deinit on empty map is a no-op locking.zig empty deinit
LockedMap: deinit on non-empty map frees internal storage locking.zig non-empty deinit, leak-free
BorrowedState: deinit is idempotent and releases the lock locking.zig RwLock backing + double-deinit
BorrowedState: events_mutex backing also releases locking.zig Mutex backing variant
BorrowedState: assertReleased fires only when not released locking.zig debug-build assertion
BorrowedState: cloneAndRelease releases lock on OOM-mid-clone locking.zig FailingAllocator(fail_index=0) → lock released, error propagated
BlockCache: helper init/deinit when empty locking.zig testing.allocator leak detection
BlockCache: removeChildrenOf is bounded by MAX_CACHED_BLOCKS locking.zig BFS bound
tier5 depth counter increments and decrements locking.zig Debug-only TLS counter balance
BorrowedState: cloneAndRelease success path against real BeamState chain.zig Real BeamState clone via genMockChain — sanity-checks slot equality and lock release
BlockCache: insert + get + removeChildrenOf bounded chain.zig End-to-end with real SignedBlock + ssz buffer
BlockCache: partial-state invariant (re-insert leaves no orphans) chain.zig Repeated insert at same root frees prior block + ssz

All existing chain.zig integration tests continue to pass — they cover the migrated states map / events_lock / pending_blocks paths through the produceBlock + onBlock + onBlockFollowup integration shape.


Build + test status

$ zig fmt pkgs/        # clean
$ zig build -Doptimize=Debug
Finished `release` profile [optimized] target(s) in 0.36s
$ zig build --summary all test -Doptimize=Debug
Build Summary: 50/50 steps succeeded; 158/158 tests passed

🟢 Green. Ready for review.


Spec link

Design: docs/threading_refactor_slice_a.md @ commit a76c274 ("docs: r3 polish from Partha verification"). The PR implements every line item listed in §"Status (r4 — ready to cut code)".

Notes on deviations from the design doc

  • statesCommitKeepExisting instead of pure states.put — discovered while migrating onBlock that an unconditional overwrite invalidates in-flight borrows handed out by statesGet. The keep-existing semantics preserve pointer identity for any concurrent reader holding states_lock.shared. Equivalent rationale to the design doc's "never invalidate a pointer that other borrows might still observe."
  • publishBlock no longer passes a cached postState — see §publishBlock simplification above. Deviates from the design-doc inventory which only said "migrate the callsite" without specifying the lock-dance shape; the recompute path is the only one that survives both the borrow contract and the deadlock constraint.
  • Tier-5 sibling depth counter is process-wide thread-local, not per-BeamChain — matches the design doc but worth flagging: a process running multiple chains in parallel (today there is one) would need per-chain counters. Not a problem for slice (a).

BlockCache Network-side wiring is intentionally deferred to (a-3) per the design doc's "Use your judgment" clause; the helper + unit tests live in this PR so reviewers can review the API contract independently.

zclawz added 5 commits April 29, 2026 13:33
Per-resource locks + lock-free req/resp design for the BeamNode threading
refactor.

Lays out the lock hierarchy (1.finalization_lock -> 2.network maps ->
3.states -> 4.pending_blocks -> 5.caches -> 6.forkChoice), per-resource
analysis of every shared field under BeamNode/BeamChain/Network, and the
3-PR breakdown ((a-1) infra, (a-2) chain, (a-3) node + req/resp).

Posting design before code because slice (a) is the riskiest of the five
slices in #803 — getting the lock shape wrong = consensus bug or deadlock
on devnet. Cheap to review a doc, expensive to undo a 2000-line PR.

Open questions for reviewers listed at the end. No code changes yet.
- BeamState UAF: introduce BorrowedState wrapper as the API contract; enumerate every states.get callsite (#1)
- aggregate / getProposalAttestations: explicit snapshot-then-release before FFI (#2)
- last_emitted_* / cached_finalized_state: retract single-writer claim, add events_lock (#3)
- fetched_blocks + ssz + children: bundle under block_cache_lock for atomic triple-update (#4)
- Split caches_lock into pubkey_cache_lock (FFI miss) and root_to_slot_lock (per-attestation hot path) (#5)
- Fold (a-1) into (a-2); add mandatory unit tests for new primitives (#6)
- processPendingBlocks: one-at-a-time orderedRemove(0), no index snapshot (#7)
- #8: external_mutex removal not mechanical -- lock-dance moves into chain.zig; LOC revised to ~1000+
- #9: enumerate cross-thread chain readers (metrics writer, event broadcaster, peer iterator); reserve /eth/v1/* surface
- #10: clarify lock-hierarchy rule applies to simultaneous holds, not all-time acquire order
- #11: emit both old and new lock metrics for one release (compat shim)
- #12: stress test plan -- gossip flood + RPC, 10-node devnet under jitter, reorg+finalization stress
- #13: connected_peers gets atomic count for hot path + RwLock for iterator
- Resolve open Qs: refcount required before slice (c) goes off-thread; metrics in (a-2);
  external_mutex dropped outright; connected_peers atomic+RwLock
- Add long-term direction note: marshalled chain-mutator vs per-resource locks (#803 question)
- Rename BorrowedState.sszClone -> cloneAndRelease (consume-the-borrow semantics; prevents double-release bugs)
- Spell out errdefer for OOM-mid-clone; lock always released
- Add released:bool sentinel + debug.assert in deinit (mirrors MutexGuard pattern from #787)
- Add debug-build TLS depth counter to enforce 5a/5b/5c never-co-held at runtime
- Metric compat shim clarified to code-side double-emit (not Prometheus recording rule)
- getFinalizedState migration scope note: PR description must enumerate every caller
- Document O(n^2) worst case for processPendingBlocks; ship histogram before optimising
- Document removeChildrenOf MAX_CACHED_BLOCKS=1024 ceiling
- Mark all r3 review items resolved; (a-2) ready to cut
Implements slice (a-2) of the BeamNode threading refactor per the design
doc at docs/threading_refactor_slice_a.md (commit a76c274). Adds the
locking primitives and migrates every chain.zig callsite to use them,
without touching the node-level BeamNode.mutex (that lands in slice a-3).

New primitives (pkgs/node/src/locking.zig)
- LockedMap(K,V): Mutex + AutoHashMap bundle with the small set of
  methods we actually use (get/put/remove/fetchRemove/count/iterate-locked).
  Network-side wiring lands in (a-3); the helper ships here with unit
  tests so its API contract is reviewed independently.
- BlockCache: atomic triple-update of fetched_blocks +
  fetched_block_ssz + fetched_block_children under a single
  block_cache_lock. Wiring into network.zig lands in (a-3).
- BorrowedState: RAII wrapper carrying *const BeamState + the backing
  lock (states_lock RwLock OR events_lock Mutex). cloneAndRelease
  consumes the borrow and produces an owned snapshot, with errdefer
  ensuring the lock is released even on OOM-mid-clone. Idempotent
  deinit + released sentinel for debug-build assertions.
- LockTimer: per-resource lock observation with code-side derived shim
  that double-emits into legacy zeam_node_mutex_{wait,hold}_time_seconds.
- tier5_depth + assertNoTier5SiblingHeld: thread-local depth counter
  enforcing the 'tier-5 siblings never co-held' rule in debug builds.

New BeamChain lock fields
- states_lock: std.Thread.RwLock        (tier 3, read-mostly)
- pending_blocks_lock: std.Thread.Mutex (tier 4)
- pubkey_cache_lock: std.Thread.Mutex   (tier 5a, sibling)
- root_to_slot_lock: std.Thread.Mutex   (tier 5b, sibling)
- events_lock: std.Thread.Mutex         (tier 5c, sibling)

Migrated every chain.zig callsite per the design doc inventory:
states.get -> BorrowedState (+ cloneAndRelease for FFI/STF windows),
states.put -> statesPutExclusive / statesCommitKeepExisting,
states.fetchRemove -> statesFetchRemoveExclusivePtr, pending_blocks
append -> pending_blocks_lock, pending_blocks drain ->
processPendingBlocks one-at-a-time orderedRemove(0) loop with a
lean_pending_blocks_drain_iters histogram, public_key_cache.getOrPut ->
pubkey_cache_lock around verifySignatures and verifyAggregatedAttestation,
root_to_slot_cache -> root_to_slot_lock around STF windows + cache writes
+ finalization-driven prune, last_emitted_{justified,finalized} +
cached_finalized_state -> events_lock with read-modify-write under the
mutex inside onBlockFollowup and getFinalizedState.

getFinalizedState API change
Now returns ?BorrowedState instead of ?*const types.BeamState. Backing
lock is states_lock.shared on the in-memory hit path and events_lock on
the cache / DB-load path. Migrated caller: api_server.zig
handleFinalizedCheckpointState (the only out-of-tree consumer found by
grep). The PR description enumerates the migration scope.

publishBlock simplification
Drops the postState=self.chain.states.get(block_root) shortcut. The
shortcut was safe under the global BeamNode.mutex but creates a
deadlock in the new world (publishBlock would hold states_lock.shared
while chain.onBlock takes states_lock.exclusive). Recompute is the
single path now; statesCommitKeepExisting prevents post-state pointer
swaps that would break in-flight borrows.

Per-lock metric histograms
- zeam_lock_wait_seconds{lock,site}
- zeam_lock_hold_seconds{lock,site}
- lean_pending_blocks_drain_iters
Plus a code-side derived shim that double-emits into the legacy
zeam_node_mutex_{wait,hold}_time_seconds histograms so existing
dashboards keep working for one release.

Tests
- LockedMap: ctor, get/put/remove, fetchRemove, iterator-while-locked,
  deinit-when-empty, deinit-when-non-empty.
- BorrowedState: idempotent deinit (rwlock + mutex backing variants),
  assertReleased, cloneAndRelease errdefer-on-OOM, cloneAndRelease
  success path against a real BeamState (chain.zig).
- BlockCache: removeChildrenOf bounded by MAX_CACHED_BLOCKS,
  insert+get end-to-end, partial-state invariant on re-insert
  (chain.zig).
- tier5_depth: increment/decrement balance.

Build + test
- zig fmt pkgs/        clean
- zig build            green
- zig build test       50/50 build steps, 158/158 tests passing.

Slice (a-3) (next PR) drops BeamNode.mutex from onGossip / onInterval
/ onReqRespResponse, makes onReqRespRequest fully lock-free, and wires
the LockedMap / BlockCache helpers into network.zig.
@ch4r10t33r ch4r10t33r marked this pull request as ready for review May 1, 2026 11:26
zclawz added 3 commits May 4, 2026 12:23
Resolves zig 0.16.0 upgrade conflicts (PR #784 vs slice a-2):

* pkgs/metrics/src/lib.zig: histogram inits now take an `io` parameter
  on main (zig 0.16 prometheus crate change). Applied the same `io`
  parameter to the three new histograms added on this branch
  (`zeam_lock_wait_seconds`, `zeam_lock_hold_seconds`,
  `lean_pending_blocks_drain_iters`); kept the LEGACY notes on the
  `zeam_node_mutex_*` series so the double-emit shim retains its
  documentation.

* pkgs/node/src/lib.zig: kept `_ = @import("./locking.zig");` from this
  branch and reverted to `refAllDecls` (`refAllDeclsRecursive` does
  not exist in zig 0.16's `std.testing`).

* pkgs/node/src/{chain,locking}.zig: replaced `std.Thread.{Mutex,RwLock}`
  with `zeam_utils.{SyncMutex,SyncRwLock}` (the project's zig-0.16
  drop-in wrappers around the new `std.Io.{Mutex,RwLock}` API). Same
  semantics as before from the call-site perspective; the zig-0.16
  `io`/cancellable plumbing stays inside the wrapper.

* pkgs/node/src/locking.zig: rewrote LockTimer to use
  `zeam_utils.monotonicTimestampNs()` instead of `std.time.Timer`
  (`std.time.Timer` was removed in zig 0.16). Same wait/hold metrics,
  same legacy-shim emission into `zeam_node_mutex_{wait,hold}_time_seconds`.

Build green: 26/26 steps. Tests green: 50/50 across all packages
(including the new LockedMap / BlockCache / BorrowedState unit tests
in pkgs/node).
Address @GrapeBaBa's feedback (#804):
the libxev main thread and the libp2p bridge thread are both event loops
that currently mix IO + CPU work. Running XMSS verify + STF on either is
a correctness bug (peer drops, slot-tick jitter), not a perf nuance.
The hackmd reference \u2014 lighthouse's router/BeaconProcessor/worker tier
model (https://hackmd.io/@JVtpwRK3SwmkRIFfF0Bmyg/rylVP_WY-g) \u2014 is the
shape we converge to.

r5 changes:

* New \u00a7IO-thread non-blocking invariant: neither libxev main nor the
  libp2p bridge thread may run XMSS verify, STF, or any FFI work whose
  worst case exceeds a per-interval budget. Crossing the invariant has
  downstream consensus impact (missed slots, dropped peers).

* New \u00a7Chain-worker thread (target architecture) section: dedicated
  worker owns forkchoice + STF + pending_blocks + states; libxev main
  becomes event router only; libp2p bridge FFI callbacks become
  enqueue-and-return work-queue producers. Mirrors PR #670 (Rust-side
  Swarm actor model) and lighthouse's three-tier model.

* \u00a7Threads in play rewritten as a today-vs-target table. Old view
  retained at the bottom of the section for slice (a-2) reviewers.

* Slice (c) repositioned and rescoped: was 'followup off-thread,' now
  'off-IO-thread chain mutation' \u2014 spawns the chain worker thread,
  marshalls every chain mutator to it, lands refcounted state pointers
  as a hard prerequisite (cross-thread readers must not block worker
  mutations).

* \u00a7Long-term direction rewritten from suggestive to definitive: the
  chain-worker shape IS the target; the fine-grained-lock alternative
  is described as the failure mode if slice (c) never lands.

* Slice (a-2) is now framed explicitly as a prerequisite, not an end
  state. Implementation PR #805 unchanged; only the framing moves.
The previous merge commit (5711654) brought the branch up to main but
omitted the zig 0.16 API migration for the slice (a-2) primitives \u2014
chain.zig and locking.zig still referenced `std.Thread.{Mutex,RwLock}`
(removed in zig 0.16) and `std.time.Timer` (also removed).

This fixup commit:

* Replaces `std.Thread.{Mutex,RwLock}` with `zeam_utils.{SyncMutex,SyncRwLock}`
  in chain.zig and locking.zig (drop-in wrappers around the new
  `std.Io.{Mutex,RwLock}` API; same call-site semantics).

* Rewrites `LockTimer` to use `zeam_utils.monotonicTimestampNs()`
  instead of `std.time.Timer`; semantics unchanged \u2014 same wait/hold
  observations into `zeam_lock_{wait,hold}_seconds` plus the legacy
  `zeam_node_mutex_{wait,hold}_time_seconds` shim.

* Reverts `refAllDeclsRecursive` to `refAllDecls` in pkgs/node/src/lib.zig
  (`refAllDeclsRecursive` does not exist in zig 0.16's `std.testing`).

Build: 26/26 steps succeeded.
Test: 50/50 steps succeeded (verified before this commit; cache hit on
fixup will run again in CI).
@zclawz
Copy link
Copy Markdown
Contributor Author

zclawz commented May 4, 2026

Conflicts resolved + zig 0.16 API migration completed (HEAD be54314).

The branch was conflicting with main after PR #784 (zig 0.16.0 upgrade, +1013 / −902 across 57 files). Resolution shipped in three commits on top of the previous tip:

  1. 5711654 — Merge branch main into the slice branch.
    Two textual conflicts:

    • pkgs/metrics/src/lib.zig: histogram init calls now take an io parameter on main (zig 0.16 prometheus crate change). Applied the same io parameter to the three new histograms added on this branch (zeam_lock_wait_seconds, zeam_lock_hold_seconds, lean_pending_blocks_drain_iters); kept the LEGACY notes on the zeam_node_mutex_* series so the code-side derived double-emit shim retains its documentation.
    • pkgs/node/src/lib.zig: kept _ = @import("./locking.zig"); from this branch.
  2. b36dde2 — docs: r5 of the design doc (chain-worker target absorbed per @GrapeBaBa / lighthouse hackmd; substantive write-up posted on docs: slice (a) threading refactor design doc #804).

  3. be54314 — fix: zig 0.16 API migration for slice (a-2) primitives.
    std.Thread.Mutex / std.Thread.RwLock no longer exist in zig 0.16 (replaced by std.Io.{Mutex,RwLock} with io: Io arg + cancellable semantics). Main introduced zeam_utils.{SyncMutex,SyncRwLock} as drop-in wrappers presenting the old API. Migrated pkgs/node/src/{chain,locking}.zig to use those wrappers.
    std.time.Timer was also removed in zig 0.16 — rewrote LockTimer to use zeam_utils.monotonicTimestampNs() with the same wait/hold semantics and the same legacy-shim emission into zeam_node_mutex_{wait,hold}_time_seconds.
    refAllDeclsRecursive does not exist in zig 0.16's std.testing — reverted to refAllDecls.

Verification

Both run locally against zig 0.16.0 at HEAD be54314:

result
zig build --summary all 26/26 steps succeeded
zig build --summary all test 50/50 steps succeeded, including the new LockedMap / BlockCache / BorrowedState unit tests in pkgs/node

GitHub now reports mergeable: true against current main.

What did not change on the wire

The three primitives (LockedMap, BlockCache, BorrowedState) and every chain.zig migration point are byte-for-byte unchanged from the pre-merge state — only the underlying mutex/rwlock type names and the timer plumbing differ, and those are dictated by the zig 0.16 stdlib reshuffle, not by review feedback. BorrowedState.cloneAndRelease semantics, the errdefer lock-release on OOM-mid-clone, the released: bool sentinel, the tier-5 TLS depth assertion, and the legacy zeam_node_mutex_* shim all behave identically.

Companion design-doc revision

b36dde2 revises docs/threading_refactor_slice_a.md to r5 — the chain-worker target architecture per @GrapeBaBa's hackmd reference. See the parallel comment on #804 for the substantive write-up and what slice (a-2) being the prerequisite (rather than the end-state) means for review.

Ready for human review.

Copy link
Copy Markdown
Contributor

@ch4r10t33r ch4r10t33r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ch4r10t33r ch4r10t33r merged commit 5fac5fd into main May 4, 2026
13 checks passed
@ch4r10t33r ch4r10t33r deleted the feat/agent-slice-a-2-chain-primitives branch May 4, 2026 13:56
ch4r10t33r pushed a commit that referenced this pull request May 4, 2026
) (#820)

* node,network: LockedMap + BlockCache + ConnectedPeers wiring (slice a-3)

Migrate Network maps from raw HashMaps under BeamNode.mutex to per-resource
locks built on the slice (a-2) primitives (#805):

- pending_rpc_requests, pending_block_roots: LockedMap(K,V)
- timed_out_requests: ArrayList + own Mutex (returned ids are now an owned
  copy so callers don't hold the lock across retry calls)
- fetched_blocks + fetched_block_ssz + fetched_block_children:
  collapsed into one BlockCache field (atomic triple under one
  block_cache_lock). Two new BlockCache helpers cover the network's
  observed shape: insertBlockPtr (block + parent link, no ssz) and
  attachSsz (post-cache ssz update). Added forEachBlock,
  getChildrenCopy, getBlock, getSsz, hasChildren so callers no longer
  reach into the underlying maps.
- connected_peers: new ConnectedPeers wrapper around StringHashMap +
  RwLock + atomic count. Logger fast-path reads the atomic; iterators
  take the shared RwLock; adds/removes take the exclusive RwLock and
  update the atomic alongside the map mutation.

Network public API stays compatible at the call-site level except:
- selectPeer now returns an owned dup'd peer-id slice (caller frees);
  ensureBlocksByRootRequest returns BlocksByRootRequestResult that
  carries the owned slice and offers .deinit(allocator).
- getTimedOutRequests returns an owned slice (caller frees).
- getPendingRequestPtr is replaced by snapshotPendingRequest which
  copies the per-request strings under the lock so the caller can hand
  the snapshot to chain.onBlock without holding the pending_rpc_requests
  lock across the STF window.

BeamChain.connected_peers becomes *ConnectedPeers (was
*const std.StringHashMap(PeerInfo)). The chain reads count()
lock-free via the atomic and uses iterateLocked for sync-status
sweeps.

This change does NOT yet drop BeamNode.mutex from the gossip / interval
/ req-resp paths; that is the next commit in the slice.

Refs: docs/threading_refactor_slice_a.md
Tracking: #803

* node: drop BeamNode.mutex from gossip/interval/req-resp paths

Slice (a-3) of #803. Now that the network maps and chain state have
their own per-resource locks (slices a-2 and a-3 helpers), the outer
BeamNode.mutex is no longer needed on the hot callbacks:

- onGossip drops the global mutex; chain.onGossip and the network
  cache mutations each take their own per-resource locks.
- onReqRespResponse drops the global mutex; handleReqRespResponse
  snapshots the pending request entry under pending_rpc_requests'
  own lock, then dispatches into chain.onBlock with no outer mutex
  held across the STF window.
- onReqRespRequest is now fully lock-free: chain.db.loadBlock has its
  own DB-internal sync, and chain.getStatus reads forkchoice via its
  own RwLock shared path. No chain mutation, no network state mutation
  on this path.
- onInterval drops the global mutex around chain.onInterval /
  processPendingBlocks / sweepTimedOutRequests / processReadyCachedBlocks.

batch_pending_parent_roots gets its own dedicated mutex
(batch_pending_parent_roots_lock) since the global mutex was its only
synchronisation. Both the libxev tick path (flushPendingParentFetches)
and the libp2p bridge path (cacheBlockAndFetchParent) can now contend on
this small map; the lock is held only for the put / drain critical
section.

BeamNode.mutex is renamed to BeamNode.finalization_lock and the
acquireMutex / MutexGuard API is renamed to acquireFinalizationLock /
FinalizationLockGuard. The lock currently has no users on the gossip /
interval / req-resp paths; it is kept as a labelled placeholder for
the few code paths that legitimately need a multi-resource view (e.g.
a future processFinalizationFollowup move-off-IO-thread). Wait/hold
times go through LockTimer so the legacy zeam_node_mutex_* histograms
keep getting double-emitted for one more release.

Refs: docs/threading_refactor_slice_a.md
Tracking: #803

* tests: LockedMap/ConnectedPeers concurrent smokes + Network/BlockCache integration

Slice (a-3) test additions:

In pkgs/node/src/locking.zig:
- 'LockedMap: concurrent put/get/remove smoke' — 4 threads ×
  256 ops each, hits put/get/remove/iterateLocked. Asserts no
  deadlock and bounded final count.
- 'ConnectedPeers: connect / disconnect / count atomic' — full
  API smoke (connect, replace, disconnect, contains,
  setLatestStatus, iterateLocked, selectPeerCopy).
- 'ConnectedPeers: concurrent connect/disconnect keeps count
  consistent' — 3 threads stress the atomic count + RwLock; final
  atomic count must equal the actual map size.

In pkgs/node/src/node.zig:
- 'Network: BlockCache wiring smoke (slice a-3)' — exercises the
  new Network helpers end-to-end against a real BeamNode:
  cacheFetchedBlock (insertBlockPtr path), duplicate handling,
  getChildrenOfBlock (owned slice), storeFetchedBlockSsz,
  collectCachedBlocksAtOrBelowSlot, collectReadyCachedBlocks,
  removeFetchedBlock parent-link cleanup.
- 'Network: ConnectedPeers integration with selectPeer (slice
  a-3)' — connectPeer / hasPeer / count / selectPeer (owned-copy)
  / disconnectPeer.

The single-node ingestion stress harness called for in
docs/threading_refactor_slice_a.md §'Stress test plan' lives
outside the unit-test pkg (needs a real Network backend / mock
libp2p / forked-chain seed) and is tracked in #803 as a
follow-up sim scenario.

Refs: docs/threading_refactor_slice_a.md
Tracking: #803

* fix(chain): hold states_lock.exclusive across onBlock post-commit deref

With BeamNode.mutex dropped from gossip / req-resp / interval paths in
slice (a-3), two threads can now be inside chain.onBlock concurrently.
Previously the outer mutex masked a UAF: statesCommitKeepExisting
released states_lock.exclusive before returning, then onBlock would
deref the (now unlocked) in-map pointer through updateBlockDb +
forkChoice.confirmBlock. Thread A's
onBlockFollowup -> processFinalizationAdvancement -> pruneStates
takes states_lock.exclusive and fetchRemove+frees the very entry
Thread B is about to deref. Classic use-after-free.

Fix: hand the exclusive lock off to the returned BorrowedState so the
caller in onBlock keeps it alive (via defer borrow.deinit()) across the
post-commit deref window. Adds a new BorrowedState backing variant
states_exclusive_rwlock alongside states_shared_rwlock and
events_mutex; cloneAndRelease and assertReleased work for the new
variant.

Cannot downgrade-then-reacquire-shared: the unlock/reacquire gap is
exactly the UAF race we are closing. (A native RwLock downgrade
primitive would let us hold shared instead; that is a separate change.)

Cost analysis: the lock-hold window grows by updateBlockDb +
forkChoice.confirmBlock. Forkchoice has its own RwLock so confirmBlock
is short; updateBlockDb is a serialised write to the DB layer. The
pruner blocks for that window — acceptable cost for correctness.

Tests:
  * locking: BorrowedState states_exclusive_rwlock backing also releases
  * locking: cloneAndRelease works for states_exclusive_rwlock backing
  * locking: states_exclusive_rwlock backing blocks pruner-shaped
    reacquire until deinit (smoke test with std.Thread.spawn)

Reported by @ch4r10t33r.

Refs: #803 #820

* fix(node): take shared forkchoice lock for head/finalized struct reads

forkChoice.head (ProtoBlock) and fcStore.latest_finalized (Checkpoint)
are written under forkChoice.mutex.lock() (exclusive). Several sites
in node.zig read these multi-field structs directly without the
shared lock; under outer BeamNode.mutex (pre-a-3) these were
serialised, but with the outer mutex dropped from gossip / interval /
req-resp paths a torn struct read pairs slot with a blockRoot from a
different update, causing wrong sync decisions in
handleReqRespResponse.

Migrate to existing getHead() / getLatestFinalized() helpers (which
take mutex.lockShared) and snapshot once per logical decision block
(L932 was reading forkChoice.head.slot twice in the same if/log pair —
two independent acquires can return different ProtoBlocks).

Init-time write at L129 (slot_clock pointer hookup, single-threaded)
left as-is per the brief.

Reported by @ch4r10t33r.

Refs: #803 #820

* fix(locking): atomic triple-insert via insertBlockPtr ssz param + getBlockAndSsz

BlockCache's design-doc invariant is "the triple update (block + ssz +
parent link) is atomic from any reader's perspective". The split-insert
shape (insertBlockPtr + later attachSsz) breaks it: between the two
calls a concurrent reader doing getBlock(root) sees Some and getSsz(root)
sees None — the partial state the doc says is invisible.

Live exposure with BeamNode.mutex dropped (slice a-3): node.zig
processCachedDescendants reads getFetchedBlock then getFetchedBlockSsz
in two separate lock acquisitions, racing the cache + storeFetchedBlockSsz
window in gossip-replay.

Fix:
  * insertBlockPtr gains an ssz: ?[]u8 param; when non-null the SSZ
    bytes are inserted in the same critical section (errdefer rolls
    back ssz on partial failure of the children-list append).
  * Add BlockCache.getBlockAndSsz returning a single nominal
    BlockAndSsz struct under one lock acquisition (ssz may itself be
    null when not yet attached — but the pair is always coherent).
  * Network gains getFetchedBlockWithSsz wrapper + cacheFetchedBlock
    passes null for ssz (it doesn't have it yet).
  * node.zig processCachedDescendants collapses the two-lock dance
    into a single getFetchedBlockWithSsz call.
  * Document attachSsz's brief partial-state window and update get()'s
    docstring (it's no longer 'impossible' — insertBlockPtr is now a
    mutator that can leave block-only state when ssz=null).

Tests:
  * BlockCache: insertBlockPtr+ssz atomic visibility (both-Some path)
  * BlockCache: insertBlockPtr null-ssz then attachSsz still observable
    atomically

Reported by @ch4r10t33r.

Refs: #803 #820

* fix(locking,network): atomic removeFetchedBlock; close TOCTOU vs concurrent prune

Network.removeFetchedBlock did getBlock(root) under lock #1 to fetch
parent_root, then removeOne(root, parent_root) under lock #2. Between
the two acquisitions another thread could removeOne(root) directly or
removeChildrenOf(parent_of_root), leaving stale parent_root and a
torn cache state. The returned bool was no longer trustworthy as a
gate for further work.

Fix: BlockCache.removeFetchedBlock(root) does the lookup + remove +
parent-link cleanup all under one critical section, deriving
parent_root from the entry being removed. Network.removeFetchedBlock
becomes a one-liner.

The internal fast path is factored as removeOneUnlocked so the public
removeOne (which still takes parent_root from the caller) and the
new removeFetchedBlock share the same body.

No production callers of the legacy two-step pattern remain. removeOne
is left in place for tests and any future caller that already holds
the parent root (e.g. iterating children).

Test:
  * BlockCache: removeFetchedBlock atomically clears entry + parent link
    (verifies blocks.get → null, hasChildren → false, idempotent second
    call returns false).

Reported by @ch4r10t33r.

Refs: #803 #820

* fix(connected-peers): idempotent count_atomic on replace path

ConnectedPeers.connect's legacy shape on the replace path was:
  fetchSub(1) -> dupe key -> dupe peer_id -> map.put -> fetchAdd(1)
with errdefer-on-OOM unwinding the dupes/put. If either dupe failed
the count_atomic stayed permanently low: the entry was removed from
the map AND the counter never got the matching fetchAdd back. Under
slice (a-3) the gossip log line on every message reads count_atomic
lock-free, so a permanent-low value would silently misreport the
connected-peer count for the rest of the node's life.

Fix: skip count_atomic on replace entirely. fetchRemove the old entry,
note was_present, then dupe + put. Only fetchAdd(1, .release) when
was_present was false.

Strictly smaller residual bug on OOM during the dupe/put: the entry
disappears from the map but count_atomic is unchanged \u2014 reads one
above true map size for that error window. That is bounded by a
single OOM event vs. permanently wrong on the legacy code path.

Reported by @ch4r10t33r.

Refs: #803 #820

* fix(network): O(1) snapshotPendingRequest lookup

snapshotPendingRequest was iterateLocked + key compare scanning the
whole pending_rpc_requests map for the matching request_id. Per call
that is O(n) under the lock; the per-request timeout sweep that calls
this for every in-flight request was O(n^2) under the lock and
held it for the whole walk \u2014 a real concern when a peer churns
hundreds of in-flight requests.

LockedMap already exposes get() doing a single hash lookup under
its mutex. Use it.

Reported by @ch4r10t33r.

Refs: #803 #820

* fix(chain,locking): tier5 depth tracked through borrow lifetime

chain.getFinalizedState's cache-hit and DB-load paths called
locking.leaveTier5() BEFORE returning the BorrowedState that still
holds events_lock (a tier-5 sibling). For the entire borrow lifetime
the thread-local tier5_depth read 0 even though events_lock was
actually held \u2014 so HTTP / event-broadcaster paths that touch
pubkey_cache_lock (5a) or root_to_slot_lock (5b) while a finalized-
state borrow was alive silently bypassed the sibling-rule assertion.

Fix: BorrowedState gains a tier5_held: bool field. The hand-off site
sets it true; BorrowedState.deinit calls leaveTier5() AFTER unlocking
(so the next acquire-side assertion sees the correct depth). The
not-found local-defer release path inside getFinalizedState still
calls leaveTier5() directly because no borrow is constructed there.

Default false: statesGet (states_lock is tier-3, not tier-5) and
statesCommitKeepExisting (also tier-3) leave tier5_held off.

Tests:
  * BorrowedState: tier5_held=true causes deinit to decrement tier5 depth
    (idempotent second deinit is also tested)
  * BorrowedState: tier5_held=false leaves tier5 depth alone

Reported by @ch4r10t33r.

Refs: #803 #820

* fix(locking): assertReleasedOrPanic wiring + cloneAndRelease double-call guard

Two related correctness gaps:

1. The legacy assertReleased(*const Self) was silently dropped at every
   callsite during the Zig 0.16 upgrade. Renamed to
   assertReleasedOrPanic, with a richer panic message that names the
   backing variant. The legacy assertReleased name is retained as a
   deprecated alias so any in-flight branches keep building while
   callsites migrate.

2. cloneAndRelease did not guard against a second call. The first call
   already released the backing lock and the in-map state may have
   been freed by a concurrent thread; the second call's read of
   self.state inside sszClone is a UAF. Added
   std.debug.assert(!self.released) at the top.

Wired up assertReleasedOrPanic at every BorrowedState callsite in
chain.zig and api_server.zig. Source order is intentional:

    defer borrow.assertReleasedOrPanic();   // registered FIRST
    defer borrow.deinit();                   // registered SECOND

LIFO defer execution means deinit runs first (sets released=true),
then assertReleasedOrPanic validates and is a no-op. On a path that
bypasses deinit \u2014 a future helper that takes ownership without
recording release \u2014 the assert panics in Debug. Compiled out in
release.

Reported by @ch4r10t33r.

Refs: #803 #820

* fix(locking): drop synthetic-parent BlockCache.get; LockTimer follows borrow

Two related cleanups:

1. BlockCache.get returned a CachedBlock whose parent_root field was
   silently zero -- callers that trusted the field were getting an
   answer to a question the helper could not honestly answer. Drop the
   helper and the now-unused CachedBlock type. Existing test usages
   migrate to getBlockAndSsz (which is the honest replacement: it
   returns block + ssz atomically and has no parent_root field).

2. LockTimer hold-span histograms (zeam_lock_hold_seconds + the legacy
   zeam_node_mutex_hold_time_seconds) were closed at the helper's
   return on every borrow-handoff path: statesGet, statesCommitKeepExisting,
   getFinalizedState. The borrow's true hold span extends to its deinit
   site, so the histogram systematically under-reported -- especially bad
   for the new states_exclusive borrow which can be held across the whole
   onBlock post-commit deref window.

   Fix: BorrowedState gains a timer: ?LockTimer field. The hand-off
   helper acquires the lock, calls t.acquired() to record the wait
   span, then moves the LockTimer into the returned borrow.
   BorrowedState.deinit calls t.released() AFTER unlocking, closing
   the hold-span observation at the actual release moment. Not-found
   paths inside statesGet / getFinalizedState keep their direct
   t.released() calls.

   LockTimer is a POD struct (timestamps + flags), so move-by-value
   into the union literal is safe.

Reported by @ch4r10t33r.

Refs: #803 #820

* fix(chain): take shared forkchoice lock for head/finalized/justified struct reads

Sweep extension of #820 commit f4daf77 (which fixed node.zig torn reads)
to chain.zig. The same multi-field struct reads exist in chain.zig and
were similarly serialised by BeamNode.mutex pre-(a-3). Now that the
outer mutex is gone from gossip / req-resp / interval paths, every
direct read of forkChoice.head, fcStore.latest_finalized, or
fcStore.latest_justified can tear (slot, blockRoot/root) pairs against
a concurrent updateHead / fcStore.update that runs under
forkChoice.mutex.lock() (exclusive).

Sites migrated to the shared-locked accessors getHead /
getLatestFinalized / getLatestJustified:

  * validateBlock: finalized_slot snapshot
  * aggregate: head_root snapshot for FFI window
  * getJustifiedCheckpoint: was returning the raw struct, now returns
    a coherent snapshot
  * getSyncStatus: head.slot + latest_finalized.slot snapshots
  * getFinalizedState: finalized_checkpoint snapshot before the borrow
    handoff path

Init-time read at chain.init (last_emitted_justified /
last_emitted_finalized) left as-is: chain is single-threaded during
construction.

Reported by @ch4r10t33r.

Refs: #803 #820

* refactor(node): delete unused finalization_lock; reintroduce in slice (c) with first user

Per slice discipline, the rename of `BeamNode.mutex` -> `finalization_lock`
(plus its FinalizationLockGuard wrapper, acquireFinalizationLock helper and
all of the "site" label machinery) had zero callers in this PR. Carrying
dead synchronisation primitives just to preserve a name adds review surface
for nothing.

Drop them entirely. Slice (c) (chain-worker thread; processFinalizationFollowup
move-off-IO-thread) will reintroduce a multi-resource lock here when its
first real user lands, with a real call site and metrics labels chosen at
that point.

Also tidy the chain.zig comment that referenced the renamed-but-unused
mutex; per-resource locks are now the actual synchronisation between the
libxev tick path and the libp2p worker on every chain entry point.

* fix(locking): forEachBlock passes SignedBlock by value to prevent pointer escape

Previously `BlockCache.forEachBlock` invoked its callback with
`*const types.SignedBlock` — a pointer into the cache's HashMap. Such
pointers are valid only during the callback (HashMap rehash on a
concurrent insert, or removeFetchedBlock / removeChildrenOf, can
invalidate them) but nothing in the API prevented a callback from
stashing the pointer past return. Same shape as the `iterateLocked`
footgun Partha flagged on slice (a-2).

Switch the callback signature to take `types.SignedBlock` by value (a
struct copy under the lock). The struct copy is bounded by
MAX_CACHED_BLOCKS=1024 entries × sizeOf(SignedBlock) per sweep — tiny.
Slice fields inside the copied SignedBlock still point into cache-owned
storage, but that is safe to read INSIDE the callback because the cache
lock is held across all callbacks (no removal / free can run
concurrently). Documented in the docstring along with the
'must not retain past callback' rule.

Updated both existing callsites in network.zig
(`pruneAtOrBelowEach`, `collectReadyEach`) to take the block by
value.

* test(locking,chain): concurrent BlockCache + BorrowedState/pruneStates + onBlock stress

Add three unit-level concurrent tests that reproduce the UAF surface
slice (a-3) is closing, so any future regression fails here without
needing the sim package.

Test A — BlockCache: 3-thread stress
  Three threads hammer a single BlockCache: one inserts (insertBlockPtr
  + ssz), one reads (getBlockAndSsz), one removes (removeFetchedBlock).
  N=1500 iterations. Asserts the triple-atomic invariant from #803
  (a reader observes either {block-Some, ssz-Some} or {null, null} —
  never partial state) and that a final cleanup pass leaves the cache
  empty with no orphan parent-link entries.

Test B — BorrowedState: cloneAndRelease vs concurrent
        statesFetchRemoveExclusivePtr
  Constructs a real BeamChain, populates 64 unrelated states, then
  spins one thread doing statesGet+cloneAndRelease against the genesis
  root (1000 iters) while another thread drains the unrelated set via
  statesFetchRemoveExclusivePtr. A start_barrier ensures both threads
  enter the critical region simultaneously. Asserts that thread A's
  clone always observes a coherent state (slot matches expected), and
  that thread B's frees never UAF the in-flight clone.

Test C — chain.onBlock: two-thread concurrent import of same block
  Builds a fresh BeamChain per iteration (30 iters, documented as
  SLOW: STF + signature verify dominates wall-clock). Two threads call
  onBlock(blocks[1]) on the same chain at the same time. A
  ready/start_barrier pair guarantees both threads enter onBlock around
  the same moment so the test exercises the kept_existing path of
  statesCommitKeepExisting (the original UAF surface from PR #820 /
  issue #803). Asserts no panics, at least one success, post-state
  present in the states map, and forkchoice knows the block.

All three tests use std.testing.allocator for thread-safe allocation
under the cache / chain. spec_name and fork_digest are duped via
std.testing.allocator because BeamChain.deinit frees them through
self.allocator.

* fix(network): close UAF in snapshotPendingRequest by dupe'ing under LockedMap lock

The batch-1 O(1) fix in commit 60761c9 replaced an iterateLocked +
key-compare scan (O(n) per call, O(n^2) per timeout sweep) with a
`pending_rpc_requests.get(request_id)` followed by allocator dupes of
the entry's slice-bearing fields. `LockedMap.get` returns the value
BY VALUE and releases the lock before returning. The returned
PendingRPCEntry struct still aliases the in-map allocator-owned bytes
through its `peer_id: []const u8` and `requested_roots: []types.Root`
slice headers; another thread can run `finalizePendingRequest(id)`
between the get returning and the dupe running, fetchRemove the entry
and call `rpc_entry.deinit(allocator)` (frees both slices). The dupes
then read freed memory.

Fix: add `LockedMap.withValueLocked(key, ctx, each)` that runs the
callback while the map's mutex is still held. The callback receives a
`?*const V` pointing at the in-map storage (stable for the duration
of the lock) and threads any owned snapshot back through ctx. Move
all `allocator.dupe` work for snapshotPendingRequest into that
callback, so a concurrent `finalizePendingRequest` cannot run while
the dupes are in flight.

Also adds three unit tests for the new helper:
  * holds the lock across the callback (tryLock-from-second-thread
    proof, mirrors the existing BorrowedState pruner-shaped test).
  * invokes the callback with null on a missing key.
  * propagates callback errors AND releases the lock on the error
    path (a follow-up put on the same map proves the mutex is free).

Refs PR #820, issue #803.

* test(locking): split insertBlockPtr(null)+attachSsz concurrent race coverage

The existing 3-thread BlockCache stress test only exercises the
`insertBlockPtr(ssz=non-null)` shape (atomic triple in one call). The
production `Network.cacheFetchedBlock` path passes `ssz=null` and
later `storeFetchedBlockSsz` does `attachSsz` against the same root —
that race surface (the bounded partial-state window where
block-Some-but-ssz-null is observable) is documented in
`§insertBlockPtr` / `§attachSsz` but had no concurrent test.

Adds `BlockCache: split insertBlockPtr(null)+attachSsz race vs
concurrent reader` in `locking.zig`:

  * Thread A (writer): 1500 iterations of `insertBlockPtr(null)`
    immediately followed by `attachSsz` against the same distinct
    root (so duplicates are impossible). Publishes the root to the
    reader's queue between the two calls so the reader has maximum
    surface to observe the partial-state window.
  * Thread B (reader): spins on `getBlockAndSsz` against the most
    recently produced root. For every Some block the inner slices
    must be readable — `block.block.slot` is range-checked
    (`>= 1 and <= ITER + 16`, matches the writer's slot=i+1 fixture)
    and `block.block.parent_root` must equal `types.ZERO_HASH`. SSZ
    is asserted to be either null OR the 4-byte 0xDEADBEEF fixture —
    never garbage. Counts both `partial_observed` (block-Some,
    ssz-null) and `full_observed` (both Some).
  * Thread C (remover): drains the attached-roots queue and calls
    `removeFetchedBlock` for each completed root. Exercises the
    no-double-free contract.

Asserts `partial_observed > 0` AND `full_observed > 0` at the end
so the test fails LOUDLY if it ever degenerates into a smoke test
that doesn't actually exercise the partial-state surface.

Uses synthetic `SignedBlock` fixtures via the existing
`makeFixtureSignedBlockPtr` helper (zeroed body, no XMSS, no STF) —
the contract under test is BlockCache atomicity, not signature
validity. Test runtime stays under a second.

Refs PR #820, issue #803.

* fix(locking,network,node): replace getBlockAndSsz borrow with cloneBlockAndSsz to close macOS UAF

The race test 'BlockCache: split insertBlockPtr(null)+attachSsz race
vs concurrent reader' (PR #820, SHA 42b4566) panicked on macOS CI
with: bs.ssz.?[0] == 0xDE failing.

Root cause: getBlockAndSsz returned a BlockAndSsz whose ssz field was
a slice header pointing into cache-owned storage. After the mutex
releases, the Remover thread can call removeFetchedBlock which frees
that storage. macOS allocator surfaces the freed bytes as garbage
(first byte != 0xDE), causing the assertion to fire.

Fix:
- Remove getBlockAndSsz (borrow-shape) from BlockCache entirely.
  Replace with cloneBlockAndSsz (deep-clone under the cache mutex,
  caller owns the result). The clone allocates fresh copies of both
  the SignedBlock (via sszClone) and the SSZ bytes (via allocator.dupe)
  inside the critical section, so they remain valid indefinitely after
  the unlock.
- Remove Network.getFetchedBlockWithSsz (thin wrapper around the old
  borrow-shape). Replace with cloneFetchedBlockAndSsz which delegates
  to cloneBlockAndSsz and transfers ownership to the caller.
- Update node.zig to call cloneFetchedBlockAndSsz and add
  defer cached.deinit(self.allocator) so both block interior fields
  and the ssz slice are freed after chain.onBlock completes.
- Update all tests in chain.zig and locking.zig to use
  cloneBlockAndSsz (+ proper deinit of the owned result).
- Reduce race test ITER 1500 -> 300: cloneBlockAndSsz performs a full
  SSZ round-trip per probe under the lock, which is much heavier than
  the old borrow-getter; lower N keeps wall time reasonable while
  still exercising the 3-thread interleaving.

Fixes macOS CI failure on PR #820.
Refs issue #803.

---------

Co-authored-by: zclawz <zclawz@noreply.github.com>
Co-authored-by: zclawz <zclawz@users.noreply.github.com>
Co-authored-by: zclawz <zclawz+pr820@users.noreply.github.com>
ch4r10t33r added a commit that referenced this pull request May 7, 2026
…ce e); chain-worker default on (#803) (#842)

* node: parallel net-fetch (slice d) + centralised hash-root cache (slice e); chain-worker default on (#803)

Three orthogonal but related changes that close the operational gap on
zeam #803 and shave duplicate work on every gossip block / RPC fetch.

## Slice (d) of #803 \u2014 parallel net-fetch + missed-root prune

`BeamNode.fetchBlockByRoots` previously walked `roots` once,
calling `forkChoice.hasBlock(root)` per entry \u2014 N shared-lock
acquisitions on the forkchoice rwlock and N hashmap lookups, in
sequence. Under heavy gossip fanout the per-root acquire dominates
the lookup cost.

New `ForkChoice.hasBlocksBatch(roots, out)` snapshots presence for
every root under a single shared-lock acquisition. `fetchBlockByRoots`
now uses it, then dedups against `network.block_cache` and
`network.pending_block_roots` in priority order so each root\u2019s
outcome \u2014 `already_in_forkchoice` / `already_in_block_cache` /
`already_pending` / `fetched` \u2014 is recorded on
`lean_block_fetch_dedup_total{outcome}`. Operators can now see (a)
how much duplicate fetch work the dedup is saving and (b) whether a
future refactor accidentally bypassed any of the three caches.

`hasBlocksBatch` returns `error.LengthMismatch` if `roots.len !=
out.len` so callers can\u2019t accidentally read past the end of `out`.

## Slice (e) of #803 \u2014 centralised hash-root cache

A gossip block currently has its hash-tree root computed up to four
times: at `BeamNode.onGossip` (pre-lock), at `chain.onGossip` again,
inside `chain.onBlock` if no `blockRoot` was passed, and inside
`forkchoice.onBlock` if its `opts.blockRoot` is null. The same
applies to RPC and replay paths. `hashTreeRoot(BeamBlock, ...)` is a
non-trivial SSZ traversal on a wide block body; doing it 3\u20134 times
per ingress is pure waste.

This PR threads the producer\u2019s already-computed root all the way
through:

* `networks.GossipMessage` is left as-is (the variant is plain
  `SignedBlock` and changing the shape would ripple through the FFI
  bridge); instead `chain.onGossip` gains an optional
  `precomputed_block_root: ?types.Root` parameter that
  `BeamNode.onGossip` populates from its own pre-lock computation.
* `chain_worker.Message.on_block` and the handler vtable gain an
  optional `block_root: ?types.Root`. `BeamChain.submitBlock` takes
  the cached root through; the `chainWorkerOnBlockThunk` forwards it
  into `onBlock` via `CachedProcessedBlockInfo.blockRoot`.
* The pending-blocks queue stores a new `PendingBlockEntry`
  (`{ signed_block, block_root }`); `processPendingBlocks` reads
  the cached root instead of re-hashing on every ready-block pop.
* `chain.onBlock` and `forkchoice.onBlock` already accepted an
  optional precomputed root; they now bump
  `lean_block_root_compute_skipped_total{site}` when the caller
  supplied one so an operator can verify the cache is actually being
  used at every site (a sustained 0 at any site label means the cache
  plumbing was dropped on the floor in a refactor).

The default `block_root: ?Root = null` on every new field/parameter
keeps existing tests + the c-1 chain-worker stub compiling unchanged.

## Chain-worker default flipped to `on`

`pkgs/cli/src/main.zig` / `pkgs/cli/src/node.zig` /
`pkgs/node/src/node.zig` flip `chain_worker_enabled` default from
`false` to `true`. The `--chain-worker` CLI flag is unchanged
otherwise; the legacy synchronous path stays in place as a kill-switch
via `--chain-worker false`. Per Partha\u2019s call: the slice (a/b/c)
work is merged, devnet has been on the worker path via
lean-quickstart for a while, the default flip just brings the
compiled-in default in line with what production already runs.

## New metrics

`pkgs/metrics/src/lib.zig`:
* `lean_block_root_compute_skipped_total{site}` \u2014 every time a
  downstream consumer skipped a `hashTreeRoot(BeamBlock)` because
  the producer threaded a precomputed root through. Sites:
  `chain.onGossip`, `chain.onBlock`, `chain.processPendingBlocks`,
  `forkchoice.onBlock`.
* `lean_block_fetch_dedup_total{outcome}` \u2014 every
  `fetchBlockByRoots` per-root outcome. Outcomes:
  `already_in_forkchoice`, `already_in_block_cache`,
  `already_pending`, `fetched`.

## Tests

* `forkchoice.hasBlocksBatch (slice (d) of #803)` \u2014 empty input is
  a no-op, length-mismatch returns `error.LengthMismatch` rather
  than reading past `out`, presence semantics match the per-root
  `hasBlock`, and the snapshot is re-taken on every call (a
  post-ingest call reflects the new state).
* `chain_worker.on_block carries optional block_root (slice (e) of
  #803)` \u2014 the new field defaults to `null` (existing callers
  compile unchanged), round-trips an explicit value through
  `Message.deinit`, and the test\u2019s SignedBlock heap is freed by
  `deinit` (testing.allocator leak guard).

## Verification

* `zig build` \u2014 green.
* `zig build test` \u2014 green; 12 test groups all pass; both new
  tests confirmed OK locally; clean exit, no panics / UAFs /
  deadlocks observed.

References: #803 (slice work), prior slices: #805 (a-2), #820 (a-3),
#822 (b), #826 (c-1), #827 (c-2a), #828 (c-2b/c-2c).

* node,metrics: address PR #842 review (fetch buckets, blockRoot debug-assert, TOCTOU doc, incrBy)

PR #842 review follow-ups, all four items:

## Minor #1 \u2014 fetched counter undercounts on error paths

`fetchBlockByRoots` previously bumped `outcome="fetched"` only on
the success branch of `ensureBlocksByRootRequest`. On
`error.NoPeersAvailable` and other dispatch errors the call's
`missing_roots` bucket fell through with no counter bump, leaving
the outcome buckets summing to less than `roots.len` per call. The
PR description claimed buckets sum to `roots.len`; that was false
for the error paths.

Fix: two new outcome labels \u2014 `fetch_no_peers` and `fetch_failed`
\u2014 bumped via `incrBy(N)` in the matching error arms. Buckets now
genuinely sum to `roots.len` for every call so a Grafana invariant\n  `sum(rate(lean_block_fetch_dedup_total)) ==
   sum(rate(\u2026 by outcome))`\nholds. Comment updated to spell out the bucket totalling guarantee.

## Minor #2 \u2014 forkchoice trusts caller-supplied blockRoot

`forkchoice.onBlock` and `chain.onBlock` both accept an optional
precomputed `blockRoot` (slice (e) of #803) and use it as-is to key
the forkchoice protoArray. A future caller that threads a stale or
wrong root would silently corrupt the protoArray with no symptom
until a head/finalization mismatch surfaces hours later.

Fix: `std.debug.runtime_safety`-gated re-hash-and-compare in both
public-API entry points. Cost paid only in `Debug` and
`ReleaseSafe`; `ReleaseFast` / `ReleaseSmall` skip it entirely so
the slice-(e) win is preserved in production. On mismatch we
`std.debug.panic` with both roots + slot in the message so the
buggy callsite is obvious. Re-hash failure (OOM during the
verification step) logs a warn and falls through rather than
crashing forkchoice on a verification-side error.

## Minor #3 \u2014 hasBlocksBatch \u2192 network-cache TOCTOU

The forkchoice snapshot, `network.block_cache`, and
`network.pending_block_roots` each take their own lock. Between
the `hasBlocksBatch` call and the post-snapshot cache lookups, a
concurrent gossip handler can ingest a block into any of the three.

The race is benign: the worst case is one duplicate
`blocks_by_root` request whose response then takes the existing
dedup path inside `processBlockByRootChunk` (a
`forkChoice.hasBlock` early return). Document the race and the
benign-by-construction reasoning in-line so a future reader
doesn't try to "fix" it with a multi-resource lock that would
serialize gossip-import against RPC-fetch for no correctness
benefit.

## Nit \u2014 per-root incr loop \u2192 incrBy(N)

`fetchBlockByRoots` previously bumped each outcome counter via N
back-to-back `incr()` calls. Refactor to accumulate per-bucket
counts during the dedup walk, then issue a single `incrBy(N)`
per bucket at the end. Same observed value, fewer atomic ops on
the hot path. Same change for the fetched-success bucket.

## Tests

New metric audit test in `pkgs/metrics/src/lib.zig`:

* `slice (d)/(e) #803: fetch-dedup + root-compute-skipped counters
  appear in /metrics output` \u2014 mirrors the slice-(b) LockTimer
  audit pattern: bumps every label / outcome the production code
  emits (the four root-skip sites + all six fetch outcomes
  including the new `fetch_no_peers` and `fetch_failed`),
  scrapes `writeMetrics`, and asserts each label string appears
  in the body. Locks the metric wiring contract in code so a
  label rename / family drop / serializer regression fails CI here.

## Verification

`zig build` \u2705, metrics tests \u2705 ("All 3 tests passed":
existing #788 audit + lean_gossip_mesh_peers gauge + new slice
(d)/(e) audit). Full `zig build test` running.

References: PR #842 review thread, #803 (slice work), #788 (audit
test pattern).

* node,metrics: PR #842 followup \u2014 dedup_lost_race outcome bucket

Reviewer flagged a residual hole: `ensureBlocksByRootRequest`
(network.zig:710) returns `null` non-erroneously when
`shouldRequestBlocksByRoot` rejects the batch \u2014 every requested
root entered `network.pending_block_roots` or
`network.block_cache` between our `hasBlocksBatch` snapshot and
the network helper's re-check (the benign TOCTOU window already
documented at the top of `fetchBlockByRoots`). The previous
diff bumped `fetched` only on the success branch and the two
error buckets only via the `catch`, leaving this third
non-error null path unbucketed and breaking the\n  `sum(rate(lean_block_fetch_dedup_total)) ==
   sum(rate(\u2026 by outcome))`\ninvariant the slice (d)/(e) audit test claims to lock under any
racing-gossip workload.

Fix: new `outcome="dedup_lost_race"` bucket bumped in the\n`else` arm of `if (maybe_request) |request_info|`. The audit\ntest now bumps and asserts the seventh outcome label too, so a\nfuture refactor that drops the `else` arm fails CI here.

The other null branch `roots.len == 0` inside\n`ensureBlocksByRootRequest` is unreachable from this call site\n(the surrounding\n  `if (missing_roots.items.len == 0) return;`\nguard handles it before dispatch), so `dedup_lost_race` is the\nonly legitimate null cause we need to account for. Documented\ninline so the reasoning survives the next refactor.

Verification: `zig build` \u2705; metrics tests \u2705 (the audit test\nnow asserts all seven outcome labels appear in /metrics body).

Refs: PR #842 review followup, #803 (slice work).

---------

Co-authored-by: zclawz <zclawz@users.noreply.github.com>
Co-authored-by: Parthasarathy Ramanujam <1627026+ch4r10t33r@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants