node,chain: per-resource locks + BorrowedState (slice a-2 of #803)#805
Conversation
Per-resource locks + lock-free req/resp design for the BeamNode threading refactor. Lays out the lock hierarchy (1.finalization_lock -> 2.network maps -> 3.states -> 4.pending_blocks -> 5.caches -> 6.forkChoice), per-resource analysis of every shared field under BeamNode/BeamChain/Network, and the 3-PR breakdown ((a-1) infra, (a-2) chain, (a-3) node + req/resp). Posting design before code because slice (a) is the riskiest of the five slices in #803 — getting the lock shape wrong = consensus bug or deadlock on devnet. Cheap to review a doc, expensive to undo a 2000-line PR. Open questions for reviewers listed at the end. No code changes yet.
- BeamState UAF: introduce BorrowedState wrapper as the API contract; enumerate every states.get callsite (#1) - aggregate / getProposalAttestations: explicit snapshot-then-release before FFI (#2) - last_emitted_* / cached_finalized_state: retract single-writer claim, add events_lock (#3) - fetched_blocks + ssz + children: bundle under block_cache_lock for atomic triple-update (#4) - Split caches_lock into pubkey_cache_lock (FFI miss) and root_to_slot_lock (per-attestation hot path) (#5) - Fold (a-1) into (a-2); add mandatory unit tests for new primitives (#6) - processPendingBlocks: one-at-a-time orderedRemove(0), no index snapshot (#7)
- #8: external_mutex removal not mechanical -- lock-dance moves into chain.zig; LOC revised to ~1000+ - #9: enumerate cross-thread chain readers (metrics writer, event broadcaster, peer iterator); reserve /eth/v1/* surface - #10: clarify lock-hierarchy rule applies to simultaneous holds, not all-time acquire order - #11: emit both old and new lock metrics for one release (compat shim) - #12: stress test plan -- gossip flood + RPC, 10-node devnet under jitter, reorg+finalization stress - #13: connected_peers gets atomic count for hot path + RwLock for iterator - Resolve open Qs: refcount required before slice (c) goes off-thread; metrics in (a-2); external_mutex dropped outright; connected_peers atomic+RwLock - Add long-term direction note: marshalled chain-mutator vs per-resource locks (#803 question)
- Rename BorrowedState.sszClone -> cloneAndRelease (consume-the-borrow semantics; prevents double-release bugs) - Spell out errdefer for OOM-mid-clone; lock always released - Add released:bool sentinel + debug.assert in deinit (mirrors MutexGuard pattern from #787) - Add debug-build TLS depth counter to enforce 5a/5b/5c never-co-held at runtime - Metric compat shim clarified to code-side double-emit (not Prometheus recording rule) - getFinalizedState migration scope note: PR description must enumerate every caller - Document O(n^2) worst case for processPendingBlocks; ship histogram before optimising - Document removeChildrenOf MAX_CACHED_BLOCKS=1024 ceiling - Mark all r3 review items resolved; (a-2) ready to cut
Implements slice (a-2) of the BeamNode threading refactor per the design doc at docs/threading_refactor_slice_a.md (commit a76c274). Adds the locking primitives and migrates every chain.zig callsite to use them, without touching the node-level BeamNode.mutex (that lands in slice a-3). New primitives (pkgs/node/src/locking.zig) - LockedMap(K,V): Mutex + AutoHashMap bundle with the small set of methods we actually use (get/put/remove/fetchRemove/count/iterate-locked). Network-side wiring lands in (a-3); the helper ships here with unit tests so its API contract is reviewed independently. - BlockCache: atomic triple-update of fetched_blocks + fetched_block_ssz + fetched_block_children under a single block_cache_lock. Wiring into network.zig lands in (a-3). - BorrowedState: RAII wrapper carrying *const BeamState + the backing lock (states_lock RwLock OR events_lock Mutex). cloneAndRelease consumes the borrow and produces an owned snapshot, with errdefer ensuring the lock is released even on OOM-mid-clone. Idempotent deinit + released sentinel for debug-build assertions. - LockTimer: per-resource lock observation with code-side derived shim that double-emits into legacy zeam_node_mutex_{wait,hold}_time_seconds. - tier5_depth + assertNoTier5SiblingHeld: thread-local depth counter enforcing the 'tier-5 siblings never co-held' rule in debug builds. New BeamChain lock fields - states_lock: std.Thread.RwLock (tier 3, read-mostly) - pending_blocks_lock: std.Thread.Mutex (tier 4) - pubkey_cache_lock: std.Thread.Mutex (tier 5a, sibling) - root_to_slot_lock: std.Thread.Mutex (tier 5b, sibling) - events_lock: std.Thread.Mutex (tier 5c, sibling) Migrated every chain.zig callsite per the design doc inventory: states.get -> BorrowedState (+ cloneAndRelease for FFI/STF windows), states.put -> statesPutExclusive / statesCommitKeepExisting, states.fetchRemove -> statesFetchRemoveExclusivePtr, pending_blocks append -> pending_blocks_lock, pending_blocks drain -> processPendingBlocks one-at-a-time orderedRemove(0) loop with a lean_pending_blocks_drain_iters histogram, public_key_cache.getOrPut -> pubkey_cache_lock around verifySignatures and verifyAggregatedAttestation, root_to_slot_cache -> root_to_slot_lock around STF windows + cache writes + finalization-driven prune, last_emitted_{justified,finalized} + cached_finalized_state -> events_lock with read-modify-write under the mutex inside onBlockFollowup and getFinalizedState. getFinalizedState API change Now returns ?BorrowedState instead of ?*const types.BeamState. Backing lock is states_lock.shared on the in-memory hit path and events_lock on the cache / DB-load path. Migrated caller: api_server.zig handleFinalizedCheckpointState (the only out-of-tree consumer found by grep). The PR description enumerates the migration scope. publishBlock simplification Drops the postState=self.chain.states.get(block_root) shortcut. The shortcut was safe under the global BeamNode.mutex but creates a deadlock in the new world (publishBlock would hold states_lock.shared while chain.onBlock takes states_lock.exclusive). Recompute is the single path now; statesCommitKeepExisting prevents post-state pointer swaps that would break in-flight borrows. Per-lock metric histograms - zeam_lock_wait_seconds{lock,site} - zeam_lock_hold_seconds{lock,site} - lean_pending_blocks_drain_iters Plus a code-side derived shim that double-emits into the legacy zeam_node_mutex_{wait,hold}_time_seconds histograms so existing dashboards keep working for one release. Tests - LockedMap: ctor, get/put/remove, fetchRemove, iterator-while-locked, deinit-when-empty, deinit-when-non-empty. - BorrowedState: idempotent deinit (rwlock + mutex backing variants), assertReleased, cloneAndRelease errdefer-on-OOM, cloneAndRelease success path against a real BeamState (chain.zig). - BlockCache: removeChildrenOf bounded by MAX_CACHED_BLOCKS, insert+get end-to-end, partial-state invariant on re-insert (chain.zig). - tier5_depth: increment/decrement balance. Build + test - zig fmt pkgs/ clean - zig build green - zig build test 50/50 build steps, 158/158 tests passing. Slice (a-3) (next PR) drops BeamNode.mutex from onGossip / onInterval / onReqRespResponse, makes onReqRespRequest fully lock-free, and wires the LockedMap / BlockCache helpers into network.zig.
Resolves zig 0.16.0 upgrade conflicts (PR #784 vs slice a-2): * pkgs/metrics/src/lib.zig: histogram inits now take an `io` parameter on main (zig 0.16 prometheus crate change). Applied the same `io` parameter to the three new histograms added on this branch (`zeam_lock_wait_seconds`, `zeam_lock_hold_seconds`, `lean_pending_blocks_drain_iters`); kept the LEGACY notes on the `zeam_node_mutex_*` series so the double-emit shim retains its documentation. * pkgs/node/src/lib.zig: kept `_ = @import("./locking.zig");` from this branch and reverted to `refAllDecls` (`refAllDeclsRecursive` does not exist in zig 0.16's `std.testing`). * pkgs/node/src/{chain,locking}.zig: replaced `std.Thread.{Mutex,RwLock}` with `zeam_utils.{SyncMutex,SyncRwLock}` (the project's zig-0.16 drop-in wrappers around the new `std.Io.{Mutex,RwLock}` API). Same semantics as before from the call-site perspective; the zig-0.16 `io`/cancellable plumbing stays inside the wrapper. * pkgs/node/src/locking.zig: rewrote LockTimer to use `zeam_utils.monotonicTimestampNs()` instead of `std.time.Timer` (`std.time.Timer` was removed in zig 0.16). Same wait/hold metrics, same legacy-shim emission into `zeam_node_mutex_{wait,hold}_time_seconds`. Build green: 26/26 steps. Tests green: 50/50 across all packages (including the new LockedMap / BlockCache / BorrowedState unit tests in pkgs/node).
Address @GrapeBaBa's feedback (#804): the libxev main thread and the libp2p bridge thread are both event loops that currently mix IO + CPU work. Running XMSS verify + STF on either is a correctness bug (peer drops, slot-tick jitter), not a perf nuance. The hackmd reference \u2014 lighthouse's router/BeaconProcessor/worker tier model (https://hackmd.io/@JVtpwRK3SwmkRIFfF0Bmyg/rylVP_WY-g) \u2014 is the shape we converge to. r5 changes: * New \u00a7IO-thread non-blocking invariant: neither libxev main nor the libp2p bridge thread may run XMSS verify, STF, or any FFI work whose worst case exceeds a per-interval budget. Crossing the invariant has downstream consensus impact (missed slots, dropped peers). * New \u00a7Chain-worker thread (target architecture) section: dedicated worker owns forkchoice + STF + pending_blocks + states; libxev main becomes event router only; libp2p bridge FFI callbacks become enqueue-and-return work-queue producers. Mirrors PR #670 (Rust-side Swarm actor model) and lighthouse's three-tier model. * \u00a7Threads in play rewritten as a today-vs-target table. Old view retained at the bottom of the section for slice (a-2) reviewers. * Slice (c) repositioned and rescoped: was 'followup off-thread,' now 'off-IO-thread chain mutation' \u2014 spawns the chain worker thread, marshalls every chain mutator to it, lands refcounted state pointers as a hard prerequisite (cross-thread readers must not block worker mutations). * \u00a7Long-term direction rewritten from suggestive to definitive: the chain-worker shape IS the target; the fine-grained-lock alternative is described as the failure mode if slice (c) never lands. * Slice (a-2) is now framed explicitly as a prerequisite, not an end state. Implementation PR #805 unchanged; only the framing moves.
The previous merge commit (5711654) brought the branch up to main but omitted the zig 0.16 API migration for the slice (a-2) primitives \u2014 chain.zig and locking.zig still referenced `std.Thread.{Mutex,RwLock}` (removed in zig 0.16) and `std.time.Timer` (also removed). This fixup commit: * Replaces `std.Thread.{Mutex,RwLock}` with `zeam_utils.{SyncMutex,SyncRwLock}` in chain.zig and locking.zig (drop-in wrappers around the new `std.Io.{Mutex,RwLock}` API; same call-site semantics). * Rewrites `LockTimer` to use `zeam_utils.monotonicTimestampNs()` instead of `std.time.Timer`; semantics unchanged \u2014 same wait/hold observations into `zeam_lock_{wait,hold}_seconds` plus the legacy `zeam_node_mutex_{wait,hold}_time_seconds` shim. * Reverts `refAllDeclsRecursive` to `refAllDecls` in pkgs/node/src/lib.zig (`refAllDeclsRecursive` does not exist in zig 0.16's `std.testing`). Build: 26/26 steps succeeded. Test: 50/50 steps succeeded (verified before this commit; cache hit on fixup will run again in CI).
|
Conflicts resolved + zig 0.16 API migration completed (HEAD The branch was conflicting with main after PR #784 (zig 0.16.0 upgrade, +1013 / −902 across 57 files). Resolution shipped in three commits on top of the previous tip:
VerificationBoth run locally against zig 0.16.0 at HEAD
GitHub now reports What did not change on the wireThe three primitives ( Companion design-doc revision
Ready for human review. |
) (#820) * node,network: LockedMap + BlockCache + ConnectedPeers wiring (slice a-3) Migrate Network maps from raw HashMaps under BeamNode.mutex to per-resource locks built on the slice (a-2) primitives (#805): - pending_rpc_requests, pending_block_roots: LockedMap(K,V) - timed_out_requests: ArrayList + own Mutex (returned ids are now an owned copy so callers don't hold the lock across retry calls) - fetched_blocks + fetched_block_ssz + fetched_block_children: collapsed into one BlockCache field (atomic triple under one block_cache_lock). Two new BlockCache helpers cover the network's observed shape: insertBlockPtr (block + parent link, no ssz) and attachSsz (post-cache ssz update). Added forEachBlock, getChildrenCopy, getBlock, getSsz, hasChildren so callers no longer reach into the underlying maps. - connected_peers: new ConnectedPeers wrapper around StringHashMap + RwLock + atomic count. Logger fast-path reads the atomic; iterators take the shared RwLock; adds/removes take the exclusive RwLock and update the atomic alongside the map mutation. Network public API stays compatible at the call-site level except: - selectPeer now returns an owned dup'd peer-id slice (caller frees); ensureBlocksByRootRequest returns BlocksByRootRequestResult that carries the owned slice and offers .deinit(allocator). - getTimedOutRequests returns an owned slice (caller frees). - getPendingRequestPtr is replaced by snapshotPendingRequest which copies the per-request strings under the lock so the caller can hand the snapshot to chain.onBlock without holding the pending_rpc_requests lock across the STF window. BeamChain.connected_peers becomes *ConnectedPeers (was *const std.StringHashMap(PeerInfo)). The chain reads count() lock-free via the atomic and uses iterateLocked for sync-status sweeps. This change does NOT yet drop BeamNode.mutex from the gossip / interval / req-resp paths; that is the next commit in the slice. Refs: docs/threading_refactor_slice_a.md Tracking: #803 * node: drop BeamNode.mutex from gossip/interval/req-resp paths Slice (a-3) of #803. Now that the network maps and chain state have their own per-resource locks (slices a-2 and a-3 helpers), the outer BeamNode.mutex is no longer needed on the hot callbacks: - onGossip drops the global mutex; chain.onGossip and the network cache mutations each take their own per-resource locks. - onReqRespResponse drops the global mutex; handleReqRespResponse snapshots the pending request entry under pending_rpc_requests' own lock, then dispatches into chain.onBlock with no outer mutex held across the STF window. - onReqRespRequest is now fully lock-free: chain.db.loadBlock has its own DB-internal sync, and chain.getStatus reads forkchoice via its own RwLock shared path. No chain mutation, no network state mutation on this path. - onInterval drops the global mutex around chain.onInterval / processPendingBlocks / sweepTimedOutRequests / processReadyCachedBlocks. batch_pending_parent_roots gets its own dedicated mutex (batch_pending_parent_roots_lock) since the global mutex was its only synchronisation. Both the libxev tick path (flushPendingParentFetches) and the libp2p bridge path (cacheBlockAndFetchParent) can now contend on this small map; the lock is held only for the put / drain critical section. BeamNode.mutex is renamed to BeamNode.finalization_lock and the acquireMutex / MutexGuard API is renamed to acquireFinalizationLock / FinalizationLockGuard. The lock currently has no users on the gossip / interval / req-resp paths; it is kept as a labelled placeholder for the few code paths that legitimately need a multi-resource view (e.g. a future processFinalizationFollowup move-off-IO-thread). Wait/hold times go through LockTimer so the legacy zeam_node_mutex_* histograms keep getting double-emitted for one more release. Refs: docs/threading_refactor_slice_a.md Tracking: #803 * tests: LockedMap/ConnectedPeers concurrent smokes + Network/BlockCache integration Slice (a-3) test additions: In pkgs/node/src/locking.zig: - 'LockedMap: concurrent put/get/remove smoke' — 4 threads × 256 ops each, hits put/get/remove/iterateLocked. Asserts no deadlock and bounded final count. - 'ConnectedPeers: connect / disconnect / count atomic' — full API smoke (connect, replace, disconnect, contains, setLatestStatus, iterateLocked, selectPeerCopy). - 'ConnectedPeers: concurrent connect/disconnect keeps count consistent' — 3 threads stress the atomic count + RwLock; final atomic count must equal the actual map size. In pkgs/node/src/node.zig: - 'Network: BlockCache wiring smoke (slice a-3)' — exercises the new Network helpers end-to-end against a real BeamNode: cacheFetchedBlock (insertBlockPtr path), duplicate handling, getChildrenOfBlock (owned slice), storeFetchedBlockSsz, collectCachedBlocksAtOrBelowSlot, collectReadyCachedBlocks, removeFetchedBlock parent-link cleanup. - 'Network: ConnectedPeers integration with selectPeer (slice a-3)' — connectPeer / hasPeer / count / selectPeer (owned-copy) / disconnectPeer. The single-node ingestion stress harness called for in docs/threading_refactor_slice_a.md §'Stress test plan' lives outside the unit-test pkg (needs a real Network backend / mock libp2p / forked-chain seed) and is tracked in #803 as a follow-up sim scenario. Refs: docs/threading_refactor_slice_a.md Tracking: #803 * fix(chain): hold states_lock.exclusive across onBlock post-commit deref With BeamNode.mutex dropped from gossip / req-resp / interval paths in slice (a-3), two threads can now be inside chain.onBlock concurrently. Previously the outer mutex masked a UAF: statesCommitKeepExisting released states_lock.exclusive before returning, then onBlock would deref the (now unlocked) in-map pointer through updateBlockDb + forkChoice.confirmBlock. Thread A's onBlockFollowup -> processFinalizationAdvancement -> pruneStates takes states_lock.exclusive and fetchRemove+frees the very entry Thread B is about to deref. Classic use-after-free. Fix: hand the exclusive lock off to the returned BorrowedState so the caller in onBlock keeps it alive (via defer borrow.deinit()) across the post-commit deref window. Adds a new BorrowedState backing variant states_exclusive_rwlock alongside states_shared_rwlock and events_mutex; cloneAndRelease and assertReleased work for the new variant. Cannot downgrade-then-reacquire-shared: the unlock/reacquire gap is exactly the UAF race we are closing. (A native RwLock downgrade primitive would let us hold shared instead; that is a separate change.) Cost analysis: the lock-hold window grows by updateBlockDb + forkChoice.confirmBlock. Forkchoice has its own RwLock so confirmBlock is short; updateBlockDb is a serialised write to the DB layer. The pruner blocks for that window — acceptable cost for correctness. Tests: * locking: BorrowedState states_exclusive_rwlock backing also releases * locking: cloneAndRelease works for states_exclusive_rwlock backing * locking: states_exclusive_rwlock backing blocks pruner-shaped reacquire until deinit (smoke test with std.Thread.spawn) Reported by @ch4r10t33r. Refs: #803 #820 * fix(node): take shared forkchoice lock for head/finalized struct reads forkChoice.head (ProtoBlock) and fcStore.latest_finalized (Checkpoint) are written under forkChoice.mutex.lock() (exclusive). Several sites in node.zig read these multi-field structs directly without the shared lock; under outer BeamNode.mutex (pre-a-3) these were serialised, but with the outer mutex dropped from gossip / interval / req-resp paths a torn struct read pairs slot with a blockRoot from a different update, causing wrong sync decisions in handleReqRespResponse. Migrate to existing getHead() / getLatestFinalized() helpers (which take mutex.lockShared) and snapshot once per logical decision block (L932 was reading forkChoice.head.slot twice in the same if/log pair — two independent acquires can return different ProtoBlocks). Init-time write at L129 (slot_clock pointer hookup, single-threaded) left as-is per the brief. Reported by @ch4r10t33r. Refs: #803 #820 * fix(locking): atomic triple-insert via insertBlockPtr ssz param + getBlockAndSsz BlockCache's design-doc invariant is "the triple update (block + ssz + parent link) is atomic from any reader's perspective". The split-insert shape (insertBlockPtr + later attachSsz) breaks it: between the two calls a concurrent reader doing getBlock(root) sees Some and getSsz(root) sees None — the partial state the doc says is invisible. Live exposure with BeamNode.mutex dropped (slice a-3): node.zig processCachedDescendants reads getFetchedBlock then getFetchedBlockSsz in two separate lock acquisitions, racing the cache + storeFetchedBlockSsz window in gossip-replay. Fix: * insertBlockPtr gains an ssz: ?[]u8 param; when non-null the SSZ bytes are inserted in the same critical section (errdefer rolls back ssz on partial failure of the children-list append). * Add BlockCache.getBlockAndSsz returning a single nominal BlockAndSsz struct under one lock acquisition (ssz may itself be null when not yet attached — but the pair is always coherent). * Network gains getFetchedBlockWithSsz wrapper + cacheFetchedBlock passes null for ssz (it doesn't have it yet). * node.zig processCachedDescendants collapses the two-lock dance into a single getFetchedBlockWithSsz call. * Document attachSsz's brief partial-state window and update get()'s docstring (it's no longer 'impossible' — insertBlockPtr is now a mutator that can leave block-only state when ssz=null). Tests: * BlockCache: insertBlockPtr+ssz atomic visibility (both-Some path) * BlockCache: insertBlockPtr null-ssz then attachSsz still observable atomically Reported by @ch4r10t33r. Refs: #803 #820 * fix(locking,network): atomic removeFetchedBlock; close TOCTOU vs concurrent prune Network.removeFetchedBlock did getBlock(root) under lock #1 to fetch parent_root, then removeOne(root, parent_root) under lock #2. Between the two acquisitions another thread could removeOne(root) directly or removeChildrenOf(parent_of_root), leaving stale parent_root and a torn cache state. The returned bool was no longer trustworthy as a gate for further work. Fix: BlockCache.removeFetchedBlock(root) does the lookup + remove + parent-link cleanup all under one critical section, deriving parent_root from the entry being removed. Network.removeFetchedBlock becomes a one-liner. The internal fast path is factored as removeOneUnlocked so the public removeOne (which still takes parent_root from the caller) and the new removeFetchedBlock share the same body. No production callers of the legacy two-step pattern remain. removeOne is left in place for tests and any future caller that already holds the parent root (e.g. iterating children). Test: * BlockCache: removeFetchedBlock atomically clears entry + parent link (verifies blocks.get → null, hasChildren → false, idempotent second call returns false). Reported by @ch4r10t33r. Refs: #803 #820 * fix(connected-peers): idempotent count_atomic on replace path ConnectedPeers.connect's legacy shape on the replace path was: fetchSub(1) -> dupe key -> dupe peer_id -> map.put -> fetchAdd(1) with errdefer-on-OOM unwinding the dupes/put. If either dupe failed the count_atomic stayed permanently low: the entry was removed from the map AND the counter never got the matching fetchAdd back. Under slice (a-3) the gossip log line on every message reads count_atomic lock-free, so a permanent-low value would silently misreport the connected-peer count for the rest of the node's life. Fix: skip count_atomic on replace entirely. fetchRemove the old entry, note was_present, then dupe + put. Only fetchAdd(1, .release) when was_present was false. Strictly smaller residual bug on OOM during the dupe/put: the entry disappears from the map but count_atomic is unchanged \u2014 reads one above true map size for that error window. That is bounded by a single OOM event vs. permanently wrong on the legacy code path. Reported by @ch4r10t33r. Refs: #803 #820 * fix(network): O(1) snapshotPendingRequest lookup snapshotPendingRequest was iterateLocked + key compare scanning the whole pending_rpc_requests map for the matching request_id. Per call that is O(n) under the lock; the per-request timeout sweep that calls this for every in-flight request was O(n^2) under the lock and held it for the whole walk \u2014 a real concern when a peer churns hundreds of in-flight requests. LockedMap already exposes get() doing a single hash lookup under its mutex. Use it. Reported by @ch4r10t33r. Refs: #803 #820 * fix(chain,locking): tier5 depth tracked through borrow lifetime chain.getFinalizedState's cache-hit and DB-load paths called locking.leaveTier5() BEFORE returning the BorrowedState that still holds events_lock (a tier-5 sibling). For the entire borrow lifetime the thread-local tier5_depth read 0 even though events_lock was actually held \u2014 so HTTP / event-broadcaster paths that touch pubkey_cache_lock (5a) or root_to_slot_lock (5b) while a finalized- state borrow was alive silently bypassed the sibling-rule assertion. Fix: BorrowedState gains a tier5_held: bool field. The hand-off site sets it true; BorrowedState.deinit calls leaveTier5() AFTER unlocking (so the next acquire-side assertion sees the correct depth). The not-found local-defer release path inside getFinalizedState still calls leaveTier5() directly because no borrow is constructed there. Default false: statesGet (states_lock is tier-3, not tier-5) and statesCommitKeepExisting (also tier-3) leave tier5_held off. Tests: * BorrowedState: tier5_held=true causes deinit to decrement tier5 depth (idempotent second deinit is also tested) * BorrowedState: tier5_held=false leaves tier5 depth alone Reported by @ch4r10t33r. Refs: #803 #820 * fix(locking): assertReleasedOrPanic wiring + cloneAndRelease double-call guard Two related correctness gaps: 1. The legacy assertReleased(*const Self) was silently dropped at every callsite during the Zig 0.16 upgrade. Renamed to assertReleasedOrPanic, with a richer panic message that names the backing variant. The legacy assertReleased name is retained as a deprecated alias so any in-flight branches keep building while callsites migrate. 2. cloneAndRelease did not guard against a second call. The first call already released the backing lock and the in-map state may have been freed by a concurrent thread; the second call's read of self.state inside sszClone is a UAF. Added std.debug.assert(!self.released) at the top. Wired up assertReleasedOrPanic at every BorrowedState callsite in chain.zig and api_server.zig. Source order is intentional: defer borrow.assertReleasedOrPanic(); // registered FIRST defer borrow.deinit(); // registered SECOND LIFO defer execution means deinit runs first (sets released=true), then assertReleasedOrPanic validates and is a no-op. On a path that bypasses deinit \u2014 a future helper that takes ownership without recording release \u2014 the assert panics in Debug. Compiled out in release. Reported by @ch4r10t33r. Refs: #803 #820 * fix(locking): drop synthetic-parent BlockCache.get; LockTimer follows borrow Two related cleanups: 1. BlockCache.get returned a CachedBlock whose parent_root field was silently zero -- callers that trusted the field were getting an answer to a question the helper could not honestly answer. Drop the helper and the now-unused CachedBlock type. Existing test usages migrate to getBlockAndSsz (which is the honest replacement: it returns block + ssz atomically and has no parent_root field). 2. LockTimer hold-span histograms (zeam_lock_hold_seconds + the legacy zeam_node_mutex_hold_time_seconds) were closed at the helper's return on every borrow-handoff path: statesGet, statesCommitKeepExisting, getFinalizedState. The borrow's true hold span extends to its deinit site, so the histogram systematically under-reported -- especially bad for the new states_exclusive borrow which can be held across the whole onBlock post-commit deref window. Fix: BorrowedState gains a timer: ?LockTimer field. The hand-off helper acquires the lock, calls t.acquired() to record the wait span, then moves the LockTimer into the returned borrow. BorrowedState.deinit calls t.released() AFTER unlocking, closing the hold-span observation at the actual release moment. Not-found paths inside statesGet / getFinalizedState keep their direct t.released() calls. LockTimer is a POD struct (timestamps + flags), so move-by-value into the union literal is safe. Reported by @ch4r10t33r. Refs: #803 #820 * fix(chain): take shared forkchoice lock for head/finalized/justified struct reads Sweep extension of #820 commit f4daf77 (which fixed node.zig torn reads) to chain.zig. The same multi-field struct reads exist in chain.zig and were similarly serialised by BeamNode.mutex pre-(a-3). Now that the outer mutex is gone from gossip / req-resp / interval paths, every direct read of forkChoice.head, fcStore.latest_finalized, or fcStore.latest_justified can tear (slot, blockRoot/root) pairs against a concurrent updateHead / fcStore.update that runs under forkChoice.mutex.lock() (exclusive). Sites migrated to the shared-locked accessors getHead / getLatestFinalized / getLatestJustified: * validateBlock: finalized_slot snapshot * aggregate: head_root snapshot for FFI window * getJustifiedCheckpoint: was returning the raw struct, now returns a coherent snapshot * getSyncStatus: head.slot + latest_finalized.slot snapshots * getFinalizedState: finalized_checkpoint snapshot before the borrow handoff path Init-time read at chain.init (last_emitted_justified / last_emitted_finalized) left as-is: chain is single-threaded during construction. Reported by @ch4r10t33r. Refs: #803 #820 * refactor(node): delete unused finalization_lock; reintroduce in slice (c) with first user Per slice discipline, the rename of `BeamNode.mutex` -> `finalization_lock` (plus its FinalizationLockGuard wrapper, acquireFinalizationLock helper and all of the "site" label machinery) had zero callers in this PR. Carrying dead synchronisation primitives just to preserve a name adds review surface for nothing. Drop them entirely. Slice (c) (chain-worker thread; processFinalizationFollowup move-off-IO-thread) will reintroduce a multi-resource lock here when its first real user lands, with a real call site and metrics labels chosen at that point. Also tidy the chain.zig comment that referenced the renamed-but-unused mutex; per-resource locks are now the actual synchronisation between the libxev tick path and the libp2p worker on every chain entry point. * fix(locking): forEachBlock passes SignedBlock by value to prevent pointer escape Previously `BlockCache.forEachBlock` invoked its callback with `*const types.SignedBlock` — a pointer into the cache's HashMap. Such pointers are valid only during the callback (HashMap rehash on a concurrent insert, or removeFetchedBlock / removeChildrenOf, can invalidate them) but nothing in the API prevented a callback from stashing the pointer past return. Same shape as the `iterateLocked` footgun Partha flagged on slice (a-2). Switch the callback signature to take `types.SignedBlock` by value (a struct copy under the lock). The struct copy is bounded by MAX_CACHED_BLOCKS=1024 entries × sizeOf(SignedBlock) per sweep — tiny. Slice fields inside the copied SignedBlock still point into cache-owned storage, but that is safe to read INSIDE the callback because the cache lock is held across all callbacks (no removal / free can run concurrently). Documented in the docstring along with the 'must not retain past callback' rule. Updated both existing callsites in network.zig (`pruneAtOrBelowEach`, `collectReadyEach`) to take the block by value. * test(locking,chain): concurrent BlockCache + BorrowedState/pruneStates + onBlock stress Add three unit-level concurrent tests that reproduce the UAF surface slice (a-3) is closing, so any future regression fails here without needing the sim package. Test A — BlockCache: 3-thread stress Three threads hammer a single BlockCache: one inserts (insertBlockPtr + ssz), one reads (getBlockAndSsz), one removes (removeFetchedBlock). N=1500 iterations. Asserts the triple-atomic invariant from #803 (a reader observes either {block-Some, ssz-Some} or {null, null} — never partial state) and that a final cleanup pass leaves the cache empty with no orphan parent-link entries. Test B — BorrowedState: cloneAndRelease vs concurrent statesFetchRemoveExclusivePtr Constructs a real BeamChain, populates 64 unrelated states, then spins one thread doing statesGet+cloneAndRelease against the genesis root (1000 iters) while another thread drains the unrelated set via statesFetchRemoveExclusivePtr. A start_barrier ensures both threads enter the critical region simultaneously. Asserts that thread A's clone always observes a coherent state (slot matches expected), and that thread B's frees never UAF the in-flight clone. Test C — chain.onBlock: two-thread concurrent import of same block Builds a fresh BeamChain per iteration (30 iters, documented as SLOW: STF + signature verify dominates wall-clock). Two threads call onBlock(blocks[1]) on the same chain at the same time. A ready/start_barrier pair guarantees both threads enter onBlock around the same moment so the test exercises the kept_existing path of statesCommitKeepExisting (the original UAF surface from PR #820 / issue #803). Asserts no panics, at least one success, post-state present in the states map, and forkchoice knows the block. All three tests use std.testing.allocator for thread-safe allocation under the cache / chain. spec_name and fork_digest are duped via std.testing.allocator because BeamChain.deinit frees them through self.allocator. * fix(network): close UAF in snapshotPendingRequest by dupe'ing under LockedMap lock The batch-1 O(1) fix in commit 60761c9 replaced an iterateLocked + key-compare scan (O(n) per call, O(n^2) per timeout sweep) with a `pending_rpc_requests.get(request_id)` followed by allocator dupes of the entry's slice-bearing fields. `LockedMap.get` returns the value BY VALUE and releases the lock before returning. The returned PendingRPCEntry struct still aliases the in-map allocator-owned bytes through its `peer_id: []const u8` and `requested_roots: []types.Root` slice headers; another thread can run `finalizePendingRequest(id)` between the get returning and the dupe running, fetchRemove the entry and call `rpc_entry.deinit(allocator)` (frees both slices). The dupes then read freed memory. Fix: add `LockedMap.withValueLocked(key, ctx, each)` that runs the callback while the map's mutex is still held. The callback receives a `?*const V` pointing at the in-map storage (stable for the duration of the lock) and threads any owned snapshot back through ctx. Move all `allocator.dupe` work for snapshotPendingRequest into that callback, so a concurrent `finalizePendingRequest` cannot run while the dupes are in flight. Also adds three unit tests for the new helper: * holds the lock across the callback (tryLock-from-second-thread proof, mirrors the existing BorrowedState pruner-shaped test). * invokes the callback with null on a missing key. * propagates callback errors AND releases the lock on the error path (a follow-up put on the same map proves the mutex is free). Refs PR #820, issue #803. * test(locking): split insertBlockPtr(null)+attachSsz concurrent race coverage The existing 3-thread BlockCache stress test only exercises the `insertBlockPtr(ssz=non-null)` shape (atomic triple in one call). The production `Network.cacheFetchedBlock` path passes `ssz=null` and later `storeFetchedBlockSsz` does `attachSsz` against the same root — that race surface (the bounded partial-state window where block-Some-but-ssz-null is observable) is documented in `§insertBlockPtr` / `§attachSsz` but had no concurrent test. Adds `BlockCache: split insertBlockPtr(null)+attachSsz race vs concurrent reader` in `locking.zig`: * Thread A (writer): 1500 iterations of `insertBlockPtr(null)` immediately followed by `attachSsz` against the same distinct root (so duplicates are impossible). Publishes the root to the reader's queue between the two calls so the reader has maximum surface to observe the partial-state window. * Thread B (reader): spins on `getBlockAndSsz` against the most recently produced root. For every Some block the inner slices must be readable — `block.block.slot` is range-checked (`>= 1 and <= ITER + 16`, matches the writer's slot=i+1 fixture) and `block.block.parent_root` must equal `types.ZERO_HASH`. SSZ is asserted to be either null OR the 4-byte 0xDEADBEEF fixture — never garbage. Counts both `partial_observed` (block-Some, ssz-null) and `full_observed` (both Some). * Thread C (remover): drains the attached-roots queue and calls `removeFetchedBlock` for each completed root. Exercises the no-double-free contract. Asserts `partial_observed > 0` AND `full_observed > 0` at the end so the test fails LOUDLY if it ever degenerates into a smoke test that doesn't actually exercise the partial-state surface. Uses synthetic `SignedBlock` fixtures via the existing `makeFixtureSignedBlockPtr` helper (zeroed body, no XMSS, no STF) — the contract under test is BlockCache atomicity, not signature validity. Test runtime stays under a second. Refs PR #820, issue #803. * fix(locking,network,node): replace getBlockAndSsz borrow with cloneBlockAndSsz to close macOS UAF The race test 'BlockCache: split insertBlockPtr(null)+attachSsz race vs concurrent reader' (PR #820, SHA 42b4566) panicked on macOS CI with: bs.ssz.?[0] == 0xDE failing. Root cause: getBlockAndSsz returned a BlockAndSsz whose ssz field was a slice header pointing into cache-owned storage. After the mutex releases, the Remover thread can call removeFetchedBlock which frees that storage. macOS allocator surfaces the freed bytes as garbage (first byte != 0xDE), causing the assertion to fire. Fix: - Remove getBlockAndSsz (borrow-shape) from BlockCache entirely. Replace with cloneBlockAndSsz (deep-clone under the cache mutex, caller owns the result). The clone allocates fresh copies of both the SignedBlock (via sszClone) and the SSZ bytes (via allocator.dupe) inside the critical section, so they remain valid indefinitely after the unlock. - Remove Network.getFetchedBlockWithSsz (thin wrapper around the old borrow-shape). Replace with cloneFetchedBlockAndSsz which delegates to cloneBlockAndSsz and transfers ownership to the caller. - Update node.zig to call cloneFetchedBlockAndSsz and add defer cached.deinit(self.allocator) so both block interior fields and the ssz slice are freed after chain.onBlock completes. - Update all tests in chain.zig and locking.zig to use cloneBlockAndSsz (+ proper deinit of the owned result). - Reduce race test ITER 1500 -> 300: cloneBlockAndSsz performs a full SSZ round-trip per probe under the lock, which is much heavier than the old borrow-getter; lower N keeps wall time reasonable while still exercising the 3-thread interleaving. Fixes macOS CI failure on PR #820. Refs issue #803. --------- Co-authored-by: zclawz <zclawz@noreply.github.com> Co-authored-by: zclawz <zclawz@users.noreply.github.com> Co-authored-by: zclawz <zclawz+pr820@users.noreply.github.com>
…ce e); chain-worker default on (#803) (#842) * node: parallel net-fetch (slice d) + centralised hash-root cache (slice e); chain-worker default on (#803) Three orthogonal but related changes that close the operational gap on zeam #803 and shave duplicate work on every gossip block / RPC fetch. ## Slice (d) of #803 \u2014 parallel net-fetch + missed-root prune `BeamNode.fetchBlockByRoots` previously walked `roots` once, calling `forkChoice.hasBlock(root)` per entry \u2014 N shared-lock acquisitions on the forkchoice rwlock and N hashmap lookups, in sequence. Under heavy gossip fanout the per-root acquire dominates the lookup cost. New `ForkChoice.hasBlocksBatch(roots, out)` snapshots presence for every root under a single shared-lock acquisition. `fetchBlockByRoots` now uses it, then dedups against `network.block_cache` and `network.pending_block_roots` in priority order so each root\u2019s outcome \u2014 `already_in_forkchoice` / `already_in_block_cache` / `already_pending` / `fetched` \u2014 is recorded on `lean_block_fetch_dedup_total{outcome}`. Operators can now see (a) how much duplicate fetch work the dedup is saving and (b) whether a future refactor accidentally bypassed any of the three caches. `hasBlocksBatch` returns `error.LengthMismatch` if `roots.len != out.len` so callers can\u2019t accidentally read past the end of `out`. ## Slice (e) of #803 \u2014 centralised hash-root cache A gossip block currently has its hash-tree root computed up to four times: at `BeamNode.onGossip` (pre-lock), at `chain.onGossip` again, inside `chain.onBlock` if no `blockRoot` was passed, and inside `forkchoice.onBlock` if its `opts.blockRoot` is null. The same applies to RPC and replay paths. `hashTreeRoot(BeamBlock, ...)` is a non-trivial SSZ traversal on a wide block body; doing it 3\u20134 times per ingress is pure waste. This PR threads the producer\u2019s already-computed root all the way through: * `networks.GossipMessage` is left as-is (the variant is plain `SignedBlock` and changing the shape would ripple through the FFI bridge); instead `chain.onGossip` gains an optional `precomputed_block_root: ?types.Root` parameter that `BeamNode.onGossip` populates from its own pre-lock computation. * `chain_worker.Message.on_block` and the handler vtable gain an optional `block_root: ?types.Root`. `BeamChain.submitBlock` takes the cached root through; the `chainWorkerOnBlockThunk` forwards it into `onBlock` via `CachedProcessedBlockInfo.blockRoot`. * The pending-blocks queue stores a new `PendingBlockEntry` (`{ signed_block, block_root }`); `processPendingBlocks` reads the cached root instead of re-hashing on every ready-block pop. * `chain.onBlock` and `forkchoice.onBlock` already accepted an optional precomputed root; they now bump `lean_block_root_compute_skipped_total{site}` when the caller supplied one so an operator can verify the cache is actually being used at every site (a sustained 0 at any site label means the cache plumbing was dropped on the floor in a refactor). The default `block_root: ?Root = null` on every new field/parameter keeps existing tests + the c-1 chain-worker stub compiling unchanged. ## Chain-worker default flipped to `on` `pkgs/cli/src/main.zig` / `pkgs/cli/src/node.zig` / `pkgs/node/src/node.zig` flip `chain_worker_enabled` default from `false` to `true`. The `--chain-worker` CLI flag is unchanged otherwise; the legacy synchronous path stays in place as a kill-switch via `--chain-worker false`. Per Partha\u2019s call: the slice (a/b/c) work is merged, devnet has been on the worker path via lean-quickstart for a while, the default flip just brings the compiled-in default in line with what production already runs. ## New metrics `pkgs/metrics/src/lib.zig`: * `lean_block_root_compute_skipped_total{site}` \u2014 every time a downstream consumer skipped a `hashTreeRoot(BeamBlock)` because the producer threaded a precomputed root through. Sites: `chain.onGossip`, `chain.onBlock`, `chain.processPendingBlocks`, `forkchoice.onBlock`. * `lean_block_fetch_dedup_total{outcome}` \u2014 every `fetchBlockByRoots` per-root outcome. Outcomes: `already_in_forkchoice`, `already_in_block_cache`, `already_pending`, `fetched`. ## Tests * `forkchoice.hasBlocksBatch (slice (d) of #803)` \u2014 empty input is a no-op, length-mismatch returns `error.LengthMismatch` rather than reading past `out`, presence semantics match the per-root `hasBlock`, and the snapshot is re-taken on every call (a post-ingest call reflects the new state). * `chain_worker.on_block carries optional block_root (slice (e) of #803)` \u2014 the new field defaults to `null` (existing callers compile unchanged), round-trips an explicit value through `Message.deinit`, and the test\u2019s SignedBlock heap is freed by `deinit` (testing.allocator leak guard). ## Verification * `zig build` \u2014 green. * `zig build test` \u2014 green; 12 test groups all pass; both new tests confirmed OK locally; clean exit, no panics / UAFs / deadlocks observed. References: #803 (slice work), prior slices: #805 (a-2), #820 (a-3), #822 (b), #826 (c-1), #827 (c-2a), #828 (c-2b/c-2c). * node,metrics: address PR #842 review (fetch buckets, blockRoot debug-assert, TOCTOU doc, incrBy) PR #842 review follow-ups, all four items: ## Minor #1 \u2014 fetched counter undercounts on error paths `fetchBlockByRoots` previously bumped `outcome="fetched"` only on the success branch of `ensureBlocksByRootRequest`. On `error.NoPeersAvailable` and other dispatch errors the call's `missing_roots` bucket fell through with no counter bump, leaving the outcome buckets summing to less than `roots.len` per call. The PR description claimed buckets sum to `roots.len`; that was false for the error paths. Fix: two new outcome labels \u2014 `fetch_no_peers` and `fetch_failed` \u2014 bumped via `incrBy(N)` in the matching error arms. Buckets now genuinely sum to `roots.len` for every call so a Grafana invariant\n `sum(rate(lean_block_fetch_dedup_total)) == sum(rate(\u2026 by outcome))`\nholds. Comment updated to spell out the bucket totalling guarantee. ## Minor #2 \u2014 forkchoice trusts caller-supplied blockRoot `forkchoice.onBlock` and `chain.onBlock` both accept an optional precomputed `blockRoot` (slice (e) of #803) and use it as-is to key the forkchoice protoArray. A future caller that threads a stale or wrong root would silently corrupt the protoArray with no symptom until a head/finalization mismatch surfaces hours later. Fix: `std.debug.runtime_safety`-gated re-hash-and-compare in both public-API entry points. Cost paid only in `Debug` and `ReleaseSafe`; `ReleaseFast` / `ReleaseSmall` skip it entirely so the slice-(e) win is preserved in production. On mismatch we `std.debug.panic` with both roots + slot in the message so the buggy callsite is obvious. Re-hash failure (OOM during the verification step) logs a warn and falls through rather than crashing forkchoice on a verification-side error. ## Minor #3 \u2014 hasBlocksBatch \u2192 network-cache TOCTOU The forkchoice snapshot, `network.block_cache`, and `network.pending_block_roots` each take their own lock. Between the `hasBlocksBatch` call and the post-snapshot cache lookups, a concurrent gossip handler can ingest a block into any of the three. The race is benign: the worst case is one duplicate `blocks_by_root` request whose response then takes the existing dedup path inside `processBlockByRootChunk` (a `forkChoice.hasBlock` early return). Document the race and the benign-by-construction reasoning in-line so a future reader doesn't try to "fix" it with a multi-resource lock that would serialize gossip-import against RPC-fetch for no correctness benefit. ## Nit \u2014 per-root incr loop \u2192 incrBy(N) `fetchBlockByRoots` previously bumped each outcome counter via N back-to-back `incr()` calls. Refactor to accumulate per-bucket counts during the dedup walk, then issue a single `incrBy(N)` per bucket at the end. Same observed value, fewer atomic ops on the hot path. Same change for the fetched-success bucket. ## Tests New metric audit test in `pkgs/metrics/src/lib.zig`: * `slice (d)/(e) #803: fetch-dedup + root-compute-skipped counters appear in /metrics output` \u2014 mirrors the slice-(b) LockTimer audit pattern: bumps every label / outcome the production code emits (the four root-skip sites + all six fetch outcomes including the new `fetch_no_peers` and `fetch_failed`), scrapes `writeMetrics`, and asserts each label string appears in the body. Locks the metric wiring contract in code so a label rename / family drop / serializer regression fails CI here. ## Verification `zig build` \u2705, metrics tests \u2705 ("All 3 tests passed": existing #788 audit + lean_gossip_mesh_peers gauge + new slice (d)/(e) audit). Full `zig build test` running. References: PR #842 review thread, #803 (slice work), #788 (audit test pattern). * node,metrics: PR #842 followup \u2014 dedup_lost_race outcome bucket Reviewer flagged a residual hole: `ensureBlocksByRootRequest` (network.zig:710) returns `null` non-erroneously when `shouldRequestBlocksByRoot` rejects the batch \u2014 every requested root entered `network.pending_block_roots` or `network.block_cache` between our `hasBlocksBatch` snapshot and the network helper's re-check (the benign TOCTOU window already documented at the top of `fetchBlockByRoots`). The previous diff bumped `fetched` only on the success branch and the two error buckets only via the `catch`, leaving this third non-error null path unbucketed and breaking the\n `sum(rate(lean_block_fetch_dedup_total)) == sum(rate(\u2026 by outcome))`\ninvariant the slice (d)/(e) audit test claims to lock under any racing-gossip workload. Fix: new `outcome="dedup_lost_race"` bucket bumped in the\n`else` arm of `if (maybe_request) |request_info|`. The audit\ntest now bumps and asserts the seventh outcome label too, so a\nfuture refactor that drops the `else` arm fails CI here. The other null branch `roots.len == 0` inside\n`ensureBlocksByRootRequest` is unreachable from this call site\n(the surrounding\n `if (missing_roots.items.len == 0) return;`\nguard handles it before dispatch), so `dedup_lost_race` is the\nonly legitimate null cause we need to account for. Documented\ninline so the reasoning survives the next refactor. Verification: `zig build` \u2705; metrics tests \u2705 (the audit test\nnow asserts all seven outcome labels appear in /metrics body). Refs: PR #842 review followup, #803 (slice work). --------- Co-authored-by: zclawz <zclawz@users.noreply.github.com> Co-authored-by: Parthasarathy Ramanujam <1627026+ch4r10t33r@users.noreply.github.com>
Slice (a-2): per-resource locks + BorrowedState
Implements slice (a-2) of the BeamNode threading refactor per the design doc at
docs/threading_refactor_slice_a.md(commit a76c274). Adds the locking primitives and migrates everychain.zigcallsite to use them, without touchingBeamNode.mutexitself — that drops in slice (a-3).Tracking: #803. Design tracker: #804.
This PR is draft on purpose — please leave it draft for human review. ~1500 LOC changed.
What's new
pkgs/node/src/locking.zig(new file, ~744 LOC)Three primitives + two debug helpers:
LockedMap(K, V)—std.Thread.Mutex+std.AutoHashMapbundle with the small set of methods we actually use (get/put/remove/fetchRemove/count/iterateLocked). The Network-side wiring lands in (a-3); the helper ships here with unit tests so its API contract is reviewed independently.BlockCache— atomic triple-update offetched_blocks+fetched_block_ssz+fetched_block_childrenunder a singleblock_cache_lock. The wiring intonetwork.ziglands in (a-3) per Partha build all zkvm binaries + cli at once #4.removeChildrenOfis bounded byMAX_CACHED_BLOCKS = 1024.BorrowedState— RAII wrapper carrying*const BeamState+ the backing lock. Two backing variants:states_shared_rwlock—BeamChain.states_lockread side (the common case).events_mutex—BeamChain.events_lock(used only by the cache / DB-load paths insidegetFinalizedState).cloneAndRelease(allocator)consumes the borrow and produces an owned snapshot, witherrdefer self.deinit()ensuring the lock is released even on OOM-mid-clone.deinitis idempotent;released: boolsentinel +assertReleasedpanics in debug builds if a borrow is dropped without release. Naming honesty per Partha r3 typos correction README.md #1 — the previous draft called thissszClone, which read like a non-mutating helper.LockTimer— per-resource lock observation. Records into the newzeam_lock_{wait,hold}_seconds{lock,site}histograms AND double-emits into the legacyzeam_node_mutex_{wait,hold}_time_secondsseries so existing dashboards stay alive for one release (Partha build: glue builder and build.zig cleanup #11, r3 update codebase to support 0.14.0 #5 — code-side derived shim, not a Prometheus recording rule).tier5_depththread-local +assertNoTier5SiblingHeld— the 5a/5b/5c sibling locks must never be co-held. A debug-build TLS depth counter increments on every tier-5 acquire and decrements on release; an assert at every entry catches violations in tests (Partha r3 build all zkvm binaries + cli at once #4).pkgs/node/src/chain.zig— new lock fieldsAdded to
BeamChain:Plus three internal helpers around
states:statesGet(root) -> ?BorrowedStatestatesPutExclusive(site, root, state)— single-key writestatesCommitKeepExisting(site, root, state) -> { effective, kept_existing }— atomic insert-or-keep that never invalidates an existing in-map pointer (the swap-overwrite would create a UAF for in-flight borrows).statesFetchRemoveExclusivePtr(site, root) -> ?*BeamStatechain.zigcallsite migrationEvery
states.{get,put,fetchRemove}callsite from the design-doc inventory has been migrated. Test sites are commented// SAFETY: test-only, single-threaded.chain.zig:611produceBlockstates.get(parent_root)→statesGetcloneAndRelease(FFI window viaforkChoice.getProposalAttestations)chain.zig:725produceBlockstates.put(block_root, post_state)→statesPutExclusive("produceBlock.commit")chain.zig:732produceBlockerrdeferstates.fetchRemove(block_root)→statesFetchRemoveExclusivePtr("produceBlock.errdefer")chain.zig:1112onBlockstates.get(parent_root)→statesGetcloneAndRelease(verify + STF window)chain.zig:1356onBlockstates.put(fcBlock.blockRoot, post_state)→statesCommitKeepExisting("onBlock.commit")chain.zig:1556pruneStatesstates.fetchRemove(root)(loop)chain.zig:1782onGossipAttestationstates.get(target.root)→statesGetchain.zig:1819verifyAggregatedAttestation(called fromonGossipAggregatedAttestation)states.get(target.root)→statesGetchain.zig:1881aggregatestates.get(head_root)→statesGetcloneAndRelease(FFI window viaforkChoice.aggregate)node.zig:1489publishBlockstates.get(block_root)shortcutchain.zig:2337(test)process and add mock blocks…states.get// SAFETY: test-only, single-threadedchain.zig:3117(test)produceBlock — greedy selection…states.get// SAFETY: test-only, single-threadedEvery
pubkey_cache/root_to_slot_cacheaccess is wrapped in its respective tier-5 lock (with the sibling-rule assertion). Everylast_emitted_{justified,finalized}andcached_finalized_stateread-modify-write is wrapped inevents_lock.pending_blocksmigrationprocessPendingBlocksis now a one-at-a-timeorderedRemove(0)re-scan loop per Partha #7. Indices are never assumed stable across an unlock. Thelean_pending_blocks_drain_itershistogram measures iteration count so we can spot the O(n²) tail in production before optimising (cap or cursor — both noted in design doc).getFinalizedStateAPI changeNow returns
?BorrowedStateinstead of?*const types.BeamState. Backing lock depends on path:states_lock.shared.cached_finalized_statehit / DB-load path →events_lock(the field that owns the pointer is mutated under the same lock).Every caller migrated (Partha r3 #6 — grepping
states.getdoes not find them):pkgs/cli/src/api_server.zig:274handleFinalizedCheckpointStateArrayList(u8), thendefer borrow.deinit()releases the lock before HTTP I/O.Repo-wide grep confirmed
getFinalizedStateis the only function with this*const BeamStateoutward contract; only the one HTTP route consumes it today.publishBlocksimplificationDropped the
postState = self.chain.states.get(block_root)shortcut. It was safe under globalBeamNode.mutexbut creates a deadlock in the new world (caller holdsstates_lock.sharedfrom a borrow, calleechain.onBlocktakesstates_lock.exclusiveto commit). Recompute is the single path now;statesCommitKeepExistingkeeps the original in-map pointer untouched on the duplicate-import path so locally produced blocks no longer leak their first post-state on the publish hop.Per-lock metrics + legacy shim
New series:
zeam_lock_wait_seconds{lock,site}zeam_lock_hold_seconds{lock,site}lean_pending_blocks_drain_itersLegacy
zeam_node_mutex_{wait,hold}_time_seconds{site}is double-emitted byLockTimersumming observations across all per-resource locks. Operators do not redeploy Prometheus or change recording rules — the legacy series stays alive automatically for one release, then is removed in the release after.What's NOT in this PR (saved for a-3)
pending_rpc_requests,pending_block_roots,timed_out_requests,connected_peers) →LockedMapmigration.BlockCacheNetwork-side wiring (the helper itself ships here, with unit tests).BeamNode.mutexrename tofinalization_lockand dropping it fromonGossip/onInterval/onReqRespResponse.onReqRespRequest.Slice (a-2) keeps
BeamNode.mutexwrapping every chain entry point — the new chain locks are nested inside it (redundant under the outer mutex but correct). They become the actual synchronisation once (a-3) drops the outer mutex from those callbacks. There is noexternal_mutexparameter onchain.onBlockto drop because it was never landed onmainbetween #798–#801; the design doc's "drop external_mutex" item is therefore a no-op against the current tree.Tests
LockedMap: ctor + get/put/remove + countlocking.zigLockedMap: fetchRemove returns the prior entrylocking.zigLockedMap: iterator-while-locked sees every inserted entrylocking.zigLockedMap: deinit on empty map is a no-oplocking.zigLockedMap: deinit on non-empty map frees internal storagelocking.zigBorrowedState: deinit is idempotent and releases the locklocking.zigBorrowedState: events_mutex backing also releaseslocking.zigBorrowedState: assertReleased fires only when not releasedlocking.zigBorrowedState: cloneAndRelease releases lock on OOM-mid-clonelocking.zigfail_index=0) → lock released, error propagatedBlockCache: helper init/deinit when emptylocking.zigBlockCache: removeChildrenOf is bounded by MAX_CACHED_BLOCKSlocking.zigtier5 depth counter increments and decrementslocking.zigBorrowedState: cloneAndRelease success path against real BeamStatechain.zigBlockCache: insert + get + removeChildrenOf boundedchain.zigBlockCache: partial-state invariant (re-insert leaves no orphans)chain.zigAll existing chain.zig integration tests continue to pass — they cover the migrated states map / events_lock / pending_blocks paths through the produceBlock + onBlock + onBlockFollowup integration shape.
Build + test status
🟢 Green. Ready for review.
Spec link
Design:
docs/threading_refactor_slice_a.md@ commita76c274("docs: r3 polish from Partha verification"). The PR implements every line item listed in §"Status (r4 — ready to cut code)".Notes on deviations from the design doc
statesCommitKeepExistinginstead of purestates.put— discovered while migratingonBlockthat an unconditional overwrite invalidates in-flight borrows handed out bystatesGet. The keep-existing semantics preserve pointer identity for any concurrent reader holdingstates_lock.shared. Equivalent rationale to the design doc's "never invalidate a pointer that other borrows might still observe."publishBlockno longer passes a cachedpostState— see §publishBlock simplification above. Deviates from the design-doc inventory which only said "migrate the callsite" without specifying the lock-dance shape; the recompute path is the only one that survives both the borrow contract and the deadlock constraint.BeamChain— matches the design doc but worth flagging: a process running multiple chains in parallel (today there is one) would need per-chain counters. Not a problem for slice (a).BlockCacheNetwork-side wiring is intentionally deferred to (a-3) per the design doc's "Use your judgment" clause; the helper + unit tests live in this PR so reviewers can review the API contract independently.