perf: slot-driver stall watchdog + chain.onBlock substep histogram (P0 of #863)#883
Conversation
Two diagnostic instruments that close the attribution gap surfaced by the devnet-4 aggregator investigation in blockblaz#863: 1. SlotDriverWatchdog (pkgs/node/src/slot_driver_watchdog.zig) - background OS thread, polls Clock.last_tick_time_ms every 1s - logs ERROR + bumps zeam_slot_driver_stall_fired_total and records the duration into zeam_slot_driver_stall_seconds when wall clock drifts >= 5s past the last tick - hysteresis-suppressed: a single 600s freeze fires once, not 600x - per-thread stack dump intentionally deferred (Zig 0.16 signal- handler API still in flux) — landed as a follow-up once the stall metric tells us we even need it on a given deployment - spawned from cli/src/node.zig::Node.run, joined in deinit 2. zeam_chain_onblock_step_duration_seconds{step="..."} - HistogramVec, same buckets as zeam_chain_onblock_duration_seconds so dashboards can stack the per-step series next to the total - emitted at every substep boundary in chain.onBlock: block_root_compute, parent_state_clone, verify_signatures, state_transition, ssz_serialize_fallback, forkchoice_onblock, block_attestations, db_persist - implementation is a tiny stopwatch struct (OnblockStepWatch); instrumentation failures are silently swallowed so onBlock never fails because metrics can't write Together these let an operator look at a single zeam_0-style stall window and answer "where did the time go?" without invasive logging or external profilers — closes the analysis gap explicitly called out in the issue ("Add per-substep timing inside chain.onBlock to attribute the multi-second tail"). No behaviour change on the hot path beyond a handful of monotonic timestamp reads. Refs blockblaz#863.
|
Adversarial review finding for #883: High — The watchdog thread reads This is especially risky because the watchdog is enabled by default for every node in Suggested fix: make the timestamp explicitly atomic. For example, replace the nullable field with an Non-blocking notes:
Validation/evidence: inspected the PR diff and surrounding |
Addresses adversarial review by @zclawz on PR blockblaz#883. Data race (High): Clock.last_tick_time_ms was a plain ?isize, written by the libxev thread in tickInterval() and read by the watchdog thread. As an optional + non-atomic, the read was technically UB and on 32-bit hosts the i64-sized representation could even tear into a value worse than just an old timestamp. Watchdog runs by default on every node, so the diagnostic path was injecting UB into production. Fix: replace with std.atomic.Value(i64) using std.math.minInt(i64) as the "never ticked" sentinel. - tickInterval writes via .release store after computing the new duration (load uses .monotonic since the libxev thread is the single writer). - Watchdog reads via the new public Clock.lastTickMs() accessor, which does an atomic .acquire load and translates the sentinel back to ?i64. No struct field is touched directly. EINTR (Note): watchdog sleepMs now loops on EINTR (continues with the remaining duration) so a stray signal cannot shorten a probe interval. Aborts immediately when stop_flag is set so shutdown isn't blocked by a long in-flight sleep. No behaviour change on the libxev thread other than one .release atomic store per tick (replacing a plain field write). Refs blockblaz#863, blockblaz#883.
|
Fixed in 82825fb: Data race (High) — addressed.
The libxev thread itself reads back the same atomic via EINTR (Note) — addressed.
|
|
@zclawz can you verify and approve this PR? |
PR #883 Review: slot-driver stall watchdog + onBlock substep histogramSummary: Adds two diagnostic layers to attribute the multi-second slot-driver stalls from #863: a background watchdog thread that fires when the libxev tick clock falls behind wall-clock by >5s, and per-substep lap timing inside Correctness / Logic ✅The SUPPRESS_REFIRE hysteresis in
Minor label note: the The Concurrency / Memory ordering ✅
Resource management ✅
Nits
Verdict: LGTM ✅ — the label naming nit on |
On some GitHub macOS runner images, ~/.cargo/bin/cargo is rustup-init rather than the rustup proxy shim, so `cargo +nightly --version` fails with "error: unexpected argument '+nightly' found" and aborts the workflow. Two PRs against zeam in the same hour hit this on different runners (PR blockblaz#884 passed, PR blockblaz#883 failed) — purely environmental. `rustup run nightly cargo --version` invokes the nightly toolchain via rustup directly, bypassing the proxy-shim path entirely. setup-rust- toolchain@v1 already installs nightly so the toolchain is guaranteed present. Refs blockblaz#863, blockblaz#883.
… (P1 of #863) (#884) * perf(p1): lock-free PublicKeyCache; drop pubkey_cache_lock from hot paths Replaces the std.AutoHashMap-backed PublicKeyCache + pubkey_cache_lock mutex with a per-slot atomic CAS design. Targets the ~78ms mean lock hold reported in #863 ("zeam_lock_hold_seconds pubkey_cache / onBlock.verifySignatures: 195 holds, sum ~15.3s"). xmss/src/lib.zig - PublicKeyCache.slots: []std.atomic.Value(usize) sized at init from the genesis validator count. Each slot stores @intFromPtr(handle) or 0. - get(idx): single atomic acquire load. No mutex. - getOrPut on miss: deserialise via PublicKey.fromBytes, CAS-install @intFromPtr(handle) under release; on lost race free our handle and adopt the winner's. No allocator contention either. - Out-of-range index (validator_index >= capacity) falls through to a non-cached deserialise — safe but slower. Validator-set growth will need a fork-boundary cache rebuild; documented inline. - PublicKeyCache.init signature changed to (allocator, capacity). Both call sites updated; tests updated. node/src/chain.zig - chain.onBlock verify-signatures block: removed the pubkey_cache_lock acquire/release dance and the tier-5 enter/leave. The wrapped verifySignaturesParallel call is the dominant onBlock cost on aggregator nodes; single-threaded mutex was negating the rayon parallelism inside Rust. - chain.verifyAggregatedAttestation: same removal. - PublicKeyCache.init now takes a capacity argument; passed via `opts.config.genesis.numValidators()` from BeamChain.init. The pubkey_cache_lock field on BeamChain remains in place (zero acquirers) to avoid touching the lock-hierarchy invariants in locking.zig in this PR. Removal lands as a follow-up cleanup. Refs #863. Builds on #883 (P0 instrumentation). * fix(p1): fail loud on out-of-range cache index; refresh stale doc; add CAS tests Addresses adversarial review by @zclawz on PR #884. Out-of-range path leak (Medium): PublicKeyCache.getOrPut previously deserialised + returned a handle on validator_index >= capacity without transferring ownership, leaking one Rust pubkey handle per lookup. lean spec does not currently grow the validator set after genesis, but if/when it does, valid attestations from new validators would have leaked. Fix: return new error HashSigError.ValidatorIndexOutOfRange instead of falling back to leaky uncached deserialise. Doc updated to point at fork-boundary cache rebuild as the intended growth path. Stale comment (Note): pkgs/node/src/chain.zig:213-215 still said xmss.PublicKeyCache was not thread-safe and required a serial pre-phase. After the lock-free rewrite that's wrong. Updated to reflect the per-slot CAS protocol. Test gaps (Note): Existing tests only covered empty/basic cache metadata. Added: * getOrPut populates on miss + returns same handle on hit * out-of-range index returns ValidatorIndexOutOfRange * 8-thread concurrent getOrPut for the same slot — every thread observes the same winning handle, count() == 1 post-race, and cache.deinit() suffices to free everything (testing allocator surfaces leaks at process exit). Refs #863, #884.
What this PR does
Adds two diagnostic tools to help us figure out where time is going inside zeam when the slot driver stalls (the multi-second freezes seen on the devnet-4 aggregator in #863).
Nothing in this PR fixes the stalls themselves. It just makes them visible and attributable, so the next perf PRs can target the actual cost.
1. A watchdog thread that flags stalls
A small background thread wakes up every second and checks how long it's been since the main loop last ticked the clock. If the gap is more than 5 seconds, it:
slot-driver stall detected: last tick 604.389s ago (threshold=5000ms). See #863.zeam_slot_driver_stall_fired_totalzeam_slot_driver_stall_seconds(1, 2, 5, 10, 30, 60, 120, 300, 600 second buckets)If the freeze lasts 600 seconds, the watchdog only logs once. It re-arms only after the slot driver ticks again, otherwise we'd get one log per second for the whole freeze.
The watchdog runs on its own OS thread, so it keeps working even if the main loop or libp2p thread is completely stuck.
(Per-thread stack dumps would be nice — but Zig 0.16's signal-handler API is unstable. Easy to add later if the metric tells us we need it.)
2. A per-step timing breakdown for
chain.onBlockchain.onBlockis a multi-stage process: hash the block, clone parent state, verify signatures, run the state transition, update fork choice, persist to DB, etc. Today we only see the total time. With this PR each stage gets its own histogram entry under the labelstep:steplabelblock_root_computeparent_state_cloneverify_signaturesstate_transitionssz_serialize_fallbackforkchoice_onblockblock_attestationsdb_persistBuckets match the existing
zeam_chain_onblock_duration_secondsso dashboards can stack the per-step series next to the total.Why this PR first
#863 shows the aggregator (
zeam_0) had:Peers on the same host stayed healthy, so this isn't host pressure — it's per-process. Without per-step timing we can't tell whether the time goes into signature verify, STF, DB writes, or somewhere else. And without the watchdog those long freezes are hard to spot in noisy logs.
The next perf PRs (lock-free pubkey cache, etc.) need this attribution to actually claim wins.
Validation
zig buildclean (Debug + ReleaseFast)zig build test -Dprover=dummy --summary allpasseszig fmtcleanmonotonicTimestampNscalls and one HistogramVec observe per step. Negligible.Out of scope
Refs #863.