perf: slot-driver stall watchdog + chain.onBlock substep histogram (P0 of #863) by ch4r10t33r · Pull Request #883 · blockblaz/zeam

ch4r10t33r · 2026-05-14T12:21:59Z

What this PR does

Adds two diagnostic tools to help us figure out where time is going inside zeam when the slot driver stalls (the multi-second freezes seen on the devnet-4 aggregator in #863).

Nothing in this PR fixes the stalls themselves. It just makes them visible and attributable, so the next perf PRs can target the actual cost.

1. A watchdog thread that flags stalls

A small background thread wakes up every second and checks how long it's been since the main loop last ticked the clock. If the gap is more than 5 seconds, it:

logs an error: slot-driver stall detected: last tick 604.389s ago (threshold=5000ms). See #863.
bumps a Prometheus counter: zeam_slot_driver_stall_fired_total
records the stall length in a histogram: zeam_slot_driver_stall_seconds (1, 2, 5, 10, 30, 60, 120, 300, 600 second buckets)

If the freeze lasts 600 seconds, the watchdog only logs once. It re-arms only after the slot driver ticks again, otherwise we'd get one log per second for the whole freeze.

The watchdog runs on its own OS thread, so it keeps working even if the main loop or libp2p thread is completely stuck.

(Per-thread stack dumps would be nice — but Zig 0.16's signal-handler API is unstable. Easy to add later if the metric tells us we need it.)

2. A per-step timing breakdown for `chain.onBlock`

chain.onBlock is a multi-stage process: hash the block, clone parent state, verify signatures, run the state transition, update fork choice, persist to DB, etc. Today we only see the total time. With this PR each stage gets its own histogram entry under the label step:

`step` label	what it measures
`block_root_compute`	hashing the block (when the caller didn't pre-compute the root)
`parent_state_clone`	snapshot + clone of the parent state
`verify_signatures`	XMSS signature verification
`state_transition`	the actual STF (process_slots + process_block)
`ssz_serialize_fallback`	re-serialising the block when no precomputed bytes were supplied
`forkchoice_onblock`	adding the block to fork choice
`block_attestations`	processing attestations carried by the block
`db_persist`	writing the block + state to RocksDB / LMDB

Buckets match the existing zeam_chain_onblock_duration_seconds so dashboards can stack the per-step series next to the total.

Why this PR first

#863 shows the aggregator (zeam_0) had:

5 onBlock calls > 1s, 2 > 5s, 1 > 10s
12 xev drains > 10s, 3 > 60s
one window where the process froze for 604 seconds

Peers on the same host stayed healthy, so this isn't host pressure — it's per-process. Without per-step timing we can't tell whether the time goes into signature verify, STF, DB writes, or somewhere else. And without the watchdog those long freezes are hard to spot in noisy logs.

The next perf PRs (lock-free pubkey cache, etc.) need this attribution to actually claim wins.

Validation

zig build clean (Debug + ReleaseFast)
zig build test -Dprover=dummy --summary all passes
zig fmt clean
Hot path overhead: a few monotonicTimestampNs calls and one HistogramVec observe per step. Negligible.

Out of scope

Per-thread stack dump on stall (waiting on stable Zig signal API).
The actual perf fixes — those come in P1 and beyond.

Refs #863.

Two diagnostic instruments that close the attribution gap surfaced by the devnet-4 aggregator investigation in blockblaz#863: 1. SlotDriverWatchdog (pkgs/node/src/slot_driver_watchdog.zig) - background OS thread, polls Clock.last_tick_time_ms every 1s - logs ERROR + bumps zeam_slot_driver_stall_fired_total and records the duration into zeam_slot_driver_stall_seconds when wall clock drifts >= 5s past the last tick - hysteresis-suppressed: a single 600s freeze fires once, not 600x - per-thread stack dump intentionally deferred (Zig 0.16 signal- handler API still in flux) — landed as a follow-up once the stall metric tells us we even need it on a given deployment - spawned from cli/src/node.zig::Node.run, joined in deinit 2. zeam_chain_onblock_step_duration_seconds{step="..."} - HistogramVec, same buckets as zeam_chain_onblock_duration_seconds so dashboards can stack the per-step series next to the total - emitted at every substep boundary in chain.onBlock: block_root_compute, parent_state_clone, verify_signatures, state_transition, ssz_serialize_fallback, forkchoice_onblock, block_attestations, db_persist - implementation is a tiny stopwatch struct (OnblockStepWatch); instrumentation failures are silently swallowed so onBlock never fails because metrics can't write Together these let an operator look at a single zeam_0-style stall window and answer "where did the time go?" without invasive logging or external profilers — closes the analysis gap explicitly called out in the issue ("Add per-substep timing inside chain.onBlock to attribute the multi-second tail"). No behaviour change on the hot path beyond a handful of monotonic timestamp reads. Refs blockblaz#863.

zclawz · 2026-05-14T13:10:24Z

Adversarial review finding for #883:

High — pkgs/node/src/slot_driver_watchdog.zig:112-117 / pkgs/node/src/clock.zig:19,70-75 — watchdog introduces an unsynchronised cross-thread read of Clock.last_tick_time_ms

The watchdog thread reads self.clock.last_tick_time_ms while the libxev clock thread writes it in tickInterval(). The PR comment says torn reads are acceptable because the worst case is a spurious stall log, but this is a plain shared mutable Zig field, not an atomic. That makes the program data-racy once the watchdog is enabled. Also, because the field is ?isize, the representation may include a tag/discriminant rather than a single naturally atomic word, so a torn read can be worse than just an old timestamp.

This is especially risky because the watchdog is enabled by default for every node in Node.run(), so the diagnostic path can introduce undefined behaviour into the production slot driver.

Suggested fix: make the timestamp explicitly atomic. For example, replace the nullable field with an std.atomic.Value(i64)/isize using a sentinel such as -1 for “no tick yet”, then use release/store in tickInterval() and acquire/load in the watchdog. Alternatively add a tiny accessor on Clock that hides the atomic/sentinel details.

Non-blocking notes:

sleepMs() ignores interrupted nanosleep; harmless for diagnostics, but it can shorten probe intervals under signals.
I like the onBlock substep histogram direction; I did not find a consensus-path mutation there beyond metric observations.

Validation/evidence: inspected the PR diff and surrounding Clock.tickInterval() / Node.run() lifecycle. A full zig build test -Dprover=dummy --summary all is still running locally through the long XMSS/forkchoice tests; no compile failure had appeared at the time of this review comment.

@zclawz

Addresses adversarial review by @zclawz on PR blockblaz#883. Data race (High): Clock.last_tick_time_ms was a plain ?isize, written by the libxev thread in tickInterval() and read by the watchdog thread. As an optional + non-atomic, the read was technically UB and on 32-bit hosts the i64-sized representation could even tear into a value worse than just an old timestamp. Watchdog runs by default on every node, so the diagnostic path was injecting UB into production. Fix: replace with std.atomic.Value(i64) using std.math.minInt(i64) as the "never ticked" sentinel. - tickInterval writes via .release store after computing the new duration (load uses .monotonic since the libxev thread is the single writer). - Watchdog reads via the new public Clock.lastTickMs() accessor, which does an atomic .acquire load and translates the sentinel back to ?i64. No struct field is touched directly. EINTR (Note): watchdog sleepMs now loops on EINTR (continues with the remaining duration) so a stray signal cannot shorten a probe interval. Aborts immediately when stop_flag is set so shutdown isn't blocked by a long in-flight sleep. No behaviour change on the libxev thread other than one .release atomic store per tick (replacing a plain field write). Refs blockblaz#863, blockblaz#883.

ch4r10t33r · 2026-05-14T14:15:35Z

Fixed in 82825fb:

Data race (High) — addressed.

Clock.last_tick_time_ms is now std.atomic.Value(i64) using std.math.minInt(i64) as the "never ticked" sentinel. The libxev thread writes via .release store in tickInterval(); the watchdog reads via a new public accessor Clock.lastTickMs() ?i64 that does an .acquire load and translates the sentinel. No struct field is poked directly from outside the Clock.

The libxev thread itself reads back the same atomic via .monotonic (it's the single writer, so it doesn't need to synchronise with itself). Behaviour change on the hot path is limited to one .release atomic store per tick, replacing a plain field write.

EINTR (Note) — addressed.

SlotDriverWatchdog.sleepMs now loops on EINTR, continuing with the remaining duration from nanosleep's rem slot. That way a stray signal (e.g. a future SIGUSR1-based stack-dump probe) cannot shorten the probe interval. Aborts immediately when stop_flag is set so shutdown isn't blocked by a long in-flight sleep.

zig build test -Dprover=dummy --summary all passes locally.

ch4r10t33r · 2026-05-14T14:16:24Z

@zclawz can you verify and approve this PR?

zclawz · 2026-05-14T14:34:19Z

PR #883 Review: slot-driver stall watchdog + onBlock substep histogram

Summary: Adds two diagnostic layers to attribute the multi-second slot-driver stalls from #863: a background watchdog thread that fires when the libxev tick clock falls behind wall-clock by >5s, and per-substep lap timing inside onBlock exported as a labeled Prometheus histogram.

Correctness / Logic ✅

The SUPPRESS_REFIRE hysteresis in runLoop is correct: firing_for_current_stall resets when last_tick changes (i.e. the driver ticked again), so each distinct stall event fires exactly once rather than once per probe interval. The ordering — reset check, then last_observed_tick_ms update, then threshold compare, then refire guard — is the right sequence.

OnblockStepWatch.lap() silently swallows metric write errors (catch {}), so instrumentation failures never bubble into block processing. Correct and intentional.

Minor label note: the "block_attestations" lap point is placed after forkChoice.updateHead() (step 5), so it actually measures the attestation loop plus updateHead. Consider renaming to "block_attestations_and_head_update" or splitting the lap — otherwise dashboard consumers may undercount updateHead time. Not a bug, but a readability issue.

The "ssz_serialize_fallback" lap only fires on the fallback code path (when no precomputed SSZ bytes were supplied). Worth noting in the metric help text that this observation is absent on the fast path.

Concurrency / Memory ordering ✅

last_tick_time_ms_atomic uses .release on writes and .acquire on reads — correct for producer-consumer visibility. The sentinel NEVER_TICKED_MS = std.math.minInt(i64) is unambiguous since real unix-epoch ms are positive and large.

stop_flag uses .release on store and .acquire on loads — correct for a thread-exit signal. sleepMs re-issues nanosleep on EINTR so stop-flag checks are responsive.

Resource management ✅

Node.deinit calls wd.stop() (which joins the watchdog thread) before stopping the API server — correct ordering. Watchdog spawn failure is non-fatal and handled cleanly with a warn log.

Nits

"block_attestations" label mismatch noted above — worth fixing before merge so dashboards are accurate.
The metric help comment mentions a "total_excluding_io_wait" label but I don't see a corresponding lap call in the diff — intended for a follow-up PR?

Verdict: LGTM ✅ — the label naming nit on "block_attestations" is worth fixing for dashboard accuracy, but not a blocker.

On some GitHub macOS runner images, ~/.cargo/bin/cargo is rustup-init rather than the rustup proxy shim, so `cargo +nightly --version` fails with "error: unexpected argument '+nightly' found" and aborts the workflow. Two PRs against zeam in the same hour hit this on different runners (PR blockblaz#884 passed, PR blockblaz#883 failed) — purely environmental. `rustup run nightly cargo --version` invokes the nightly toolchain via rustup directly, bypassing the proxy-shim path entirely. setup-rust- toolchain@v1 already installs nightly so the toolchain is guaranteed present. Refs blockblaz#863, blockblaz#883.

@zclawz

… (P1 of #863) (#884) * perf(p1): lock-free PublicKeyCache; drop pubkey_cache_lock from hot paths Replaces the std.AutoHashMap-backed PublicKeyCache + pubkey_cache_lock mutex with a per-slot atomic CAS design. Targets the ~78ms mean lock hold reported in #863 ("zeam_lock_hold_seconds pubkey_cache / onBlock.verifySignatures: 195 holds, sum ~15.3s"). xmss/src/lib.zig - PublicKeyCache.slots: []std.atomic.Value(usize) sized at init from the genesis validator count. Each slot stores @intFromPtr(handle) or 0. - get(idx): single atomic acquire load. No mutex. - getOrPut on miss: deserialise via PublicKey.fromBytes, CAS-install @intFromPtr(handle) under release; on lost race free our handle and adopt the winner's. No allocator contention either. - Out-of-range index (validator_index >= capacity) falls through to a non-cached deserialise — safe but slower. Validator-set growth will need a fork-boundary cache rebuild; documented inline. - PublicKeyCache.init signature changed to (allocator, capacity). Both call sites updated; tests updated. node/src/chain.zig - chain.onBlock verify-signatures block: removed the pubkey_cache_lock acquire/release dance and the tier-5 enter/leave. The wrapped verifySignaturesParallel call is the dominant onBlock cost on aggregator nodes; single-threaded mutex was negating the rayon parallelism inside Rust. - chain.verifyAggregatedAttestation: same removal. - PublicKeyCache.init now takes a capacity argument; passed via `opts.config.genesis.numValidators()` from BeamChain.init. The pubkey_cache_lock field on BeamChain remains in place (zero acquirers) to avoid touching the lock-hierarchy invariants in locking.zig in this PR. Removal lands as a follow-up cleanup. Refs #863. Builds on #883 (P0 instrumentation). * fix(p1): fail loud on out-of-range cache index; refresh stale doc; add CAS tests Addresses adversarial review by @zclawz on PR #884. Out-of-range path leak (Medium): PublicKeyCache.getOrPut previously deserialised + returned a handle on validator_index >= capacity without transferring ownership, leaking one Rust pubkey handle per lookup. lean spec does not currently grow the validator set after genesis, but if/when it does, valid attestations from new validators would have leaked. Fix: return new error HashSigError.ValidatorIndexOutOfRange instead of falling back to leaky uncached deserialise. Doc updated to point at fork-boundary cache rebuild as the intended growth path. Stale comment (Note): pkgs/node/src/chain.zig:213-215 still said xmss.PublicKeyCache was not thread-safe and required a serial pre-phase. After the lock-free rewrite that's wrong. Updated to reflect the per-slot CAS protocol. Test gaps (Note): Existing tests only covered empty/basic cache metadata. Added: * getOrPut populates on miss + returns same handle on hit * out-of-range index returns ValidatorIndexOutOfRange * 8-thread concurrent getOrPut for the same slot — every thread observes the same winning handle, count() == 1 post-race, and cache.deinit() suffices to free everything (testing allocator surfaces leaks at process exit). Refs #863, #884.

ch4r10t33r mentioned this pull request May 14, 2026

perf: lock-free PublicKeyCache; drop pubkey_cache_lock from hot paths (P1 of #863) #884

Merged

ch4r10t33r requested a review from anshalshukla May 14, 2026 14:16

noopur23 approved these changes May 14, 2026

View reviewed changes

Merge branch 'main' into perf/p0-watchdog-onblock-steps-fast-reject

53e9334

ch4r10t33r merged commit 86d842a into blockblaz:main May 14, 2026
9 checks passed

ch4r10t33r mentioned this pull request May 14, 2026

Investigate slow slot_interval / tick duration (event-loop starvation vs nominal 0.8s) #863

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: slot-driver stall watchdog + chain.onBlock substep histogram (P0 of #863)#883

perf: slot-driver stall watchdog + chain.onBlock substep histogram (P0 of #863)#883
ch4r10t33r merged 4 commits into
blockblaz:mainfrom
ch4r10t33r:perf/p0-watchdog-onblock-steps-fast-reject

ch4r10t33r commented May 14, 2026 •

edited

Loading

Uh oh!

zclawz commented May 14, 2026

Uh oh!

ch4r10t33r commented May 14, 2026

Uh oh!

ch4r10t33r commented May 14, 2026

Uh oh!

zclawz commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ch4r10t33r commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

1. A watchdog thread that flags stalls

2. A per-step timing breakdown for chain.onBlock

Why this PR first

Validation

Out of scope

Uh oh!

zclawz commented May 14, 2026

Uh oh!

ch4r10t33r commented May 14, 2026

Uh oh!

ch4r10t33r commented May 14, 2026

Uh oh!

zclawz commented May 14, 2026

PR #883 Review: slot-driver stall watchdog + onBlock substep histogram

Correctness / Logic ✅

Concurrency / Memory ordering ✅

Resource management ✅

Nits

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ch4r10t33r commented May 14, 2026 •

edited

Loading

2. A per-step timing breakdown for `chain.onBlock`