Skip to content

perf: slot-driver stall watchdog + chain.onBlock substep histogram (P0 of #863)#883

Merged
ch4r10t33r merged 4 commits into
blockblaz:mainfrom
ch4r10t33r:perf/p0-watchdog-onblock-steps-fast-reject
May 14, 2026
Merged

perf: slot-driver stall watchdog + chain.onBlock substep histogram (P0 of #863)#883
ch4r10t33r merged 4 commits into
blockblaz:mainfrom
ch4r10t33r:perf/p0-watchdog-onblock-steps-fast-reject

Conversation

@ch4r10t33r
Copy link
Copy Markdown
Contributor

@ch4r10t33r ch4r10t33r commented May 14, 2026

What this PR does

Adds two diagnostic tools to help us figure out where time is going inside zeam when the slot driver stalls (the multi-second freezes seen on the devnet-4 aggregator in #863).

Nothing in this PR fixes the stalls themselves. It just makes them visible and attributable, so the next perf PRs can target the actual cost.

1. A watchdog thread that flags stalls

A small background thread wakes up every second and checks how long it's been since the main loop last ticked the clock. If the gap is more than 5 seconds, it:

  • logs an error: slot-driver stall detected: last tick 604.389s ago (threshold=5000ms). See #863.
  • bumps a Prometheus counter: zeam_slot_driver_stall_fired_total
  • records the stall length in a histogram: zeam_slot_driver_stall_seconds (1, 2, 5, 10, 30, 60, 120, 300, 600 second buckets)

If the freeze lasts 600 seconds, the watchdog only logs once. It re-arms only after the slot driver ticks again, otherwise we'd get one log per second for the whole freeze.

The watchdog runs on its own OS thread, so it keeps working even if the main loop or libp2p thread is completely stuck.

(Per-thread stack dumps would be nice — but Zig 0.16's signal-handler API is unstable. Easy to add later if the metric tells us we need it.)

2. A per-step timing breakdown for chain.onBlock

chain.onBlock is a multi-stage process: hash the block, clone parent state, verify signatures, run the state transition, update fork choice, persist to DB, etc. Today we only see the total time. With this PR each stage gets its own histogram entry under the label step:

step label what it measures
block_root_compute hashing the block (when the caller didn't pre-compute the root)
parent_state_clone snapshot + clone of the parent state
verify_signatures XMSS signature verification
state_transition the actual STF (process_slots + process_block)
ssz_serialize_fallback re-serialising the block when no precomputed bytes were supplied
forkchoice_onblock adding the block to fork choice
block_attestations processing attestations carried by the block
db_persist writing the block + state to RocksDB / LMDB

Buckets match the existing zeam_chain_onblock_duration_seconds so dashboards can stack the per-step series next to the total.

Why this PR first

#863 shows the aggregator (zeam_0) had:

  • 5 onBlock calls > 1s, 2 > 5s, 1 > 10s
  • 12 xev drains > 10s, 3 > 60s
  • one window where the process froze for 604 seconds

Peers on the same host stayed healthy, so this isn't host pressure — it's per-process. Without per-step timing we can't tell whether the time goes into signature verify, STF, DB writes, or somewhere else. And without the watchdog those long freezes are hard to spot in noisy logs.

The next perf PRs (lock-free pubkey cache, etc.) need this attribution to actually claim wins.

Validation

  • zig build clean (Debug + ReleaseFast)
  • zig build test -Dprover=dummy --summary all passes
  • zig fmt clean
  • Hot path overhead: a few monotonicTimestampNs calls and one HistogramVec observe per step. Negligible.

Out of scope

  • Per-thread stack dump on stall (waiting on stable Zig signal API).
  • The actual perf fixes — those come in P1 and beyond.

Refs #863.

Two diagnostic instruments that close the attribution gap surfaced by
the devnet-4 aggregator investigation in blockblaz#863:

1. SlotDriverWatchdog (pkgs/node/src/slot_driver_watchdog.zig)
   - background OS thread, polls Clock.last_tick_time_ms every 1s
   - logs ERROR + bumps zeam_slot_driver_stall_fired_total and records
     the duration into zeam_slot_driver_stall_seconds when wall clock
     drifts >= 5s past the last tick
   - hysteresis-suppressed: a single 600s freeze fires once, not 600x
   - per-thread stack dump intentionally deferred (Zig 0.16 signal-
     handler API still in flux) — landed as a follow-up once the
     stall metric tells us we even need it on a given deployment
   - spawned from cli/src/node.zig::Node.run, joined in deinit

2. zeam_chain_onblock_step_duration_seconds{step="..."}
   - HistogramVec, same buckets as zeam_chain_onblock_duration_seconds
     so dashboards can stack the per-step series next to the total
   - emitted at every substep boundary in chain.onBlock:
     block_root_compute, parent_state_clone, verify_signatures,
     state_transition, ssz_serialize_fallback, forkchoice_onblock,
     block_attestations, db_persist
   - implementation is a tiny stopwatch struct (OnblockStepWatch);
     instrumentation failures are silently swallowed so onBlock
     never fails because metrics can't write

Together these let an operator look at a single zeam_0-style stall
window and answer "where did the time go?" without invasive logging
or external profilers — closes the analysis gap explicitly called out
in the issue ("Add per-substep timing inside chain.onBlock to attribute
the multi-second tail").

No behaviour change on the hot path beyond a handful of monotonic
timestamp reads.

Refs blockblaz#863.
@zclawz
Copy link
Copy Markdown
Contributor

zclawz commented May 14, 2026

Adversarial review finding for #883:

High — pkgs/node/src/slot_driver_watchdog.zig:112-117 / pkgs/node/src/clock.zig:19,70-75 — watchdog introduces an unsynchronised cross-thread read of Clock.last_tick_time_ms

The watchdog thread reads self.clock.last_tick_time_ms while the libxev clock thread writes it in tickInterval(). The PR comment says torn reads are acceptable because the worst case is a spurious stall log, but this is a plain shared mutable Zig field, not an atomic. That makes the program data-racy once the watchdog is enabled. Also, because the field is ?isize, the representation may include a tag/discriminant rather than a single naturally atomic word, so a torn read can be worse than just an old timestamp.

This is especially risky because the watchdog is enabled by default for every node in Node.run(), so the diagnostic path can introduce undefined behaviour into the production slot driver.

Suggested fix: make the timestamp explicitly atomic. For example, replace the nullable field with an std.atomic.Value(i64)/isize using a sentinel such as -1 for “no tick yet”, then use release/store in tickInterval() and acquire/load in the watchdog. Alternatively add a tiny accessor on Clock that hides the atomic/sentinel details.

Non-blocking notes:

  • sleepMs() ignores interrupted nanosleep; harmless for diagnostics, but it can shorten probe intervals under signals.
  • I like the onBlock substep histogram direction; I did not find a consensus-path mutation there beyond metric observations.

Validation/evidence: inspected the PR diff and surrounding Clock.tickInterval() / Node.run() lifecycle. A full zig build test -Dprover=dummy --summary all is still running locally through the long XMSS/forkchoice tests; no compile failure had appeared at the time of this review comment.

Addresses adversarial review by @zclawz on PR blockblaz#883.

Data race (High):
  Clock.last_tick_time_ms was a plain ?isize, written by the libxev
  thread in tickInterval() and read by the watchdog thread. As an
  optional + non-atomic, the read was technically UB and on 32-bit
  hosts the i64-sized representation could even tear into a value
  worse than just an old timestamp. Watchdog runs by default on
  every node, so the diagnostic path was injecting UB into production.

  Fix: replace with std.atomic.Value(i64) using std.math.minInt(i64)
  as the "never ticked" sentinel.
    - tickInterval writes via .release store after computing the new
      duration (load uses .monotonic since the libxev thread is the
      single writer).
    - Watchdog reads via the new public Clock.lastTickMs() accessor,
      which does an atomic .acquire load and translates the sentinel
      back to ?i64. No struct field is touched directly.

EINTR (Note):
  watchdog sleepMs now loops on EINTR (continues with the remaining
  duration) so a stray signal cannot shorten a probe interval. Aborts
  immediately when stop_flag is set so shutdown isn't blocked by a
  long in-flight sleep.

No behaviour change on the libxev thread other than one .release
atomic store per tick (replacing a plain field write).

Refs blockblaz#863, blockblaz#883.
@ch4r10t33r
Copy link
Copy Markdown
Contributor Author

Fixed in 82825fb:

Data race (High) — addressed.

Clock.last_tick_time_ms is now std.atomic.Value(i64) using std.math.minInt(i64) as the "never ticked" sentinel. The libxev thread writes via .release store in tickInterval(); the watchdog reads via a new public accessor Clock.lastTickMs() ?i64 that does an .acquire load and translates the sentinel. No struct field is poked directly from outside the Clock.

The libxev thread itself reads back the same atomic via .monotonic (it's the single writer, so it doesn't need to synchronise with itself). Behaviour change on the hot path is limited to one .release atomic store per tick, replacing a plain field write.

EINTR (Note) — addressed.

SlotDriverWatchdog.sleepMs now loops on EINTR, continuing with the remaining duration from nanosleep's rem slot. That way a stray signal (e.g. a future SIGUSR1-based stack-dump probe) cannot shorten the probe interval. Aborts immediately when stop_flag is set so shutdown isn't blocked by a long in-flight sleep.

zig build test -Dprover=dummy --summary all passes locally.

@ch4r10t33r ch4r10t33r requested a review from anshalshukla May 14, 2026 14:16
@ch4r10t33r
Copy link
Copy Markdown
Contributor Author

@zclawz can you verify and approve this PR?

@zclawz
Copy link
Copy Markdown
Contributor

zclawz commented May 14, 2026

PR #883 Review: slot-driver stall watchdog + onBlock substep histogram

Summary: Adds two diagnostic layers to attribute the multi-second slot-driver stalls from #863: a background watchdog thread that fires when the libxev tick clock falls behind wall-clock by >5s, and per-substep lap timing inside onBlock exported as a labeled Prometheus histogram.


Correctness / Logic ✅

The SUPPRESS_REFIRE hysteresis in runLoop is correct: firing_for_current_stall resets when last_tick changes (i.e. the driver ticked again), so each distinct stall event fires exactly once rather than once per probe interval. The ordering — reset check, then last_observed_tick_ms update, then threshold compare, then refire guard — is the right sequence.

OnblockStepWatch.lap() silently swallows metric write errors (catch {}), so instrumentation failures never bubble into block processing. Correct and intentional.

Minor label note: the "block_attestations" lap point is placed after forkChoice.updateHead() (step 5), so it actually measures the attestation loop plus updateHead. Consider renaming to "block_attestations_and_head_update" or splitting the lap — otherwise dashboard consumers may undercount updateHead time. Not a bug, but a readability issue.

The "ssz_serialize_fallback" lap only fires on the fallback code path (when no precomputed SSZ bytes were supplied). Worth noting in the metric help text that this observation is absent on the fast path.


Concurrency / Memory ordering ✅

last_tick_time_ms_atomic uses .release on writes and .acquire on reads — correct for producer-consumer visibility. The sentinel NEVER_TICKED_MS = std.math.minInt(i64) is unambiguous since real unix-epoch ms are positive and large.

stop_flag uses .release on store and .acquire on loads — correct for a thread-exit signal. sleepMs re-issues nanosleep on EINTR so stop-flag checks are responsive.


Resource management ✅

Node.deinit calls wd.stop() (which joins the watchdog thread) before stopping the API server — correct ordering. Watchdog spawn failure is non-fatal and handled cleanly with a warn log.


Nits

  • "block_attestations" label mismatch noted above — worth fixing before merge so dashboards are accurate.
  • The metric help comment mentions a "total_excluding_io_wait" label but I don't see a corresponding lap call in the diff — intended for a follow-up PR?

Verdict: LGTM ✅ — the label naming nit on "block_attestations" is worth fixing for dashboard accuracy, but not a blocker.

On some GitHub macOS runner images, ~/.cargo/bin/cargo is rustup-init
rather than the rustup proxy shim, so `cargo +nightly --version` fails
with "error: unexpected argument '+nightly' found" and aborts the
workflow. Two PRs against zeam in the same hour hit this on different
runners (PR blockblaz#884 passed, PR blockblaz#883 failed) — purely environmental.

`rustup run nightly cargo --version` invokes the nightly toolchain via
rustup directly, bypassing the proxy-shim path entirely. setup-rust-
toolchain@v1 already installs nightly so the toolchain is guaranteed
present.

Refs blockblaz#863, blockblaz#883.
ch4r10t33r added a commit that referenced this pull request May 14, 2026
… (P1 of #863) (#884)

* perf(p1): lock-free PublicKeyCache; drop pubkey_cache_lock from hot paths

Replaces the std.AutoHashMap-backed PublicKeyCache + pubkey_cache_lock
mutex with a per-slot atomic CAS design. Targets the ~78ms mean lock
hold reported in #863 ("zeam_lock_hold_seconds pubkey_cache /
onBlock.verifySignatures: 195 holds, sum ~15.3s").

xmss/src/lib.zig
  - PublicKeyCache.slots: []std.atomic.Value(usize) sized at init from
    the genesis validator count. Each slot stores @intFromPtr(handle)
    or 0.
  - get(idx): single atomic acquire load. No mutex.
  - getOrPut on miss: deserialise via PublicKey.fromBytes, CAS-install
    @intFromPtr(handle) under release; on lost race free our handle
    and adopt the winner's. No allocator contention either.
  - Out-of-range index (validator_index >= capacity) falls through to
    a non-cached deserialise — safe but slower. Validator-set growth
    will need a fork-boundary cache rebuild; documented inline.
  - PublicKeyCache.init signature changed to (allocator, capacity).
    Both call sites updated; tests updated.

node/src/chain.zig
  - chain.onBlock verify-signatures block: removed the
    pubkey_cache_lock acquire/release dance and the tier-5 enter/leave.
    The wrapped verifySignaturesParallel call is the dominant onBlock
    cost on aggregator nodes; single-threaded mutex was negating the
    rayon parallelism inside Rust.
  - chain.verifyAggregatedAttestation: same removal.
  - PublicKeyCache.init now takes a capacity argument; passed via
    `opts.config.genesis.numValidators()` from BeamChain.init.

The pubkey_cache_lock field on BeamChain remains in place (zero
acquirers) to avoid touching the lock-hierarchy invariants in
locking.zig in this PR. Removal lands as a follow-up cleanup.

Refs #863. Builds on #883 (P0 instrumentation).

* fix(p1): fail loud on out-of-range cache index; refresh stale doc; add CAS tests

Addresses adversarial review by @zclawz on PR #884.

Out-of-range path leak (Medium):
  PublicKeyCache.getOrPut previously deserialised + returned a handle
  on validator_index >= capacity without transferring ownership,
  leaking one Rust pubkey handle per lookup. lean spec does not
  currently grow the validator set after genesis, but if/when it
  does, valid attestations from new validators would have leaked.

  Fix: return new error HashSigError.ValidatorIndexOutOfRange instead
  of falling back to leaky uncached deserialise. Doc updated to
  point at fork-boundary cache rebuild as the intended growth path.

Stale comment (Note):
  pkgs/node/src/chain.zig:213-215 still said xmss.PublicKeyCache was
  not thread-safe and required a serial pre-phase. After the lock-free
  rewrite that's wrong. Updated to reflect the per-slot CAS protocol.

Test gaps (Note):
  Existing tests only covered empty/basic cache metadata. Added:
    * getOrPut populates on miss + returns same handle on hit
    * out-of-range index returns ValidatorIndexOutOfRange
    * 8-thread concurrent getOrPut for the same slot — every thread
      observes the same winning handle, count() == 1 post-race, and
      cache.deinit() suffices to free everything (testing allocator
      surfaces leaks at process exit).

Refs #863, #884.
@ch4r10t33r ch4r10t33r merged commit 86d842a into blockblaz:main May 14, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants