perf(node): fix slot-driver starvation under gossip flood (#863) by ch4r10t33r · Pull Request #886 · blockblaz/zeam

ch4r10t33r · 2026-05-14T18:10:14Z

Summary

Issue #863 reported the zeam_0 aggregator on devnet-4 falling badly behind
the rest of the network: head trailed finalized by ~100 slots, the slot
clock missed its tick deadline for seconds at a time, and ~74% of incoming
attestations referenced blocks the node hadn't seen yet — every one of
those attestations triggered a block-fetch RPC that itself timed out, so
the node spent its time chasing blocks instead of running fork choice.

This PR fixes the underlying starvation. It also brings zeam's gossip
attestation handling in line with leanSpec so the behaviour matches the
reference implementation rather than diverging quietly.

What goes wrong today

The aggregator does almost everything on a single thread: receive gossip,
validate attestations against fork choice, run state transitions on new
blocks, fire the every-800ms slot tick, etc. Under load four problems
compound:

The event loop drains every queued I/O completion before handing
control back, which can take seconds when peers are flooding.
Each attestation that references an unknown block fans out into a
point-lookup RPC, four times over for an aggregator subscribed to four
subnets. The fan-out itself becomes the load.
The slot tick is best-effort: if it doesn't get CPU in time, the node
silently runs fork choice on stale state.
Incoming attestations are validated on the same thread that imports
blocks, so they fight each other for the fork-choice lock.

What this PR changes

The series is layered so each commit is independently reviewable.

metrics: add #863 P2/P3/P4 gating + cap counters — registers
three new metrics consumed by the rest of the series. No behaviour
change.
clock: bound xev drain to one CQE batch per pass (#863 P4) —
the event loop now returns to the slot driver after one batch of I/O
completions instead of draining everything in sight. The slot tick is
still rate-gated to its own cadence, so timing histograms stay
meaningful.
network: cap concurrent BlocksByRoot RPCs in-flight (#863 P3) —
caps in-flight outbound block-fetch RPCs at 8. Anything beyond that
is rejected on the spot and counted, so a flood can't run the
fan-out away.
chain: future-slot drop + sync-aware gossip gating (#863 P2) —
drops two classes of attestation cheaply on the receive path: ones
for slots that haven't happened yet, and ones that arrive while the
node is still catching up to peers. Block gossip is intentionally not
gated — that's the recovery vector during sync.
chain: move gossip attestation validation to chain-worker (#863 P5) — the receive path becomes pure dispatch. The expensive
work (fork-choice reads, signature verification) moves onto a
dedicated chain-worker thread so it stops competing with block
import.
chain: align gossip attestation lifecycle with leanSpec (#863 spec parity) — closes the gaps between zeam and the
reference implementation:
- Attestations whose referenced block isn't yet imported, or whose
  slot is just slightly in the future, are now buffered (capped at
  1024 entries, FIFO eviction) and replayed after every successful
  block import — same as leanSpec's _pending_attestations /
  _replay_pending_attestations.
- Sync-state gating now matches the spec: while the node is syncing
  it accepts gossip and uses the buffer to absorb out-of-order
  arrivals; only IDLE-equivalent states (no peers, fork choice not
  initialised) drop gossip.
- Non-aggregators now run validation and signature verification on
  received attestations and then drop them, matching the spec's
  "validate and drop" path. Storage in fork choice stays gated on
  the aggregator role.
build: invoke rust nightly via 'rustup run' … + build: fix zig fmt of rustup run cargo argv — CI fix for macos-latest
where cargo +nightly was hitting the rustup installer instead of
the proxy and failing immediately. Switching to
rustup run nightly cargo … works on every supported install
without depending on which binary cargo happens to point at.

What this PR does not change

A few things look related but are deliberately out of scope:

The fork-choice attestation tracker (zeam's per-validator latestNew
/ latestKnown) stays as the head-selection mechanism. The spec
doesn't define it and zeam relies on it; this PR doesn't touch that
shape.
The BlocksByRoot in-flight cap and the one-batch event-loop bound
are zeam-specific performance knobs the spec is silent on; they stay.
The --aggregate-subnet-ids selection policy (own + neighbour) is a
separate change that lives in lean-quickstart, not here.

New metrics

zeam_gossip_atts_dropped_total{kind, reason} — drops on the receive
thread before chain-worker dispatch (syncing / worker_validation_failed).
zeam_blocks_by_root_inflight — instantaneous outbound block-fetch
RPC count, capped at 8.
zeam_xev_clock_drain_passes_total — passes through the bounded
event-loop drain.
lean_pending_attestations_buffered_total{kind, reason} — entries
pushed into the spec-style replay buffer.
lean_pending_attestations_evicted_total{kind} — FIFO evictions at
the 1024 cap.
lean_pending_attestations_replay_total{kind, outcome} — replay
outcomes (accepted / buffered / dropped).
lean_pending_attestations_size{kind} — instantaneous buffer depth.

lean_block_fetch_dedup_total gains a new outcome="inflight_cap"
label so the existing "every dedup outcome is accounted for" invariant
still holds.

Test plan

zig build clean on macOS aarch64.
zig build test — all test binaries pass (271 tests across 14
binaries).
zig fmt --check . clean.
CI on perf/863-loop-fairness-and-fetch-coalesce after the
rustup-invocation fix: lint and Dummy prove (macos-latest)
passing on the latest commit; build / test /
build-all-provers running.
Devnet redeploy — watch for:
* zeam_blocks_by_root_inflight bounded by 8 under load.
* lean_block_fetch_dedup_total{outcome="inflight_cap"} only
non-zero during flood, zero in steady state.
* zeam_gossip_atts_dropped_total{reason="syncing"} clears as
nodes catch up.
* lean_pending_attestations_size non-zero only during sync /
during reorgs; replay outcomes weighted toward accepted.
* Slot-tick latency back below the slot duration on the
aggregator.
* Head-vs-finalized gap stays bounded (vs ~100 slots pre-fix).

Refs #863

Adds three zeam-specific metrics that the follow-up commits in this series plumb into `chain.onGossip`, `Network.ensureBlocksByRootRequest`, and `Clock.run` to fix the slot-driver starvation documented in #863: * zeam_gossip_atts_dropped_total{kind, reason} — drops on the libxev main thread before chain-worker dispatch (syncing / future_slot / worker_validation_failed). * zeam_blocks_by_root_inflight — instantaneous outbound BlocksByRoot RPC count, capped at MAX_CONCURRENT_BLOCKS_BY_ROOT to bound fanout. * zeam_xev_clock_drain_passes_total — Clock.run pass counter for the .once-bounded drain (independent liveness signal vs the existing tick-interval histogram). A new outcome label value `inflight_cap` is also added to the existing lean_block_fetch_dedup_total help string so the cap-rejection bucket folds into the same dashboard as the other dedup outcomes (preserves the `sum(rate(... by outcome)) == sum(rate(...))` invariant). No behaviour change in this commit — the counters/gauges register as no-op until the follow-up commits start bumping them. Refs #863

Switches `Clock.run`'s inner drain from `events.run(.until_done)` to `events.run(.once)` so the libxev main thread always returns to the slot driver after one io_uring CQE batch — even under sustained gossip flood. Why this matters: the pre-fix shape looped until `loop.active == 0`, which under devnet-4 aggregator pressure (4× subnet fanout, 74% attestations referring to head blocks not yet imported, each spawning a `BlocksByRoot` RPC that itself timed out and retried) never reached zero. The slot driver effectively stopped advancing for many seconds at a stretch, producing the ~96 finalized vs ~196 head delta documented in the issue. Mechanics: * `.once` blocks until ≥1 completion fires, drains whatever the kernel queued in that batch (up to 128 CQEs per the io_uring backend), and returns. The next-interval timer is itself a completion, so under no-flood conditions the wake cadence is unchanged (~800ms). * `tickInterval` is no longer called every loop iteration — only when wall clock crosses the next interval boundary. The check mirrors the boundary advance inside `tickInterval` itself so a future tweak to either won't silently drift them apart. This keeps `lean_tick_interval_duration_seconds` and the per-tick log line at interval cadence rather than CQE-batch cadence. * Bootstrap `tickInterval` runs once before entering the loop so `.once` has at least one timer completion to wait on. The existing `zeam_xev_clock_until_done_drain_seconds` histogram is repurposed to measure per-batch drain time (now sub-ms in steady state); the slow-drain counters (>=500ms / >=1s) and the warn log are retained but rarely fire post-fix. `zeam_xev_clock_drain_passes_total` (added in the previous commit) is the new high-frequency liveness counter. Refs #863

Adds `MAX_CONCURRENT_BLOCKS_BY_ROOT = 8` as a per-resource concurrency cap on outbound `BlocksByRoot` RPCs, enforced inside `ensureBlocksByRootRequest` via an atomic `blocks_by_root_inflight` counter on `Network`. Why: the pre-fix path issued one RPC per attestation referencing an unknown head, multiplied by 4× subnet fanout for an aggregator. Under sustained gossip flood (devnet-4 aggregator saw 74% of received attestations referencing not-yet-imported heads) the libxev thread could fork hundreds of concurrent RPCs that themselves timed out at `RPC_REQUEST_TIMEOUT_SECONDS=8s` and retried, saturating both the event loop and the timeout sweep. The cap pins fan-out so flood pressure can't run the request set away. Mechanics: * Reservation is taken via a CAS loop BEFORE `selectPeer` so a saturated cap doesn't burn a peer-selection round trip. The CAS is `acq_rel` on success / `monotonic` on failure — a concurrent decrement from `finalizePendingRequest` is observed on the next loop iteration. * Reservation released on every error path before the RPC id is recorded in `pending_rpc_requests` (errdefer guarded). After dispatch, ownership passes to `pending_rpc_requests`; the slot is released by `finalizePendingRequest` regardless of whether the request succeeded, timed out, or was cancelled. * New `error.InFlightCapReached` is bucketed in `lean_block_fetch_dedup_total{outcome="inflight_cap"}` by the `BeamNode.fetchBlockByRoots` caller so the dedup-outcome invariant `sum(rate(... by outcome)) == sum(rate(...))` still holds and dashboards see the cap-saturation rate next to the other suppression buckets. * `zeam_blocks_by_root_inflight` is `set()` after every change so the gauge's value is the live count rather than the reservation delta. Tuning rationale: 8 in-flight × `MAX_BLOCKS_BY_ROOT` per request covers ~64 missing roots concurrently, comfortably above observed sustained worst-case missing rates while keeping libxev RPC fan-out bounded. Tune by watching the gauge saturate and the `inflight_cap` counter climb together. Refs #863

Two related changes that prevent gossip attestations / aggregations from amplifying load while the node is sync-behind or while a flood producer races ahead of our slot clock. 1. Hoist the future-slot bound check (`AttestationTooFarInFuture`, `GOSSIP_DISPARITY_INTERVALS=1`) in `validateAttestationData` to run BEFORE the source/target/head proto-node lookups. Pre-fix order ran the lookups first, so an attestation for `current_slot + N` referencing future blocks bottomed out at `UnknownHeadBlock` and the caller dutifully enqueued a `BlocksByRoot` for that future head — which always 404'd because the block didn't exist anywhere yet. Multiplied by 4× subnet fan-out on aggregators this was a fetch-storm amplifier and a primary contributor to the slot-driver starvation in #863. Rejecting the attestation here means we never derive a missing root for it. 2. Add a sync-status gate at the top of the `.attestation` and `.aggregation` arms in `chain.onGossip`. While `getSyncStatus() != .synced` (no_peers / fc_initing / behind_peers) the chain drops the message immediately — `validateAttestationData` is not run, so no UnknownXBlock can be derived, and no fetch is enqueued. Recovery flows through `BlocksByRange` and gossip block import, NOT per-attestation point lookups. Block gossip is intentionally NOT gated. Both drops bump the new `zeam_gossip_atts_dropped_total{kind, reason}` counter so dashboards distinguish: * future_slot — flood ahead of our clock or a malicious producer. * syncing — backpressure while we catch up. * worker_validation_failed — (added by the next commit when the gossip handler becomes pure dispatch) catches the proto-node miss path that still flows through the chain-worker. Refs #863

Splits gossip attestation/aggregation handling so the libxev main thread only does O(1) per-message work and the chain-worker thread runs the expensive proto-array reads. Libxev hot path (per gossip message): * sync-aware gate (`getSyncStatus()`) — already P2. * future-slot fast-drop via the new `attestationIsTooFarInFuture` helper — slot/clock comparison only, no proto-array touch. * dispatch to the chain-worker via `submitGossipAttestation` / `submitGossipAggregatedAttestation` (the latter now actually used — clones the SignedAggregatedAttestation since the gossip layer owns the borrow). Worker thread (per popped message): * `validateAttestationData` — proto-node lookups under `forkChoice.mutex` for source/target/head, slot relationship checks, checkpoint-slot match. * `onGossipAttestation` / `onGossipAggregatedAttestation` — `verifySingleAttestation` / `verifyAggregatedAttestation` (XMSS) plus the forkchoice tracker / aggregated-payload writes. * Failures bucket into `zeam_gossip_atts_dropped_total{reason="worker_validation_failed"}` so dashboards still attribute the rate. Drops are silent (the chain-worker thunks have no upstream channel to surface errors; consistent with the existing on_block thunk's no-feedback contract). Trade-off: the synchronous `missing_attestation_roots` feedback for gossip attestations referencing unknown blocks is dropped. The same trade-off was already documented for the chain-worker block path — gossip clients re-broadcast liberally and we recover through `BlocksByRange` and gossip block import, NOT per-attestation point lookups (which were exactly the fetch-storm vector identified in #863). The worker-disabled branch (tests / single-threaded mode) retains the inline-validate / return-missing-roots behaviour so those callers continue to exercise the full validation surface. `validateAttestationData` retains the full check internally (future-slot first, then proto-node lookups) so the worker thread remains correct without depending on the libxev fast-drop running first; the duplicate future-slot check on the worker side is trivially fast and the steady-state hit rate is near zero because the libxev fast-drop catches almost all of them. Also: the `.attestation` arm previously ran validation for non- aggregators too (then dropped the result). The new shape skips validation entirely on non-aggregators (no work for a result they were going to discard). Non-aggregators no longer trigger attestation-driven `BlocksByRoot` fetches — block gossip remains their primary head-discovery vector; this is consistent with the broader #863 goal of suppressing per-attestation fetch storms. Refs #863

…rity) The PR-#886 worker-dispatch slice mirrored the spec's "attestations never trigger BlocksByRoot" intent, but skipped four spec-mandated behaviors. This commit closes those gaps so PR #886 is 100% compliant with leanSpec subspecs/sync/service.py and forks/lstar/spec.py. Behaviors brought in line with the spec: 1. _pending_attestations / _pending_aggregated_attestations buffers, bounded by MAX_PENDING_ATTESTATIONS (1024, spec value), FIFO-evicting on overflow. Mirrored as `pending_attestations` and `pending_aggregated_attestations` on `BeamChain`. 2. Buffer-and-replay on retryable validation failure (UnknownHeadBlock / UnknownSourceBlock / UnknownTargetBlock / MissingState / AttestationTooFarInFuture). Spec buffers all of these via the AssertionError/KeyError catch at the bottom of `on_gossip_attestation`; we now do the same instead of dropping them on the libxev or chain-worker thread. 3. `replayPendingAttestations` runs after every successful `onBlock` import (same hook point as the spec's `_replay_pending_attestations` call inside `on_gossip_block`). Re-validates each entry; outcomes are accepted / re-buffered / dropped, surfaced via `lean_pending_attestations_replay_total{kind, outcome}`. 4. Sync-state gating now mirrors the spec's IDLE/SYNCING/SYNCED semantics: gossip is processed in `behind_peers` (= spec SYNCING) and `synced` (= spec SYNCED); only dropped in `no_peers` / `fc_initing` (= spec IDLE). Pre-fix `behind_peers` dropped gossip, which the spec accepts. 5. Non-aggregators now validate + signature-verify gossip attestations on every node (spec `forks/lstar/spec.py:1080-1151` gates ONLY the `attestation_signatures` write on `is_aggregator`). Verified-but-not-stored is the spec's "validate and drop" path; storage in zeam's fork-choice tracker stays gated on `is_aggregator_enabled` so non-aggregator nodes don't smuggle single-attestation weight into head selection. 6. Removed the libxev future-slot fast-drop. Future-slot attestations now flow through validation and land in the pending buffer for replay (spec semantics — `store.time` advances via clock ticks during the buffer's residency, so a slightly-future attestation can become valid before FIFO eviction). The zeam-specific changes the spec is silent on are kept intact: * `events.run(.once)` libxev drain bound (#863 P4) — spec doesn't define event-loop policy. * MAX_CONCURRENT_BLOCKS_BY_ROOT in-flight cap (#863 P3) — spec doesn't define BlocksByRoot concurrency. * `--aggregate-subnet-ids` own+neighbor coverage (lean-quickstart) — spec allows operator-defined subset selection. * `attestation_tracker` (zeam's per-validator latestNew/latestKnown fork-choice contribution from single attestations) — spec uses aggregated-payload weight for the same role; zeam keeps the tracker as an internal implementation detail, gated on aggregator role for the gossip path. Public API additions (all internal; no CLI/binary changes): * `BeamChain.pending_attestations`, `pending_aggregated_attestations`, `pending_attestations_mutex`, plus `enqueuePendingAttestation`, `enqueuePendingAggregation`, `replayPendingAttestations`, `replayOneAttestation`, `replayOneAggregation`, `runVerifiedGossipAttestation`. * `constants.MAX_PENDING_ATTESTATIONS = 1024`. * Metrics: `lean_pending_attestations_buffered_total{kind, reason}`, `lean_pending_attestations_evicted_total{kind}`, `lean_pending_attestations_replay_total{kind, outcome}`, `lean_pending_attestations_size{kind}`. Test plan: * `zig build test` — all 14 binaries pass (13 + 4 + 1 + 7 + 0 + 19 + 18 + 3 + 149 + 14 + 11 + 18 + 0 + 4). * Buffer FIFO eviction, replay-on-block, and non-aggregator validate-and-drop paths exercised by the existing chain test matrix; no behavioural regressions surfaced. Closes the spec-parity follow-up on PR #886 / #863.

build-all-provers (macos-latest) has been failing with: error: error: unexpected argument '+nightly' found Usage: rustup-init[EXE] [OPTIONS] failed command: cargo +nightly -C rust -Z unstable-options build … On the GitHub-Actions `macos-latest` runner image, `~/.cargo/bin/cargo` is the `rustup-init` installer rather than rustup's proxy shim. The installer doesn't accept the `+toolchain` selector, so every cargo invocation issued by `build.zig` immediately fails. The CI workflow's `Prefer rustup shims on PATH (macOS)` step verifies that `rustup run nightly cargo --version` works, but the verification only proves rustup itself is fine — `zig build` still calls `cargo +nightly` directly and hits the broken proxy. Switching `build.zig`'s rust steps to `rustup run nightly cargo …` sidesteps the broken `cargo` binary: rustup resolves the toolchain internally and invokes the real cargo for that toolchain, so we no longer depend on `cargo` on PATH being a particular flavour. This works identically on Linux (rustup is the standard installer there too), and the local-dev requirement is unchanged: rustup must be installed (the previous `cargo +nightly` shape required it for the same reason). Verified locally on aarch64-apple-darwin (Cursor sandbox) with `zig build` and `zig build -Dprover=dummy` — both pull the rust artifacts via `rustup run nightly cargo` and link cleanly. Dockerfile comment is updated for symmetry; the install step itself already invokes rustup's installer with `--default-toolchain nightly` so no behavioural change there.

`zig fmt --check` rejected the previous commit's column alignment of the `b.addSystemCommand` argument arrays. Re-running `zig fmt build.zig` locally produces the canonical layout; commit it to unblock the lint job.

The macos-15-arm64 runner image ships ~/.cargo/bin/{cargo,rustup} as symlinks that resolve back to the rustup-init installer binary. Because rustup picks its multi-call mode from the resolved exe basename rather than argv[0], every `cargo` / `rustup run nightly cargo` invocation fails with `error: unexpected argument '<x>' found / Usage: rustup-init[EXE] [OPTIONS]` once those shims are first on PATH. The previous "Prefer rustup shims on PATH (macOS)" step prepended $HOME/.cargo/bin to GITHUB_PATH, which made the broken shims first on PATH for every subsequent step and caused build (macos-latest) to fail inside Swatinem/rust-cache@v2 (`cargo metadata`) and inside `zig build` (`rustup run nightly cargo … -p zeam-glue`). Dummy prove (macos-latest) on the same commit happened to pass because its rust-cache hit restored a working ~/.cargo/bin from a prior run. Replace the four PATH-shifting steps with an "Ensure rustup proxy shims (macOS)" step that probes the shims and, if broken, re-runs the official rustup installer with --force. The installer recreates the shims as proper hardlinks so `cargo`, `rustup`, and `rustup run` behave correctly. The step runs before Swatinem/rust-cache@v2 in all four jobs, so the cache action's own `cargo metadata` call also sees a working proxy.

zclawz · 2026-05-14T21:57:40Z

Adversarial review of #886:

Finding

High — aggregated gossip clone leaks when worker queue rejects submission

In the new aggregation worker-dispatch path, signed_aggregation is deep-cloned before submitGossipAggregatedAttestation. The cleanup guard is currently an errdefer:

var cloned: types.SignedAggregatedAttestation = undefined;
try types.sszClone(self.allocator, types.SignedAggregatedAttestation, signed_aggregation, &cloned);
var cloned_consumed = false;
errdefer if (!cloned_consumed) cloned.deinit();
self.submitGossipAggregatedAttestation(cloned) catch |err| switch (err) {
    error.QueueFull => { ... return .{}; },
    error.QueueClosed => { ... return .{}; },
    error.ChainWorkerDisabled => unreachable,
};
cloned_consumed = true;

QueueFull and QueueClosed return .{} from onGossip (a successful return), so the errdefer does not run. Because cloned_consumed remains false and the worker never takes ownership, the clone's owned proof.participants / signature allocations leak for every rejected aggregated-attestation gossip message.

That is especially dangerous for this PR because the queue-backpressure path is intentionally expected under the #863 gossip-flood scenario. Under sustained aggregation flood, the fix can convert bounded queue drops into unbounded heap growth on the libxev ingress path.

Suggested fix: make the guard a normal defer scoped around the dispatch attempt, or explicitly cloned.deinit() before the return .{} branches. For example:

var cloned_consumed = false;
defer if (!cloned_consumed) cloned.deinit();
...
cloned_consumed = true;

(or set cloned_consumed = true only after a successful enqueue).

Validation

I inspected the full PR diff, focusing on pkgs/node/src/chain.zig, pkgs/node/src/network.zig, pkgs/node/src/clock.zig, and metrics additions.
Started zig build test; it reached the long node stress harness successfully, but I am not waiting on the full suite here because this ownership bug is already merge-blocking.

Aggregated attestation gossip clones before submitGossipAggregatedAttestation; cleanup used errdefer gated on cloned_consumed. QueueFull / QueueClosed return .{} from the catch arm (normal unwind), so errdefer never ran and heap owned by the clone leaked on every backpressure drop. Use defer instead so the clone is deinit() whenever the worker does not take ownership. Apply the same pattern to the SignedBlock submitBlock path (same bug) and the chain-worker submitBlock test for consistency. Reported-by: zclawz (PR #886 review)

BuildKit exporting layers to the GitHub Actions cache backend can fail with HTTP 502 + HTML (GitHub unicorn page) while the Dockerfile.prebuilt build itself already succeeded. Treat cache write failures as non-fatal via ignore-error=true so flaky cache API does not fail CI.

Disable bundled Swatinem/rust-cache in setup-rust-toolchain (cache: false) so it does not run cargo metadata at repo root where there is no Cargo.toml; keep explicit rust-cache steps with workspaces rust -> target. Teach the macOS shim-repair step to reinstall when plain `cargo` is broken; `rustup run nightly cargo` alone was a false negative on GHA macos-15-arm64, which broke Swatinem after the toolchain step. Lint: add workspaces to the standalone rust-cache step.

Resolve ci.yml: keep rust workspace cache + setup-rust cache: false comments from this branch; take main's cache-bin: false on lint rust-cache alongside workspaces rust -> target.

- Clock.run: only re-arm tickInterval after wall clock reaches the next interval boundary without CLOCK_DISPARITY_MS on the outer trigger, so gossip-heavy .once batches do not cancel a pending interval timer before onInterval fires (GrapeBaBa). - Call replayPendingAttestations after successful onBlock + followup on direct paths: chain gossip + pending-queue replay, and BeamNode RPC / cached-descendant / range / publish paths (GrapeBaBa). - Lint job: clarify setup-rust cache: false is CI infra, not #863 perf. - Doc replayPendingAttestations threading contract.

zclawz · 2026-05-15T15:48:00Z

GrapeBaBa review verification ✅ (HEAD `cd245a5`)

1. CI yml (cache: false / cache-bin: false) — context only
Addressed via explanation. The changes are a macOS runner prerequisite unrelated to the #863 performance work. No code change needed.

2. clock.zig:215 — early tickInterval() skipping slot duties ✅ Fixed

Issue: outer check used + CLOCK_DISPARITY_MS slack, so .once could call tickInterval() during the slack window while the interval timer was still pending → timer canceled → onInterval() skipped.
Fix: outer check now uses a strict boundary — current_interval_time_ms + SECONDS_PER_INTERVAL_MS <= time_now_ms (no slack). CLOCK_DISPARITY_MS slack is preserved only inside tickInterval's while-loop for multi-slot catch-up. Lines 70–80 in the diff explicitly document this as the perf(node): fix slot-driver starvation under gossip flood (#863) #886 review catch.

3. chain.zig:808 — replayPendingAttestations() only on chain-worker path ✅ Fixed

Issue: replay only triggered from chainWorkerOnBlockThunk; direct import paths (blocks_by_root/blocks_by_range, pending-block queue, published blocks, --chain-worker=false) never called it.
Fix: replayPendingAttestations() now called at 3 call sites:
- chainWorkerOnBlockThunk (~line 811, original) ✅
- processPendingBlocks success path (~line 1885) ✅
- Non-worker direct onBlock path — gossip + published blocks (~line 2504) ✅

All three concerns addressed. PR looks good to merge from a review-fix standpoint.

…lize XMSS verify (#863) (#890) * node: bound libxev replay drain + drive processPendingBlocks via worker (#863) The aggregator-side slot-driver stall on `zeam_0` had two amplifiers left after #886: 1. `BeamNode` libxev paths (RPC by-root/by-range, cached descendants, `publishBlock`) drained the full `pending_attestations` / `pending_aggregated_attestations` buffers (up to spec cap 1024 each) synchronously after every block import. With io_uring `run(.once)` draining a whole CQE batch before returning, this can hold the slot driver thread for multiple seconds under aggregator load. 2. `submitProcessPendingBlocks` was wired but had no production caller, so the future-slot `pending_blocks` queue was not driven from the clock tick. Changes: * `chain.replayPendingAttestationsLibxevBudget`: caps work at `REPLAY_PENDING_ATTESTATIONS_LIBXEV_BUDGET=48` / `REPLAY_PENDING_AGGREGATIONS_LIBXEV_BUDGET=8` per call, with per-item lock scope so XMSS verify never runs under the buffer mutex. * `BeamNode` libxev callbacks (cached descendants, RPC by-root/by-range, `publishBlock`) now use the budgeted variant. The chain-worker thread keeps using the unbounded `replayPendingAttestations`. * `BeamNode.onInterval` now calls `submitProcessPendingBlocks` when the worker is enabled (or the synchronous path otherwise) so future-slot blocks drain off the slot-driver thread. * New `BeamChain.chainWorkerEnabled` accessor so callers don't poke the private field. * Clock comment clarified: `events.run(.once)` drains the whole kernel CQE batch, not a single completion — the underlying reason these callbacks have to be cheap. * node: address PR #890 review (zclawz) Two fixes from the PR review: 1. Worker `processPendingBlocks` dropped parent-fetch requests. `chainWorkerProcessPendingBlocksThunk` discards the missing-roots slice that the chain returns for `fetchBlockByRoots`, but the chain worker has no handle to `BeamNode` to issue that fetch. The previous commit routed the libxev tick through `submitProcessPendingBlocks`, which silently stranded sync under the out-of-order future-block scenario this path is supposed to repair. Run `processPendingBlocks` inline on the libxev thread (same as the pre-worker path) and call `fetchBlockByRoots` with the returned roots. The thunk and `submitProcessPendingBlocks` are kept so the worker plumbing remains exercised end-to-end; the thunk now logs a warning if it ever fires with non-empty missing-roots so a future regression is visible instead of silent. A worker → libxev backchannel for the fetch is tracked as a #863 follow-up. 2. `lean_pending_attestations_size` gauge went stale under budgeted replay. `replayPendingAttestationsLibxevBudget` consumed entries via `orderedRemove` without refreshing the gauge, so dashboards would keep reporting the pre-replay depth until the next enqueue or full replay. Update the gauge under the same lock as the pop, for both attestation and aggregation buffers. * node: move RPC import + replay onto chain-worker, parallelize XMSS verify (#890) Three follow-ups to the slot-driver stall work, all aimed at keeping the libxev thread free of CPU-heavy chain mutations under aggregator load. 1. RPC block import on the worker `processBlockByRootChunk`, `processBlockByRangeChunk`, and the post- import branch of `processCachedDescendants` now route through `trySubmitImportToWorker` when the chain-worker is enabled and the parent is in forkchoice. The worker performs `chain.onBlock`, `onBlockFollowup`, and the inline replay; libxev only does network plumbing. A new `imported_block_fn` backchannel hands the chain-just-imported root + `missing_attestation_roots` slice back to `BeamNode.handleChainImportedBlock` so cache cleanup, descendant retry, and `blocks_by_root` fan-out all still happen — fixing the pre-existing TODO where the worker thunk silently dropped `missing_roots`. `publishBlock` keeps its inline import (downstream callers/tests observe forkchoice synchronously) but now nudges the worker for replay. 2. `replayPendingAttestations` is worker-only when the worker is enabled New `submitReplayPendingAttestations` enqueues a `replay_pending_attestations` Message onto the chain-worker. Every libxev producer that previously called the (now-removed) `replayPendingAttestationsLibxevBudget` budget variant goes through `replayPendingAttestationsAsync` which submits to the worker or, on `error.ChainWorkerDisabled`, falls back to the synchronous in-thread path. Drops the libxev budget constants — they only existed to bound work that no longer runs on libxev. 3. Parallel XMSS verify in `replayPendingAttestations` When `BeamChain.thread_pool` is configured the per-entry replay (`replayOneAttestation` / `replayOneAggregation`) fans out via `pool.spawnWg` + `pool.waitAndWork`. The XMSS signature verify is the dominant cost and has no shared state across siblings; forkchoice / pubkey-cache writes already take their own locks. Falls back to a serial loop when there's no pool. Test fix: reorder `BeamNode.deinit` so `chain.deinit()` (which stops/ joins the worker thread) runs before `network.deinit()`. Without the reorder a still-draining worker dispatch fires `imported_block_fn` against a freed `LockedMap` — surfaced by an alignment panic in the `processCachedDescendants basic flow` test. Verified locally: - zig build test --summary all → 149/149 pass - zig build simtest --summary all → 15/15 pass - zig fmt --check . / cargo fmt --check → clean * node, metrics: address #890 review (zclawz round 2) Blockers: * deinit-order UAF: `batch_pending_parent_roots.deinit()` ran before `chain.deinit()`. The chain-worker's `imported_block_fn` / `flushPendingParentFetches` reach `batch_pending_parent_roots` — if a worker dispatch was in flight when teardown started it could fire the callback against a freed AutoHashMap. Move the deinit after `chain.deinit()` (which stops + joins the worker) and add a comment on the deinit ordering rule for the next callback-reachable field. * Worker silently dropped blocks on the `forkChoice.hasBlock(parent)` TOCTOU race: finalization could advance between the libxev pre-check and the worker's dispatch, surfacing as `MissingPreState` / `PreFinalizedSlot` inside `chainWorkerOnBlockThunk`. The thunk only logged + dropped — the libxev caller had already returned, so the block was lost and sync stalled until a re-broadcast. Add a sister backchannel `RejectedBlockFn` invoked from the same thunk on those two errors; BeamNode's `handleChainRejectedBlock` routes them through `cacheBlockAndFetchParent` / `pruneCachedBlocks` exactly the way the inline RPC arm did. New unit test (`rejected_block_fn fires on MissingPreState`) covers it by submitting an orphan block with no parent state. Quality items addressed in this PR: * Drop the redundant `flushPendingParentFetches` from `processBlockByRootChunk` / `processBlockByRangeChunk` worker-success branches — the imported-block callback now flushes once. * Stop folding `replay_pending_attestations` nudge drops into the `attestation` queue label. They use a new label `lean_chain_queue_dropped_total{queue="replay_pending"}` so a single drain trigger is no longer counted as N attestation drops. The buffer depth itself is already exposed via `lean_pending_attestations_size{kind}`; doc updated to point operators at it. * Add a tripwire counter `lean_chain_worker_process_pending_blocks_dropped_missing_roots_total`: the worker's `process_pending_blocks` thunk has no production producer today (libxev tick still runs `processPendingBlocks` inline so the missing-roots can flow back into `BeamNode.fetchBlockByRoots`), but the existing warn-only path was easy to miss in dashboards. The counter MUST stay 0; non-zero is a regression signal that a future producer wired the path without plumbing the missing-roots backchannel. * Delete `BeamChain.submitProcessPendingBlocks` — zero callers. The Message variant + thunk stay (heap-free POD token used by queue smoke tests in `chain_worker.zig`); a comment points at the tripwire counter for the next refactor. * Document the eviction-tail behaviour for `error.QueueFull` on the isolated-node `publishBlock` path on both `replayPendingAttestationsAsync` and `sendReplayPending`. * Document the `is_aggregator_role` snapshot semantics on `replayOneAttestationTask`. Add debug-build asserts that the `*SignedAggregatedAttestation` borrow into `aggs.items` is in-bounds, so a future loop refactor that mutates the buffer mid-iteration trips loud. * Fix the `imported_block_ctx` doc comment (Allocator, not GPA). * Tighten `setImportedBlockCallback` doc to match the non-nullable signature (no "pass null to clear" claim). * Add happy-path test (`imported_block_fn fires once per successful submitBlock with the correct root`) so any future regression in the worker→BeamNode hand-back is caught by `zig build test`. Tests pass: 151/151 unit (was 149 + 2 new) and 15/15 simtest. * node, chain_worker: address #890 review (zclawz round 3) Address the round-3 adversarial review on PR #890. The architecture held up under a second pass, but five mid-tier and two cosmetic items remained. Fixes below; tests stay green at 152/152 unit + 15/15 simtest. * Network I/O thread-safety audit (HIGH). The chain-worker→BeamNode callbacks (`handleChainImportedBlock`, `handleChainRejectedBlock`) now run on the worker thread, and the safety claim was previously hand-waved as "all touched state is mutex-protected". Walk every primitive the handlers reach and document it explicitly: network helpers (`LockedMap`), `connected_peers` (`RwLock`), forkchoice (`RwLock`), `batch_pending_parent_roots` (mutex), and the `backend.reqresp.sendRequest` path which enqueues onto a Tokio `mpsc::Sender::try_send` in the Rust libp2p glue (`SwarmCommandChannel`, `send_swarm_command` in `rust/libp2p-glue/src/lib.rs`). Tokio `mpsc::Sender` is `Send + Sync` so any thread can fire it; the Rust libp2p swarm runs on its own Tokio runtime, separate from the Zig libxev loop. Net: no libxev primitive is touched from the worker thread. Doc'd on `BeamNode.handleChainImportedBlock` and the `BeamChain.imported_block` field with the full primitive list. Also flagged the future risk: helpers that issue libxev I/O (`xev.Loop` reads/writes/timers) MUST NOT be called from these handlers — they would need a wakeup-back-onto-libxev hop. * Bundle imported/rejected callback fn+ctx into a single optional (MEDIUM). Previously two independent `?fn` / `?*anyopaque` fields per backchannel. Direct field assignment from a test or a future refactor could wire the `fn` without the `ctx` (or vice versa); the worker thunk's two-step `if (fn) |…| if (ctx) |…|` would then silently free `missing_roots` instead of handing them back. Fold into `BlockCallback(FnT)` so the thunk reads a single `if (self.imported_block) |cb| cb.func(cb.ctx, …)`. Setters (`setImportedBlockCallback`, `setRejectedBlockCallback`) take both args together; field-direct assignment now requires populating the struct, eliminating the half-wired footgun. * PreFinalizedSlot rejection path test (MEDIUM). The `pre_finalized` arm of `RejectedBlockReason` runs a different code path (`pruneCachedBlocks`) than `MissingPreState` (`cacheBlockAndFetchParent`). New unit test (`rejected_block_fn fires on PreFinalizedSlot`) bumps `forkChoice.fcStore.latest_finalized.slot` past the candidate block's slot under the forkchoice mutex, then submits — the worker takes the PreFinalizedSlot arm and the callback observer records `reason == .pre_finalized`. The MissingPreState test remains; both arms now have explicit coverage. * Isolated-validator liveness gap (MEDIUM). On a single-validator / no-gossip node, a `QueueFull` drop from `replayPendingAttestationsAsync` on the local-`publishBlock` path has no inbound `on_block` to piggy-back on, so pending attestations age toward FIFO eviction at `MAX_PENDING_ATTESTATIONS = 1024`. The previous `debug` log was invisible without the `lean_chain_queue_dropped_total{queue= replay_pending}` dashboard. Add a `ReplayProducer` enum (`local_publish` vs `gossip_or_rpc_followup`); the local-publish drop now logs at `warn` with the gauge name and FIFO threshold inline, so the first occurrence is visible in raw logs. The three gossip / RPC follow-up call sites keep the prior `debug` level (a missed nudge there is provably benign within one block-cycle). * TestCallbackObserver `.len` after free (MEDIUM). Capture `missing_roots.len` BEFORE the `allocator.free` so the access is obviously valid to readers and static analysers. Zig slices store `.len` in the local fat pointer so the previous form was defined; this is purely a clarity fix. * Comment that aggs pointer-stability relies on `defer` ordering, not the assert (LOW). The `std.debug.assert(@intFromPtr(agg) >= …)` pair fires on the spawning thread BEFORE the task body runs, so it is a one-time spawn-time range check, not a lifetime guarantee. The actual safety invariant is that the loop body doesn't mutate `aggs.items` between `pool.spawnWg` and `pool.waitAndWork`, plus `defer aggs.deinit(allocator)` running AFTER `waitAndWork`. Comment now spells this out so a future refactor that hoists a mutation between iterations is the obvious regression to look for, not the assert text. * Annotate `process_pending_blocks` Message variant as test-only (LOW). The variant + thunk + tripwire counter (`lean_chain_worker_process_pending_blocks_dropped_missing_roots_total`) remain after `submitProcessPendingBlocks` was removed in the previous round, exercised only by queue smoke tests. Doc on the variant declaration now says **TEST-ONLY in production today; see #863 follow-up** with a one-paragraph note that wiring a worker producer requires plumbing a missing-roots backchannel first (the same shape as `imported_block_fn`). Verified locally: - zig build -> clean - zig fmt --check . -> clean - cargo fmt --check -> clean - zig build test --summary all -> 152/152 pass (+1 new test) - zig build simtest --summary all -> 15/15 pass * node: drop forkchoice main mutex from forkChoice.aggregate (#890) The aggregator-thread call to forkChoice.aggregate held the forkchoice main mutex (exclusive) for the full ~70s XMSS aggregate-proof FFI window. Live zeam_0 metrics from devnet (4 zeam containers per host, 8 logical CPUs, only zeam_0 aggregating): lean_committee_signatures_aggregation_time_seconds: 7 calls, sum 427.87s (~61s avg, 6 of 7 fall in the +Inf bucket) zeam_aggregate_worker_duration_seconds: 6 calls, sum 360.17s (~60s avg, confirms the time is dominated by forkChoice.aggregate) lean_tick_interval_duration_seconds: 42 ticks, sum 473.65s (~11.3s/tick vs the 800ms budget = 14x over) zeam_slot_driver_stall_fired_total: 8 (all in the 5-10s bucket) lean_chain_queue_depth{queue="attestation"}: 995 lean_chain_queue_dropped_total{queue="attestation"}: 3026 While that mutex was held, libxev's chain.onInterval -> forkChoice.onInterval blocked on the same lock, the chain-worker blocked too (#890 made the worker a second exclusive contender), and the slot-driver watchdog fired with multi-second stalls every interval an aggregation was in flight - exactly the aggregator-only "clock falls behind head slot, can't reach finality" pattern the user reported. aggregateUnlocked already coordinates correctly with signatures_mutex - the only lock the aggregation actually needs. It does not touch protoArray, latest_finalized, the validator tracker, or anything else gated by the main mutex. Concurrent aggregate() invocations are gated upstream by aggregate_group.concurrent (concurrent_limit=1 in submitAggregateOnInterval), so the main mutex was never serving a real purpose here either. Adds a regression test that holds the forkchoice main mutex exclusive on the test thread and asserts that aggregate runs to completion on a worker thread; if a future refactor re-introduces mutex.lock() in the aggregate path, the worker deadlocks and thread.join() hangs the test. The residual i=4 stall (acceptNewAttestationsUnlocked on libxev still acquires signatures_mutex while the aggregator thread holds it for the FFI window) is left for a follow-up PR that needs a snapshot-then-release pattern over SignaturesMap and AggregatedPayloadsMap. Documented inline on the new aggregate docstring. * node: drop signatures_mutex during XMSS aggregate FFI (#863, #890 followup) zeam_0 still hit `slot-driver stall detected` at i=4 every aggregator slot after #890. Cause: `aggregateUnlocked` held `signatures_mutex` for the full ~18 s `computeAggregatedSignatures` window, and libxev's `chain.onInterval` → `acceptNewAttestationsUnlocked` (which also takes `signatures_mutex`) blocks at i=0/i=4 every slot during aggregation. Snapshot-then-release: 1. Phase 1 (lock briefly): deep-clone `attestation_signatures`, `latest_new_aggregated_payloads`, `latest_known_aggregated_payloads` into owned local copies. 2. Phase 2 (no lock, ~18 s): run `computeAggregatedSignatures` and per-result SSZ clones against the snapshot. Other threads are free to mutate the live maps. 3. Phase 3 (lock briefly): MERGE per-att_data results into `latest_new_*` (gossip-sourced aggregations from other subnets that arrived during phase 2 must survive — replace would clobber them); per-validator remove from `attestation_signatures` for `(att_data, vid)` pairs that existed in the snapshot (entries added during phase 2 stay so the next aggregator pass can consume them). Concurrent `aggregate()` invocations are gated upstream by `aggregate_group.concurrent` (concurrent_limit=1), so no two aggregations race here. The existing #890 regression test still covers "no forkchoice main mutex acquired"; functional behavior is covered by the existing aggregate / gossip / proposal test suites (both `zig build test` and `zig build simtest` green).

ch4r10t33r added 9 commits May 14, 2026 18:38

build: fix zig fmt of rustup run cargo argv

71535e4

`zig fmt --check` rejected the previous commit's column alignment of the `b.addSystemCommand` argument arrays. Re-running `zig fmt build.zig` locally produces the canonical layout; commit it to unblock the lint job.

ch4r10t33r and others added 5 commits May 15, 2026 07:59

Merge branch 'main' into perf/863-loop-fairness-and-fetch-coalesce

673c44e

Merge branch 'main' into perf/863-loop-fairness-and-fetch-coalesce

3219970

Resolve ci.yml: keep rust workspace cache + setup-rust cache: false comments from this branch; take main's cache-bin: false on lint rust-cache alongside workspaces rust -> target.

GrapeBaBa reviewed May 15, 2026

View reviewed changes

Comment thread .github/workflows/ci.yml

GrapeBaBa reviewed May 15, 2026

View reviewed changes

Comment thread pkgs/node/src/clock.zig

GrapeBaBa reviewed May 15, 2026

View reviewed changes

Comment thread pkgs/node/src/chain.zig

ch4r10t33r requested review from GrapeBaBa and zclawz May 15, 2026 15:23

noopur23 approved these changes May 15, 2026

View reviewed changes

ch4r10t33r merged commit f2ecef7 into main May 15, 2026
13 checks passed

ch4r10t33r deleted the perf/863-loop-fairness-and-fetch-coalesce branch May 15, 2026 17:44

zclawz mentioned this pull request May 15, 2026

log attestation subnet coverage in chain status #876

Merged

ch4r10t33r mentioned this pull request May 16, 2026

node, metrics: offload heavy chain mutations to chain-worker, parallelize XMSS verify (#863) #890

Merged

7 tasks

ch4r10t33r mentioned this pull request May 19, 2026

node, forkchoice: keep libxev tick callbacks off the forkchoice main mutex #897

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(node): fix slot-driver starvation under gossip flood (#863)#886

perf(node): fix slot-driver starvation under gossip flood (#863)#886
ch4r10t33r merged 15 commits into
mainfrom
perf/863-loop-fairness-and-fetch-coalesce

ch4r10t33r commented May 14, 2026 •

edited

Loading

Uh oh!

zclawz commented May 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zclawz commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ch4r10t33r commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What goes wrong today

What this PR changes

What this PR does not change

New metrics

Test plan

Uh oh!

zclawz commented May 14, 2026

Finding

Validation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zclawz commented May 15, 2026

GrapeBaBa review verification ✅ (HEAD cd245a5)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ch4r10t33r commented May 14, 2026 •

edited

Loading

GrapeBaBa review verification ✅ (HEAD `cd245a5`)