perf(node): fix slot-driver starvation under gossip flood (#863)#886
Conversation
Adds three zeam-specific metrics that the follow-up commits in this series plumb into `chain.onGossip`, `Network.ensureBlocksByRootRequest`, and `Clock.run` to fix the slot-driver starvation documented in #863: * zeam_gossip_atts_dropped_total{kind, reason} — drops on the libxev main thread before chain-worker dispatch (syncing / future_slot / worker_validation_failed). * zeam_blocks_by_root_inflight — instantaneous outbound BlocksByRoot RPC count, capped at MAX_CONCURRENT_BLOCKS_BY_ROOT to bound fanout. * zeam_xev_clock_drain_passes_total — Clock.run pass counter for the .once-bounded drain (independent liveness signal vs the existing tick-interval histogram). A new outcome label value `inflight_cap` is also added to the existing lean_block_fetch_dedup_total help string so the cap-rejection bucket folds into the same dashboard as the other dedup outcomes (preserves the `sum(rate(... by outcome)) == sum(rate(...))` invariant). No behaviour change in this commit — the counters/gauges register as no-op until the follow-up commits start bumping them. Refs #863
Switches `Clock.run`'s inner drain from `events.run(.until_done)` to
`events.run(.once)` so the libxev main thread always returns to the
slot driver after one io_uring CQE batch — even under sustained gossip
flood.
Why this matters: the pre-fix shape looped until `loop.active == 0`,
which under devnet-4 aggregator pressure (4× subnet fanout, 74%
attestations referring to head blocks not yet imported, each spawning
a `BlocksByRoot` RPC that itself timed out and retried) never reached
zero. The slot driver effectively stopped advancing for many seconds at
a stretch, producing the ~96 finalized vs ~196 head delta documented
in the issue.
Mechanics:
* `.once` blocks until ≥1 completion fires, drains whatever the
kernel queued in that batch (up to 128 CQEs per the io_uring
backend), and returns. The next-interval timer is itself a
completion, so under no-flood conditions the wake cadence is
unchanged (~800ms).
* `tickInterval` is no longer called every loop iteration — only
when wall clock crosses the next interval boundary. The check
mirrors the boundary advance inside `tickInterval` itself so a
future tweak to either won't silently drift them apart. This keeps
`lean_tick_interval_duration_seconds` and the per-tick log line at
interval cadence rather than CQE-batch cadence.
* Bootstrap `tickInterval` runs once before entering the loop so
`.once` has at least one timer completion to wait on.
The existing `zeam_xev_clock_until_done_drain_seconds` histogram is
repurposed to measure per-batch drain time (now sub-ms in steady
state); the slow-drain counters (>=500ms / >=1s) and the warn log are
retained but rarely fire post-fix. `zeam_xev_clock_drain_passes_total`
(added in the previous commit) is the new high-frequency liveness
counter.
Refs #863
Adds `MAX_CONCURRENT_BLOCKS_BY_ROOT = 8` as a per-resource concurrency
cap on outbound `BlocksByRoot` RPCs, enforced inside
`ensureBlocksByRootRequest` via an atomic `blocks_by_root_inflight`
counter on `Network`.
Why: the pre-fix path issued one RPC per attestation referencing an
unknown head, multiplied by 4× subnet fanout for an aggregator. Under
sustained gossip flood (devnet-4 aggregator saw 74% of received
attestations referencing not-yet-imported heads) the libxev thread
could fork hundreds of concurrent RPCs that themselves timed out at
`RPC_REQUEST_TIMEOUT_SECONDS=8s` and retried, saturating both the
event loop and the timeout sweep. The cap pins fan-out so flood
pressure can't run the request set away.
Mechanics:
* Reservation is taken via a CAS loop BEFORE `selectPeer` so a
saturated cap doesn't burn a peer-selection round trip. The CAS
is `acq_rel` on success / `monotonic` on failure — a concurrent
decrement from `finalizePendingRequest` is observed on the next
loop iteration.
* Reservation released on every error path before the RPC id is
recorded in `pending_rpc_requests` (errdefer guarded). After
dispatch, ownership passes to `pending_rpc_requests`; the slot
is released by `finalizePendingRequest` regardless of whether
the request succeeded, timed out, or was cancelled.
* New `error.InFlightCapReached` is bucketed in
`lean_block_fetch_dedup_total{outcome="inflight_cap"}` by the
`BeamNode.fetchBlockByRoots` caller so the dedup-outcome
invariant `sum(rate(... by outcome)) == sum(rate(...))` still
holds and dashboards see the cap-saturation rate next to the
other suppression buckets.
* `zeam_blocks_by_root_inflight` is `set()` after every change so
the gauge's value is the live count rather than the reservation
delta.
Tuning rationale: 8 in-flight × `MAX_BLOCKS_BY_ROOT` per request
covers ~64 missing roots concurrently, comfortably above observed
sustained worst-case missing rates while keeping libxev RPC fan-out
bounded. Tune by watching the gauge saturate and the `inflight_cap`
counter climb together.
Refs #863
Two related changes that prevent gossip attestations / aggregations
from amplifying load while the node is sync-behind or while a flood
producer races ahead of our slot clock.
1. Hoist the future-slot bound check (`AttestationTooFarInFuture`,
`GOSSIP_DISPARITY_INTERVALS=1`) in `validateAttestationData` to
run BEFORE the source/target/head proto-node lookups. Pre-fix
order ran the lookups first, so an attestation for `current_slot
+ N` referencing future blocks bottomed out at `UnknownHeadBlock`
and the caller dutifully enqueued a `BlocksByRoot` for that
future head — which always 404'd because the block didn't exist
anywhere yet. Multiplied by 4× subnet fan-out on aggregators this
was a fetch-storm amplifier and a primary contributor to the
slot-driver starvation in #863. Rejecting the attestation here
means we never derive a missing root for it.
2. Add a sync-status gate at the top of the `.attestation` and
`.aggregation` arms in `chain.onGossip`. While
`getSyncStatus() != .synced` (no_peers / fc_initing /
behind_peers) the chain drops the message immediately —
`validateAttestationData` is not run, so no UnknownXBlock can be
derived, and no fetch is enqueued. Recovery flows through
`BlocksByRange` and gossip block import, NOT per-attestation
point lookups. Block gossip is intentionally NOT gated.
Both drops bump the new
`zeam_gossip_atts_dropped_total{kind, reason}` counter so dashboards
distinguish:
* future_slot — flood ahead of our clock or a malicious producer.
* syncing — backpressure while we catch up.
* worker_validation_failed — (added by the next commit when the
gossip handler becomes pure dispatch) catches the proto-node
miss path that still flows through the chain-worker.
Refs #863
Splits gossip attestation/aggregation handling so the libxev main
thread only does O(1) per-message work and the chain-worker thread
runs the expensive proto-array reads.
Libxev hot path (per gossip message):
* sync-aware gate (`getSyncStatus()`) — already P2.
* future-slot fast-drop via the new `attestationIsTooFarInFuture`
helper — slot/clock comparison only, no proto-array touch.
* dispatch to the chain-worker via `submitGossipAttestation` /
`submitGossipAggregatedAttestation` (the latter now actually
used — clones the SignedAggregatedAttestation since the gossip
layer owns the borrow).
Worker thread (per popped message):
* `validateAttestationData` — proto-node lookups under
`forkChoice.mutex` for source/target/head, slot relationship
checks, checkpoint-slot match.
* `onGossipAttestation` / `onGossipAggregatedAttestation` —
`verifySingleAttestation` / `verifyAggregatedAttestation` (XMSS)
plus the forkchoice tracker / aggregated-payload writes.
* Failures bucket into
`zeam_gossip_atts_dropped_total{reason="worker_validation_failed"}`
so dashboards still attribute the rate. Drops are silent (the
chain-worker thunks have no upstream channel to surface errors;
consistent with the existing on_block thunk's no-feedback
contract).
Trade-off: the synchronous `missing_attestation_roots` feedback for
gossip attestations referencing unknown blocks is dropped. The same
trade-off was already documented for the chain-worker block path —
gossip clients re-broadcast liberally and we recover through
`BlocksByRange` and gossip block import, NOT per-attestation point
lookups (which were exactly the fetch-storm vector identified in
#863). The worker-disabled branch (tests / single-threaded mode)
retains the inline-validate / return-missing-roots behaviour so
those callers continue to exercise the full validation surface.
`validateAttestationData` retains the full check internally
(future-slot first, then proto-node lookups) so the worker thread
remains correct without depending on the libxev fast-drop running
first; the duplicate future-slot check on the worker side is
trivially fast and the steady-state hit rate is near zero because
the libxev fast-drop catches almost all of them.
Also: the `.attestation` arm previously ran validation for non-
aggregators too (then dropped the result). The new shape skips
validation entirely on non-aggregators (no work for a result they
were going to discard). Non-aggregators no longer trigger
attestation-driven `BlocksByRoot` fetches — block gossip remains
their primary head-discovery vector; this is consistent with the
broader #863 goal of suppressing per-attestation fetch storms.
Refs #863
…rity) The PR-#886 worker-dispatch slice mirrored the spec's "attestations never trigger BlocksByRoot" intent, but skipped four spec-mandated behaviors. This commit closes those gaps so PR #886 is 100% compliant with leanSpec subspecs/sync/service.py and forks/lstar/spec.py. Behaviors brought in line with the spec: 1. _pending_attestations / _pending_aggregated_attestations buffers, bounded by MAX_PENDING_ATTESTATIONS (1024, spec value), FIFO-evicting on overflow. Mirrored as `pending_attestations` and `pending_aggregated_attestations` on `BeamChain`. 2. Buffer-and-replay on retryable validation failure (UnknownHeadBlock / UnknownSourceBlock / UnknownTargetBlock / MissingState / AttestationTooFarInFuture). Spec buffers all of these via the AssertionError/KeyError catch at the bottom of `on_gossip_attestation`; we now do the same instead of dropping them on the libxev or chain-worker thread. 3. `replayPendingAttestations` runs after every successful `onBlock` import (same hook point as the spec's `_replay_pending_attestations` call inside `on_gossip_block`). Re-validates each entry; outcomes are accepted / re-buffered / dropped, surfaced via `lean_pending_attestations_replay_total{kind, outcome}`. 4. Sync-state gating now mirrors the spec's IDLE/SYNCING/SYNCED semantics: gossip is processed in `behind_peers` (= spec SYNCING) and `synced` (= spec SYNCED); only dropped in `no_peers` / `fc_initing` (= spec IDLE). Pre-fix `behind_peers` dropped gossip, which the spec accepts. 5. Non-aggregators now validate + signature-verify gossip attestations on every node (spec `forks/lstar/spec.py:1080-1151` gates ONLY the `attestation_signatures` write on `is_aggregator`). Verified-but-not-stored is the spec's "validate and drop" path; storage in zeam's fork-choice tracker stays gated on `is_aggregator_enabled` so non-aggregator nodes don't smuggle single-attestation weight into head selection. 6. Removed the libxev future-slot fast-drop. Future-slot attestations now flow through validation and land in the pending buffer for replay (spec semantics — `store.time` advances via clock ticks during the buffer's residency, so a slightly-future attestation can become valid before FIFO eviction). The zeam-specific changes the spec is silent on are kept intact: * `events.run(.once)` libxev drain bound (#863 P4) — spec doesn't define event-loop policy. * MAX_CONCURRENT_BLOCKS_BY_ROOT in-flight cap (#863 P3) — spec doesn't define BlocksByRoot concurrency. * `--aggregate-subnet-ids` own+neighbor coverage (lean-quickstart) — spec allows operator-defined subset selection. * `attestation_tracker` (zeam's per-validator latestNew/latestKnown fork-choice contribution from single attestations) — spec uses aggregated-payload weight for the same role; zeam keeps the tracker as an internal implementation detail, gated on aggregator role for the gossip path. Public API additions (all internal; no CLI/binary changes): * `BeamChain.pending_attestations`, `pending_aggregated_attestations`, `pending_attestations_mutex`, plus `enqueuePendingAttestation`, `enqueuePendingAggregation`, `replayPendingAttestations`, `replayOneAttestation`, `replayOneAggregation`, `runVerifiedGossipAttestation`. * `constants.MAX_PENDING_ATTESTATIONS = 1024`. * Metrics: `lean_pending_attestations_buffered_total{kind, reason}`, `lean_pending_attestations_evicted_total{kind}`, `lean_pending_attestations_replay_total{kind, outcome}`, `lean_pending_attestations_size{kind}`. Test plan: * `zig build test` — all 14 binaries pass (13 + 4 + 1 + 7 + 0 + 19 + 18 + 3 + 149 + 14 + 11 + 18 + 0 + 4). * Buffer FIFO eviction, replay-on-block, and non-aggregator validate-and-drop paths exercised by the existing chain test matrix; no behavioural regressions surfaced. Closes the spec-parity follow-up on PR #886 / #863.
build-all-provers (macos-latest) has been failing with:
error: error: unexpected argument '+nightly' found
Usage: rustup-init[EXE] [OPTIONS]
failed command: cargo +nightly -C rust -Z unstable-options build …
On the GitHub-Actions `macos-latest` runner image, `~/.cargo/bin/cargo`
is the `rustup-init` installer rather than rustup's proxy shim. The
installer doesn't accept the `+toolchain` selector, so every cargo
invocation issued by `build.zig` immediately fails. The CI workflow's
`Prefer rustup shims on PATH (macOS)` step verifies that
`rustup run nightly cargo --version` works, but the verification only
proves rustup itself is fine — `zig build` still calls `cargo +nightly`
directly and hits the broken proxy.
Switching `build.zig`'s rust steps to `rustup run nightly cargo …`
sidesteps the broken `cargo` binary: rustup resolves the toolchain
internally and invokes the real cargo for that toolchain, so we no
longer depend on `cargo` on PATH being a particular flavour. This
works identically on Linux (rustup is the standard installer there
too), and the local-dev requirement is unchanged: rustup must be
installed (the previous `cargo +nightly` shape required it for the
same reason).
Verified locally on aarch64-apple-darwin (Cursor sandbox) with
`zig build` and `zig build -Dprover=dummy` — both pull the rust
artifacts via `rustup run nightly cargo` and link cleanly.
Dockerfile comment is updated for symmetry; the install step itself
already invokes rustup's installer with `--default-toolchain nightly`
so no behavioural change there.
`zig fmt --check` rejected the previous commit's column alignment of the `b.addSystemCommand` argument arrays. Re-running `zig fmt build.zig` locally produces the canonical layout; commit it to unblock the lint job.
The macos-15-arm64 runner image ships ~/.cargo/bin/{cargo,rustup} as
symlinks that resolve back to the rustup-init installer binary. Because
rustup picks its multi-call mode from the resolved exe basename rather
than argv[0], every `cargo` / `rustup run nightly cargo` invocation
fails with `error: unexpected argument '<x>' found / Usage:
rustup-init[EXE] [OPTIONS]` once those shims are first on PATH.
The previous "Prefer rustup shims on PATH (macOS)" step prepended
$HOME/.cargo/bin to GITHUB_PATH, which made the broken shims first on
PATH for every subsequent step and caused build (macos-latest) to fail
inside Swatinem/rust-cache@v2 (`cargo metadata`) and inside `zig build`
(`rustup run nightly cargo … -p zeam-glue`). Dummy prove (macos-latest)
on the same commit happened to pass because its rust-cache hit
restored a working ~/.cargo/bin from a prior run.
Replace the four PATH-shifting steps with an "Ensure rustup proxy
shims (macOS)" step that probes the shims and, if broken, re-runs the
official rustup installer with --force. The installer recreates the
shims as proper hardlinks so `cargo`, `rustup`, and `rustup run`
behave correctly. The step runs before Swatinem/rust-cache@v2 in all
four jobs, so the cache action's own `cargo metadata` call also sees a
working proxy.
|
Adversarial review of #886: FindingHigh — aggregated gossip clone leaks when worker queue rejects submission In the new aggregation worker-dispatch path, var cloned: types.SignedAggregatedAttestation = undefined;
try types.sszClone(self.allocator, types.SignedAggregatedAttestation, signed_aggregation, &cloned);
var cloned_consumed = false;
errdefer if (!cloned_consumed) cloned.deinit();
self.submitGossipAggregatedAttestation(cloned) catch |err| switch (err) {
error.QueueFull => { ... return .{}; },
error.QueueClosed => { ... return .{}; },
error.ChainWorkerDisabled => unreachable,
};
cloned_consumed = true;
That is especially dangerous for this PR because the queue-backpressure path is intentionally expected under the #863 gossip-flood scenario. Under sustained aggregation flood, the fix can convert bounded queue drops into unbounded heap growth on the libxev ingress path. Suggested fix: make the guard a normal var cloned_consumed = false;
defer if (!cloned_consumed) cloned.deinit();
...
cloned_consumed = true;(or set Validation
|
Aggregated attestation gossip clones before submitGossipAggregatedAttestation;
cleanup used errdefer gated on cloned_consumed. QueueFull / QueueClosed
return .{} from the catch arm (normal unwind), so errdefer never ran
and heap owned by the clone leaked on every backpressure drop.
Use defer instead so the clone is deinit() whenever the worker does not
take ownership. Apply the same pattern to the SignedBlock submitBlock
path (same bug) and the chain-worker submitBlock test for consistency.
Reported-by: zclawz (PR #886 review)
BuildKit exporting layers to the GitHub Actions cache backend can fail with HTTP 502 + HTML (GitHub unicorn page) while the Dockerfile.prebuilt build itself already succeeded. Treat cache write failures as non-fatal via ignore-error=true so flaky cache API does not fail CI.
Disable bundled Swatinem/rust-cache in setup-rust-toolchain (cache: false) so it does not run cargo metadata at repo root where there is no Cargo.toml; keep explicit rust-cache steps with workspaces rust -> target. Teach the macOS shim-repair step to reinstall when plain `cargo` is broken; `rustup run nightly cargo` alone was a false negative on GHA macos-15-arm64, which broke Swatinem after the toolchain step. Lint: add workspaces to the standalone rust-cache step.
Resolve ci.yml: keep rust workspace cache + setup-rust cache: false comments from this branch; take main's cache-bin: false on lint rust-cache alongside workspaces rust -> target.
- Clock.run: only re-arm tickInterval after wall clock reaches the next interval boundary without CLOCK_DISPARITY_MS on the outer trigger, so gossip-heavy .once batches do not cancel a pending interval timer before onInterval fires (GrapeBaBa). - Call replayPendingAttestations after successful onBlock + followup on direct paths: chain gossip + pending-queue replay, and BeamNode RPC / cached-descendant / range / publish paths (GrapeBaBa). - Lint job: clarify setup-rust cache: false is CI infra, not #863 perf. - Doc replayPendingAttestations threading contract.
GrapeBaBa review verification ✅ (HEAD
|
…lize XMSS verify (#863) (#890) * node: bound libxev replay drain + drive processPendingBlocks via worker (#863) The aggregator-side slot-driver stall on `zeam_0` had two amplifiers left after #886: 1. `BeamNode` libxev paths (RPC by-root/by-range, cached descendants, `publishBlock`) drained the full `pending_attestations` / `pending_aggregated_attestations` buffers (up to spec cap 1024 each) synchronously after every block import. With io_uring `run(.once)` draining a whole CQE batch before returning, this can hold the slot driver thread for multiple seconds under aggregator load. 2. `submitProcessPendingBlocks` was wired but had no production caller, so the future-slot `pending_blocks` queue was not driven from the clock tick. Changes: * `chain.replayPendingAttestationsLibxevBudget`: caps work at `REPLAY_PENDING_ATTESTATIONS_LIBXEV_BUDGET=48` / `REPLAY_PENDING_AGGREGATIONS_LIBXEV_BUDGET=8` per call, with per-item lock scope so XMSS verify never runs under the buffer mutex. * `BeamNode` libxev callbacks (cached descendants, RPC by-root/by-range, `publishBlock`) now use the budgeted variant. The chain-worker thread keeps using the unbounded `replayPendingAttestations`. * `BeamNode.onInterval` now calls `submitProcessPendingBlocks` when the worker is enabled (or the synchronous path otherwise) so future-slot blocks drain off the slot-driver thread. * New `BeamChain.chainWorkerEnabled` accessor so callers don't poke the private field. * Clock comment clarified: `events.run(.once)` drains the whole kernel CQE batch, not a single completion — the underlying reason these callbacks have to be cheap. * node: address PR #890 review (zclawz) Two fixes from the PR review: 1. Worker `processPendingBlocks` dropped parent-fetch requests. `chainWorkerProcessPendingBlocksThunk` discards the missing-roots slice that the chain returns for `fetchBlockByRoots`, but the chain worker has no handle to `BeamNode` to issue that fetch. The previous commit routed the libxev tick through `submitProcessPendingBlocks`, which silently stranded sync under the out-of-order future-block scenario this path is supposed to repair. Run `processPendingBlocks` inline on the libxev thread (same as the pre-worker path) and call `fetchBlockByRoots` with the returned roots. The thunk and `submitProcessPendingBlocks` are kept so the worker plumbing remains exercised end-to-end; the thunk now logs a warning if it ever fires with non-empty missing-roots so a future regression is visible instead of silent. A worker → libxev backchannel for the fetch is tracked as a #863 follow-up. 2. `lean_pending_attestations_size` gauge went stale under budgeted replay. `replayPendingAttestationsLibxevBudget` consumed entries via `orderedRemove` without refreshing the gauge, so dashboards would keep reporting the pre-replay depth until the next enqueue or full replay. Update the gauge under the same lock as the pop, for both attestation and aggregation buffers. * node: move RPC import + replay onto chain-worker, parallelize XMSS verify (#890) Three follow-ups to the slot-driver stall work, all aimed at keeping the libxev thread free of CPU-heavy chain mutations under aggregator load. 1. RPC block import on the worker `processBlockByRootChunk`, `processBlockByRangeChunk`, and the post- import branch of `processCachedDescendants` now route through `trySubmitImportToWorker` when the chain-worker is enabled and the parent is in forkchoice. The worker performs `chain.onBlock`, `onBlockFollowup`, and the inline replay; libxev only does network plumbing. A new `imported_block_fn` backchannel hands the chain-just-imported root + `missing_attestation_roots` slice back to `BeamNode.handleChainImportedBlock` so cache cleanup, descendant retry, and `blocks_by_root` fan-out all still happen — fixing the pre-existing TODO where the worker thunk silently dropped `missing_roots`. `publishBlock` keeps its inline import (downstream callers/tests observe forkchoice synchronously) but now nudges the worker for replay. 2. `replayPendingAttestations` is worker-only when the worker is enabled New `submitReplayPendingAttestations` enqueues a `replay_pending_attestations` Message onto the chain-worker. Every libxev producer that previously called the (now-removed) `replayPendingAttestationsLibxevBudget` budget variant goes through `replayPendingAttestationsAsync` which submits to the worker or, on `error.ChainWorkerDisabled`, falls back to the synchronous in-thread path. Drops the libxev budget constants — they only existed to bound work that no longer runs on libxev. 3. Parallel XMSS verify in `replayPendingAttestations` When `BeamChain.thread_pool` is configured the per-entry replay (`replayOneAttestation` / `replayOneAggregation`) fans out via `pool.spawnWg` + `pool.waitAndWork`. The XMSS signature verify is the dominant cost and has no shared state across siblings; forkchoice / pubkey-cache writes already take their own locks. Falls back to a serial loop when there's no pool. Test fix: reorder `BeamNode.deinit` so `chain.deinit()` (which stops/ joins the worker thread) runs before `network.deinit()`. Without the reorder a still-draining worker dispatch fires `imported_block_fn` against a freed `LockedMap` — surfaced by an alignment panic in the `processCachedDescendants basic flow` test. Verified locally: - zig build test --summary all → 149/149 pass - zig build simtest --summary all → 15/15 pass - zig fmt --check . / cargo fmt --check → clean * node, metrics: address #890 review (zclawz round 2) Blockers: * deinit-order UAF: `batch_pending_parent_roots.deinit()` ran before `chain.deinit()`. The chain-worker's `imported_block_fn` / `flushPendingParentFetches` reach `batch_pending_parent_roots` — if a worker dispatch was in flight when teardown started it could fire the callback against a freed AutoHashMap. Move the deinit after `chain.deinit()` (which stops + joins the worker) and add a comment on the deinit ordering rule for the next callback-reachable field. * Worker silently dropped blocks on the `forkChoice.hasBlock(parent)` TOCTOU race: finalization could advance between the libxev pre-check and the worker's dispatch, surfacing as `MissingPreState` / `PreFinalizedSlot` inside `chainWorkerOnBlockThunk`. The thunk only logged + dropped — the libxev caller had already returned, so the block was lost and sync stalled until a re-broadcast. Add a sister backchannel `RejectedBlockFn` invoked from the same thunk on those two errors; BeamNode's `handleChainRejectedBlock` routes them through `cacheBlockAndFetchParent` / `pruneCachedBlocks` exactly the way the inline RPC arm did. New unit test (`rejected_block_fn fires on MissingPreState`) covers it by submitting an orphan block with no parent state. Quality items addressed in this PR: * Drop the redundant `flushPendingParentFetches` from `processBlockByRootChunk` / `processBlockByRangeChunk` worker-success branches — the imported-block callback now flushes once. * Stop folding `replay_pending_attestations` nudge drops into the `attestation` queue label. They use a new label `lean_chain_queue_dropped_total{queue="replay_pending"}` so a single drain trigger is no longer counted as N attestation drops. The buffer depth itself is already exposed via `lean_pending_attestations_size{kind}`; doc updated to point operators at it. * Add a tripwire counter `lean_chain_worker_process_pending_blocks_dropped_missing_roots_total`: the worker's `process_pending_blocks` thunk has no production producer today (libxev tick still runs `processPendingBlocks` inline so the missing-roots can flow back into `BeamNode.fetchBlockByRoots`), but the existing warn-only path was easy to miss in dashboards. The counter MUST stay 0; non-zero is a regression signal that a future producer wired the path without plumbing the missing-roots backchannel. * Delete `BeamChain.submitProcessPendingBlocks` — zero callers. The Message variant + thunk stay (heap-free POD token used by queue smoke tests in `chain_worker.zig`); a comment points at the tripwire counter for the next refactor. * Document the eviction-tail behaviour for `error.QueueFull` on the isolated-node `publishBlock` path on both `replayPendingAttestationsAsync` and `sendReplayPending`. * Document the `is_aggregator_role` snapshot semantics on `replayOneAttestationTask`. Add debug-build asserts that the `*SignedAggregatedAttestation` borrow into `aggs.items` is in-bounds, so a future loop refactor that mutates the buffer mid-iteration trips loud. * Fix the `imported_block_ctx` doc comment (Allocator, not GPA). * Tighten `setImportedBlockCallback` doc to match the non-nullable signature (no "pass null to clear" claim). * Add happy-path test (`imported_block_fn fires once per successful submitBlock with the correct root`) so any future regression in the worker→BeamNode hand-back is caught by `zig build test`. Tests pass: 151/151 unit (was 149 + 2 new) and 15/15 simtest. * node, chain_worker: address #890 review (zclawz round 3) Address the round-3 adversarial review on PR #890. The architecture held up under a second pass, but five mid-tier and two cosmetic items remained. Fixes below; tests stay green at 152/152 unit + 15/15 simtest. * Network I/O thread-safety audit (HIGH). The chain-worker→BeamNode callbacks (`handleChainImportedBlock`, `handleChainRejectedBlock`) now run on the worker thread, and the safety claim was previously hand-waved as "all touched state is mutex-protected". Walk every primitive the handlers reach and document it explicitly: network helpers (`LockedMap`), `connected_peers` (`RwLock`), forkchoice (`RwLock`), `batch_pending_parent_roots` (mutex), and the `backend.reqresp.sendRequest` path which enqueues onto a Tokio `mpsc::Sender::try_send` in the Rust libp2p glue (`SwarmCommandChannel`, `send_swarm_command` in `rust/libp2p-glue/src/lib.rs`). Tokio `mpsc::Sender` is `Send + Sync` so any thread can fire it; the Rust libp2p swarm runs on its own Tokio runtime, separate from the Zig libxev loop. Net: no libxev primitive is touched from the worker thread. Doc'd on `BeamNode.handleChainImportedBlock` and the `BeamChain.imported_block` field with the full primitive list. Also flagged the future risk: helpers that issue libxev I/O (`xev.Loop` reads/writes/timers) MUST NOT be called from these handlers — they would need a wakeup-back-onto-libxev hop. * Bundle imported/rejected callback fn+ctx into a single optional (MEDIUM). Previously two independent `?fn` / `?*anyopaque` fields per backchannel. Direct field assignment from a test or a future refactor could wire the `fn` without the `ctx` (or vice versa); the worker thunk's two-step `if (fn) |…| if (ctx) |…|` would then silently free `missing_roots` instead of handing them back. Fold into `BlockCallback(FnT)` so the thunk reads a single `if (self.imported_block) |cb| cb.func(cb.ctx, …)`. Setters (`setImportedBlockCallback`, `setRejectedBlockCallback`) take both args together; field-direct assignment now requires populating the struct, eliminating the half-wired footgun. * PreFinalizedSlot rejection path test (MEDIUM). The `pre_finalized` arm of `RejectedBlockReason` runs a different code path (`pruneCachedBlocks`) than `MissingPreState` (`cacheBlockAndFetchParent`). New unit test (`rejected_block_fn fires on PreFinalizedSlot`) bumps `forkChoice.fcStore.latest_finalized.slot` past the candidate block's slot under the forkchoice mutex, then submits — the worker takes the PreFinalizedSlot arm and the callback observer records `reason == .pre_finalized`. The MissingPreState test remains; both arms now have explicit coverage. * Isolated-validator liveness gap (MEDIUM). On a single-validator / no-gossip node, a `QueueFull` drop from `replayPendingAttestationsAsync` on the local-`publishBlock` path has no inbound `on_block` to piggy-back on, so pending attestations age toward FIFO eviction at `MAX_PENDING_ATTESTATIONS = 1024`. The previous `debug` log was invisible without the `lean_chain_queue_dropped_total{queue= replay_pending}` dashboard. Add a `ReplayProducer` enum (`local_publish` vs `gossip_or_rpc_followup`); the local-publish drop now logs at `warn` with the gauge name and FIFO threshold inline, so the first occurrence is visible in raw logs. The three gossip / RPC follow-up call sites keep the prior `debug` level (a missed nudge there is provably benign within one block-cycle). * TestCallbackObserver `.len` after free (MEDIUM). Capture `missing_roots.len` BEFORE the `allocator.free` so the access is obviously valid to readers and static analysers. Zig slices store `.len` in the local fat pointer so the previous form was defined; this is purely a clarity fix. * Comment that aggs pointer-stability relies on `defer` ordering, not the assert (LOW). The `std.debug.assert(@intFromPtr(agg) >= …)` pair fires on the spawning thread BEFORE the task body runs, so it is a one-time spawn-time range check, not a lifetime guarantee. The actual safety invariant is that the loop body doesn't mutate `aggs.items` between `pool.spawnWg` and `pool.waitAndWork`, plus `defer aggs.deinit(allocator)` running AFTER `waitAndWork`. Comment now spells this out so a future refactor that hoists a mutation between iterations is the obvious regression to look for, not the assert text. * Annotate `process_pending_blocks` Message variant as test-only (LOW). The variant + thunk + tripwire counter (`lean_chain_worker_process_pending_blocks_dropped_missing_roots_total`) remain after `submitProcessPendingBlocks` was removed in the previous round, exercised only by queue smoke tests. Doc on the variant declaration now says **TEST-ONLY in production today; see #863 follow-up** with a one-paragraph note that wiring a worker producer requires plumbing a missing-roots backchannel first (the same shape as `imported_block_fn`). Verified locally: - zig build -> clean - zig fmt --check . -> clean - cargo fmt --check -> clean - zig build test --summary all -> 152/152 pass (+1 new test) - zig build simtest --summary all -> 15/15 pass * node: drop forkchoice main mutex from forkChoice.aggregate (#890) The aggregator-thread call to forkChoice.aggregate held the forkchoice main mutex (exclusive) for the full ~70s XMSS aggregate-proof FFI window. Live zeam_0 metrics from devnet (4 zeam containers per host, 8 logical CPUs, only zeam_0 aggregating): lean_committee_signatures_aggregation_time_seconds: 7 calls, sum 427.87s (~61s avg, 6 of 7 fall in the +Inf bucket) zeam_aggregate_worker_duration_seconds: 6 calls, sum 360.17s (~60s avg, confirms the time is dominated by forkChoice.aggregate) lean_tick_interval_duration_seconds: 42 ticks, sum 473.65s (~11.3s/tick vs the 800ms budget = 14x over) zeam_slot_driver_stall_fired_total: 8 (all in the 5-10s bucket) lean_chain_queue_depth{queue="attestation"}: 995 lean_chain_queue_dropped_total{queue="attestation"}: 3026 While that mutex was held, libxev's chain.onInterval -> forkChoice.onInterval blocked on the same lock, the chain-worker blocked too (#890 made the worker a second exclusive contender), and the slot-driver watchdog fired with multi-second stalls every interval an aggregation was in flight - exactly the aggregator-only "clock falls behind head slot, can't reach finality" pattern the user reported. aggregateUnlocked already coordinates correctly with signatures_mutex - the only lock the aggregation actually needs. It does not touch protoArray, latest_finalized, the validator tracker, or anything else gated by the main mutex. Concurrent aggregate() invocations are gated upstream by aggregate_group.concurrent (concurrent_limit=1 in submitAggregateOnInterval), so the main mutex was never serving a real purpose here either. Adds a regression test that holds the forkchoice main mutex exclusive on the test thread and asserts that aggregate runs to completion on a worker thread; if a future refactor re-introduces mutex.lock() in the aggregate path, the worker deadlocks and thread.join() hangs the test. The residual i=4 stall (acceptNewAttestationsUnlocked on libxev still acquires signatures_mutex while the aggregator thread holds it for the FFI window) is left for a follow-up PR that needs a snapshot-then-release pattern over SignaturesMap and AggregatedPayloadsMap. Documented inline on the new aggregate docstring. * node: drop signatures_mutex during XMSS aggregate FFI (#863, #890 followup) zeam_0 still hit `slot-driver stall detected` at i=4 every aggregator slot after #890. Cause: `aggregateUnlocked` held `signatures_mutex` for the full ~18 s `computeAggregatedSignatures` window, and libxev's `chain.onInterval` → `acceptNewAttestationsUnlocked` (which also takes `signatures_mutex`) blocks at i=0/i=4 every slot during aggregation. Snapshot-then-release: 1. Phase 1 (lock briefly): deep-clone `attestation_signatures`, `latest_new_aggregated_payloads`, `latest_known_aggregated_payloads` into owned local copies. 2. Phase 2 (no lock, ~18 s): run `computeAggregatedSignatures` and per-result SSZ clones against the snapshot. Other threads are free to mutate the live maps. 3. Phase 3 (lock briefly): MERGE per-att_data results into `latest_new_*` (gossip-sourced aggregations from other subnets that arrived during phase 2 must survive — replace would clobber them); per-validator remove from `attestation_signatures` for `(att_data, vid)` pairs that existed in the snapshot (entries added during phase 2 stay so the next aggregator pass can consume them). Concurrent `aggregate()` invocations are gated upstream by `aggregate_group.concurrent` (concurrent_limit=1), so no two aggregations race here. The existing #890 regression test still covers "no forkchoice main mutex acquired"; functional behavior is covered by the existing aggregate / gossip / proposal test suites (both `zig build test` and `zig build simtest` green).
Summary
Issue #863 reported the zeam_0 aggregator on devnet-4 falling badly behind
the rest of the network: head trailed finalized by ~100 slots, the slot
clock missed its tick deadline for seconds at a time, and ~74% of incoming
attestations referenced blocks the node hadn't seen yet — every one of
those attestations triggered a block-fetch RPC that itself timed out, so
the node spent its time chasing blocks instead of running fork choice.
This PR fixes the underlying starvation. It also brings zeam's gossip
attestation handling in line with
leanSpecso the behaviour matches thereference implementation rather than diverging quietly.
What goes wrong today
The aggregator does almost everything on a single thread: receive gossip,
validate attestations against fork choice, run state transitions on new
blocks, fire the every-800ms slot tick, etc. Under load four problems
compound:
control back, which can take seconds when peers are flooding.
point-lookup RPC, four times over for an aggregator subscribed to four
subnets. The fan-out itself becomes the load.
silently runs fork choice on stale state.
blocks, so they fight each other for the fork-choice lock.
What this PR changes
The series is layered so each commit is independently reviewable.
metrics: add #863 P2/P3/P4 gating + cap counters— registersthree new metrics consumed by the rest of the series. No behaviour
change.
clock: bound xev drain to one CQE batch per pass (#863 P4)—the event loop now returns to the slot driver after one batch of I/O
completions instead of draining everything in sight. The slot tick is
still rate-gated to its own cadence, so timing histograms stay
meaningful.
network: cap concurrent BlocksByRoot RPCs in-flight (#863 P3)—caps in-flight outbound block-fetch RPCs at 8. Anything beyond that
is rejected on the spot and counted, so a flood can't run the
fan-out away.
chain: future-slot drop + sync-aware gossip gating (#863 P2)—drops two classes of attestation cheaply on the receive path: ones
for slots that haven't happened yet, and ones that arrive while the
node is still catching up to peers. Block gossip is intentionally not
gated — that's the recovery vector during sync.
chain: move gossip attestation validation to chain-worker (#863 P5)— the receive path becomes pure dispatch. The expensivework (fork-choice reads, signature verification) moves onto a
dedicated chain-worker thread so it stops competing with block
import.
chain: align gossip attestation lifecycle with leanSpec (#863 spec parity)— closes the gaps between zeam and thereference implementation:
slot is just slightly in the future, are now buffered (capped at
1024 entries, FIFO eviction) and replayed after every successful
block import — same as
leanSpec's_pending_attestations/_replay_pending_attestations.it accepts gossip and uses the buffer to absorb out-of-order
arrivals; only IDLE-equivalent states (no peers, fork choice not
initialised) drop gossip.
received attestations and then drop them, matching the spec's
"validate and drop" path. Storage in fork choice stays gated on
the aggregator role.
build: invoke rust nightly via 'rustup run' …+build: fix zig fmt of rustup run cargo argv— CI fix formacos-latestwhere
cargo +nightlywas hitting the rustup installer instead ofthe proxy and failing immediately. Switching to
rustup run nightly cargo …works on every supported installwithout depending on which binary
cargohappens to point at.What this PR does not change
A few things look related but are deliberately out of scope:
latestNew/
latestKnown) stays as the head-selection mechanism. The specdoesn't define it and zeam relies on it; this PR doesn't touch that
shape.
BlocksByRootin-flight cap and the one-batch event-loop boundare zeam-specific performance knobs the spec is silent on; they stay.
--aggregate-subnet-idsselection policy (own + neighbour) is aseparate change that lives in
lean-quickstart, not here.New metrics
zeam_gossip_atts_dropped_total{kind, reason}— drops on the receivethread before chain-worker dispatch (
syncing/worker_validation_failed).zeam_blocks_by_root_inflight— instantaneous outbound block-fetchRPC count, capped at 8.
zeam_xev_clock_drain_passes_total— passes through the boundedevent-loop drain.
lean_pending_attestations_buffered_total{kind, reason}— entriespushed into the spec-style replay buffer.
lean_pending_attestations_evicted_total{kind}— FIFO evictions atthe 1024 cap.
lean_pending_attestations_replay_total{kind, outcome}— replayoutcomes (
accepted/buffered/dropped).lean_pending_attestations_size{kind}— instantaneous buffer depth.lean_block_fetch_dedup_totalgains a newoutcome="inflight_cap"label so the existing "every dedup outcome is accounted for" invariant
still holds.
Test plan
zig buildclean on macOS aarch64.zig build test— all test binaries pass (271 tests across 14binaries).
zig fmt --check .clean.perf/863-loop-fairness-and-fetch-coalesceafter therustup-invocation fix:
lintandDummy prove (macos-latest)passing on the latest commit;
build/test/build-all-proversrunning.*
zeam_blocks_by_root_inflightbounded by 8 under load.*
lean_block_fetch_dedup_total{outcome="inflight_cap"}onlynon-zero during flood, zero in steady state.
*
zeam_gossip_atts_dropped_total{reason="syncing"}clears asnodes catch up.
*
lean_pending_attestations_sizenon-zero only during sync /during reorgs; replay outcomes weighted toward
accepted.* Slot-tick latency back below the slot duration on the
aggregator.
* Head-vs-finalized gap stays bounded (vs ~100 slots pre-fix).
Refs #863