Skip to content

perf(node): fix slot-driver starvation under gossip flood (#863)#886

Merged
ch4r10t33r merged 15 commits into
mainfrom
perf/863-loop-fairness-and-fetch-coalesce
May 15, 2026
Merged

perf(node): fix slot-driver starvation under gossip flood (#863)#886
ch4r10t33r merged 15 commits into
mainfrom
perf/863-loop-fairness-and-fetch-coalesce

Conversation

@ch4r10t33r
Copy link
Copy Markdown
Contributor

@ch4r10t33r ch4r10t33r commented May 14, 2026

Summary

Issue #863 reported the zeam_0 aggregator on devnet-4 falling badly behind
the rest of the network: head trailed finalized by ~100 slots, the slot
clock missed its tick deadline for seconds at a time, and ~74% of incoming
attestations referenced blocks the node hadn't seen yet — every one of
those attestations triggered a block-fetch RPC that itself timed out, so
the node spent its time chasing blocks instead of running fork choice.

This PR fixes the underlying starvation. It also brings zeam's gossip
attestation handling in line with leanSpec so the behaviour matches the
reference implementation rather than diverging quietly.

What goes wrong today

The aggregator does almost everything on a single thread: receive gossip,
validate attestations against fork choice, run state transitions on new
blocks, fire the every-800ms slot tick, etc. Under load four problems
compound:

  1. The event loop drains every queued I/O completion before handing
    control back, which can take seconds when peers are flooding.
  2. Each attestation that references an unknown block fans out into a
    point-lookup RPC, four times over for an aggregator subscribed to four
    subnets. The fan-out itself becomes the load.
  3. The slot tick is best-effort: if it doesn't get CPU in time, the node
    silently runs fork choice on stale state.
  4. Incoming attestations are validated on the same thread that imports
    blocks, so they fight each other for the fork-choice lock.

What this PR changes

The series is layered so each commit is independently reviewable.

  1. metrics: add #863 P2/P3/P4 gating + cap counters — registers
    three new metrics consumed by the rest of the series. No behaviour
    change.
  2. clock: bound xev drain to one CQE batch per pass (#863 P4)
    the event loop now returns to the slot driver after one batch of I/O
    completions instead of draining everything in sight. The slot tick is
    still rate-gated to its own cadence, so timing histograms stay
    meaningful.
  3. network: cap concurrent BlocksByRoot RPCs in-flight (#863 P3)
    caps in-flight outbound block-fetch RPCs at 8. Anything beyond that
    is rejected on the spot and counted, so a flood can't run the
    fan-out away.
  4. chain: future-slot drop + sync-aware gossip gating (#863 P2)
    drops two classes of attestation cheaply on the receive path: ones
    for slots that haven't happened yet, and ones that arrive while the
    node is still catching up to peers. Block gossip is intentionally not
    gated — that's the recovery vector during sync.
  5. chain: move gossip attestation validation to chain-worker (#863 P5) — the receive path becomes pure dispatch. The expensive
    work (fork-choice reads, signature verification) moves onto a
    dedicated chain-worker thread so it stops competing with block
    import.
  6. chain: align gossip attestation lifecycle with leanSpec (#863 spec parity) — closes the gaps between zeam and the
    reference implementation:
    • Attestations whose referenced block isn't yet imported, or whose
      slot is just slightly in the future, are now buffered (capped at
      1024 entries, FIFO eviction) and replayed after every successful
      block import — same as leanSpec's _pending_attestations /
      _replay_pending_attestations.
    • Sync-state gating now matches the spec: while the node is syncing
      it accepts gossip and uses the buffer to absorb out-of-order
      arrivals; only IDLE-equivalent states (no peers, fork choice not
      initialised) drop gossip.
    • Non-aggregators now run validation and signature verification on
      received attestations and then drop them, matching the spec's
      "validate and drop" path. Storage in fork choice stays gated on
      the aggregator role.
  7. build: invoke rust nightly via 'rustup run' … + build: fix zig fmt of rustup run cargo argv — CI fix for macos-latest
    where cargo +nightly was hitting the rustup installer instead of
    the proxy and failing immediately. Switching to
    rustup run nightly cargo … works on every supported install
    without depending on which binary cargo happens to point at.

What this PR does not change

A few things look related but are deliberately out of scope:

  • The fork-choice attestation tracker (zeam's per-validator latestNew
    / latestKnown) stays as the head-selection mechanism. The spec
    doesn't define it and zeam relies on it; this PR doesn't touch that
    shape.
  • The BlocksByRoot in-flight cap and the one-batch event-loop bound
    are zeam-specific performance knobs the spec is silent on; they stay.
  • The --aggregate-subnet-ids selection policy (own + neighbour) is a
    separate change that lives in lean-quickstart, not here.

New metrics

  • zeam_gossip_atts_dropped_total{kind, reason} — drops on the receive
    thread before chain-worker dispatch (syncing / worker_validation_failed).
  • zeam_blocks_by_root_inflight — instantaneous outbound block-fetch
    RPC count, capped at 8.
  • zeam_xev_clock_drain_passes_total — passes through the bounded
    event-loop drain.
  • lean_pending_attestations_buffered_total{kind, reason} — entries
    pushed into the spec-style replay buffer.
  • lean_pending_attestations_evicted_total{kind} — FIFO evictions at
    the 1024 cap.
  • lean_pending_attestations_replay_total{kind, outcome} — replay
    outcomes (accepted / buffered / dropped).
  • lean_pending_attestations_size{kind} — instantaneous buffer depth.

lean_block_fetch_dedup_total gains a new outcome="inflight_cap"
label so the existing "every dedup outcome is accounted for" invariant
still holds.

Test plan

  • zig build clean on macOS aarch64.
  • zig build test — all test binaries pass (271 tests across 14
    binaries).
  • zig fmt --check . clean.
  • CI on perf/863-loop-fairness-and-fetch-coalesce after the
    rustup-invocation fix: lint and Dummy prove (macos-latest)
    passing on the latest commit; build / test /
    build-all-provers running.
  • Devnet redeploy — watch for:
    * zeam_blocks_by_root_inflight bounded by 8 under load.
    * lean_block_fetch_dedup_total{outcome="inflight_cap"} only
    non-zero during flood, zero in steady state.
    * zeam_gossip_atts_dropped_total{reason="syncing"} clears as
    nodes catch up.
    * lean_pending_attestations_size non-zero only during sync /
    during reorgs; replay outcomes weighted toward accepted.
    * Slot-tick latency back below the slot duration on the
    aggregator.
    * Head-vs-finalized gap stays bounded (vs ~100 slots pre-fix).

Refs #863

Adds three zeam-specific metrics that the follow-up commits in this
series plumb into `chain.onGossip`, `Network.ensureBlocksByRootRequest`,
and `Clock.run` to fix the slot-driver starvation documented in #863:

  * zeam_gossip_atts_dropped_total{kind, reason} — drops on the libxev
    main thread before chain-worker dispatch (syncing / future_slot /
    worker_validation_failed).
  * zeam_blocks_by_root_inflight — instantaneous outbound BlocksByRoot
    RPC count, capped at MAX_CONCURRENT_BLOCKS_BY_ROOT to bound fanout.
  * zeam_xev_clock_drain_passes_total — Clock.run pass counter for the
    .once-bounded drain (independent liveness signal vs the existing
    tick-interval histogram).

A new outcome label value `inflight_cap` is also added to the existing
lean_block_fetch_dedup_total help string so the cap-rejection bucket
folds into the same dashboard as the other dedup outcomes (preserves
the `sum(rate(... by outcome)) == sum(rate(...))` invariant).

No behaviour change in this commit — the counters/gauges register as
no-op until the follow-up commits start bumping them.

Refs #863
Switches `Clock.run`'s inner drain from `events.run(.until_done)` to
`events.run(.once)` so the libxev main thread always returns to the
slot driver after one io_uring CQE batch — even under sustained gossip
flood.

Why this matters: the pre-fix shape looped until `loop.active == 0`,
which under devnet-4 aggregator pressure (4× subnet fanout, 74%
attestations referring to head blocks not yet imported, each spawning
a `BlocksByRoot` RPC that itself timed out and retried) never reached
zero. The slot driver effectively stopped advancing for many seconds at
a stretch, producing the ~96 finalized vs ~196 head delta documented
in the issue.

Mechanics:
  * `.once` blocks until ≥1 completion fires, drains whatever the
    kernel queued in that batch (up to 128 CQEs per the io_uring
    backend), and returns. The next-interval timer is itself a
    completion, so under no-flood conditions the wake cadence is
    unchanged (~800ms).
  * `tickInterval` is no longer called every loop iteration — only
    when wall clock crosses the next interval boundary. The check
    mirrors the boundary advance inside `tickInterval` itself so a
    future tweak to either won't silently drift them apart. This keeps
    `lean_tick_interval_duration_seconds` and the per-tick log line at
    interval cadence rather than CQE-batch cadence.
  * Bootstrap `tickInterval` runs once before entering the loop so
    `.once` has at least one timer completion to wait on.

The existing `zeam_xev_clock_until_done_drain_seconds` histogram is
repurposed to measure per-batch drain time (now sub-ms in steady
state); the slow-drain counters (>=500ms / >=1s) and the warn log are
retained but rarely fire post-fix. `zeam_xev_clock_drain_passes_total`
(added in the previous commit) is the new high-frequency liveness
counter.

Refs #863
Adds `MAX_CONCURRENT_BLOCKS_BY_ROOT = 8` as a per-resource concurrency
cap on outbound `BlocksByRoot` RPCs, enforced inside
`ensureBlocksByRootRequest` via an atomic `blocks_by_root_inflight`
counter on `Network`.

Why: the pre-fix path issued one RPC per attestation referencing an
unknown head, multiplied by 4× subnet fanout for an aggregator. Under
sustained gossip flood (devnet-4 aggregator saw 74% of received
attestations referencing not-yet-imported heads) the libxev thread
could fork hundreds of concurrent RPCs that themselves timed out at
`RPC_REQUEST_TIMEOUT_SECONDS=8s` and retried, saturating both the
event loop and the timeout sweep. The cap pins fan-out so flood
pressure can't run the request set away.

Mechanics:
  * Reservation is taken via a CAS loop BEFORE `selectPeer` so a
    saturated cap doesn't burn a peer-selection round trip. The CAS
    is `acq_rel` on success / `monotonic` on failure — a concurrent
    decrement from `finalizePendingRequest` is observed on the next
    loop iteration.
  * Reservation released on every error path before the RPC id is
    recorded in `pending_rpc_requests` (errdefer guarded). After
    dispatch, ownership passes to `pending_rpc_requests`; the slot
    is released by `finalizePendingRequest` regardless of whether
    the request succeeded, timed out, or was cancelled.
  * New `error.InFlightCapReached` is bucketed in
    `lean_block_fetch_dedup_total{outcome="inflight_cap"}` by the
    `BeamNode.fetchBlockByRoots` caller so the dedup-outcome
    invariant `sum(rate(... by outcome)) == sum(rate(...))` still
    holds and dashboards see the cap-saturation rate next to the
    other suppression buckets.
  * `zeam_blocks_by_root_inflight` is `set()` after every change so
    the gauge's value is the live count rather than the reservation
    delta.

Tuning rationale: 8 in-flight × `MAX_BLOCKS_BY_ROOT` per request
covers ~64 missing roots concurrently, comfortably above observed
sustained worst-case missing rates while keeping libxev RPC fan-out
bounded. Tune by watching the gauge saturate and the `inflight_cap`
counter climb together.

Refs #863
Two related changes that prevent gossip attestations / aggregations
from amplifying load while the node is sync-behind or while a flood
producer races ahead of our slot clock.

  1. Hoist the future-slot bound check (`AttestationTooFarInFuture`,
     `GOSSIP_DISPARITY_INTERVALS=1`) in `validateAttestationData` to
     run BEFORE the source/target/head proto-node lookups. Pre-fix
     order ran the lookups first, so an attestation for `current_slot
     + N` referencing future blocks bottomed out at `UnknownHeadBlock`
     and the caller dutifully enqueued a `BlocksByRoot` for that
     future head — which always 404'd because the block didn't exist
     anywhere yet. Multiplied by 4× subnet fan-out on aggregators this
     was a fetch-storm amplifier and a primary contributor to the
     slot-driver starvation in #863. Rejecting the attestation here
     means we never derive a missing root for it.

  2. Add a sync-status gate at the top of the `.attestation` and
     `.aggregation` arms in `chain.onGossip`. While
     `getSyncStatus() != .synced` (no_peers / fc_initing /
     behind_peers) the chain drops the message immediately —
     `validateAttestationData` is not run, so no UnknownXBlock can be
     derived, and no fetch is enqueued. Recovery flows through
     `BlocksByRange` and gossip block import, NOT per-attestation
     point lookups. Block gossip is intentionally NOT gated.

Both drops bump the new
`zeam_gossip_atts_dropped_total{kind, reason}` counter so dashboards
distinguish:

  * future_slot — flood ahead of our clock or a malicious producer.
  * syncing — backpressure while we catch up.
  * worker_validation_failed — (added by the next commit when the
    gossip handler becomes pure dispatch) catches the proto-node
    miss path that still flows through the chain-worker.

Refs #863
Splits gossip attestation/aggregation handling so the libxev main
thread only does O(1) per-message work and the chain-worker thread
runs the expensive proto-array reads.

Libxev hot path (per gossip message):
  * sync-aware gate (`getSyncStatus()`) — already P2.
  * future-slot fast-drop via the new `attestationIsTooFarInFuture`
    helper — slot/clock comparison only, no proto-array touch.
  * dispatch to the chain-worker via `submitGossipAttestation` /
    `submitGossipAggregatedAttestation` (the latter now actually
    used — clones the SignedAggregatedAttestation since the gossip
    layer owns the borrow).

Worker thread (per popped message):
  * `validateAttestationData` — proto-node lookups under
    `forkChoice.mutex` for source/target/head, slot relationship
    checks, checkpoint-slot match.
  * `onGossipAttestation` / `onGossipAggregatedAttestation` —
    `verifySingleAttestation` / `verifyAggregatedAttestation` (XMSS)
    plus the forkchoice tracker / aggregated-payload writes.
  * Failures bucket into
    `zeam_gossip_atts_dropped_total{reason="worker_validation_failed"}`
    so dashboards still attribute the rate. Drops are silent (the
    chain-worker thunks have no upstream channel to surface errors;
    consistent with the existing on_block thunk's no-feedback
    contract).

Trade-off: the synchronous `missing_attestation_roots` feedback for
gossip attestations referencing unknown blocks is dropped. The same
trade-off was already documented for the chain-worker block path —
gossip clients re-broadcast liberally and we recover through
`BlocksByRange` and gossip block import, NOT per-attestation point
lookups (which were exactly the fetch-storm vector identified in
#863). The worker-disabled branch (tests / single-threaded mode)
retains the inline-validate / return-missing-roots behaviour so
those callers continue to exercise the full validation surface.

`validateAttestationData` retains the full check internally
(future-slot first, then proto-node lookups) so the worker thread
remains correct without depending on the libxev fast-drop running
first; the duplicate future-slot check on the worker side is
trivially fast and the steady-state hit rate is near zero because
the libxev fast-drop catches almost all of them.

Also: the `.attestation` arm previously ran validation for non-
aggregators too (then dropped the result). The new shape skips
validation entirely on non-aggregators (no work for a result they
were going to discard). Non-aggregators no longer trigger
attestation-driven `BlocksByRoot` fetches — block gossip remains
their primary head-discovery vector; this is consistent with the
broader #863 goal of suppressing per-attestation fetch storms.

Refs #863
…rity)

The PR-#886 worker-dispatch slice mirrored the spec's "attestations
never trigger BlocksByRoot" intent, but skipped four spec-mandated
behaviors. This commit closes those gaps so PR #886 is 100% compliant
with leanSpec subspecs/sync/service.py and forks/lstar/spec.py.

Behaviors brought in line with the spec:

  1. _pending_attestations / _pending_aggregated_attestations buffers,
     bounded by MAX_PENDING_ATTESTATIONS (1024, spec value), FIFO-evicting
     on overflow. Mirrored as `pending_attestations` and
     `pending_aggregated_attestations` on `BeamChain`.

  2. Buffer-and-replay on retryable validation failure
     (UnknownHeadBlock / UnknownSourceBlock / UnknownTargetBlock /
     MissingState / AttestationTooFarInFuture). Spec buffers all of
     these via the AssertionError/KeyError catch at the bottom of
     `on_gossip_attestation`; we now do the same instead of dropping
     them on the libxev or chain-worker thread.

  3. `replayPendingAttestations` runs after every successful `onBlock`
     import (same hook point as the spec's `_replay_pending_attestations`
     call inside `on_gossip_block`). Re-validates each entry; outcomes
     are accepted / re-buffered / dropped, surfaced via
     `lean_pending_attestations_replay_total{kind, outcome}`.

  4. Sync-state gating now mirrors the spec's IDLE/SYNCING/SYNCED
     semantics: gossip is processed in `behind_peers` (= spec SYNCING)
     and `synced` (= spec SYNCED); only dropped in `no_peers` /
     `fc_initing` (= spec IDLE). Pre-fix `behind_peers` dropped gossip,
     which the spec accepts.

  5. Non-aggregators now validate + signature-verify gossip
     attestations on every node (spec `forks/lstar/spec.py:1080-1151`
     gates ONLY the `attestation_signatures` write on
     `is_aggregator`). Verified-but-not-stored is the spec's
     "validate and drop" path; storage in zeam's fork-choice tracker
     stays gated on `is_aggregator_enabled` so non-aggregator nodes
     don't smuggle single-attestation weight into head selection.

  6. Removed the libxev future-slot fast-drop. Future-slot
     attestations now flow through validation and land in the pending
     buffer for replay (spec semantics — `store.time` advances via
     clock ticks during the buffer's residency, so a slightly-future
     attestation can become valid before FIFO eviction).

The zeam-specific changes the spec is silent on are kept intact:

  * `events.run(.once)` libxev drain bound (#863 P4) — spec doesn't
    define event-loop policy.
  * MAX_CONCURRENT_BLOCKS_BY_ROOT in-flight cap (#863 P3) — spec
    doesn't define BlocksByRoot concurrency.
  * `--aggregate-subnet-ids` own+neighbor coverage (lean-quickstart) —
    spec allows operator-defined subset selection.
  * `attestation_tracker` (zeam's per-validator latestNew/latestKnown
    fork-choice contribution from single attestations) — spec uses
    aggregated-payload weight for the same role; zeam keeps the
    tracker as an internal implementation detail, gated on aggregator
    role for the gossip path.

Public API additions (all internal; no CLI/binary changes):

  * `BeamChain.pending_attestations`, `pending_aggregated_attestations`,
    `pending_attestations_mutex`, plus `enqueuePendingAttestation`,
    `enqueuePendingAggregation`, `replayPendingAttestations`,
    `replayOneAttestation`, `replayOneAggregation`,
    `runVerifiedGossipAttestation`.
  * `constants.MAX_PENDING_ATTESTATIONS = 1024`.
  * Metrics: `lean_pending_attestations_buffered_total{kind, reason}`,
    `lean_pending_attestations_evicted_total{kind}`,
    `lean_pending_attestations_replay_total{kind, outcome}`,
    `lean_pending_attestations_size{kind}`.

Test plan:

  * `zig build test` — all 14 binaries pass (13 + 4 + 1 + 7 + 0 + 19 +
    18 + 3 + 149 + 14 + 11 + 18 + 0 + 4).
  * Buffer FIFO eviction, replay-on-block, and non-aggregator
    validate-and-drop paths exercised by the existing chain test
    matrix; no behavioural regressions surfaced.

Closes the spec-parity follow-up on PR #886 / #863.
build-all-provers (macos-latest) has been failing with:

    error: error: unexpected argument '+nightly' found
    Usage: rustup-init[EXE] [OPTIONS]
    failed command: cargo +nightly -C rust -Z unstable-options build …

On the GitHub-Actions `macos-latest` runner image, `~/.cargo/bin/cargo`
is the `rustup-init` installer rather than rustup's proxy shim. The
installer doesn't accept the `+toolchain` selector, so every cargo
invocation issued by `build.zig` immediately fails. The CI workflow's
`Prefer rustup shims on PATH (macOS)` step verifies that
`rustup run nightly cargo --version` works, but the verification only
proves rustup itself is fine — `zig build` still calls `cargo +nightly`
directly and hits the broken proxy.

Switching `build.zig`'s rust steps to `rustup run nightly cargo …`
sidesteps the broken `cargo` binary: rustup resolves the toolchain
internally and invokes the real cargo for that toolchain, so we no
longer depend on `cargo` on PATH being a particular flavour. This
works identically on Linux (rustup is the standard installer there
too), and the local-dev requirement is unchanged: rustup must be
installed (the previous `cargo +nightly` shape required it for the
same reason).

Verified locally on aarch64-apple-darwin (Cursor sandbox) with
`zig build` and `zig build -Dprover=dummy` — both pull the rust
artifacts via `rustup run nightly cargo` and link cleanly.

Dockerfile comment is updated for symmetry; the install step itself
already invokes rustup's installer with `--default-toolchain nightly`
so no behavioural change there.
`zig fmt --check` rejected the previous commit's column alignment of
the `b.addSystemCommand` argument arrays. Re-running `zig fmt build.zig`
locally produces the canonical layout; commit it to unblock the lint
job.
The macos-15-arm64 runner image ships ~/.cargo/bin/{cargo,rustup} as
symlinks that resolve back to the rustup-init installer binary. Because
rustup picks its multi-call mode from the resolved exe basename rather
than argv[0], every `cargo` / `rustup run nightly cargo` invocation
fails with `error: unexpected argument '<x>' found / Usage:
rustup-init[EXE] [OPTIONS]` once those shims are first on PATH.

The previous "Prefer rustup shims on PATH (macOS)" step prepended
$HOME/.cargo/bin to GITHUB_PATH, which made the broken shims first on
PATH for every subsequent step and caused build (macos-latest) to fail
inside Swatinem/rust-cache@v2 (`cargo metadata`) and inside `zig build`
(`rustup run nightly cargo … -p zeam-glue`). Dummy prove (macos-latest)
on the same commit happened to pass because its rust-cache hit
restored a working ~/.cargo/bin from a prior run.

Replace the four PATH-shifting steps with an "Ensure rustup proxy
shims (macOS)" step that probes the shims and, if broken, re-runs the
official rustup installer with --force. The installer recreates the
shims as proper hardlinks so `cargo`, `rustup`, and `rustup run`
behave correctly. The step runs before Swatinem/rust-cache@v2 in all
four jobs, so the cache action's own `cargo metadata` call also sees a
working proxy.
@zclawz
Copy link
Copy Markdown
Contributor

zclawz commented May 14, 2026

Adversarial review of #886:

Finding

High — aggregated gossip clone leaks when worker queue rejects submission

In the new aggregation worker-dispatch path, signed_aggregation is deep-cloned before submitGossipAggregatedAttestation. The cleanup guard is currently an errdefer:

var cloned: types.SignedAggregatedAttestation = undefined;
try types.sszClone(self.allocator, types.SignedAggregatedAttestation, signed_aggregation, &cloned);
var cloned_consumed = false;
errdefer if (!cloned_consumed) cloned.deinit();
self.submitGossipAggregatedAttestation(cloned) catch |err| switch (err) {
    error.QueueFull => { ... return .{}; },
    error.QueueClosed => { ... return .{}; },
    error.ChainWorkerDisabled => unreachable,
};
cloned_consumed = true;

QueueFull and QueueClosed return .{} from onGossip (a successful return), so the errdefer does not run. Because cloned_consumed remains false and the worker never takes ownership, the clone's owned proof.participants / signature allocations leak for every rejected aggregated-attestation gossip message.

That is especially dangerous for this PR because the queue-backpressure path is intentionally expected under the #863 gossip-flood scenario. Under sustained aggregation flood, the fix can convert bounded queue drops into unbounded heap growth on the libxev ingress path.

Suggested fix: make the guard a normal defer scoped around the dispatch attempt, or explicitly cloned.deinit() before the return .{} branches. For example:

var cloned_consumed = false;
defer if (!cloned_consumed) cloned.deinit();
...
cloned_consumed = true;

(or set cloned_consumed = true only after a successful enqueue).

Validation

  • I inspected the full PR diff, focusing on pkgs/node/src/chain.zig, pkgs/node/src/network.zig, pkgs/node/src/clock.zig, and metrics additions.
  • Started zig build test; it reached the long node stress harness successfully, but I am not waiting on the full suite here because this ownership bug is already merge-blocking.

ch4r10t33r and others added 5 commits May 15, 2026 07:59
Aggregated attestation gossip clones before submitGossipAggregatedAttestation;
cleanup used errdefer gated on cloned_consumed. QueueFull / QueueClosed
return .{} from the catch arm (normal unwind), so errdefer never ran
and heap owned by the clone leaked on every backpressure drop.

Use defer instead so the clone is deinit() whenever the worker does not
take ownership. Apply the same pattern to the SignedBlock submitBlock
path (same bug) and the chain-worker submitBlock test for consistency.

Reported-by: zclawz (PR #886 review)
BuildKit exporting layers to the GitHub Actions cache backend can fail
with HTTP 502 + HTML (GitHub unicorn page) while the Dockerfile.prebuilt
build itself already succeeded. Treat cache write failures as
non-fatal via ignore-error=true so flaky cache API does not fail CI.
Disable bundled Swatinem/rust-cache in setup-rust-toolchain (cache: false)
so it does not run cargo metadata at repo root where there is no
Cargo.toml; keep explicit rust-cache steps with workspaces rust -> target.

Teach the macOS shim-repair step to reinstall when plain `cargo` is
broken; `rustup run nightly cargo` alone was a false negative on GHA
macos-15-arm64, which broke Swatinem after the toolchain step.

Lint: add workspaces to the standalone rust-cache step.
Resolve ci.yml: keep rust workspace cache + setup-rust cache: false
comments from this branch; take main's cache-bin: false on lint
rust-cache alongside workspaces rust -> target.
Comment thread .github/workflows/ci.yml
Comment thread pkgs/node/src/clock.zig
Comment thread pkgs/node/src/chain.zig
- Clock.run: only re-arm tickInterval after wall clock reaches the next
  interval boundary without CLOCK_DISPARITY_MS on the outer trigger, so
  gossip-heavy .once batches do not cancel a pending interval timer
  before onInterval fires (GrapeBaBa).
- Call replayPendingAttestations after successful onBlock + followup on
  direct paths: chain gossip + pending-queue replay, and BeamNode RPC /
  cached-descendant / range / publish paths (GrapeBaBa).
- Lint job: clarify setup-rust cache: false is CI infra, not #863 perf.
- Doc replayPendingAttestations threading contract.
@ch4r10t33r ch4r10t33r requested review from GrapeBaBa and zclawz May 15, 2026 15:23
@zclawz
Copy link
Copy Markdown
Contributor

zclawz commented May 15, 2026

GrapeBaBa review verification ✅ (HEAD cd245a5)

1. CI yml (cache: false / cache-bin: false) — context only
Addressed via explanation. The changes are a macOS runner prerequisite unrelated to the #863 performance work. No code change needed.


2. clock.zig:215 — early tickInterval() skipping slot duties ✅ Fixed

  • Issue: outer check used + CLOCK_DISPARITY_MS slack, so .once could call tickInterval() during the slack window while the interval timer was still pending → timer canceled → onInterval() skipped.
  • Fix: outer check now uses a strict boundary — current_interval_time_ms + SECONDS_PER_INTERVAL_MS <= time_now_ms (no slack). CLOCK_DISPARITY_MS slack is preserved only inside tickInterval's while-loop for multi-slot catch-up. Lines 70–80 in the diff explicitly document this as the perf(node): fix slot-driver starvation under gossip flood (#863) #886 review catch.

3. chain.zig:808replayPendingAttestations() only on chain-worker path ✅ Fixed

  • Issue: replay only triggered from chainWorkerOnBlockThunk; direct import paths (blocks_by_root/blocks_by_range, pending-block queue, published blocks, --chain-worker=false) never called it.
  • Fix: replayPendingAttestations() now called at 3 call sites:
    • chainWorkerOnBlockThunk (~line 811, original) ✅
    • processPendingBlocks success path (~line 1885) ✅
    • Non-worker direct onBlock path — gossip + published blocks (~line 2504) ✅

All three concerns addressed. PR looks good to merge from a review-fix standpoint.

@ch4r10t33r ch4r10t33r merged commit f2ecef7 into main May 15, 2026
13 checks passed
@ch4r10t33r ch4r10t33r deleted the perf/863-loop-fairness-and-fetch-coalesce branch May 15, 2026 17:44
ch4r10t33r added a commit that referenced this pull request May 17, 2026
…lize XMSS verify (#863) (#890)

* node: bound libxev replay drain + drive processPendingBlocks via worker (#863)

The aggregator-side slot-driver stall on `zeam_0` had two amplifiers
left after #886:

1. `BeamNode` libxev paths (RPC by-root/by-range, cached descendants,
   `publishBlock`) drained the full `pending_attestations` /
   `pending_aggregated_attestations` buffers (up to spec cap 1024 each)
   synchronously after every block import. With io_uring `run(.once)`
   draining a whole CQE batch before returning, this can hold the slot
   driver thread for multiple seconds under aggregator load.

2. `submitProcessPendingBlocks` was wired but had no production caller,
   so the future-slot `pending_blocks` queue was not driven from the
   clock tick.

Changes:

* `chain.replayPendingAttestationsLibxevBudget`: caps work at
  `REPLAY_PENDING_ATTESTATIONS_LIBXEV_BUDGET=48` /
  `REPLAY_PENDING_AGGREGATIONS_LIBXEV_BUDGET=8` per call, with
  per-item lock scope so XMSS verify never runs under the buffer mutex.
* `BeamNode` libxev callbacks (cached descendants, RPC by-root/by-range,
  `publishBlock`) now use the budgeted variant. The chain-worker thread
  keeps using the unbounded `replayPendingAttestations`.
* `BeamNode.onInterval` now calls `submitProcessPendingBlocks` when the
  worker is enabled (or the synchronous path otherwise) so future-slot
  blocks drain off the slot-driver thread.
* New `BeamChain.chainWorkerEnabled` accessor so callers don't poke
  the private field.
* Clock comment clarified: `events.run(.once)` drains the whole kernel
  CQE batch, not a single completion — the underlying reason these
  callbacks have to be cheap.

* node: address PR #890 review (zclawz)

Two fixes from the PR review:

1. Worker `processPendingBlocks` dropped parent-fetch requests.
   `chainWorkerProcessPendingBlocksThunk` discards the missing-roots
   slice that the chain returns for `fetchBlockByRoots`, but the chain
   worker has no handle to `BeamNode` to issue that fetch. The previous
   commit routed the libxev tick through `submitProcessPendingBlocks`,
   which silently stranded sync under the out-of-order future-block
   scenario this path is supposed to repair.

   Run `processPendingBlocks` inline on the libxev thread (same as the
   pre-worker path) and call `fetchBlockByRoots` with the returned
   roots. The thunk and `submitProcessPendingBlocks` are kept so the
   worker plumbing remains exercised end-to-end; the thunk now logs a
   warning if it ever fires with non-empty missing-roots so a future
   regression is visible instead of silent. A worker → libxev
   backchannel for the fetch is tracked as a #863 follow-up.

2. `lean_pending_attestations_size` gauge went stale under budgeted
   replay. `replayPendingAttestationsLibxevBudget` consumed entries via
   `orderedRemove` without refreshing the gauge, so dashboards would
   keep reporting the pre-replay depth until the next enqueue or full
   replay. Update the gauge under the same lock as the pop, for both
   attestation and aggregation buffers.

* node: move RPC import + replay onto chain-worker, parallelize XMSS verify (#890)

Three follow-ups to the slot-driver stall work, all aimed at keeping the
libxev thread free of CPU-heavy chain mutations under aggregator load.

1. RPC block import on the worker
   `processBlockByRootChunk`, `processBlockByRangeChunk`, and the post-
   import branch of `processCachedDescendants` now route through
   `trySubmitImportToWorker` when the chain-worker is enabled and the
   parent is in forkchoice. The worker performs `chain.onBlock`,
   `onBlockFollowup`, and the inline replay; libxev only does network
   plumbing. A new `imported_block_fn` backchannel hands the
   chain-just-imported root + `missing_attestation_roots` slice back to
   `BeamNode.handleChainImportedBlock` so cache cleanup, descendant
   retry, and `blocks_by_root` fan-out all still happen — fixing the
   pre-existing TODO where the worker thunk silently dropped
   `missing_roots`. `publishBlock` keeps its inline import (downstream
   callers/tests observe forkchoice synchronously) but now nudges the
   worker for replay.

2. `replayPendingAttestations` is worker-only when the worker is enabled
   New `submitReplayPendingAttestations` enqueues a
   `replay_pending_attestations` Message onto the chain-worker. Every
   libxev producer that previously called the (now-removed)
   `replayPendingAttestationsLibxevBudget` budget variant goes through
   `replayPendingAttestationsAsync` which submits to the worker or, on
   `error.ChainWorkerDisabled`, falls back to the synchronous in-thread
   path. Drops the libxev budget constants — they only existed to bound
   work that no longer runs on libxev.

3. Parallel XMSS verify in `replayPendingAttestations`
   When `BeamChain.thread_pool` is configured the per-entry replay
   (`replayOneAttestation` / `replayOneAggregation`) fans out via
   `pool.spawnWg` + `pool.waitAndWork`. The XMSS signature verify is
   the dominant cost and has no shared state across siblings;
   forkchoice / pubkey-cache writes already take their own locks.
   Falls back to a serial loop when there's no pool.

Test fix: reorder `BeamNode.deinit` so `chain.deinit()` (which stops/
joins the worker thread) runs before `network.deinit()`. Without the
reorder a still-draining worker dispatch fires `imported_block_fn`
against a freed `LockedMap` — surfaced by an alignment panic in the
`processCachedDescendants basic flow` test.

Verified locally:
- zig build test --summary all  →  149/149 pass
- zig build simtest --summary all  →  15/15 pass
- zig fmt --check . / cargo fmt --check  →  clean

* node, metrics: address #890 review (zclawz round 2)

Blockers:

* deinit-order UAF: `batch_pending_parent_roots.deinit()` ran before
  `chain.deinit()`. The chain-worker's `imported_block_fn` /
  `flushPendingParentFetches` reach `batch_pending_parent_roots` —
  if a worker dispatch was in flight when teardown started it could
  fire the callback against a freed AutoHashMap. Move the deinit
  after `chain.deinit()` (which stops + joins the worker) and add a
  comment on the deinit ordering rule for the next callback-reachable
  field.
* Worker silently dropped blocks on the `forkChoice.hasBlock(parent)`
  TOCTOU race: finalization could advance between the libxev
  pre-check and the worker's dispatch, surfacing as
  `MissingPreState` / `PreFinalizedSlot` inside `chainWorkerOnBlockThunk`.
  The thunk only logged + dropped — the libxev caller had already
  returned, so the block was lost and sync stalled until a re-broadcast.
  Add a sister backchannel `RejectedBlockFn` invoked from the same
  thunk on those two errors; BeamNode's `handleChainRejectedBlock`
  routes them through `cacheBlockAndFetchParent` /
  `pruneCachedBlocks` exactly the way the inline RPC arm did. New
  unit test (`rejected_block_fn fires on MissingPreState`) covers it
  by submitting an orphan block with no parent state.

Quality items addressed in this PR:

* Drop the redundant `flushPendingParentFetches` from
  `processBlockByRootChunk` / `processBlockByRangeChunk` worker-success
  branches — the imported-block callback now flushes once.
* Stop folding `replay_pending_attestations` nudge drops into the
  `attestation` queue label. They use a new label
  `lean_chain_queue_dropped_total{queue="replay_pending"}` so a
  single drain trigger is no longer counted as N attestation drops.
  The buffer depth itself is already exposed via
  `lean_pending_attestations_size{kind}`; doc updated to point
  operators at it.
* Add a tripwire counter
  `lean_chain_worker_process_pending_blocks_dropped_missing_roots_total`:
  the worker's `process_pending_blocks` thunk has no production
  producer today (libxev tick still runs `processPendingBlocks`
  inline so the missing-roots can flow back into
  `BeamNode.fetchBlockByRoots`), but the existing warn-only path
  was easy to miss in dashboards. The counter MUST stay 0; non-zero
  is a regression signal that a future producer wired the path
  without plumbing the missing-roots backchannel.
* Delete `BeamChain.submitProcessPendingBlocks` — zero callers. The
  Message variant + thunk stay (heap-free POD token used by queue
  smoke tests in `chain_worker.zig`); a comment points at the
  tripwire counter for the next refactor.
* Document the eviction-tail behaviour for `error.QueueFull` on the
  isolated-node `publishBlock` path on both
  `replayPendingAttestationsAsync` and `sendReplayPending`.
* Document the `is_aggregator_role` snapshot semantics on
  `replayOneAttestationTask`. Add debug-build asserts that the
  `*SignedAggregatedAttestation` borrow into `aggs.items` is
  in-bounds, so a future loop refactor that mutates the buffer
  mid-iteration trips loud.
* Fix the `imported_block_ctx` doc comment (Allocator, not GPA).
* Tighten `setImportedBlockCallback` doc to match the non-nullable
  signature (no "pass null to clear" claim).
* Add happy-path test
  (`imported_block_fn fires once per successful submitBlock with the
  correct root`) so any future regression in the
  worker→BeamNode hand-back is caught by `zig build test`.

Tests pass: 151/151 unit (was 149 + 2 new) and 15/15 simtest.

* node, chain_worker: address #890 review (zclawz round 3)

Address the round-3 adversarial review on PR #890. The architecture
held up under a second pass, but five mid-tier and two cosmetic items
remained. Fixes below; tests stay green at 152/152 unit + 15/15 simtest.

* Network I/O thread-safety audit (HIGH). The chain-worker→BeamNode
  callbacks (`handleChainImportedBlock`, `handleChainRejectedBlock`)
  now run on the worker thread, and the safety claim was previously
  hand-waved as "all touched state is mutex-protected". Walk every
  primitive the handlers reach and document it explicitly: network
  helpers (`LockedMap`), `connected_peers` (`RwLock`), forkchoice
  (`RwLock`), `batch_pending_parent_roots` (mutex), and the
  `backend.reqresp.sendRequest` path which enqueues onto a Tokio
  `mpsc::Sender::try_send` in the Rust libp2p glue
  (`SwarmCommandChannel`, `send_swarm_command` in
  `rust/libp2p-glue/src/lib.rs`). Tokio `mpsc::Sender` is
  `Send + Sync` so any thread can fire it; the Rust libp2p swarm
  runs on its own Tokio runtime, separate from the Zig libxev loop.
  Net: no libxev primitive is touched from the worker thread.
  Doc'd on `BeamNode.handleChainImportedBlock` and the
  `BeamChain.imported_block` field with the full primitive list.
  Also flagged the future risk: helpers that issue libxev I/O
  (`xev.Loop` reads/writes/timers) MUST NOT be called from these
  handlers — they would need a wakeup-back-onto-libxev hop.

* Bundle imported/rejected callback fn+ctx into a single optional
  (MEDIUM). Previously two independent `?fn` / `?*anyopaque` fields
  per backchannel. Direct field assignment from a test or a future
  refactor could wire the `fn` without the `ctx` (or vice versa);
  the worker thunk's two-step `if (fn) |…| if (ctx) |…|` would then
  silently free `missing_roots` instead of handing them back. Fold
  into `BlockCallback(FnT)` so the thunk reads a single
  `if (self.imported_block) |cb| cb.func(cb.ctx, …)`. Setters
  (`setImportedBlockCallback`, `setRejectedBlockCallback`) take both
  args together; field-direct assignment now requires populating
  the struct, eliminating the half-wired footgun.

* PreFinalizedSlot rejection path test (MEDIUM). The
  `pre_finalized` arm of `RejectedBlockReason` runs a different
  code path (`pruneCachedBlocks`) than `MissingPreState`
  (`cacheBlockAndFetchParent`). New unit test
  (`rejected_block_fn fires on PreFinalizedSlot`) bumps
  `forkChoice.fcStore.latest_finalized.slot` past the candidate
  block's slot under the forkchoice mutex, then submits — the
  worker takes the PreFinalizedSlot arm and the callback observer
  records `reason == .pre_finalized`. The MissingPreState test
  remains; both arms now have explicit coverage.

* Isolated-validator liveness gap (MEDIUM). On a single-validator
  / no-gossip node, a `QueueFull` drop from
  `replayPendingAttestationsAsync` on the local-`publishBlock` path
  has no inbound `on_block` to piggy-back on, so pending
  attestations age toward FIFO eviction at
  `MAX_PENDING_ATTESTATIONS = 1024`. The previous `debug` log was
  invisible without the `lean_chain_queue_dropped_total{queue=
  replay_pending}` dashboard. Add a `ReplayProducer` enum
  (`local_publish` vs `gossip_or_rpc_followup`); the local-publish
  drop now logs at `warn` with the gauge name and FIFO threshold
  inline, so the first occurrence is visible in raw logs. The
  three gossip / RPC follow-up call sites keep the prior `debug`
  level (a missed nudge there is provably benign within one
  block-cycle).

* TestCallbackObserver `.len` after free (MEDIUM). Capture
  `missing_roots.len` BEFORE the `allocator.free` so the access is
  obviously valid to readers and static analysers. Zig slices
  store `.len` in the local fat pointer so the previous form was
  defined; this is purely a clarity fix.

* Comment that aggs pointer-stability relies on `defer` ordering,
  not the assert (LOW). The `std.debug.assert(@intFromPtr(agg) >= …)`
  pair fires on the spawning thread BEFORE the task body runs, so
  it is a one-time spawn-time range check, not a lifetime
  guarantee. The actual safety invariant is that the loop body
  doesn't mutate `aggs.items` between `pool.spawnWg` and
  `pool.waitAndWork`, plus `defer aggs.deinit(allocator)` running
  AFTER `waitAndWork`. Comment now spells this out so a future
  refactor that hoists a mutation between iterations is the
  obvious regression to look for, not the assert text.

* Annotate `process_pending_blocks` Message variant as test-only
  (LOW). The variant + thunk + tripwire counter
  (`lean_chain_worker_process_pending_blocks_dropped_missing_roots_total`)
  remain after `submitProcessPendingBlocks` was removed in the
  previous round, exercised only by queue smoke tests. Doc on the
  variant declaration now says **TEST-ONLY in production today;
  see #863 follow-up** with a one-paragraph note that wiring a
  worker producer requires plumbing a missing-roots backchannel
  first (the same shape as `imported_block_fn`).

Verified locally:
- zig build  ->  clean
- zig fmt --check .  ->  clean
- cargo fmt --check  ->  clean
- zig build test --summary all  ->  152/152 pass (+1 new test)
- zig build simtest --summary all  ->  15/15 pass

* node: drop forkchoice main mutex from forkChoice.aggregate (#890)

The aggregator-thread call to forkChoice.aggregate held the
forkchoice main mutex (exclusive) for the full ~70s XMSS
aggregate-proof FFI window. Live zeam_0 metrics from devnet (4
zeam containers per host, 8 logical CPUs, only zeam_0 aggregating):

  lean_committee_signatures_aggregation_time_seconds: 7 calls,
    sum 427.87s (~61s avg, 6 of 7 fall in the +Inf bucket)
  zeam_aggregate_worker_duration_seconds: 6 calls, sum 360.17s
    (~60s avg, confirms the time is dominated by forkChoice.aggregate)
  lean_tick_interval_duration_seconds: 42 ticks, sum 473.65s
    (~11.3s/tick vs the 800ms budget = 14x over)
  zeam_slot_driver_stall_fired_total: 8 (all in the 5-10s bucket)
  lean_chain_queue_depth{queue="attestation"}: 995
  lean_chain_queue_dropped_total{queue="attestation"}: 3026

While that mutex was held, libxev's chain.onInterval ->
forkChoice.onInterval blocked on the same lock, the chain-worker
blocked too (#890 made the worker a second exclusive contender),
and the slot-driver watchdog fired with multi-second stalls every
interval an aggregation was in flight - exactly the aggregator-only
"clock falls behind head slot, can't reach finality" pattern the
user reported.

aggregateUnlocked already coordinates correctly with
signatures_mutex - the only lock the aggregation actually needs.
It does not touch protoArray, latest_finalized, the validator
tracker, or anything else gated by the main mutex. Concurrent
aggregate() invocations are gated upstream by
aggregate_group.concurrent (concurrent_limit=1 in
submitAggregateOnInterval), so the main mutex was never serving a
real purpose here either.

Adds a regression test that holds the forkchoice main mutex
exclusive on the test thread and asserts that aggregate runs to
completion on a worker thread; if a future refactor re-introduces
mutex.lock() in the aggregate path, the worker deadlocks and
thread.join() hangs the test.

The residual i=4 stall (acceptNewAttestationsUnlocked on libxev
still acquires signatures_mutex while the aggregator thread holds
it for the FFI window) is left for a follow-up PR that needs a
snapshot-then-release pattern over SignaturesMap and
AggregatedPayloadsMap. Documented inline on the new aggregate
docstring.

* node: drop signatures_mutex during XMSS aggregate FFI (#863, #890 followup)

zeam_0 still hit `slot-driver stall detected` at i=4 every aggregator
slot after #890. Cause: `aggregateUnlocked` held `signatures_mutex` for
the full ~18 s `computeAggregatedSignatures` window, and libxev's
`chain.onInterval` → `acceptNewAttestationsUnlocked` (which also takes
`signatures_mutex`) blocks at i=0/i=4 every slot during aggregation.

Snapshot-then-release:

  1. Phase 1 (lock briefly): deep-clone `attestation_signatures`,
     `latest_new_aggregated_payloads`, `latest_known_aggregated_payloads`
     into owned local copies.
  2. Phase 2 (no lock, ~18 s): run `computeAggregatedSignatures` and
     per-result SSZ clones against the snapshot. Other threads are
     free to mutate the live maps.
  3. Phase 3 (lock briefly): MERGE per-att_data results into
     `latest_new_*` (gossip-sourced aggregations from other subnets
     that arrived during phase 2 must survive — replace would clobber
     them); per-validator remove from `attestation_signatures` for
     `(att_data, vid)` pairs that existed in the snapshot (entries
     added during phase 2 stay so the next aggregator pass can
     consume them).

Concurrent `aggregate()` invocations are gated upstream by
`aggregate_group.concurrent` (concurrent_limit=1), so no two
aggregations race here. The existing #890 regression test still
covers "no forkchoice main mutex acquired"; functional behavior is
covered by the existing aggregate / gossip / proposal test suites
(both `zig build test` and `zig build simtest` green).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants