Skip to content

node: harden blocks_by_range catch-up retry and fork recovery#894

Merged
ch4r10t33r merged 13 commits into
mainfrom
node/893-blocks-by-range-catchup
May 19, 2026
Merged

node: harden blocks_by_range catch-up retry and fork recovery#894
ch4r10t33r merged 13 commits into
mainfrom
node/893-blocks-by-range-catchup

Conversation

@ch4r10t33r
Copy link
Copy Markdown
Contributor

@ch4r10t33r ch4r10t33r commented May 17, 2026

Summary

Fixes #893. Hardens blocks_by_range bulk catch-up and the chain-worker backpressure path so a fresh-genesis or aggregator node cannot get stuck with a frozen head while the slot clock keeps advancing. Three follow-up commits were folded into this PR after the initial review to address regressions found on devnet.

blocks_by_range catch-up (#893)

  • Fork mismatch: abort range batch when the first chunk at start_slot does not parent-link to our head; fall back to blocks_by_root from peer head.
  • Retry/fallback: on range failure, timeout, or zero imports — retry on another range-capable peer (up to 3 attempts), then blocks_by_root from peer head.
  • blocks_by_range unavailable: if a peer returns unsupported / invalid-request (or dispatch fails), mark the peer and fall back to blocks_by_root immediately instead of retrying range on that peer (per Anshal).
  • Gap sanity: cap sync gap with min(peer_head - our_head, wall_slot - our_head).
  • Early devnet: trigger catch-up on head gap even when both sides report finalized_slot == 0.
  • Pagination: chain another range after a successful batch if still far behind peer head.
  • Concurrency: one in-flight blocks_by_range per (start_slot, count) window; disjoint ranges may run in parallel.
  • Async import accounting: count chunks_imported only after chain-worker onBlock succeeds; defer sync-end until async imports drain.
  • Helpers: pure gap/retry logic in blocks_by_range_sync.zig (orchestration remains in `node.zig` for now).

Aggregated-attestation queue backpressure

Under devnet restart load the chain-worker aggregated-attestation queue (512) could fill while block STF and catch-up saturated the single worker thread, causing permanent gossip drops (`aggregated attestation queue full, dropping slot=…`) and starving fork-choice vote tracking.

Post-#894 follow-ups (regressions found on devnet)

After the original PR landed, two regression patterns were observed on devnet (most visibly on `zeam_4`/`zeam_8` aggregators) and addressed in three follow-up commits:

1. Sync gated on host wall clock + slot-driver watchdog (`77d90292`)

Symptom: `shouldCatchUpFromPeerStatus` was reading `forkchoice.fcStore.slot_clock.timeSlots` to decide whether to catch up, but on aggregators libxev could stall and freeze that clock; once frozen, the node could no longer detect that it was behind and never re-armed catch-up.

  • New `Clock.wallSlotNow()` derives the current slot from host `std.time` + genesis time, decoupled from the libxev-driven slot clock.
  • `shouldCatchUpFromPeerStatus` now uses `wallSlotNow()` so that wall-clock gap drives catch-up even when the slot driver is hung.
  • Added `SlotDriverWatchdog` (background `std.Thread`): observes the libxev slot driver's last-tick timestamp via an atomic, fires when it exceeds the threshold (default 5 s, see Investigate slow slot_interval / tick duration (event-loop starvation vs nominal 0.8s) #863), and emits `zeam_slot_driver_stall_fired_total` / `zeam_slot_driver_stall_seconds`. When fired, a registered callback in `BeamNode` forces `refreshSyncFromPeers()` so a stalled main loop can recover instead of silently lagging.

2. Reverted `getSyncStatus` head-gap check (`6d8a3920`)

A short-lived attempt to mark a node `.behind_peers` when `our_head_slot < max_peer_head_slot` caused validators to skip their duties when other clients on the network briefly ran ahead. Reverted; sync status now keeps the original spec-aligned semantics and only signals `.behind_peers` for finality / sync-distance signals, not transient head jitter.

3. Drop RPC catch-up chunks on chain-worker `QueueFull` instead of inline-importing (`b2654679`)

This is the regression that PR #894 itself introduced and that the user reported as "all zeam nodes head slot are far behind … clear regression". Diagnosed from `zeam_8` logs as a 9.7-second libxev event-loop stall: a 3.4 MB `blocks_by_root` response was being processed inline on the libxev thread because the chain-worker block queue was already saturated by attestation traffic on the aggregator subnet, and the RPC chunk handler had no backpressure path — it just fell through to a synchronous `chain.onBlock` (~0.5 s of XMSS verification per block). The gossip path already drops on `QueueFull`; the RPC path was asymmetric.

Fix:

  • New `ImportSubmitOutcome` enum (`submitted` / `queue_full` / `worker_disabled` / `failed`) returned by `trySubmitImportToWorker` in `pkgs/node/src/node.zig`.
  • New `ChunkImportDisposition` enum + `classifyChunkImport` helper in `pkgs/node/src/blocks_by_range_sync.zig`.
  • `processBlockByRootChunk` and `processBlockByRangeChunk` now consult `classifyChunkImport`:
    • `submitted` → `handled` (track range-async import as before)
    • `queue_full` → `drop_backpressure` — log + bump `lean_chain_queue_dropped_total{queue="block"}` (already done by `chain_worker.sendBlock`) and return without inline-importing. For range chunks, deliberately do not advance the by-range request progress, so the chunk-timeout / status-driven retry refetches once the worker queue drains.
    • `worker_disabled` / `failed` → `fallback_inline` (only path that still runs `chain.onBlock` on the caller, as before).
  • Aggregators are now bounded by chain-worker capacity instead of pinning the libxev thread when the worker is full.
  • Regression tests in `blocks_by_range_sync.zig` assert that every `ImportSubmitOutcome` variant has a defined disposition and that `queue_full` always maps to `drop_backpressure` (never `fallback_inline`).

Test plan

  • `zig fmt --check .`
  • `cargo fmt --check` / `cargo clippy --workspace -- -D warnings` (rust/Cargo.toml)
  • `zig build test --summary all`
  • `zig build simtest --summary all`
  • Unit test: `cappedSyncGapSlots limits range catch-up to wall-clock head`
  • Unit test: `classifyChunkImport: queue_full drops, never falls back to inline (node: harden blocks_by_range catch-up retry and fork recovery #894 regression guard)`
  • Unit test: `classifyChunkImport` exhaustive switch over every `ImportSubmitOutcome` variant
  • Local-devnet smoke (5 distinct clients × 4 subnets, `0xpartha/zeam:local` built from `b2654679`):
    • 0 `chain-worker block queue full for RPC chunk` warnings
    • 0 `sszClone … falling back to inline import` warnings
    • `lean_chain_queue_dropped_total` never emitted (zero increments)
    • `lean_pending_attestations_size{kind="attestation"} = 0` throughout
    • `verify_signatures` p99 < 1 s (all imports went through chain-worker, none inline on libxev)
    • Local mesh did not stay up long enough on macOS Docker QUIC to drive finality (known `xev` / Docker limitation; see Dockerfile note + Investigate slow slot_interval / tick duration (event-loop starvation vs nominal 0.8s) #863) — the Linux devnet remains the canonical environment for the regression test below.
  • Deploy on ansible devnet `zeam_0` (fresh genesis, no checkpoint) and confirm head advances without manual restart
  • Confirm `lean_chain_queue_dropped_total{queue="aggregated_attestation"}` stays near zero and `lean_pending_attestations_buffered_total{reason="queue_full"}` is bounded under gossip burst
  • Deploy `b2654679` on aggregator `zeam_8` and confirm:
    • no recurrence of multi-second libxev stalls coinciding with large `blocks_by_root` responses
    • `zeam_slot_driver_stall_fired_total` does not increase
    • `lean_chain_queue_dropped_total{queue="block"}` may briefly increment under sustained load — that is the intended backpressure path, not a regression

When local head is on a divergent fork, bulk range sync from our_head+1
could fill the block cache with orphans and stall sync. Detect fork
mismatch on the first range chunk, retry failed or empty range RPCs on
another peer (up to 3 attempts), fall back to head-by-root, cap gap by
wall-clock slot, paginate successful ranges, and allow head-gap catch-up
when finalized is still zero.

Fixes #893
Count chunks_imported only after chain-worker onBlock succeeds; defer
sync-end until async imports drain. Coalesce range RPC bookkeeping under
one lock, require an alternate peer for retries, treat all-pre-finalized
batches as no-op, and add zeam_blocks_by_range_sync_total plus unit tests.
On aggregated-attestation QueueFull, enqueue to the leanSpec pending
buffer and nudge replay instead of dropping. Drain up to 32 aggs per
worker loop when queue depth exceeds 16. Avoid duplicate participant
bit decoding and validator_id allocation in onGossipAggregatedAttestation.
Comment thread pkgs/node/src/chain_worker.zig
Comment thread pkgs/node/src/network.zig Outdated
Comment thread pkgs/node/src/node.zig
Allow concurrent blocks_by_range catch-up for disjoint slot windows (one
in-flight per start_slot/count), extract pure sync helpers to
blocks_by_range_sync.zig, and document that agg queue batching does not
call forkChoice.aggregate (snapshot path unchanged).
Mark peers that reject range RPCs (unsupported / invalid request) and
skip range retries on them. Large-gap catch-up uses blocks_by_root from
peer head when range is unavailable; range retries prefer capable peers.
@ch4r10t33r ch4r10t33r requested a review from g11tech May 18, 2026 16:17
@zclawz
Copy link
Copy Markdown
Contributor

zclawz commented May 18, 2026

Reviewed PR #894. I’m not approving yet — I found two catch-up correctness issues that look realistic enough to block this until fixed.

Blocking findings

  1. Overlapping blocks_by_range windows can strand sync completion

    blocks_by_range_active is keyed by exactly (start_slot, count) (pkgs/node/src/network.zig:769-786), so two requests with the same start_slot but different count — or otherwise overlapping slot windows — can run concurrently.

    Async import tracking is block_root -> request_id (pkgs/node/src/node.zig:276-280) and increments chunks_async_pending before inserting into that map. If two overlapping ranges submit the same block to the worker before import completes, the later put overwrites the earlier request id. The earlier request’s async-pending count is then never decremented, so sync_end_pending && chunks_async_pending == 0 never becomes true and its finalize/retry/fallback path can hang indefinitely.

    Suggested fixes: either reject overlapping range windows, track block_root -> list/set(request_id), or only increment async-pending when the root was newly tracked for that request.

  2. Fork-mismatch detection misses skipped-slot first chunks

    The range fork guard only checks parent linkage when signed_block.block.slot == view.start_slot (pkgs/node/src/node.zig:1573-1584). But blocks_by_range responses legitimately skip empty slots. If the first returned block is at start_slot + N, a sibling-fork first chunk bypasses this guard and falls into missing-parent/cache/fetch handling instead of immediately aborting the range and falling back to blocks_by_root.

    Suggested fix: track whether a chunk is the first received chunk for the request and apply the parent_root == our_head_root_at_start check to that first chunk regardless of its slot.

Checks

  • zig fmt --check .
  • zig build test --summary all could not complete in this environment because the build invokes rustup, which is not installed here:
    error: failed to spawn and capture stdio from rustup: FileNotFound
    Some non-Rust test targets did run and passed before that failure, but this is not a full test pass.

Once the two range-sync issues are addressed, I’m happy to re-review quickly.

Reject overlapping blocks_by_range windows so async import accounting
cannot hang; apply head-at-start parent check on the first returned
chunk when peers skip empty slots.
@zclawz
Copy link
Copy Markdown
Contributor

zclawz commented May 18, 2026

Re-reviewed at 1620e637f53180cf8e630abc2e10d9a9cee51f91 after the follow-up fix. The two blockers from my earlier review are addressed:

  • overlapping range windows are now rejected via overlap-aware active range tracking (rangesOverlap + reserveBlocksByRangeActive), and duplicate async chunk tracking no longer increments pending accounting for an already-tracked root;
  • the fork-mismatch guard now applies to the first returned chunk regardless of skipped slots via is_first_chunk.

I don’t see remaining blockers from my review pass. Approved from my side / LGTM.

Validation I could run here:

  • zig fmt --check .
  • zig build test --summary all still cannot fully complete in this environment because rustup is not installed (failed to spawn ... rustup: FileNotFound). The non-Rust/cached tests that ran passed, but this is not a substitute for CI/devnet validation.

@ch4r10t33r ch4r10t33r requested a review from anshalshukla May 18, 2026 18:30
getSyncStatus now reports behind_peers when peer heads are ahead while
finalized is still zero. shouldCatchUpFromPeerStatus triggers on any
positive capped head gap; the 64-slot threshold only picks range vs
blocks_by_root inside initiateCatchUpFromPeerStatus.
Resolve node.zig conflicts: keep PR async sync-end and retry/fallback;
wire main's finalized-slot pagination through initiateBlocksByRangeCatchUp.
Dedupe RPC_ERR_INVALID_REQUEST in constants.zig after auto-merge.
Two fixes for the recurring "node believes it is synced while peers
report a much higher head" symptom on aggregator nodes (#863, #893).

1. cappedSyncGap / shouldCatchUpFromPeerStatus now derive the wall slot
   directly from unixTimestampMillis() and genesis_time_ms via a new
   Clock.wallSlotNow accessor, instead of reading
   forkchoice.fcStore.slot_clock.timeSlots. Reading the forkchoice
   counter self-reinforced slot-driver stalls: a starved counter capped
   the gap to zero, status-driven catch-up was skipped, and the node
   stayed stuck. Using the host wall clock breaks that loop so catch-up
   triggers on real lag independent of tick liveness.

2. SlotDriverWatchdog gains an on_stall callback (StallCallback). The
   CLI wires it to BeamNode.onSlotDriverStall which only flips an
   atomic flag; the next libxev tick observes it and forces a
   refreshSyncFromPeers outside the normal 8-slot cadence. The watchdog
   thread never emits RPCs itself — sendStatusToPeer mutates request
   map state shared with the libp2p bridge and assumes a single
   producer per tick.

Tests: pure wallSlotNowImpl helper covers genesis/skipped/zero edges;
watchdog test confirms the callback signature.
Earlier 7ff865f added a "Check 3" that returned `.behind_peers` whenever
peer head was ahead of ours. Two problems:

1. validator_client.maybeDoProposal / mayBeDoAttestation skip proposer
   and attestation duties on `.behind_peers`. Pre-#894 that only fired
   on a finalization gap (deep sync). Check 3 made it trip on a 1-slot
   head delta from normal gossip latency, silently disabling validators
   near the head.

2. Surrounding chain.zig comments explicitly map `.behind_peers` to
   leanSpec SYNCING ("deep sync"). Inflating SYNCING into "any positive
   head gap" deviates from that mapping.

Status-driven catch-up for the head-only-gap case is preserved: the
`.synced` arm of `handleReqRespResponse` already calls
`shouldCatchUpFromPeerStatus` directly so a peer reporting a higher
head triggers catch-up without changing the node's sync state.
Devnet aggregator zeam_8 was seeing 9.7s libxev tick stalls under
sustained load and missing block-production duty (slot 64 proposed
~4.4s late). Trace evidence (devnet-logs-20260519T111903Z):

  s=63 i=4 11:18:25.200  tick duration=0.799s            (healthy)
  s=63 i=4 11:18:25.670  received blocks-by-root chunk   (3.4 MB resp)
  s=63 i=0 11:18:22       attestation queue full, dropping slot=1 ...
  s=64 i=0 11:18:31.186  slot-driver stall detected: last tick 5.986s
  s=64 i=0 11:18:34.966  tick duration=9.766s            (recovered)

Root cause: `processBlockByRootChunk` and `processBlockByRangeChunk`
fell back to inline `chain.onBlock` on the libxev thread when
`trySubmitImportToWorker` returned false. Inline import costs ~0.5s
of XMSS verification per block; a multi-chunk catch-up burst on a
loaded aggregator (block queue saturated by attestation flood)
hijacked libxev for the duration of the burst. The gossip path
(chain.zig::onGossip) already drops on QueueFull — the asymmetry
was the regression.

Fix: refactor `trySubmitImportToWorker` to return an enum
`ImportSubmitOutcome` (submitted | queue_full | worker_disabled |
failed). Both RPC chunk handlers now classify via a new pure helper
`blocks_by_range_sync.classifyChunkImport`:

  submitted        -> handled (return)
  queue_full       -> drop_backpressure (NO inline fallback;
                      catch-up RPC will refetch next status round)
  worker_disabled  -> fallback_inline (legitimate test path)
  failed           -> fallback_inline (last-resort sszClone failure)

Regression tests in `blocks_by_range_sync.zig`:
  * `queue_full drops, never falls back to inline` — the explicit
    contract guarding against re-introducing the inline fallback.
  * `submitted is handled`, `worker_disabled and failed fall back
    to inline`, exhaustiveness guard over `ImportSubmitOutcome`.

`zig build test` and `zig build simtest` pass.
@ch4r10t33r ch4r10t33r merged commit 8ab0757 into main May 19, 2026
13 checks passed
@ch4r10t33r ch4r10t33r deleted the node/893-blocks-by-range-catchup branch May 19, 2026 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

node: blocks_by_range catch-up fails on fork mismatch; no retry/fallback

3 participants