node: harden blocks_by_range catch-up retry and fork recovery#894
Conversation
When local head is on a divergent fork, bulk range sync from our_head+1 could fill the block cache with orphans and stall sync. Detect fork mismatch on the first range chunk, retry failed or empty range RPCs on another peer (up to 3 attempts), fall back to head-by-root, cap gap by wall-clock slot, paginate successful ranges, and allow head-gap catch-up when finalized is still zero. Fixes #893
Count chunks_imported only after chain-worker onBlock succeeds; defer sync-end until async imports drain. Coalesce range RPC bookkeeping under one lock, require an alternate peer for retries, treat all-pre-finalized batches as no-op, and add zeam_blocks_by_range_sync_total plus unit tests.
On aggregated-attestation QueueFull, enqueue to the leanSpec pending buffer and nudge replay instead of dropping. Drain up to 32 aggs per worker loop when queue depth exceeds 16. Avoid duplicate participant bit decoding and validator_id allocation in onGossipAggregatedAttestation.
Allow concurrent blocks_by_range catch-up for disjoint slot windows (one in-flight per start_slot/count), extract pure sync helpers to blocks_by_range_sync.zig, and document that agg queue batching does not call forkChoice.aggregate (snapshot path unchanged).
Mark peers that reject range RPCs (unsupported / invalid request) and skip range retries on them. Large-gap catch-up uses blocks_by_root from peer head when range is unavailable; range retries prefer capable peers.
|
Reviewed PR #894. I’m not approving yet — I found two catch-up correctness issues that look realistic enough to block this until fixed. Blocking findings
Checks
Once the two range-sync issues are addressed, I’m happy to re-review quickly. |
Reject overlapping blocks_by_range windows so async import accounting cannot hang; apply head-at-start parent check on the first returned chunk when peers skip empty slots.
|
Re-reviewed at
I don’t see remaining blockers from my review pass. Approved from my side / LGTM. Validation I could run here:
|
getSyncStatus now reports behind_peers when peer heads are ahead while finalized is still zero. shouldCatchUpFromPeerStatus triggers on any positive capped head gap; the 64-slot threshold only picks range vs blocks_by_root inside initiateCatchUpFromPeerStatus.
Resolve node.zig conflicts: keep PR async sync-end and retry/fallback; wire main's finalized-slot pagination through initiateBlocksByRangeCatchUp. Dedupe RPC_ERR_INVALID_REQUEST in constants.zig after auto-merge.
Two fixes for the recurring "node believes it is synced while peers report a much higher head" symptom on aggregator nodes (#863, #893). 1. cappedSyncGap / shouldCatchUpFromPeerStatus now derive the wall slot directly from unixTimestampMillis() and genesis_time_ms via a new Clock.wallSlotNow accessor, instead of reading forkchoice.fcStore.slot_clock.timeSlots. Reading the forkchoice counter self-reinforced slot-driver stalls: a starved counter capped the gap to zero, status-driven catch-up was skipped, and the node stayed stuck. Using the host wall clock breaks that loop so catch-up triggers on real lag independent of tick liveness. 2. SlotDriverWatchdog gains an on_stall callback (StallCallback). The CLI wires it to BeamNode.onSlotDriverStall which only flips an atomic flag; the next libxev tick observes it and forces a refreshSyncFromPeers outside the normal 8-slot cadence. The watchdog thread never emits RPCs itself — sendStatusToPeer mutates request map state shared with the libp2p bridge and assumes a single producer per tick. Tests: pure wallSlotNowImpl helper covers genesis/skipped/zero edges; watchdog test confirms the callback signature.
Earlier 7ff865f added a "Check 3" that returned `.behind_peers` whenever peer head was ahead of ours. Two problems: 1. validator_client.maybeDoProposal / mayBeDoAttestation skip proposer and attestation duties on `.behind_peers`. Pre-#894 that only fired on a finalization gap (deep sync). Check 3 made it trip on a 1-slot head delta from normal gossip latency, silently disabling validators near the head. 2. Surrounding chain.zig comments explicitly map `.behind_peers` to leanSpec SYNCING ("deep sync"). Inflating SYNCING into "any positive head gap" deviates from that mapping. Status-driven catch-up for the head-only-gap case is preserved: the `.synced` arm of `handleReqRespResponse` already calls `shouldCatchUpFromPeerStatus` directly so a peer reporting a higher head triggers catch-up without changing the node's sync state.
Devnet aggregator zeam_8 was seeing 9.7s libxev tick stalls under
sustained load and missing block-production duty (slot 64 proposed
~4.4s late). Trace evidence (devnet-logs-20260519T111903Z):
s=63 i=4 11:18:25.200 tick duration=0.799s (healthy)
s=63 i=4 11:18:25.670 received blocks-by-root chunk (3.4 MB resp)
s=63 i=0 11:18:22 attestation queue full, dropping slot=1 ...
s=64 i=0 11:18:31.186 slot-driver stall detected: last tick 5.986s
s=64 i=0 11:18:34.966 tick duration=9.766s (recovered)
Root cause: `processBlockByRootChunk` and `processBlockByRangeChunk`
fell back to inline `chain.onBlock` on the libxev thread when
`trySubmitImportToWorker` returned false. Inline import costs ~0.5s
of XMSS verification per block; a multi-chunk catch-up burst on a
loaded aggregator (block queue saturated by attestation flood)
hijacked libxev for the duration of the burst. The gossip path
(chain.zig::onGossip) already drops on QueueFull — the asymmetry
was the regression.
Fix: refactor `trySubmitImportToWorker` to return an enum
`ImportSubmitOutcome` (submitted | queue_full | worker_disabled |
failed). Both RPC chunk handlers now classify via a new pure helper
`blocks_by_range_sync.classifyChunkImport`:
submitted -> handled (return)
queue_full -> drop_backpressure (NO inline fallback;
catch-up RPC will refetch next status round)
worker_disabled -> fallback_inline (legitimate test path)
failed -> fallback_inline (last-resort sszClone failure)
Regression tests in `blocks_by_range_sync.zig`:
* `queue_full drops, never falls back to inline` — the explicit
contract guarding against re-introducing the inline fallback.
* `submitted is handled`, `worker_disabled and failed fall back
to inline`, exhaustiveness guard over `ImportSubmitOutcome`.
`zig build test` and `zig build simtest` pass.
Summary
Fixes #893. Hardens
blocks_by_rangebulk catch-up and the chain-worker backpressure path so a fresh-genesis or aggregator node cannot get stuck with a frozen head while the slot clock keeps advancing. Three follow-up commits were folded into this PR after the initial review to address regressions found on devnet.blocks_by_range catch-up (#893)
start_slotdoes not parent-link to our head; fall back toblocks_by_rootfrom peer head.blocks_by_rootfrom peer head.blocks_by_rangeunavailable: if a peer returns unsupported / invalid-request (or dispatch fails), mark the peer and fall back toblocks_by_rootimmediately instead of retrying range on that peer (per Anshal).min(peer_head - our_head, wall_slot - our_head).finalized_slot == 0.blocks_by_rangeper(start_slot, count)window; disjoint ranges may run in parallel.chunks_importedonly after chain-workeronBlocksucceeds; defer sync-end until async imports drain.blocks_by_range_sync.zig(orchestration remains in `node.zig` for now).Aggregated-attestation queue backpressure
Under devnet restart load the chain-worker aggregated-attestation queue (512) could fill while block STF and catch-up saturated the single worker thread, causing permanent gossip drops (`aggregated attestation queue full, dropping slot=…`) and starving fork-choice vote tracking.
slot_interval/ tick duration (event-loop starvation vs nominal 0.8s) #863/node, metrics: offload heavy chain mutations to chain-worker, parallelize XMSS verify (#863) #890).Post-#894 follow-ups (regressions found on devnet)
After the original PR landed, two regression patterns were observed on devnet (most visibly on `zeam_4`/`zeam_8` aggregators) and addressed in three follow-up commits:
1. Sync gated on host wall clock + slot-driver watchdog (`77d90292`)
Symptom: `shouldCatchUpFromPeerStatus` was reading `forkchoice.fcStore.slot_clock.timeSlots` to decide whether to catch up, but on aggregators libxev could stall and freeze that clock; once frozen, the node could no longer detect that it was behind and never re-armed catch-up.
slot_interval/ tick duration (event-loop starvation vs nominal 0.8s) #863), and emits `zeam_slot_driver_stall_fired_total` / `zeam_slot_driver_stall_seconds`. When fired, a registered callback in `BeamNode` forces `refreshSyncFromPeers()` so a stalled main loop can recover instead of silently lagging.2. Reverted `getSyncStatus` head-gap check (`6d8a3920`)
A short-lived attempt to mark a node `.behind_peers` when `our_head_slot < max_peer_head_slot` caused validators to skip their duties when other clients on the network briefly ran ahead. Reverted; sync status now keeps the original spec-aligned semantics and only signals `.behind_peers` for finality / sync-distance signals, not transient head jitter.
3. Drop RPC catch-up chunks on chain-worker `QueueFull` instead of inline-importing (`b2654679`)
This is the regression that PR #894 itself introduced and that the user reported as "all zeam nodes head slot are far behind … clear regression". Diagnosed from `zeam_8` logs as a 9.7-second libxev event-loop stall: a 3.4 MB `blocks_by_root` response was being processed inline on the libxev thread because the chain-worker block queue was already saturated by attestation traffic on the aggregator subnet, and the RPC chunk handler had no backpressure path — it just fell through to a synchronous `chain.onBlock` (~0.5 s of XMSS verification per block). The gossip path already drops on `QueueFull`; the RPC path was asymmetric.
Fix:
Test plan
slot_interval/ tick duration (event-loop starvation vs nominal 0.8s) #863) — the Linux devnet remains the canonical environment for the regression test below.