fix silent crashes during cached-block catch-up sync#681
Conversation
1b3d17c to
cd9a8d3
Compare
zclawz
left a comment
There was a problem hiding this comment.
Good root-cause analysis and clear description of the crash chain. The three fixes are individually correct. A few observations:
1. onBlockFollowup(true, cached_block) — signedBlock param is _ = signedBlock
The comment says "Must happen before removeFetchedBlock frees the block memory" — which implies cached_block is read inside onBlockFollowup. But the function body starts with _ = signedBlock;, meaning the pointer is never dereferenced. The ordering constraint is therefore not a memory-safety requirement today, though keeping the call before removeFetchedBlock is still good practice. Worth either removing the comment or wiring signedBlock properly if it will be used in future.
2. pruneForkchoice = true on every cached block in processCachedDescendants
processCachedDescendants is recursive (each successful import calls itself again). With onBlockFollowup(true, ...) added, processFinalizationAdvancement can now fire on each iteration if finalization advances. For a deep cached-block chain (dozens of blocks during checkpoint sync burst) this means many sequential canonical-view analyses and DB batch writes. Correct semantically, but could be slow. A possible optimization is to pass false for pruneForkchoice during catch-up and prune once at the end — but that is a follow-up concern, not a blocker.
3. Arithmetic guard in chain.zig
The if (slot_gap >= newly_finalized_count) guard is correct and safe. One minor note: if newly_finalized_count > slot_gap the orphaned count is silently zeroed rather than logged. A debug-level log here would make the edge case visible in traces, though not critical.
4. clock.zig — .disarm on Canceled
Returning .disarm is the right call — the canceled completion should not re-arm itself. The std.debug.panic for unexpected errors (rather than unreachable) is a clear improvement; at least a useful message is printed before the process exits.
Overall the root cause chain (missing onBlockFollowup → stale finalization pointer → underflow in processFinalizationAdvancement → @trap()) is well-documented and the fixes are correct. Happy to approve with the note that items 1 and 2 are worth revisiting in a follow-up.
Three bugs caused the node to crash silently (Zig ReleaseFast trap) when processing a burst of cached blocks while syncing from a checkpoint: 1. clock: `_ = r catch unreachable` triggered a trap on the expected `error.Canceled` that xev emits when a timer fires after being re-armed. The error is now handled explicitly: `Canceled` is silently ignored, any other value panics with a message. 2. chain: `processFinalizationAdvancement` assumed `finalized_roots` is non-empty before performing `finalized_roots[1..finalized_roots.len]` slice arithmetic. If the fork choice had already been rebased past the requested root `getCanonicalViewAndAnalysis` can return an empty slice, causing an out-of-bounds panic. The function now returns early with a warning instead. The `usize` subtraction for the orphan-count log line was also guarded against underflow. 3. node: `processCachedDescendants` did not call `chain.onBlockFollowup` after successfully integrating a cached block, unlike the sibling path `processBlockByRootChunk`. This meant head/justification/finalization events were not emitted and `last_emitted_finalized` was not advanced as the node processed a chain of cached blocks, leading to a large deferred finalization jump that triggered the panics above.
cd9a8d3 to
e52031f
Compare
ReleaseFast compiles unreachable statements and failed bounds/overflow checks to @trap() — a silent illegal instruction with no output. ReleaseSafe keeps safety checks active (bounds, overflow, unreachable → panics with a stack trace) while still applying most optimizations. This makes crash sites visible in production and CI instead of silently killing the process, which was the root cause of the hard-to-diagnose crashes fixed in PR #681. Affected: Dockerfile, ci.yml (SSE integration build), auto-release.yml (x86_64 and aarch64 release binaries). The ReleaseFast in build.zig is for the risc0/zkvm targets and is left unchanged.
* fix: improve state recovery logging and add genesis time validation (#481) - Add explicit success log when state is loaded from DB on restart - Validate genesis time of loaded state matches chain config; fall back to genesis on mismatch with a clear warning - Improve DB recovery logging: log block root and slot at each step so failures are easier to diagnose - Root cause: lean-quickstart spin-node.sh unconditionally wiped the data directory on every restart (fixed separately in lean-quickstart) * chore: update lean-quickstart submodule to fix data dir wipe on restart * fix: wipe stale database on genesis time mismatch When the DB contains state from a different genesis (genesis_time mismatch), close and delete the RocksDB directory before reopening so the node starts with a fresh DB instance rather than accumulating stale data. Requested by @g11tech in #637 (comment) * fix: add post-wipe genesis time re-check, log and return error if still mismatches * refactor: remove redundant post-wipe genesis check Per g11tech's review: the downstream loadLatestFinalizedState call will handle any inconsistency if the wipe somehow failed. No need to re-probe immediately after wiping. * fix: resolve CI failure - apply zig fmt to node.zig (remove trailing newline) * fix: error out if db wipe fails on genesis time mismatch * refactor the anchor setup on startup * fix: wipe db even when no local finalized state found Per @anshalshukla review: if loadLatestFinalizedState fails (no finalized state in db), we should still wipe the db for a clean slate rather than risk leftover data. NotFound errors are ignored since the db directory may not exist yet on first run. * build: switch production builds from ReleaseFast to ReleaseSafe ReleaseFast compiles unreachable statements and failed bounds/overflow checks to @trap() — a silent illegal instruction with no output. ReleaseSafe keeps safety checks active (bounds, overflow, unreachable → panics with a stack trace) while still applying most optimizations. This makes crash sites visible in production and CI instead of silently killing the process, which was the root cause of the hard-to-diagnose crashes fixed in PR #681. Affected: Dockerfile, ci.yml (SSE integration build), auto-release.yml (x86_64 and aarch64 release binaries). The ReleaseFast in build.zig is for the risc0/zkvm targets and is left unchanged. * fix: change ReleaseFast to ReleaseSafe in zkvm build step * fix: revert zkvm optimize to ReleaseFast (ReleaseSafe breaks riscv32 inline asm) --------- Co-authored-by: anshalshuklabot <anshalshuklabot@users.noreply.github.com> Co-authored-by: zclawz <zclawz@users.noreply.github.com> Co-authored-by: zeam-bot <zeam-bot@openclaw> Co-authored-by: Anshal Shukla <53994948+anshalshukla@users.noreply.github.com> Co-authored-by: zclawz <zclawz@openclaw.ai> Co-authored-by: harkamal <gajinder@zeam.in> Co-authored-by: zclawz <zclawz@blockblaz.io> Co-authored-by: zclawz <zclawz@blockblaz.com>
Root cause
When syncing from a checkpoint, the node fetches a chain of blocks that
arrive out-of-order and parks them in the
fetched_blockscache.processCachedDescendantsthen replays them in parent → child order.During this burst, the node crashed silently (no log output) every few
sessions, with the last line always being:
Why the crash was silent
All production builds use
-Doptimize=ReleaseFast(Dockerfile, CI,auto-release workflow). In
ReleaseFast, Zig compilesunreachablestatements and failed bounds/overflow checks to
@trap()— an illegalinstruction that kills the process immediately with no output to stdout
or stderr. Because the crash happened after the last log line was flushed,
there is no stack trace to pinpoint the exact site. The three fixes below
are defensive guards for the most likely candidates identified by code
inspection.
Fixes
1.
clock.zig—unreachableon expected timer cancellationxevfires a completion witherror.Canceledwhen a timer is re-armedbefore the previous fire is delivered. The callback was:
In
ReleaseFastthis compiles to an illegal instruction, killing theprocess silently.
Canceledis now handled explicitly and ignored;any other error panics with a message.
2.
chain.zig— out-of-bounds slice inprocessFinalizationAdvancementgetCanonicalViewAndAnalysiscan return an empty slice when the forkchoice was already rebased past the requested root (e.g. the previous
finalized checkpoint). This caused a
usizeunderflow / OOB trap inReleaseFast. The fix adds an early-return guard with a warning log.3.
node.zig— missingonBlockFollowupinprocessCachedDescendantsprocessBlockByRootChunk(the normal inbound-gossip / RPC path) callschain.onBlockFollowupafter each successful block import. This emitshead/justification/finalization events and advances
last_emitted_finalized.processCachedDescendantsdid not call it,so while replaying dozens of cached blocks the finalization pointer was
never updated. When the normal path finally ran
onBlockFollowupon thenext gossip block, it saw a large finalization gap and called
processFinalizationAdvancementwith stale bookmarks, triggering bug 2.The fix adds
self.chain.onBlockFollowup(true, cached_block)immediatelyafter each successful block import, before
removeFetchedBlockfrees theblock memory.
Testing
zig build— cleanzig build test --summary all— all tests passzig fmt --check .— no formatting issues