fix silent crashes during cached-block catch-up sync by ch4r10t33r · Pull Request #681 · blockblaz/zeam

ch4r10t33r · 2026-03-19T15:29:26Z

Root cause

When syncing from a checkpoint, the node fetches a chain of blocks that
arrive out-of-order and parks them in the fetched_blocks cache.
processCachedDescendants then replays them in parent → child order.
During this burst, the node crashed silently (no log output) every few
sessions, with the last line always being:

[node] Successfully processed cached block 0x…

Why the crash was silent

All production builds use -Doptimize=ReleaseFast (Dockerfile, CI,
auto-release workflow). In ReleaseFast, Zig compiles unreachable
statements and failed bounds/overflow checks to @trap() — an illegal
instruction that kills the process immediately with no output to stdout
or stderr. Because the crash happened after the last log line was flushed,
there is no stack trace to pinpoint the exact site. The three fixes below
are defensive guards for the most likely candidates identified by code
inspection.

Fixes

1. `clock.zig` — `unreachable` on expected timer cancellation

xev fires a completion with error.Canceled when a timer is re-armed
before the previous fire is delivered. The callback was:

_ = r catch unreachable;

In ReleaseFast this compiles to an illegal instruction, killing the
process silently. Canceled is now handled explicitly and ignored;
any other error panics with a message.

2. `chain.zig` — out-of-bounds slice in `processFinalizationAdvancement`

const newly_finalized_count = finalized_roots.len - 1;  // usize underflow if 0
…
_ = self.pruneStates(finalized_roots[1..finalized_roots.len], …);  // OOB if 0

getCanonicalViewAndAnalysis can return an empty slice when the fork
choice was already rebased past the requested root (e.g. the previous
finalized checkpoint). This caused a usize underflow / OOB trap in
ReleaseFast. The fix adds an early-return guard with a warning log.

3. `node.zig` — missing `onBlockFollowup` in `processCachedDescendants`

processBlockByRootChunk (the normal inbound-gossip / RPC path) calls
chain.onBlockFollowup after each successful block import. This emits
head/justification/finalization events and advances
last_emitted_finalized. processCachedDescendants did not call it,
so while replaying dozens of cached blocks the finalization pointer was
never updated. When the normal path finally ran onBlockFollowup on the
next gossip block, it saw a large finalization gap and called
processFinalizationAdvancement with stale bookmarks, triggering bug 2.

The fix adds self.chain.onBlockFollowup(true, cached_block) immediately
after each successful block import, before removeFetchedBlock frees the
block memory.

Testing

zig build — clean
zig build test --summary all — all tests pass
zig fmt --check . — no formatting issues

zclawz

Good root-cause analysis and clear description of the crash chain. The three fixes are individually correct. A few observations:

1. onBlockFollowup(true, cached_block) — signedBlock param is _ = signedBlock

The comment says "Must happen before removeFetchedBlock frees the block memory" — which implies cached_block is read inside onBlockFollowup. But the function body starts with _ = signedBlock;, meaning the pointer is never dereferenced. The ordering constraint is therefore not a memory-safety requirement today, though keeping the call before removeFetchedBlock is still good practice. Worth either removing the comment or wiring signedBlock properly if it will be used in future.

2. pruneForkchoice = true on every cached block in processCachedDescendants

processCachedDescendants is recursive (each successful import calls itself again). With onBlockFollowup(true, ...) added, processFinalizationAdvancement can now fire on each iteration if finalization advances. For a deep cached-block chain (dozens of blocks during checkpoint sync burst) this means many sequential canonical-view analyses and DB batch writes. Correct semantically, but could be slow. A possible optimization is to pass false for pruneForkchoice during catch-up and prune once at the end — but that is a follow-up concern, not a blocker.

3. Arithmetic guard in chain.zig

The if (slot_gap >= newly_finalized_count) guard is correct and safe. One minor note: if newly_finalized_count > slot_gap the orphaned count is silently zeroed rather than logged. A debug-level log here would make the edge case visible in traces, though not critical.

4. clock.zig — .disarm on Canceled

Returning .disarm is the right call — the canceled completion should not re-arm itself. The std.debug.panic for unexpected errors (rather than unreachable) is a clear improvement; at least a useful message is printed before the process exits.

Overall the root cause chain (missing onBlockFollowup → stale finalization pointer → underflow in processFinalizationAdvancement → @trap()) is well-documented and the fixes are correct. Happy to approve with the note that items 1 and 2 are worth revisiting in a follow-up.

Three bugs caused the node to crash silently (Zig ReleaseFast trap) when processing a burst of cached blocks while syncing from a checkpoint: 1. clock: `_ = r catch unreachable` triggered a trap on the expected `error.Canceled` that xev emits when a timer fires after being re-armed. The error is now handled explicitly: `Canceled` is silently ignored, any other value panics with a message. 2. chain: `processFinalizationAdvancement` assumed `finalized_roots` is non-empty before performing `finalized_roots[1..finalized_roots.len]` slice arithmetic. If the fork choice had already been rebased past the requested root `getCanonicalViewAndAnalysis` can return an empty slice, causing an out-of-bounds panic. The function now returns early with a warning instead. The `usize` subtraction for the orphan-count log line was also guarded against underflow. 3. node: `processCachedDescendants` did not call `chain.onBlockFollowup` after successfully integrating a cached block, unlike the sibling path `processBlockByRootChunk`. This meant head/justification/finalization events were not emitted and `last_emitted_finalized` was not advanced as the node processed a chain of cached blocks, leading to a large deferred finalization jump that triggered the panics above.

@trap

ReleaseFast compiles unreachable statements and failed bounds/overflow checks to @trap() — a silent illegal instruction with no output. ReleaseSafe keeps safety checks active (bounds, overflow, unreachable → panics with a stack trace) while still applying most optimizations. This makes crash sites visible in production and CI instead of silently killing the process, which was the root cause of the hard-to-diagnose crashes fixed in PR #681. Affected: Dockerfile, ci.yml (SSE integration build), auto-release.yml (x86_64 and aarch64 release binaries). The ReleaseFast in build.zig is for the risc0/zkvm targets and is left unchanged.

@g11tech

* fix: improve state recovery logging and add genesis time validation (#481) - Add explicit success log when state is loaded from DB on restart - Validate genesis time of loaded state matches chain config; fall back to genesis on mismatch with a clear warning - Improve DB recovery logging: log block root and slot at each step so failures are easier to diagnose - Root cause: lean-quickstart spin-node.sh unconditionally wiped the data directory on every restart (fixed separately in lean-quickstart) * chore: update lean-quickstart submodule to fix data dir wipe on restart * fix: wipe stale database on genesis time mismatch When the DB contains state from a different genesis (genesis_time mismatch), close and delete the RocksDB directory before reopening so the node starts with a fresh DB instance rather than accumulating stale data. Requested by @g11tech in #637 (comment) * fix: add post-wipe genesis time re-check, log and return error if still mismatches * refactor: remove redundant post-wipe genesis check Per g11tech's review: the downstream loadLatestFinalizedState call will handle any inconsistency if the wipe somehow failed. No need to re-probe immediately after wiping. * fix: resolve CI failure - apply zig fmt to node.zig (remove trailing newline) * fix: error out if db wipe fails on genesis time mismatch * refactor the anchor setup on startup * fix: wipe db even when no local finalized state found Per @anshalshukla review: if loadLatestFinalizedState fails (no finalized state in db), we should still wipe the db for a clean slate rather than risk leftover data. NotFound errors are ignored since the db directory may not exist yet on first run. * build: switch production builds from ReleaseFast to ReleaseSafe ReleaseFast compiles unreachable statements and failed bounds/overflow checks to @trap() — a silent illegal instruction with no output. ReleaseSafe keeps safety checks active (bounds, overflow, unreachable → panics with a stack trace) while still applying most optimizations. This makes crash sites visible in production and CI instead of silently killing the process, which was the root cause of the hard-to-diagnose crashes fixed in PR #681. Affected: Dockerfile, ci.yml (SSE integration build), auto-release.yml (x86_64 and aarch64 release binaries). The ReleaseFast in build.zig is for the risc0/zkvm targets and is left unchanged. * fix: change ReleaseFast to ReleaseSafe in zkvm build step * fix: revert zkvm optimize to ReleaseFast (ReleaseSafe breaks riscv32 inline asm) --------- Co-authored-by: anshalshuklabot <anshalshuklabot@users.noreply.github.com> Co-authored-by: zclawz <zclawz@users.noreply.github.com> Co-authored-by: zeam-bot <zeam-bot@openclaw> Co-authored-by: Anshal Shukla <53994948+anshalshukla@users.noreply.github.com> Co-authored-by: zclawz <zclawz@openclaw.ai> Co-authored-by: harkamal <gajinder@zeam.in> Co-authored-by: zclawz <zclawz@blockblaz.io> Co-authored-by: zclawz <zclawz@blockblaz.com>

ch4r10t33r changed the title ~~clock, chain, node: fix silent crashes during cached-block catch-up sync~~ fix silent crashes during cached-block catch-up sync Mar 19, 2026

ch4r10t33r force-pushed the fix/sync-crash-catchup branch from 1b3d17c to cd9a8d3 Compare March 19, 2026 15:38

ch4r10t33r changed the base branch from fix/attestation-committee-count to main March 19, 2026 15:38

ch4r10t33r marked this pull request as ready for review March 19, 2026 15:53

ch4r10t33r requested review from anshalshukla and g11tech March 19, 2026 15:58

zclawz previously approved these changes Mar 19, 2026

View reviewed changes

ch4r10t33r dismissed zclawz’s stale review via e52031f March 19, 2026 16:38

ch4r10t33r force-pushed the fix/sync-crash-catchup branch from cd9a8d3 to e52031f Compare March 19, 2026 16:38

ch4r10t33r requested a review from zclawz March 19, 2026 18:39

Merge branch 'main' into fix/sync-crash-catchup

e780cf3

g11tech approved these changes Mar 23, 2026

View reviewed changes

g11tech merged commit 9239160 into main Mar 23, 2026
10 checks passed

g11tech deleted the fix/sync-crash-catchup branch March 23, 2026 11:22

This was referenced Mar 24, 2026

Debug logs not produced after upgrade to Zig 0.15.2 #690

Open

Debug logs not produced after upgrade to Zig 0.15.2 #691

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix silent crashes during cached-block catch-up sync#681

fix silent crashes during cached-block catch-up sync#681
g11tech merged 2 commits intomainfrom
fix/sync-crash-catchup

ch4r10t33r commented Mar 19, 2026 •

edited

Loading

Uh oh!

zclawz left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ch4r10t33r commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause

Why the crash was silent

Fixes

1. clock.zig — unreachable on expected timer cancellation

2. chain.zig — out-of-bounds slice in processFinalizationAdvancement

3. node.zig — missing onBlockFollowup in processCachedDescendants

Testing

Uh oh!

zclawz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ch4r10t33r commented Mar 19, 2026 •

edited

Loading

1. `clock.zig` — `unreachable` on expected timer cancellation

2. `chain.zig` — out-of-bounds slice in `processFinalizationAdvancement`

3. `node.zig` — missing `onBlockFollowup` in `processCachedDescendants`