Skip to content

fix silent crashes during cached-block catch-up sync#681

Merged
g11tech merged 2 commits intomainfrom
fix/sync-crash-catchup
Mar 23, 2026
Merged

fix silent crashes during cached-block catch-up sync#681
g11tech merged 2 commits intomainfrom
fix/sync-crash-catchup

Conversation

@ch4r10t33r
Copy link
Copy Markdown
Contributor

@ch4r10t33r ch4r10t33r commented Mar 19, 2026

Root cause

When syncing from a checkpoint, the node fetches a chain of blocks that
arrive out-of-order and parks them in the fetched_blocks cache.
processCachedDescendants then replays them in parent → child order.
During this burst, the node crashed silently (no log output) every few
sessions, with the last line always being:

[node] Successfully processed cached block 0x…

Why the crash was silent

All production builds use -Doptimize=ReleaseFast (Dockerfile, CI,
auto-release workflow). In ReleaseFast, Zig compiles unreachable
statements and failed bounds/overflow checks to @trap() — an illegal
instruction that kills the process immediately with no output to stdout
or stderr. Because the crash happened after the last log line was flushed,
there is no stack trace to pinpoint the exact site. The three fixes below
are defensive guards for the most likely candidates identified by code
inspection.

Fixes

1. clock.zigunreachable on expected timer cancellation

xev fires a completion with error.Canceled when a timer is re-armed
before the previous fire is delivered. The callback was:

_ = r catch unreachable;

In ReleaseFast this compiles to an illegal instruction, killing the
process silently. Canceled is now handled explicitly and ignored;
any other error panics with a message.

2. chain.zig — out-of-bounds slice in processFinalizationAdvancement

const newly_finalized_count = finalized_roots.len - 1;  // usize underflow if 0_ = self.pruneStates(finalized_roots[1..finalized_roots.len], …);  // OOB if 0

getCanonicalViewAndAnalysis can return an empty slice when the fork
choice was already rebased past the requested root (e.g. the previous
finalized checkpoint). This caused a usize underflow / OOB trap in
ReleaseFast. The fix adds an early-return guard with a warning log.

3. node.zig — missing onBlockFollowup in processCachedDescendants

processBlockByRootChunk (the normal inbound-gossip / RPC path) calls
chain.onBlockFollowup after each successful block import. This emits
head/justification/finalization events and advances
last_emitted_finalized. processCachedDescendants did not call it,
so while replaying dozens of cached blocks the finalization pointer was
never updated. When the normal path finally ran onBlockFollowup on the
next gossip block, it saw a large finalization gap and called
processFinalizationAdvancement with stale bookmarks, triggering bug 2.

The fix adds self.chain.onBlockFollowup(true, cached_block) immediately
after each successful block import, before removeFetchedBlock frees the
block memory.

Testing

  • zig build — clean
  • zig build test --summary all — all tests pass
  • zig fmt --check . — no formatting issues

@ch4r10t33r ch4r10t33r changed the title clock, chain, node: fix silent crashes during cached-block catch-up sync fix silent crashes during cached-block catch-up sync Mar 19, 2026
@ch4r10t33r ch4r10t33r force-pushed the fix/sync-crash-catchup branch from 1b3d17c to cd9a8d3 Compare March 19, 2026 15:38
@ch4r10t33r ch4r10t33r changed the base branch from fix/attestation-committee-count to main March 19, 2026 15:38
@ch4r10t33r ch4r10t33r marked this pull request as ready for review March 19, 2026 15:53
zclawz
zclawz previously approved these changes Mar 19, 2026
Copy link
Copy Markdown
Contributor

@zclawz zclawz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good root-cause analysis and clear description of the crash chain. The three fixes are individually correct. A few observations:

1. onBlockFollowup(true, cached_block)signedBlock param is _ = signedBlock

The comment says "Must happen before removeFetchedBlock frees the block memory" — which implies cached_block is read inside onBlockFollowup. But the function body starts with _ = signedBlock;, meaning the pointer is never dereferenced. The ordering constraint is therefore not a memory-safety requirement today, though keeping the call before removeFetchedBlock is still good practice. Worth either removing the comment or wiring signedBlock properly if it will be used in future.

2. pruneForkchoice = true on every cached block in processCachedDescendants

processCachedDescendants is recursive (each successful import calls itself again). With onBlockFollowup(true, ...) added, processFinalizationAdvancement can now fire on each iteration if finalization advances. For a deep cached-block chain (dozens of blocks during checkpoint sync burst) this means many sequential canonical-view analyses and DB batch writes. Correct semantically, but could be slow. A possible optimization is to pass false for pruneForkchoice during catch-up and prune once at the end — but that is a follow-up concern, not a blocker.

3. Arithmetic guard in chain.zig

The if (slot_gap >= newly_finalized_count) guard is correct and safe. One minor note: if newly_finalized_count > slot_gap the orphaned count is silently zeroed rather than logged. A debug-level log here would make the edge case visible in traces, though not critical.

4. clock.zig.disarm on Canceled

Returning .disarm is the right call — the canceled completion should not re-arm itself. The std.debug.panic for unexpected errors (rather than unreachable) is a clear improvement; at least a useful message is printed before the process exits.

Overall the root cause chain (missing onBlockFollowup → stale finalization pointer → underflow in processFinalizationAdvancement@trap()) is well-documented and the fixes are correct. Happy to approve with the note that items 1 and 2 are worth revisiting in a follow-up.

Three bugs caused the node to crash silently (Zig ReleaseFast trap) when
processing a burst of cached blocks while syncing from a checkpoint:

1. clock: `_ = r catch unreachable` triggered a trap on the expected
   `error.Canceled` that xev emits when a timer fires after being
   re-armed. The error is now handled explicitly: `Canceled` is silently
   ignored, any other value panics with a message.

2. chain: `processFinalizationAdvancement` assumed `finalized_roots` is
   non-empty before performing `finalized_roots[1..finalized_roots.len]`
   slice arithmetic. If the fork choice had already been rebased past the
   requested root `getCanonicalViewAndAnalysis` can return an empty slice,
   causing an out-of-bounds panic. The function now returns early with a
   warning instead. The `usize` subtraction for the orphan-count log line
   was also guarded against underflow.

3. node: `processCachedDescendants` did not call `chain.onBlockFollowup`
   after successfully integrating a cached block, unlike the sibling path
   `processBlockByRootChunk`. This meant head/justification/finalization
   events were not emitted and `last_emitted_finalized` was not advanced
   as the node processed a chain of cached blocks, leading to a large
   deferred finalization jump that triggered the panics above.
@ch4r10t33r ch4r10t33r force-pushed the fix/sync-crash-catchup branch from cd9a8d3 to e52031f Compare March 19, 2026 16:38
@ch4r10t33r ch4r10t33r requested a review from zclawz March 19, 2026 18:39
zclawz pushed a commit that referenced this pull request Mar 20, 2026
ReleaseFast compiles unreachable statements and failed
bounds/overflow checks to @trap() — a silent illegal instruction
with no output. ReleaseSafe keeps safety checks active (bounds,
overflow, unreachable → panics with a stack trace) while still
applying most optimizations.

This makes crash sites visible in production and CI instead of
silently killing the process, which was the root cause of the
hard-to-diagnose crashes fixed in PR #681.

Affected: Dockerfile, ci.yml (SSE integration build), auto-release.yml
(x86_64 and aarch64 release binaries).

The ReleaseFast in build.zig is for the risc0/zkvm targets and
is left unchanged.
anshalshukla added a commit that referenced this pull request Mar 20, 2026
* fix: improve state recovery logging and add genesis time validation (#481)

- Add explicit success log when state is loaded from DB on restart
- Validate genesis time of loaded state matches chain config; fall back
  to genesis on mismatch with a clear warning
- Improve DB recovery logging: log block root and slot at each step so
  failures are easier to diagnose
- Root cause: lean-quickstart spin-node.sh unconditionally wiped the data
  directory on every restart (fixed separately in lean-quickstart)

* chore: update lean-quickstart submodule to fix data dir wipe on restart

* fix: wipe stale database on genesis time mismatch

When the DB contains state from a different genesis (genesis_time mismatch),
close and delete the RocksDB directory before reopening so the node starts
with a fresh DB instance rather than accumulating stale data.

Requested by @g11tech in #637 (comment)

* fix: add post-wipe genesis time re-check, log and return error if still mismatches

* refactor: remove redundant post-wipe genesis check

Per g11tech's review: the downstream loadLatestFinalizedState call
will handle any inconsistency if the wipe somehow failed. No need
to re-probe immediately after wiping.

* fix: resolve CI failure - apply zig fmt to node.zig (remove trailing newline)

* fix: error out if db wipe fails on genesis time mismatch

* refactor the anchor setup on startup

* fix: wipe db even when no local finalized state found

Per @anshalshukla review: if loadLatestFinalizedState fails (no
finalized state in db), we should still wipe the db for a clean
slate rather than risk leftover data. NotFound errors are ignored
since the db directory may not exist yet on first run.

* build: switch production builds from ReleaseFast to ReleaseSafe

ReleaseFast compiles unreachable statements and failed
bounds/overflow checks to @trap() — a silent illegal instruction
with no output. ReleaseSafe keeps safety checks active (bounds,
overflow, unreachable → panics with a stack trace) while still
applying most optimizations.

This makes crash sites visible in production and CI instead of
silently killing the process, which was the root cause of the
hard-to-diagnose crashes fixed in PR #681.

Affected: Dockerfile, ci.yml (SSE integration build), auto-release.yml
(x86_64 and aarch64 release binaries).

The ReleaseFast in build.zig is for the risc0/zkvm targets and
is left unchanged.

* fix: change ReleaseFast to ReleaseSafe in zkvm build step

* fix: revert zkvm optimize to ReleaseFast (ReleaseSafe breaks riscv32 inline asm)

---------

Co-authored-by: anshalshuklabot <anshalshuklabot@users.noreply.github.com>
Co-authored-by: zclawz <zclawz@users.noreply.github.com>
Co-authored-by: zeam-bot <zeam-bot@openclaw>
Co-authored-by: Anshal Shukla <53994948+anshalshukla@users.noreply.github.com>
Co-authored-by: zclawz <zclawz@openclaw.ai>
Co-authored-by: harkamal <gajinder@zeam.in>
Co-authored-by: zclawz <zclawz@blockblaz.io>
Co-authored-by: zclawz <zclawz@blockblaz.com>
@g11tech g11tech merged commit 9239160 into main Mar 23, 2026
10 checks passed
@g11tech g11tech deleted the fix/sync-crash-catchup branch March 23, 2026 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants