Skip to content

chain: prune root_to_slot_cache on previous finalized slot#776

Merged
ch4r10t33r merged 1 commit into
mainfrom
fix/prune-cache-on-previous-finalized
Apr 22, 2026
Merged

chain: prune root_to_slot_cache on previous finalized slot#776
ch4r10t33r merged 1 commit into
mainfrom
fix/prune-cache-on-previous-finalized

Conversation

@ch4r10t33r
Copy link
Copy Markdown
Contributor

@ch4r10t33r ch4r10t33r commented Apr 22, 2026

Fixes #771.

What this changes

One-line semantic fix in pkgs/node/src/chain.zig inside processFinalizationAdvancement:

-try self.root_to_slot_cache.prune(latestFinalized.slot);
+try self.root_to_slot_cache.prune(previousFinalized.slot);

Why

The chain-owned root_to_slot_cache is the only source of slot info for roots still sitting in BeamState.justifications_roots. After pruneStates at this same call site, retained cached states have block.slot > latestFinalized.slot, but their per-state latest_finalized.slot may still equal previousFinalized.slot — it was frozen at import time and the state itself didn't drive the current advance. Those states' justifications_roots can therefore reference roots in (previousFinalized.slot, latestFinalized.slot].

Pruning on latestFinalized.slot drops exactly those roots from the cache. The next block imported on top of such a state then fails the processAttestations cleanup lookup in state.zig:519 and the STF returns InvalidJustificationRoot — which on devnet-4 wedged zeam_0 permanently after a cross-fork reorg at the finality boundary (slot 267 triggering a 171→252 jumbo advance, forkchoice swinging back to slot 268, next block on top missing on a root in (171, 252]). Full Loki timeline in the #771 follow-up comment.

Pruning on previousFinalized.slot keeps the slot window any surviving cached state can still reference. The cache stays coherent with the set of states the chain is holding, so the miss simply does not occur in normal operation.

Residual edge case

A cached state B imported when chain finalized was F_old can survive two successive advances F_old → F_mid → F_new if its block slot stays above both floors. Its justifications_roots can then reference slots in (F_old, F_mid], which this PR drops at the second advance. Requires a minority fork staying above the canonical chain across two finality boundaries — not observed in the wild, and the airtight fix (evict cached states whose latest_finalized.slot < chain latestFinalized.slot on advance, Option A from the issue comment) is captured as a follow-up. Keeping it out of this PR to preserve minimal scope.

Scope

Deliberately narrow, per AGENTS.md:

  • single call-site change in chain.zig
  • no changes to RootToSlotCache, pruneStates, state shape, SSZ types, or STF
  • no dependency changes

Test plan

  • zig fmt --check .
  • zig build test --summary all
  • zig build simtest --summary all
  • cargo fmt --manifest-path rust/Cargo.toml --all -- --check
  • cargo clippy --manifest-path rust/Cargo.toml --workspace -- -D warnings

Local test run was interrupted; CI will confirm. The change is a single variable swap at one call site — no test case's semantics rely on the exact prune cutoff value at this site, only that pruning advances monotonically (which it still does, just one-advance-behind).

Supersedes #772.

Cached post-states retained in `self.states` after `pruneStates`
(block.slot > latestFinalized.slot) can still hold justifications_roots
referencing slots in (state.latest_finalized.slot, state.slot]. The
post-finalization cleanup loop in `BeamState.processAttestations` looks
those roots up in the chain-owned `root_to_slot_cache`, so the cache
must keep them reachable across at least one finalization boundary.

Previously we pruned on `latestFinalized.slot`, which dropped exactly
the roots in (previousFinalized.slot, latestFinalized.slot] that such
cached states can still reference. On devnet-4 a late-arriving
competing block at slot 267 triggered a 171→252 jumbo finality jump
followed by a forkchoice swing back to the pre-jump head at slot 268;
the very next block on top of it then missed on a justification root
in (171, 252] in the STF cleanup loop and wedged zeam_0 with
`InvalidJustificationRoot` until a checkpoint-sync restart.

Pruning on `previousFinalized.slot` keeps (previousFinalized.slot,
latestFinalized.slot] alive in the cache for exactly the window any
surviving cached state can reference, closing the coherence gap for
the common single-advance case. Paired with #772 (which hardens the
STF to drop cache misses instead of wedging), this gives defense in
depth: the cache no longer drops roots any reachable state can still
need, and any residual miss from the rare two-hop stale-state case is
handled gracefully.

Refs: #771, #772
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

types/state: STF aborts with InvalidJustificationRoot when a justifications root was just pruned from block_cache, wedging the node permanently

2 participants