fix: treat block_cache miss as stale in justifications cleanup#772
fix: treat block_cache miss as stale in justifications cleanup#772ch4r10t33r wants to merge 3 commits into
Conversation
The post-finalization prune loop in `processAttestations` returned `InvalidJustificationRoot` on any miss in `block_cache`, which wedged the node permanently: the same pre-state re-failed every retry and every slot-tick-driven STF run. A miss is the *expected* outcome for any root at `slot <= finalized_slot` — `RootToSlotCache.prune` removes exactly those entries — so the cleanup branch and the cache prune policy agree on what's stale. Drop the root in that case instead of aborting the whole STF. Fixes #771.
zclawz
left a comment
There was a problem hiding this comment.
Adversarial Review
The core logic is correct and the regression test is thorough. One change request before merge.
🔴 Silent drop — add a log line
Pre-fix: the miss propagated as InvalidJustificationRoot — always visible in logs. Post-fix: the miss is silently dropped with no trace. If upstream code ever incorrectly inserts a root into state.justifications (e.g. a future regression), the symptom will be invisible — finalization quietly stalls, no error, no log. A single debug/warn line makes future debugging tractable:
const is_stale = if (block_cache.get(root)) |slot|
slot <= finalized_slot
else blk: {
logger.debug(
"justification root {x} not in block_cache (pruned past finalized_slot={d}), treating as stale",
.{ root, finalized_slot },
);
break :blk true;
};⚠️ Undocumented assumption: miss ≠ only from pruning
The comment says:
a miss always implies
slot <= finalized_slotnow
This is only guaranteed for roots pruned by RootToSlotCache.prune. There's a second miss scenario: a re-orged block whose root was never persisted, or was evicted before finalization. In that case slot could be > finalized_slot, but the drop is still correct — you can't finalize a block you don't have. The comment should document this:
// A miss has two causes:
// 1. `RootToSlotCache.prune` evicted it (slot <= finalized_slot, monotonic).
// 2. The block was re-orged or never fully imported (slot may be >
// finalized_slot). We can't finalize an unknown root regardless,
// so dropping it is still safe.
✅ Logic — correct
block_cache.get(root) orelse true is sound in both miss scenarios above. Dropping an unjustifiable root is always the right move — we lose nothing except a root we couldn't have processed anyway. The explicit slot <= finalized_slot branch and the new else-true branch are semantically equivalent in outcome. ✅
✅ Test — solid regression coverage
- Seeds
state.justifications_rootswith genesis root (slot 0), whichinitRootToSlotCacheexplicitly skips (only loadsslot > latest_finalized.slot) ✅ - Drives finalization to slot 1, triggering the cleanup loop ✅
- Asserts
latest_finalized==1,latest_justified==2, and genesis_root absent fromjustifications_rootspost-call ✅ att_transferredbool pattern correctly avoids double-free on the error path ✅errdeferonpending_roots/pending_validatorsbefore ownership transfer ✅
Add the log line (and optionally the extended comment) and this is good to merge.
Address review: silent drops make future regressions invisible — the pre-fix path always surfaced `InvalidJustificationRoot` in logs, so follow it with a debug line when we treat a miss as stale. Also extend the comment to spell out the second miss cause (re-orged / never fully imported block) where slot info may not be present at all; the drop is still correct because we can't finalize a root we don't have.
zclawz
left a comment
There was a problem hiding this comment.
Both review points addressed ✅
- Log line added:
logger.debug("dropping stale justification root=0x{x}: not in block_cache (finalized_slot={d})", ...)in the miss branch — future regressions are now observable in node logs ✅ - Comment updated: both miss causes documented (pruned entry vs re-orged/never-imported block), with explanation of why drop is correct in both cases ✅
Approved. 🚢
| const slot = block_cache.get(root) orelse return StateTransitionError.InvalidJustificationRoot; | ||
| if (slot <= finalized_slot) { | ||
| // A miss in `block_cache` for a root that's still in | ||
| // `state.justifications` has two possible causes: |
There was a problem hiding this comment.
this is incorrect, only parent roots till finalized can be in justifications and can't be evicted
pls find the real bug fix for this why we dont have correct RootToSlotCache
Cached post-states retained in `self.states` after `pruneStates` (block.slot > latestFinalized.slot) can still hold justifications_roots referencing slots in (state.latest_finalized.slot, state.slot]. The post-finalization cleanup loop in `BeamState.processAttestations` looks those roots up in the chain-owned `root_to_slot_cache`, so the cache must keep them reachable across at least one finalization boundary. Previously we pruned on `latestFinalized.slot`, which dropped exactly the roots in (previousFinalized.slot, latestFinalized.slot] that such cached states can still reference. On devnet-4 a late-arriving competing block at slot 267 triggered a 171→252 jumbo finality jump followed by a forkchoice swing back to the pre-jump head at slot 268; the very next block on top of it then missed on a justification root in (171, 252] in the STF cleanup loop and wedged zeam_0 with `InvalidJustificationRoot` until a checkpoint-sync restart. Pruning on `previousFinalized.slot` keeps (previousFinalized.slot, latestFinalized.slot] alive in the cache for exactly the window any surviving cached state can reference, closing the coherence gap for the common single-advance case. Paired with #772 (which hardens the STF to drop cache misses instead of wedging), this gives defense in depth: the cache no longer drops roots any reachable state can still need, and any residual miss from the rare two-hop stale-state case is handled gracefully. Refs: #771, #772
…ixture fill leanSpec #772 set the prover's rayon thread count to 1 and re-enabled parallel prod fill, so `fill-ci` is now `-n auto --dist=worksteal` (per-process safe). Bump the leanSpec pin to 352b558 and match ci.yml's fixture-gen step + comment. Addresses anshil's ci.yml:440 review comment.
…ixture fill leanSpec #772 set the prover's rayon thread count to 1 and re-enabled parallel prod fill, so `fill-ci` is now `-n auto --dist=worksteal` (per-process safe). Bump the leanSpec pin to 352b558 and match ci.yml's fixture-gen step + comment. Addresses anshil's ci.yml:440 review comment.
Fixes #771.
The post-finalization prune loop in
processAttestationslooked each pending-justification root up inblock_cacheand hard-errored withStateTransitionError.InvalidJustificationRooton a miss. Once the chain-owned cache was pruned past any root still sitting in a cached per-statejustifications_rootsmap, the miss became permanent: the same pre-state re-failed on every block retry and every slot-tick-driven STF run — wedging the node until checkpoint-sync restart.Why the miss is actually reachable
See #771 follow-up comment for the full timeline. Short version: the global
root_to_slot_cacheis pruned against chain-widelatestFinalized.slot, but cachedBeamState.justifications_rootsare pinned to the per-statelatest_finalized.slot, which is frozen at import time. When a cross-fork reorg advances chain-wide finality past a reachable cached state's own finalized slot, that state'sjustifications_rootscan reference roots the global cache has just evicted.On devnet-4 this happened at
06:56:29.535: importing a late-arriving competing block at slot 267 advancedlatestFinalized171→252 and triggeredprune(252). Fork-choice then swung back to92a7(268)(whose post-state still hadlatest_finalized=171), and the next block on top of it (b4bd9e5f(269)) hit this loop with a root in(171, 252]that was no longer in the cache.Fix (this PR — narrow hotfix)
RootToSlotCache.prune(F)removes exactly the entries withslot <= F. So a miss either means the root was once in the cache and has since been pruned (⇒ pre-finalization ⇒ same outcome as the explicit<=branch — drop it) or it was never populated for a root that nonetheless lives instate.historical_block_hashes(no slot info to finalize against — drop is still the only safe action). In both cases: treat the miss identically to the<=branch, drop the root, continue.This is option (1) from the issue — minimal, stays inside
state.zig, no changes to state shape, SSZ types, or the chain-layer prune policy. Sufficient to unwedge this class of stall.Out of scope (follow-ups)
Captured in the #771 follow-up comment, not addressed here:
latest_finalized.slot < chain latestFinalized.slotwhenonFinalizationfires. Makes the missed-lookup path unreachable in practice and tightens memory. Cleaner than widening the prune floor.logger.debugin the drop path; anerr-level log withroot/state_finalized/new_finalizedon any future hard-failure branches would make repeats trivial to diagnose.Scope
Deliberately narrow:
state.justifications_roots/justifications_validatorsshapepruneCachedBlocksCallbackorRootToSlotCacheTest plan
zig fmt --check .cargo fmt --manifest-path rust/Cargo.toml --all -- --checkcargo clippy --manifest-path rust/Cargo.toml --workspace -- -D warningszig build test --summary all— 19 tests pass inpkgs/types/src/lib.zig(was 18, +1 regression)zig build simtest --summary allprocess_attestations drops pending justifications missing from block_cachefails against the unpatched cleanup loop withInvalidJustificationRoot(verified by locally reverting just the src change), passes with the fix.