Skip to content

fix: treat block_cache miss as stale in justifications cleanup#772

Closed
ch4r10t33r wants to merge 3 commits into
mainfrom
fix/issue-771-stale-justification-cleanup
Closed

fix: treat block_cache miss as stale in justifications cleanup#772
ch4r10t33r wants to merge 3 commits into
mainfrom
fix/issue-771-stale-justification-cleanup

Conversation

@ch4r10t33r
Copy link
Copy Markdown
Contributor

@ch4r10t33r ch4r10t33r commented Apr 22, 2026

Fixes #771.

The post-finalization prune loop in processAttestations looked each pending-justification root up in block_cache and hard-errored with StateTransitionError.InvalidJustificationRoot on a miss. Once the chain-owned cache was pruned past any root still sitting in a cached per-state justifications_roots map, the miss became permanent: the same pre-state re-failed on every block retry and every slot-tick-driven STF run — wedging the node until checkpoint-sync restart.

Why the miss is actually reachable

See #771 follow-up comment for the full timeline. Short version: the global root_to_slot_cache is pruned against chain-wide latestFinalized.slot, but cached BeamState.justifications_roots are pinned to the per-state latest_finalized.slot, which is frozen at import time. When a cross-fork reorg advances chain-wide finality past a reachable cached state's own finalized slot, that state's justifications_roots can reference roots the global cache has just evicted.

On devnet-4 this happened at 06:56:29.535: importing a late-arriving competing block at slot 267 advanced latestFinalized 171→252 and triggered prune(252). Fork-choice then swung back to 92a7(268) (whose post-state still had latest_finalized=171), and the next block on top of it (b4bd9e5f(269)) hit this loop with a root in (171, 252] that was no longer in the cache.

Fix (this PR — narrow hotfix)

RootToSlotCache.prune(F) removes exactly the entries with slot <= F. So a miss either means the root was once in the cache and has since been pruned (⇒ pre-finalization ⇒ same outcome as the explicit <= branch — drop it) or it was never populated for a root that nonetheless lives in state.historical_block_hashes (no slot info to finalize against — drop is still the only safe action). In both cases: treat the miss identically to the <= branch, drop the root, continue.

This is option (1) from the issue — minimal, stays inside state.zig, no changes to state shape, SSZ types, or the chain-layer prune policy. Sufficient to unwedge this class of stall.

Out of scope (follow-ups)

Captured in the #771 follow-up comment, not addressed here:

  • Structural coherence fix (preferred follow-up). Evict cached states whose latest_finalized.slot < chain latestFinalized.slot when onFinalization fires. Makes the missed-lookup path unreachable in practice and tightens memory. Cleaner than widening the prune floor.
  • Slot-tick resilience. A hard STF error on block import should reject the block, not freeze forkchoice/time (option 4 in the issue, same "STF error wedges tick" pattern as node: fetched-block cache fills during finality stall and never drains — node permanently stuck after ~1024 slots without finalization #749).
  • Observability. The pre-fix error site didn't log the offending root or either finalized_slot. This PR adds a logger.debug in the drop path; an err-level log with root/state_finalized/new_finalized on any future hard-failure branches would make repeats trivial to diagnose.

Scope

Deliberately narrow:

  • no changes to state.justifications_roots / justifications_validators shape
  • no changes to pruneCachedBlocksCallback or RootToSlotCache
  • no changes to the forkchoice/tick loop

Test plan

  • zig fmt --check .
  • cargo fmt --manifest-path rust/Cargo.toml --all -- --check
  • cargo clippy --manifest-path rust/Cargo.toml --workspace -- -D warnings
  • zig build test --summary all — 19 tests pass in pkgs/types/src/lib.zig (was 18, +1 regression)
  • zig build simtest --summary all
  • New regression test process_attestations drops pending justifications missing from block_cache fails against the unpatched cleanup loop with InvalidJustificationRoot (verified by locally reverting just the src change), passes with the fix.

The post-finalization prune loop in `processAttestations` returned
`InvalidJustificationRoot` on any miss in `block_cache`, which wedged
the node permanently: the same pre-state re-failed every retry and every
slot-tick-driven STF run.

A miss is the *expected* outcome for any root at `slot <= finalized_slot`
— `RootToSlotCache.prune` removes exactly those entries — so the cleanup
branch and the cache prune policy agree on what's stale. Drop the root
in that case instead of aborting the whole STF.

Fixes #771.
@ch4r10t33r ch4r10t33r changed the title types: treat block_cache miss as stale in justifications cleanup fix: treat block_cache miss as stale in justifications cleanup Apr 22, 2026
@ch4r10t33r ch4r10t33r marked this pull request as ready for review April 22, 2026 08:41
Copy link
Copy Markdown
Contributor

@zclawz zclawz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial Review

The core logic is correct and the regression test is thorough. One change request before merge.


🔴 Silent drop — add a log line

Pre-fix: the miss propagated as InvalidJustificationRoot — always visible in logs. Post-fix: the miss is silently dropped with no trace. If upstream code ever incorrectly inserts a root into state.justifications (e.g. a future regression), the symptom will be invisible — finalization quietly stalls, no error, no log. A single debug/warn line makes future debugging tractable:

const is_stale = if (block_cache.get(root)) |slot|
    slot <= finalized_slot
else blk: {
    logger.debug(
        "justification root {x} not in block_cache (pruned past finalized_slot={d}), treating as stale",
        .{ root, finalized_slot },
    );
    break :blk true;
};

⚠️ Undocumented assumption: miss ≠ only from pruning

The comment says:

a miss always implies slot <= finalized_slot now

This is only guaranteed for roots pruned by RootToSlotCache.prune. There's a second miss scenario: a re-orged block whose root was never persisted, or was evicted before finalization. In that case slot could be > finalized_slot, but the drop is still correct — you can't finalize a block you don't have. The comment should document this:

// A miss has two causes:
// 1. `RootToSlotCache.prune` evicted it (slot <= finalized_slot, monotonic).
// 2. The block was re-orged or never fully imported (slot may be >
//    finalized_slot). We can't finalize an unknown root regardless,
//    so dropping it is still safe.

✅ Logic — correct

block_cache.get(root) orelse true is sound in both miss scenarios above. Dropping an unjustifiable root is always the right move — we lose nothing except a root we couldn't have processed anyway. The explicit slot <= finalized_slot branch and the new else-true branch are semantically equivalent in outcome. ✅

✅ Test — solid regression coverage

  • Seeds state.justifications_roots with genesis root (slot 0), which initRootToSlotCache explicitly skips (only loads slot > latest_finalized.slot) ✅
  • Drives finalization to slot 1, triggering the cleanup loop ✅
  • Asserts latest_finalized==1, latest_justified==2, and genesis_root absent from justifications_roots post-call ✅
  • att_transferred bool pattern correctly avoids double-free on the error path ✅
  • errdefer on pending_roots / pending_validators before ownership transfer ✅

Add the log line (and optionally the extended comment) and this is good to merge.

Address review: silent drops make future regressions invisible — the
pre-fix path always surfaced `InvalidJustificationRoot` in logs, so
follow it with a debug line when we treat a miss as stale. Also extend
the comment to spell out the second miss cause (re-orged / never fully
imported block) where slot info may not be present at all; the drop is
still correct because we can't finalize a root we don't have.
@ch4r10t33r ch4r10t33r requested a review from zclawz April 22, 2026 09:02
Copy link
Copy Markdown
Contributor

@zclawz zclawz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both review points addressed ✅

  • Log line added: logger.debug("dropping stale justification root=0x{x}: not in block_cache (finalized_slot={d})", ...) in the miss branch — future regressions are now observable in node logs ✅
  • Comment updated: both miss causes documented (pruned entry vs re-orged/never-imported block), with explanation of why drop is correct in both cases ✅

Approved. 🚢

Comment thread pkgs/types/src/state.zig
const slot = block_cache.get(root) orelse return StateTransitionError.InvalidJustificationRoot;
if (slot <= finalized_slot) {
// A miss in `block_cache` for a root that's still in
// `state.justifications` has two possible causes:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is incorrect, only parent roots till finalized can be in justifications and can't be evicted

pls find the real bug fix for this why we dont have correct RootToSlotCache

@ch4r10t33r
Copy link
Copy Markdown
Contributor Author

Superseded by #776, which addresses the root cause in the chain-level prune logic rather than masking cache misses in the STF. With #776 the miss can't happen in the single-advance case, so the STF-level drop-on-miss added here is no longer needed.

@ch4r10t33r ch4r10t33r closed this Apr 22, 2026
ch4r10t33r added a commit that referenced this pull request Apr 22, 2026
Cached post-states retained in `self.states` after `pruneStates`
(block.slot > latestFinalized.slot) can still hold justifications_roots
referencing slots in (state.latest_finalized.slot, state.slot]. The
post-finalization cleanup loop in `BeamState.processAttestations` looks
those roots up in the chain-owned `root_to_slot_cache`, so the cache
must keep them reachable across at least one finalization boundary.

Previously we pruned on `latestFinalized.slot`, which dropped exactly
the roots in (previousFinalized.slot, latestFinalized.slot] that such
cached states can still reference. On devnet-4 a late-arriving
competing block at slot 267 triggered a 171→252 jumbo finality jump
followed by a forkchoice swing back to the pre-jump head at slot 268;
the very next block on top of it then missed on a justification root
in (171, 252] in the STF cleanup loop and wedged zeam_0 with
`InvalidJustificationRoot` until a checkpoint-sync restart.

Pruning on `previousFinalized.slot` keeps (previousFinalized.slot,
latestFinalized.slot] alive in the cache for exactly the window any
surviving cached state can reference, closing the coherence gap for
the common single-advance case. Paired with #772 (which hardens the
STF to drop cache misses instead of wedging), this gives defense in
depth: the cache no longer drops roots any reachable state can still
need, and any residual miss from the rare two-hop stale-state case is
handled gracefully.

Refs: #771, #772
GrapeBaBa added a commit that referenced this pull request May 26, 2026
…ixture fill

leanSpec #772 set the prover's rayon thread count to 1 and re-enabled parallel
prod fill, so `fill-ci` is now `-n auto --dist=worksteal` (per-process safe).
Bump the leanSpec pin to 352b558 and match ci.yml's fixture-gen step + comment.
Addresses anshil's ci.yml:440 review comment.
GrapeBaBa added a commit that referenced this pull request May 26, 2026
…ixture fill

leanSpec #772 set the prover's rayon thread count to 1 and re-enabled parallel
prod fill, so `fill-ci` is now `-n auto --dist=worksteal` (per-process safe).
Bump the leanSpec pin to 352b558 and match ci.yml's fixture-gen step + comment.
Addresses anshil's ci.yml:440 review comment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

types/state: STF aborts with InvalidJustificationRoot when a justifications root was just pruned from block_cache, wedging the node permanently

3 participants