Skip to content

[r3.4] execution/stagedsync: find diffset by actually-executed hash on unwind#21157

Merged
AskAlexSharov merged 2 commits into
release/3.4from
cp/r34-create2-unwind-collision
May 14, 2026
Merged

[r3.4] execution/stagedsync: find diffset by actually-executed hash on unwind#21157
AskAlexSharov merged 2 commits into
release/3.4from
cp/r34-create2-unwind-collision

Conversation

@JkLondon
Copy link
Copy Markdown
Member

Summary

Fixes a state-leak bug in unwindExec3 that surfaces as a gas used mismatch and Cannot update chain head after any reorg at the tip whose unwound block deployed a contract via CREATE/CREATE2 to a counterfactual address (safe-wallets, EIP-1167 clone factories, ERC-4337 accounts, deterministic deployers). Reproduced on hoodi at block 2 789 993 on a node running release/3.4 — local sidechain head 0x706b073b…, canonical 0x72c42d17… / 0xc4c75456…, divergence +72 524 gas in tx 0x73652205af0b85e7… (idx 7) which goes from status=1, gasUsed=171 004 on canonical to status=0, gasUsed=243 528 (consume-all-gas) locally.

This is not a cherry-pick of a merged PR. The fix originated on release/3.4 after diagnosing the hoodi snapshotter incident on 2026-05-11. The corresponding bug is also present on main (see "Cherry-pick to main" below).

Root cause

unwindExec3 looks up the per-block DomainEntryDiff set via doms.GetDiffset(rwTx, br.CanonicalHash(currentBlock), currentBlock). The diffset, however, is keyed by the hash of the block that was actually executed, not by whatever the current canonical hash happens to be at unwind time.

The headers stage processes a reorg in two steps inside a single transaction (execution/stagedsync/stage_headers.go):

  1. u.UnwindTo(unwindTo, …) — schedules an unwind for subsequent stages.
  2. fixCanonicalChain(…) — overwrites kv.HeaderCanonical to point at the new chain.

The canonical pointer flip happens before the execution-stage unwind actually runs. When unwindExec3 then asks for br.CanonicalHash(currentBlock) it gets the new canonical hash, but the diffset for that height is sitting in ChangeSets3 under the previous (now sidechain) hash that was actually applied.

The pre-existing fallback at stage_execute.go:191-204 only handled the case where CanonicalHash returns !ok (no canonical pointer at this height — meaning we executed a block that was never marked canonical). It did not handle the case where CanonicalHash returns the new canonical and the diffset is missing under that new hash. In that case the code treated the miss as the "first block in traversal, nothing to unwind yet" edge case and silently continued.

The downstream effect is severe:

  • changeSet stays nilunwindExec3State gets nil → the loop that collects state changes into stateChanges is skipped → sd.Unwind(txUnwindTo, nil)sd.unwindChangesetRaw = nil.
  • At TemporalMemBatch.Flush the guard if sd.unwindChangesetRaw != nil (temporal_mem_batch.go:596-606) skips AggregatorRoTx.Unwind entirely.
  • No empty-tombstones are written into the latest AccountsDomain / CodeDomain rows for the contracts deployed in the unwound sidechain block. The previous tombstone-restore fix #20627 / e1526bf is therefore bypassed entirely — its []byte{} tombstone write never gets a chance to run because the diffset that drives it is "not found".

The execution stage progress is then rolled back through the staged-sync machinery (Headers, Bodies, Senders reset), but the domain (latest state) tables in MDBX still reflect the sidechain. On the next forward execution the canonical chain re-attempts the same CREATE2 against the now-phantom contract. EIP-684 / EIP-1014 collision rules consume all remaining gas of that tx — exactly what shows up in logs:

[WARN] gas used mismatch  block=2789993  header=4279886  execution=4352410  diff=72524  txCount=34  receiptCount=34
  tx gas detail  block=2789993 txIdx=7 txHash=0x73652205af0b85e7 gasUsed=243528 ... status=0

243528 is the tx gas limit; 171004 is what canonical reference RPC (https://rpc.hoodi.ethpandaops.io) reports for the same tx; 243528 − 171004 = 72524, matching the entire block-level mismatch.

Fix

Localised to one function. Does not require any of the parallel-commitment / SD-recreate infrastructure that landed in main (bfa03df625, 95d96627ce, etc.) — the diffset lookup is the only place that was making an incorrect assumption about canonical-hash stability across stages.

New helper findExecutedDiffsetAtHeight in execution/stagedsync/stage_execute.go:

  1. Tries the diffset under the current canonical hash (previous behaviour for the common case).
  2. If !ok, walks every header stored at this height via rawdb.ReadHeadersByNumber and tries each one. The header whose diffset is present is by construction the block that was actually executed.
  3. Returns the recovered diffset together with the executedHash for the caller.

unwindExec3 is updated to:

  • Use the helper instead of the inline lookup.
  • Track lastExecHash from the helper-returned executedHash for the topmost unwound block (u.CurrentBlockNumber), rather than re-reading br.CanonicalHash(u.CurrentBlockNumber) afterwards (which would suffer the same canonical-flip miscompare).
  • Pass that hash to unwindExec3State, which forwards it to stateCache.RevertWithDiffset(changeset, lastExecutedBlockHash, unwindToHash). The cache's RevertWithDiffset compares its own stored blockHash against revertFromHash — with the corrected hash, surgical eviction works and the cache no longer degrades to a full clear after every tip reorg.

The fallback search is bounded by the number of headers stored at a given height (effectively 1 in the common case, 2 in a 1-deep reorg, and so on). Order of headers returned by ReadHeadersByNumber is by storage prefix, but only one of them has a stored diffset, so the search is deterministic.

Regression test

execution/stagedsync/stage_execute_unwind_test.go covers three phases:

  1. Canonical hash points to the actually-executed block — direct hit, executedHash = hOld.
  2. Headers stage re-canonicalises the height to a different hash (hNew) — the test first asserts that the pre-fix direct doms.GetDiffset(tx, hNew, height) returns !ok (documenting the bug), then asserts that findExecutedDiffsetAtHeight recovers via fallback and still returns executedHash = hOld with the original diff list intact.
  3. A height with no stored diffset under any known header — found=false, no spurious matches.

Test runs in ~100 ms and exercises the helper against a real mdbx + Aggregator + temporal.New stack with a freezeblocks.BlockReader (no snapshots required, the unwind range is at the tip).

Limitations / what this does NOT cover

  • Multiple distinct executed sidechains at the same height. If a node reorg'd back-and-forth and ChangeSets3 ends up holding diffsets for more than one block at the same height, the fallback returns the first non-canonical hash whose diffset is present. In practice the staged-sync unwind path immediately invalidates the stored diffset of any unwound block (via the subsequent forward execution), so this is rare; but a fully robust fix would persist a separate execution-applied head hash per height. Out of scope here.
  • The race itself (headers stage rewriting kv.HeaderCanonical before execution stage unwind runs) is unchanged. This PR makes the execution-stage unwind robust against that ordering rather than restructuring stage interaction.

How to verify on a stuck node

The runbook for an already-affected datadir (no rebuild required, snapshots are not contaminated because the phantom lives only in the MDBX latest domain tables at the tip) is:

sudo systemctl stop erigon
./build/bin/integration reset_state --datadir=<DD> --chain=hoodi
sudo systemctl start erigon

With the fix in place, future reorgs at the tip that involve CREATE2-to-counterfactual transactions will unwind cleanly and the same scenario will not recur.

Test plan

  • go test -count=1 -short ./execution/stagedsync/... ./db/state/... — green.
  • New TestFindExecutedDiffsetAtHeight_FallsBackAfterCanonicalReorg — green; embeds a sanity-assert that the pre-fix direct lookup misses (so the test fails if the fallback is reverted).
  • Manual: run on a node holding the hoodi-2789993 datadir state; after reset_state and restart, block 2 789 993 must be accepted with hash 0xc4c754565911f13b96a841782095a21a32806f927307ffa71695ecd99801c236.
  • CI: full unit-test + lint pipeline.

🤖 Generated with Claude Code

Root cause of the CREATE2-collision-after-reorg bug surfaced on hoodi at
block 2789993 (release/3.4):

When the headers stage re-canonicalises a height before the execution
stage unwind runs, br.CanonicalHash(blockN) returns the new canonical
hash, but the diffset for that height was stored under the previously-
canonical (now sidechain) hash. doms.GetDiffset under the new hash
returns \!ok. The pre-fix code treated this as "first block in traversal,
nothing to unwind yet" and just `continue`d. Result: changeSet stayed
nil → sd.unwindChangesetRaw stayed nil → at Flush, the
`if sd.unwindChangesetRaw \!= nil` guard skipped AggregatorRoTx.Unwind
entirely → no tombstones written to the AccountsDomain/CodeDomain
latest tables → phantom CREATE2 contracts created on the sidechain
remained.

On the next forward execution the canonical chain hits the same
counterfactual CREATE2 address, collides with the phantom (EIP-684 /
EIP-1014: existing code or nonce\!=0 ⇒ consume-all-gas), and the block
is rejected with `gas used mismatch`.

Fix: when looking up the diffset for an unwound block, fall back from
the (now-canonical) hash to every other header hash stored at that
height. The header whose diffset is present is the block we actually
executed. Use that hash for both the diff merge and the
RevertWithDiffset cache-invalidation revertFromHash, so cache eviction
also stays surgical instead of degrading to a full clear.

This is a localised fix that does NOT depend on the parallel-commitment
infrastructure in main; the lookup is the only place that was making
the wrong assumption about canonical-hash stability across stages.

Regression test in stage_execute_unwind_test.go covers:
  - canonical-hash points to executed block (direct hit)
  - canonical-hash flipped after execution (fallback hit, this is the
    bug being fixed; includes a sanity assertion that the pre-fix
    direct GetDiffset still misses)
  - no stored diffset under any known header (not-found, no spurious
    matches)

Refs hoodi snapshotter incident 2026-05-11, block 2789993,
local sidechain head 0x706b073b…, canonical 0x72c42d17…/0xc4c75456…
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes execution-stage unwind robustness during tip reorgs by ensuring the per-block diffset is located using the hash of the actually executed block, not the possibly-updated canonical hash, preventing missed unwinds that can leave “phantom” state and trigger downstream gas used mismatch / head update failures.

Changes:

  • Added findExecutedDiffsetAtHeight to recover the correct ChangeSets3 diffset even after headers-stage canonical pointer flips.
  • Updated unwindExec3 to use the recovered diffset and pass the executed hash through to state-cache revert logic.
  • Added a regression test covering canonical reorg fallback behavior and genuine-not-found behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
execution/stagedsync/stage_execute.go Adds executed-diffset discovery helper and integrates it into unwindExec3 to unwind the correct applied state.
execution/stagedsync/stage_execute_unwind_test.go Adds regression coverage for diffset lookup after canonical reorg and not-found cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread execution/stagedsync/stage_execute.go Outdated
Comment on lines 248 to 272
@@ -200,10 +265,10 @@ func unwindExec3(u *UnwindState, s *StageState, doms *execctx.SharedDomains, rwT
// all previous blocks
continue
}
return fmt.Errorf("domains.GetDiffset(%d, %s): not found", currentBlock, currentHash)
return fmt.Errorf("domains.GetDiffset(%d): not found under any known header hash", currentBlock)
}
if err != nil {
return err
if currentBlock == u.CurrentBlockNumber {
lastExecHash = executedHash
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JkLondon is it important?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, will fix

Comment thread execution/stagedsync/stage_execute_unwind_test.go Outdated
…t loop iter

Address Copilot review on #21157.

unwindExec3 only set lastExecHash when currentBlock == u.CurrentBlockNumber.
When that block has no diffset yet (the "has not been processed" edge case
that triggers `continue`), lastExecHash stayed zero even after a later
iteration recovered the topmost executed block's diffset. RevertWithDiffset
then saw a hash mismatch and fell back to a full state-cache clear,
defeating the surgical eviction the PR is meant to enable.

Move the assignment into the `changeSet == nil` branch so it captures the
hash of the first actually-found diffset (= topmost executed block in the
unwound range).

Also drop dead `stepBytes`/`diffKey` from the regression test — the slice
was sliced back to `addr.Bytes()` and the comment about "inverted
big-endian" was wrong (zero bytes are not inverted-BE step 0).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@yperbasis yperbasis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overview

Fixes a state-leak bug where unwindExec3 fails to locate the per-block DomainEntryDiff set after a tip reorg, because the diffset is keyed by the actually-executed block hash while br.CanonicalHash() returns whatever canonical now points to. The miss caused sd.unwindChangesetRaw to stay nil, skipping AggregatorRoTx.Unwind and leaving "phantom" deployed contracts in the latest-state domain tables. The next forward execution of the canonical chain re-attempted the same CREATE2 and consumed the entire gas limit (EIP-684 collision), surfacing as the gas used mismatch on hoodi block 2,789,993.

The fix introduces a helper findExecutedDiffsetAtHeight that tries the canonical hash first and falls back to walking every header stored at the height.

What works well

  • Surgical and localised. Only changes unwindExec3; no scope-creep into the broader staged-sync interaction.
  • Preserves existing edge-case logic. The "first iteration: nothing executed yet → continue" branch (if changeSet == nil { continue }) is retained, and the helper's not-found return cooperates with it correctly.
  • Good regression test. Three phases cover (1) direct hit, (2) canonical-flipped fallback — with an explicit sanity assert that the pre-fix code path (doms.GetDiffset directly under new canonical) misses, so the test fails if the fallback is reverted, and (3) genuinely-not-found.
  • Test runs in ~1s and uses a real mdbx + Aggregator + temporal.New stack — no mocks.
  • The Copilot finding on lastExecHash capture has already been fixed in the follow-up commit 37a846c30, by moving the assignment into the changeSet == nil branch so it captures the first found diffset rather than the first loop iteration. Important — otherwise a deeper unwind whose top blocks have no diffset would have silently degraded to a full cache clear.

Issues / suggestions

1. PR-description root cause is mis-attributed (docs only, not code)

The description claims the canonical pointer flip happens in stage_headers.go via fixCanonicalChain after u.UnwindTo. Looking at the actual code (stage_headers.go:280-291):

if headerInserter.GetHighest() != 0 {
    if !headerInserter.Unwind() {                 // ← only NON-unwind path
        if err = fixCanonicalChain(...); err != nil { ... }
    }
    ...
}

fixCanonicalChain is gated by !headerInserter.Unwind(), so it does not run when an unwind is scheduled. The mechanism that actually flips canonical-before-unwind on hoodi (PoS) is addAndVerifyBlockStep in stageloop.go:560-605, which unconditionally WriteCanonicalHash(currentHash, currentHeight) every time NewPayload arrives — so a second payload at the same height as a previously-executed block flips the canonical pointer before the eventual forkchoice-triggered unwind runs.

The diagnosis of the actual bug is correct; only the explanation of how it gets triggered is off. Consider updating the PR body so the runbook references the right code path.

2. Doc comment in helper slightly overstates the loop's edge case (stage_execute.go:175-178)

"…the surrounding loop handles by continuing past the first iteration only."

The continue actually fires on every iteration where changeSet == nil && !found, not just the very first one. With a deep enough unwind range whose top N blocks were never executed, several iterations may be skipped before the first diffset is encountered. Re-word to "while no diffset has been found yet" or similar.

3. Minor: ReadHeadersByNumber is called twice on the rare !ok + single-header + miss path

When br.CanonicalHash() returns !ok and exactly one header is stored, the helper sets currentHash = nonCanonicalHeaders[0].Hash(), then falls through to doms.GetDiffset. If that misses, the post-fallback rawdb.ReadHeadersByNumber(rwTx, currentBlock) re-reads the same single header and skips it. Functionally correct, just two cursor scans where one would do. Low priority.

4. Cache-eviction call in unwindExec3State still uses raw canonical (stage_execute.go:359)

unwindToHash, err := rawdb.ReadCanonicalHash(tx, blockUnwindTo)

This is for the target hash (the unwind point), and at unwind time the unwind point's canonical hash is still the original (new canonicals are only written for blocks above the unwind point), so this is safe today. Worth a one-line comment noting why this lookup is robust by construction, since the surrounding diffset lookup needed the elaborate fallback.

5. Test coverage gap (nice-to-have)

The regression test doesn't exercise the !ok && len(headers) > 1 ambiguity-error path or the multi-sidechain-at-same-height limitation that the PR explicitly calls out. Not blocking — but a 3rd phase that confirms found=false plus a specific error string when two non-canonical headers both have diffsets would lock in the documented behaviour.

Correctness / safety

  • The fallback is deterministic in the common case (≤2 headers per height, only one has a stored diffset). With multiple stored diffsets at one height the helper returns the first match by ReadHeadersByNumber's storage-prefix order — the PR explicitly accepts this in "Limitations" and the practical rarity makes it a reasonable trade-off for a backport.
  • The error-message change in unwindExec3 ("domains.GetDiffset(%d, %s): not found""…not found under any known header hash") is a small behaviour change to observability but more accurate (the old message printed the hash we tried, which is now misleading after the fallback).
  • No new locking or transaction-scope changes; the helper takes the same rwTx as the caller.

Local verification

$ go test -count=1 -short -run TestFindExecutedDiffsetAtHeight ./execution/stagedsync/...
ok  	github.com/erigontech/erigon/execution/stagedsync	1.020s
$ go test -count=1 -short ./execution/stagedsync/
ok  	github.com/erigontech/erigon/execution/stagedsync	1.257s

Recommendation

Approving. Please tighten the PR description's root-cause section (point 1 — the stage_headers.go story doesn't match the gated code path; the PoS addAndVerifyBlockStep flow is the real trigger). The other suggestions (comment wording, double-read, additional test phase) are non-blocking polish.

The "Cherry-pick to main" section referenced in the body appears to be missing — worth confirming a tracking PR exists, since the same bug pattern is present on main (verified by reading git show main:execution/stagedsync/stage_execute.go).

@JkLondon JkLondon requested a review from AskAlexSharov May 13, 2026 21:06
@AskAlexSharov AskAlexSharov merged commit 2b9b9ce into release/3.4 May 14, 2026
23 checks passed
@AskAlexSharov AskAlexSharov deleted the cp/r34-create2-unwind-collision branch May 14, 2026 00:56
AskAlexSharov pushed a commit that referenced this pull request May 21, 2026
…ove snapshot tip on reset_state (#21246)

## Summary

Fixes a stale-canonical-pointer leak in `integration reset_state` that
was reintroducing phantom state on hoodi snapshotters running
`release/3.4` even after a clean rebuild. Observed at block **2 818
468** on `snapshotter-bm-v34-hoodi-n1`: local `kv.HeaderCanonical[2 818
468] = 0xac2ee57a…` (a sidechain block) while the real canonical was
`0x27db29e4…`. The next canonical block, **2 818 476**, then died with
`insufficient funds for gas * price + value` on
`0xdD11751cdD3f6EFf01B1f6151B640685bfa5dB4a` — a fee-recipient sweep-tx
that was short by exactly `17 100 × 1 265 000 000 = 21 631 500 000 000
wei`, the EIP-2929 SSTORE-SET-vs-RESET delta for a single storage slot
on `0x8684adde…` that had been first-written by the surviving sidechain
block 8 blocks earlier.

This is **not** a cherry-pick of a merged PR. The fix originates on
`release/3.4` and pairs with #21157 (already on `release/3.4`).

## Root cause

`reset_state` historically called `ResetExec`, which clears the MDBX
state-domain tables (`AccountsDomain`, `StorageDomain`, `CodeDomain`,
`CommitmentDomain`, `ReceiptDomain`, `RCacheDomain`), the inverted
indexes and `ChangeSets3`, plus the
`Execution`/`Finish`/`CustomTrace`/`TxLookup`/`Witnesses` stage
progress. It deliberately left `kv.HeaderCanonical`, `kv.HeaderTD` and
the `Headers`/`BlockHashes`/`Bodies`/`Senders` stage progress alone, on
the assumption that those tables faithfully reflect the consensus-layer
view of the chain.

That assumption breaks once a forkchoice update has ever committed a
canonical hash whose chain later turned out to be the wrong side of a
reorg. Concretely, on the affected snapshotter:

1. An older forkchoice update for chain *A* succeeded;
`WriteCanonicalHash(A, 2 818 468)` was committed.
2. CL switched to chain *B*. The forkchoice handler walked back, built
`newCanonicals` including `B@2 818 468`, ran the staged unwind and
reached the forward `executionPipeline.Run` step.
3. Forward execution failed on a downstream block because the unwind on
a pre-#21157 binary failed to revert `CodeDomain` / `StorageDomain`
writes made by *A* (see
`execution/stagedsync/stage_execute.go:findExecutedDiffsetAtHeight`
introduced in #21157). The forkchoice handler returned `BadBlock` and
the whole tx — including the `WriteCanonicalHash(B, 2 818 468)` issued
at `execution/execmodule/forkchoice.go:381` — rolled back.
`kv.HeaderCanonical[2 818 468]` remained at *A*.
4. The operator upgraded to a binary containing #21157 and ran
`integration reset_state`. `ResetExec` cleared the state domains, but
the stale `kv.HeaderCanonical[2 818 468] = A` survived.
5. On restart the staged sync caught up from the snapshot tip (`block 2
810 162`, txNum 100 781 249, end of step 258). The execution stage read
each block's body using its canonical hash, hit the stale pointer at 2
818 468, re-applied chain *A*'s block as canonical and re-installed the
phantom code+slot writes on `0x8684adde…`. The subsequent canonical tx 3
of block 2 818 476 then took the SSTORE-RESET path instead of SSTORE-SET
— saving exactly 17 100 gas and starving the fee-recipient sweep at tx
14.

`seg integrity --check=CommitmentRoot` on the same datadir is green
(both `v2.0-commitment.0-256.kv` and `v2.0-commitment.256-258.kv`
recompute matching roots), confirming that no published `.kv` is
contaminated. The phantom lived purely in MDBX, fed by
`kv.HeaderCanonical` whose stale entries `reset_state` was failing to
evict.

## Fix

`ResetCanonicalAboveTip(ctx, db, frozenBlocks)` truncates
`kv.HeaderCanonical` and `kv.HeaderTD` from `frozenBlocks+1` forward,
caps `Headers`/`BlockHashes`/`Bodies`/`Senders` stage progress at
`frozenBlocks`, and re-anchors `HeadHeaderHash` to the canonical hash
recorded at the tip (when one is present). `ResetState` takes a new
`frozenBlocks` parameter and invokes the helper after `ResetExec`.
`cmd/integration/commands/reset_state.go` passes `br.FrozenBlocks()`.

Why this is safe to roll into `ResetState`:

- The snapshot tip is by construction immutable — the canonical hashes
for heights `≤ frozenBlocks` live in the frozen blocks themselves, so
leaving the table contents below the tip untouched is correct.
- Stage-progress writes only run when an individual stage is observably
above the tip; the helper is a no-op on a clean db.
- After the truncation the next forkchoice update from the consensus
layer drives canonical assignments for the post-tip range fresh, with no
remnant of any pre-fix sidechain commit able to mislead a forward
catchup.

The forkchoice-handler logic itself
(`execution/execmodule/forkchoice.go`) and the
`findExecutedDiffsetAtHeight` unwind path from #21157 are untouched.
This PR addresses the persistent-leftover side of the problem; future
reorgs on a node carrying #21157 will unwind cleanly and won't
accumulate the leftover.

## Test plan

`execution/stagedsync/rawdbreset/reset_stages_test.go` adds two cases:

- `TestResetCanonicalAboveTip_ClearsStaleSidechainPointers` seeds
`kv.HeaderCanonical` entries for heights 100..110 (including an explicit
non-zero `0x99…` at height 105 to make the assertion distinguish
"missing" from "zero"), sets the four stage progresses to 110 and writes
a stale `HeadHeaderHash`. After the call with `snapshotTip = 100`, the
test asserts that canonical hashes at 101..110 are gone, the tip's
canonical hash at 100 is preserved, `HeadHeaderHash` has been
re-anchored, and each of the four stages has been capped at 100.
- `TestResetCanonicalAboveTip_NoOpWhenAlreadyAtTip` calls the helper on
a db whose only canonical entry sits exactly at the tip and whose
`Headers` stage already equals the tip; the test asserts the tip is
preserved and stage progress is not regressed.

Both tests use `temporaltest.NewTestDB` for a real MDBX/temporal
backend, run in well under one second, and exercise the public helper
directly without needing a full ExecModule.

- [x] `go test -count=1 -timeout=120s
./execution/stagedsync/rawdbreset/...` — green.
- [x] `go test -count=1 -timeout=300s ./db/test/...` — green (covers
existing callers of `ResetExec`).
- [x] `go build ./cmd/integration/...` — green (only production caller
of `ResetState`).
- [ ] CI full unit-test + lint pipeline.
- [ ] Manual: run `integration reset_state --datadir=<DD> --chain=hoodi`
on the affected snapshotter; verify `kv.HeaderCanonical[2 818 468]` is
cleared, restart, watch catchup pass `2 818 476` cleanly, run `seg
integrity --check=CommitmentRoot` before re-enabling snapshot
publication.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: JkLondon <me@ilyamikheev.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
yperbasis added a commit that referenced this pull request May 21, 2026
## Summary

- Backfills `ChangeLog.md` with entries for v3.2.3, v3.3.1–v3.3.10,
v3.4.0, and v3.4.1 (previously the file stopped at v3.3.0).
- Drafts the v3.4.2 entry (release date TBD): two bugfixes (#21157,
#21262) and two improvements (#21148, #21135).
- Corrects the v3.3.0 release date to 2025-11-27 to match the GitHub
publish date.

## Test plan

- [ ] Skim the rendered changelog on GitHub for formatting.
- [ ] Confirm v3.4.2 draft contents before tagging.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pull Bot pushed a commit to Dustin4444/erigon that referenced this pull request May 21, 2026
…pshot tip on reset_state (erigontech#21247)

## Summary

Defense-in-depth cleanup of `integration reset_state` so it also evicts
stale `kv.HeaderCanonical` / `kv.HeaderTD` entries above the snapshot
tip and caps `Headers`/`BlockHashes`/`Bodies`/`Senders` stage progress
at the tip. Without this, a stale canonical-hash pointer survives
`reset_state` and steers the next forward catchup back onto a
non-canonical block.

Mirrors the `release/3.4` backport at erigontech#21246; the file structure of
`execution/stagedsync/rawdbreset/reset_stages.go` and
`cmd/integration/commands/reset_state.go` is identical on `main` and
`release/3.4`, so the patch applies verbatim.

## Why it matters on main even after the recent unwind fixes

`main` already carries the two reorg-correctness fixes that prevent
**new** stale canonical pointers from accumulating:

- `bfa03df625` ("parallel-commitment correctness for reorg/unwind + SD
recreate")
- erigontech#21157 ("execution/stagedsync: find diffset by actually-executed hash
on unwind")

After those changes, a forkchoice update whose forward execution fails
no longer leaks state because the unwind correctly reverts the
previously-applied sidechain. So in steady state on a fresh `main`
build, `kv.HeaderCanonical` should only ever hold canonical hashes.

However, datadirs created on older Erigon versions (including any
`release/3.4` build older than 2026-05-14) can carry leftover sidechain
entries in `kv.HeaderCanonical` from the time when the unwind bug was
active. Today's `reset_state` clears MDBX domain state but leaves those
entries alone — a forward catchup then reads them as authoritative and
re-introduces the very phantom state that `reset_state` was meant to
clear. Operators upgrading such a datadir to a fresh `main` binary will
hit the same symptom that bit the `release/3.4` snapshotter on hoodi at
block 2 818 476 (`insufficient funds` on a fee-recipient sweep tx, off
by exactly the EIP-2929 SSTORE-SET-vs-RESET delta `17 100 × 1 265 000
000 wei`).

This PR is therefore a no-cost forward-compat improvement: it harmlessly
no-ops on a fresh `main` datadir and fully recovers older /
cross-version datadirs without requiring `integration stage_exec
--unwind=N` or any other manual incantation.

## Fix

`ResetCanonicalAboveTip(ctx, db, frozenBlocks)`:

- `rawdb.TruncateCanonicalHash(tx, frozenBlocks+1, false)`
- `rawdb.TruncateTd(tx, frozenBlocks+1)`
- `WriteHeadHeaderHash` to canonical hash at the tip (when one is
recorded)
- Caps `Headers`/`BlockHashes`/`Bodies`/`Senders` stage progress at
`frozenBlocks`

`ResetState` takes a new `frozenBlocks uint64` parameter and invokes the
helper after `ResetExec`. The only production caller,
`cmd/integration/commands/reset_state.go`, passes `br.FrozenBlocks()`.

Snapshot-tip data is by construction immutable, so leaving the
canonical-hash table contents at-or-below the tip untouched is correct;
stage-progress writes only run when an individual stage is observably
above the tip; the routine is a no-op when nothing is above the tip.

## Test plan

`execution/stagedsync/rawdbreset/reset_stages_test.go` adds two cases
against a real `temporaltest.NewTestDB` backend:

- `TestResetCanonicalAboveTip_ClearsStaleSidechainPointers` — seeds
canonical entries for heights 100..110 (including an explicit non-zero
`0x99…` at 105 to distinguish "missing" from "zero"), sets all four
stage progresses to 110 and writes a stale `HeadHeaderHash`. After the
call with `snapshotTip = 100`, asserts canonical hashes at 101..110 are
gone, the tip's canonical hash at 100 is preserved, `HeadHeaderHash` is
re-anchored to the tip's canonical hash, and each stage has been capped
at 100.
- `TestResetCanonicalAboveTip_NoOpWhenAlreadyAtTip` — calls the helper
on a db whose only canonical entry sits exactly at the tip and whose
`Headers` stage already equals the tip; asserts the tip is preserved and
stage progress is not regressed.

- [x] `go test -count=1 -timeout=120s
./execution/stagedsync/rawdbreset/...` — green.
- [x] `go build ./cmd/integration/...` — green (only production caller
of `ResetState`).
- [ ] CI full unit-test + lint pipeline.

## Related

- `release/3.4` backport: erigontech#21246
- Reorg-unwind correctness on r3.4: erigontech#21157
- Reorg-unwind correctness on main: `bfa03df625`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: JkLondon <me@ilyamikheev.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants