[r3.4] execution/stagedsync: find diffset by actually-executed hash on unwind by JkLondon · Pull Request #21157 · erigontech/erigon

JkLondon · 2026-05-13T00:23:48Z

Summary

Fixes a state-leak bug in unwindExec3 that surfaces as a gas used mismatch and Cannot update chain head after any reorg at the tip whose unwound block deployed a contract via CREATE/CREATE2 to a counterfactual address (safe-wallets, EIP-1167 clone factories, ERC-4337 accounts, deterministic deployers). Reproduced on hoodi at block 2 789 993 on a node running release/3.4 — local sidechain head 0x706b073b…, canonical 0x72c42d17… / 0xc4c75456…, divergence +72 524 gas in tx 0x73652205af0b85e7… (idx 7) which goes from status=1, gasUsed=171 004 on canonical to status=0, gasUsed=243 528 (consume-all-gas) locally.

This is not a cherry-pick of a merged PR. The fix originated on release/3.4 after diagnosing the hoodi snapshotter incident on 2026-05-11. The corresponding bug is also present on main (see "Cherry-pick to main" below).

Root cause

unwindExec3 looks up the per-block DomainEntryDiff set via doms.GetDiffset(rwTx, br.CanonicalHash(currentBlock), currentBlock). The diffset, however, is keyed by the hash of the block that was actually executed, not by whatever the current canonical hash happens to be at unwind time.

The headers stage processes a reorg in two steps inside a single transaction (execution/stagedsync/stage_headers.go):

u.UnwindTo(unwindTo, …) — schedules an unwind for subsequent stages.
fixCanonicalChain(…) — overwrites kv.HeaderCanonical to point at the new chain.

The canonical pointer flip happens before the execution-stage unwind actually runs. When unwindExec3 then asks for br.CanonicalHash(currentBlock) it gets the new canonical hash, but the diffset for that height is sitting in ChangeSets3 under the previous (now sidechain) hash that was actually applied.

The pre-existing fallback at stage_execute.go:191-204 only handled the case where CanonicalHash returns !ok (no canonical pointer at this height — meaning we executed a block that was never marked canonical). It did not handle the case where CanonicalHash returns the new canonical and the diffset is missing under that new hash. In that case the code treated the miss as the "first block in traversal, nothing to unwind yet" edge case and silently continued.

The downstream effect is severe:

changeSet stays nil → unwindExec3State gets nil → the loop that collects state changes into stateChanges is skipped → sd.Unwind(txUnwindTo, nil) → sd.unwindChangesetRaw = nil.
At TemporalMemBatch.Flush the guard if sd.unwindChangesetRaw != nil (temporal_mem_batch.go:596-606) skips AggregatorRoTx.Unwind entirely.
No empty-tombstones are written into the latest AccountsDomain / CodeDomain rows for the contracts deployed in the unwound sidechain block. The previous tombstone-restore fix #20627 / e1526bf is therefore bypassed entirely — its []byte{} tombstone write never gets a chance to run because the diffset that drives it is "not found".

The execution stage progress is then rolled back through the staged-sync machinery (Headers, Bodies, Senders reset), but the domain (latest state) tables in MDBX still reflect the sidechain. On the next forward execution the canonical chain re-attempts the same CREATE2 against the now-phantom contract. EIP-684 / EIP-1014 collision rules consume all remaining gas of that tx — exactly what shows up in logs:

[WARN] gas used mismatch  block=2789993  header=4279886  execution=4352410  diff=72524  txCount=34  receiptCount=34
  tx gas detail  block=2789993 txIdx=7 txHash=0x73652205af0b85e7 gasUsed=243528 ... status=0

243528 is the tx gas limit; 171004 is what canonical reference RPC (https://rpc.hoodi.ethpandaops.io) reports for the same tx; 243528 − 171004 = 72524, matching the entire block-level mismatch.

Fix

Localised to one function. Does not require any of the parallel-commitment / SD-recreate infrastructure that landed in main (bfa03df625, 95d96627ce, etc.) — the diffset lookup is the only place that was making an incorrect assumption about canonical-hash stability across stages.

New helper findExecutedDiffsetAtHeight in execution/stagedsync/stage_execute.go:

Tries the diffset under the current canonical hash (previous behaviour for the common case).
If !ok, walks every header stored at this height via rawdb.ReadHeadersByNumber and tries each one. The header whose diffset is present is by construction the block that was actually executed.
Returns the recovered diffset together with the executedHash for the caller.

unwindExec3 is updated to:

Use the helper instead of the inline lookup.
Track lastExecHash from the helper-returned executedHash for the topmost unwound block (u.CurrentBlockNumber), rather than re-reading br.CanonicalHash(u.CurrentBlockNumber) afterwards (which would suffer the same canonical-flip miscompare).
Pass that hash to unwindExec3State, which forwards it to stateCache.RevertWithDiffset(changeset, lastExecutedBlockHash, unwindToHash). The cache's RevertWithDiffset compares its own stored blockHash against revertFromHash — with the corrected hash, surgical eviction works and the cache no longer degrades to a full clear after every tip reorg.

The fallback search is bounded by the number of headers stored at a given height (effectively 1 in the common case, 2 in a 1-deep reorg, and so on). Order of headers returned by ReadHeadersByNumber is by storage prefix, but only one of them has a stored diffset, so the search is deterministic.

Regression test

execution/stagedsync/stage_execute_unwind_test.go covers three phases:

Canonical hash points to the actually-executed block — direct hit, executedHash = hOld.
Headers stage re-canonicalises the height to a different hash (hNew) — the test first asserts that the pre-fix direct doms.GetDiffset(tx, hNew, height) returns !ok (documenting the bug), then asserts that findExecutedDiffsetAtHeight recovers via fallback and still returns executedHash = hOld with the original diff list intact.
A height with no stored diffset under any known header — found=false, no spurious matches.

Test runs in ~100 ms and exercises the helper against a real mdbx + Aggregator + temporal.New stack with a freezeblocks.BlockReader (no snapshots required, the unwind range is at the tip).

Limitations / what this does NOT cover

Multiple distinct executed sidechains at the same height. If a node reorg'd back-and-forth and ChangeSets3 ends up holding diffsets for more than one block at the same height, the fallback returns the first non-canonical hash whose diffset is present. In practice the staged-sync unwind path immediately invalidates the stored diffset of any unwound block (via the subsequent forward execution), so this is rare; but a fully robust fix would persist a separate execution-applied head hash per height. Out of scope here.
The race itself (headers stage rewriting kv.HeaderCanonical before execution stage unwind runs) is unchanged. This PR makes the execution-stage unwind robust against that ordering rather than restructuring stage interaction.

How to verify on a stuck node

The runbook for an already-affected datadir (no rebuild required, snapshots are not contaminated because the phantom lives only in the MDBX latest domain tables at the tip) is:

sudo systemctl stop erigon
./build/bin/integration reset_state --datadir=<DD> --chain=hoodi
sudo systemctl start erigon

With the fix in place, future reorgs at the tip that involve CREATE2-to-counterfactual transactions will unwind cleanly and the same scenario will not recur.

Test plan

go test -count=1 -short ./execution/stagedsync/... ./db/state/... — green.
New TestFindExecutedDiffsetAtHeight_FallsBackAfterCanonicalReorg — green; embeds a sanity-assert that the pre-fix direct lookup misses (so the test fails if the fallback is reverted).
Manual: run on a node holding the hoodi-2789993 datadir state; after reset_state and restart, block 2 789 993 must be accepted with hash 0xc4c754565911f13b96a841782095a21a32806f927307ffa71695ecd99801c236.
CI: full unit-test + lint pipeline.

🤖 Generated with Claude Code

Root cause of the CREATE2-collision-after-reorg bug surfaced on hoodi at block 2789993 (release/3.4): When the headers stage re-canonicalises a height before the execution stage unwind runs, br.CanonicalHash(blockN) returns the new canonical hash, but the diffset for that height was stored under the previously- canonical (now sidechain) hash. doms.GetDiffset under the new hash returns \!ok. The pre-fix code treated this as "first block in traversal, nothing to unwind yet" and just `continue`d. Result: changeSet stayed nil → sd.unwindChangesetRaw stayed nil → at Flush, the `if sd.unwindChangesetRaw \!= nil` guard skipped AggregatorRoTx.Unwind entirely → no tombstones written to the AccountsDomain/CodeDomain latest tables → phantom CREATE2 contracts created on the sidechain remained. On the next forward execution the canonical chain hits the same counterfactual CREATE2 address, collides with the phantom (EIP-684 / EIP-1014: existing code or nonce\!=0 ⇒ consume-all-gas), and the block is rejected with `gas used mismatch`. Fix: when looking up the diffset for an unwound block, fall back from the (now-canonical) hash to every other header hash stored at that height. The header whose diffset is present is the block we actually executed. Use that hash for both the diff merge and the RevertWithDiffset cache-invalidation revertFromHash, so cache eviction also stays surgical instead of degrading to a full clear. This is a localised fix that does NOT depend on the parallel-commitment infrastructure in main; the lookup is the only place that was making the wrong assumption about canonical-hash stability across stages. Regression test in stage_execute_unwind_test.go covers: - canonical-hash points to executed block (direct hit) - canonical-hash flipped after execution (fallback hit, this is the bug being fixed; includes a sanity assertion that the pre-fix direct GetDiffset still misses) - no stored diffset under any known header (not-found, no spurious matches) Refs hoodi snapshotter incident 2026-05-11, block 2789993, local sidechain head 0x706b073b…, canonical 0x72c42d17…/0xc4c75456…

Copilot

Pull request overview

Fixes execution-stage unwind robustness during tip reorgs by ensuring the per-block diffset is located using the hash of the actually executed block, not the possibly-updated canonical hash, preventing missed unwinds that can leave “phantom” state and trigger downstream gas used mismatch / head update failures.

Changes:

Added findExecutedDiffsetAtHeight to recover the correct ChangeSets3 diffset even after headers-stage canonical pointer flips.
Updated unwindExec3 to use the recovered diffset and pass the executed hash through to state-cache revert logic.
Added a regression test covering canonical reorg fallback behavior and genuine-not-found behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`execution/stagedsync/stage_execute.go`	Adds executed-diffset discovery helper and integrates it into `unwindExec3` to unwind the correct applied state.
`execution/stagedsync/stage_execute_unwind_test.go`	Adds regression coverage for diffset lookup after canonical reorg and not-found cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

AskAlexSharov · 2026-05-13T02:52:40Z

@@ -200,10 +265,10 @@ func unwindExec3(u *UnwindState, s *StageState, doms *execctx.SharedDomains, rwT
 				// all previous blocks
 				continue
 			}
-			return fmt.Errorf("domains.GetDiffset(%d, %s): not found", currentBlock, currentHash)
+			return fmt.Errorf("domains.GetDiffset(%d): not found under any known header hash", currentBlock)
 		}
-		if err != nil {
-			return err
+		if currentBlock == u.CurrentBlockNumber {
+			lastExecHash = executedHash
 		}


@JkLondon is it important?

yep, will fix

…t loop iter Address Copilot review on #21157. unwindExec3 only set lastExecHash when currentBlock == u.CurrentBlockNumber. When that block has no diffset yet (the "has not been processed" edge case that triggers `continue`), lastExecHash stayed zero even after a later iteration recovered the topmost executed block's diffset. RevertWithDiffset then saw a hash mismatch and fell back to a full state-cache clear, defeating the surgical eviction the PR is meant to enable. Move the assignment into the `changeSet == nil` branch so it captures the hash of the first actually-found diffset (= topmost executed block in the unwound range). Also drop dead `stepBytes`/`diffKey` from the regression test — the slice was sliced back to `addr.Bytes()` and the comment about "inverted big-endian" was wrong (zero bytes are not inverted-BE step 0). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

yperbasis

Overview

Fixes a state-leak bug where unwindExec3 fails to locate the per-block DomainEntryDiff set after a tip reorg, because the diffset is keyed by the actually-executed block hash while br.CanonicalHash() returns whatever canonical now points to. The miss caused sd.unwindChangesetRaw to stay nil, skipping AggregatorRoTx.Unwind and leaving "phantom" deployed contracts in the latest-state domain tables. The next forward execution of the canonical chain re-attempted the same CREATE2 and consumed the entire gas limit (EIP-684 collision), surfacing as the gas used mismatch on hoodi block 2,789,993.

The fix introduces a helper findExecutedDiffsetAtHeight that tries the canonical hash first and falls back to walking every header stored at the height.

What works well

Surgical and localised. Only changes unwindExec3; no scope-creep into the broader staged-sync interaction.
Preserves existing edge-case logic. The "first iteration: nothing executed yet → continue" branch (if changeSet == nil { continue }) is retained, and the helper's not-found return cooperates with it correctly.
Good regression test. Three phases cover (1) direct hit, (2) canonical-flipped fallback — with an explicit sanity assert that the pre-fix code path (doms.GetDiffset directly under new canonical) misses, so the test fails if the fallback is reverted, and (3) genuinely-not-found.
Test runs in ~1s and uses a real mdbx + Aggregator + temporal.New stack — no mocks.
The Copilot finding on lastExecHash capture has already been fixed in the follow-up commit 37a846c30, by moving the assignment into the changeSet == nil branch so it captures the first found diffset rather than the first loop iteration. Important — otherwise a deeper unwind whose top blocks have no diffset would have silently degraded to a full cache clear.

Issues / suggestions

1. PR-description root cause is mis-attributed (docs only, not code)

The description claims the canonical pointer flip happens in stage_headers.go via fixCanonicalChain after u.UnwindTo. Looking at the actual code (stage_headers.go:280-291):

if headerInserter.GetHighest() != 0 {
    if !headerInserter.Unwind() {                 // ← only NON-unwind path
        if err = fixCanonicalChain(...); err != nil { ... }
    }
    ...
}

fixCanonicalChain is gated by !headerInserter.Unwind(), so it does not run when an unwind is scheduled. The mechanism that actually flips canonical-before-unwind on hoodi (PoS) is addAndVerifyBlockStep in stageloop.go:560-605, which unconditionally WriteCanonicalHash(currentHash, currentHeight) every time NewPayload arrives — so a second payload at the same height as a previously-executed block flips the canonical pointer before the eventual forkchoice-triggered unwind runs.

The diagnosis of the actual bug is correct; only the explanation of how it gets triggered is off. Consider updating the PR body so the runbook references the right code path.

2. Doc comment in helper slightly overstates the loop's edge case (`stage_execute.go:175-178`)

"…the surrounding loop handles by continuing past the first iteration only."

The continue actually fires on every iteration where changeSet == nil && !found, not just the very first one. With a deep enough unwind range whose top N blocks were never executed, several iterations may be skipped before the first diffset is encountered. Re-word to "while no diffset has been found yet" or similar.

3. Minor: `ReadHeadersByNumber` is called twice on the rare `!ok` + single-header + miss path

When br.CanonicalHash() returns !ok and exactly one header is stored, the helper sets currentHash = nonCanonicalHeaders[0].Hash(), then falls through to doms.GetDiffset. If that misses, the post-fallback rawdb.ReadHeadersByNumber(rwTx, currentBlock) re-reads the same single header and skips it. Functionally correct, just two cursor scans where one would do. Low priority.

4. Cache-eviction call in `unwindExec3State` still uses raw canonical (`stage_execute.go:359`)

unwindToHash, err := rawdb.ReadCanonicalHash(tx, blockUnwindTo)

This is for the target hash (the unwind point), and at unwind time the unwind point's canonical hash is still the original (new canonicals are only written for blocks above the unwind point), so this is safe today. Worth a one-line comment noting why this lookup is robust by construction, since the surrounding diffset lookup needed the elaborate fallback.

5. Test coverage gap (nice-to-have)

The regression test doesn't exercise the !ok && len(headers) > 1 ambiguity-error path or the multi-sidechain-at-same-height limitation that the PR explicitly calls out. Not blocking — but a 3rd phase that confirms found=false plus a specific error string when two non-canonical headers both have diffsets would lock in the documented behaviour.

Correctness / safety

The fallback is deterministic in the common case (≤2 headers per height, only one has a stored diffset). With multiple stored diffsets at one height the helper returns the first match by ReadHeadersByNumber's storage-prefix order — the PR explicitly accepts this in "Limitations" and the practical rarity makes it a reasonable trade-off for a backport.
The error-message change in unwindExec3 ("domains.GetDiffset(%d, %s): not found" → "…not found under any known header hash") is a small behaviour change to observability but more accurate (the old message printed the hash we tried, which is now misleading after the fallback).
No new locking or transaction-scope changes; the helper takes the same rwTx as the caller.

Local verification

$ go test -count=1 -short -run TestFindExecutedDiffsetAtHeight ./execution/stagedsync/...
ok  	github.com/erigontech/erigon/execution/stagedsync	1.020s
$ go test -count=1 -short ./execution/stagedsync/
ok  	github.com/erigontech/erigon/execution/stagedsync	1.257s

Recommendation

Approving. Please tighten the PR description's root-cause section (point 1 — the stage_headers.go story doesn't match the gated code path; the PoS addAndVerifyBlockStep flow is the real trigger). The other suggestions (comment wording, double-read, additional test phase) are non-blocking polish.

The "Cherry-pick to main" section referenced in the body appears to be missing — worth confirming a tracking PR exists, since the same bug pattern is present on main (verified by reading git show main:execution/stagedsync/stage_execute.go).

…ove snapshot tip on reset_state (#21246) ## Summary Fixes a stale-canonical-pointer leak in `integration reset_state` that was reintroducing phantom state on hoodi snapshotters running `release/3.4` even after a clean rebuild. Observed at block **2 818 468** on `snapshotter-bm-v34-hoodi-n1`: local `kv.HeaderCanonical[2 818 468] = 0xac2ee57a…` (a sidechain block) while the real canonical was `0x27db29e4…`. The next canonical block, **2 818 476**, then died with `insufficient funds for gas * price + value` on `0xdD11751cdD3f6EFf01B1f6151B640685bfa5dB4a` — a fee-recipient sweep-tx that was short by exactly `17 100 × 1 265 000 000 = 21 631 500 000 000 wei`, the EIP-2929 SSTORE-SET-vs-RESET delta for a single storage slot on `0x8684adde…` that had been first-written by the surviving sidechain block 8 blocks earlier. This is **not** a cherry-pick of a merged PR. The fix originates on `release/3.4` and pairs with #21157 (already on `release/3.4`). ## Root cause `reset_state` historically called `ResetExec`, which clears the MDBX state-domain tables (`AccountsDomain`, `StorageDomain`, `CodeDomain`, `CommitmentDomain`, `ReceiptDomain`, `RCacheDomain`), the inverted indexes and `ChangeSets3`, plus the `Execution`/`Finish`/`CustomTrace`/`TxLookup`/`Witnesses` stage progress. It deliberately left `kv.HeaderCanonical`, `kv.HeaderTD` and the `Headers`/`BlockHashes`/`Bodies`/`Senders` stage progress alone, on the assumption that those tables faithfully reflect the consensus-layer view of the chain. That assumption breaks once a forkchoice update has ever committed a canonical hash whose chain later turned out to be the wrong side of a reorg. Concretely, on the affected snapshotter: 1. An older forkchoice update for chain *A* succeeded; `WriteCanonicalHash(A, 2 818 468)` was committed. 2. CL switched to chain *B*. The forkchoice handler walked back, built `newCanonicals` including `B@2 818 468`, ran the staged unwind and reached the forward `executionPipeline.Run` step. 3. Forward execution failed on a downstream block because the unwind on a pre-#21157 binary failed to revert `CodeDomain` / `StorageDomain` writes made by *A* (see `execution/stagedsync/stage_execute.go:findExecutedDiffsetAtHeight` introduced in #21157). The forkchoice handler returned `BadBlock` and the whole tx — including the `WriteCanonicalHash(B, 2 818 468)` issued at `execution/execmodule/forkchoice.go:381` — rolled back. `kv.HeaderCanonical[2 818 468]` remained at *A*. 4. The operator upgraded to a binary containing #21157 and ran `integration reset_state`. `ResetExec` cleared the state domains, but the stale `kv.HeaderCanonical[2 818 468] = A` survived. 5. On restart the staged sync caught up from the snapshot tip (`block 2 810 162`, txNum 100 781 249, end of step 258). The execution stage read each block's body using its canonical hash, hit the stale pointer at 2 818 468, re-applied chain *A*'s block as canonical and re-installed the phantom code+slot writes on `0x8684adde…`. The subsequent canonical tx 3 of block 2 818 476 then took the SSTORE-RESET path instead of SSTORE-SET — saving exactly 17 100 gas and starving the fee-recipient sweep at tx 14. `seg integrity --check=CommitmentRoot` on the same datadir is green (both `v2.0-commitment.0-256.kv` and `v2.0-commitment.256-258.kv` recompute matching roots), confirming that no published `.kv` is contaminated. The phantom lived purely in MDBX, fed by `kv.HeaderCanonical` whose stale entries `reset_state` was failing to evict. ## Fix `ResetCanonicalAboveTip(ctx, db, frozenBlocks)` truncates `kv.HeaderCanonical` and `kv.HeaderTD` from `frozenBlocks+1` forward, caps `Headers`/`BlockHashes`/`Bodies`/`Senders` stage progress at `frozenBlocks`, and re-anchors `HeadHeaderHash` to the canonical hash recorded at the tip (when one is present). `ResetState` takes a new `frozenBlocks` parameter and invokes the helper after `ResetExec`. `cmd/integration/commands/reset_state.go` passes `br.FrozenBlocks()`. Why this is safe to roll into `ResetState`: - The snapshot tip is by construction immutable — the canonical hashes for heights `≤ frozenBlocks` live in the frozen blocks themselves, so leaving the table contents below the tip untouched is correct. - Stage-progress writes only run when an individual stage is observably above the tip; the helper is a no-op on a clean db. - After the truncation the next forkchoice update from the consensus layer drives canonical assignments for the post-tip range fresh, with no remnant of any pre-fix sidechain commit able to mislead a forward catchup. The forkchoice-handler logic itself (`execution/execmodule/forkchoice.go`) and the `findExecutedDiffsetAtHeight` unwind path from #21157 are untouched. This PR addresses the persistent-leftover side of the problem; future reorgs on a node carrying #21157 will unwind cleanly and won't accumulate the leftover. ## Test plan `execution/stagedsync/rawdbreset/reset_stages_test.go` adds two cases: - `TestResetCanonicalAboveTip_ClearsStaleSidechainPointers` seeds `kv.HeaderCanonical` entries for heights 100..110 (including an explicit non-zero `0x99…` at height 105 to make the assertion distinguish "missing" from "zero"), sets the four stage progresses to 110 and writes a stale `HeadHeaderHash`. After the call with `snapshotTip = 100`, the test asserts that canonical hashes at 101..110 are gone, the tip's canonical hash at 100 is preserved, `HeadHeaderHash` has been re-anchored, and each of the four stages has been capped at 100. - `TestResetCanonicalAboveTip_NoOpWhenAlreadyAtTip` calls the helper on a db whose only canonical entry sits exactly at the tip and whose `Headers` stage already equals the tip; the test asserts the tip is preserved and stage progress is not regressed. Both tests use `temporaltest.NewTestDB` for a real MDBX/temporal backend, run in well under one second, and exercise the public helper directly without needing a full ExecModule. - [x] `go test -count=1 -timeout=120s ./execution/stagedsync/rawdbreset/...` — green. - [x] `go test -count=1 -timeout=300s ./db/test/...` — green (covers existing callers of `ResetExec`). - [x] `go build ./cmd/integration/...` — green (only production caller of `ResetState`). - [ ] CI full unit-test + lint pipeline. - [ ] Manual: run `integration reset_state --datadir=<DD> --chain=hoodi` on the affected snapshotter; verify `kv.HeaderCanonical[2 818 468]` is cleared, restart, watch catchup pass `2 818 476` cleanly, run `seg integrity --check=CommitmentRoot` before re-enabling snapshot publication. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: JkLondon <me@ilyamikheev.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

## Summary - Backfills `ChangeLog.md` with entries for v3.2.3, v3.3.1–v3.3.10, v3.4.0, and v3.4.1 (previously the file stopped at v3.3.0). - Drafts the v3.4.2 entry (release date TBD): two bugfixes (#21157, #21262) and two improvements (#21148, #21135). - Corrects the v3.3.0 release date to 2025-11-27 to match the GitHub publish date. ## Test plan - [ ] Skim the rendered changelog on GitHub for formatting. - [ ] Confirm v3.4.2 draft contents before tagging. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…pshot tip on reset_state (erigontech#21247) ## Summary Defense-in-depth cleanup of `integration reset_state` so it also evicts stale `kv.HeaderCanonical` / `kv.HeaderTD` entries above the snapshot tip and caps `Headers`/`BlockHashes`/`Bodies`/`Senders` stage progress at the tip. Without this, a stale canonical-hash pointer survives `reset_state` and steers the next forward catchup back onto a non-canonical block. Mirrors the `release/3.4` backport at erigontech#21246; the file structure of `execution/stagedsync/rawdbreset/reset_stages.go` and `cmd/integration/commands/reset_state.go` is identical on `main` and `release/3.4`, so the patch applies verbatim. ## Why it matters on main even after the recent unwind fixes `main` already carries the two reorg-correctness fixes that prevent **new** stale canonical pointers from accumulating: - `bfa03df625` ("parallel-commitment correctness for reorg/unwind + SD recreate") - erigontech#21157 ("execution/stagedsync: find diffset by actually-executed hash on unwind") After those changes, a forkchoice update whose forward execution fails no longer leaks state because the unwind correctly reverts the previously-applied sidechain. So in steady state on a fresh `main` build, `kv.HeaderCanonical` should only ever hold canonical hashes. However, datadirs created on older Erigon versions (including any `release/3.4` build older than 2026-05-14) can carry leftover sidechain entries in `kv.HeaderCanonical` from the time when the unwind bug was active. Today's `reset_state` clears MDBX domain state but leaves those entries alone — a forward catchup then reads them as authoritative and re-introduces the very phantom state that `reset_state` was meant to clear. Operators upgrading such a datadir to a fresh `main` binary will hit the same symptom that bit the `release/3.4` snapshotter on hoodi at block 2 818 476 (`insufficient funds` on a fee-recipient sweep tx, off by exactly the EIP-2929 SSTORE-SET-vs-RESET delta `17 100 × 1 265 000 000 wei`). This PR is therefore a no-cost forward-compat improvement: it harmlessly no-ops on a fresh `main` datadir and fully recovers older / cross-version datadirs without requiring `integration stage_exec --unwind=N` or any other manual incantation. ## Fix `ResetCanonicalAboveTip(ctx, db, frozenBlocks)`: - `rawdb.TruncateCanonicalHash(tx, frozenBlocks+1, false)` - `rawdb.TruncateTd(tx, frozenBlocks+1)` - `WriteHeadHeaderHash` to canonical hash at the tip (when one is recorded) - Caps `Headers`/`BlockHashes`/`Bodies`/`Senders` stage progress at `frozenBlocks` `ResetState` takes a new `frozenBlocks uint64` parameter and invokes the helper after `ResetExec`. The only production caller, `cmd/integration/commands/reset_state.go`, passes `br.FrozenBlocks()`. Snapshot-tip data is by construction immutable, so leaving the canonical-hash table contents at-or-below the tip untouched is correct; stage-progress writes only run when an individual stage is observably above the tip; the routine is a no-op when nothing is above the tip. ## Test plan `execution/stagedsync/rawdbreset/reset_stages_test.go` adds two cases against a real `temporaltest.NewTestDB` backend: - `TestResetCanonicalAboveTip_ClearsStaleSidechainPointers` — seeds canonical entries for heights 100..110 (including an explicit non-zero `0x99…` at 105 to distinguish "missing" from "zero"), sets all four stage progresses to 110 and writes a stale `HeadHeaderHash`. After the call with `snapshotTip = 100`, asserts canonical hashes at 101..110 are gone, the tip's canonical hash at 100 is preserved, `HeadHeaderHash` is re-anchored to the tip's canonical hash, and each stage has been capped at 100. - `TestResetCanonicalAboveTip_NoOpWhenAlreadyAtTip` — calls the helper on a db whose only canonical entry sits exactly at the tip and whose `Headers` stage already equals the tip; asserts the tip is preserved and stage progress is not regressed. - [x] `go test -count=1 -timeout=120s ./execution/stagedsync/rawdbreset/...` — green. - [x] `go build ./cmd/integration/...` — green (only production caller of `ResetState`). - [ ] CI full unit-test + lint pipeline. ## Related - `release/3.4` backport: erigontech#21246 - Reorg-unwind correctness on r3.4: erigontech#21157 - Reorg-unwind correctness on main: `bfa03df625` 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: JkLondon <me@ilyamikheev.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

JkLondon requested review from mh0lt and yperbasis as code owners May 13, 2026 00:23

AskAlexSharov requested a review from Copilot May 13, 2026 02:27

Copilot started reviewing on behalf of AskAlexSharov May 13, 2026 02:27 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

yperbasis approved these changes May 13, 2026

View reviewed changes

JkLondon requested a review from AskAlexSharov May 13, 2026 21:06

AskAlexSharov merged commit 2b9b9ce into release/3.4 May 14, 2026
23 checks passed

AskAlexSharov deleted the cp/r34-create2-unwind-collision branch May 14, 2026 00:56

This was referenced May 18, 2026

[r3.4] cmd/integration, execution/stagedsync: clear canonical hash above snapshot tip on reset_state #21246

Merged

execution/stagedsync, cmd/integration: clear canonical hash above snapshot tip on reset_state #21247

Merged

yperbasis mentioned this pull request May 21, 2026

[r3.4] ChangeLog: catch up on 3.3.x/3.4.x and draft v3.4.2 #21337

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[r3.4] execution/stagedsync: find diffset by actually-executed hash on unwind#21157

[r3.4] execution/stagedsync: find diffset by actually-executed hash on unwind#21157
AskAlexSharov merged 2 commits into
release/3.4from
cp/r34-create2-unwind-collision

JkLondon commented May 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

AskAlexSharov May 13, 2026

Uh oh!

JkLondon May 13, 2026

Uh oh!

Uh oh!

yperbasis left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

JkLondon commented May 13, 2026

Summary

Root cause

Fix

Regression test

Limitations / what this does NOT cover

How to verify on a stuck node

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

AskAlexSharov May 13, 2026

Choose a reason for hiding this comment

Uh oh!

JkLondon May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yperbasis left a comment

Choose a reason for hiding this comment

Overview

What works well

Issues / suggestions

1. PR-description root cause is mis-attributed (docs only, not code)

2. Doc comment in helper slightly overstates the loop's edge case (stage_execute.go:175-178)

3. Minor: ReadHeadersByNumber is called twice on the rare !ok + single-header + miss path

4. Cache-eviction call in unwindExec3State still uses raw canonical (stage_execute.go:359)

5. Test coverage gap (nice-to-have)

Correctness / safety

Local verification

Recommendation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

2. Doc comment in helper slightly overstates the loop's edge case (`stage_execute.go:175-178`)

3. Minor: `ReadHeadersByNumber` is called twice on the rare `!ok` + single-header + miss path

4. Cache-eviction call in `unwindExec3State` still uses raw canonical (`stage_execute.go:359`)