[r3.4] execution/stagedsync: find diffset by actually-executed hash on unwind#21157
Conversation
Root cause of the CREATE2-collision-after-reorg bug surfaced on hoodi at
block 2789993 (release/3.4):
When the headers stage re-canonicalises a height before the execution
stage unwind runs, br.CanonicalHash(blockN) returns the new canonical
hash, but the diffset for that height was stored under the previously-
canonical (now sidechain) hash. doms.GetDiffset under the new hash
returns \!ok. The pre-fix code treated this as "first block in traversal,
nothing to unwind yet" and just `continue`d. Result: changeSet stayed
nil → sd.unwindChangesetRaw stayed nil → at Flush, the
`if sd.unwindChangesetRaw \!= nil` guard skipped AggregatorRoTx.Unwind
entirely → no tombstones written to the AccountsDomain/CodeDomain
latest tables → phantom CREATE2 contracts created on the sidechain
remained.
On the next forward execution the canonical chain hits the same
counterfactual CREATE2 address, collides with the phantom (EIP-684 /
EIP-1014: existing code or nonce\!=0 ⇒ consume-all-gas), and the block
is rejected with `gas used mismatch`.
Fix: when looking up the diffset for an unwound block, fall back from
the (now-canonical) hash to every other header hash stored at that
height. The header whose diffset is present is the block we actually
executed. Use that hash for both the diff merge and the
RevertWithDiffset cache-invalidation revertFromHash, so cache eviction
also stays surgical instead of degrading to a full clear.
This is a localised fix that does NOT depend on the parallel-commitment
infrastructure in main; the lookup is the only place that was making
the wrong assumption about canonical-hash stability across stages.
Regression test in stage_execute_unwind_test.go covers:
- canonical-hash points to executed block (direct hit)
- canonical-hash flipped after execution (fallback hit, this is the
bug being fixed; includes a sanity assertion that the pre-fix
direct GetDiffset still misses)
- no stored diffset under any known header (not-found, no spurious
matches)
Refs hoodi snapshotter incident 2026-05-11, block 2789993,
local sidechain head 0x706b073b…, canonical 0x72c42d17…/0xc4c75456…
There was a problem hiding this comment.
Pull request overview
Fixes execution-stage unwind robustness during tip reorgs by ensuring the per-block diffset is located using the hash of the actually executed block, not the possibly-updated canonical hash, preventing missed unwinds that can leave “phantom” state and trigger downstream gas used mismatch / head update failures.
Changes:
- Added
findExecutedDiffsetAtHeightto recover the correctChangeSets3diffset even after headers-stage canonical pointer flips. - Updated
unwindExec3to use the recovered diffset and pass the executed hash through to state-cache revert logic. - Added a regression test covering canonical reorg fallback behavior and genuine-not-found behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
execution/stagedsync/stage_execute.go |
Adds executed-diffset discovery helper and integrates it into unwindExec3 to unwind the correct applied state. |
execution/stagedsync/stage_execute_unwind_test.go |
Adds regression coverage for diffset lookup after canonical reorg and not-found cases. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -200,10 +265,10 @@ func unwindExec3(u *UnwindState, s *StageState, doms *execctx.SharedDomains, rwT | |||
| // all previous blocks | |||
| continue | |||
| } | |||
| return fmt.Errorf("domains.GetDiffset(%d, %s): not found", currentBlock, currentHash) | |||
| return fmt.Errorf("domains.GetDiffset(%d): not found under any known header hash", currentBlock) | |||
| } | |||
| if err != nil { | |||
| return err | |||
| if currentBlock == u.CurrentBlockNumber { | |||
| lastExecHash = executedHash | |||
| } | |||
…t loop iter Address Copilot review on #21157. unwindExec3 only set lastExecHash when currentBlock == u.CurrentBlockNumber. When that block has no diffset yet (the "has not been processed" edge case that triggers `continue`), lastExecHash stayed zero even after a later iteration recovered the topmost executed block's diffset. RevertWithDiffset then saw a hash mismatch and fell back to a full state-cache clear, defeating the surgical eviction the PR is meant to enable. Move the assignment into the `changeSet == nil` branch so it captures the hash of the first actually-found diffset (= topmost executed block in the unwound range). Also drop dead `stepBytes`/`diffKey` from the regression test — the slice was sliced back to `addr.Bytes()` and the comment about "inverted big-endian" was wrong (zero bytes are not inverted-BE step 0). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
yperbasis
left a comment
There was a problem hiding this comment.
Overview
Fixes a state-leak bug where unwindExec3 fails to locate the per-block DomainEntryDiff set after a tip reorg, because the diffset is keyed by the actually-executed block hash while br.CanonicalHash() returns whatever canonical now points to. The miss caused sd.unwindChangesetRaw to stay nil, skipping AggregatorRoTx.Unwind and leaving "phantom" deployed contracts in the latest-state domain tables. The next forward execution of the canonical chain re-attempted the same CREATE2 and consumed the entire gas limit (EIP-684 collision), surfacing as the gas used mismatch on hoodi block 2,789,993.
The fix introduces a helper findExecutedDiffsetAtHeight that tries the canonical hash first and falls back to walking every header stored at the height.
What works well
- Surgical and localised. Only changes
unwindExec3; no scope-creep into the broader staged-sync interaction. - Preserves existing edge-case logic. The "first iteration: nothing executed yet → continue" branch (
if changeSet == nil { continue }) is retained, and the helper's not-found return cooperates with it correctly. - Good regression test. Three phases cover (1) direct hit, (2) canonical-flipped fallback — with an explicit sanity assert that the pre-fix code path (
doms.GetDiffsetdirectly under new canonical) misses, so the test fails if the fallback is reverted, and (3) genuinely-not-found. - Test runs in ~1s and uses a real
mdbx + Aggregator + temporal.Newstack — no mocks. - The Copilot finding on
lastExecHashcapture has already been fixed in the follow-up commit37a846c30, by moving the assignment into thechangeSet == nilbranch so it captures the first found diffset rather than the first loop iteration. Important — otherwise a deeper unwind whose top blocks have no diffset would have silently degraded to a full cache clear.
Issues / suggestions
1. PR-description root cause is mis-attributed (docs only, not code)
The description claims the canonical pointer flip happens in stage_headers.go via fixCanonicalChain after u.UnwindTo. Looking at the actual code (stage_headers.go:280-291):
if headerInserter.GetHighest() != 0 {
if !headerInserter.Unwind() { // ← only NON-unwind path
if err = fixCanonicalChain(...); err != nil { ... }
}
...
}fixCanonicalChain is gated by !headerInserter.Unwind(), so it does not run when an unwind is scheduled. The mechanism that actually flips canonical-before-unwind on hoodi (PoS) is addAndVerifyBlockStep in stageloop.go:560-605, which unconditionally WriteCanonicalHash(currentHash, currentHeight) every time NewPayload arrives — so a second payload at the same height as a previously-executed block flips the canonical pointer before the eventual forkchoice-triggered unwind runs.
The diagnosis of the actual bug is correct; only the explanation of how it gets triggered is off. Consider updating the PR body so the runbook references the right code path.
2. Doc comment in helper slightly overstates the loop's edge case (stage_execute.go:175-178)
"…the surrounding loop handles by continuing past the first iteration only."
The continue actually fires on every iteration where changeSet == nil && !found, not just the very first one. With a deep enough unwind range whose top N blocks were never executed, several iterations may be skipped before the first diffset is encountered. Re-word to "while no diffset has been found yet" or similar.
3. Minor: ReadHeadersByNumber is called twice on the rare !ok + single-header + miss path
When br.CanonicalHash() returns !ok and exactly one header is stored, the helper sets currentHash = nonCanonicalHeaders[0].Hash(), then falls through to doms.GetDiffset. If that misses, the post-fallback rawdb.ReadHeadersByNumber(rwTx, currentBlock) re-reads the same single header and skips it. Functionally correct, just two cursor scans where one would do. Low priority.
4. Cache-eviction call in unwindExec3State still uses raw canonical (stage_execute.go:359)
unwindToHash, err := rawdb.ReadCanonicalHash(tx, blockUnwindTo)This is for the target hash (the unwind point), and at unwind time the unwind point's canonical hash is still the original (new canonicals are only written for blocks above the unwind point), so this is safe today. Worth a one-line comment noting why this lookup is robust by construction, since the surrounding diffset lookup needed the elaborate fallback.
5. Test coverage gap (nice-to-have)
The regression test doesn't exercise the !ok && len(headers) > 1 ambiguity-error path or the multi-sidechain-at-same-height limitation that the PR explicitly calls out. Not blocking — but a 3rd phase that confirms found=false plus a specific error string when two non-canonical headers both have diffsets would lock in the documented behaviour.
Correctness / safety
- The fallback is deterministic in the common case (≤2 headers per height, only one has a stored diffset). With multiple stored diffsets at one height the helper returns the first match by
ReadHeadersByNumber's storage-prefix order — the PR explicitly accepts this in "Limitations" and the practical rarity makes it a reasonable trade-off for a backport. - The error-message change in
unwindExec3("domains.GetDiffset(%d, %s): not found"→"…not found under any known header hash") is a small behaviour change to observability but more accurate (the old message printed the hash we tried, which is now misleading after the fallback). - No new locking or transaction-scope changes; the helper takes the same
rwTxas the caller.
Local verification
$ go test -count=1 -short -run TestFindExecutedDiffsetAtHeight ./execution/stagedsync/...
ok github.com/erigontech/erigon/execution/stagedsync 1.020s
$ go test -count=1 -short ./execution/stagedsync/
ok github.com/erigontech/erigon/execution/stagedsync 1.257s
Recommendation
Approving. Please tighten the PR description's root-cause section (point 1 — the stage_headers.go story doesn't match the gated code path; the PoS addAndVerifyBlockStep flow is the real trigger). The other suggestions (comment wording, double-read, additional test phase) are non-blocking polish.
The "Cherry-pick to main" section referenced in the body appears to be missing — worth confirming a tracking PR exists, since the same bug pattern is present on main (verified by reading git show main:execution/stagedsync/stage_execute.go).
…ove snapshot tip on reset_state (#21246) ## Summary Fixes a stale-canonical-pointer leak in `integration reset_state` that was reintroducing phantom state on hoodi snapshotters running `release/3.4` even after a clean rebuild. Observed at block **2 818 468** on `snapshotter-bm-v34-hoodi-n1`: local `kv.HeaderCanonical[2 818 468] = 0xac2ee57a…` (a sidechain block) while the real canonical was `0x27db29e4…`. The next canonical block, **2 818 476**, then died with `insufficient funds for gas * price + value` on `0xdD11751cdD3f6EFf01B1f6151B640685bfa5dB4a` — a fee-recipient sweep-tx that was short by exactly `17 100 × 1 265 000 000 = 21 631 500 000 000 wei`, the EIP-2929 SSTORE-SET-vs-RESET delta for a single storage slot on `0x8684adde…` that had been first-written by the surviving sidechain block 8 blocks earlier. This is **not** a cherry-pick of a merged PR. The fix originates on `release/3.4` and pairs with #21157 (already on `release/3.4`). ## Root cause `reset_state` historically called `ResetExec`, which clears the MDBX state-domain tables (`AccountsDomain`, `StorageDomain`, `CodeDomain`, `CommitmentDomain`, `ReceiptDomain`, `RCacheDomain`), the inverted indexes and `ChangeSets3`, plus the `Execution`/`Finish`/`CustomTrace`/`TxLookup`/`Witnesses` stage progress. It deliberately left `kv.HeaderCanonical`, `kv.HeaderTD` and the `Headers`/`BlockHashes`/`Bodies`/`Senders` stage progress alone, on the assumption that those tables faithfully reflect the consensus-layer view of the chain. That assumption breaks once a forkchoice update has ever committed a canonical hash whose chain later turned out to be the wrong side of a reorg. Concretely, on the affected snapshotter: 1. An older forkchoice update for chain *A* succeeded; `WriteCanonicalHash(A, 2 818 468)` was committed. 2. CL switched to chain *B*. The forkchoice handler walked back, built `newCanonicals` including `B@2 818 468`, ran the staged unwind and reached the forward `executionPipeline.Run` step. 3. Forward execution failed on a downstream block because the unwind on a pre-#21157 binary failed to revert `CodeDomain` / `StorageDomain` writes made by *A* (see `execution/stagedsync/stage_execute.go:findExecutedDiffsetAtHeight` introduced in #21157). The forkchoice handler returned `BadBlock` and the whole tx — including the `WriteCanonicalHash(B, 2 818 468)` issued at `execution/execmodule/forkchoice.go:381` — rolled back. `kv.HeaderCanonical[2 818 468]` remained at *A*. 4. The operator upgraded to a binary containing #21157 and ran `integration reset_state`. `ResetExec` cleared the state domains, but the stale `kv.HeaderCanonical[2 818 468] = A` survived. 5. On restart the staged sync caught up from the snapshot tip (`block 2 810 162`, txNum 100 781 249, end of step 258). The execution stage read each block's body using its canonical hash, hit the stale pointer at 2 818 468, re-applied chain *A*'s block as canonical and re-installed the phantom code+slot writes on `0x8684adde…`. The subsequent canonical tx 3 of block 2 818 476 then took the SSTORE-RESET path instead of SSTORE-SET — saving exactly 17 100 gas and starving the fee-recipient sweep at tx 14. `seg integrity --check=CommitmentRoot` on the same datadir is green (both `v2.0-commitment.0-256.kv` and `v2.0-commitment.256-258.kv` recompute matching roots), confirming that no published `.kv` is contaminated. The phantom lived purely in MDBX, fed by `kv.HeaderCanonical` whose stale entries `reset_state` was failing to evict. ## Fix `ResetCanonicalAboveTip(ctx, db, frozenBlocks)` truncates `kv.HeaderCanonical` and `kv.HeaderTD` from `frozenBlocks+1` forward, caps `Headers`/`BlockHashes`/`Bodies`/`Senders` stage progress at `frozenBlocks`, and re-anchors `HeadHeaderHash` to the canonical hash recorded at the tip (when one is present). `ResetState` takes a new `frozenBlocks` parameter and invokes the helper after `ResetExec`. `cmd/integration/commands/reset_state.go` passes `br.FrozenBlocks()`. Why this is safe to roll into `ResetState`: - The snapshot tip is by construction immutable — the canonical hashes for heights `≤ frozenBlocks` live in the frozen blocks themselves, so leaving the table contents below the tip untouched is correct. - Stage-progress writes only run when an individual stage is observably above the tip; the helper is a no-op on a clean db. - After the truncation the next forkchoice update from the consensus layer drives canonical assignments for the post-tip range fresh, with no remnant of any pre-fix sidechain commit able to mislead a forward catchup. The forkchoice-handler logic itself (`execution/execmodule/forkchoice.go`) and the `findExecutedDiffsetAtHeight` unwind path from #21157 are untouched. This PR addresses the persistent-leftover side of the problem; future reorgs on a node carrying #21157 will unwind cleanly and won't accumulate the leftover. ## Test plan `execution/stagedsync/rawdbreset/reset_stages_test.go` adds two cases: - `TestResetCanonicalAboveTip_ClearsStaleSidechainPointers` seeds `kv.HeaderCanonical` entries for heights 100..110 (including an explicit non-zero `0x99…` at height 105 to make the assertion distinguish "missing" from "zero"), sets the four stage progresses to 110 and writes a stale `HeadHeaderHash`. After the call with `snapshotTip = 100`, the test asserts that canonical hashes at 101..110 are gone, the tip's canonical hash at 100 is preserved, `HeadHeaderHash` has been re-anchored, and each of the four stages has been capped at 100. - `TestResetCanonicalAboveTip_NoOpWhenAlreadyAtTip` calls the helper on a db whose only canonical entry sits exactly at the tip and whose `Headers` stage already equals the tip; the test asserts the tip is preserved and stage progress is not regressed. Both tests use `temporaltest.NewTestDB` for a real MDBX/temporal backend, run in well under one second, and exercise the public helper directly without needing a full ExecModule. - [x] `go test -count=1 -timeout=120s ./execution/stagedsync/rawdbreset/...` — green. - [x] `go test -count=1 -timeout=300s ./db/test/...` — green (covers existing callers of `ResetExec`). - [x] `go build ./cmd/integration/...` — green (only production caller of `ResetState`). - [ ] CI full unit-test + lint pipeline. - [ ] Manual: run `integration reset_state --datadir=<DD> --chain=hoodi` on the affected snapshotter; verify `kv.HeaderCanonical[2 818 468]` is cleared, restart, watch catchup pass `2 818 476` cleanly, run `seg integrity --check=CommitmentRoot` before re-enabling snapshot publication. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: JkLondon <me@ilyamikheev.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
## Summary - Backfills `ChangeLog.md` with entries for v3.2.3, v3.3.1–v3.3.10, v3.4.0, and v3.4.1 (previously the file stopped at v3.3.0). - Drafts the v3.4.2 entry (release date TBD): two bugfixes (#21157, #21262) and two improvements (#21148, #21135). - Corrects the v3.3.0 release date to 2025-11-27 to match the GitHub publish date. ## Test plan - [ ] Skim the rendered changelog on GitHub for formatting. - [ ] Confirm v3.4.2 draft contents before tagging. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pshot tip on reset_state (erigontech#21247) ## Summary Defense-in-depth cleanup of `integration reset_state` so it also evicts stale `kv.HeaderCanonical` / `kv.HeaderTD` entries above the snapshot tip and caps `Headers`/`BlockHashes`/`Bodies`/`Senders` stage progress at the tip. Without this, a stale canonical-hash pointer survives `reset_state` and steers the next forward catchup back onto a non-canonical block. Mirrors the `release/3.4` backport at erigontech#21246; the file structure of `execution/stagedsync/rawdbreset/reset_stages.go` and `cmd/integration/commands/reset_state.go` is identical on `main` and `release/3.4`, so the patch applies verbatim. ## Why it matters on main even after the recent unwind fixes `main` already carries the two reorg-correctness fixes that prevent **new** stale canonical pointers from accumulating: - `bfa03df625` ("parallel-commitment correctness for reorg/unwind + SD recreate") - erigontech#21157 ("execution/stagedsync: find diffset by actually-executed hash on unwind") After those changes, a forkchoice update whose forward execution fails no longer leaks state because the unwind correctly reverts the previously-applied sidechain. So in steady state on a fresh `main` build, `kv.HeaderCanonical` should only ever hold canonical hashes. However, datadirs created on older Erigon versions (including any `release/3.4` build older than 2026-05-14) can carry leftover sidechain entries in `kv.HeaderCanonical` from the time when the unwind bug was active. Today's `reset_state` clears MDBX domain state but leaves those entries alone — a forward catchup then reads them as authoritative and re-introduces the very phantom state that `reset_state` was meant to clear. Operators upgrading such a datadir to a fresh `main` binary will hit the same symptom that bit the `release/3.4` snapshotter on hoodi at block 2 818 476 (`insufficient funds` on a fee-recipient sweep tx, off by exactly the EIP-2929 SSTORE-SET-vs-RESET delta `17 100 × 1 265 000 000 wei`). This PR is therefore a no-cost forward-compat improvement: it harmlessly no-ops on a fresh `main` datadir and fully recovers older / cross-version datadirs without requiring `integration stage_exec --unwind=N` or any other manual incantation. ## Fix `ResetCanonicalAboveTip(ctx, db, frozenBlocks)`: - `rawdb.TruncateCanonicalHash(tx, frozenBlocks+1, false)` - `rawdb.TruncateTd(tx, frozenBlocks+1)` - `WriteHeadHeaderHash` to canonical hash at the tip (when one is recorded) - Caps `Headers`/`BlockHashes`/`Bodies`/`Senders` stage progress at `frozenBlocks` `ResetState` takes a new `frozenBlocks uint64` parameter and invokes the helper after `ResetExec`. The only production caller, `cmd/integration/commands/reset_state.go`, passes `br.FrozenBlocks()`. Snapshot-tip data is by construction immutable, so leaving the canonical-hash table contents at-or-below the tip untouched is correct; stage-progress writes only run when an individual stage is observably above the tip; the routine is a no-op when nothing is above the tip. ## Test plan `execution/stagedsync/rawdbreset/reset_stages_test.go` adds two cases against a real `temporaltest.NewTestDB` backend: - `TestResetCanonicalAboveTip_ClearsStaleSidechainPointers` — seeds canonical entries for heights 100..110 (including an explicit non-zero `0x99…` at 105 to distinguish "missing" from "zero"), sets all four stage progresses to 110 and writes a stale `HeadHeaderHash`. After the call with `snapshotTip = 100`, asserts canonical hashes at 101..110 are gone, the tip's canonical hash at 100 is preserved, `HeadHeaderHash` is re-anchored to the tip's canonical hash, and each stage has been capped at 100. - `TestResetCanonicalAboveTip_NoOpWhenAlreadyAtTip` — calls the helper on a db whose only canonical entry sits exactly at the tip and whose `Headers` stage already equals the tip; asserts the tip is preserved and stage progress is not regressed. - [x] `go test -count=1 -timeout=120s ./execution/stagedsync/rawdbreset/...` — green. - [x] `go build ./cmd/integration/...` — green (only production caller of `ResetState`). - [ ] CI full unit-test + lint pipeline. ## Related - `release/3.4` backport: erigontech#21246 - Reorg-unwind correctness on r3.4: erigontech#21157 - Reorg-unwind correctness on main: `bfa03df625` 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: JkLondon <me@ilyamikheev.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Fixes a state-leak bug in
unwindExec3that surfaces as agas used mismatchandCannot update chain headafter any reorg at the tip whose unwound block deployed a contract viaCREATE/CREATE2to a counterfactual address (safe-wallets, EIP-1167 clone factories, ERC-4337 accounts, deterministic deployers). Reproduced on hoodi at block 2 789 993 on a node runningrelease/3.4— local sidechain head0x706b073b…, canonical0x72c42d17…/0xc4c75456…, divergence+72 524gas in tx0x73652205af0b85e7…(idx 7) which goes fromstatus=1, gasUsed=171 004on canonical tostatus=0, gasUsed=243 528(consume-all-gas) locally.This is not a cherry-pick of a merged PR. The fix originated on
release/3.4after diagnosing the hoodi snapshotter incident on 2026-05-11. The corresponding bug is also present onmain(see "Cherry-pick to main" below).Root cause
unwindExec3looks up the per-blockDomainEntryDiffset viadoms.GetDiffset(rwTx, br.CanonicalHash(currentBlock), currentBlock). The diffset, however, is keyed by the hash of the block that was actually executed, not by whatever the current canonical hash happens to be at unwind time.The headers stage processes a reorg in two steps inside a single transaction (
execution/stagedsync/stage_headers.go):u.UnwindTo(unwindTo, …)— schedules an unwind for subsequent stages.fixCanonicalChain(…)— overwriteskv.HeaderCanonicalto point at the new chain.The canonical pointer flip happens before the execution-stage unwind actually runs. When
unwindExec3then asks forbr.CanonicalHash(currentBlock)it gets the new canonical hash, but the diffset for that height is sitting inChangeSets3under the previous (now sidechain) hash that was actually applied.The pre-existing fallback at
stage_execute.go:191-204only handled the case whereCanonicalHashreturns!ok(no canonical pointer at this height — meaning we executed a block that was never marked canonical). It did not handle the case whereCanonicalHashreturns the new canonical and the diffset is missing under that new hash. In that case the code treated the miss as the "first block in traversal, nothing to unwind yet" edge case and silentlycontinued.The downstream effect is severe:
changeSetstaysnil→unwindExec3Stategetsnil→ the loop that collects state changes intostateChangesis skipped →sd.Unwind(txUnwindTo, nil)→sd.unwindChangesetRaw = nil.TemporalMemBatch.Flushthe guardif sd.unwindChangesetRaw != nil(temporal_mem_batch.go:596-606) skipsAggregatorRoTx.Unwindentirely.AccountsDomain/CodeDomainrows for the contracts deployed in the unwound sidechain block. The previous tombstone-restore fix #20627 / e1526bf is therefore bypassed entirely — its[]byte{}tombstone write never gets a chance to run because the diffset that drives it is "not found".The execution stage progress is then rolled back through the staged-sync machinery (
Headers,Bodies,Sendersreset), but the domain (latest state) tables in MDBX still reflect the sidechain. On the next forward execution the canonical chain re-attempts the same CREATE2 against the now-phantom contract. EIP-684 / EIP-1014 collision rules consume all remaining gas of that tx — exactly what shows up in logs:243528is the tx gas limit;171004is what canonical reference RPC (https://rpc.hoodi.ethpandaops.io) reports for the same tx;243528 − 171004 = 72524, matching the entire block-level mismatch.Fix
Localised to one function. Does not require any of the parallel-commitment / SD-recreate infrastructure that landed in main (
bfa03df625,95d96627ce, etc.) — the diffset lookup is the only place that was making an incorrect assumption about canonical-hash stability across stages.New helper
findExecutedDiffsetAtHeightinexecution/stagedsync/stage_execute.go:!ok, walks every header stored at this height viarawdb.ReadHeadersByNumberand tries each one. The header whose diffset is present is by construction the block that was actually executed.executedHashfor the caller.unwindExec3is updated to:lastExecHashfrom the helper-returnedexecutedHashfor the topmost unwound block (u.CurrentBlockNumber), rather than re-readingbr.CanonicalHash(u.CurrentBlockNumber)afterwards (which would suffer the same canonical-flip miscompare).unwindExec3State, which forwards it tostateCache.RevertWithDiffset(changeset, lastExecutedBlockHash, unwindToHash). The cache'sRevertWithDiffsetcompares its own storedblockHashagainstrevertFromHash— with the corrected hash, surgical eviction works and the cache no longer degrades to a full clear after every tip reorg.The fallback search is bounded by the number of headers stored at a given height (effectively 1 in the common case, 2 in a 1-deep reorg, and so on). Order of headers returned by
ReadHeadersByNumberis by storage prefix, but only one of them has a stored diffset, so the search is deterministic.Regression test
execution/stagedsync/stage_execute_unwind_test.gocovers three phases:executedHash = hOld.hNew) — the test first asserts that the pre-fix directdoms.GetDiffset(tx, hNew, height)returns!ok(documenting the bug), then asserts thatfindExecutedDiffsetAtHeightrecovers via fallback and still returnsexecutedHash = hOldwith the original diff list intact.found=false, no spurious matches.Test runs in ~100 ms and exercises the helper against a real
mdbx + Aggregator + temporal.Newstack with afreezeblocks.BlockReader(no snapshots required, the unwind range is at the tip).Limitations / what this does NOT cover
ChangeSets3ends up holding diffsets for more than one block at the same height, the fallback returns the first non-canonical hash whose diffset is present. In practice the staged-sync unwind path immediately invalidates the stored diffset of any unwound block (via the subsequent forward execution), so this is rare; but a fully robust fix would persist a separateexecution-applied head hashper height. Out of scope here.kv.HeaderCanonicalbefore execution stage unwind runs) is unchanged. This PR makes the execution-stage unwind robust against that ordering rather than restructuring stage interaction.How to verify on a stuck node
The runbook for an already-affected datadir (no rebuild required, snapshots are not contaminated because the phantom lives only in the MDBX
latestdomain tables at the tip) is:With the fix in place, future reorgs at the tip that involve CREATE2-to-counterfactual transactions will unwind cleanly and the same scenario will not recur.
Test plan
go test -count=1 -short ./execution/stagedsync/... ./db/state/...— green.TestFindExecutedDiffsetAtHeight_FallsBackAfterCanonicalReorg— green; embeds a sanity-assert that the pre-fix direct lookup misses (so the test fails if the fallback is reverted).reset_stateand restart, block 2 789 993 must be accepted with hash0xc4c754565911f13b96a841782095a21a32806f927307ffa71695ecd99801c236.🤖 Generated with Claude Code