Skip to content

execution/cache: fix StateCache dropping deleted storage keys, causing stale reads#20265

Merged
yperbasis merged 5 commits intomainfrom
fix/state-cache-delete-bug
Apr 1, 2026
Merged

execution/cache: fix StateCache dropping deleted storage keys, causing stale reads#20265
yperbasis merged 5 commits intomainfrom
fix/state-cache-delete-bug

Conversation

@yperbasis
Copy link
Copy Markdown
Member

Summary

Fix a bug in StateCache.Get()/Put() that caused deleted storage keys to be treated as cache misses, allowing stale pre-delete values to be read from the DB.

Root cause of #20169 (Erigon nodes stuck on Hoodi).

The bug

StateCache is an in-memory cache in SharedDomains that caches storage values between transactions in the same batch. When a storage slot is deleted (written to zero via DomainDel):

  1. DomainDelstateCache.Delete(key) — removes key from cache
  2. GetLatest reads mem batch → gets nilstateCache.Put(key, nil) — but Put sees len(nil)==0 → calls cache.Delete(key) again (removes from cache!)
  3. Next tx reads → stateCache.Get(key) → key not found → returns (nil, false)cache miss
  4. Falls through to DB → returns the old non-zero value that's still in MDBX

Impact

On Hoodi, block 2523978 tx[21] cleared 5 storage slots at contract 0x1d150609ee9edcc6143506ba55a4faaedd562cd9. The clearing was lost in the cache, so those slots retained stale non-zero values. 1,525 blocks later, block 2525503 tx[25] hit those slots: 5 SSTOREs cost 91,100 less gas than expected (warm/existing writes instead of cold/new slot creation), causing a gas mismatch that stuck the node.

Fix

  • Get: Remove the len(v)==0 → false short-circuit so that cached empty values are returned as ([]byte{}, true) (cache hit with empty value), not (nil, false) (cache miss).
  • Put: When len(value)==0, store []byte{} sentinel instead of deleting the key, so the "deleted" state is preserved in cache.

Why release/3.3 is not affected

release/3.3's SharedDomains does not have stateCache. All reads go directly to the mem batch → DB, bypassing the buggy cache layer.

Test plan

  • go test ./execution/cache/... passes
  • Re-execution from clean snapshots on Hoodi produces correct state (block 2525503 passes with diff=0)
  • Full Hoodi sync from scratch with this fix

🤖 Generated with Claude Code

StateCache.Get() treated empty/nil cached values as cache misses,
returning (nil, false). This caused reads after a DomainDel to fall
through to the DB and return the stale pre-delete value:

1. DomainDel → stateCache.Delete(key)
2. GetLatest reads mem batch → nil → stateCache.Put(key, nil)
3. Put sees len(nil)==0 → cache.Delete(key) — removes from cache!
4. Next tx: stateCache.Get(key) → not found → falls through to DB
5. DB returns old non-zero value → state corruption

This caused Erigon nodes on Hoodi to get stuck: block 2523978
cleared 5 storage slots, but the clearing was lost in the cache,
so block 2525503 saw stale non-zero values causing a 91,100 gas
mismatch.

Fix:
- Get: remove the len(v)==0 → false short-circuit
- Put: store []byte{} sentinel for empty values instead of deleting

The bug only affects main; release/3.3 does not have StateCache in
SharedDomains.

Closes #20169

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yperbasis yperbasis linked an issue Apr 1, 2026 that may be closed by this pull request
@yperbasis yperbasis changed the title cache: fix StateCache dropping deleted storage keys, causing stale reads execution/cache: fix StateCache dropping deleted storage keys, causing stale reads Apr 1, 2026
@awskii
Copy link
Copy Markdown
Member

awskii commented Apr 1, 2026

There's a subtle issue with the sentinel approach.

In Put(), when value is empty, you store []byte{} as a sentinel:

cache.Put(key, []byte{})

But GenericCache.Get() filters out zero-size values:

if !ok || c.sizeFunc(value) == 0 {
    return zero, false  // sentinel is treated as "not found"!
}

Since sizeFunc([]byte{}) == len([]byte{}) == 0, the sentinel is stored but immediately dropped on read. StateCache.Get() will return (nil, false) for deleted keys — same as before — so the stale DB read still happens.

To fix this properly, either:

  1. Use a non-empty sentinel (e.g. a single-byte tombstone []byte{0}) and handle it in the callers, or
  2. Use a separate deleted-key set/map in DomainCache, or
  3. Change GenericCache.Get() to distinguish "absent" from "deleted" — e.g. return a special tombstone value or a third boolean.

The Get() simplification to return cache.Get(key) is correct in intent, but won't work until the underlying cache actually preserves the empty sentinel through the size check.

yperbasis and others added 4 commits April 1, 2026 12:44
…ession tests

GenericCache.Get had the same bug: `sizeFunc(value) == 0` caused it to
return false for empty []byte{} sentinels stored by the StateCache fix.

Add regression tests that reproduce the exact SharedDomains pattern:
Put(key, nil) followed by Get(key) must return (_, true) — a cache hit
with empty value, not a cache miss.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Exercises the exact sequence that caused Hoodi nodes to get stuck:

  1. Put(key, value) — cache a storage value
  2. Delete(key) — simulate DomainDel
  3. Put(key, nil) — simulate GetLatest caching a deletion from mem buffer
  4. Get(key) — must return (nil, true), not (nil, false)

Step 4 is the critical assertion: with the bug, Get returns false
(cache miss), causing the caller to fall through to the DB and read
the stale pre-delete value. This caused SSTORE gas to be mispriced
(warm noop instead of cold new-slot creation), producing a 91,100 gas
mismatch at Hoodi block 2525503.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…yte{}

common.Copy (bytes.Clone) returns nil for nil input, so the
len(value)==0 branch is unnecessary. Storing nil directly in the
map works: GenericCache.Get returns (nil, true) for it, which is
a more natural return value for deleted keys.

Also fix import ordering in blockchain_test.go.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yperbasis yperbasis marked this pull request as ready for review April 1, 2026 11:12
@yperbasis yperbasis requested a review from mh0lt as a code owner April 1, 2026 11:12
@yperbasis yperbasis enabled auto-merge April 1, 2026 11:17
@yperbasis yperbasis requested a review from AskAlexSharov April 1, 2026 11:28
@yperbasis yperbasis added this pull request to the merge queue Apr 1, 2026
Merged via the queue into main with commit ec45be4 Apr 1, 2026
35 of 38 checks passed
@yperbasis yperbasis deleted the fix/state-cache-delete-bug branch April 1, 2026 14:09
yperbasis added a commit that referenced this pull request Apr 1, 2026
…g stale reads (#20265)

## Summary

Fix a bug in `StateCache.Get()`/`Put()` that caused deleted storage keys
to be treated as cache misses, allowing stale pre-delete values to be
read from the DB.

**Root cause of #20169** (Erigon nodes stuck on Hoodi).

## The bug

`StateCache` is an in-memory cache in `SharedDomains` that caches
storage values between transactions in the same batch. When a storage
slot is deleted (written to zero via `DomainDel`):

1. `DomainDel` → `stateCache.Delete(key)` — removes key from cache
2. `GetLatest` reads mem batch → gets `nil` → `stateCache.Put(key, nil)`
— but `Put` sees `len(nil)==0` → calls `cache.Delete(key)` again
(removes from cache!)
3. Next tx reads → `stateCache.Get(key)` → key not found → returns
`(nil, false)` — **cache miss**
4. Falls through to DB → returns the old non-zero value that's still in
MDBX

## Impact

On Hoodi, block 2523978 tx[21] cleared 5 storage slots at contract
`0x1d150609ee9edcc6143506ba55a4faaedd562cd9`. The clearing was lost in
the cache, so those slots retained stale non-zero values. 1,525 blocks
later, block 2525503 tx[25] hit those slots: 5 SSTOREs cost 91,100 less
gas than expected (warm/existing writes instead of cold/new slot
creation), causing a gas mismatch that stuck the node.

## Fix

- **`Get`**: Remove the `len(v)==0 → false` short-circuit so that cached
empty values are returned as `([]byte{}, true)` (cache hit with empty
value), not `(nil, false)` (cache miss).
- **`Put`**: When `len(value)==0`, store `[]byte{}` sentinel instead of
deleting the key, so the "deleted" state is preserved in cache.

## Why release/3.3 is not affected

release/3.3's `SharedDomains` does not have `stateCache`. All reads go
directly to the mem batch → DB, bypassing the buggy cache layer.

## Test plan

- [x] `go test ./execution/cache/...` passes
- [x] Re-execution from clean snapshots on Hoodi produces correct state
(block 2525503 passes with `diff=0`)
- [ ] Full Hoodi sync from scratch with this fix

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
yperbasis added a commit that referenced this pull request Apr 1, 2026
… causing stale reads (#20271)

Cherry-pick of #20265 to `release/3.4`.

## Summary

Fix a bug in `StateCache.Get()`/`Put()` that caused deleted storage keys
to be treated as cache misses, allowing stale pre-delete values to be
read from the DB.

**Root cause of #20169** (Erigon nodes stuck on Hoodi).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
yperbasis added a commit that referenced this pull request Apr 12, 2026
The StateCache fix (PR #20265) is a valid cache correctness improvement,
but it wasn't the root cause of Issue #20169 — the real fix was in the
domain/snapshot layer (PR #20483). A cache miss falling through to the
membatch/MDBX path should return the correct value.

- Remove TestStateCacheDeletedStorageSSTOREGas (duplicates
  TestStateCache_PutEmpty_ThenGet_IsCacheHit, with wrong attribution)
- Simplify TestStateCache_PutEmpty_ThenGet_IsCacheHit comment to frame
  it as a cache correctness property, not a regression test for #20169

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
github-merge-queue Bot pushed a commit that referenced this pull request Apr 12, 2026
…he tests (#20498)

## Summary

- Remove `TestStateCacheDeletedStorageSSTOREGas` — it duplicated
`TestStateCache_PutEmpty_ThenGet_IsCacheHit` and incorrectly attributed
the SSTORE gas mismatch (#20169) to the StateCache layer
- Simplify the comment on `TestStateCache_PutEmpty_ThenGet_IsCacheHit`
to frame it as a cache correctness property, not a regression test for
#20169

The StateCache fix (PR #20265) is a valid cache correctness improvement,
but the root cause of #20169 was in the domain/snapshot layer — a cache
miss falling through to the membatch/MDBX path should return the correct
value. The real fix was PR #20483.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
github-merge-queue Bot pushed a commit that referenced this pull request Apr 13, 2026
## Summary

- Wire the existing `RevertWithDiffset` into `unwindExec3` so the state
cache is surgically invalidated during unwinds — keys touched by the
unwound blocks are evicted, untouched keys remain cached
- Previously `RevertWithDiffset` was implemented but never called; the
cache survived unwinds only because `ValidateAndPrepare` cleared it on
the next block (hash mismatch fallback)
- The `lastExecHash` (already fetched with a comment anticipating this
wiring) detects if the cache was modified by a different execution path
(e.g. rolled-back `ValidatePayload`), falling back to a full clear

## Context

Follow-up to #20265 and #20483. Those PRs fixed correctness bugs in the
cache layer and domain unwind; this PR completes the picture by actually
invalidating the cache during unwinds rather than relying on the
next-block safety net.

## Test plan

- [x] `TestRevertWithDiffset_SurgicalEviction` — touched keys evicted,
untouched preserved, blockHash updated
- [x] `TestRevertWithDiffset_HashMismatch_ClearsAll` — mismatched
revertFromHash triggers full clear
- [x] `TestRevertWithDiffset_AccountChange_EvictsCode` — account diffs
also evict corresponding code entries
- [x] `TestRevertWithDiffset_ThenValidateAndPrepare_Continuity` —
end-to-end: surgical revert then re-execution preserves surviving cache
data
- [x] `make lint` clean
- [ ] Full sync test on a testnet with unwinds

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gas mismatches on mainnet & Hoodi

3 participants