Skip to content

[performance] fix issue #20169: StateCache + collation/pruning race fixes#20771

Closed
AskAlexSharov wants to merge 8 commits intoperformancefrom
cherry-pick/performance-issue-20169
Closed

[performance] fix issue #20169: StateCache + collation/pruning race fixes#20771
AskAlexSharov wants to merge 8 commits intoperformancefrom
cherry-pick/performance-issue-20169

Conversation

@AskAlexSharov
Copy link
Copy Markdown
Collaborator

@AskAlexSharov AskAlexSharov commented Apr 24, 2026

Cherry-picks to performance branch for issue #20169 (gas mismatches / stale reads):

Hive rpc-compat fixes:

yperbasis and others added 3 commits April 24, 2026 11:17
…g stale reads (#20265)

## Summary

Fix a bug in `StateCache.Get()`/`Put()` that caused deleted storage keys
to be treated as cache misses, allowing stale pre-delete values to be
read from the DB.

**Root cause of #20169** (Erigon nodes stuck on Hoodi).

## The bug

`StateCache` is an in-memory cache in `SharedDomains` that caches
storage values between transactions in the same batch. When a storage
slot is deleted (written to zero via `DomainDel`):

1. `DomainDel` → `stateCache.Delete(key)` — removes key from cache
2. `GetLatest` reads mem batch → gets `nil` → `stateCache.Put(key, nil)`
— but `Put` sees `len(nil)==0` → calls `cache.Delete(key)` again
(removes from cache!)
3. Next tx reads → `stateCache.Get(key)` → key not found → returns
`(nil, false)` — **cache miss**
4. Falls through to DB → returns the old non-zero value that's still in
MDBX

## Impact

On Hoodi, block 2523978 tx[21] cleared 5 storage slots at contract
`0x1d150609ee9edcc6143506ba55a4faaedd562cd9`. The clearing was lost in
the cache, so those slots retained stale non-zero values. 1,525 blocks
later, block 2525503 tx[25] hit those slots: 5 SSTOREs cost 91,100 less
gas than expected (warm/existing writes instead of cold/new slot
creation), causing a gas mismatch that stuck the node.

## Fix

- **`Get`**: Remove the `len(v)==0 → false` short-circuit so that cached
empty values are returned as `([]byte{}, true)` (cache hit with empty
value), not `(nil, false)` (cache miss).
- **`Put`**: When `len(value)==0`, store `[]byte{}` sentinel instead of
deleting the key, so the "deleted" state is preserved in cache.

## Why release/3.3 is not affected

release/3.3's `SharedDomains` does not have `stateCache`. All reads go
directly to the mem batch → DB, bypassing the buggy cache layer.

## Test plan

- [x] `go test ./execution/cache/...` passes
- [x] Re-execution from clean snapshots on Hoodi produces correct state
(block 2525503 passes with `diff=0`)
- [ ] Full Hoodi sync from scratch with this fix

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…he tests (#20498)

## Summary

- Remove `TestStateCacheDeletedStorageSSTOREGas` — it duplicated
`TestStateCache_PutEmpty_ThenGet_IsCacheHit` and incorrectly attributed
the SSTORE gas mismatch (#20169) to the StateCache layer
- Simplify the comment on `TestStateCache_PutEmpty_ThenGet_IsCacheHit`
to frame it as a cache correctness property, not a regression test for
#20169

The StateCache fix (PR #20265) is a valid cache correctness improvement,
but the root cause of #20169 was in the domain/snapshot layer — a cache
miss falling through to the membatch/MDBX path should return the correct
value. The real fix was PR #20483.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
…lesInBackground (#20445)

Fixes two related bugs in the domain state layer that cause gas
mismatches during execution (#20169).

`BuildFilesInBackground` collates domain files by reading values from
the DB via per-worker read-transactions opened at collation time.
Between the moment a step is deemed ready and the collation reads,
execution can commit new batches that overwrite step S values with step
S+1 data. The collated file for step S then contains wrong values. After
pruning removes the DB entries, `GetLatest` returns the stale file
values, causing SSTORE gas mispricing (20,000 vs 2,900 = 17,100 gas per
affected slot).

**Fix:** Restructure `buildFiles` into two phases:
1. **Phase 1 (sequential, single read-tx):** Collate all domains and
indices using one MDBX read-transaction. All collations see the exact
same DB snapshot — zero race window.
2. **Phase 2 (parallel, no DB access):** Build files from collations.
This is the expensive part and remains fully parallel.

Additionally, add a committed txNum guard: don't collate step S until
`ComputeCommitment` has confirmed all data through the end of step S is
flushed.

After a reorg, the unwind restores domain entries tagged with a step
derived from the unwind-target txNum. If `BuildFilesInBackground` has
filed that step, `getLatestFromDb` discards the restored entry (step
covered by files) and falls through to `getLatestFromFiles`, returning
the stale end-of-step value instead of the changeset-restored value.

The fix must use the **current** (atomically published) file boundary
from `Aggregator.EndTxNumMinimax()`, not the tx-scoped
`DomainRoTx.FirstStepNotInFiles()` — because `BuildFilesInBackground`
may have filed new steps since the tx was opened, and `getLatestFromDb`
will see those via the current view.

**Fix:** Pass `Aggregator.EndTxNumMinimax()` into the unwind and tag
restored entries with `max(naturalStep, currentFilesEndStep)`.

- `db/state/aggregator.go`: single-tx collation + parallel file
building; committed txNum guard; `stepFullyCommitted` helper; pass
current file boundary to unwind
- `db/state/domain.go`: bump unwind step tag past current filed range
- `db/state/aggregator_test.go`: `TestAggregator_CommittedTxNumGuard`
- `db/state/domain_test.go`:
`TestDomain_CollationIsolatedFromLaterSteps`,
`TestDomain_UnwindRestoredEntryVisibility`
- `execution/commitment/commitmentdb/commitment_context.go`: export
`DecodeTxBlockNums`; fix `minUnwindale` typo; short-value length guard

- [x] `make lint` passes
- [x] `TestDomain_CollationIsolatedFromLaterSteps` — step S collation
ignores step S+1 data
- [x] `TestDomain_UnwindRestoredEntryVisibility` — unwind-restored
entries survive `getLatestFromDb` after filing (fails without fix)
- [x] `TestAggregator_CommittedTxNumGuard` — guard blocks premature file
building
- [x] Integration: sync mainnet to chain tip without gas mismatches,
including after reorgs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com>
AskAlexSharov and others added 4 commits April 24, 2026 11:49
This PR fixes 3 testing_ api after merge(today) of PR 783
execution_apis:

- engine_types/jsonrpc.go: add omitempty to SlotNumber JSON tag so nil
slot number is omitted from ExecutionPayloadV3 responses (Prague chain
does not include slotNumber, only Amsterdam/V4+)
- testing_api.go: return real blockValue from assembler instead of
hardcoding zero;
- nonce validation errors now use -32000 (execution error) instead of
-32602 (invalid params) to match execution-apis spec

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This PR fixes hive after merge (today) PR 783 on execution-apis:
- Remove baseFee bucket from txpool_content, txpool_contentFrom and
txpool_status responses: the field is not in the execution-apis spec and
causes hive rpc-compat failures

- computeGasPrice: compute effective gasPrice for EIP-1559 pending
transactions using current header baseFee, instead of returning nil when
blockHash is zero

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…rt, add explicit tx list (#20592)

Cherry-pick of PR #20592 adapted for the performance branch proto-based interface:

- Move testing_ namespace registration from engine_server.go (port 8551) to
  rpc/jsonrpc/daemon.go (port 8545) — hive rpc-compat tests use port 8545
- Add explicit transaction list support: null→mempool, []→empty block, [...]→exact list
  with per-sender nonce validation against current state
- Add CustomTxnProvider to builder.Parameters (forwarded via AssembleBlockWithParams
  on ExecModule, accessed via two-step type assertion to avoid import cycles)
- NewTestingImpl now takes (logger, db) instead of enabled bool; gating moved to
  API registration level (include "testing" in --http.api to enable)
- node/eth/backend.go: create testingEntry when "testing" is in http.api list
@AskAlexSharov
Copy link
Copy Markdown
Collaborator Author

Dispatched hive rpc-compat manually after identifying root cause: all 4 testing_buildBlockV1 tests were failing because the testing namespace was registered on the engine API port (8551) rather than the HTTP API port (8545) used by hive rpc-compat.

Root cause: PR #20592 (which moved registration to rpc/jsonrpc/daemon.go on port 8545 and added explicit tx list support) was not cherry-picked.

Fix: Added a performance-branch-compatible adaptation of #20592 — keeps the proto-based AssembleBlock/GetAssembledBlock calls while adding AssembleBlockWithParams to ExecModule for the explicit-transaction path, accessed via two-step type assertion to avoid circular imports.

Copy link
Copy Markdown
Member

@yperbasis yperbasis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why cherry pick? We should simply merge the latest main into the performance branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants