[performance] fix issue #20169: StateCache + collation/pruning race fixes#20771
[performance] fix issue #20169: StateCache + collation/pruning race fixes#20771AskAlexSharov wants to merge 8 commits intoperformancefrom
Conversation
…g stale reads (#20265) ## Summary Fix a bug in `StateCache.Get()`/`Put()` that caused deleted storage keys to be treated as cache misses, allowing stale pre-delete values to be read from the DB. **Root cause of #20169** (Erigon nodes stuck on Hoodi). ## The bug `StateCache` is an in-memory cache in `SharedDomains` that caches storage values between transactions in the same batch. When a storage slot is deleted (written to zero via `DomainDel`): 1. `DomainDel` → `stateCache.Delete(key)` — removes key from cache 2. `GetLatest` reads mem batch → gets `nil` → `stateCache.Put(key, nil)` — but `Put` sees `len(nil)==0` → calls `cache.Delete(key)` again (removes from cache!) 3. Next tx reads → `stateCache.Get(key)` → key not found → returns `(nil, false)` — **cache miss** 4. Falls through to DB → returns the old non-zero value that's still in MDBX ## Impact On Hoodi, block 2523978 tx[21] cleared 5 storage slots at contract `0x1d150609ee9edcc6143506ba55a4faaedd562cd9`. The clearing was lost in the cache, so those slots retained stale non-zero values. 1,525 blocks later, block 2525503 tx[25] hit those slots: 5 SSTOREs cost 91,100 less gas than expected (warm/existing writes instead of cold/new slot creation), causing a gas mismatch that stuck the node. ## Fix - **`Get`**: Remove the `len(v)==0 → false` short-circuit so that cached empty values are returned as `([]byte{}, true)` (cache hit with empty value), not `(nil, false)` (cache miss). - **`Put`**: When `len(value)==0`, store `[]byte{}` sentinel instead of deleting the key, so the "deleted" state is preserved in cache. ## Why release/3.3 is not affected release/3.3's `SharedDomains` does not have `stateCache`. All reads go directly to the mem batch → DB, bypassing the buggy cache layer. ## Test plan - [x] `go test ./execution/cache/...` passes - [x] Re-execution from clean snapshots on Hoodi produces correct state (block 2525503 passes with `diff=0`) - [ ] Full Hoodi sync from scratch with this fix 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…he tests (#20498) ## Summary - Remove `TestStateCacheDeletedStorageSSTOREGas` — it duplicated `TestStateCache_PutEmpty_ThenGet_IsCacheHit` and incorrectly attributed the SSTORE gas mismatch (#20169) to the StateCache layer - Simplify the comment on `TestStateCache_PutEmpty_ThenGet_IsCacheHit` to frame it as a cache correctness property, not a regression test for #20169 The StateCache fix (PR #20265) is a valid cache correctness improvement, but the root cause of #20169 was in the domain/snapshot layer — a cache miss falling through to the membatch/MDBX path should return the correct value. The real fix was PR #20483. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
…lesInBackground (#20445) Fixes two related bugs in the domain state layer that cause gas mismatches during execution (#20169). `BuildFilesInBackground` collates domain files by reading values from the DB via per-worker read-transactions opened at collation time. Between the moment a step is deemed ready and the collation reads, execution can commit new batches that overwrite step S values with step S+1 data. The collated file for step S then contains wrong values. After pruning removes the DB entries, `GetLatest` returns the stale file values, causing SSTORE gas mispricing (20,000 vs 2,900 = 17,100 gas per affected slot). **Fix:** Restructure `buildFiles` into two phases: 1. **Phase 1 (sequential, single read-tx):** Collate all domains and indices using one MDBX read-transaction. All collations see the exact same DB snapshot — zero race window. 2. **Phase 2 (parallel, no DB access):** Build files from collations. This is the expensive part and remains fully parallel. Additionally, add a committed txNum guard: don't collate step S until `ComputeCommitment` has confirmed all data through the end of step S is flushed. After a reorg, the unwind restores domain entries tagged with a step derived from the unwind-target txNum. If `BuildFilesInBackground` has filed that step, `getLatestFromDb` discards the restored entry (step covered by files) and falls through to `getLatestFromFiles`, returning the stale end-of-step value instead of the changeset-restored value. The fix must use the **current** (atomically published) file boundary from `Aggregator.EndTxNumMinimax()`, not the tx-scoped `DomainRoTx.FirstStepNotInFiles()` — because `BuildFilesInBackground` may have filed new steps since the tx was opened, and `getLatestFromDb` will see those via the current view. **Fix:** Pass `Aggregator.EndTxNumMinimax()` into the unwind and tag restored entries with `max(naturalStep, currentFilesEndStep)`. - `db/state/aggregator.go`: single-tx collation + parallel file building; committed txNum guard; `stepFullyCommitted` helper; pass current file boundary to unwind - `db/state/domain.go`: bump unwind step tag past current filed range - `db/state/aggregator_test.go`: `TestAggregator_CommittedTxNumGuard` - `db/state/domain_test.go`: `TestDomain_CollationIsolatedFromLaterSteps`, `TestDomain_UnwindRestoredEntryVisibility` - `execution/commitment/commitmentdb/commitment_context.go`: export `DecodeTxBlockNums`; fix `minUnwindale` typo; short-value length guard - [x] `make lint` passes - [x] `TestDomain_CollationIsolatedFromLaterSteps` — step S collation ignores step S+1 data - [x] `TestDomain_UnwindRestoredEntryVisibility` — unwind-restored entries survive `getLatestFromDb` after filing (fails without fix) - [x] `TestAggregator_CommittedTxNumGuard` — guard blocks premature file building - [x] Integration: sync mainnet to chain tip without gas mismatches, including after reorgs 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it> Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com>
This PR fixes 3 testing_ api after merge(today) of PR 783 execution_apis: - engine_types/jsonrpc.go: add omitempty to SlotNumber JSON tag so nil slot number is omitted from ExecutionPayloadV3 responses (Prague chain does not include slotNumber, only Amsterdam/V4+) - testing_api.go: return real blockValue from assembler instead of hardcoding zero; - nonce validation errors now use -32000 (execution error) instead of -32602 (invalid params) to match execution-apis spec Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This PR fixes hive after merge (today) PR 783 on execution-apis: - Remove baseFee bucket from txpool_content, txpool_contentFrom and txpool_status responses: the field is not in the execution-apis spec and causes hive rpc-compat failures - computeGasPrice: compute effective gasPrice for EIP-1559 pending transactions using current header baseFee, instead of returning nil when blockHash is zero Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…rt, add explicit tx list (#20592) Cherry-pick of PR #20592 adapted for the performance branch proto-based interface: - Move testing_ namespace registration from engine_server.go (port 8551) to rpc/jsonrpc/daemon.go (port 8545) — hive rpc-compat tests use port 8545 - Add explicit transaction list support: null→mempool, []→empty block, [...]→exact list with per-sender nonce validation against current state - Add CustomTxnProvider to builder.Parameters (forwarded via AssembleBlockWithParams on ExecModule, accessed via two-step type assertion to avoid import cycles) - NewTestingImpl now takes (logger, db) instead of enabled bool; gating moved to API registration level (include "testing" in --http.api to enable) - node/eth/backend.go: create testingEntry when "testing" is in http.api list
|
Dispatched hive rpc-compat manually after identifying root cause: all 4 Root cause: PR #20592 (which moved registration to Fix: Added a performance-branch-compatible adaptation of #20592 — keeps the proto-based |
yperbasis
left a comment
There was a problem hiding this comment.
Why cherry pick? We should simply merge the latest main into the performance branch.
Cherry-picks to
performancebranch for issue #20169 (gas mismatches / stale reads):Hive rpc-compat fixes:
SlotNumberomitempty)