[SharovBot] fix(txpool): comprehensive stale tx eviction for pending and queued sub-pools#19393
[SharovBot] fix(txpool): comprehensive stale tx eviction for pending and queued sub-pools#19393Giulio2002 wants to merge 2 commits intomainfrom
Conversation
…ub-pools Bug #1 (stale pending txs): onSenderStateChange is only called for senders that appear in the EVM state-diff batch. On AuRa/Gnosis Chain, senders whose txns are mined via system transactions never appear in the state-diff, so their pending txns are never evicted. Fix: in addTxnsOnNewBlock, build a minedSenderMinNonce map from minedTxns (nonce N mined => on-chain nonce >= N+1), add all mined-tx senders to sendersWithChangedState, and floor the kvcache nonce with the mined floor when calling onSenderStateChange. Bug #2 (zombie queued txs): There was no eviction for queued transactions whose nonce is impossibly far ahead of the sender's on-chain nonce. Accounts on Gnosis Chain were seen with on-chain nonce=281 but queued nonces=16814+ (gap=16533). These txns can never become pending, yet sit in the pool indefinitely. Fix: add MaxNonceGap config (default 64) to txpoolcfg.Config. In onSenderStateChange, evict txns where tx.Nonce - noGapsNonce > MaxNonceGap with the new NonceTooDistant discard reason. noGapsNonce accounts for consecutive txns already in the pool, so valid long pending queues are not affected. Also improve reason tracking in onSenderStateChange: use a parallel toDelReasons slice so each evicted tx gets the correct DiscardReason (NonceTooLow vs NonceTooDistant). Tests added: - TestZombieQueuedEviction: verifies zombie txns are evicted, boundary tx kept, consecutive txns never zombie-evicted - TestStalePendingEvictionViaMineNonce: verifies stale pending tx evicted via mined-nonce floor even without EVM state-diff UPSERT event for sender
|
I tested on Before this fix: With this fix and after 4+ hours running node: |
yperbasis
left a comment
There was a problem hiding this comment.
I'm fine with with the fix for bug no 2 (MaxNonceGap), but not with the fix for bug 1 (missing nonce updates for Gnosis). A proper fix for it would be to ensure that such nonce changes are present in the state diffs.
More generally, it's better to tackle separate problems in separate PRs. Let's split this one into two for the two bugs and re-do the fix for bug no 1.
#2) Add MaxNonceGap config (default 64) and NonceTooDistant discard reason. In onSenderStateChange, evict queued transactions whose nonce exceeds the sender's on-chain nonce (accounting for consecutive txns already in pool) by more than MaxNonceGap. These 'zombie' txns can never become pending and cause unbounded queued pool bloat (e.g. nonce 144968 vs on-chain 6398 on Gnosis Chain Erigon 3.3.8). Also fix toDelReasons to track correct discard reason per evicted txn instead of always logging NonceTooLow. Extracted from #19393 per @yperbasis review: split separate bugs into separate PRs. Closes: related to #19393
|
[SharovBot] 📋 Addressing @yperbasis review — splitting into separate PRs as requested: Bug #2 (MaxNonceGap zombie eviction) → extracted to #19449. This contains only the Bug #1 (AuRa/Gnosis stale pending — proper state-diffs approach) → investigating the proper fix as suggested: ensuring nonce changes from system/AuRa transactions appear in the This PR (#19393) will be superseded by those two separate PRs. Closing in favour of #19449 (Bug #2) and a follow-up PR for Bug #1. |
…ap (#19449) **[SharovBot]** ## Split from #19393 per @yperbasis review This PR contains **Bug #2 only** (zombie queued transaction eviction), extracted from #19393 which was asked to be split into separate PRs. ## Problem Queued transactions with an impossibly large nonce gap (e.g. on-chain nonce=281, queued nonce=16,814 — gap of 16,533) sit in the pool forever. They can never become pending without filling thousands of nonce positions first, causing unbounded queued pool bloat (4,000+ txns observed on Gnosis Chain, Erigon 3.3.8). The existing blob-txn nonce-gap eviction only covered `BlobTxnType`. Regular transactions had no gap limit. ## Fix - Add `MaxNonceGap uint64` to `txpoolcfg.Config` (default: 64) - Add `NonceTooDistant` (DiscardReason 37) for observability - In `onSenderStateChange`, evict txns whose nonce exceeds `noGapsNonce` by more than `MaxNonceGap` - `noGapsNonce` accounts for consecutive txns already pooled, so consecutive txns are never zombie-evicted - Fix `toDelReasons` parallel slice to track correct discard reason per evicted tx (was always logging `NonceTooLow`) ## Tests - `TestZombieQueuedEviction` — 3 sub-tests: 1. Zombie tx (gap=65 > MaxNonceGap=64) is evicted with `NonceTooDistant` 2. Tx at exactly MaxNonceGap boundary (gap=64) is kept 3. Consecutive txns beyond MaxNonceGap are never zombie-evicted ## Testing ``` go build ./txnprovider/txpool/... ✅ go test ./txnprovider/txpool/... -run TestZombieQueuedEviction -count=1 ✅ ``` ## Related - Bug #1 (stale pending / AuRa nonce) will be addressed separately per @yperbasis feedback - Backport to release/3.3 will follow once this is merged - Original combined PR: #19393
…ap (#19449) **[SharovBot]** ## Split from #19393 per @yperbasis review This PR contains **Bug #2 only** (zombie queued transaction eviction), extracted from #19393 which was asked to be split into separate PRs. ## Problem Queued transactions with an impossibly large nonce gap (e.g. on-chain nonce=281, queued nonce=16,814 — gap of 16,533) sit in the pool forever. They can never become pending without filling thousands of nonce positions first, causing unbounded queued pool bloat (4,000+ txns observed on Gnosis Chain, Erigon 3.3.8). The existing blob-txn nonce-gap eviction only covered `BlobTxnType`. Regular transactions had no gap limit. ## Fix - Add `MaxNonceGap uint64` to `txpoolcfg.Config` (default: 64) - Add `NonceTooDistant` (DiscardReason 37) for observability - In `onSenderStateChange`, evict txns whose nonce exceeds `noGapsNonce` by more than `MaxNonceGap` - `noGapsNonce` accounts for consecutive txns already pooled, so consecutive txns are never zombie-evicted - Fix `toDelReasons` parallel slice to track correct discard reason per evicted tx (was always logging `NonceTooLow`) ## Tests - `TestZombieQueuedEviction` — 3 sub-tests: 1. Zombie tx (gap=65 > MaxNonceGap=64) is evicted with `NonceTooDistant` 2. Tx at exactly MaxNonceGap boundary (gap=64) is kept 3. Consecutive txns beyond MaxNonceGap are never zombie-evicted ## Testing ``` go build ./txnprovider/txpool/... ✅ go test ./txnprovider/txpool/... -run TestZombieQueuedEviction -count=1 ✅ ``` ## Related - Bug #1 (stale pending / AuRa nonce) will be addressed separately per @yperbasis feedback - Backport to release/3.3 will follow once this is merged - Original combined PR: #19393
…vict stale pending txns (AuRa/Gnosis fix)
On AuRa/Gnosis Chain, system transactions advance sender nonces without the
execution layer emitting state-diff UPSERT entries for those senders. This
causes onSenderStateChange to never fire for them, leaving their pooled
pending transactions stale forever. Block builders then attempt to include
these txns, receive 'nonce too low', and produce empty or near-empty blocks.
Fix: add ensureMinedSendersInStateDiff(), called at the top of OnNewBlock
before cache.OnNewBlock(). For each mined-tx sender absent from the state-diff
batch, it reads the authoritative post-block account state from coreTx and
injects a synthetic AccountChange{Action=UPSERT} entry. This causes:
1. kvcache to be updated with the correct nonce for that sender
2. addTxnsOnNewBlock to add the sender to sendersWithChangedState
3. onSenderStateChange to fire and evict any stale pending transactions
The DB read uses coreTx.GetLatest(AccountsDomain), which falls through to
the underlying MDBX store after block execution, so the correct post-block
nonce is always available. If the DB has no entry (zero-value), the function
floors the injected nonce at maxMinedNonce+1, guaranteeing at minimum the
mined nonce is reflected.
This is PR1 (Bug #1) of the split requested by @yperbasis on #19393.
Bug #2 (MaxNonceGap zombie queued eviction) was addressed in #19449.
|
[SharovBot] As requested by @yperbasis, this PR has been split into two separate PRs:
This PR (#19393) can be closed in favour of the above two. |
…ap (#19449) **[SharovBot]** ## Split from #19393 per @yperbasis review This PR contains **Bug #2 only** (zombie queued transaction eviction), extracted from #19393 which was asked to be split into separate PRs. ## Problem Queued transactions with an impossibly large nonce gap (e.g. on-chain nonce=281, queued nonce=16,814 — gap of 16,533) sit in the pool forever. They can never become pending without filling thousands of nonce positions first, causing unbounded queued pool bloat (4,000+ txns observed on Gnosis Chain, Erigon 3.3.8). The existing blob-txn nonce-gap eviction only covered `BlobTxnType`. Regular transactions had no gap limit. ## Fix - Add `MaxNonceGap uint64` to `txpoolcfg.Config` (default: 64) - Add `NonceTooDistant` (DiscardReason 37) for observability - In `onSenderStateChange`, evict txns whose nonce exceeds `noGapsNonce` by more than `MaxNonceGap` - `noGapsNonce` accounts for consecutive txns already pooled, so consecutive txns are never zombie-evicted - Fix `toDelReasons` parallel slice to track correct discard reason per evicted tx (was always logging `NonceTooLow`) ## Tests - `TestZombieQueuedEviction` — 3 sub-tests: 1. Zombie tx (gap=65 > MaxNonceGap=64) is evicted with `NonceTooDistant` 2. Tx at exactly MaxNonceGap boundary (gap=64) is kept 3. Consecutive txns beyond MaxNonceGap are never zombie-evicted ## Testing ``` go build ./txnprovider/txpool/... ✅ go test ./txnprovider/txpool/... -run TestZombieQueuedEviction -count=1 ✅ ``` ## Related - Bug #1 (stale pending / AuRa nonce) will be addressed separately per @yperbasis feedback - Backport to release/3.3 will follow once this is merged - Original combined PR: #19393
…vict stale pending txns (AuRa/Gnosis fix)
On AuRa/Gnosis Chain, system transactions advance sender nonces without the
execution layer emitting state-diff UPSERT entries for those senders. This
causes onSenderStateChange to never fire for them, leaving their pooled
pending transactions stale forever. Block builders then attempt to include
these txns, receive 'nonce too low', and produce empty or near-empty blocks.
Fix: add ensureMinedSendersInStateDiff(), called at the top of OnNewBlock
before cache.OnNewBlock(). For each mined-tx sender absent from the state-diff
batch, it reads the authoritative post-block account state from coreTx and
injects a synthetic AccountChange{Action=UPSERT} entry. This causes:
1. kvcache to be updated with the correct nonce for that sender
2. addTxnsOnNewBlock to add the sender to sendersWithChangedState
3. onSenderStateChange to fire and evict any stale pending transactions
The DB read uses coreTx.GetLatest(AccountsDomain), which falls through to
the underlying MDBX store after block execution, so the correct post-block
nonce is always available. If the DB has no entry (zero-value), the function
floors the injected nonce at maxMinedNonce+1, guaranteeing at minimum the
mined nonce is reflected.
This is PR1 (Bug #1) of the split requested by @yperbasis on #19393.
Bug #2 (MaxNonceGap zombie queued eviction) was addressed in #19449.
… senders to evict stale pending txns
On AuRa/Gnosis Chain, block finalization (engine.Finalize) executes system
transactions (validator rewards, bridge calls, etc.) that advance sender
nonces. These state changes were NOT reaching the txpool state-diff batch
because the block-end stateWriter in exec3_serial.go was constructed with a
nil accumulator.
Fix: store the block's accumulator on serialExecutor (se.accumulator) and
pass it to the block-end stateWriter so that UpdateAccountData → ChangeAccount
calls during Finalize/FinalizeAndAssemble emit UPSERT entries into the batch.
The txpool then calls onSenderStateChange for those senders and evicts any
pending transactions whose nonces are now stale.
Changes:
- execution/stagedsync/exec3_serial.go:
- Add accumulator field to serialExecutor
- Set se.accumulator = accumulator alongside StartChange per block
- Pass se.accumulator to state.NewWriter in the block-end path
- txnprovider/txpool/pool.go:
- Remove txpool-level ensureMinedSendersInStateDiff workaround
- Restore original cache.OnNewBlock ordering
- txnprovider/txpool/pool_test.go:
- Update TestStalePendingEvictionViaMineNonce to test via correct
stateChanges from the EL (as now emitted after this fix)
This is the EL-level fix requested by @yperbasis on #19392/#19393.
… senders to evict stale pending txns
On AuRa/Gnosis Chain, block finalization (engine.Finalize) executes system
transactions (validator rewards, bridge calls, etc.) that advance sender
nonces. These state changes were NOT reaching the txpool state-diff batch
because the block-end stateWriter in exec3_serial.go was constructed with a
nil accumulator.
Fix: store the block's accumulator on serialExecutor (se.accumulator) and
pass it to the block-end stateWriter so that UpdateAccountData → ChangeAccount
calls during Finalize/FinalizeAndAssemble emit UPSERT entries into the batch.
The txpool then calls onSenderStateChange for those senders and evicts any
pending transactions whose nonces are now stale.
Changes:
- execution/stagedsync/exec3_serial.go:
- Add accumulator field to serialExecutor
- Set se.accumulator = accumulator alongside StartChange per block
- Pass se.accumulator to state.NewWriter in the block-end path
- txnprovider/txpool/pool.go:
- Remove txpool-level ensureMinedSendersInStateDiff workaround
- Restore original cache.OnNewBlock ordering
- txnprovider/txpool/pool_test.go:
- Update TestStalePendingEvictionViaMineNonce to test via correct
stateChanges from the EL (as now emitted after this fix)
This is the EL-level fix requested by @yperbasis on #19392/#19393.
…ap (#19449) **[SharovBot]** ## Split from #19393 per @yperbasis review This PR contains **Bug #2 only** (zombie queued transaction eviction), extracted from #19393 which was asked to be split into separate PRs. ## Problem Queued transactions with an impossibly large nonce gap (e.g. on-chain nonce=281, queued nonce=16,814 — gap of 16,533) sit in the pool forever. They can never become pending without filling thousands of nonce positions first, causing unbounded queued pool bloat (4,000+ txns observed on Gnosis Chain, Erigon 3.3.8). The existing blob-txn nonce-gap eviction only covered `BlobTxnType`. Regular transactions had no gap limit. ## Fix - Add `MaxNonceGap uint64` to `txpoolcfg.Config` (default: 64) - Add `NonceTooDistant` (DiscardReason 37) for observability - In `onSenderStateChange`, evict txns whose nonce exceeds `noGapsNonce` by more than `MaxNonceGap` - `noGapsNonce` accounts for consecutive txns already pooled, so consecutive txns are never zombie-evicted - Fix `toDelReasons` parallel slice to track correct discard reason per evicted tx (was always logging `NonceTooLow`) ## Tests - `TestZombieQueuedEviction` — 3 sub-tests: 1. Zombie tx (gap=65 > MaxNonceGap=64) is evicted with `NonceTooDistant` 2. Tx at exactly MaxNonceGap boundary (gap=64) is kept 3. Consecutive txns beyond MaxNonceGap are never zombie-evicted ## Testing ``` go build ./txnprovider/txpool/... ✅ go test ./txnprovider/txpool/... -run TestZombieQueuedEviction -count=1 ✅ ``` ## Related - Bug #1 (stale pending / AuRa nonce) will be addressed separately per @yperbasis feedback - Backport to release/3.3 will follow once this is merged - Original combined PR: #19393
[SharovBot]
Problem
A Gnosis Chain validator operator running Erigon 3.3.8 discovered two distinct txpool eviction bugs that together cause empty block production and unbounded queued pool bloat. This PR addresses both bugs comprehensively, building on the initial analysis in PRs #19385 and #19386 (which were closed before merge).
Bug #1 — Stale Pending Transactions (AuRa/Gnosis-specific)
Root Cause
onSenderStateChangeevicts pending transactions withsenderNonce > tx.Nonce(stale). However,onSenderStateChangeis only called for senders that appear in the EVM state-diff batch (stateChanges.ChangeBatch) sent from the execution layer.On AuRa/Gnosis Chain, many transactions (system transactions, validator reward txns) advance nonces without producing a state-diff UPSERT entry for the sender in the batch. As a result:
sendersWithChangedStateonSenderStateChangeis never called for themnonce too low, and produce empty blocksFix
In
addTxnsOnNewBlock, we now build aminedSenderMinNoncemap from theminedTxnsargument:minedSenderMinNonce[senderID] = max(existing, N+1)sendersWithChangedState(ensuringonSenderStateChangeis always called)max(kvcache_nonce, minedSenderMinNonce[senderID])This is zero extra state queries, O(mined_txns_per_block) overhead, and works even when kvcache hasn't been updated for the sender.
Bug #2 — Zombie Queued Transactions (Universal)
Root Cause
There is no eviction for queued transactions whose nonce is astronomically higher than the sender's current on-chain nonce. Examples seen on Gnosis Chain:
0x0006E9...0x016f6C...These transactions can never become pending — they would require filling 16,000+ nonce positions first. Yet they sit in the queued sub-pool indefinitely, bloating it to 4,000+ transactions and wasting memory.
The existing blob-txn nonce-gap eviction only covers
BlobTxnTypeand only for any gap; regular transactions had no gap limit at all.Fix
Added
MaxNonceGap uint64totxpoolcfg.Config(default: 64). InonSenderStateChange, after checking for stale nonces:noGapsNoncetracks the next expected nonce accounting for consecutive txns already in the pool. A tx with nonce 100 won't be evicted if txns 0–99 are already pooled (gap fromnoGapsNonceis 0). A tx with nonce 16,814 whennoGapsNonce=282has gap 16,532 > 64 → evicted.A new
NonceTooDistant(37) discard reason is added for observability.Additional Fix — Reason Tracking
Previously, all evictions in
onSenderStateChangeused the single reasonNonceTooLowregardless of why the tx was removed. A paralleltoDelReasons []txpoolcfg.DiscardReasonslice now tracks the correct reason per evicted transaction.Tests Added
TestZombieQueuedEviction— 3 sub-tests:NonceTooDistantTestStalePendingEvictionViaMineNonce— Verifies stale pending tx is evicted when the sender's tx is mined but the sender does NOT appear in the EVM state-diff batch (exact AuRa scenario)Testing
References