Skip to content

[SharovBot] fix(txpool): comprehensive stale tx eviction for pending and queued sub-pools#19393

Closed
Giulio2002 wants to merge 2 commits intomainfrom
fix/txpool-eviction-comprehensive
Closed

[SharovBot] fix(txpool): comprehensive stale tx eviction for pending and queued sub-pools#19393
Giulio2002 wants to merge 2 commits intomainfrom
fix/txpool-eviction-comprehensive

Conversation

@Giulio2002
Copy link
Collaborator

[SharovBot]

Problem

A Gnosis Chain validator operator running Erigon 3.3.8 discovered two distinct txpool eviction bugs that together cause empty block production and unbounded queued pool bloat. This PR addresses both bugs comprehensively, building on the initial analysis in PRs #19385 and #19386 (which were closed before merge).


Bug #1 — Stale Pending Transactions (AuRa/Gnosis-specific)

Root Cause

onSenderStateChange evicts pending transactions with senderNonce > tx.Nonce (stale). However, onSenderStateChange is only called for senders that appear in the EVM state-diff batch (stateChanges.ChangeBatch) sent from the execution layer.

On AuRa/Gnosis Chain, many transactions (system transactions, validator reward txns) advance nonces without producing a state-diff UPSERT entry for the sender in the batch. As a result:

  1. The sender's nonce is mined to N+1 on-chain
  2. The sender does not appear in sendersWithChangedState
  3. onSenderStateChange is never called for them
  4. Their pooled pending txns with nonce ≤ N remain in the pending pool forever
  5. Block builders attempt to include these, get nonce too low, and produce empty blocks

Fix

In addTxnsOnNewBlock, we now build a minedSenderMinNonce map from the minedTxns argument:

  • For each mined tx at nonce N: minedSenderMinNonce[senderID] = max(existing, N+1)
  • Add all mined-tx senders to sendersWithChangedState (ensuring onSenderStateChange is always called)
  • When reading the nonce from kvcache, use max(kvcache_nonce, minedSenderMinNonce[senderID])

This is zero extra state queries, O(mined_txns_per_block) overhead, and works even when kvcache hasn't been updated for the sender.


Bug #2 — Zombie Queued Transactions (Universal)

Root Cause

There is no eviction for queued transactions whose nonce is astronomically higher than the sender's current on-chain nonce. Examples seen on Gnosis Chain:

Account On-chain nonce Queued nonce Gap
0x0006E9... 281 16,814–17,013 16,533+
0x016f6C... 6,398 144,968 138,570

These transactions can never become pending — they would require filling 16,000+ nonce positions first. Yet they sit in the queued sub-pool indefinitely, bloating it to 4,000+ transactions and wasting memory.

The existing blob-txn nonce-gap eviction only covers BlobTxnType and only for any gap; regular transactions had no gap limit at all.

Fix

Added MaxNonceGap uint64 to txpoolcfg.Config (default: 64). In onSenderStateChange, after checking for stale nonces:

} else if p.cfg.MaxNonceGap > 0 && mt.TxnSlot.Nonce > noGapsNonce && mt.TxnSlot.Nonce-noGapsNonce > p.cfg.MaxNonceGap {
    deleteAndContinueReasonLog = "nonce gap too large"
    discardReason = txpoolcfg.NonceTooDistant
}

noGapsNonce tracks the next expected nonce accounting for consecutive txns already in the pool. A tx with nonce 100 won't be evicted if txns 0–99 are already pooled (gap from noGapsNonce is 0). A tx with nonce 16,814 when noGapsNonce=282 has gap 16,532 > 64 → evicted.

A new NonceTooDistant (37) discard reason is added for observability.


Additional Fix — Reason Tracking

Previously, all evictions in onSenderStateChange used the single reason NonceTooLow regardless of why the tx was removed. A parallel toDelReasons []txpoolcfg.DiscardReason slice now tracks the correct reason per evicted transaction.


Tests Added

  • TestZombieQueuedEviction — 3 sub-tests:

    1. Zombie tx (gap=65 > MaxNonceGap=64) is evicted with NonceTooDistant
    2. Tx at exactly MaxNonceGap boundary (gap=64) is kept
    3. Consecutive txns are never zombie-evicted even when count > MaxNonceGap
  • TestStalePendingEvictionViaMineNonce — Verifies stale pending tx is evicted when the sender's tx is mined but the sender does NOT appear in the EVM state-diff batch (exact AuRa scenario)


Testing

go build ./txnprovider/txpool/... ✅
go test ./txnprovider/txpool/... -count=1 -timeout 5m ✅ (all tests pass including full fuzz corpus)

References

…ub-pools

Bug #1 (stale pending txs): onSenderStateChange is only called for senders
that appear in the EVM state-diff batch. On AuRa/Gnosis Chain, senders whose
txns are mined via system transactions never appear in the state-diff, so their
pending txns are never evicted. Fix: in addTxnsOnNewBlock, build a
minedSenderMinNonce map from minedTxns (nonce N mined => on-chain nonce >= N+1),
add all mined-tx senders to sendersWithChangedState, and floor the kvcache nonce
with the mined floor when calling onSenderStateChange.

Bug #2 (zombie queued txs): There was no eviction for queued transactions whose
nonce is impossibly far ahead of the sender's on-chain nonce. Accounts on Gnosis
Chain were seen with on-chain nonce=281 but queued nonces=16814+ (gap=16533).
These txns can never become pending, yet sit in the pool indefinitely.

Fix: add MaxNonceGap config (default 64) to txpoolcfg.Config. In
onSenderStateChange, evict txns where tx.Nonce - noGapsNonce > MaxNonceGap with
the new NonceTooDistant discard reason. noGapsNonce accounts for consecutive
txns already in the pool, so valid long pending queues are not affected.

Also improve reason tracking in onSenderStateChange: use a parallel
toDelReasons slice so each evicted tx gets the correct DiscardReason
(NonceTooLow vs NonceTooDistant).

Tests added:
- TestZombieQueuedEviction: verifies zombie txns are evicted, boundary tx kept,
  consecutive txns never zombie-evicted
- TestStalePendingEvictionViaMineNonce: verifies stale pending tx evicted via
  mined-nonce floor even without EVM state-diff UPSERT event for sender
@lystopad lystopad requested a review from mh0lt February 21, 2026 21:49
@lystopad
Copy link
Member

I tested on hoodi network and confirm better processing.

Before this fix:

[INFO] [02-22|13:18:55.328] [txpool] stat                            pending=42 baseFee=3 queued=1273
[INFO] [02-22|13:21:55.328] [txpool] stat                            pending=34 baseFee=3 queued=1272
[INFO] [02-22|13:24:55.329] [txpool] stat                            pending=39 baseFee=3 queued=1274

With this fix and after 4+ hours running node:

[INFO] [02-23|15:41:41.243] [txpool] stat                            pending=36 baseFee=1 queued=25
[INFO] [02-23|15:44:41.242] [txpool] stat                            pending=39 baseFee=1 queued=25
[INFO] [02-23|15:47:41.243] [txpool] stat                            pending=36 baseFee=1 queued=25

Copy link
Member

@yperbasis yperbasis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with with the fix for bug no 2 (MaxNonceGap), but not with the fix for bug 1 (missing nonce updates for Gnosis). A proper fix for it would be to ensure that such nonce changes are present in the state diffs.

More generally, it's better to tackle separate problems in separate PRs. Let's split this one into two for the two bugs and re-do the fix for bug no 1.

Giulio2002 added a commit that referenced this pull request Feb 24, 2026
#2)

Add MaxNonceGap config (default 64) and NonceTooDistant discard reason.
In onSenderStateChange, evict queued transactions whose nonce exceeds the
sender's on-chain nonce (accounting for consecutive txns already in pool)
by more than MaxNonceGap. These 'zombie' txns can never become pending and
cause unbounded queued pool bloat (e.g. nonce 144968 vs on-chain 6398 on
Gnosis Chain Erigon 3.3.8).

Also fix toDelReasons to track correct discard reason per evicted txn
instead of always logging NonceTooLow.

Extracted from #19393 per @yperbasis review: split separate bugs into
separate PRs.

Closes: related to #19393
@Giulio2002
Copy link
Collaborator Author

[SharovBot] 📋 Addressing @yperbasis review — splitting into separate PRs as requested:

Bug #2 (MaxNonceGap zombie eviction) → extracted to #19449. This contains only the MaxNonceGap config, NonceTooDistant discard reason, and the zombie eviction logic in onSenderStateChange which @yperbasis approved.

Bug #1 (AuRa/Gnosis stale pending — proper state-diffs approach) → investigating the proper fix as suggested: ensuring nonce changes from system/AuRa transactions appear in the StateChangeBatch accumulator. This requires changes in the execution layer's state-diff generation path. Will create a separate PR once the proper approach is confirmed.

This PR (#19393) will be superseded by those two separate PRs. Closing in favour of #19449 (Bug #2) and a follow-up PR for Bug #1.

AskAlexSharov pushed a commit that referenced this pull request Feb 28, 2026
…ap (#19449)

**[SharovBot]**

## Split from #19393 per @yperbasis review

This PR contains **Bug #2 only** (zombie queued transaction eviction),
extracted from #19393 which was asked to be split into separate PRs.

## Problem

Queued transactions with an impossibly large nonce gap (e.g. on-chain
nonce=281, queued nonce=16,814 — gap of 16,533) sit in the pool forever.
They can never become pending without filling thousands of nonce
positions first, causing unbounded queued pool bloat (4,000+ txns
observed on Gnosis Chain, Erigon 3.3.8).

The existing blob-txn nonce-gap eviction only covered `BlobTxnType`.
Regular transactions had no gap limit.

## Fix

- Add `MaxNonceGap uint64` to `txpoolcfg.Config` (default: 64)
- Add `NonceTooDistant` (DiscardReason 37) for observability
- In `onSenderStateChange`, evict txns whose nonce exceeds `noGapsNonce`
by more than `MaxNonceGap`
- `noGapsNonce` accounts for consecutive txns already pooled, so
consecutive txns are never zombie-evicted
- Fix `toDelReasons` parallel slice to track correct discard reason per
evicted tx (was always logging `NonceTooLow`)

## Tests

- `TestZombieQueuedEviction` — 3 sub-tests:
1. Zombie tx (gap=65 > MaxNonceGap=64) is evicted with `NonceTooDistant`
  2. Tx at exactly MaxNonceGap boundary (gap=64) is kept
  3. Consecutive txns beyond MaxNonceGap are never zombie-evicted

## Testing

```
go build ./txnprovider/txpool/... ✅
go test ./txnprovider/txpool/... -run TestZombieQueuedEviction -count=1 ✅
```

## Related

- Bug #1 (stale pending / AuRa nonce) will be addressed separately per
@yperbasis feedback
- Backport to release/3.3 will follow once this is merged
- Original combined PR: #19393
Giulio2002 added a commit that referenced this pull request Mar 3, 2026
…ap (#19449)

**[SharovBot]**

## Split from #19393 per @yperbasis review

This PR contains **Bug #2 only** (zombie queued transaction eviction),
extracted from #19393 which was asked to be split into separate PRs.

## Problem

Queued transactions with an impossibly large nonce gap (e.g. on-chain
nonce=281, queued nonce=16,814 — gap of 16,533) sit in the pool forever.
They can never become pending without filling thousands of nonce
positions first, causing unbounded queued pool bloat (4,000+ txns
observed on Gnosis Chain, Erigon 3.3.8).

The existing blob-txn nonce-gap eviction only covered `BlobTxnType`.
Regular transactions had no gap limit.

## Fix

- Add `MaxNonceGap uint64` to `txpoolcfg.Config` (default: 64)
- Add `NonceTooDistant` (DiscardReason 37) for observability
- In `onSenderStateChange`, evict txns whose nonce exceeds `noGapsNonce`
by more than `MaxNonceGap`
- `noGapsNonce` accounts for consecutive txns already pooled, so
consecutive txns are never zombie-evicted
- Fix `toDelReasons` parallel slice to track correct discard reason per
evicted tx (was always logging `NonceTooLow`)

## Tests

- `TestZombieQueuedEviction` — 3 sub-tests:
1. Zombie tx (gap=65 > MaxNonceGap=64) is evicted with `NonceTooDistant`
  2. Tx at exactly MaxNonceGap boundary (gap=64) is kept
  3. Consecutive txns beyond MaxNonceGap are never zombie-evicted

## Testing

```
go build ./txnprovider/txpool/... ✅
go test ./txnprovider/txpool/... -run TestZombieQueuedEviction -count=1 ✅
```

## Related

- Bug #1 (stale pending / AuRa nonce) will be addressed separately per
@yperbasis feedback
- Backport to release/3.3 will follow once this is merged
- Original combined PR: #19393
Giulio2002 pushed a commit that referenced this pull request Mar 3, 2026
…vict stale pending txns (AuRa/Gnosis fix)

On AuRa/Gnosis Chain, system transactions advance sender nonces without the
execution layer emitting state-diff UPSERT entries for those senders. This
causes onSenderStateChange to never fire for them, leaving their pooled
pending transactions stale forever. Block builders then attempt to include
these txns, receive 'nonce too low', and produce empty or near-empty blocks.

Fix: add ensureMinedSendersInStateDiff(), called at the top of OnNewBlock
before cache.OnNewBlock(). For each mined-tx sender absent from the state-diff
batch, it reads the authoritative post-block account state from coreTx and
injects a synthetic AccountChange{Action=UPSERT} entry. This causes:
  1. kvcache to be updated with the correct nonce for that sender
  2. addTxnsOnNewBlock to add the sender to sendersWithChangedState
  3. onSenderStateChange to fire and evict any stale pending transactions

The DB read uses coreTx.GetLatest(AccountsDomain), which falls through to
the underlying MDBX store after block execution, so the correct post-block
nonce is always available. If the DB has no entry (zero-value), the function
floors the injected nonce at maxMinedNonce+1, guaranteeing at minimum the
mined nonce is reflected.

This is PR1 (Bug #1) of the split requested by @yperbasis on #19393.
Bug #2 (MaxNonceGap zombie queued eviction) was addressed in #19449.
@Giulio2002
Copy link
Collaborator Author

[SharovBot]

As requested by @yperbasis, this PR has been split into two separate PRs:

This PR (#19393) can be closed in favour of the above two.

@yperbasis yperbasis closed this Mar 4, 2026
sudeepdino008 pushed a commit that referenced this pull request Mar 4, 2026
…ap (#19449)

**[SharovBot]**

## Split from #19393 per @yperbasis review

This PR contains **Bug #2 only** (zombie queued transaction eviction),
extracted from #19393 which was asked to be split into separate PRs.

## Problem

Queued transactions with an impossibly large nonce gap (e.g. on-chain
nonce=281, queued nonce=16,814 — gap of 16,533) sit in the pool forever.
They can never become pending without filling thousands of nonce
positions first, causing unbounded queued pool bloat (4,000+ txns
observed on Gnosis Chain, Erigon 3.3.8).

The existing blob-txn nonce-gap eviction only covered `BlobTxnType`.
Regular transactions had no gap limit.

## Fix

- Add `MaxNonceGap uint64` to `txpoolcfg.Config` (default: 64)
- Add `NonceTooDistant` (DiscardReason 37) for observability
- In `onSenderStateChange`, evict txns whose nonce exceeds `noGapsNonce`
by more than `MaxNonceGap`
- `noGapsNonce` accounts for consecutive txns already pooled, so
consecutive txns are never zombie-evicted
- Fix `toDelReasons` parallel slice to track correct discard reason per
evicted tx (was always logging `NonceTooLow`)

## Tests

- `TestZombieQueuedEviction` — 3 sub-tests:
1. Zombie tx (gap=65 > MaxNonceGap=64) is evicted with `NonceTooDistant`
  2. Tx at exactly MaxNonceGap boundary (gap=64) is kept
  3. Consecutive txns beyond MaxNonceGap are never zombie-evicted

## Testing

```
go build ./txnprovider/txpool/... ✅
go test ./txnprovider/txpool/... -run TestZombieQueuedEviction -count=1 ✅
```

## Related

- Bug #1 (stale pending / AuRa nonce) will be addressed separately per
@yperbasis feedback
- Backport to release/3.3 will follow once this is merged
- Original combined PR: #19393
Giulio2002 pushed a commit that referenced this pull request Mar 4, 2026
…vict stale pending txns (AuRa/Gnosis fix)

On AuRa/Gnosis Chain, system transactions advance sender nonces without the
execution layer emitting state-diff UPSERT entries for those senders. This
causes onSenderStateChange to never fire for them, leaving their pooled
pending transactions stale forever. Block builders then attempt to include
these txns, receive 'nonce too low', and produce empty or near-empty blocks.

Fix: add ensureMinedSendersInStateDiff(), called at the top of OnNewBlock
before cache.OnNewBlock(). For each mined-tx sender absent from the state-diff
batch, it reads the authoritative post-block account state from coreTx and
injects a synthetic AccountChange{Action=UPSERT} entry. This causes:
  1. kvcache to be updated with the correct nonce for that sender
  2. addTxnsOnNewBlock to add the sender to sendersWithChangedState
  3. onSenderStateChange to fire and evict any stale pending transactions

The DB read uses coreTx.GetLatest(AccountsDomain), which falls through to
the underlying MDBX store after block execution, so the correct post-block
nonce is always available. If the DB has no entry (zero-value), the function
floors the injected nonce at maxMinedNonce+1, guaranteeing at minimum the
mined nonce is reflected.

This is PR1 (Bug #1) of the split requested by @yperbasis on #19393.
Bug #2 (MaxNonceGap zombie queued eviction) was addressed in #19449.
Giulio2002 pushed a commit that referenced this pull request Mar 4, 2026
… senders to evict stale pending txns

On AuRa/Gnosis Chain, block finalization (engine.Finalize) executes system
transactions (validator rewards, bridge calls, etc.) that advance sender
nonces. These state changes were NOT reaching the txpool state-diff batch
because the block-end stateWriter in exec3_serial.go was constructed with a
nil accumulator.

Fix: store the block's accumulator on serialExecutor (se.accumulator) and
pass it to the block-end stateWriter so that UpdateAccountData → ChangeAccount
calls during Finalize/FinalizeAndAssemble emit UPSERT entries into the batch.
The txpool then calls onSenderStateChange for those senders and evicts any
pending transactions whose nonces are now stale.

Changes:
- execution/stagedsync/exec3_serial.go:
  - Add accumulator field to serialExecutor
  - Set se.accumulator = accumulator alongside StartChange per block
  - Pass se.accumulator to state.NewWriter in the block-end path
- txnprovider/txpool/pool.go:
  - Remove txpool-level ensureMinedSendersInStateDiff workaround
  - Restore original cache.OnNewBlock ordering
- txnprovider/txpool/pool_test.go:
  - Update TestStalePendingEvictionViaMineNonce to test via correct
    stateChanges from the EL (as now emitted after this fix)

This is the EL-level fix requested by @yperbasis on #19392/#19393.
Giulio2002 pushed a commit that referenced this pull request Mar 4, 2026
… senders to evict stale pending txns

On AuRa/Gnosis Chain, block finalization (engine.Finalize) executes system
transactions (validator rewards, bridge calls, etc.) that advance sender
nonces. These state changes were NOT reaching the txpool state-diff batch
because the block-end stateWriter in exec3_serial.go was constructed with a
nil accumulator.

Fix: store the block's accumulator on serialExecutor (se.accumulator) and
pass it to the block-end stateWriter so that UpdateAccountData → ChangeAccount
calls during Finalize/FinalizeAndAssemble emit UPSERT entries into the batch.
The txpool then calls onSenderStateChange for those senders and evicts any
pending transactions whose nonces are now stale.

Changes:
- execution/stagedsync/exec3_serial.go:
  - Add accumulator field to serialExecutor
  - Set se.accumulator = accumulator alongside StartChange per block
  - Pass se.accumulator to state.NewWriter in the block-end path
- txnprovider/txpool/pool.go:
  - Remove txpool-level ensureMinedSendersInStateDiff workaround
  - Restore original cache.OnNewBlock ordering
- txnprovider/txpool/pool_test.go:
  - Update TestStalePendingEvictionViaMineNonce to test via correct
    stateChanges from the EL (as now emitted after this fix)

This is the EL-level fix requested by @yperbasis on #19392/#19393.
sudeepdino008 pushed a commit that referenced this pull request Mar 4, 2026
…ap (#19449)

**[SharovBot]**

## Split from #19393 per @yperbasis review

This PR contains **Bug #2 only** (zombie queued transaction eviction),
extracted from #19393 which was asked to be split into separate PRs.

## Problem

Queued transactions with an impossibly large nonce gap (e.g. on-chain
nonce=281, queued nonce=16,814 — gap of 16,533) sit in the pool forever.
They can never become pending without filling thousands of nonce
positions first, causing unbounded queued pool bloat (4,000+ txns
observed on Gnosis Chain, Erigon 3.3.8).

The existing blob-txn nonce-gap eviction only covered `BlobTxnType`.
Regular transactions had no gap limit.

## Fix

- Add `MaxNonceGap uint64` to `txpoolcfg.Config` (default: 64)
- Add `NonceTooDistant` (DiscardReason 37) for observability
- In `onSenderStateChange`, evict txns whose nonce exceeds `noGapsNonce`
by more than `MaxNonceGap`
- `noGapsNonce` accounts for consecutive txns already pooled, so
consecutive txns are never zombie-evicted
- Fix `toDelReasons` parallel slice to track correct discard reason per
evicted tx (was always logging `NonceTooLow`)

## Tests

- `TestZombieQueuedEviction` — 3 sub-tests:
1. Zombie tx (gap=65 > MaxNonceGap=64) is evicted with `NonceTooDistant`
  2. Tx at exactly MaxNonceGap boundary (gap=64) is kept
  3. Consecutive txns beyond MaxNonceGap are never zombie-evicted

## Testing

```
go build ./txnprovider/txpool/... ✅
go test ./txnprovider/txpool/... -run TestZombieQueuedEviction -count=1 ✅
```

## Related

- Bug #1 (stale pending / AuRa nonce) will be addressed separately per
@yperbasis feedback
- Backport to release/3.3 will follow once this is merged
- Original combined PR: #19393
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants