Skip to content

[PROBE-ONLY] EngineXTestRunner: force-evict tester per Run — investigates #21380 EEST regression (hypothesis: cache reuse across tests)#21383

Closed
mh0lt wants to merge 76 commits into
mainfrom
mh/issue21380-probe-evict-between-tests
Closed

[PROBE-ONLY] EngineXTestRunner: force-evict tester per Run — investigates #21380 EEST regression (hypothesis: cache reuse across tests)#21383
mh0lt wants to merge 76 commits into
mainfrom
mh/issue21380-probe-evict-between-tests

Conversation

@mh0lt
Copy link
Copy Markdown
Contributor

@mh0lt mh0lt commented May 23, 2026

Probe — do not merge

Diagnostic for the 22 EEST CI failures on #21380. Tests fail because the EngineXTestRunner caches testers per (fork, preAllocHash) and reuses them across tests in the same group; state/cache from test N leaks into test N+1's genesis→block-1 boundary, producing the deterministic wrong trie root of block 1: 1fba572808... expected ef5a9a5b... error.

CI evidence supports this:

This probe always-evicts the cached tester at the top of EngineXTestRunner.Run, so each test runs against a freshly built node. Not the final fix — that defeats the purpose of caching testers. If green, the proper fix on #21380 will add an in-place cache-reset between tests so caching is preserved for the SAME test's payload sequence (where it provides the perf win) while not poisoning the next test.

Diff

One change: execution/engineapi/engineapitester/engine_x_test_runner.go — add extr.Evict(test.Fork, test.PreAllocHash) at top of Run. Evict is no-op on cold cache, so first test still runs normally.

Expected outcomes

  • EEST passes → cache-isolation hypothesis confirmed; design + implement the proper in-place cache-reset on State Cache Consolidation (PR #1 of the perf stack) #21380.
  • EEST still fails → the issue is deeper than tester reuse; widen investigation (full reset isn't enough → some other shared state).

Mark Holt and others added 30 commits May 21, 2026 21:35
Pure refactor — behavior preserved. Splits the encoded-branch→cells
decode logic out of HexPatriciaHashed.unfoldBranchNode into a free
function DecodeBranchInto so the same code is consumed by:

  - unfoldBranchNode (existing trie unfold path)
  - future cache populators (decoded-payload BranchCache)
  - future parallel pre-unfold orchestrator (Stage E)

Today these would each have to re-derive the encoded-branch parsing
logic. Centralising it ensures one decoder, one set of edge cases, one
place to fix any bug in the on-disk format handling.

DecodeBranchInto is intentionally PURE — it does not call
deriveHashedKeys. Trie callers (which need hashed keys for the fold
state machine) follow the decode with their own keccak loop.
Cache callers can skip the derive step entirely until the cell is
consumed by the trie.

Tests:
- TestDecodeBranchInto_RoundTrip: BranchEncoder.EncodeBranch produces
  bytes that DecodeBranchInto recovers cell-for-cell. Property test
  that keeps the canonical decoder consistent with the canonical
  encoder.
- TestDecodeBranchInto_DeletedFlag: touchMap/afterMap convention with
  the deleted parameter.
- TestDecodeBranchInto_TruncatedInput: clean errors on truncated input
  (no panic).

Plus the existing commitment test suite (incl. trie-mismatch tests in
TestBranchData_*) all pass without modification, confirming the
refactor preserves unfoldBranchNode's behavior.

This is the foundation for the next refactors in the
representation-reduction track (see
agentspecs/trie-data-pipeline-complexity-tax.md): subsequent PRs will
introduce a decoded-payload cache that reads through this same
decoder, and will lift unfoldKeyPath as a per-key traversal primitive
that the warmer + future Stage E both consume.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduces a new BranchCache type, distinct from WarmupCache and
designed for the longer-lived caching the cross-block persistence
work needs (step 7 of the representation-reduction sequence).

Distinguishing characteristics vs WarmupCache:

  - Bounded LRU tail with configurable capacity (vs WarmupCache's
    unbounded map). Suitable for caches that outlive a single Process
    without unbounded memory growth.
  - Single pinned slot for the root branch (compact prefix [0x00]).
    Root never evicts. Atomic-pointer load/store on the hot read
    path, no lock involved.
  - dirty-flag + PutIfClean invariants — same semantics as the
    invariants added to WarmupCache in the previous commit. Lets
    cross-block writers race safely with fold updates.
  - Lazy GetDecoded — same lazy-decode pattern as WarmupCache's
    GetBranchDecoded; cells populated on first decoded-read and
    cached for subsequent reads.

NOT yet wired into the trie's read or write paths. This commit just
adds the type, with tests. The trie integration (where this cache
plugs into branchFromCacheOrDB and the encoder's PutBranch) is the
discussion point at the step 6 boundary — see the conversation
captured at this point in the representation-reduction sequence.

Today the cache is intended to be ephemeral (per-Process,
constructed alongside the trie, dies with it). Step 7 lifts the
lifetime to the aggTx level for cross-block persistence; the cache
shape (bounded LRU + pinned root + dirty-flag) is in place ahead
of that.

Tests:
- TestBranchCache_RootPinning: root branch lands in pinned slot,
  deep branches land in LRU tail; per-tier hit counters update
  independently.
- TestBranchCache_RootSurvivesEvictionPressure: root persists when
  tail is overfilled past capacity.
- TestBranchCache_DirtyFlag: PutIfClean refuses dirty entry,
  unconditional Put replaces and clears dirty.
- TestBranchCache_GetDecoded: lazy-decode round-trip with
  BranchEncoder; cells pointer reused across reads.
- TestBranchCache_Invalidate: removes from both tiers.
- TestBranchCache_Clear: empties both tiers, resets stats.
- TestBranchCache_Stats: deterministic format with per-tier counts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Doc-only addition to BranchCache's package-level comment, capturing
the caller invariants the cache assumes and the conditions under
which the existing concurrent trie satisfies them.

Three caller invariants:
  1. Single writer per prefix at any moment.
  2. Mark-dirty-then-Put discipline for racing writers.
  3. Decoded cells from GetDecoded are read-only (alias entry storage).

The current ConcurrentPatriciaHashed satisfies all three by
construction:
  - Mounts partition by first nibble (disjoint prefix spaces, no
    cross-mount writes to the same key).
  - Root branch written by single sequential fold post-Wait.
  - Mount→root grid roll-up is rootMu-protected (in-memory grid
    only, separate from cache writes).

Doc explicitly flags that any future parallel fold redesign (Stage F
in agentspecs/stage-e-pre-unfold-design.md) MUST preserve these
invariants — particularly the single-writer-per-prefix one, which
breaks if parent branches are written incrementally as children
complete in parallel. The required coordination layer goes at the
orchestrator (per-parent atomic counter; only the last-decrementer
writes the parent), NOT inside the cache. The cache's existing
primitives (atomic dirty flag, thread-safe LRU, atomic root pointer)
are sufficient for that orchestrator to build on.

Motivation: Stage F is likely deferred because the bench data shows
fold isn't the bottleneck for the canonical SSTORE-bloat workload.
But adding the constraint to the cache later (after caching is in
production) is much harder than documenting it now — correctness
regressions from a missed coordination layer can hide for many
blocks. Documenting the contract on the cache itself ensures any
engineer touching parallel fold sees it.

No code change. Doc-only. All tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 7a of the representation-reduction sequence: per-Process
integration of the BranchCache type added in the previous commit.

Plumbing:

  - Trie interface gains SetBranchCache(*BranchCache).
    HexPatriciaHashed implementation propagates to its branchEncoder.
    ConcurrentPatriciaHashed implementation propagates the SAME
    instance to root + all 16 mounts (sharing one cache is correct
    under the concurrency contract — mounts partition prefix space
    by first nibble, so cross-mount writes target distinct keys).

  - InitializeTrieAndUpdates constructs a new BranchCache(default)
    per trie instance and attaches it. Lifetime today = trie
    lifetime = per-Process. Future cross-block persistence work
    (step 7b) lifts this to aggTx scope by constructing the cache
    one layer up and passing it in.

Read path (HPH.branchFromCacheOrDB):
  L1 WarmupCache (existing) → L2 BranchCache (new) → L3 ctx.Branch.
  L3 hits with non-empty result populate L2 so subsequent reads hit
  L2 within the cache's lifetime. L1 stays first because warmup
  workers may have pre-fetched with prefix-walk-derived freshness.

Write path (BranchEncoder.CollectUpdate):
  - MarkDirty(prefix) BEFORE encode work — protects against
    concurrent warmup-style writers racing into PutIfClean during
    the encode (race documented in the cache's Concurrency Contract).
  - Put(prefixCopy, updateCopy) AFTER ctx.PutBranch succeeds —
    replaces the dirty entry with fresh canonical bytes. Single
    writer per prefix per fold step (current sequential fold +
    first-nibble mount partitioning) means no race on this Put.

Lifecycle:
  HPH.Reset clears the BranchCache when called from the root trie
  (gated by !hph.mounted). Mounted subtries share the root's cache,
  so a mount calling Clear would dump entries the root still wants.
  Carries the invariant from PR #19954 commit 1612d56.

Today's expected performance impact: minimal. Per-Process lifetime
means cache is empty at Process start, so first reads always miss.
The cache helps only branches that are read multiple times within
ONE Process — uncommon in current code paths. Step 7b is where
the real perf swing comes from (cross-block persistence so block
N reads hit branches written by block N-1).

This commit is the safe stepping stone: it validates the wire-up
end-to-end (read path + write path + concurrency contract +
lifecycle) without changing perf characteristics. Bench should
match Run I baseline (7.16 mgas/s on canonical SSTORE-bloat block).

All commitment tests pass, lint clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the BranchCache was constructed inside InitializeTrieAndUpdates,
giving it per-SharedDomains lifetime. SharedDomains is reconstructed for
every batch / many tx boundaries — verified with logs in the prototype
(921 fresh BranchCache constructions per bench run) — so the cache started
cold every batch and never delivered the cross-block hits the design is
about. Per-Process scope kept cache-cleanup correctness simple but defeated
the whole point of caching.

Place the cache on the commitment Domain struct (one cache per aggregator,
matching the pattern on add_execution_context_with_caches where each
domain owns its valueCache). ConfigureDomains attaches the cache once
after domains are initialised — idempotent, lifetime = aggregator
lifetime. AggregatorRoTx exposes BranchCache() returning the commitment
domain's cache, so SharedDomains' construction path can fetch it without
forcing db/state/execctx to import db/state (db/state already imports
execctx via squeeze.go, so the reverse import would create a cycle).

The placeholder commitment.BranchCacheProvider interface lets the SD
construction path use a duck-typed type assertion on tx.AggTx() any.

Plumb the cache through NewSharedDomainsCommitmentContext into
InitializeTrieAndUpdates as an explicit parameter; nil falls back to a
fresh per-init cache so test helpers without an aggregator still get a
valid cache.

Reset behaviour: HexPatriciaHashed.Reset no longer calls Clear on the
cache. Aggregator-scope persistence requires the cache to survive Reset
between commitment calculations. Callers that genuinely need to
invalidate the cache (unwind, fork validation) now call ClearBranchCache
explicitly. The bench is forward-only so this is a safe change for
measurement; an explicit unwind clear path can land in a follow-up
commit when needed.
Adds two debug-gated diagnostics for cross-block-cache-lifetime
investigations. Both off by default — meant to be flipped on when a
correctness regression surfaces and you need to localise *where in the
fold path* the cache started lying, instead of waiting for a downstream
trie-root-mismatch many blocks later.

1. BRANCH_CACHE_VERIFY: branchFromCacheOrDB cross-checks every L2
   (BranchCache) hit against ctx.Branch and increments a divergence
   counter when bytes disagree. Logs the prefix and both byte forms so
   the first divergent read shows up directly in the erigon log.
   BranchCache.VerifyDivergences() exposes the count for assertion in
   tests.

2. BRANCH_CACHE_FINGERPRINT: SharedDomainsCommitmentContext.ComputeCommitment
   emits a "[cache-fp]" log at end of every compute with (block, root,
   cache fingerprint, divergence count). Two builds running the same
   workload can be diffed offline ("first block at which their fp's
   differ") to nail down the first block where lifecycle invariants
   diverged. Fingerprint is an order-independent FNV-1a fold over
   (key-hash, data-hash) pairs across both root and tail tiers.

Adds maphash.LRU.Range so the Fingerprint can iterate the LRU tail
without touching recency (Peek under the hood). The wrapper otherwise
discards original byte-keys on insert; mixing by hash instead of key is
correctness-equivalent up to hash collision, which is acceptable at the
working-set sizes here.

Motivation: today we just chased a wrong-root regression to commit
d673052 (aggregator-scope cache lifetime). Took two full bench cycles
(~12 min) to localise. With the divergence detector running, the first
divergent read would surface in the erigon log within seconds. With
fingerprint logging in two builds (one good, one regressed), the
diverging block boundary would be a single grep across two log files.
The deferred-encoding path (CollectDeferredUpdate + ApplyDeferredUpdates)
parallelises EncodeBranch + merge work at apply time. The cache
correctness consequence: between collect and apply, sd.mem holds the
old state while the cache is also unchanged. When apply finally fires
(end of Process or duplicate-prefix flush mid-Process), sd.mem
advances but the cache isn't touched — CollectDeferredUpdate doesn't
have a cache hook (and wiring one breaks
TestSharedDomain_RepeatedUnwindAcrossStepBoundary +
TestCustomTraceReceiptDomain because it violates an implicit
trie-during-Process cache stability invariant).

The result on the FV bench: cache stays at the very-first-Process's
read-side L3 fallback bytes for prefixes the trie writes, while
ctx.Branch advances to each block's actual state. Divergence on every
hot prefix (root, root-zone children) starting from block 2.

Use CollectUpdate (inline) instead. CollectUpdate writes
sd.mem + WarmupCache + BranchCache atomically at fold time via
PutBranch — cache mirrors sd.mem at every write, the trie sees its own
writes consistently, and the cross-Process cache state matches what
post-FCU MDBX commit produced. Loses the encoder's parallel-encoding
optimisation, but bench profile is I/O-bound, not encode-CPU-bound, so
the trade is favourable.

History (ETL) writes are still inline via DomainPut. Splitting that
("sd.mem inline, history queued for flush at FCU") is the proper
architectural answer to defer the slow disk work without touching the
sd.mem invariant — tracked as a follow-up.
Bisection helper: force branchFromCacheOrDB to skip the L2 BranchCache
read path entirely so every read goes via ctx.Branch (sd.mem → MDBX).
Cache writes still fire so verify-mode can keep comparing cache vs
canonical. Flipping the env at runtime distinguishes "cache holds bad
data" (bench passes further with cache reads off) from "deeper compute
bug" (same failure regardless).

Used in the 2026-05-06 investigation to confirm the cache was actively
corrupting block 13's compute on the canonical SSTORE-bloat bench: with
cache reads on, wrong-trie-root at block 13. With reads off, blocks
13-16 produce the correct roots and the bench advances to a separate
failure at block 17 (unrelated pre-existing bug).

Default off; gate via env DISABLE_BRANCH_CACHE_READS=true.
When verifyBranchCache=true and a cache hit disagrees with ctx.Branch,
sample sd.mem, sd.parent.mem, and tx-direct (MDBX) for the same prefix
and dump all layers in the divergence log line. Comparing those four
byte sequences against the cached and canonical bytes pinpoints which
state layer holds the bytes the cache disagrees with — the rewriter we
need to identify before fixing the canonical-store-divergence bugs the
cache currently exposes.

Decision matrix (read off the log line):

- cache != tx, sd.mem == cache → in-memory writer is fresh, MDBX is
  stale (commit-timing issue).
- cache != tx, tx == ctx.Branch, sd.mem != cache, parent.mem != cache
  → MDBX has been rewritten by something outside the CollectUpdate
  write path (collation, file build, squeeze).
- cache != ctx.Branch, parent.mem matches ctx.Branch but sd.mem doesn't
  → parent merge is the source.
- cache != ctx.Branch, all of sd.mem / parent.mem / tx == ctx.Branch
  → cache itself was populated incorrectly (write-side bug).

Three changes:

1. SharedDomains.ProbeReadLayers (db/state/execctx/domain_shared.go):
   public method that samples sd.mem, sd.parent.mem (private field
   accessed from the same package), and tx.GetLatest. Read-only; copies
   bytes so callers can hold them past tx lifetime.

2. TrieContext (execution/commitment/commitmentdb/commitment_context.go):
   add probeSd + probeTx fields populated at trieContext()
   construction; expose ProbeStateLayers method that delegates to
   sd.ProbeReadLayers. The local `sd` interface gets the
   ProbeReadLayers method too so the duck-typed reference can call it
   without an import cycle to execctx.

3. branchFromCacheOrDB log site (execution/commitment/hex_patricia_hashed.go):
   on divergence with verifyBranchCache, type-assert ctx for the probe
   interface and append sd_mem / parent_mem / mdbx fields to the log
   line. Existing field shape preserved for log parsers; new fields
   are additive.

Pre-existing test failures
(TestSharedDomain_RepeatedUnwindAcrossStepBoundary,
TestValidateChainAndUpdateForkChoiceWithSideForksThatGoBackAndForwardInHeight)
are unchanged — they were failing on the stack before this probe
landed and are part of what the divergence work is meant to localise.
Per-write provenance for divergence-detection diagnostics. When a
divergence fires (cache hit disagrees with ctx.Branch), we now log
which write site produced the cached bytes and when, so we can
correlate the bad write against the FCU / build / step timeline.

Tag fields added to branchCacheEntry:
- origin       short label of the write site (e.g. "CollectUpdate",
               "L3-fallback-read")
- writeSeq     monotonic counter per BranchCache instance
- writeTimeNanos unix nanos at write time

Put / PutIfClean signatures take an origin string. Two writers
updated:
- BranchEncoder.CollectUpdate → "CollectUpdate"
- branchFromCacheOrDB L3-fallback Put → "L3-fallback-read"

GetWithOrigin returns bytes plus the metadata; uses a non-counting
peek so it can be called alongside Get without double-counting hits.

The divergence-detection log site at branchFromCacheOrDB now appends
cache_origin / cache_seq / cache_t_ns fields. Combine with the
existing sd_mem / parent_mem / mdbx fields to localise both who wrote
the stale bytes and which layer the canonical value lives in.

Pre-existing test failures
(TestValidateChainAndUpdateForkChoiceWithSideForksThatGoBackAndForwardInHeight)
are unchanged from previous commits.
BranchCache previously sat in front of the sd.mem -> parent.mem -> MDBX
read chain (consulted in branchFromCacheOrDB before ctx.Branch). The
shared aggregator-scope cache was written from CollectUpdate by every
SD running commitment compute, including fork-validator SDs whose
writes never reach MDBX. Origin-tagged probe (run-step7b-probe-sdid)
showed five distinct SD pointers writing the same prefix to a single
cache entry, so any reader whose lineage didn't match the most-recent
writer saw bytes that disagreed with MDBX -> wrong trie root from
block 13 onward in the canonical SSTORE-bloat fork bench.

Layering after this change:

  Read:  sd.mem -> sd.parent.mem -> branchCache -> aggTx (MDBX)
  Write: sd.mem only (DomainPut path)
  Flush: sd.mem -> MDBX, then branchCache.Clear()

The cache now mirrors MDBX-flushed bytes only. Writers' in-flight bytes
live in sd.mem above; cache hits below sd.mem are always equivalent to
reading MDBX, so cross-SD pollution is impossible by construction.
Cache fills lazily on the MDBX-read path inside sd.GetLatest, and
clear-on-flush prevents pre-flush bytes from coexisting with new MDBX
state. Per-key invalidation is a follow-up (PR2).

BranchCache entries gain a step field so Get returns (data, step, ok)
matching the aggTx contract. Without this, sd.GetLatest's cache hit
returned step=0 and CheckDataAvailable rejected the boot SeekCommitment
with "commitment state out of date".

Removed:
  - cache.Put from CollectUpdate (commitment.go)
  - cache.Put + divergence detection from branchFromCacheOrDB
    (hex_patricia_hashed.go); now just calls ctx.Branch
  - L3-fallback Put (cache fills via sd.GetLatest now)

Validated on canonical cold bench (run-step8b): first FCU VALID, all
payloads through end VALID, 0 cache divergences, 0 wrong-root errors.
Prior probe bench had 23 divergences and INVALID payloads from the
fail block onward.

Probe scaffolding (SiteIdentity, ProbeStateLayers, divergence counters)
left in place for now; can be stripped in a cleanup follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ache state

Update the BranchCache type comment to reflect the architectural
state after the WarmupCache consolidation (steps 2a-2c, 3, 4):

- BranchCache is the single branch cache (WarmupCache deleted)
- Aggregator-scope lifetime, plumbed via BranchCacheProvider
- Passive store: cache itself never reaches into underlying state
- Branch warmer is branch-scoped; no leaf-data prefetch
- Block-processing trie walker takes Updates from executor +
  memoization for siblings — no prefetch needed
- Witness / proof generation walker drives its own state reads;
  if that path turns out to be cold-bound it indicates a need for
  separate account/storage caches (treat as separate concern with
  different scope/lifetime/invalidation; do not regrow the branch
  warmer to cover it)
- disk_sto / disk_acc counters on cache-fp log surface any
  unexpected fall-through to ctx.Account / ctx.Storage as a signal
  of memoization gap or missing walker-side prefetch

Doc-only; no behaviour change.
Adds a third cache tier between the root-pin slot and the LRU tail:
a per-prefix pinned map fed by PinEntry. Pinned entries:

- Never evict (no LRU pressure on this tier).
- Are checked in lookup before the LRU tail, after the root slot.
- Survive Put/SD.Flush updates: a Put for a pinned prefix updates
  the pinned entry in place rather than displacing it. Cross-block
  correctness via the existing dirty-flag invalidation discipline
  is preserved — the new bytes land in the same pinned slot.
- Carry the same metadata as Put-tier entries (step, origin,
  writeSeq, writeTimeNanos) so divergence-detection and Stats
  treat them uniformly.

Sized by the preload policy. Intended consumer is the storage
trunk preload for big contracts (the 'storage root trunk cache for
big accounts' direction): per-contract trunk branches at depth
65-70 get pinned at SD/cache creation, persist for the cache's
lifetime.

PinEntry is the public API; pinnedHits/pinnedMisses atomic
counters are the new stats. PinnedCount() exposes current size for
observability.

Step 2 of the storage-trunk pin prototype.
Adds a function to pre-pin commitment branches for a given contract's
storage subtree. Walks the trie depth-by-depth from depth 64
(storage subtree root) down to maxDepth: reads each branch via the
supplied CommitmentReader, pins it via BranchCache.PinEntry, decodes
the child bitmap, and recurses only into children that actually
exist (no blind 16-way probing).

contractHash is keccak256(address); the trunk lives at the prefix
corresponding to the first 64 nibbles of the storage path.

For a dense storage subtree (the SSTORE-bloat workload's bloat
contract), expected pin count is ~16 + 256 + 4096 ≈ 4.4 K branches
at depth 65/66/67, plus the root at depth 64. Sparse subtrees
produce fewer.

Loading strategy: per-prefix lookups via the reader. Simplest
correct implementation. A bulk seg.Getter range-scan over sorted
.kv would amortize disk seeks (per the parity-cluster observation
in the consolidation memo) but requires building a prefix-range
API on top of recsplit; defer until per-prefix lookup is shown
to bottleneck.

Step 3 of the storage-trunk pin prototype.
Adds a SharedDomains constructor hook that, when PIN_CONTRACT_TRUNKS is
set, fires a one-shot background goroutine to preload the
storage-subtree-trunk of each listed contract into BranchCache's
pinned tier. Format: comma-separated list of 64-hex-char contract
hashes (each is keccak256(addr)).

Mechanism:
- BranchCache.TryClaimPreload (atomic CAS) ensures the goroutine fires
  exactly once per cache lifetime, even though many SDs may be
  constructed (per-tx instances etc.).
- Goroutine wraps sd.GetLatest as a CommitmentReader and calls
  commitment.PreloadContractTrunk for each contract hash, depth 64-70.
- Logs progress per contract on completion.

Closure-over-(sd, tx) is the prototype shape — works for the bench
(both live for the whole process). Production deployment needs to
revisit the lifetime — sd's tx may not outlive the goroutine.

Step 4 of the storage-trunk pin prototype. Bench measurement is
the next step (commit 5).
Previous async-goroutine shape (d204c1b) shared the SD's MDBX
tx with the calling thread. Concurrent cursor use under the same
tx tripped Go's cgo-pointer-pinning runtime check:

  panic: runtime error: cgo argument has Go pointer to unpinned
  Go pointer

surfacing in an unrelated PruneBlocks goroutine during boot.

Make the preload synchronous in the SD constructor for now: same
TryClaimPreload guard (fires once per cache lifetime), but no
goroutine. Boot pays the per-contract preload time as a one-off.

Background-with-own-tx is the proper shape and remains a
follow-up; owning the SD's tx exclusively for the preload
duration is the safe shape until that lands.
… cap

The previous bench (run-pin-trunk-instrumented-cold-cgroup-191347)
hung at SD construction with no [trunk-preload] log lines for 5+
minutes. Erigon never reached "engine RPC ready" so all blocks
came in as SYNCING.

Two changes to localise + bound:

1. **Localisation**: add INFO logs at triggerTrunkPreload entry,
   per-contract starting/done with took, and a 500-prefix
   progress log inside PreloadContractTrunk. Whatever it does
   (or hangs on) is now observable.

2. **Bound**: cap PreloadContractTrunk at 10000 branches
   (vs ~4.4K expected for a saturated 4-level subtree at
   maxDepth=67). Drops maxDepth from 70 → 67 in the trigger
   (depth 64-67 = 16+256+4096 max branches) so we don't
   recurse into the per-slot tail where pinning has no value.
   Preload fails-fast on pathological subtrees rather than
   blocking SD construction indefinitely.
The previous shape (d204c1b) ran triggerTrunkPreload BEFORE
sd.SeekCommitment in NewSharedDomains. Bench result: when the
preload fired (PIN_CONTRACT_TRUNKS set), every subsequent block
came back SYNCING — engine kept attempting backward-download which
fails on this peerless setup, no block ever validated, no cache-fp
ever fired. Without the preload firing, the same binary works
fine (verify-bench PASS at 3.26s).

Hypothesis (untested but matches the symptom): preload's
sd.GetLatest reads ran before SeekCommitment had resolved the SD's
view of the chain head. Pinned values were therefore inconsistent
with the committed state, and the trie compute on the first block
got wrong root → SYNCING → backward-download → no peers → death
spiral with no Flush ever updating the (stale) pinned entries.

Fix is mechanical: move the preload call to after SeekCommitment.
The TryClaimPreload guard still ensures fire-once-per-cache
lifetime.

If subsequent bench shows pin_count > 0 + pin_hit > 0 + blocks
validating normally, the hypothesis is confirmed; if SYNCING
repeats, the bug is something else and we need to revert and
debug differently.
…ParaTrieDB

Previous prototype iterations both broke block validation:
  1. Async sharing the SD's MDBX tx (d204c1b) → cgo
     "unpinned Go pointer" panic from concurrent cursor use.
  2. Synchronous from NewSharedDomains (5a81976 / 4c9ead456d) →
     blocked the engine HTTP handler for ~3-4s during the preload
     window, causing the bench's first NewPayload to be dropped.
     Confirmed: the bench's height=24358001 is ABSENT from the
     erigon log; the next received block (24358002) then fails
     backward-download (no peers) → SYNCING forever.

Restructure:
  - Move trigger from NewSharedDomains to EnableParaTrieDB. The
    latter is called from the staged-sync exec-stage init, NOT
    from request handlers, AND has access to a kv.TemporalRoDB.
  - triggerTrunkPreload now takes the DB (not a tx) and spawns
    a goroutine that opens its OWN tx via db.BeginTemporalRo.
    No shared cursors with the main pipeline; no blocking the
    engine.
  - Reader uses tx.GetLatest directly (not sd.GetLatest) — the SD
    layering would re-introduce shared-state risk and isn't needed
    (pinned bytes don't depend on sd.mem state).

Same TryClaimPreload guard ensures the preload fires once per
BranchCache lifetime regardless of how many SDs construct.

If this works the bench should:
  - Show [trunk-preload] log lines firing once
  - Pin ~4369 branches
  - TEST block cache-fp shows pin_hit > 0 and files_comm < 1K
  - All blocks validate normally (no SYNCING failure)
Make the trunk-pin maxDepth configurable via env (default 67) so we can
sweep depths to find the memory/perf sweet spot without rebuilding.
Bump the per-contract maxBranches cap from 10K to 200K so deeper
saturated subtrees don't get truncated mid-walk.
The previous code disabled the Warmuper for the parallel commitment path
out of concern that it would interact with the calculator's SetUpdates
call. In practice the Warmuper's reads are independent of the
calculator's update buffer — they pre-fetch branch data while EVM
execution runs, and the calculator's SetUpdates only affects
ComputeCommitment's input set, not the warmup paths.

Re-enabling produces a measured 8× throughput improvement on the
perf-devnet-3 SSTORE-bloated benchmark (block 24358306, the canonical
fixture for #20920), restoring the win first observed in Run H/I of the
trie-perf investigation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds atomic counters (branch hit/miss/evict + bytes-served, account
hit/miss, storage hit/miss) to WarmupCache, plus a Stats() string
formatter and ResetStats() for per-Process accumulation reset. Counters
are updated on every Get/Evict path (existing Put paths were already
counted via cache size).

Useful for:
- Confirming warmup effectiveness in production logs
- Per-block diagnostics when investigating commitment perf
- Future per-pool dashboards once a coordinator/observability layer
  lands (tracked separately)

No behavior change beyond the counter updates themselves. Stats() format
is one line, suitable for embedding in the existing
LogCommitments output. ResetStats() zeros counters without touching
cached data — useful for per-Process windowed measurement. Clear() also
resets counters along with the data, since data and counters were
accumulated together.

Test: TestWarmupCache_Stats covers hit/miss/evict accounting across
branch/account/storage paths and verifies Stats() format + ResetStats()
preserves data.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure refactor — behavior preserved. Lifts the unfold-loop from
HexPatriciaHashed.followAndUpdate into its own method, parameterized
over (hashedKey, plainKey) and intended as the per-key traversal
primitive that future orchestrators consume.

Today only followAndUpdate calls it, replacing the inline loop with a
one-line call. The extracted method preserves the existing metric
attribution (StartUnfolding) and trace-print behavior verbatim.

Why now: this is the second step in the representation-reduction
sequence (see agentspecs/trie-data-pipeline-complexity-tax.md). Future
PRs will introduce orchestrators that drive unfold-only walks of
touched-key paths to fill cell state without going through the full
fold/update cycle:

  - Cache populator (decoded-payload BranchCache) needs to walk a
    touched-key path and capture the cells encountered, without
    triggering fold or modifying the trie's update buffer.
  - Stage E parallel pre-unfold orchestrator drives unfoldKeyPath
    across multiple HexPatriciaHashed instances concurrently to
    pre-warm trie state before commit.

Both consume the same primitive. Centralising it now means each future
orchestrator is a thin wrapper rather than a duplicate of the
unfold-loop logic.

Tests: full commitment test suite passes without modification (all 8+
test files in execution/commitment/), confirming the refactor preserves
followAndUpdate's behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the carry-as-is correctness invariants from the PR #19954
investigation as scaffolding on the existing WarmupCache:

  - branchEntry.dirty atomic.Bool — signals "stale until cleared"
  - PutBranchIfClean(prefix, data) bool — skips write if entry dirty
  - MarkBranchDirty(prefix) — mark for later refusal of stale puts

These are scaffolding additions; no callsites use them yet. Existing
PutBranch unconditionally overwrites and clears any prior dirty flag
(creates a fresh entry), preserving today's semantic exactly for
existing callers.

Why now: the prototype investigation (see
agentspecs/commitment-cache-prototype-dev-context.md) found that
inline-invalidate-on-write is incompatible with deferred encoding —
update-in-place breaks correctness because there's a window between
fold (computes hash, holds new state) and encoder (writes encoded
bytes) where readers see stale cached bytes. The reth-research
(agentspecs/reth-1ggas-research.md §4) calls the dirty-flag pattern
out as the design that resolves this without forcing synchronous
encoding: the encoder marks the entry dirty BEFORE its own write
completes, so any racing read knows to bypass the cache for that key.

Today's WarmupCache lifecycle (per-Process, warmup completes before
fold begins) does NOT exhibit this race — these invariants are
infrastructure for the future cross-block persistence work where
warmup-style writers can outlive their parent Process.

Tests:
- TestWarmupCache_DirtyFlag: PutBranchIfClean refuses dirty entry,
  unconditional PutBranch clears dirty.
- TestWarmupCache_DirtyFlag_MarkAbsentKey: marking absent key is
  no-op (no panic, no entry created).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an additive read method that returns cached branches in already-
decoded form, lazy-decoding on first decoded-read per entry and caching
the parsed cells for subsequent reads. No existing callsite changes;
existing GetBranch / GetAndEvictBranch / PutBranch callers continue to
work with encoded bytes unchanged.

Why now: this is step 5 of the representation-reduction sequence (see
agentspecs/trie-data-pipeline-complexity-tax.md). The trie's read path
currently does GetBranch (encoded) + DecodeBranchInto on every cache
hit — paying decode CPU on every read. Switching that callsite to
GetBranchDecoded (in a separate later commit) eliminates the redundant
decode.

The ENCODED form remains the source of truth — the encoder needs it
for the merge-with-prev step, and it's what gets written to disk via
PutBranch. The decoded form is derived lazily and cached alongside
the entry. When PutBranch overwrites an entry, the new entry starts
fresh and the next decoded read re-derives from the new bytes.

API design notes:
- Returns (bitmap, *[16]cell, ok). Caller derives touchMap/afterMap
  from bitmap based on its own deleted-vs-present-after context — the
  cache stores cells independent of that context so the same entry
  serves both readers.
- The returned *[16]cell aliases entry-owned storage. Read-only
  consumption is safe across concurrent calls (decode runs at most
  once per entry via sync.Once); MUST NOT be modified in place.
- Decode error → ok=false (don't count as hit OR miss; caller falls
  through to canonical re-read).

Tests:
- TestWarmupCache_GetBranchDecoded: round-trip equality with direct
  DecodeBranchInto, plus same-pointer reuse on repeat reads.
- TestWarmupCache_GetBranchDecoded_Miss: behaves like GetBranch on
  absent keys.
- TestWarmupCache_GetBranchDecoded_TruncatedData: graceful failure
  on corrupt entry (no panic).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ng payloads

Closes a timing hole that surfaced once we wired the aggregator-scope
BranchCache: between an FCU completing and the next newPayload, MDBX
hasn't been committed yet (RunLoop's CommitCycle only fires under
memory pressure), but currentContext.mem holds the latest writes from
MergeExtendingFork. The fresh doms created in ValidateChain has no
parent and a fresh roTx, so its ctx.Branch reads stale-MDBX while the
aggregator-scope BranchCache (populated by the prior FV's
CollectUpdate writes) holds the fresh state. That's the cross-newPayload
divergence pattern observed in the bench (8-61 divergences and
wrong-trie-root errors at block 3-14 across runs).

Set doms.SetParent(currentContext) when the new payload extends the
current canonical head (header.ParentHash == ReadHeadBlockHash). For
fork payloads that don't extend head, leave parent unset:
unwindToCommonCanonical below reverts doms's view to the common
ancestor, and exposing currentContext.mem (post-divergence canonical
writes) via the parent chain would shadow the unwound base and break
fork validation. Verified by TestReorgsWithInsertChain — the
"head-only" predicate is what the current single-canonical-chain SD
topology supports.

A proper per-branch SD lineage (each fork's validation chains to the
last validated SD on its own branch, not always currentContext) is the
follow-up needed for concurrent multi-fork validation. The current
design supports a single canonical chain only; that's enough to close
the divergence we have today, with the lineage extension tracked
separately.
Foundation for the "Snapshot vs MDBX read-cost equivalence"
investigation (memory: snapshot-vs-mdbx-performance-equivalence.md).

This file produces the headline ratio that quantifies the gap the
investigation aims to close: warm-cache reads from snapshot .kv files
should cost the same as warm-cache reads from MDBX (same disk, same
page cache). H0 measures how far apart they are today.

Five sub-benches:
  - MDBX_path        full chain, key in MDBX
  - File_path        full chain, key in file
  - Forced_file_path file-only debug path, file-resident keys
  - Forced_db_path   DB-only debug path, MDBX-resident keys
  - Bloom_miss_path  file-only debug path, MDBX-resident keys
                     (file misses in xorfilter for every probe)

Two operating modes; only synthetic is wired in this commit:

  - Synthetic (testDbAndAggregatorBench fixture): writes 64 full
    16-tx steps, BuildFiles + repeated PruneSmallBatches drains all
    but the tip step into files. Phase 2 keys at txNums past the
    built-step boundary stay in MDBX. Partition by *actual* residency
    after setup so bench inputs match where keys really live.

  - Real-datadir (--snapdatadir flag): TODO. Opens an existing
    chaindata+snapshots datadir read-only and picks keys via cursor
    iteration / .kv decompressor walk. Required for production-
    relevant numbers since synthetic has tiny files and small values.

Initial synthetic results on AMD EPYC 4244P (Accounts domain):
  MDBX_path        173 ns/op
  File_path        226 ns/op   (1.31x MDBX)
  Forced_file_path  30 ns/op
  Forced_db_path   158 ns/op
  Bloom_miss_path   30 ns/op

Synthetic dataset is too small to surface the production gap that
pprof shows (xorfilter at 35% CPU on real bloat workload). H1-H4
benches and the real-datadir mode are the next steps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the --snapdatadir flag path to the H0 bench. Opens an existing
chaindata + snapshots datadir read-only via the same recipe as
cmd/integration (mdbx Accede + state.New + temporal.New) and picks
keys by cursor-walking the per-domain values table.

Pragmatic adjustments:

  - On heavily-pruned production datadirs (perf-devnet-3-run was
    100% pruned), every MDBX values-table row is step-shadowed by a
    file, so getLatestFromDb returns ok=false. The MDBX-side
    sub-benches skip in this case; File_path numbers stand on their
    own and the synthetic MDBX_path baseline serves as the cross-
    mode comparator.

  - skipIfEmpty short-circuits per sub-bench rather than failing the
    whole run, so we can still get the file-path numbers.

First production numbers (AMD EPYC 4244P, AccountsDomain, 2012
file-resident keys from perf-devnet-3-run, fully pruned):

  File_path        211 ns/op   (synthetic was 226; essentially same)
  Forced_file_path  30 ns/op   (synthetic was 30; identical)

Surprising finding: real .kv file reads cost the same as synthetic.
This means production bloat-workload bottleneck is NOT in
getLatestFromFiles — it must be in HistorySeek (.ef history files
walked by HistoryStateReader.GetAsOf). The GetAsOf shortcut work
flagged in getasof-regression-suspect.md is the right lead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related additions to the snapshot-vs-MDBX perf-equivalence
investigation (memory: snapshot-vs-mdbx-performance-equivalence.md):

1. H_GetAsOf bench (db/state/snapshot_vs_mdbx_bench_test.go).
   New runHGetAsOf with four sub-benches for HistorySeek-via-GetAsOf
   on file-resident keys: GetLatest_baseline, GetAsOf_recent (asOf
   near endTxNum), GetAsOf_mid (asOf at endTxNum/2), GetAsOf_floor
   (asOf=1). Tests the path the calculator's HistoryStateReader.Read
   uses, which is distinct from getLatestFromFiles measured in H0.

   Real-datadir results on perf-devnet-3 (AccountsDomain, endTxNum=2.9B):
     GetLatest_baseline 202 ns/op   0 allocs
     GetAsOf_recent     570 ns/op   0 allocs   <- 2.8x baseline, no result
     GetAsOf_mid        235 ns/op   5 allocs
     GetAsOf_floor      196 ns/op   4 allocs

   GetAsOf_recent (the calculator's pattern after PR #21010) scans
   the .ef looking for a record at-or-after endTxNum-1, finds none
   (most keys haven't changed in the last txNum), falls through to
   GetLatest. The 370ns/op overhead vs GetLatest is wasted scan.
   Confirms the GetAsOf shortcut described in
   getasof-regression-suspect.md as a real lever, though small in
   absolute terms (~2ms/block on the bloat workload).

2. Surface "took" + "keys" on the existing [commitment][cache-fp]
   Info log line (commitmentdb). Was already computed in the
   debug-level "[commitment] processed" log, but the bench runs with
   --log.dir.disable so debug logs aren't captured.

   This made it possible to attribute the 4.3s gap inside
   newPayload(TEST block) directly: the calculator's ComputeCommitment
   takes 4220ms for the 5910-key bloat block — 91% of the entire
   block wall time. Per-key cost is ~700us, consistent across blocks
   of all sizes. The actual perf lever for the bloat workload is
   making per-branch ComputeCommitment cheaper, not file/state reads.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds three groups of fields to the existing
[commitment][cache-fp] log line so the calculator's per-block
behaviour is observable at Info level (the bench runs with
--log.dir.disable, so debug logs aren't captured):

  - took, keys: ComputeCommitment wall + key-count for the block.
    Pre-existing internally, now surfaced.
  - load, skipped, reset: process-cumulative counts of
    computeCellHash decisions:
      * load    = had no memoized stateHash, fetched value from DB
      * skipped = had memoized stateHash, reused without fetch
      * reset   = had stateHash but had to invalidate it
    Surfaced via new commitment.SkipLoadResetCounters().
  - files_acc / files_sto / files_code / files_comm: per-domain
    file-read counts pulled from sd.Metrics().Domains[domain].
    Decomposes the aggregate `files=N` from the [domain reads]
    log line into its actual sources (e.g. on the SSTORE-bloated
    block the 32k file reads break down as 5.9k Storage value
    loads + 26.6k Commitment branch reads + a handful of others).

All counters are cumulative; per-block deltas are obtained by
subtracting consecutive cache-fp lines.

Pure observability — no behaviour change. Used as the measurement
framework for the snapshot-vs-MDBX perf-equivalence investigation
and the follow-on commits that target specific levers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mark Holt and others added 21 commits May 22, 2026 08:14
Three integration points:

1. Construction in NewSharedDomainsWithTrieVariant — alongside the
   branchCache assignment. Inert until Bind. Skipped when
   DISABLE_ADAPTIVE_PIN env is set (defensive escape hatch).

2. Bind in EnableParaTrieDB — installs the cache miss-callback so
   the controller starts observing per-contract pressure. Runs AFTER
   the env-driven PIN_CONTRACT_TRUNKS forced preload, so operator
   overrides land before the controller starts attribution.

3. OnBlockComplete from SD.Flush — fires after FlushWithCallback has
   updated the cache for this batch. Reader uses the in-flight Flush
   tx so initial-view preload of newly-promoted contracts sees the
   just-committed bytes. Uses sd.txNum as the controller's "tick" —
   per-batch granularity, so cooldown thresholds map to batches not
   blocks (tunable via controller config if finer granularity is
   needed later).

Operator override paths preserved:
- PIN_CONTRACT_TRUNKS — forces specific contracts to pin regardless
  of adaptive policy. Useful for known-hot mainnet contracts and
  diagnostic profiling.
- DISABLE_ADAPTIVE_PIN — disables the controller entirely. The
  branch-cache layer stays active; only the auto-promotion logic
  is skipped.

Bench-validated: with PIN_CONTRACT_TRUNKS in use, controller fires
but no promotions occur (the bloat contract's reads go to the
pre-pinned tier, not the miss callback). Baseline reproduces
byte-for-byte (took=1.22 s, pin_count=89365, divergences=0).
Surface the existing internal counters via the diagnostics/metrics
package so dashboards can reason about pin effectiveness without
log scraping. Six metrics:

  commitment_branchcache_pinned_hits_total       (counter)
  commitment_branchcache_pinned_misses_total     (counter)
  commitment_branchcache_pinned_entries          (gauge)
  commitment_adaptive_pin_promoted_total         (counter)
  commitment_adaptive_pin_extended_total         (counter)
  commitment_adaptive_pin_demoted_total          (counter)
  commitment_adaptive_pin_active_contracts       (gauge)

Per-contract labels intentionally omitted — cardinality risk is real
(mainnet could see hundreds of promoted contracts over a process
lifetime) and the structured [adaptive-pin] log line gives
per-contract detail when needed for debugging. Total counters are
sufficient for "is this lever working" dashboard signals.

Refresh cadence:
- pinned-entries gauge updates at SD.Flush via PublishMetrics — once
  per batch, avoids hot-path cost
- AdaptivePinController updates promote/extend/demote/active in
  OnBlockComplete (also per-batch, same call site)
New trunk_pin_test.go covers the additions:

  TestPinEntry_AndPinnedCount        — basic pin-tier mechanics
  TestInvalidate_RemovesFromPinned   — pin-aware Invalidate
  TestGetWithOrigin_ChecksPinnedTier — pin-aware GetWithOrigin
  TestClear_ResetsPinnedTier         — pin-aware Clear
  TestMissCallback_FiresOnTripleMiss — cache hook semantics
  TestContractHashFromPrefix_*       — storage-trunk decode
  TestContractTrunkPreload_PhasedRun — Initial+Extend split
  TestAdaptivePinController_PromoteOnThreshold
  TestAdaptivePinController_DemoteOnCold

Updated branch_cache_test.go's TestBranchCache_Stats to match the
new Stats string format (added pinned-tier columns; "tail entries"
moved past the pin block).

Deleted execution/commitment/warmup_cache_test.go — orphaned by the
WarmupCache type deletion in commit ded378f. Test referenced
NewWarmupCache which no longer exists; removing unblocks
`go test ./execution/commitment/...`.
Cleanup pass before PR. Removes scaffolding that supported the
SSTORE-bloat investigation but isn't operator-relevant for production:

Env knobs removed:
- TRUNK_PROBE — logged contract hashes of depth-64 commitment prefixes
  on first sight. Was used to identify the bloat contract; now obsolete.
- DISABLE_BRANCH_CACHE_READS — A/B switch for cache-vs-no-cache
  benchmarking. The cache is now a fundamental architecture
  component; DISABLE_ADAPTIVE_PIN already covers the operator-safety
  use case.
- BRANCH_CACHE_FINGERPRINT — gated the [commitment][cache-fp] log
  line that emitted ~30 fields of investigation-era counters per
  block. Replaced by Prometheus metrics and the [adaptive-pin] log.
- Dead variables in hex_patricia_hashed.go: verifyBranchCache and
  disableBranchCacheReads were declared but never referenced
  (callers were removed in the WarmupCache deletion).

Code removed:
- The 90-line [cache-fp] log block in commitmentdb/commitment_context.go
  that referenced ~10 bench-only counter helpers
  (SkipLoadResetCounters, ContainsHashStats, SstoreClassificationCounts,
  WarmerBranchOutcomeStats, FileReadCount, UniqueLenBuckets, etc.)
- The trunk-probe path in db/state/changeset/state_changeset.go
- Stale comment references to verifyBranchCache

The bench-only counter helpers themselves (SkipLoadResetCounters etc.)
are now unreferenced but left in place — a follow-up cleanup commit
can remove them once we confirm no other callers exist.
Apply gofmt to four files where the perf-stack work left formatting
differences from main's conventions. None are functionally meaningful;
all are whitespace.

Drop db/state/snapshot_vs_mdbx_bench_test.go — 714-line bench test
file from the snapshot-vs-MDBX equivalence investigation
(snapshot-vs-mdbx-performance-equivalence.md memo). Out of scope for
this PR; the bench harness can return as a standalone artefact when
that investigation resumes.
…+ tests)

Replaces the per-prefix GetLatest BFS with a sequential range scan: a
contract H's storage-trunk branches live in two CommitmentDomain key
ranges (split by HexToCompact's HP-flag parity), so contractTrunkKeyRanges
computes [0x00||H, nextSubtree(0x00||H)) for even-depth branches and
[HexToCompact(H_nibbles||0), nextSubtree(HexToCompact(H_nibbles||15)))
for odd-depth, and LoadBulk drains both via a CommitmentRangeReader,
collects every branch of the subtree, sorts shallowest-first (the
BFS/highest-reuse ordering — a lexicographic scan would be depth-first
within a subtree, which under a budget pins a deep narrow slice), and
pins from the front until the byte budget is hit. Re-scanning on a later
LoadBulk extends the pinned set (resumable/phased, like Run).

This commit is the algorithm + ContractTrunkPreload.LoadBulk +
PreloadContractTrunkBulk + unit tests (range computation, nextSubtree,
shallowest-first-within-budget, phased extend). The per-prefix Run/BFS
path is left intact; wiring triggerTrunkPreload to provide the real
CommitmentRangeReader (aggTx.DebugRangeLatestFromFiles) and call LoadBulk
is the next commit.

Refs perf-research-tracker.md R3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
triggerTrunkPreload's goroutine now builds a CommitmentRangeReader over
tx.RangeLatest(CommitmentDomain, from, to) and calls
PreloadContractTrunkBulk; PIN_TRUNK_BULK=false reverts to the per-prefix
GetLatest BFS (PreloadContractTrunk). step is passed as 0 (the pinned-tier
step is a coherency hint, corrected in-place by a later same-prefix write;
the bulk scan reads the latest committed value).

Refs perf-research-tracker.md R3. Adaptive-controller path (Run-based
extend) still uses the per-prefix reader — to be migrated to LoadBulk in
a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AdaptivePinController.OnBlockComplete now takes both a CommitmentReader
(per-prefix BFS) and an optional CommitmentRangeReader (bulk range scan);
initial-view promotes and per-block extends use LoadBulk when rr != nil,
else fall back to Run. domain_shared.go's Flush-time hook builds rr over
tx.Debug().RangeLatest(CommitmentDomain, ...) when PIN_TRUNK_BULK
(default true) and passes it. Adds ContractTrunkPreload.HasMore() (BFS
queue non-empty OR last bulk scan was budget-limited) so the extend gate
works for both paths; lastBulkHadMore tracked in LoadBulk. Test
OnBlockComplete calls pass nil for rr (exercise the controller policy on
the BFS path; LoadBulk itself is unit-tested separately).

Refs perf-research-tracker.md R3. Remaining: bench validation on the
SSTORE-bloat fixture (needs perf-devnet-3 datadir re-downloaded).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The R3 'bulk preload via two-parity range scan' (LoadBulk + Debug().RangeLatest
+ adaptive wiring) regressed the SSTORE-bloat docker bench 13.2 -> 5.6 mgas/s
(below the no-pin baseline). Root cause: the commitment domain .kv files are
hashmap-indexed (.kvi, no ordinal lookup) and total 515 GB, so RangeLatest
over them degrades to a linear decompress-and-skip from byte 0 -- orders of
magnitude slower than the 89k targeted point lookups it replaced; LoadBulk also
collects->sorts->pins (no incremental benefit, loses the preload-vs-block race);
and the debug raw iterator returns un-dereferenced (shortened-key) branch values.

Reverts domain_shared.go / adaptive_pin.go / preload.go / trunk_pin_test.go to
the trunk-pin BFS baseline; deletes preload_bulk.go + preload_bulk_test.go. The
replacement (two-phase parallel preload: bulk DB range-cursor + parallel
file-only wave-BFS) is designed in agentspecs/commitment-branch-preload-redesign.md
and lands in follow-up commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the per-prefix serial BFS-via-GetLatest preload (which crossed the
MDBX boundary ~89k times per contract and finished in ~31s, losing the
preload-vs-test-block race in the no-warmup harness) with a parallel,
breadth-first, file-layer-only wave walk:

  PreloadContractTrunkParallel walks the contract subtree one depth level
  (wave) at a time; each wave's branches are resolved via a single batched
  resolver (BatchBranchResolver) that the caller fans out across worker RoTxs in
  contiguous, key-sorted slices -> locally sequential reads that drive page-cache
  readahead. Reads go via DebugGetLatestFromFiles (file layer only: no MDBX
  cursor, no per-key cgo crossing; values already dereferenced via
  replaceShortenedKeysInBranch). Breadth-first so a budget-truncated run still
  pins the shallowest, highest-reuse branches.

triggerTrunkPreload wires it (PIN_TRUNK_PARALLEL, default on; =false falls back
to the per-prefix BFS): N=GOMAXPROCS worker RoTxs (each goroutine its own tx --
MDBX cursors aren't concurrent-safe under one tx), contiguous-slice fanout.

File-layer-only is correct for the from-cold-snapshot trigger path (MDBX holds
nothing for the subtree). The MDBX-resident set (steady-state) is handled by a
separate range-cursor prefetch -- see agentspecs/commitment-branch-preload-redesign.md;
not in this cut-down.

Unit tests: PreloadParallel_{FullBudget_BreadthFirst, BudgetCutoff,
NotFoundStopsDescent, ResolverError}.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PreloadContractTrunkParallel resolved an entire depth wave before applying the
RAM budget. Depth 69 of the SSTORE-bloat trunk is ~1.3M keys but only ~20k fit
the remaining budget -- so it issued >1M file lookups to pin ~20k, and that
over-fetch ran concurrently with the test block (docker bench: 10.1 mgas/s,
preload 38.5s vs the BFS baseline's 13.2 / 31s -- depths 64-68 pinned in ~1-2s,
then 31s stalled on the over-fetched depth-69 wave).

Now: each wave's fetch is capped at remaining_budget/minEntryBytes + 1, where
minEntryBytes (= overhead + shortest depth>=64 compact key) is a true lower
bound on a pinned entry's cost -- so it never under-fetches and the budget is
guaranteed exhausted within the (truncated) wave; if a wave is truncated we stop
after it. Depth 69 goes from ~1.3M lookups -> ~tens of thousands -> ~seconds,
the preload finishes well before the block, and the over-fetch contention is
gone. Adds TestPreloadParallel_CapsWaveFetch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…orrectness)

The parallel wave-BFS preload was file-layer-only: a branch present in MDBX but
not yet flushed to a .kv file is invisible, and one present in both (DB fresh,
file stale) resolved to the stale value. Correct from a cold snapshot (MDBX
empty for the subtree), wrong on a live node.

Phase 1: triggerTrunkPreload now first scans the contract's two CommitmentDomain
key ranges (ContractTrunkKeyRanges, re-added with NextSubtree from the deleted
preload_bulk.go) over TblCommitmentVals via one DupSort cursor per range
(dups are invertedStep||value, latest-first, so Seek + NextNoDup walks the
latest value per key) into an in-memory dbBranches map, bounded by the RAM
budget. PreloadContractTrunkParallel now takes that map: per wave, DB-hits are
resolved from memory (the freshest values, pinned first), file-misses go through
the batched file resolver (capped by the budget remaining after the DB-hits).
On a from-cold-snapshot start dbBranches is empty -> behaviour unchanged.

Tests: TestPreloadParallel_DbHitsShadowFiles (DB value wins over a stale file
value and its bitmap drives descent), TestNextSubtree, TestContractTrunkKeyRanges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TestCommitmentPreloadPhase1Cursor populates TblCommitmentVals with real
commitment-domain branch entries at multiple steps for two contracts, then walks
the contract's two parity ranges (commitment.ContractTrunkKeyRanges) via
CursorDupSort + Seek + NextNoDup and decodes v[8:] as the value — the same
mechanics as triggerTrunkPreload's dbPrefetch. Asserts:
  - every contract-A branch appears in the cursor walk with its LATEST value
    (the dup decoded by skipping the 8-byte invertedStep prefix is correct);
  - the foreign contract's keys do not leak into contract-A's parity ranges
    (ContractTrunkKeyRanges is tight).

This covers the cursor mechanics that the preload-side fake-resolver unit
test (commitment.TestPreloadParallel_DbHitsShadowFiles) does not.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mode wiring

Adds the per-block adaptive extension path on top of the parallel wave-BFS
preload. The new ContractTrunkPreloadParallel.Run method can be called
incrementally with a step budget — each call advances zero or more complete
waves and preserves un-pinned items at the current depth + accumulated
children at depth+1 for the next call. This is the resumable counterpart
to ContractTrunkPreload.Run (the serial BFS), so the adaptive controller
can use the same wave-BFS file-only resolver for promote AND per-block
extension instead of falling back to the serial CommitmentReader BFS.

AdaptivePinController grows SetParallelMode(factory, provider) which
installs a per-call resolver factory + dbBranches provider. The factory
is invoked once per OnBlockComplete (returning a tx-scoped resolver and a
release callback). promoteLocked/runExtensionLocked dispatch to the
parallel path when the factory is set, falling back to the serial path
when it isn't (or when the factory returns an error). adaptiveContractState
holds either *ContractTrunkPreload (serial) or *ContractTrunkPreloadParallel
(parallel); pinnedTotal/usedBytes/queueRemaining/pinnedPrefixes are mode-
agnostic helpers used by the promote/extend/demote loop.

SharedDomains.Flush installs the per-call factory + provider before calling
OnBlockComplete, clears them after. The factory closes over the in-flight
Flush tx (ttx.Debug().GetLatestFromFiles for file-only batch resolution);
the provider walks TblCommitmentVals via CursorDupSort over the dual-parity
ranges to populate the MDBX-resident branch overlay per contract.

Tests:
  - execution/commitment/preload_parallel_test.go +9 tests
    covering ResumeAcrossSteps, RunAfterCompleteIsNoOp, StepBudgetCaps,
    ResumeAfterResolverError, DbBranchesPerStep, PinnedPrefixesAccumulate,
    NilCacheError, NilResolverError, BadHashLengthError
  - execution/commitment/trunk_pin_test.go +5 tests
    covering ParallelMode_PromoteUsesParallelResolver,
    ParallelMode_ExtendUsesResumableState,
    ParallelMode_FactoryErrorFallsBackToSerial,
    ParallelMode_DemoteInvalidates,
    ParallelMode_ReleaseCallbackInvoked
  Closes the "hypothesis C" gate from agentspecs/perf-regression-manifest.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
KeyCommitmentState ("state") is a commitment-domain key, so the BranchCache
that sits in front of the CommitmentDomain was caching it like a trie
branch. But it is the trie checkpoint (txNum/blockNum/root), not a branch:
it changes every block, and serving a stale copy restores the trie to the
wrong state — producing a wrong commitment root (observed as engine-x
"Wrong trie root of block 1", and depended on an explicit
branchCache.Invalidate(KeyCommitmentState) workaround in squeeze).

Reject KeyCommitmentState in BranchCache.Put/Get/GetDecoded/PinEntry so the
cache cannot hold the checkpoint by construction — no caller can pollute it
and no external invalidation is required. The key's single definition moves
to the commitment package so the cache can reference it without an import
cycle; commitmentdb.KeyCommitmentState now aliases it.

Follow-up: a coordinated, version-keyed (txNum) rolling-buffer cache could
re-introduce checkpoint caching safely — tracked separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A reorg-path NewPayload validation (TestReceiptRootValidationAfterReorg)
computed a wrong commitment trie root, flakily, racing the gate-2
background commit.

Root cause: fork-validation in ValidateChain created its SharedDomains
with no parent. unwindToCommonCanonical then builds its unwind set from
GetDiffset, which on a fresh SD finds nothing locally and falls back to
ReadDiffSet on the tx — and the canonical chain's diffsets are not yet on
disk while a background commit is in flight. With no unwind set, the
unwind ran silently empty: SharedDomains.mem.unwindChangeset was never
populated, so the commitment BranchCache was left unmasked and served
pre-unwind (canonical-lineage) branches into the unwound read path.

Fix: set the parent on the fork-validation SD (previously head-extending
payloads only), and have SharedDomains.GetDiffset resolve through the
parent chain. The canonical generation's pastChangesAccumulator then
supplies the diffsets, the unwind builds a complete unwind set, and
TemporalMemBatch.getLatest resolves every unwound key from the unwind set
before the BranchCache is ever consulted — the unwind set masks the
cache by construction. The parent does not shadow the unwound base: the
unwind set covers exactly the keys the unwound canonical blocks touched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lint (prealloc): the rest slice length is known from the source ranges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…unwind

SetHead unwinds the commitment domain (via pipelineExecutor.RunUnwind)
and truncates the TxNums index back to targetBlock. The unwind removes
post-targetBlock commitment entries from sd.mem and MDBX, but the
aggregator-scope BranchCache still holds the pre-unwind value of
KeyCommitmentState (pointing at the original head's blockNum).

sd.Flush after the unwind calls FlushWithCallback, which only updates
cache entries for keys present in sd.mem — KeyCommitmentState is not
re-written by the unwind, so it stays unchanged in cache.

The next FCU's NewSharedDomains.SeekCommitment hits the cache, reads
the original-head blockNum, and trips ErrBehindCommitment when
compared against the truncated TxNums.Last() — which returns
targetBlock. updateForkChoice returns ExecutionStatusTooFarAway and
the FCU fails even though the chain state is fully consistent.

Add a SharedDomains.ClearBranchCache method that empties the
aggregator-scope BranchCache (all tiers). Call it from SetHead after
sd.Flush and before tx.Commit so subsequent FCUs see the post-unwind
MDBX state.

Bisect: TestSetHeadCanonicalCleanup first failed inside the
24c0edd9f5b817 window (WarmupCache deletion sequence + the
FlushWithCallback method introduction). The test has been failing
since the cache was integrated; this is the integration-gap fix
needed to enable cache use for perf.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ResetExec wipes the commitment table but the aggregator's in-memory
branchCache was holding entries that referenced the just-deleted trie
nodes. A from-0 re-exec then served those stale entries when computing
block 0's commitment, producing a wrong trie root and dropping every
genesis-allocated balance that no subsequent block touched.

Serial exec happened to skirt the issue; parallel exec hit it directly
(test_account_access/EXTCODESIZE-contract behaviour on mainnet block
46147 against 0xA1E4380A3B1f749673E270229993eE55F35663b4, and the
TestFromZero_GenesisAllocPreservedAfterResetReExec parallel subtest).

Drop the cache as part of ResetExec so it repopulates from the freshly-
wiped commitment table.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@taratorio taratorio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's the point of EngineX tests - to re-use nodes with same fork+preAlloc for tests that share it since init time is expensive

any issues found in these tests are related to caching and re-orgs

@mh0lt
Copy link
Copy Markdown
Contributor Author

mh0lt commented May 25, 2026

Closing: probe-only PR for the #21380 EEST investigation (engine-x test runner force-evict). The investigation concluded (root cause + fixes landed via the parallel-exec correctness arc into #21017 / #21387). The diagnostic confirmed the cache-pollution hypothesis but the production fix was implemented differently (in-place reset rather than full tester eviction). Closing as redundant.

@mh0lt mh0lt closed this May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants