Skip to content

execution/commitment: stop cloning bytes at TrieContext.Branch#21524

Open
yperbasis wants to merge 2 commits into
mainfrom
yperbasis/trie-context-branch-no-clone
Open

execution/commitment: stop cloning bytes at TrieContext.Branch#21524
yperbasis wants to merge 2 commits into
mainfrom
yperbasis/trie-context-branch-no-clone

Conversation

@yperbasis
Copy link
Copy Markdown
Member

Summary

`TrieContext.Branch` clones every returned byte slice via `common.Copy` (= `bytes.Clone`). Profiling shows this single site is responsible for 37.86 GB / 60s (~631 MB/s) of allocated bytes at tip — about 10% of total alloc traffic and 59% of all `bytes.Clone` calls — for no benefit: every downstream consumer either reads inline (`cell.fillFromFields` copies into pre-allocated arrays; `merger.Merge`, `trie_reader`, `unfoldBranchNode`, `EncodeBranch` consume the bytes synchronously) or clones at its own queue boundary (`getDeferredUpdate` clones both `prefix` and `prev` when storing in the deferred-update pool).

The upstream Clone was a third copy of bytes nothing needed to retain.

Change

  • Drop the `common.Copy` at `commitment_context.go:809`; `Branch` now returns a borrowed slice valid for the lifetime of the current `ComputeCommitment` call.
  • Document the new contract in the method comment.
  • Update the test (`Test_TrieContext_BranchCopiesData` → `Test_TrieContext_BranchBorrowsData`) to verify the new aliasing guarantee instead of the old ownership one.

Test plan

  • `go build ./...` clean
  • `make lint` clean
  • `./execution/commitment/...` unit tests pass
  • Live-rig before/after newPayload latency comparison: erigon-main (default) vs erigon-main + this change, both MDBX, same datadir snapshot, prysm as the single CL driving both.

Profile context

60-second steady-state CPU+alloc capture on the live-rig (MDBX, in-flight at tip) was used to identify the hotspot. Top alloc consumers pre-fix:

```
64.3 GB / 60s bytes.Clone (cum)
└─ 37.86 GB via TrieContext.Branch
└─ 11.27 GB via encodeDeferredUpdate (legitimate, queue boundary)
└─ 11.09 GB via getDeferredUpdate (legitimate, queue boundary)
└─ ... rest
79.0 GB / 60s sortableBuffer.Put (etl pooling - separate problem)
75.0 GB / 60s DomainPut (chains down to MemBatch writes)
63.1 GB / 60s GetLatest (cascades into Branch + downstream)
```

profiling at tip (MDBX, 60s window) pinned bytes.Clone as the #1
single allocation source at 64.3 GB / 60s (~16.85% of total alloc
traffic). 37.86 GB / 60s (~631 MB/sec) of that came from the
defensive common.Copy at TrieContext.Branch.

The clone was redundant: every downstream consumer either reads the
slice inline (cell.fillFromFields copies into pre-allocated arrays;
merger.Merge consumes and produces a fresh buffer; trie_reader
parses bytes into cells; unfoldBranchNode similar) or clones it
itself at the queue boundary (getDeferredUpdate clones both prefix
and prev when storing in the deferred-update pool). Branch's clone
was a third copy of bytes that nothing needed to retain.

Document the new contract (borrowed slice valid for the current
ComputeCommitment scope) and update the test that exercised the old
"returns owned bytes" guarantee to verify the new aliasing
guarantee instead.

After-measurement on the rig is blocked by an unrelated stage-loop
persistence inconsistency (chaindata head pointer ahead of state
writes on restart) that's reproducing on every restart cycle today;
the change is mathematically minimal (single Clone removal +
test contract update) and unit tests + make lint pass, so shipping
on the math.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@yperbasis yperbasis marked this pull request as draft May 30, 2026 06:56
@yperbasis
Copy link
Copy Markdown
Member Author

yperbasis commented May 30, 2026

Live A/B measurement — both MDBX, both at mainnet tip

Setup: two erigon instances on disjoint 4-P-core sets (CPUs 0-7 and 8-15) of an i9-13900KS, aux on E-cores. Both started from a byte-for-byte clone of the same chaindata @ block 25,204,537. Same prysm CL driving both via engine-mux fan-out (primary=opt, shadow=main, latency recorded mux-side end-to-end). Both clean (0 errors).

Erigon-side direct execution time (strips mux RPC overhead, n=100 head-validated blocks per side):

Pctile opt main Δ
p50 98 ms 105 ms −6.7% (−7 ms)
p90 190 ms 197 ms −3.6%
p99 328 ms 335 ms −2.1%
mean 111.1 ms 113.8 ms −2.4% (−2.7 ms)

Mux-side end-to-end (n=331 engine_newPayloadV4 both-VALID samples — includes JSON-RPC + GC pauses caught in the request window):

Pctile opt main Δ
p50 97.4 ms 115.6 ms −16% (−18 ms)
p90 240.0 ms 273.5 ms −12% (−33 ms)
mean 125.3 ms 150.1 ms −17% (−25 ms)

The honest engine-API gain is the 2-7% / 2-7 ms number from the erigon-side timer. The mux-side gap is wider by ~10 ppt — likely GC-pause and RPC serialization under main's higher heap churn.

Alloc profile confirms the mechanism (lifetime alloc, same uptime each):

main opt Δ
Total alloc 67.4 GB 61.4 GB −9%
bytes.Clone total 10.2 GB (15.49%) 4.6 GB (7.64%) −55%
└─ via TrieContext.Branch 5.5 GB absent gone

A one-line change that erases 5.5 GB of redundant clone work and shaves 2-7 ms off newPayload p50.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces allocation pressure in the commitment trie path by making TrieContext.Branch return the domain reader’s slice directly instead of cloning it, relying on callers to consume inline or clone at retention boundaries.

Changes:

  • Removed the common.Copy in TrieContext.Branch.
  • Added an implementation-level borrowed-slice contract comment.
  • Updated the unit test to assert aliasing/borrowing behavior instead of copy ownership.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
execution/commitment/commitmentdb/commitment_context.go Changes Branch to return borrowed data directly and documents the new lifetime expectation.
execution/commitment/commitmentdb/commitment_context_test.go Updates the branch ownership test to validate borrowed-slice aliasing behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread execution/commitment/commitmentdb/commitment_context.go
…e PatriciaContext interface

Address Copilot's review on PR #21524: the borrowed-slice lifetime
note was only on the concrete *TrieContext implementation; callers
that depend on the interface couldn't see it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@yperbasis yperbasis marked this pull request as ready for review May 30, 2026 08:59
Copy link
Copy Markdown
Contributor

@Giulio2002 Giulio2002 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM \u2014 small obvious change (~39 changed lines across 3 files)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants