Skip to content

Revert "logger: avoid mutex contention"#20653

Closed
AskAlexSharov wants to merge 178 commits into
mainfrom
alex/gnosis_stuck_35
Closed

Revert "logger: avoid mutex contention"#20653
AskAlexSharov wants to merge 178 commits into
mainfrom
alex/gnosis_stuck_35

Conversation

@AskAlexSharov
Copy link
Copy Markdown
Collaborator

Reverts #20454

reason: gnosis has regression

AskAlexSharov and others added 30 commits March 6, 2026 07:47
…logic extraction (#19642)

- relax check when several empty blocks (same state root)
- sampling logic extraction
```
for blockNum := range sampler.BlockNums(from, to) {
```
or
```
for start := fromBlock; start <= toBlock; start += chunkSize {
	if sampler.CanSkip() {
		continue
	}
```

Also:
- enable CommitmentHistVal by default - with `--sample` support
## Summary

- Replace per-key heap allocations in HashSort (ModeDirect and
ModeUpdate) with grow-only `batchSlab` and `byteArena` fields on the
`Updates` struct
- Arena is pre-allocated per batch and reset between batches; slab
stores `KeyUpdate` values contiguously
- Add dedicated HashSort benchmarks for both modes at N=50, 5000, 50000

## Benchmark Results

**HexPatriciaHashed_Process** (main vs this branch, 5 runs, p=0.008):
| Metric | main | branch | Change |
|--------|------|--------|--------|
| Time | 21.3 µs/op | 17.8 µs/op | **-16% faster** |
| Memory | 91.5 KB/op | 10.4 KB/op | **-89% less** |
| Allocs | 128/op | 106/op | **-17% fewer** |

**HashSort-specific** (new benchmarks):
- ModeDirect: constant 18-19 allocs regardless of key count (50 to 50k)
- ModeUpdate: **zero allocations** with full arena reuse

## Test plan

- [x] `go test ./execution/commitment/...` passes
- [x] Benchmarks run with `-count=5 -benchmem` on both main and branch
- [x] `benchstat` comparison confirms statistically significant
improvements (p=0.008)

---------

Co-authored-by: bloxster <40316187+bloxster@users.noreply.github.com>
Co-authored-by: Bloxster <bloxster@proton.me>
Co-authored-by: Mark Holt <135143369+mh0lt@users.noreply.github.com>
Co-authored-by: Mark Holt <erigon@dev-bm-e3-ethmainnet-n4.erigon.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
## Summary

Adds a file-integrity-cache for `CommitmentKvi` and `CommitmentKvDeref`
integrity checks. Once a file passes integrity verification, its result
is cached using the torrent InfoHash as the fingerprint. Subsequent runs
skip re-checking files that have not changed.

## Changes

- **New CLI flag**: `--file-integrity-cache=<path>` for `erigon
snapshots integrity`
- **Cache implementation**: `db/integrity/deref_cache.go`
  - Uses SHA1 InfoHash from `.torrent` files (not content hashing)
  - Requires `.torrent` files to exist (no fallback)
- Cache format: `CheckName\tfile1:hash1\tfile2:hash2...` (e.g.
`CommitmentKvi\tv2.0-commitment.0-32.kv:a2de2d...\tv2.0-commitment.0-32.kvi:9158a0...`)
- **Integration**: Cache parameter added to `CommitmentKvi` and
`CommitmentKvDeref` functions

## Performance

Tested on Sepolia (4 commitment file sets, 9.4G total):

| Phase | Description | Duration |
|-------|-------------|----------|
| Baseline (no cache) | Full integrity check | 2m 7s |
| Cache creation | Full check + write cache | 2m 7s |
| Cache hit | Skip verified files | **2s** |

**63x speedup** on subsequent runs.

## Test Commands

```bash
# Generate .torrent files if missing
./build/bin/downloader torrent_create --datadir=/path/to/datadir --chain=sepolia --all

# Run with cache
./build/bin/erigon snapshots integrity --datadir=/path/to/datadir \
  --check=CommitmentKvi,CommitmentKvDeref \
  --file-integrity-cache=/tmp/integrity-cache.txt
```

## Notes

- Cache is invalidated automatically when file content changes
(different torrent hash)
- Missing `.torrent` files will cause an error (use `torrent_create` to
generate them)

---------

Co-authored-by: Alexey Sharov <AskAlexSharov@gmail.com>
- Sampling support in `CommitmentKvi`
- Enable CheckCommitmentHistAtBlkRange as default check
- A bit hack: reduced sample ratio for CheckCommitmentHistAtBlkRange in
code (to make default `integrity` run fast-enough).

```
INFO[03-06|05:43:31.354] [integrity] CommitmentKvi                kvi=v2.0-commitment.0-4096.kvi kv=v2.0-commitment.0-4096.kv
INFO[03-06|05:44:01.354] [integrity] CommitmentKvi                at=19718552/333930881 p=5.9% k/s=657269.458 eta=7m58s kvi=v2.0-commitment.0-4096.kvi
INFO[03-06|05:44:31.354] [integrity] CommitmentKvi                at=38533118/333930881 p=11.5% k/s=642211.170 eta=7m39s kvi=v2.0-commitment.0-4096.kvi
```
Example i catched:
```
 [integrity]                              err="[integrity] .ef file has foreign txNum: 100000000 < 114165593,          
  v3.0-logaddrs.192-256.ef, 0000000000000039" 
```
in the past i introduced couple primitive nil-ptrs there - and tests
didn't catch it earlier.
Slot 21651456 Epoch 1353216 ts 1773653580 UTC Mon 16/03/2026, 09:33:00

Cherry-pick of #17485 to `release/3.4`
## Summary
- Cherry-pick of #19691 to `release/3.4`
- Replace `ChiadoBootstrapNodes` with the ones from Lighthouse's
built-in Chiado network config

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…retry in buildVI (#19697)

Cherry-pick of #19695 to release/3.4

---

The counter 'i' used to track page offsets in paged history files was
declared outside the retry loop. When a recsplit collision occurred and
the loop retried, 'i' retained its value from the previous iteration,
causing incorrect page offset tracking in the .vi index.

Fix: Move `i := 0` inside the retry loop so it's reset on each attempt.
…19680)

Story: i noticed on profiler that during InvertedIndex files merge
`deriveFields()` fun did many re-allocs. Initially i had intent to 2x
over-alloc to amortize it - but realized that we doing multiple merges
of same key (multiple chunks from multiple small files) - each such
merge produced "bigger sequence". So, clearly higher-level logic of
merge files is wrong and need merge all chunks at once.

---
Before: incremental pairwise merge — for a key in N files, reads C + 2C
+ 3C + ... + NC = O(N²·C) elements and calls EliasFano.ResetForWrite(→
deriveFields → make([]uint64, ...)) N times, allocating a new backing
array each time since the merged size always exceeds the previous
capacity.

After: single-pass accumulation — collects all N items for the current
key, computes totalCount and maxOff in one scan, initialises the builder
once with the correct size (so deriveFields only allocates once per key
at the right capacity), then iterates the N files in ascending txNum
order adding all values in a single pass. O(N·C) reads, O(1) allocations
per key.

`BenchmarkInvertedIndexMergeFiles`: 
```
  ┌───────┬─────────┬─────────┬─────────┐
  │ files │ before  │  after  │ speedup │
  ├───────┼─────────┼─────────┼─────────┤
  │ 4     │ 840 µs  │ 763 µs  │ 1.1x    │
  ├───────┼─────────┼─────────┼─────────┤
  │ 8     │ 1306 µs │ 819 µs  │ 1.6x    │
  ├───────┼─────────┼─────────┼─────────┤
  │ 16    │ 3116 µs │ 970 µs  │ 3.2x    │
  ├───────┼─────────┼─────────┼─────────┤
  │ 32    │ 9718 µs │ 1305 µs │ 7.4x    │
  └───────┴─────────┴─────────┴─────────┘
```
UnitTest for #19697 - maybe can
simplify in future (by passing external recsplit cfg)
…curren… (#19725)

cherry-pick of #19710 

close #19720 

Fix goroutine leak in GetReceipts loop: each tx spawned a goroutine
waiting on ctx.Done() to cancel the EVM, but ctx was shared across the
entire block execution, so all N goroutines stayed alive until
GetReceipts returned. Under concurrent requests for different blocks
this caused goroutines and EVMs to accumulate in memory, triggering OOM.
Fixed by adding a txDone channel closed immediately after
ApplyTransactionWithEVM - the goroutine now exits as soon as its tx
completes, keeping at most 1 goroutine alive at a time.

Add execSem semaphore (capacity max(1, GOMAXPROCS/2), env
R_EXEC_CONCURRENCY) to limit the number of blocks executing concurrently
in GetReceipts. Each parallel block execution holds its own
IntraBlockState which can be hundreds of MB for busy mainnet blocks;
without a bound, many concurrent requests for different blocks exhaust
memory. Requests already served from the LRU cache bypass the semaphore
entirely.

./run_perf_tests.py .... _eth_get_block_receipts_21M_20K.tar -t
500:10,5000:1

with current SW:
[3. 1] daemon: executes test qps: 500 time: 10 -> [R=100.00%
max=10.597s]
[3. 2] daemon: executes test qps: 500 time: 10 -> [R=100.00% max=13.36s]
[3. 3] daemon: executes test qps: 500 time: 10 -> [R=100.00%
max=26.353s]
[3. 4] daemon: executes test qps: 500 time: 10 -> [R=100.00% max=1m56s]
[3. 5] daemon: executes test qps: 500 time: 10 -> [R=100.00% max=2m44s]

[4. 1] daemon: executes test qps: 5000 time: 1 -> [R=100.00% max=4m11s]
[4. 2] daemon: executes test qps: 5000 time: 1 -> test failed: server is
Dead for OOM

with new SW:
[3. 1] daemon: executes test qps: 500 time: 10 -> [R=100.00%
max=11.591s]
[3. 2] daemon: executes test qps: 500 time: 10 -> [R=100.00% max=5.202s]
[3. 3] daemon: executes test qps: 500 time: 10 -> [R=100.00% max=4.947s]
[3. 4] daemon: executes test qps: 500 time: 10 -> [R=100.00% max=5.061s]
[3. 5] daemon: executes test qps: 500 time: 10 -> [R=100.00% max=5.009s]

[4. 1] daemon: executes test qps: 5000 time: 1 -> [R=100.00%
max=13.789s]
[4. 2] daemon: executes test qps: 5000 time: 1 -> [R=100.00%
max=14.032s]
[4. 3] daemon: executes test qps: 5000 time: 1 -> [R=100.00% max=14.02s]
[4. 4] daemon: executes test qps: 5000 time: 1 -> [R=100.00%
max=13.958s]
[4. 5] daemon: executes test qps: 5000 time: 1 -> [R=100.00%
max=13.924s]

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…g step-rebase (#19730)

Cherry-pick of #19723 to `release/3.4`.

## Summary
- `erigon seg step-rebase` renames snapshot data files, invalidating
existing torrent metadata
- `.torrent` files in subdirectories (domain/, history/, accessor/,
idx/) were already deleted, but `erigondb.toml.torrent` in the snapshots
root was missed
- Add it to the deletion list
## Summary
- Cherry-pick of #19728 to `release/3.4`
- Documents the `[rX.Y]` prefix convention for PRs that cherry-pick
commits to `release/X.Y` branches

## Test plan
- [x] Read updated file and confirm the new line appears in the
Conventions section
…the chain in use (#19727)

Cherry-pick of #19722 (merged to main as a18eb9b) to `release/3.4`.

## Summary

- **Lazy-parse webseed TOML**: instead of parsing all 8 chains' webseed
TOML at init time, store raw bytes in `EmbeddedWebseedsRaw` and parse on
demand via `GetEmbeddedWebseeds(chain)` — only the chain actually in use
gets parsed.
- **Remove no-op re-assignment**: `LoadRemotePreverified` was
redundantly re-building the same `KnownWebseeds` map; removed.
- **Inline `webseedsParse`**: folded into its sole caller
`GetEmbeddedWebseeds`.
- **Rename `KnownWebseeds` → `EmbeddedWebseeds`**: clearer naming —
`EmbeddedWebseedsRaw` for the raw bytes map, `GetEmbeddedWebseeds()` for
parsed access.
…s hash collisions (#19741) (#19753)

## Summary

Cherry-pick of #19741 from `performance` to `release/3.4`.

- Migrate bloatnet configuration from perf-devnet-2 (which is down) to
perf-devnet-3
- Fix genesis hash collisions in `registeredChainsByGenesisHash` caused
by multiple chains sharing the same genesis hash
- Replace the genesis-hash-based chain lookup with explicit `chainName`
parameter threading

## Changes

- **`execution/chain/spec/config.go`** — Remove
`registeredChainsByGenesisHash` map
- **`execution/state/genesiswrite/genesis_write.go`** — Add `chainName`
parameter to `WriteGenesisState`
- **`p2p/sentry/sentry_grpc_server.go`** — Accept
`bootnodes`/`dnsNetwork` as explicit params instead of looking them up
from genesis hash
- **`node/eth/backend.go`** — Resolve and pass chain-specific
bootnodes/DNS params
- **`cl/clparams/config.go`** — Update bloatnet ENR and fork
configuration for perf-devnet-3

## Test plan

- [x] Cherry-pick applies cleanly (no conflicts)
- [x] `make lint` passes
- [x] `make erigon integration` builds successfully
- [ ] CI passes
- db/state/domain.go: Move keyPos and valPos declarations inside the
retry loop
remove alert: `[experiment] enabling commitment history. this is an
experimental flag so run at your own risk!`
…nts (#19793)

## Summary
Cherry-pick of #19777 to `release/3.4`.

- Use `GetCodeHash` instead of `ResolveCodeHash` in the contract address
collision check so that an EIP-7702 delegation designator
(`0xef0100...`) is treated as non-empty code—matching geth's behavior
and preventing a consensus split.
- Includes tests for both CREATE and CREATE2 collision with delegated
accounts.

Fixes ethereum-bounty/erigon#2

## Test plan
- [x] `TestCreate2CollisionWithEIP7702Delegation` passes
- [x] `TestCreateCollisionWithEIP7702Delegation` passes
- [x] `go build ./execution/vm/...` compiles cleanly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Cherry-pick of #19819 to `release/3.4`.

## Summary
- Adds `mcp` to the `BINARIES` list in `.github/workflows/release.yml`
so it is built and included in release artifacts and Docker images.

Generated with Claude.
…build (#19853)

## Summary

Cherry-pick of #19825 and #19847 to `release/3.4`.

- Improve release workflow robustness (early release existence check,
artifact verification, non-fatal skopeo delete, GitHub App token for
publish step)
- Inline debian package build (removing separate reusable workflow file)
- Fix debian control file heredoc inside for loop
- Update docker actions to v4.0.0 (Node.js 24)
- Pin `actions/create-github-app-token` to v2.2.1 SHA (Node.js 24)
it's often useful for experiments: "how collate+build impacting
chaindata size", etc...
often need build files in blocking way - but don't wait for merge to
finish
AskAlexSharov and others added 16 commits April 14, 2026 12:05
…0521)

Cherry-pick of #20393 from main.

Fixes double indirection on interface{} preventing correct unmarshalling
of JSON null responses. Mirrors go-ethereum PR #26723.
…ch-up (#20555)

Cherry-pick of #20546.

InsertBlocks on 3.4 uses `BeginRw` directly (not `SharedDomains`), so
the `inserters.go` portion of #20546 is a no-op here. This cherry-pick
keeps the general `SharedDomains` contract improvement: `SeekCommitment`
always fully restores the SD, and `NewSharedDomains` attaches
`ErrBehindCommitment` as an environmental signal probed via
`TxNums.Last` at the construction boundary.
…ng (#20540)

Cherry-pick of #20538

Co-authored-by: kewei-bot <kewei.train@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…rt own parallel-bulding (#20565)

compression and accessors: support parallel-building means we don't need
multiple `merge_workers` usually

closing: #20560
```
 [INFO] [04-15|05:15:50.145] [integrity] StateRootVerifyByHistory     blks/s=0.5 checked=3.72k/2.48k windows=1860/2488                 
  blkRange=1-24.87M                                                                                                                     

  ```  
`checked` overflow
Needed for Gnosis (Fulu fork) and for security updates
…BuildFilesInBackground (#20594)

Port of #20445 to release/3.4.

## Summary

Fixes two related bugs in the domain state layer that cause gas
mismatches during execution (#20169).

### Bug 1: Collation/pruning race

`BuildFilesInBackground` collates domain files by reading values from
the DB via per-worker read-transactions opened at collation time.
Between the moment a step is deemed ready and the collation reads,
execution can commit new batches that overwrite step S values with step
S+1 data. The collated file for step S then contains wrong values. After
pruning removes the DB entries, `GetLatest` returns the stale file
values, causing SSTORE gas mispricing.

**Fix:** Restructure `buildFiles` into two phases:
1. **Phase 1 (sequential, single read-tx):** Collate all domains and
indices using one MDBX read-transaction. All collations see the exact
same DB snapshot — zero race window.
2. **Phase 2 (parallel, no DB access):** Build files from collations.
This is the expensive part and remains fully parallel.

Additionally, add a committed txNum guard: don't collate step S until
`ComputeCommitment` has confirmed all data through the end of step S is
flushed.

### Bug 2: Unwind entry visibility after filing

After a reorg, the unwind restores domain entries tagged with a step
derived from the unwind-target txNum. If `BuildFilesInBackground` has
filed that step, `getLatestFromDb` discards the restored entry (step
covered by files) and falls through to `getLatestFromFiles`, returning
the stale end-of-step value instead of the changeset-restored value.

**Fix:** Pass `Aggregator.EndTxNumMinimax()` into the unwind and tag
restored entries with `max(naturalStep, currentFilesEndStep)`.

### Changes

- `db/state/aggregator.go`: single-tx collation + parallel file
building; committed txNum guard; `stepFullyCommitted` helper; pass
current file boundary to unwind
- `db/state/domain.go`: bump unwind step tag past current filed range
- `db/state/aggregator_test.go`: `TestAggregator_CommittedTxNumGuard`
- `db/state/domain_test.go`:
`TestDomain_CollationIsolatedFromLaterSteps`,
`TestDomain_UnwindRestoredEntryVisibility`
- `execution/commitment/commitmentdb/commitment_context.go`: export
`DecodeTxBlockNums`; fix `minUnwindale` typo; short-value length guard
…ionData (#20600)

## Summary

Cherry-pick of #19783 from `main` to `release/3.4`.

Fixes a panic observed on `alex/collation_race_fix_34` (and
`release/3.4`) when a validator client polls
`/eth/v1/validator/attestation_data` before Caplin has synced to head.

- **`SyncedDataManager.CommitteeCount`** (`synced_data.go`): added
`accessLock.RLock()` + nil check on `headState`, consistent with every
other accessor in the same file.
- **Debug-log defer** (`block_production.go`): guard against nil
`committeeIndex` in the deferred log closure.

## Reproduction

Start Erigon on `release/3.4` while a validator client is actively
polling. The VC calls `GET /eth/v1/validator/attestation_data` before
Caplin reaches head → panic in HTTP handler goroutine:
```
panic: runtime error: invalid memory address or nil pointer dereference
github.com/erigontech/erigon/cl/phase1/core/state.(*CachingBeaconState).CommitteeCount(0x0, ...)
github.com/erigontech/erigon/cl/beacon/synced_data.(*SyncedDataManager).CommitteeCount(...)
github.com/erigontech/erigon/cl/beacon/handler.(*ApiHandler).GetEthV1ValidatorAttestationData.func1()
```

## Test plan

- [x] Clean cherry-pick from `main` (commit `0f3624a17b`)
- [x] `go test ./cl/beacon/synced_data/... ./cl/beacon/handler/...
-short` passes on main

Generated with Claude
Cherry-pick of #20609 to release/3.4

Co-authored-by: bendertherobert <bendertherobert@gmail.com>
…elease/3.4) (#20636)

Cherry-pick of #20627 to release/3.4.

---------

Co-authored-by: mh0lt <mark.holt@erigon.tech>
Adds FAQ entries for the MCP server to the Help Center.

## Changes
- `docs/gitbook-help/frequently-asked-questions-faqs.md` — FAQ #23: what
is the MCP server; FAQ #24: how to connect Claude Desktop

## Notes
- `mcp.md` already exists and is complete — no changes
- Port 8553 and MCP flags already in `default-ports.md` and
`configuring-erigon/README.md`
- Second PR targeting `main`: #20605
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.