Reduce impact of synchronized aggregation across fleet nodes by lemenkov · Pull Request #20391 · erigontech/erigon

lemenkov · 2026-04-07T20:16:18Z

Reduce impact of synchronized aggregation across fleet nodes

Problem

When running multiple Erigon nodes syncing the same chain, all nodes cross snapshot step boundaries at nearly the same time (within seconds of each other). This triggers BuildFilesInBackground simultaneously on every node, and the resulting aggregation I/O stalls block execution on all nodes at once.

In a load-balanced fleet this causes a total service outage — every backend falls behind the chain tip simultaneously, and the proxy has zero healthy backends to route traffic to.

Real-world incident (April 7 2026)

We operate a 3-node fleet. After ~2 months of stable operation, all nodes hit aggregation step 2193 within 20 seconds of each other:

Node	`BuildFilesInBackground step=2193`	Aggregation duration
node-1	09:59:34	2m30s
node-2	09:59:28	2m29s
node-3	09:59:48	still aggregating, was restarted

During the aggregation, block execution throughput dropped from ~20 Mgas/s to ~1-5 Mgas/s. All nodes fell behind the chain tip. At 10:07:33 the fleet had 0 out of 3 healthy backends for 60 seconds.

The aggregation step itself evicted ~16GB of page cache (RSS dropped from 48GB to 32GB on one node), starving block execution of I/O bandwidth.

Each node recovered on its own within 10-15 minutes, but the synchronized nature of the stall meant there was no healthy node to absorb traffic during the event.

Root cause

BuildFilesInBackground is triggered when txNum crosses a step boundary. Since all nodes process the same chain in real time, they all cross the boundary on the same block. The trigger is deterministic — there is no jitter or per-node offset.

Solution

Add a configurable delay (ERIGON_AGGREGATION_DELAY_MS, default 0) at the start of BuildFilesInBackground, before the build loop begins. This follows the same pattern as the existing COMPRESS_WORKERS env var in common/dbg/experiments.go.

Operators running multi-node fleets can set different values per node to desynchronize aggregation:

node-1: ERIGON_AGGREGATION_DELAY_MS=0
node-2: ERIGON_AGGREGATION_DELAY_MS=60000
node-3: ERIGON_AGGREGATION_DELAY_MS=120000

This guarantees at least 60 seconds between each node starting its aggregation, which would have completely prevented the 0/3 healthy window in the incident above. Single-node operators are unaffected (default is 0).

Notes

This is complementary to COMPRESS_WORKERS (PR Reduce impact of background merge/compress to ChainTip #18995) which reduces I/O pressure within each aggregation step. This PR addresses the timing of when aggregation starts across nodes.
No impact on single-node deployments or initial sync (default delay is 0).

Copilot

Pull request overview

This PR adds a configurable per-node startup delay for snapshot aggregation so multi-node fleets can stagger BuildFilesInBackground and avoid synchronized I/O stalls that can temporarily take all nodes out of service at once.

Changes:

Introduce ERIGON_AGGREGATION_DELAY_MS (default 0) as a debug/experiment env var.
Apply the configured delay at the start of Aggregator.buildFilesInBackground before the build loop proceeds.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`db/state/aggregator.go`	Adds an optional delay before starting background file building to desynchronize aggregation timing across nodes.
`common/dbg/experiments.go`	Adds `AggregationDelayMs` configuration sourced from `ERIGON_AGGREGATION_DELAY_MS`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

AskAlexSharov · 2026-04-08T01:58:55Z

Thank you for investigation. I would "just accept this PR" because it gives simple-enough workaround for you.

But here is whole picture:

dropped from ~20 Mgas/s to ~1-5 Mgas/s. only for 3 minutes? Or for much longer?
Because beside "Building files in background" - there is one more "IO intensive rare event": it's Merge of small files to bigger in background. We limiting goroutines amount for building and merging to reduce ChainTip impact (1 goroutine for building, 1 for merge). And Merge actually producing more IO (especially merging of commitment.kv files) - than building. Some workaround for this is: NO_DEEP_MERGE_HISTORY=true and manual merge trigger on offline nod by erigon snapshots retire command.

We working on it:

in 3.4 version of erigon (will release in several days) we did quite a lot of things to mitigate ChainTip perf impact. I would advise re-sync on 3.4 or main (after we release files) - it will greatly reduce ChainTip impact, building files speed, building files IO, chaindata size (it's io for building and pruning), reduced impact of RPC to chain tip, etc...

lemenkov · 2026-04-08T02:14:52Z

Thank you for investigation. I would "just accept this PR" because it gives simple-enough workaround for you.

But here is whole picture:

* `dropped from ~20 Mgas/s to ~1-5 Mgas/s.` only for 3 minutes? Or for much longer?

* Because beside "Building files in background" - there is one more "IO intensive rare event": it's `Merge` of small files to bigger in background. We limiting goroutines amount for building and merging to reduce ChainTip impact (1 goroutine for building, 1 for merge). And `Merge` actually producing more IO (especially merging of commitment.kv files) - than building. Some workaround for this is: `NO_DEEP_MERGE_HISTORY=true` and manual merge trigger on offline nod by `erigon snapshots retire` command.

We working on it:

* in `3.4` version of erigon (will release in several days) we did quite a lot of things to mitigate `ChainTip` perf impact. I would advise re-sync on 3.4 or main (after we release files) - it will greatly reduce ChainTip impact, building files speed, building files IO, chaindata size (it's io for building and pruning), reduced impact of RPC to chain tip, etc...

Give me a sec. I'll explain

AskAlexSharov · 2026-04-08T02:33:07Z

ah, if you are already on release/3.4 or main branches you can get smaller chaindata feature now. by:
/build/bin/erigon seg step-rebase --datadir=<your_path> --new-step-size=390625

AskAlexSharov · 2026-04-08T02:37:38Z

randomization of merge and build start-times - it's interesting feature - we can just add it. do you have any requirements here (like how much seconds you may expect, etc...)?

io rate-limiter on background build/merge - it's interesting feature - but can be hard to implement. maybe you have some requirements on it also?

AskAlexSharov · 2026-04-08T02:42:25Z

FYI: we also already have next features:

prevent building/merging of state files and block files in same time
don't apply any compression when do Collate+Build small files - to speedup files building (earlier finish building - earlier can start pruning -> smaller chaindata -> less IO)

Also:

if you can reproduce it easily: then you can try run erigon with --db.pagesize=4kb (need rm datadir/chaindata for this). 4kb is known for lower r/w IO.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

When multiple nodes are syncing the same chain, they cross step boundaries at nearly the same time, triggering BuildFilesInBackground simultaneously. The resulting concurrent I/O from aggregation can cause all nodes to fall behind the chain tip at once, leaving no healthy backends in a load-balanced fleet. Add a configurable delay (via AGGREGATION_DELAY_MS environment variable, default 0) before the build loop starts. Operators can set different values per node to desynchronize aggregation and avoid fleet-wide stalls. Example deployment: node-1: AGGREGATION_DELAY_MS=0 node-2: AGGREGATION_DELAY_MS=60000 node-3: AGGREGATION_DELAY_MS=120000 Signed-off-by: Peter Lemenkov <lemenkov@gmail.com> Assisted-by: Claude (Anthropic) <https://claude.ai>

…ure (#20486) ### Problem When Erigon is running at chain tip, `MergeLoop` executes merge steps back-to-back with no pause between iterations. Each merge step involves heavy disk I/O (reading, compressing, and writing state files). Running these steps consecutively saturates the disk, starving block execution of I/O bandwidth. The result is periodic block processing stalls: the node's reported block number freezes for minutes at a time while background merges consume all available I/O, then bursts forward when a merge step completes. During these stalls the node falls behind the chain tip and is marked unhealthy by load balancers. ### Observed behavior On a production fleet running Erigon v3.3.x on AWS Graviton instances (64GB RAM, EBS gp3 volumes), we observed the following pattern during MergeLoop activity on individual nodes: - Block execution throughput drops from ~20 Mgas/s to 1-5 Mgas/s - Node block number freezes for 8-16 minutes per merge step - Page cache eviction of 16GB+ as merge I/O displaces cached state data - Lag accumulates at ~5 blocks/minute during each stall - Worst observed: 164 blocks behind over a 188-minute period of continuous merge activity The node always recovers eventually, but the stalls cause the node to be removed from load balancer rotation, reducing fleet capacity. ### Solution Add a configurable delay between `MergeLoop` iterations via the `MERGE_THROTTLE_MS` environment variable (default 0, preserving current behavior). The delay is inserted after each successful `mergeLoopStep`, giving block execution a window to access the disk before the next merge step begins. ``` Before (current): mergeLoopStep() → heavy I/O mergeLoopStep() → immediately, more heavy I/O mergeLoopStep() → immediately, more heavy I/O After (with ERIGON_MERGE_THROTTLE_MS=2000): mergeLoopStep() → heavy I/O sleep(2s) → block execution catches up mergeLoopStep() → heavy I/O sleep(2s) → block execution catches up ``` ### Production results We have been running this patch on a 3-node production fleet since December 2025. Results: - Individual node availability during merge-heavy periods improved from ~90% to >99% - Block execution stalls reduced from 8-16 minutes to under 5 minutes - Nodes maintain chain tip proximity during merge activity - No negative impact on merge completion time (merges still finish, just spread over a slightly longer window) - Fleet-wide availability (via load-balanced proxy) is near 99.99%, with the remaining downtime caused by synchronized stalls that this patch and `AGGREGATION_DELAY_MS` (PR #20391) address together Recommended values based on our testing: | Use case | Value | Effect | |----------|-------|--------| | Default (no throttle) | 0 | Current behavior, no change | | Light throttle | 500 | Slight breathing room between merges | | Production RPC nodes | 2000 | Good balance of merge progress and block execution | | Heavy RPC workload | 5000 | Prioritize block execution over merge speed | ### Notes - This is complementary to `COMPRESS_WORKERS` (PR #18995) which reduces I/O pressure *within* each merge step by limiting worker parallelism. This PR addresses I/O pressure *between* merge steps. - This is also complementary to `AGGREGATION_DELAY_MS` (PR #20391, merged) which staggers the *start time* of aggregation across fleet nodes. - No impact on single-node deployments or initial sync (default delay is 0). Signed-off-by: Peter Lemenkov <lemenkov@gmail.com>

lemenkov requested review from AskAlexSharov, Giulio2002 and sudeepdino008 as code owners April 7, 2026 20:16

AskAlexSharov requested a review from Copilot April 8, 2026 01:10

Copilot started reviewing on behalf of AskAlexSharov April 8, 2026 01:10 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

Comment thread db/state/aggregator.go Outdated

Comment thread db/state/aggregator.go Outdated

lemenkov force-pushed the aggregation_desync branch from fd0646b to bd23866 Compare April 8, 2026 02:35

AskAlexSharov approved these changes Apr 8, 2026

View reviewed changes

AskAlexSharov requested a review from Copilot April 8, 2026 02:44

Copilot started reviewing on behalf of AskAlexSharov April 8, 2026 02:45 View session

lemenkov force-pushed the aggregation_desync branch from bd23866 to 9c53287 Compare April 8, 2026 02:45

Copilot AI reviewed Apr 8, 2026

View reviewed changes

Comment thread common/dbg/experiments.go Outdated

Comment thread db/state/aggregator.go

lemenkov force-pushed the aggregation_desync branch from 9c53287 to 914af75 Compare April 8, 2026 03:04

AskAlexSharov added 4 commits April 8, 2026 11:03

kick ci

0c46261

save

6770c12

Merge branch 'main' into aggregation_desync

b2ed177

Merge branch 'main' into aggregation_desync

569cb27

AskAlexSharov added this pull request to the merge queue Apr 10, 2026

Merged via the queue into erigontech:main with commit 4a2fca3 Apr 10, 2026
33 checks passed

lemenkov mentioned this pull request Apr 10, 2026

db/state: add optional throttle to MergeLoop to reduce disk I/O pressure #20486

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce impact of synchronized aggregation across fleet nodes#20391

Reduce impact of synchronized aggregation across fleet nodes#20391
AskAlexSharov merged 5 commits intoerigontech:mainfrom
lemenkov:aggregation_desync

lemenkov commented Apr 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

AskAlexSharov commented Apr 8, 2026

Uh oh!

lemenkov commented Apr 8, 2026

Uh oh!

AskAlexSharov commented Apr 8, 2026

Uh oh!

AskAlexSharov commented Apr 8, 2026

Uh oh!

AskAlexSharov commented Apr 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lemenkov commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!