Skip to content

Reduce impact of synchronized aggregation across fleet nodes#20391

Merged
AskAlexSharov merged 5 commits intoerigontech:mainfrom
lemenkov:aggregation_desync
Apr 10, 2026
Merged

Reduce impact of synchronized aggregation across fleet nodes#20391
AskAlexSharov merged 5 commits intoerigontech:mainfrom
lemenkov:aggregation_desync

Conversation

@lemenkov
Copy link
Copy Markdown
Contributor

@lemenkov lemenkov commented Apr 7, 2026

Reduce impact of synchronized aggregation across fleet nodes

Problem

When running multiple Erigon nodes syncing the same chain, all nodes cross snapshot step boundaries at nearly the same time (within seconds of each other). This triggers BuildFilesInBackground simultaneously on every node, and the resulting aggregation I/O stalls block execution on all nodes at once.

In a load-balanced fleet this causes a total service outage — every backend falls behind the chain tip simultaneously, and the proxy has zero healthy backends to route traffic to.

Real-world incident (April 7 2026)

We operate a 3-node fleet. After ~2 months of stable operation, all nodes hit aggregation step 2193 within 20 seconds of each other:

Node BuildFilesInBackground step=2193 Aggregation duration
node-1 09:59:34 2m30s
node-2 09:59:28 2m29s
node-3 09:59:48 still aggregating, was restarted

During the aggregation, block execution throughput dropped from ~20 Mgas/s to ~1-5 Mgas/s. All nodes fell behind the chain tip. At 10:07:33 the fleet had 0 out of 3 healthy backends for 60 seconds.

The aggregation step itself evicted ~16GB of page cache (RSS dropped from 48GB to 32GB on one node), starving block execution of I/O bandwidth.

Each node recovered on its own within 10-15 minutes, but the synchronized nature of the stall meant there was no healthy node to absorb traffic during the event.

Root cause

BuildFilesInBackground is triggered when txNum crosses a step boundary. Since all nodes process the same chain in real time, they all cross the boundary on the same block. The trigger is deterministic — there is no jitter or per-node offset.

Solution

Add a configurable delay (ERIGON_AGGREGATION_DELAY_MS, default 0) at the start of BuildFilesInBackground, before the build loop begins. This follows the same pattern as the existing COMPRESS_WORKERS env var in common/dbg/experiments.go.

Operators running multi-node fleets can set different values per node to desynchronize aggregation:

node-1: ERIGON_AGGREGATION_DELAY_MS=0
node-2: ERIGON_AGGREGATION_DELAY_MS=60000
node-3: ERIGON_AGGREGATION_DELAY_MS=120000

This guarantees at least 60 seconds between each node starting its aggregation, which would have completely prevented the 0/3 healthy window in the incident above. Single-node operators are unaffected (default is 0).

Notes

  • This is complementary to COMPRESS_WORKERS (PR Reduce impact of background merge/compress to ChainTip #18995) which reduces I/O pressure within each aggregation step. This PR addresses the timing of when aggregation starts across nodes.
  • No impact on single-node deployments or initial sync (default delay is 0).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a configurable per-node startup delay for snapshot aggregation so multi-node fleets can stagger BuildFilesInBackground and avoid synchronized I/O stalls that can temporarily take all nodes out of service at once.

Changes:

  • Introduce ERIGON_AGGREGATION_DELAY_MS (default 0) as a debug/experiment env var.
  • Apply the configured delay at the start of Aggregator.buildFilesInBackground before the build loop proceeds.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
db/state/aggregator.go Adds an optional delay before starting background file building to desynchronize aggregation timing across nodes.
common/dbg/experiments.go Adds AggregationDelayMs configuration sourced from ERIGON_AGGREGATION_DELAY_MS.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread db/state/aggregator.go Outdated
Comment thread db/state/aggregator.go Outdated
@AskAlexSharov
Copy link
Copy Markdown
Collaborator

Thank you for investigation. I would "just accept this PR" because it gives simple-enough workaround for you.

But here is whole picture:

  • dropped from ~20 Mgas/s to ~1-5 Mgas/s. only for 3 minutes? Or for much longer?
  • Because beside "Building files in background" - there is one more "IO intensive rare event": it's Merge of small files to bigger in background. We limiting goroutines amount for building and merging to reduce ChainTip impact (1 goroutine for building, 1 for merge). And Merge actually producing more IO (especially merging of commitment.kv files) - than building. Some workaround for this is: NO_DEEP_MERGE_HISTORY=true and manual merge trigger on offline nod by erigon snapshots retire command.

We working on it:

  • in 3.4 version of erigon (will release in several days) we did quite a lot of things to mitigate ChainTip perf impact. I would advise re-sync on 3.4 or main (after we release files) - it will greatly reduce ChainTip impact, building files speed, building files IO, chaindata size (it's io for building and pruning), reduced impact of RPC to chain tip, etc...

@lemenkov
Copy link
Copy Markdown
Contributor Author

lemenkov commented Apr 8, 2026

Thank you for investigation. I would "just accept this PR" because it gives simple-enough workaround for you.

But here is whole picture:

* `dropped from ~20 Mgas/s to ~1-5 Mgas/s.` only for 3 minutes? Or for much longer?

* Because beside "Building files in background" - there is one more "IO intensive rare event": it's `Merge` of small files to bigger in background. We limiting goroutines amount for building and merging to reduce ChainTip impact (1 goroutine for building, 1 for merge). And `Merge` actually producing more IO (especially merging of commitment.kv files) - than building. Some workaround for this is: `NO_DEEP_MERGE_HISTORY=true` and manual merge trigger on offline nod by `erigon snapshots retire` command.

We working on it:

* in `3.4` version of erigon (will release in several days) we did quite a lot of things to mitigate `ChainTip` perf impact. I would advise re-sync on 3.4 or main (after we release files) - it will greatly reduce ChainTip impact, building files speed, building files IO, chaindata size (it's io for building and pruning), reduced impact of RPC to chain tip, etc...

Give me a sec. I'll explain

@AskAlexSharov
Copy link
Copy Markdown
Collaborator

ah, if you are already on release/3.4 or main branches you can get smaller chaindata feature now. by:
/build/bin/erigon seg step-rebase --datadir=<your_path> --new-step-size=390625

@lemenkov lemenkov force-pushed the aggregation_desync branch from fd0646b to bd23866 Compare April 8, 2026 02:35
@AskAlexSharov
Copy link
Copy Markdown
Collaborator

randomization of merge and build start-times - it's interesting feature - we can just add it. do you have any requirements here (like how much seconds you may expect, etc...)?

io rate-limiter on background build/merge - it's interesting feature - but can be hard to implement. maybe you have some requirements on it also?

@AskAlexSharov
Copy link
Copy Markdown
Collaborator

FYI: we also already have next features:

  • prevent building/merging of state files and block files in same time
  • don't apply any compression when do Collate+Build small files - to speedup files building (earlier finish building - earlier can start pruning -> smaller chaindata -> less IO)

Also:

  • if you can reproduce it easily: then you can try run erigon with --db.pagesize=4kb (need rm datadir/chaindata for this). 4kb is known for lower r/w IO.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread common/dbg/experiments.go Outdated
Comment thread db/state/aggregator.go
When multiple nodes are syncing the same chain, they cross step
boundaries at nearly the same time, triggering BuildFilesInBackground
simultaneously. The resulting concurrent I/O from aggregation can cause
all nodes to fall behind the chain tip at once, leaving no healthy
backends in a load-balanced fleet.

Add a configurable delay (via AGGREGATION_DELAY_MS environment
variable, default 0) before the build loop starts. Operators can set
different values per node to desynchronize aggregation and avoid
fleet-wide stalls.

Example deployment:

  node-1: AGGREGATION_DELAY_MS=0
  node-2: AGGREGATION_DELAY_MS=60000
  node-3: AGGREGATION_DELAY_MS=120000

Signed-off-by: Peter Lemenkov <lemenkov@gmail.com>
Assisted-by: Claude (Anthropic) <https://claude.ai>
@lemenkov lemenkov force-pushed the aggregation_desync branch from 9c53287 to 914af75 Compare April 8, 2026 03:04
@AskAlexSharov AskAlexSharov added this pull request to the merge queue Apr 10, 2026
Merged via the queue into erigontech:main with commit 4a2fca3 Apr 10, 2026
33 checks passed
github-merge-queue Bot pushed a commit that referenced this pull request Apr 11, 2026
…ure (#20486)

### Problem
 
When Erigon is running at chain tip, `MergeLoop` executes merge steps
back-to-back with no pause between iterations. Each merge step involves
heavy disk I/O (reading, compressing, and writing state files). Running
these steps consecutively saturates the disk, starving block execution
of I/O bandwidth.
 
The result is periodic block processing stalls: the node's reported
block number freezes for minutes at a time while background merges
consume all available I/O, then bursts forward when a merge step
completes. During these stalls the node falls behind the chain tip and
is marked unhealthy by load balancers.
 
### Observed behavior
 
On a production fleet running Erigon v3.3.x on AWS Graviton instances
(64GB RAM, EBS gp3 volumes), we observed the following pattern during
MergeLoop activity on individual nodes:
 
- Block execution throughput drops from ~20 Mgas/s to 1-5 Mgas/s
- Node block number freezes for 8-16 minutes per merge step
- Page cache eviction of 16GB+ as merge I/O displaces cached state data
- Lag accumulates at ~5 blocks/minute during each stall
- Worst observed: 164 blocks behind over a 188-minute period of
continuous merge activity
 
The node always recovers eventually, but the stalls cause the node to be
removed from load balancer rotation, reducing fleet capacity.
 
### Solution
 
Add a configurable delay between `MergeLoop` iterations via the
`MERGE_THROTTLE_MS` environment variable (default 0, preserving current
behavior). The delay is inserted after each successful `mergeLoopStep`,
giving block execution a window to access the disk before the next merge
step begins.
 
```
Before (current):
  mergeLoopStep()  → heavy I/O
  mergeLoopStep()  → immediately, more heavy I/O
  mergeLoopStep()  → immediately, more heavy I/O
 
After (with ERIGON_MERGE_THROTTLE_MS=2000):
  mergeLoopStep()  → heavy I/O
  sleep(2s)        → block execution catches up
  mergeLoopStep()  → heavy I/O
  sleep(2s)        → block execution catches up
```
 
### Production results
 
We have been running this patch on a 3-node production fleet since
December 2025. Results:
 
- Individual node availability during merge-heavy periods improved from
~90% to >99%
- Block execution stalls reduced from 8-16 minutes to under 5 minutes
- Nodes maintain chain tip proximity during merge activity
- No negative impact on merge completion time (merges still finish, just
spread over a slightly longer window)
- Fleet-wide availability (via load-balanced proxy) is near 99.99%, with
the remaining downtime caused by synchronized stalls that this patch and
`AGGREGATION_DELAY_MS` (PR #20391) address together
 
Recommended values based on our testing:
 
| Use case | Value | Effect |
|----------|-------|--------|
| Default (no throttle) | 0 | Current behavior, no change |
| Light throttle | 500 | Slight breathing room between merges |
| Production RPC nodes | 2000 | Good balance of merge progress and block
execution |
| Heavy RPC workload | 5000 | Prioritize block execution over merge
speed |
 
### Notes
 
- This is complementary to `COMPRESS_WORKERS` (PR #18995) which reduces
I/O pressure *within* each merge step by limiting worker parallelism.
This PR addresses I/O pressure *between* merge steps.
- This is also complementary to `AGGREGATION_DELAY_MS` (PR #20391,
merged) which staggers the *start time* of aggregation across fleet
nodes.
- No impact on single-node deployments or initial sync (default delay is
0).

Signed-off-by: Peter Lemenkov <lemenkov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants