Reduce impact of synchronized aggregation across fleet nodes#20391
Reduce impact of synchronized aggregation across fleet nodes#20391AskAlexSharov merged 5 commits intoerigontech:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a configurable per-node startup delay for snapshot aggregation so multi-node fleets can stagger BuildFilesInBackground and avoid synchronized I/O stalls that can temporarily take all nodes out of service at once.
Changes:
- Introduce
ERIGON_AGGREGATION_DELAY_MS(default0) as a debug/experiment env var. - Apply the configured delay at the start of
Aggregator.buildFilesInBackgroundbefore the build loop proceeds.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
db/state/aggregator.go |
Adds an optional delay before starting background file building to desynchronize aggregation timing across nodes. |
common/dbg/experiments.go |
Adds AggregationDelayMs configuration sourced from ERIGON_AGGREGATION_DELAY_MS. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thank you for investigation. I would "just accept this PR" because it gives simple-enough workaround for you. But here is whole picture:
We working on it:
|
Give me a sec. I'll explain |
|
ah, if you are already on |
fd0646b to
bd23866
Compare
|
|
|
FYI: we also already have next features:
Also:
|
bd23866 to
9c53287
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
When multiple nodes are syncing the same chain, they cross step boundaries at nearly the same time, triggering BuildFilesInBackground simultaneously. The resulting concurrent I/O from aggregation can cause all nodes to fall behind the chain tip at once, leaving no healthy backends in a load-balanced fleet. Add a configurable delay (via AGGREGATION_DELAY_MS environment variable, default 0) before the build loop starts. Operators can set different values per node to desynchronize aggregation and avoid fleet-wide stalls. Example deployment: node-1: AGGREGATION_DELAY_MS=0 node-2: AGGREGATION_DELAY_MS=60000 node-3: AGGREGATION_DELAY_MS=120000 Signed-off-by: Peter Lemenkov <lemenkov@gmail.com> Assisted-by: Claude (Anthropic) <https://claude.ai>
9c53287 to
914af75
Compare
…ure (#20486) ### Problem When Erigon is running at chain tip, `MergeLoop` executes merge steps back-to-back with no pause between iterations. Each merge step involves heavy disk I/O (reading, compressing, and writing state files). Running these steps consecutively saturates the disk, starving block execution of I/O bandwidth. The result is periodic block processing stalls: the node's reported block number freezes for minutes at a time while background merges consume all available I/O, then bursts forward when a merge step completes. During these stalls the node falls behind the chain tip and is marked unhealthy by load balancers. ### Observed behavior On a production fleet running Erigon v3.3.x on AWS Graviton instances (64GB RAM, EBS gp3 volumes), we observed the following pattern during MergeLoop activity on individual nodes: - Block execution throughput drops from ~20 Mgas/s to 1-5 Mgas/s - Node block number freezes for 8-16 minutes per merge step - Page cache eviction of 16GB+ as merge I/O displaces cached state data - Lag accumulates at ~5 blocks/minute during each stall - Worst observed: 164 blocks behind over a 188-minute period of continuous merge activity The node always recovers eventually, but the stalls cause the node to be removed from load balancer rotation, reducing fleet capacity. ### Solution Add a configurable delay between `MergeLoop` iterations via the `MERGE_THROTTLE_MS` environment variable (default 0, preserving current behavior). The delay is inserted after each successful `mergeLoopStep`, giving block execution a window to access the disk before the next merge step begins. ``` Before (current): mergeLoopStep() → heavy I/O mergeLoopStep() → immediately, more heavy I/O mergeLoopStep() → immediately, more heavy I/O After (with ERIGON_MERGE_THROTTLE_MS=2000): mergeLoopStep() → heavy I/O sleep(2s) → block execution catches up mergeLoopStep() → heavy I/O sleep(2s) → block execution catches up ``` ### Production results We have been running this patch on a 3-node production fleet since December 2025. Results: - Individual node availability during merge-heavy periods improved from ~90% to >99% - Block execution stalls reduced from 8-16 minutes to under 5 minutes - Nodes maintain chain tip proximity during merge activity - No negative impact on merge completion time (merges still finish, just spread over a slightly longer window) - Fleet-wide availability (via load-balanced proxy) is near 99.99%, with the remaining downtime caused by synchronized stalls that this patch and `AGGREGATION_DELAY_MS` (PR #20391) address together Recommended values based on our testing: | Use case | Value | Effect | |----------|-------|--------| | Default (no throttle) | 0 | Current behavior, no change | | Light throttle | 500 | Slight breathing room between merges | | Production RPC nodes | 2000 | Good balance of merge progress and block execution | | Heavy RPC workload | 5000 | Prioritize block execution over merge speed | ### Notes - This is complementary to `COMPRESS_WORKERS` (PR #18995) which reduces I/O pressure *within* each merge step by limiting worker parallelism. This PR addresses I/O pressure *between* merge steps. - This is also complementary to `AGGREGATION_DELAY_MS` (PR #20391, merged) which staggers the *start time* of aggregation across fleet nodes. - No impact on single-node deployments or initial sync (default delay is 0). Signed-off-by: Peter Lemenkov <lemenkov@gmail.com>
Reduce impact of synchronized aggregation across fleet nodes
Problem
When running multiple Erigon nodes syncing the same chain, all nodes cross snapshot step boundaries at nearly the same time (within seconds of each other). This triggers
BuildFilesInBackgroundsimultaneously on every node, and the resulting aggregation I/O stalls block execution on all nodes at once.In a load-balanced fleet this causes a total service outage — every backend falls behind the chain tip simultaneously, and the proxy has zero healthy backends to route traffic to.
Real-world incident (April 7 2026)
We operate a 3-node fleet. After ~2 months of stable operation, all nodes hit aggregation step 2193 within 20 seconds of each other:
BuildFilesInBackground step=2193During the aggregation, block execution throughput dropped from ~20 Mgas/s to ~1-5 Mgas/s. All nodes fell behind the chain tip. At 10:07:33 the fleet had 0 out of 3 healthy backends for 60 seconds.
The aggregation step itself evicted ~16GB of page cache (RSS dropped from 48GB to 32GB on one node), starving block execution of I/O bandwidth.
Each node recovered on its own within 10-15 minutes, but the synchronized nature of the stall meant there was no healthy node to absorb traffic during the event.
Root cause
BuildFilesInBackgroundis triggered whentxNumcrosses a step boundary. Since all nodes process the same chain in real time, they all cross the boundary on the same block. The trigger is deterministic — there is no jitter or per-node offset.Solution
Add a configurable delay (
ERIGON_AGGREGATION_DELAY_MS, default 0) at the start ofBuildFilesInBackground, before the build loop begins. This follows the same pattern as the existingCOMPRESS_WORKERSenv var incommon/dbg/experiments.go.Operators running multi-node fleets can set different values per node to desynchronize aggregation:
This guarantees at least 60 seconds between each node starting its aggregation, which would have completely prevented the 0/3 healthy window in the incident above. Single-node operators are unaffected (default is 0).
Notes
COMPRESS_WORKERS(PR Reduce impact of background merge/compress to ChainTip #18995) which reduces I/O pressure within each aggregation step. This PR addresses the timing of when aggregation starts across nodes.