Skip to content

[r3.4] cl/caplin: fix blob and data column pruning (never ran, wrong range, configurable keep window)#20380

Merged
AskAlexSharov merged 4 commits into
release/3.4from
feature/lystopad/fix-blob-column-pruning-r34
Apr 8, 2026
Merged

[r3.4] cl/caplin: fix blob and data column pruning (never ran, wrong range, configurable keep window)#20380
AskAlexSharov merged 4 commits into
release/3.4from
feature/lystopad/fix-blob-column-pruning-r34

Conversation

@lystopad
Copy link
Copy Markdown
Member

@lystopad lystopad commented Apr 7, 2026

Cherry-pick of #20379 to release/3.4.

Summary

Three bugs caused caplin blob and PeerDAS data column storage to grow unboundedly, filling disks on long-running nodes (observed: 1.6TB in caplin/ on a Hoodi node).

Bug 1: CleanupAndPruning stage was never reached

ForkChoice transitioned directly to SleepForSlot, making CleanupAndPruning a dead stage. Pruning never ran on any node, ever.

Bug 2: Prune loop only covered a narrow window

Even if pruning had run, both BlobStore.Prune() and dataColumnStorageImpl.Prune() started the delete loop from currentSlot - minSlotsForBlobSidecarRequest instead of 0. Data older than that window was never deleted.

Bug 3: Column keep window was 1M slots (~138 days) — too large

The hardcoded pruneDistance = 1_000_000 for data columns meant that on networks where PeerDAS activated recently (e.g. Hoodi ~50 days ago), all column data fell within the keep window and nothing was pruned even after fixing bugs 1 and 2.

Test plan

  • Deploy on Hoodi node and verify du -sh /erigon-data/caplin/* decreases after reaching head
  • Verify ls /erigon-data/caplin/blobs/ | sort -n | head -5 shows only recent subdirs after prune
  • Verify --caplin.columns-keep-slots flag is accepted and overrides the default
  • go test ./cl/persistence/blob_storage/... passes

Co-Authored-By: Claude

lystopad added 3 commits April 7, 2026 11:30
The Prune() functions for both BlobStore and dataColumnStorageImpl had
a bug where the loop started from (cutoff - minSlotsForBlobSidecarRequest)
instead of 0. This meant only a narrow ~131K-slot window just before the
pruning cutoff was ever iterated, leaving all data older than that window
on disk indefinitely.

On a long-running Hoodi node this caused 444GB of blob accumulation and
1.1TB of PeerDAS data column accumulation (1.6TB total in caplin/).

Fix: start the delete loop from slot 0. Also add an underflow guard for
the case where currentSlot < keepDistance (e.g. very early in sync).

Co-Authored-By: Claude
Replace hardcoded 1_000_000 slot column keep distance with a configurable
flag --caplin.columns-keep-slots (default: 131072 = MIN_EPOCHS_FOR_DATA_COLUMN_SIDECARS_REQUESTS
* SLOTS_PER_EPOCH = 4096 * 32, ~18 days).

The previous default of 1M slots (~138 days) caused unbounded disk growth
on nodes where PeerDAS activated recently: all column data fell within the
keep window so nothing was ever pruned.

Operators running DA oracle or rollup nodes that need longer column history
can increase the value via the flag.

Co-Authored-By: Claude
ForkChoice was transitioning directly to SleepForSlot, leaving
CleanupAndPruning as a dead stage that was never executed. Blob and
data column pruning therefore never ran on any node.

Fix ForkChoice to transition to CleanupAndPruning, which then
transitions to SleepForSlot, matching the intended graph:
  ForkChoice -> CleanupAndPruning -> SleepForSlot

Co-Authored-By: Claude
@lystopad lystopad requested a review from yperbasis April 7, 2026 09:34
@lystopad lystopad self-assigned this Apr 7, 2026
…value

- Derive the column keep distance from beaconCfg when ColumnKeepSlots is 0
  (i.e. not set via CLI), using MinEpochsForDataColumnSidecarsRequests *
  SlotsPerEpoch. This gives the correct spec minimum per chain:
  mainnet/Hoodi/Sepolia = 131072 slots (~18 days),
  Gnosis/Chiado = 65536 slots (~3.8 days).
- Guards against ColumnKeepSlots being zero (struct zero-value when config
  is constructed without CLI), which would otherwise delete all column data.

Co-Authored-By: Claude
@AskAlexSharov AskAlexSharov merged commit 70013ce into release/3.4 Apr 8, 2026
22 checks passed
@AskAlexSharov AskAlexSharov deleted the feature/lystopad/fix-blob-column-pruning-r34 branch April 8, 2026 08:08
AskAlexSharov pushed a commit that referenced this pull request Apr 23, 2026
…tion) (#20729)

## Summary

Documents the `--caplin.columns-keep-slots` flag introduced in #20380.

- Adds a new **PeerDAS Data Column Retention** subsection to `caplin.md`
- Flag: `--caplin.columns-keep-slots` (default: 131072, ~18 days)
- Explains use case for DA oracle / rollup nodes needing longer column
history

## Test plan
- [ ] Verify flag appears in `erigon --help` on `release/3.4`
- [ ] Verify default value matches source:
`MIN_EPOCHS_FOR_DATA_COLUMN_SIDECARS_REQUESTS * SLOTS_PER_EPOCH =
131072`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: bloxster <gianni.morselli@erigon.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants