feat(cache): cache pruning #2748

julienrbrt · 2025-10-09T20:19:03Z

Overview

Add cache pruning to prevent OOM.

claude · 2025-10-09T20:19:19Z

Claude finished @julienrbrt's task —— View job

Comprehensive Code Review

I've completed a thorough analysis of the cache pruning implementation. This review covers code quality, architecture, error handling, security, performance, testing, and documentation.

Summary

This PR introduces cache pruning to prevent unbounded memory growth in long-running chains. The implementation adds height-based tracking and pruning mechanisms across header, data, and pending events caches. Overall, the approach is sound, but there are several important issues to address before merging.

Critical Issues

1. Unbounded Array Allocation in Pruning (Severity: High)

Location: block/internal/cache/generic_cache.go:142-169

The pruneOldEntries method collects all keys to delete in unbounded slices:

var itemsToDelete []uint64
var hashesToDelete []string

Problem: On a chain running for months with millions of blocks, if pruning is called infrequently or after a long period without pruning, these slices could grow to millions of entries, causing large memory allocations and potential OOM issues.

Recommendation: Implement batch size limits as suggested by @tac0turtle:

const maxPruneBatchSize = 10000

// Prune in batches
for len(itemsToDelete) < maxPruneBatchSize {
    // collect keys...
}

Or use an iterative approach that deletes as it goes instead of collecting all keys first.

2. P2P-Only Nodes Cache Buildup (Severity: Medium-High)

Location: block/internal/submitting/submitter.go:240

As noted by @tac0turtle, nodes syncing only via P2P never trigger cache clearing because they don't track DA inclusion:

// Only called when DA inclusion advances
s.cache.ClearBelowHeight(header.Height())

Problem: P2P-only nodes will accumulate cache entries indefinitely, defeating the purpose of this PR.

Recommendation: Implement a fallback pruning mechanism:

Add periodic pruning based on time or cache size thresholds
Or prune based on current sync height even without DA inclusion
Document the trade-off: P2P-only nodes sacrifice DA inclusion checks for reduced memory

Example:

// In syncer after successful block sync
if s.config.DA.Enable == false { // P2P only
    s.cache.ClearBelowHeight(syncedHeight - retentionWindow)
}

3. Return Value Redundancy (Severity: Low)

Location: block/internal/cache/generic_cache.go:83-91

The isSeen method has redundant logic:

func (c *Cache[T]) isSeen(hash string) bool {
    seen, ok := c.hashes.Load(hash)
    if !ok {
        return false
    }
    _, ok = seen.(bool)  // This is redundant
    return ok
}

Problem: Since only true boolean values are stored via setSeen, checking the type assertion success is equivalent to checking if the key exists. The value itself is never actually used.

Recommendation: As @tac0turtle suggested, simplify to:

func (c *Cache[T]) isSeen(hash string) bool {
    _, ok := c.hashes.Load(hash)
    return ok
}

This eliminates unnecessary type assertion and is clearer about intent.

Architecture & Design

Strengths

Two-phase pruning approach (lines 136-180) properly avoids race conditions
Height-based tracking via heightByHash enables efficient pruning
Separation of concerns: Cache pruning logic is centralized in Manager.ClearBelowHeight

Concerns

Height Tracking Overhead

Every hash now stores an additional heightByHash entry. For chains with high transaction throughput, this doubles the hash map memory footprint:

Before: 1 entry per hash
After: 2-3 entries per hash (hashes + heightByHash, potentially daIncluded)

Question: Have you profiled memory usage with this change under realistic load? Consider documenting expected memory characteristics.

Pruning Trigger Inconsistency

Pruning is triggered in multiple places with different semantics:

manager.go:227 - During GetNextPendingEvent (prunes pending events only)
submitter.go:240 - During DA inclusion advancement (prunes all caches)
No automatic trigger for P2P-only nodes

Recommendation: Document the pruning strategy clearly and consider a unified trigger mechanism.

Concurrency Safety

Strengths

Mutex protection on ClearBelowHeight prevents concurrent pruning
Atomic operations via sync.Map throughout
Two-phase deletion prevents orphaned references

Potential Race Condition

Location: manager.go:273-286

func (m *implementation) ClearBelowHeight(height uint64) {
    if height == 0 {
        return
    }
    m.pruneMu.Lock()
    defer m.pruneMu.Unlock()
    if height <= m.lastClearedHeight {  // Optimization check
        return
    }
    m.headerCache.pruneOldEntries(height)
    m.dataCache.pruneOldEntries(height)
    m.pendingEventsCache.pruneOldEntries(height)
    m.lastClearedHeight = height
}

Issue: While the mutex protects pruning, concurrent Set*Seen or SetDAIncluded calls can add entries at old heights between the check and the actual pruning. This could lead to:

Newly-added old entries not being pruned
Orphaned heightByHash entries

Assessment: This is likely acceptable since:

It's a transient issue that self-corrects on the next prune
The alternative (locking all cache operations) would harm performance significantly
The two-phase approach in pruneOldEntries minimizes this window

Recommendation: Document this behavior as "eventual consistency" in pruning.

Performance Considerations

Memory Growth Characteristics

Good test coverage in TestCache_HashMapPruning demonstrates memory savings. However, some concerns:

Pruning cost scales with retention window: If pruning runs infrequently, each prune operation iterates over many entries
No metrics: Consider adding metrics for:
- Cache sizes (headers/data/pending)
- Pruned entries per operation
- Time spent pruning

CPU Usage

The Range operations in pruning iterate over all entries. For large caches, this could cause latency spikes.

Recommendation:

Add timing metrics to monitor pruning duration
Consider incremental/background pruning if pruning time becomes significant

Testing Quality

Strengths

Comprehensive test coverage for pruning logic
Large dataset tests (TestCache_LargeDataset, TestCache_HashMapPruning)
Edge case coverage (empty cache, zero height, boundary conditions)
Concurrent access test (TestManager_ClearBelowHeight_ConcurrentAccess)

Gaps

Missing Test: P2P-Only Scenario

No test validates behavior when DA is disabled. Add:

func TestManager_ClearBelowHeight_P2POnly(t *testing.T) {
    // Test that cache doesn't grow unbounded for P2P-only nodes
}

Missing Test: Pruning Performance

No test validates pruning doesn't block for too long:

func TestCache_PruneOldEntries_Performance(t *testing.T) {
    // Measure pruning time for large caches
}

Coverage Concerns

The codecov report shows:

executor.go: 0% coverage on new lines (18 missing)
syncer.go: 10% coverage on new lines (18 missing)

These are integration points where cache clearing is called. While integration tests may cover these, unit tests for the specific pruning trigger logic would be valuable.

Error Handling

File I/O Operations

Location: generic_cache.go:190-223

The SaveToDisk and LoadFromDisk methods properly handle errors, but:

Issue: No cleanup on partial write failures. If saving one map fails, others may have been written, leaving inconsistent state on disk.

Recommendation:

func (c *Cache[T]) SaveToDisk(folderPath string) error {
    // Create temp directory
    tempDir := folderPath + ".tmp"
    // Write to temp
    // Atomically rename on success
    // This ensures all-or-nothing semantics
}

Logging

Good use of structured logging throughout. Consider adding:

Debug log when pruning completes with entry counts
Warn log when cache size exceeds thresholds (if implemented)

Security Considerations

No major security concerns identified. The code:

✅ Doesn't expose internal state unsafely
✅ Properly validates height parameters (checks for zero)
✅ Doesn't log sensitive data
✅ Uses safe concurrent primitives

Minor note: Ensure the gob-encoded cache files have appropriate permissions (currently 0o755 for directories). Consider 0o700 for more restrictive access.

Code Quality & Style

Strengths

Clear comments explaining two-phase approach
Idiomatic Go with proper use of sync.Map
Consistent naming conventions

Minor Issues

Magic Number

Location: syncer.go:105

heightInCh:  make(chan common.DAHeightEvent, 10_000),

Consider making this configurable or documenting why 10,000 is the right buffer size.

Unclear Comment

Location: submitter.go:239

// TODO: If heights are posted out of order, this may be an issue.
s.cache.ClearBelowHeight(header.Height())

This TODO is vague. Either address the issue or clarify what "out of order" means and when it could occur.

Documentation

Missing Documentation

Cache retention policy: How long are entries kept? What's the window?
Memory expectations: What's the expected memory footprint with pruning?
P2P-only behavior: Document the limitation or solution
Pruning guarantees: Document eventual consistency vs. immediate clearing

Recommended Additions

Add to CLAUDE.md or a CACHE.md:

## Cache Management

### Pruning Strategy
- Caches are pruned based on DA inclusion height
- P2P-only nodes: [document strategy]
- Retention window: [X blocks or Y time]
- Expected memory: [Z MB per 1000 blocks]

### Configuration
- MaxPendingHeadersAndData: Limits pending submissions
- BlockTime: Affects pruning frequency

Recommendations Summary

Must Fix Before Merge

✅ Batch size limits in pruning to prevent large allocations
✅ P2P-only node handling to prevent unbounded cache growth

Should Fix

✅ Simplify isSeen method (remove redundant type check)
✅ Add test for P2P-only scenario
✅ Clarify or resolve TODO in submitter.go:239

Nice to Have

⚠️ Add pruning metrics (cache size, prune duration, entries pruned)
⚠️ Document cache retention policy
⚠️ Add performance test for pruning
⚠️ Atomic file writes for cache persistence

Conclusion

This PR addresses a critical memory leak issue and the implementation is fundamentally sound. The two-phase pruning approach is correct, and test coverage is comprehensive. However, the unbounded array allocation and P2P-only node issues must be resolved before merging to ensure the solution actually prevents OOM in all scenarios.

The codebase follows Go best practices, and the architectural separation is clean. With the recommended fixes, this will be a solid improvement to the system's reliability.

Estimated Risk After Fixes: Low
Recommendation: Request changes for items 1-2, approve after addressed

github-actions · 2025-10-09T20:19:26Z

The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`✅ passed`	Oct 14, 2025, 8:21 PM

codecov · 2025-10-09T20:22:35Z

Codecov Report

❌ Patch coverage is 55.03876% with 58 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.96%. Comparing base (8c85f1a) to head (8b4eb98).

Files with missing lines	Patch %	Lines
block/internal/cache/generic_cache.go	65.07%	15 Missing and 7 partials ⚠️
block/internal/executing/executor.go	0.00%	18 Missing ⚠️
block/internal/syncing/syncer.go	10.00%	18 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2748      +/-   ##
==========================================
- Coverage   62.08%   61.96%   -0.13%     
==========================================
  Files          80       80              
  Lines        8459     8569     +110     
==========================================
+ Hits         5252     5310      +58     
- Misses       2715     2762      +47     
- Partials      492      497       +5

Flag	Coverage Δ
combined	`61.96% <55.03%> (-0.13%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

tac0turtle · 2025-10-10T13:46:35Z

block/internal/cache/generic_cache.go

 	if !ok {
 		return false
 	}
-	return seen.(bool)
+	val, ok := seen.(bool)
+	if !ok {
+		return false
+	}


since we only store if the value is seen, then returning ok should be enough, ok will always tell us if its present or not, unless we start calling store with false on line 90

tac0turtle · 2025-10-10T13:54:55Z

block/internal/cache/manager.go

+	}
+
+	// Only prune if we have a valid DA included height
+	if daIncludedHeight > 0 {


if a node is only using p2p this would continuously build up, how should we handle?

maybe after a certain size in the cache we begin to clear anyways since the assumption is that users that only sync p2p are fine with not checking for inclusion

tac0turtle · 2025-10-10T14:18:04Z

block/internal/cache/generic_cache.go

+	var itemsToDelete []uint64
+	var hashesToDelete []string


we should have a limit on these otherwise we could be allocating a large array. We should have a batch size to limit massive allocations

julienrbrt · 2025-10-14T20:41:33Z

Closing this as it isn't worth it and will make the system more unstable. I'll open a PR with a small optimization instead.

julienrbrt · 2025-10-14T20:59:39Z

some pruning is still needed so re-opened a simpler #2761.

julienrbrt · 2025-10-14T21:00:06Z

block/internal/submitting/submitter.go

 				}
+
+				// Clear cache below height
+				// TODO: If heights are posted out of order, this may be an issue.


this is something we need to watch out for

## Overview Supersed #2748 Except it is fully contained in the cache. We delete all the cache for the height that just been marked as included.

feat(cache): cache pruning

61f8d7d

github-project-automation bot added this to Evolve Oct 9, 2025

github-actions bot assigned julienrbrt Oct 9, 2025

julienrbrt added 2 commits October 10, 2025 12:29

updates

9f09161

async pruning

872195f

julienrbrt marked this pull request as ready for review October 10, 2025 10:36

Merge branch 'main' into julien/cache-pruning

93fd310

julienrbrt marked this pull request as draft October 10, 2025 10:43

julienrbrt added 2 commits October 10, 2025 12:48

updates

8b4eb98

blessing ai

8011187

julienrbrt marked this pull request as ready for review October 10, 2025 11:33

Update execution.go

5cd9039

tac0turtle reviewed Oct 10, 2025

View reviewed changes

cleanup

96912e9

julienrbrt marked this pull request as draft October 14, 2025 19:53

updates

994f9ce

julienrbrt closed this Oct 14, 2025

github-project-automation bot moved this to Done in Evolve Oct 14, 2025

julienrbrt mentioned this pull request Oct 14, 2025

feat(cache): cache pruning #2761

Merged

julienrbrt deleted the julien/cache-pruning branch October 14, 2025 20:59

julienrbrt commented Oct 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(cache): cache pruning #2748

feat(cache): cache pruning #2748

Uh oh!

julienrbrt commented Oct 9, 2025

Uh oh!

claude bot commented Oct 9, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 9, 2025 •

edited

Loading

Uh oh!

codecov bot commented Oct 9, 2025 •

edited

Loading

Uh oh!

tac0turtle Oct 10, 2025

Uh oh!

tac0turtle Oct 10, 2025

Uh oh!

tac0turtle Oct 10, 2025

Uh oh!

julienrbrt commented Oct 14, 2025

Uh oh!

julienrbrt commented Oct 14, 2025 •

edited

Loading

Uh oh!

julienrbrt Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(cache): cache pruning #2748

feat(cache): cache pruning #2748

Uh oh!

Conversation

julienrbrt commented Oct 9, 2025

Overview

Uh oh!

claude bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comprehensive Code Review

Summary

Critical Issues

1. Unbounded Array Allocation in Pruning (Severity: High)

2. P2P-Only Nodes Cache Buildup (Severity: Medium-High)

3. Return Value Redundancy (Severity: Low)

Architecture & Design

Strengths

Concerns

Height Tracking Overhead

Pruning Trigger Inconsistency

Concurrency Safety

Strengths

Potential Race Condition

Performance Considerations

Memory Growth Characteristics

CPU Usage

Testing Quality

Strengths

Gaps

Missing Test: P2P-Only Scenario

Missing Test: Pruning Performance

Coverage Concerns

Error Handling

File I/O Operations

Logging

Security Considerations

Code Quality & Style

Strengths

Minor Issues

Magic Number

Unclear Comment

Documentation

Missing Documentation

Recommended Additions

Recommendations Summary

Must Fix Before Merge

Should Fix

Nice to Have

Conclusion

Uh oh!

github-actions bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tac0turtle Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

tac0turtle Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

tac0turtle Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

julienrbrt commented Oct 14, 2025

Uh oh!

julienrbrt commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

julienrbrt Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

claude bot commented Oct 9, 2025 •

edited

Loading

github-actions bot commented Oct 9, 2025 •

edited

Loading

codecov bot commented Oct 9, 2025 •

edited

Loading

julienrbrt commented Oct 14, 2025 •

edited

Loading