Skip to content

Conversation

@julienrbrt
Copy link
Member

Overview

Add cache pruning to prevent OOM.

@claude
Copy link
Contributor

claude bot commented Oct 9, 2025

Claude finished @julienrbrt's task —— View job


Comprehensive Code Review

I've completed a thorough analysis of the cache pruning implementation. This review covers code quality, architecture, error handling, security, performance, testing, and documentation.


Summary

This PR introduces cache pruning to prevent unbounded memory growth in long-running chains. The implementation adds height-based tracking and pruning mechanisms across header, data, and pending events caches. Overall, the approach is sound, but there are several important issues to address before merging.


Critical Issues

1. Unbounded Array Allocation in Pruning (Severity: High)

Location: block/internal/cache/generic_cache.go:142-169

The pruneOldEntries method collects all keys to delete in unbounded slices:

var itemsToDelete []uint64
var hashesToDelete []string

Problem: On a chain running for months with millions of blocks, if pruning is called infrequently or after a long period without pruning, these slices could grow to millions of entries, causing large memory allocations and potential OOM issues.

Recommendation: Implement batch size limits as suggested by @tac0turtle:

const maxPruneBatchSize = 10000

// Prune in batches
for len(itemsToDelete) < maxPruneBatchSize {
    // collect keys...
}

Or use an iterative approach that deletes as it goes instead of collecting all keys first.


2. P2P-Only Nodes Cache Buildup (Severity: Medium-High)

Location: block/internal/submitting/submitter.go:240

As noted by @tac0turtle, nodes syncing only via P2P never trigger cache clearing because they don't track DA inclusion:

// Only called when DA inclusion advances
s.cache.ClearBelowHeight(header.Height())

Problem: P2P-only nodes will accumulate cache entries indefinitely, defeating the purpose of this PR.

Recommendation: Implement a fallback pruning mechanism:

  1. Add periodic pruning based on time or cache size thresholds
  2. Or prune based on current sync height even without DA inclusion
  3. Document the trade-off: P2P-only nodes sacrifice DA inclusion checks for reduced memory

Example:

// In syncer after successful block sync
if s.config.DA.Enable == false { // P2P only
    s.cache.ClearBelowHeight(syncedHeight - retentionWindow)
}

3. Return Value Redundancy (Severity: Low)

Location: block/internal/cache/generic_cache.go:83-91

The isSeen method has redundant logic:

func (c *Cache[T]) isSeen(hash string) bool {
    seen, ok := c.hashes.Load(hash)
    if !ok {
        return false
    }
    _, ok = seen.(bool)  // This is redundant
    return ok
}

Problem: Since only true boolean values are stored via setSeen, checking the type assertion success is equivalent to checking if the key exists. The value itself is never actually used.

Recommendation: As @tac0turtle suggested, simplify to:

func (c *Cache[T]) isSeen(hash string) bool {
    _, ok := c.hashes.Load(hash)
    return ok
}

This eliminates unnecessary type assertion and is clearer about intent.


Architecture & Design

Strengths

  1. Two-phase pruning approach (lines 136-180) properly avoids race conditions
  2. Height-based tracking via heightByHash enables efficient pruning
  3. Separation of concerns: Cache pruning logic is centralized in Manager.ClearBelowHeight

Concerns

Height Tracking Overhead

Every hash now stores an additional heightByHash entry. For chains with high transaction throughput, this doubles the hash map memory footprint:

  • Before: 1 entry per hash
  • After: 2-3 entries per hash (hashes + heightByHash, potentially daIncluded)

Question: Have you profiled memory usage with this change under realistic load? Consider documenting expected memory characteristics.

Pruning Trigger Inconsistency

Pruning is triggered in multiple places with different semantics:

  • manager.go:227 - During GetNextPendingEvent (prunes pending events only)
  • submitter.go:240 - During DA inclusion advancement (prunes all caches)
  • No automatic trigger for P2P-only nodes

Recommendation: Document the pruning strategy clearly and consider a unified trigger mechanism.


Concurrency Safety

Strengths

  1. Mutex protection on ClearBelowHeight prevents concurrent pruning
  2. Atomic operations via sync.Map throughout
  3. Two-phase deletion prevents orphaned references

Potential Race Condition

Location: manager.go:273-286

func (m *implementation) ClearBelowHeight(height uint64) {
    if height == 0 {
        return
    }
    m.pruneMu.Lock()
    defer m.pruneMu.Unlock()
    if height <= m.lastClearedHeight {  // Optimization check
        return
    }
    m.headerCache.pruneOldEntries(height)
    m.dataCache.pruneOldEntries(height)
    m.pendingEventsCache.pruneOldEntries(height)
    m.lastClearedHeight = height
}

Issue: While the mutex protects pruning, concurrent Set*Seen or SetDAIncluded calls can add entries at old heights between the check and the actual pruning. This could lead to:

  1. Newly-added old entries not being pruned
  2. Orphaned heightByHash entries

Assessment: This is likely acceptable since:

  • It's a transient issue that self-corrects on the next prune
  • The alternative (locking all cache operations) would harm performance significantly
  • The two-phase approach in pruneOldEntries minimizes this window

Recommendation: Document this behavior as "eventual consistency" in pruning.


Performance Considerations

Memory Growth Characteristics

Good test coverage in TestCache_HashMapPruning demonstrates memory savings. However, some concerns:

  1. Pruning cost scales with retention window: If pruning runs infrequently, each prune operation iterates over many entries
  2. No metrics: Consider adding metrics for:
    • Cache sizes (headers/data/pending)
    • Pruned entries per operation
    • Time spent pruning

CPU Usage

The Range operations in pruning iterate over all entries. For large caches, this could cause latency spikes.

Recommendation:

  • Add timing metrics to monitor pruning duration
  • Consider incremental/background pruning if pruning time becomes significant

Testing Quality

Strengths

  1. Comprehensive test coverage for pruning logic
  2. Large dataset tests (TestCache_LargeDataset, TestCache_HashMapPruning)
  3. Edge case coverage (empty cache, zero height, boundary conditions)
  4. Concurrent access test (TestManager_ClearBelowHeight_ConcurrentAccess)

Gaps

Missing Test: P2P-Only Scenario

No test validates behavior when DA is disabled. Add:

func TestManager_ClearBelowHeight_P2POnly(t *testing.T) {
    // Test that cache doesn't grow unbounded for P2P-only nodes
}

Missing Test: Pruning Performance

No test validates pruning doesn't block for too long:

func TestCache_PruneOldEntries_Performance(t *testing.T) {
    // Measure pruning time for large caches
}

Coverage Concerns

The codecov report shows:

  • executor.go: 0% coverage on new lines (18 missing)
  • syncer.go: 10% coverage on new lines (18 missing)

These are integration points where cache clearing is called. While integration tests may cover these, unit tests for the specific pruning trigger logic would be valuable.


Error Handling

File I/O Operations

Location: generic_cache.go:190-223

The SaveToDisk and LoadFromDisk methods properly handle errors, but:

Issue: No cleanup on partial write failures. If saving one map fails, others may have been written, leaving inconsistent state on disk.

Recommendation:

func (c *Cache[T]) SaveToDisk(folderPath string) error {
    // Create temp directory
    tempDir := folderPath + ".tmp"
    // Write to temp
    // Atomically rename on success
    // This ensures all-or-nothing semantics
}

Logging

Good use of structured logging throughout. Consider adding:

  • Debug log when pruning completes with entry counts
  • Warn log when cache size exceeds thresholds (if implemented)

Security Considerations

No major security concerns identified. The code:

  • ✅ Doesn't expose internal state unsafely
  • ✅ Properly validates height parameters (checks for zero)
  • ✅ Doesn't log sensitive data
  • ✅ Uses safe concurrent primitives

Minor note: Ensure the gob-encoded cache files have appropriate permissions (currently 0o755 for directories). Consider 0o700 for more restrictive access.


Code Quality & Style

Strengths

  1. Clear comments explaining two-phase approach
  2. Idiomatic Go with proper use of sync.Map
  3. Consistent naming conventions

Minor Issues

Magic Number

Location: syncer.go:105

heightInCh:  make(chan common.DAHeightEvent, 10_000),

Consider making this configurable or documenting why 10,000 is the right buffer size.

Unclear Comment

Location: submitter.go:239

// TODO: If heights are posted out of order, this may be an issue.
s.cache.ClearBelowHeight(header.Height())

This TODO is vague. Either address the issue or clarify what "out of order" means and when it could occur.


Documentation

Missing Documentation

  1. Cache retention policy: How long are entries kept? What's the window?
  2. Memory expectations: What's the expected memory footprint with pruning?
  3. P2P-only behavior: Document the limitation or solution
  4. Pruning guarantees: Document eventual consistency vs. immediate clearing

Recommended Additions

Add to CLAUDE.md or a CACHE.md:

## Cache Management

### Pruning Strategy
- Caches are pruned based on DA inclusion height
- P2P-only nodes: [document strategy]
- Retention window: [X blocks or Y time]
- Expected memory: [Z MB per 1000 blocks]

### Configuration
- MaxPendingHeadersAndData: Limits pending submissions
- BlockTime: Affects pruning frequency

Recommendations Summary

Must Fix Before Merge

  1. Batch size limits in pruning to prevent large allocations
  2. P2P-only node handling to prevent unbounded cache growth

Should Fix

  1. Simplify isSeen method (remove redundant type check)
  2. Add test for P2P-only scenario
  3. Clarify or resolve TODO in submitter.go:239

Nice to Have

  1. ⚠️ Add pruning metrics (cache size, prune duration, entries pruned)
  2. ⚠️ Document cache retention policy
  3. ⚠️ Add performance test for pruning
  4. ⚠️ Atomic file writes for cache persistence

Conclusion

This PR addresses a critical memory leak issue and the implementation is fundamentally sound. The two-phase pruning approach is correct, and test coverage is comprehensive. However, the unbounded array allocation and P2P-only node issues must be resolved before merging to ensure the solution actually prevents OOM in all scenarios.

The codebase follows Go best practices, and the architectural separation is clean. With the recommended fixes, this will be a solid improvement to the system's reliability.

Estimated Risk After Fixes: Low
Recommendation: Request changes for items 1-2, approve after addressed


@github-actions
Copy link
Contributor

github-actions bot commented Oct 9, 2025

The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedOct 14, 2025, 8:21 PM

@codecov
Copy link

codecov bot commented Oct 9, 2025

Codecov Report

❌ Patch coverage is 55.03876% with 58 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.96%. Comparing base (8c85f1a) to head (8b4eb98).

Files with missing lines Patch % Lines
block/internal/cache/generic_cache.go 65.07% 15 Missing and 7 partials ⚠️
block/internal/executing/executor.go 0.00% 18 Missing ⚠️
block/internal/syncing/syncer.go 10.00% 18 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2748      +/-   ##
==========================================
- Coverage   62.08%   61.96%   -0.13%     
==========================================
  Files          80       80              
  Lines        8459     8569     +110     
==========================================
+ Hits         5252     5310      +58     
- Misses       2715     2762      +47     
- Partials      492      497       +5     
Flag Coverage Δ
combined 61.96% <55.03%> (-0.13%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@julienrbrt julienrbrt marked this pull request as ready for review October 10, 2025 10:36
@julienrbrt julienrbrt marked this pull request as draft October 10, 2025 10:43
@julienrbrt julienrbrt marked this pull request as ready for review October 10, 2025 11:33
Comment on lines 85 to 91
if !ok {
return false
}
return seen.(bool)
val, ok := seen.(bool)
if !ok {
return false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we only store if the value is seen, then returning ok should be enough, ok will always tell us if its present or not, unless we start calling store with false on line 90

}

// Only prune if we have a valid DA included height
if daIncludedHeight > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a node is only using p2p this would continuously build up, how should we handle?

maybe after a certain size in the cache we begin to clear anyways since the assumption is that users that only sync p2p are fine with not checking for inclusion

Comment on lines +144 to +145
var itemsToDelete []uint64
var hashesToDelete []string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should have a limit on these otherwise we could be allocating a large array. We should have a batch size to limit massive allocations

@julienrbrt julienrbrt marked this pull request as draft October 14, 2025 19:53
@julienrbrt
Copy link
Member Author

Closing this as it isn't worth it and will make the system more unstable. I'll open a PR with a small optimization instead.

@julienrbrt
Copy link
Member Author

julienrbrt commented Oct 14, 2025

some pruning is still needed so re-opened a simpler #2761.

@julienrbrt julienrbrt deleted the julien/cache-pruning branch October 14, 2025 20:59
}

// Clear cache below height
// TODO: If heights are posted out of order, this may be an issue.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is something we need to watch out for

github-merge-queue bot pushed a commit that referenced this pull request Oct 17, 2025
<!--
Please read and fill out this form before submitting your PR.

Please make sure you have reviewed our contributors guide before
submitting your
first PR.

NOTE: PR titles should follow semantic commits:
https://www.conventionalcommits.org/en/v1.0.0/
-->

## Overview

Supersed #2748
Except it is fully contained in the cache.

We delete all the cache for the height that just been marked as
included.

<!-- 
Please provide an explanation of the PR, including the appropriate
context,
background, goal, and rationale. If there is an issue with this
information,
please provide a tl;dr and link the issue. 

Ex: Closes #<issue number>
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants