-
Notifications
You must be signed in to change notification settings - Fork 245
feat(cache): cache pruning #2748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Claude finished @julienrbrt's task —— View job Comprehensive Code ReviewI've completed a thorough analysis of the cache pruning implementation. This review covers code quality, architecture, error handling, security, performance, testing, and documentation. SummaryThis PR introduces cache pruning to prevent unbounded memory growth in long-running chains. The implementation adds height-based tracking and pruning mechanisms across header, data, and pending events caches. Overall, the approach is sound, but there are several important issues to address before merging. Critical Issues1. Unbounded Array Allocation in Pruning (Severity: High)Location: The var itemsToDelete []uint64
var hashesToDelete []stringProblem: On a chain running for months with millions of blocks, if pruning is called infrequently or after a long period without pruning, these slices could grow to millions of entries, causing large memory allocations and potential OOM issues. Recommendation: Implement batch size limits as suggested by @tac0turtle: const maxPruneBatchSize = 10000
// Prune in batches
for len(itemsToDelete) < maxPruneBatchSize {
// collect keys...
}Or use an iterative approach that deletes as it goes instead of collecting all keys first. 2. P2P-Only Nodes Cache Buildup (Severity: Medium-High)Location: As noted by @tac0turtle, nodes syncing only via P2P never trigger cache clearing because they don't track DA inclusion: // Only called when DA inclusion advances
s.cache.ClearBelowHeight(header.Height())Problem: P2P-only nodes will accumulate cache entries indefinitely, defeating the purpose of this PR. Recommendation: Implement a fallback pruning mechanism:
Example: // In syncer after successful block sync
if s.config.DA.Enable == false { // P2P only
s.cache.ClearBelowHeight(syncedHeight - retentionWindow)
}3. Return Value Redundancy (Severity: Low)Location: The func (c *Cache[T]) isSeen(hash string) bool {
seen, ok := c.hashes.Load(hash)
if !ok {
return false
}
_, ok = seen.(bool) // This is redundant
return ok
}Problem: Since only Recommendation: As @tac0turtle suggested, simplify to: func (c *Cache[T]) isSeen(hash string) bool {
_, ok := c.hashes.Load(hash)
return ok
}This eliminates unnecessary type assertion and is clearer about intent. Architecture & DesignStrengths
ConcernsHeight Tracking OverheadEvery hash now stores an additional
Question: Have you profiled memory usage with this change under realistic load? Consider documenting expected memory characteristics. Pruning Trigger InconsistencyPruning is triggered in multiple places with different semantics:
Recommendation: Document the pruning strategy clearly and consider a unified trigger mechanism. Concurrency SafetyStrengths
Potential Race ConditionLocation: func (m *implementation) ClearBelowHeight(height uint64) {
if height == 0 {
return
}
m.pruneMu.Lock()
defer m.pruneMu.Unlock()
if height <= m.lastClearedHeight { // Optimization check
return
}
m.headerCache.pruneOldEntries(height)
m.dataCache.pruneOldEntries(height)
m.pendingEventsCache.pruneOldEntries(height)
m.lastClearedHeight = height
}Issue: While the mutex protects pruning, concurrent
Assessment: This is likely acceptable since:
Recommendation: Document this behavior as "eventual consistency" in pruning. Performance ConsiderationsMemory Growth CharacteristicsGood test coverage in
CPU UsageThe Recommendation:
Testing QualityStrengths
GapsMissing Test: P2P-Only ScenarioNo test validates behavior when DA is disabled. Add: func TestManager_ClearBelowHeight_P2POnly(t *testing.T) {
// Test that cache doesn't grow unbounded for P2P-only nodes
}Missing Test: Pruning PerformanceNo test validates pruning doesn't block for too long: func TestCache_PruneOldEntries_Performance(t *testing.T) {
// Measure pruning time for large caches
}Coverage ConcernsThe codecov report shows:
These are integration points where cache clearing is called. While integration tests may cover these, unit tests for the specific pruning trigger logic would be valuable. Error HandlingFile I/O OperationsLocation: The Issue: No cleanup on partial write failures. If saving one map fails, others may have been written, leaving inconsistent state on disk. Recommendation: func (c *Cache[T]) SaveToDisk(folderPath string) error {
// Create temp directory
tempDir := folderPath + ".tmp"
// Write to temp
// Atomically rename on success
// This ensures all-or-nothing semantics
}LoggingGood use of structured logging throughout. Consider adding:
Security ConsiderationsNo major security concerns identified. The code:
Minor note: Ensure the gob-encoded cache files have appropriate permissions (currently 0o755 for directories). Consider 0o700 for more restrictive access. Code Quality & StyleStrengths
Minor IssuesMagic NumberLocation: heightInCh: make(chan common.DAHeightEvent, 10_000),Consider making this configurable or documenting why 10,000 is the right buffer size. Unclear CommentLocation: // TODO: If heights are posted out of order, this may be an issue.
s.cache.ClearBelowHeight(header.Height())This TODO is vague. Either address the issue or clarify what "out of order" means and when it could occur. DocumentationMissing Documentation
Recommended AdditionsAdd to ## Cache Management
### Pruning Strategy
- Caches are pruned based on DA inclusion height
- P2P-only nodes: [document strategy]
- Retention window: [X blocks or Y time]
- Expected memory: [Z MB per 1000 blocks]
### Configuration
- MaxPendingHeadersAndData: Limits pending submissions
- BlockTime: Affects pruning frequencyRecommendations SummaryMust Fix Before Merge
Should Fix
Nice to Have
ConclusionThis PR addresses a critical memory leak issue and the implementation is fundamentally sound. The two-phase pruning approach is correct, and test coverage is comprehensive. However, the unbounded array allocation and P2P-only node issues must be resolved before merging to ensure the solution actually prevents OOM in all scenarios. The codebase follows Go best practices, and the architectural separation is clean. With the recommended fixes, this will be a solid improvement to the system's reliability. Estimated Risk After Fixes: Low |
|
The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2748 +/- ##
==========================================
- Coverage 62.08% 61.96% -0.13%
==========================================
Files 80 80
Lines 8459 8569 +110
==========================================
+ Hits 5252 5310 +58
- Misses 2715 2762 +47
- Partials 492 497 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| if !ok { | ||
| return false | ||
| } | ||
| return seen.(bool) | ||
| val, ok := seen.(bool) | ||
| if !ok { | ||
| return false | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we only store if the value is seen, then returning ok should be enough, ok will always tell us if its present or not, unless we start calling store with false on line 90
block/internal/cache/manager.go
Outdated
| } | ||
|
|
||
| // Only prune if we have a valid DA included height | ||
| if daIncludedHeight > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a node is only using p2p this would continuously build up, how should we handle?
maybe after a certain size in the cache we begin to clear anyways since the assumption is that users that only sync p2p are fine with not checking for inclusion
| var itemsToDelete []uint64 | ||
| var hashesToDelete []string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should have a limit on these otherwise we could be allocating a large array. We should have a batch size to limit massive allocations
|
Closing this as it isn't worth it and will make the system more unstable. I'll open a PR with a small optimization instead. |
|
some pruning is still needed so re-opened a simpler #2761. |
| } | ||
|
|
||
| // Clear cache below height | ||
| // TODO: If heights are posted out of order, this may be an issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is something we need to watch out for
<!-- Please read and fill out this form before submitting your PR. Please make sure you have reviewed our contributors guide before submitting your first PR. NOTE: PR titles should follow semantic commits: https://www.conventionalcommits.org/en/v1.0.0/ --> ## Overview Supersed #2748 Except it is fully contained in the cache. We delete all the cache for the height that just been marked as included. <!-- Please provide an explanation of the PR, including the appropriate context, background, goal, and rationale. If there is an issue with this information, please provide a tl;dr and link the issue. Ex: Closes #<issue number> -->
Overview
Add cache pruning to prevent OOM.