Skip to content

Conversation

@julienrbrt
Copy link
Member

@julienrbrt julienrbrt commented Feb 3, 2026

Overview

Make pending events cache and tx cache fully ephemeral. Those will be re-fetched on restart. DA Inclusion cache persists until cleared up after DA inclusion has been processed. Persist accross restart using store metadata.

Once nodes have upgraded to this version, they can do a rm -rf $HOME/.appd/data/cache.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @julienrbrt, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on refining the internal logic related to Data Availability (DA) submission and inclusion. It streamlines code by removing unnecessary intermediate variables, ensures more accurate and efficient metric reporting, and enhances the system's observability through additional debug logging. These changes contribute to a cleaner, more performant, and easier-to-debug DA process.

Highlights

  • Code Optimization: Removed redundant variable assignments for header hashes and data commitments, directly using method calls where appropriate.
  • Metric Reporting Fix: Corrected the DAInclusionHeight metric update to use an already computed value, preventing unnecessary re-computation.
  • Improved Observability: Added a debug log message to indicate when a height is not yet included in Data Availability (DA), aiding in debugging and monitoring.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • block/internal/submitting/da_submitter.go
    • Removed the headerHash local variable, directly using header.Hash().String() in the cache.SetHeaderDAIncluded call.
  • block/internal/submitting/submitter.go
    • Updated the s.metrics.DAInclusionHeight.Set call to use the currentDAIncluded variable instead of re-calling s.GetDAIncludedHeight().
    • Added a debug log message height not yet DA included within the processDAInclusionLoop when IsHeightDAIncluded returns false.
    • Refactored IsHeightDAIncluded to remove headerHash and dataHash local variables, directly passing header.Hash().String() and dataCommitment.String() to cache lookup functions.
Activity
  • The pull request was opened with the title 'chore: log non included height', indicating an intent to improve logging around DA inclusion.
  • The changes primarily involve minor code cleanups, a metric reporting correction, and the addition of a debug log statement, aligning with the 'chore' and 'log' aspects of the title.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 3, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedFeb 4, 2026, 10:48 AM

@claude
Copy link
Contributor

claude bot commented Feb 3, 2026

Claude finished @julienrbrt's task in 3m 2s —— View job


Claude finished @julienrbrt's task in 2m 40s —— View job


Code Review Complete

I've performed a comprehensive review of this cache refactoring PR. The changes represent a significant architectural improvement, moving from gob-based persistence to a hybrid ephemeral/metadata approach. Overall, the implementation is solid with good test coverage and clear intent.


Summary

Breaking Change: ✅ Well-documented and justified
Test Coverage: 66.08% patch coverage (some areas need attention)
Architecture: Strong improvement in memory safety and restart behavior
Documentation: Clear CHANGELOG entry, needs operational guidance


Strengths

  1. Memory Safety: Replacing unbounded maps with LRU caches prevents OOM issues
  2. Clean Separation: DA inclusion data persists, while ephemeral caches (tx, pending events) don't - this is architecturally sound
  3. Test Quality: Comprehensive tests for persistence, restoration, and edge cases
  4. Atomic Operations: Good use of atomic operations for concurrent DA height tracking
  5. Backward Compatibility: Clear migration path with ClearCache flag

Critical Issues

1. Race Condition in setMaxDAHeight (High Priority)

Location: block/internal/cache/generic_cache.go:183-192

func (c *Cache[T]) setMaxDAHeight(daHeight uint64) {
	for range 1_000 {
		current := c.maxDAHeight.Load()
		if daHeight <= current {
			return
		}
		if c.maxDAHeight.CompareAndSwap(current, daHeight) {
			return
		}
	}
}

Issue: The hard-coded 1,000 iteration limit is concerning. Under extreme concurrent load, this could fail silently, leading to incorrect DaHeight() reporting.

Impact: This could cause DA inclusion processing to stall or report incorrect heights, which is critical for consensus.

Recommendation:

  • Add a panic or error log if all 1,000 attempts fail
  • Consider using a simpler atomic max pattern or mutex if contention is high
  • Add metrics to track CAS failures
func (c *Cache[T]) setMaxDAHeight(daHeight uint64) {
	for i := 0; i < 1_000; i++ {
		current := c.maxDAHeight.Load()
		if daHeight <= current {
			return
		}
		if c.maxDAHeight.CompareAndSwap(current, daHeight) {
			return
		}
	}
	// This should never happen in practice, but if it does, we need to know
	panic(fmt.Sprintf("failed to set max DA height after 1000 attempts: attempted=%d, current=%d", 
		daHeight, c.maxDAHeight.Load()))
}

2. Store Error Handling in Hot Path (Medium Priority)

Location: block/internal/cache/generic_cache.go:157-160

if c.store != nil {
	_ = c.store.SetMetadata(context.Background(), c.storeKey(hash), encodeDAInclusion(daHeight, blockHeight))
}

Issue: Store errors are silently ignored in setDAIncluded, which is called on every DA inclusion. If the store fails (disk full, permissions, corruption), DA inclusion metadata won't persist, but the node continues operating normally.

Impact: On restart, DA inclusion state would be lost, requiring re-syncing from peers or manual intervention.

Recommendation:

  • Log errors at ERROR level with context
  • Consider metrics for store failures
  • Document that this is write-through cache and store failures are non-fatal
if c.store != nil {
	if err := c.store.SetMetadata(context.Background(), c.storeKey(hash), encodeDAInclusion(daHeight, blockHeight)); err != nil {
		// Log but don't fail - in-memory cache is source of truth during runtime
		// On restart, we'll re-fetch from DA layer if needed
		// TODO: Add metric for tracking store persistence failures
		log.Error().Err(err).Str("hash", hash).Uint64("da_height", daHeight).Msg("failed to persist DA inclusion to store")
	}
}

3. Context Handling in Background Operations (Medium Priority)

Location: block/internal/cache/generic_cache.go:159, 170

_ = c.store.SetMetadata(context.Background(), c.storeKey(hash), ...)
_ = c.store.DeleteMetadata(context.Background(), c.storeKey(hash))

Issue: Using context.Background() in operations that may be called during shutdown could lead to incomplete writes or hangs.

Recommendation:

  • Accept context as parameter or store a context with the cache
  • Respect cancellation during shutdown to avoid delays
  • Consider using context.WithTimeout for store operations

Important Issues

4. Inefficient Hash Lookup in SaveToStore (Performance)

Location: block/internal/cache/generic_cache.go:263-272

var blockHeight uint64
heightKeys := c.hashByHeight.Keys()
for _, h := range heightKeys {
	if storedHash, ok := c.hashByHeight.Peek(h); ok && storedHash == hash {
		blockHeight = h
		break
	}
}

Issue: O(N) linear search through all heights for each hash during save. With 200k cache size, this could be 200k × 200k = 40B operations.

Impact: SaveToStore() on shutdown could take significant time (minutes on large caches).

Recommendation:

  • Add a reverse index hashToHeight *lru.Cache[string, uint64] maintained alongside hashByHeight
  • Or encode both hash→height and height→hash in the store key/value
  • Given this is only called on shutdown, document the potential delay

5. Missing Validation in decodeDAInclusion

Location: block/internal/cache/generic_cache.go:90-97

func decodeDAInclusion(value []byte) (daHeight, blockHeight uint64, ok bool) {
	if len(value) != 16 {
		return 0, 0, false
	}
	daHeight = binary.LittleEndian.Uint64(value[0:8])
	blockHeight = binary.LittleEndian.Uint64(value[8:16])
	return daHeight, blockHeight, true
}

Issue: No validation that decoded values are reasonable (non-zero, daHeight > 0, blockHeight > 0).

Recommendation:

func decodeDAInclusion(value []byte) (daHeight, blockHeight uint64, ok bool) {
	if len(value) != 16 {
		return 0, 0, false
	}
	daHeight = binary.LittleEndian.Uint64(value[0:8])
	blockHeight = binary.LittleEndian.Uint64(value[8:16])
	// Validate that we didn't decode garbage
	if daHeight == 0 || blockHeight == 0 {
		return 0, 0, false
	}
	return daHeight, blockHeight, true
}

6. Tracing Without Context Propagation

Location: pkg/store/tracing.go:159-175

func (t *tracedStore) GetMetadata(ctx context.Context, key string) ([]byte, error) {
	ctx, span := t.tracer.Start(ctx, "Store.GetMetadata",
		trace.WithAttributes(attribute.String("key", key)),
	)
	defer span.End()

	data, err := t.inner.GetMetadata(ctx, key)
	// ...
}

Observation: The wrapped context with span is used in t.inner.GetMetadata(ctx, key), which is correct. However, many cache operations use context.Background(), so these traces won't be connected to parent operations.

Recommendation:

  • Document that cache persistence happens in background context
  • Consider adding cache-level tracing for DA inclusion operations
  • Link to broader tracing strategy documentation

Moderate Issues

7. Unclear Behavior on LRU Eviction

Location: block/internal/cache/generic_cache.go:17-25

const (
	DefaultItemsCacheSize = 200_000
	DefaultHashesCacheSize = 200_000
	DefaultDAIncludedCacheSize = 200_000
)

Issue: When LRU evicts entries, what happens to:

  • DA inclusion data that was persisted to store but evicted from cache?
  • Height-to-hash mappings that are evicted?

Impact: If a recently-evicted DA inclusion is checked via GetHeaderDAIncluded, it returns false even though it's in the store.

Current Behavior: Based on code review, evicted DA inclusion entries are still in the store but not in cache. This seems intentional (store is source of truth), but it's not documented.

Recommendation:

  • Document eviction behavior in comments
  • Clarify that RestoreFromStore only restores known hashes, not all store entries
  • Consider whether DA included cache should be unbounded or have a size proportional to finalization depth

8. Potential Nil Pointer in initDAHeightFromStore

Location: block/internal/cache/manager.go:428-452

func (m *implementation) initDAHeightFromStore(ctx context.Context) {
	daIncludedHeightBytes, err := m.store.GetMetadata(ctx, store.DAIncludedHeightKey)
	if err != nil || len(daIncludedHeightBytes) != 8 {
		return
	}
	// ... uses daIncludedHeight without checking if it's 0
	headerKey := store.GetHeightToDAHeightHeaderKey(daIncludedHeight)
	if headerBytes, err := m.store.GetMetadata(ctx, headerKey); err == nil && len(headerBytes) == 8 {
		// ...
	}
}

Issue: If daIncludedHeight is 0 (which is checked and returned early), but let's say it's 1 and the header/data keys don't exist, this silently fails.

Impact: Low - m.headerCache.setMaxDAHeight(0) is a no-op, and the system will recover when DA inclusion resumes.

Recommendation: Add debug logging when initialization fails or succeeds.


Minor Issues / Suggestions

9. Test Coverage Gaps

Based on codecov report (66.08% patch coverage, 58 lines missing):

Missing coverage in:

  • pkg/store/tracing.go: 13 lines (likely error paths)
  • pkg/store/store.go: 5 lines
  • block/internal/cache/manager.go: 12 missing + 11 partials

Recommendation:

  • Add tests for store metadata failures during DA inclusion
  • Test LRU eviction scenarios
  • Test concurrent setMaxDAHeight calls

10. Transaction Cache Cleanup Not Automatic

Location: block/internal/cache/manager.go:204-243

The CleanupOldTxs method is defined but I don't see it being called anywhere in the PR.

Issue: Without automatic cleanup, the transaction cache will grow unbounded until 200k LRU limit, then evict arbitrarily.

Recommendation:

  • Call CleanupOldTxs periodically (e.g., every hour) in a background goroutine
  • Or document that it's the application's responsibility to call this
  • Consider whether LRU eviction is sufficient for tx cache

11. Documentation: Operational Impact

Location: PR description

Once nodes have upgraded to this version, they can do a `rm -rf $HOME/.appd/data/cache`.

Suggestions:

  1. Clarify that this is optional cleanup (the new code doesn't use that directory)
  2. Document expected behavior on first restart (DA inclusion data will be re-fetched from store metadata)
  3. Add migration notes to a docs file or wiki
  4. Mention that pending tx cache is ephemeral - txs in mempool during restart are lost

Code Quality Observations

Positive Patterns ✅

  1. Excellent test coverage for critical paths (DA inclusion persistence, cache restoration)
  2. Good use of atomic operations for concurrent access
  3. Clear separation of concerns (ephemeral vs persistent caches)
  4. Bounded data structures (LRU caches prevent OOM)
  5. Comprehensive comments explaining non-obvious behavior (e.g., why tx cache isn't deleted in DeleteHeight)

Areas for Improvement

  1. Error handling: Many store errors are silently dropped (see issue Peer discovery #2)
  2. Context handling: Widespread use of context.Background() instead of request contexts
  3. Metrics: Could add metrics for cache hit/miss rates, store persistence failures
  4. Logging: Some operations could benefit from structured logging of success cases

Security Considerations

✅ No security issues identified. The changes actually improve security by:

  • Using bounded caches (prevents memory-based DoS)
  • Properly separating sensitive data (no cache of private keys, which wasn't an issue before either)
  • Maintaining data integrity through atomic operations

Performance Considerations

Positive:

  • LRU caches prevent unbounded memory growth
  • Atomic operations are lock-free for read-heavy workloads
  • Metadata persistence is write-through (doesn't block operations)

Concerns:

  • SaveToStore linear search (issue Fraud proofs gossiping #4) could delay shutdown on large caches
  • RestoreFromStore iterates all heights in store (could be slow on startup with 1M+ blocks)

Recommendation: Add startup/shutdown metrics and logging to identify bottlenecks in production.


Concurrency Analysis

The cache implementation uses a mix of:

  • LRU caches (internally thread-safe with read-write locks)
  • Mutexes for compound operations (itemsByHeightMu, hashByHeightMu)
  • Atomic operations for maxDAHeight

Race conditions checked:

  • getNextItem: Properly mutex-protected
  • deleteAllForHeight: Properly mutex-protected
  • ⚠️ setMaxDAHeight: See critical issue ABCI interface #1
  • daIncluded cache: LRU internal locking sufficient

Breaking Change Assessment

Breaking: Yes, as marked in CHANGELOG
Justification: Removing gob-based cache persistence is a significant architectural change
Migration Path: Clear and documented
Rollback: Should be safe (old data format abandoned, not corrupted)

Recommendations:

  1. Add version check/migration code if possible
  2. Document minimum safe upgrade path (can you skip versions?)
  3. Consider a validation tool to check store metadata integrity post-upgrade

Test Quality Review

Strengths:

  • Good coverage of happy paths and edge cases
  • Tests verify persistence/restoration behavior
  • Tests use table-driven patterns where appropriate
  • Tests are parallelized with t.Parallel()

Improvements needed:

  • Add failure injection tests (store write failures, corrupt metadata)
  • Add concurrent access tests (parallel setDAIncluded + daHeight calls)
  • Add benchmark tests for RestoreFromStore with large datasets
  • Test LRU eviction scenarios explicitly

Example test to add:

func TestCache_ConcurrentSetMaxDAHeight(t *testing.T) {
	c := NewCache[testItem](nil,)
	var wg sync.WaitGroup
	
	// Hammer setMaxDAHeight with concurrent calls
	for i := 0; i < 100; i++ {
		wg.Add(1)
		go func(height uint64) {
			defer wg.Done()
			c.setDAIncluded(fmt.Sprintf("hash%d", height), height, height)
		}(uint64(i))
	}
	wg.Wait()
	
	// Should have max height
	assert.Equal(t, uint64(99), c.daHeight())
}

Recommendations Summary

Must Fix (Before Merge):

  1. ⚠️ Add error handling/panic to setMaxDAHeight if CAS fails 1000 times (issue ABCI interface #1)
  2. ⚠️ Add error logging for store persistence failures (issue Peer discovery #2)

Should Fix (This Release):
3. Fix inefficient SaveToStore hash lookup (issue #4)
4. Add context parameter to cache persistence operations (issue #3)
5. Add validation to decodeDAInclusion (issue #5)

Nice to Have (Future):
6. Automatic CleanupOldTxs scheduling (issue #10)
7. Enhanced test coverage for failure scenarios (issue #9)
8. Document LRU eviction behavior (issue #7)
9. Add cache operation metrics

Documentation:
10. Expand migration guide with operational details (issue #11)
11. Document context usage patterns for store operations


Final Assessment

Code Quality: ⭐⭐⭐⭐☆ (4/5)
Architecture: ⭐⭐⭐⭐⭐ (5/5)
Test Coverage: ⭐⭐⭐⭐☆ (4/5)
Documentation: ⭐⭐⭐☆☆ (3/5)
Security: ⭐⭐⭐⭐⭐ (5/5)

Overall: This is a well-executed refactoring that significantly improves the cache architecture. The critical issues identified are edge cases that need attention but don't undermine the core design. With the recommended fixes (especially #1 and #2), this PR will be production-ready.

The move from gob persistence to metadata-based persistence for DA inclusion is architecturally sound and sets up better scalability for the future. The test coverage is comprehensive for happy paths, but could benefit from more failure scenario testing.

Recommendation:Approve after addressing critical issues #1 and #2


Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a helpful debug log for heights not yet included in the DA layer, which will aid in troubleshooting. The accompanying refactorings simplify the code by removing unnecessary temporary variables, improving conciseness. The changes are well-implemented and look good.

@codecov
Copy link

codecov bot commented Feb 3, 2026

Codecov Report

❌ Patch coverage is 67.58242% with 59 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.35%. Comparing base (2510fe7) to head (910f332).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
block/internal/cache/manager.go 69.73% 12 Missing and 11 partials ⚠️
block/internal/cache/generic_cache.go 79.48% 8 Missing and 8 partials ⚠️
pkg/store/tracing.go 0.00% 13 Missing ⚠️
pkg/store/store.go 0.00% 5 Missing ⚠️
block/components.go 0.00% 0 Missing and 1 partial ⚠️
block/internal/syncing/syncer.go 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3047      +/-   ##
==========================================
+ Coverage   56.24%   56.35%   +0.11%     
==========================================
  Files         118      118              
  Lines       12075    12036      -39     
==========================================
- Hits         6791     6783       -8     
+ Misses       4537     4511      -26     
+ Partials      747      742       -5     
Flag Coverage Δ
combined 56.35% <67.58%> (+0.11%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@julienrbrt julienrbrt changed the title chore: log non included height refactor(cache)!: remove gob cache Feb 3, 2026
@julienrbrt julienrbrt marked this pull request as draft February 3, 2026 16:13
@julienrbrt julienrbrt marked this pull request as ready for review February 3, 2026 17:13
@julienrbrt julienrbrt enabled auto-merge February 4, 2026 10:48
@julienrbrt julienrbrt added this pull request to the merge queue Feb 4, 2026
Merged via the queue into main with commit d20b1ac Feb 4, 2026
27 of 28 checks passed
@julienrbrt julienrbrt deleted the julien/logs branch February 4, 2026 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants