Skip to content

Conversation

@alpe
Copy link
Contributor

@alpe alpe commented Nov 12, 2025

Implement failover via RAFT

  • Improve Cache startup/shutdown with parallelization
  • Publish to RAFT cluster in executor
  • Sync DB after each block created in executor
  • Add new RaftReceiver to sync when in aggregator follower mode
  • Introduce failoverState to switch between follower/leader mode
  • Provide RAFT node details via http endpoint

@github-actions
Copy link
Contributor

github-actions bot commented Nov 12, 2025

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedNov 20, 2025, 4:40 PM

@codecov
Copy link

codecov bot commented Nov 12, 2025

Codecov Report

❌ Patch coverage is 42.33655% with 538 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.09%. Comparing base (a75b4bc) to head (b34d7a0).

Files with missing lines Patch % Lines
pkg/raft/node.go 12.50% 168 Missing ⚠️
pkg/raft/node_mock.go 45.08% 74 Missing and 21 partials ⚠️
block/internal/syncing/raft_retriever.go 0.00% 60 Missing ⚠️
node/full.go 31.34% 39 Missing and 7 partials ⚠️
node/failover.go 72.44% 22 Missing and 13 partials ⚠️
block/internal/syncing/syncer.go 28.88% 30 Missing and 2 partials ⚠️
block/internal/executing/executor.go 6.25% 25 Missing and 5 partials ⚠️
pkg/raft/election.go 79.26% 12 Missing and 5 partials ⚠️
pkg/rpc/server/http.go 6.66% 13 Missing and 1 partial ⚠️
block/internal/cache/generic_cache.go 82.53% 5 Missing and 6 partials ⚠️
... and 9 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2836      +/-   ##
==========================================
- Coverage   64.67%   62.09%   -2.58%     
==========================================
  Files          81       88       +7     
  Lines        7173     7891     +718     
==========================================
+ Hits         4639     4900     +261     
- Misses       1995     2416     +421     
- Partials      539      575      +36     
Flag Coverage Δ
combined 62.09% <42.33%> (-2.58%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

alpe added 4 commits November 17, 2025 10:55
* main:
  fix: remove duplicate error logging in light node shutdown (#2841)
  chore: fix incorrect function name in comment (#2840)
  chore: remove sequencer go.mod (#2837)
* main:
  build(deps): Bump the go_modules group across 2 directories with 3 updates (#2846)
  build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.7.0 to 1.8.0 in /test/e2e (#2851)
  build(deps): Bump github.com/consensys/gnark-crypto from 0.18.0 to 0.18.1 in /test/e2e (#2844)
  build(deps): Bump github.com/cometbft/cometbft from 0.38.17 to 0.38.19 in /test/e2e (#2843)
  build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.6.0 to 1.7.0 in /test/e2e (#2845)
@alpe alpe changed the title [WIP] feat: High availabilty via RAFT feat: High availabilty via RAFT Nov 17, 2025
alpe added 4 commits November 18, 2025 11:00
(cherry picked from commit c44cd77e665f6d5d463295c6ed61c59a56d88db3)
@alpe alpe marked this pull request as ready for review November 19, 2025 09:54
@alpe
Copy link
Contributor Author

alpe commented Nov 19, 2025

@claude review again

@claude
Copy link
Contributor

claude bot commented Nov 19, 2025

Claude finished @alpe's task —— View job


Code Review Complete

I've conducted a comprehensive review of the RAFT-based high availability implementation. Here are my findings organized by category:

🎯 High-Level Architecture Assessment

Strengths:

  • Clean separation of concerns with RAFT node, election logic, and failover state
  • Proper use of interfaces for testability (Runnable, sourceNode)
  • Good abstraction of leader/follower factories for dynamic role switching

Architecture Concerns:

  1. JSON serialization overhead (pkg/raft/node.go:211) - As noted in the TODO and review comments, using JSON for RAFT log entries will add significant overhead at 100ms block times. This should be migrated to protobuf for production use.
  2. Single point of failure - The failoverState rebuilds the entire P2P stack on role transitions, which could introduce downtime during failover.

🔒 Security Issues

Critical

  1. Proposer validation bypass (block/internal/syncing/raft_retriever.go:112-115)

    if err := assertExpectedProposer(r.genesis, header.ProposerAddress); err != nil {
        r.logger.Debug().Err(err).Msg("unexpected proposer")
        return nil  // Silently ignores invalid proposer!
    }

    Issue: Invalid proposers are logged but not rejected. This could allow unauthorized nodes to propose blocks.
    Recommendation: Return an error instead of nil to properly reject invalid proposers.

  2. No signature verification in RAFT retriever (block/internal/syncing/raft_retriever.go:99-128)

    • The consumeRaftBlock method unmarshals headers but doesn't verify signatures on SignedHeader
    • Compare with DA retriever which validates SignedData
    • This could allow malicious RAFT members to inject invalid blocks
      Recommendation: Add signature verification similar to assertValidSignedData in block/internal/syncing/assert.go

Medium

  1. Error channel overflow (pkg/raft/election.go:58, 78)

    errCh := make(chan error, 1)  // Buffer of 1
    select {
    case errCh <- fmt.Errorf(...):
    default: // do not block
    }

    Issue: If multiple worker errors occur rapidly, only the first is captured. Subsequent errors are silently dropped.
    Recommendation: Either increase buffer size or use atomic error storage for better error tracking.

  2. Context leakage risk (pkg/raft/election.go:66-82)

    • startWorker creates new contexts but relies on workerCancel being called
    • If panic occurs before reassignment, context might leak
      Recommendation: Add defer to cancel previous context before creating new one

🐛 Correctness & Logic Issues

High Priority

  1. Race condition in leadership transition (pkg/raft/election.go:96-104)

    raftSynced := d.node.waitForMsgsLanded(d.node.Config().SendTimeout) == nil
    if !raftSynced || !runnable.IsSynced(d.node.GetState()) {
        d.logger.Info().Msg("became leader, but not synced. Pass on leadership")
        if err := d.node.leadershipTransfer(); err != nil && !errors.Is(err, raft.ErrNotLeader) {
            return err  // Could cause double signing!
        }
        continue
    }

    Issue: Between checking sync status and transferring leadership, the node might start producing blocks if leadership transfer fails silently or takes time. The comment mentions "Better stop than double sign on old state" but the code doesn't actually stop.
    Recommendation:

    • Set a flag to block block production during leadership transfer
    • Add timeout to leadership transfer
    • Consider shutting down instead of continuing if sync check fails
  2. Duplicate log message (pkg/raft/election.go:105)

    • Line 94 and 105 both log "became leader, stopping follower operations"
    • This is confusing in logs and indicates potential logic issue
  3. Store sync before broadcast (block/internal/executing/executor.go:461-463)

    if err := e.store.Sync(context.Background()); err != nil {
        return fmt.Errorf("failed to sync store: %w", err)
    }

    Good practice: This ensures data is flushed to disk before broadcasting. However, using context.Background() bypasses cancellation. Consider using e.ctx or a timeout context.

  4. Incomplete state rollback (block/internal/executing/executor.go:330-332)

    • If RAFT leadership check fails after retrieval but before production, batch state might be inconsistent
    • No cleanup of sequencer batch if production aborts early

Medium Priority

  1. Ticker stops but channel not nil (pkg/raft/election.go:132-133)

    ticker.Stop()
    ticker.C = nil

    Issue: Setting ticker.C = nil after Stop() is unusual. Stopped ticker channels remain readable (closed), so this creates a nil channel that blocks forever instead of returning zero value.
    Recommendation: Either keep the channel or restructure to avoid the ticker path after first start.

  2. Missing height gap validation (pkg/raft/types.go:15-22)

    func (s RaftBlockState) assertValid(next RaftBlockState) error {
        if s.Height > next.Height { return fmt.Errorf(...) }
        if s.Timestamp > next.Timestamp { return fmt.Errorf(...) }
        return nil
    }

    Issue: Comment says "does not ensure that no gaps exist" but the FSM Apply uses this for validation. If Height jumps from 5 to 10, this passes validation but creates a gap.
    Recommendation: Add check: if s.Height > 0 && next.Height != s.Height && next.Height != s.Height+1

  3. Silent message drops (pkg/raft/node.go:291-297)

    select {
    case f.applyCh <- RaftApplyMsg{...}:
    default:
        f.logger.Warn().Msg("apply channel full, dropping message")
    }

    Issue: Dropped RAFT apply messages mean followers miss state updates. While the comment mentions they can sync from DA/peers, this could cause extended sync lag.
    Recommendation: Add metrics to track drops and consider backpressure mechanism.


🔄 Concurrency & Race Conditions

  1. Atomic flag without synchronization (node/failover.go:252-272)

    • singleRoleElector.running uses atomic.Bool but runnable is accessed without synchronization
    • Potential race if Run() and state() are called concurrently
  2. Map overwrites without locking (block/internal/cache/generic_cache.go)

    • Uses sync.Map correctly, but parallel calls to setSeen with same height could race on hashByHeight
    • Not critical since sync.Map handles it, but could lead to unexpected pruning behavior
  3. Cancel function reassignment (block/internal/syncing/raft_retriever.go:54-57, 72-75)

    if r.cancel != nil {
        return errors.New("syncer already started")
    }
    ctx, r.cancel = context.WithCancel(ctx)

    Issue: Mutex released before goroutine starts, allowing Stop() to be called between unlock and goroutine launch
    Recommendation: Start goroutine before releasing mutex


⚡ Performance & Resource Management

  1. Sequential cache persistence (block/internal/cache/generic_cache.go:159-189)

    • PR description mentions "parallelization" improvement, which is good
    • New parallel loading using errgroup is excellent for startup performance
  2. Unbuffered apply channel (block/internal/syncing/raft_retriever.go:58)

    applyCh := make(chan raft.RaftApplyMsg, 1)

    Issue: Buffer of 1 is very small. If processing a message takes longer than block time, messages will be dropped per pkg/raft/node.go:296
    Recommendation: Increase buffer size or make it configurable based on block time

  3. HTTP server shutdown timeout (node/failover.go:221-223)

    • 3-second timeout is reasonable but should be configurable
    • Background context means parent cancellation is ignored
  4. Heartbeat timeout configuration (pkg/raft/node.go:57-58)

    raftConfig.HeartbeatTimeout = cfg.HeartbeatTimeout
    raftConfig.LeaderLeaseTimeout = cfg.HeartbeatTimeout / 2

    Concern: At 100ms block times, default RAFT timeouts might cause unnecessary leader elections. Ensure defaults are tuned for high-frequency consensus.


🧪 Testing & Reliability

  1. Low test coverage (42.33% patch coverage)

    • pkg/raft/node.go: Only 12.50% covered (168 lines missing)
    • block/internal/syncing/raft_retriever.go: 0% covered
    • These are critical consensus components that need extensive testing
  2. Mock coverage (pkg/raft/node_mock.go)

    • 45% coverage for mocks is concerning
    • Suggests integration tests aren't exercising all code paths
  3. E2E test exists (test/e2e/failover_e2e_test.go:44)

    • Good to see TestLeaseFailoverE2E
    • Should add tests for:
      • Double-sign prevention
      • Byzantine node behavior
      • Network partition scenarios
      • Rapid leader changes
  4. Election test coverage (pkg/raft/election_test.go)

    • Good coverage of basic scenarios
    • Missing: sync check failures, leadership transfer failures, worker panic recovery

📝 Code Quality & Maintainability

  1. Typo (pkg/raft/node.go:120)

    n.logger.Info().Msg("Boostrap raft cluster")  // "Boostrap" -> "Bootstrap"
  2. Inconsistent nil checks (pkg/raft/node.go:107-110, 148-151, 169-173)

    • Methods like Start(), waitForMsgsLanded(), Stop() check if n == nil
    • But IsLeader() checks both n == nil and n.raft == nil
    • Other methods like Broadcast() don't check for nil
      Recommendation: Be consistent - either all methods handle nil receivers or none do
  3. Magic numbers (pkg/raft/election.go:83)

    ticker := time.NewTicker(300 * time.Millisecond)

    Recommendation: Extract to constant with descriptive name

  4. Error messages lack context (block/internal/syncing/raft_retriever.go:92)

    r.logger.Error().Err(err).Uint64("height", msg.State.Height).Msg("failed to apply raft block")
    • Good structured logging
    • Consider adding raft index from msg.Index for debugging
  5. Missing godoc (pkg/raft/election.go:25-33)

    • sourceNode interface lacks documentation
    • Method purposes unclear without reading implementation
  6. Shutdown error handling (node/failover.go:223)

    _ = f.rpcServer.Shutdown(shutdownCtx)
    • Ignores error, should at least log it

🎨 Configuration & Documentation

  1. RAFT config validation (pkg/raft/node.go:48-52)

    • No validation of config parameters
    • Should validate: NodeID not empty, RaftAddr valid, peers well-formed, timeouts > 0
  2. Bootstrap-only mode (pkg/raft/node.go:111-113)

    if !n.config.Bootstrap {
        return fmt.Errorf("raft cluster requires bootstrap mode")
    }

    Issue: This means Start() always fails for non-bootstrap nodes. Unclear how follower nodes join.
    Recommendation: Document expected join flow or support non-bootstrap starts

  3. Config documentation (pkg/config/config.go:29-49)

    • RAFT config flags not visible in this excerpt
    • Ensure comprehensive documentation for HA setup

✅ Positive Highlights

  1. Excellent separation of concerns - RAFT logic cleanly separated from application logic
  2. Graceful failover handling - Leadership loss returns specific error for clean shutdown
  3. Parallel cache loading - Good performance improvement
  4. Comprehensive E2E test - 600+ line failover test shows commitment to testing
  5. Proper use of contexts - Cancellation handled throughout
  6. Structured logging - Consistent use of zerolog with context
  7. Wait for messages (pkg/raft/node.go:148-167) - Critical for preventing double-signing

🎯 Priority Recommendations

Must Fix Before Merge

  1. ✅ Fix proposer validation bypass (issue ABCI interface #1)
  2. ✅ Add signature verification to RAFT retriever (issue Peer discovery #2)
  3. ✅ Fix race condition in leadership transition (issue Mempool #5)
  4. ✅ Increase test coverage to >70% for pkg/raft/* and raft_retriever.go

Should Fix Soon

  1. Migrate to protobuf for RAFT messages (performance)
  2. Add height gap validation (data integrity)
  3. Fix context usage in Sync call
  4. Add config validation

Nice to Have

  1. Improve error handling (no silent drops)
  2. Add metrics for RAFT operations
  3. Document bootstrap vs non-bootstrap node behavior
  4. Increase apply channel buffer size

📊 Coverage Analysis

The 42% patch coverage is concerning for a consensus-critical feature. Priority areas:

  • pkg/raft/node.go: 12.50% → target 80%+
  • block/internal/syncing/raft_retriever.go: 0% → target 80%+
  • node/failover.go: 72.44% → target 85%+

📚 Related Review Comments

I've reviewed the existing PR comments. Key items already discussed:

  • Protobuf migration (tac0turtle) - acknowledged
  • Sync state checking (julienrbrt) - good discussion on dependencies
  • Flush before broadcast (alpe) - correctly implemented
  • P2P peer setup (julienrbrt/alpe) - clarified aggregator behavior

Overall Assessment: This is a solid foundation for HA via RAFT, but has critical security gaps (proposer validation, signature verification) and needs significantly more test coverage before production use. The architecture is sound but needs hardening around edge cases and Byzantine scenarios.

@evstack evstack deleted a comment from claude bot Nov 19, 2025
@alpe alpe requested a review from julienrbrt November 19, 2025 11:35
Copy link
Member

@julienrbrt julienrbrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first walkthrough, I'll go read about github.com/hashicorp/raft and come back to review election.go and node.go

return nil
}

// Height returns the current height stored
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to know the height of the p2p (go header) store? (I am still reviewing, this may get clearer). We can have the app height from the evolve store.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the node switches from sync to aggregator mode, the internal state is key to prevent double signing.
The Syncer has a isCatchingUpState method now that checks the stores for any height > current.
it is called within the leader election loop to transfer leadership in case it is not fully synced, yet.

}

// SetApplyCallback sets a callback to be called when log entries are applied
func (n *Node) SetApplyCallback(ch chan<- RaftApplyMsg) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: what is this for? the go doc is very light

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The channel is passed by the syncer to receive first level state updates from within the raft cluster. This should be the fastest communication channel available.

}()

// Check raft leadership if raft is enabled
if e.raftNode != nil && !e.raftNode.IsLeader() {
Copy link
Member

@julienrbrt julienrbrt Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated: i wonder how this will play with different sequencers.
In #2797 you can get to that path without node key (to sign). I suppose we'll need to add a condition for based sequencing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I was only preparing for single sequencer. Base would not work with raft as there are no aggregators.

leaderFactory := func() (raftpkg.Runnable, error) {
logger.Info().Msg("Starting aggregator-MODE")
nodeConfig.Node.Aggregator = true
nodeConfig.P2P.Peers = "" // peers are not supported in aggregator mode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand this. is the aggregator broadcasting to no one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the aggregator is required to broadcast to at least one node part of a larger mesh other wise p2p will not work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more who calls whom. The aggregator gets called not otherwise. Starting all nodes with p2p-peer setup makes sense though. When a ha cluster is setup, the raft leader gets the aggregator role and I clear the peers when the p2p stack is restarted.
There is an error thrown somewhere when peers are not empty.

node/full.go Outdated
func initRaftNode(nodeConfig config.Config, logger zerolog.Logger) (*raftpkg.Node, error) {
raftDir := nodeConfig.Raft.RaftDir
if raftDir == "" {
raftDir = filepath.Join(nodeConfig.RootDir, "raft")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should be using DefaultConfig() value if empty.

bc *block.Components
}

func newSyncMode(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i was a tiny bit confused this was moved here instead of full.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the constructors. Naming could be better, I guess.

}
return setupFailoverState(nodeConfig, nodeKey, database, genesis, logger, mainKV, rktStore, blockComponentsFn, raftNode)
}
func newAggregatorMode(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

return fmt.Errorf("not leader")
}

data, err := json.Marshal(state) // todo:use protobuf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the todo? size?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should migrate to protobuf here. json will cause overhead, at 100ms we need to minimise it as much as possible

* main:
  chore: reduce log noise (#2864)
  fix: sync service for non zero height starts with empty store (#2834)
  build(deps): Bump golang.org/x/crypto from 0.43.0 to 0.45.0 in /execution/evm (#2861)
  chore: minor improvement for docs (#2862)
alpe added 3 commits November 20, 2025 17:24
* main:
  chore: bump da (#2866)
  chore: bump  core (#2865)
* main:
  chore: fix some comments (#2874)
  chore: bump node in evm-single (#2875)
  refactor(syncer,cache): use compare and swap loop and add comments (#2873)
  refactor: use state da height as well (#2872)
  refactor: retrieve highest da height in cache (#2870)
  chore: change from event count to start and end height (#2871)
github-merge-queue bot pushed a commit that referenced this pull request Nov 21, 2025
## Overview

Speed up cache write/loads via parallel execution.  

Pulled from  #2836
github-merge-queue bot pushed a commit that referenced this pull request Nov 21, 2025
## Overview

Minor updates to make it easier to trace errors

Extracted from #2836
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants