feat: High availabilty via RAFT #2836

alpe · 2025-11-12T15:34:17Z

Implement failover via RAFT

Improve Cache startup/shutdown with parallelization
Publish to RAFT cluster in executor
Sync DB after each block created in executor
Add new RaftReceiver to sync when in aggregator follower mode
Introduce failoverState to switch between follower/leader mode
Provide RAFT node details via http endpoint

github-actions · 2025-11-12T15:34:31Z

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`✅ passed`	Nov 20, 2025, 4:40 PM

codecov · 2025-11-12T15:39:05Z

Codecov Report

❌ Patch coverage is 42.33655% with 538 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.09%. Comparing base (a75b4bc) to head (b34d7a0).

Files with missing lines	Patch %	Lines
pkg/raft/node.go	12.50%	168 Missing ⚠️
pkg/raft/node_mock.go	45.08%	74 Missing and 21 partials ⚠️
block/internal/syncing/raft_retriever.go	0.00%	60 Missing ⚠️
node/full.go	31.34%	39 Missing and 7 partials ⚠️
node/failover.go	72.44%	22 Missing and 13 partials ⚠️
block/internal/syncing/syncer.go	28.88%	30 Missing and 2 partials ⚠️
block/internal/executing/executor.go	6.25%	25 Missing and 5 partials ⚠️
pkg/raft/election.go	79.26%	12 Missing and 5 partials ⚠️
pkg/rpc/server/http.go	6.66%	13 Missing and 1 partial ⚠️
block/internal/cache/generic_cache.go	82.53%	5 Missing and 6 partials ⚠️
... and 9 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2836      +/-   ##
==========================================
- Coverage   64.67%   62.09%   -2.58%     
==========================================
  Files          81       88       +7     
  Lines        7173     7891     +718     
==========================================
+ Hits         4639     4900     +261     
- Misses       1995     2416     +421     
- Partials      539      575      +36

Flag	Coverage Δ
combined	`62.09% <42.33%> (-2.58%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

* main: fix: remove duplicate error logging in light node shutdown (#2841) chore: fix incorrect function name in comment (#2840) chore: remove sequencer go.mod (#2837)

* main: build(deps): Bump the go_modules group across 2 directories with 3 updates (#2846) build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.7.0 to 1.8.0 in /test/e2e (#2851) build(deps): Bump github.com/consensys/gnark-crypto from 0.18.0 to 0.18.1 in /test/e2e (#2844) build(deps): Bump github.com/cometbft/cometbft from 0.38.17 to 0.38.19 in /test/e2e (#2843) build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.6.0 to 1.7.0 in /test/e2e (#2845)

(cherry picked from commit c44cd77e665f6d5d463295c6ed61c59a56d88db3)

alpe · 2025-11-19T09:54:45Z

@claude review again

claude · 2025-11-19T09:55:03Z

Claude finished @alpe's task —— View job

Code Review Complete

I've conducted a comprehensive review of the RAFT-based high availability implementation. Here are my findings organized by category:

🎯 High-Level Architecture Assessment

Strengths:

Clean separation of concerns with RAFT node, election logic, and failover state
Proper use of interfaces for testability (Runnable, sourceNode)
Good abstraction of leader/follower factories for dynamic role switching

Architecture Concerns:

JSON serialization overhead (pkg/raft/node.go:211) - As noted in the TODO and review comments, using JSON for RAFT log entries will add significant overhead at 100ms block times. This should be migrated to protobuf for production use.
Single point of failure - The failoverState rebuilds the entire P2P stack on role transitions, which could introduce downtime during failover.

🔒 Security Issues

Critical

Proposer validation bypass (block/internal/syncing/raft_retriever.go:112-115)
```
if err := assertExpectedProposer(r.genesis, header.ProposerAddress); err != nil {
    r.logger.Debug().Err(err).Msg("unexpected proposer")
    return nil  // Silently ignores invalid proposer!
}
```
Issue: Invalid proposers are logged but not rejected. This could allow unauthorized nodes to propose blocks.
Recommendation: Return an error instead of nil to properly reject invalid proposers.
No signature verification in RAFT retriever (block/internal/syncing/raft_retriever.go:99-128)
- The consumeRaftBlock method unmarshals headers but doesn't verify signatures on SignedHeader
- Compare with DA retriever which validates SignedData
- This could allow malicious RAFT members to inject invalid blocks
  Recommendation: Add signature verification similar to assertValidSignedData in block/internal/syncing/assert.go

Medium

Error channel overflow (pkg/raft/election.go:58, 78)
```
errCh := make(chan error, 1)  // Buffer of 1
select {
case errCh <- fmt.Errorf(...):
default: // do not block
}
```
Issue: If multiple worker errors occur rapidly, only the first is captured. Subsequent errors are silently dropped.
Recommendation: Either increase buffer size or use atomic error storage for better error tracking.
Context leakage risk (pkg/raft/election.go:66-82)
- startWorker creates new contexts but relies on workerCancel being called
- If panic occurs before reassignment, context might leak
  Recommendation: Add defer to cancel previous context before creating new one

🐛 Correctness & Logic Issues

High Priority

Race condition in leadership transition (pkg/raft/election.go:96-104)
```
raftSynced := d.node.waitForMsgsLanded(d.node.Config().SendTimeout) == nil
if !raftSynced || !runnable.IsSynced(d.node.GetState()) {
    d.logger.Info().Msg("became leader, but not synced. Pass on leadership")
    if err := d.node.leadershipTransfer(); err != nil && !errors.Is(err, raft.ErrNotLeader) {
        return err  // Could cause double signing!
    }
    continue
}
```
Issue: Between checking sync status and transferring leadership, the node might start producing blocks if leadership transfer fails silently or takes time. The comment mentions "Better stop than double sign on old state" but the code doesn't actually stop.
Recommendation:
- Set a flag to block block production during leadership transfer
- Add timeout to leadership transfer
- Consider shutting down instead of continuing if sync check fails
Duplicate log message (pkg/raft/election.go:105)
- Line 94 and 105 both log "became leader, stopping follower operations"
- This is confusing in logs and indicates potential logic issue
Store sync before broadcast (block/internal/executing/executor.go:461-463)
```
if err := e.store.Sync(context.Background()); err != nil {
    return fmt.Errorf("failed to sync store: %w", err)
}
```
Good practice: This ensures data is flushed to disk before broadcasting. However, using context.Background() bypasses cancellation. Consider using e.ctx or a timeout context.
Incomplete state rollback (block/internal/executing/executor.go:330-332)
- If RAFT leadership check fails after retrieval but before production, batch state might be inconsistent
- No cleanup of sequencer batch if production aborts early

Medium Priority

Ticker stops but channel not nil (pkg/raft/election.go:132-133)
```
ticker.Stop()
ticker.C = nil
```
Issue: Setting ticker.C = nil after Stop() is unusual. Stopped ticker channels remain readable (closed), so this creates a nil channel that blocks forever instead of returning zero value.
Recommendation: Either keep the channel or restructure to avoid the ticker path after first start.
Missing height gap validation (pkg/raft/types.go:15-22)
```
func (s RaftBlockState) assertValid(next RaftBlockState) error {
    if s.Height > next.Height { return fmt.Errorf(...) }
    if s.Timestamp > next.Timestamp { return fmt.Errorf(...) }
    return nil
}
```
Issue: Comment says "does not ensure that no gaps exist" but the FSM Apply uses this for validation. If Height jumps from 5 to 10, this passes validation but creates a gap.
Recommendation: Add check: if s.Height > 0 && next.Height != s.Height && next.Height != s.Height+1
Silent message drops (pkg/raft/node.go:291-297)
```
select {
case f.applyCh <- RaftApplyMsg{...}:
default:
    f.logger.Warn().Msg("apply channel full, dropping message")
}
```
Issue: Dropped RAFT apply messages mean followers miss state updates. While the comment mentions they can sync from DA/peers, this could cause extended sync lag.
Recommendation: Add metrics to track drops and consider backpressure mechanism.

🔄 Concurrency & Race Conditions

Atomic flag without synchronization (node/failover.go:252-272)
- singleRoleElector.running uses atomic.Bool but runnable is accessed without synchronization
- Potential race if Run() and state() are called concurrently
Map overwrites without locking (block/internal/cache/generic_cache.go)
- Uses sync.Map correctly, but parallel calls to setSeen with same height could race on hashByHeight
- Not critical since sync.Map handles it, but could lead to unexpected pruning behavior
Cancel function reassignment (block/internal/syncing/raft_retriever.go:54-57, 72-75)
```
if r.cancel != nil {
    return errors.New("syncer already started")
}
ctx, r.cancel = context.WithCancel(ctx)
```
Issue: Mutex released before goroutine starts, allowing Stop() to be called between unlock and goroutine launch
Recommendation: Start goroutine before releasing mutex

⚡ Performance & Resource Management

Sequential cache persistence (block/internal/cache/generic_cache.go:159-189)
- PR description mentions "parallelization" improvement, which is good
- New parallel loading using errgroup is excellent for startup performance
Unbuffered apply channel (block/internal/syncing/raft_retriever.go:58)
```
applyCh := make(chan raft.RaftApplyMsg, 1)
```
Issue: Buffer of 1 is very small. If processing a message takes longer than block time, messages will be dropped per pkg/raft/node.go:296
Recommendation: Increase buffer size or make it configurable based on block time
HTTP server shutdown timeout (node/failover.go:221-223)
- 3-second timeout is reasonable but should be configurable
- Background context means parent cancellation is ignored
Heartbeat timeout configuration (pkg/raft/node.go:57-58)
```
raftConfig.HeartbeatTimeout = cfg.HeartbeatTimeout
raftConfig.LeaderLeaseTimeout = cfg.HeartbeatTimeout / 2
```
Concern: At 100ms block times, default RAFT timeouts might cause unnecessary leader elections. Ensure defaults are tuned for high-frequency consensus.

🧪 Testing & Reliability

Low test coverage (42.33% patch coverage)
- pkg/raft/node.go: Only 12.50% covered (168 lines missing)
- block/internal/syncing/raft_retriever.go: 0% covered
- These are critical consensus components that need extensive testing
Mock coverage (pkg/raft/node_mock.go)
- 45% coverage for mocks is concerning
- Suggests integration tests aren't exercising all code paths
E2E test exists (test/e2e/failover_e2e_test.go:44)
- Good to see TestLeaseFailoverE2E
- Should add tests for:
  - Double-sign prevention
  - Byzantine node behavior
  - Network partition scenarios
  - Rapid leader changes
Election test coverage (pkg/raft/election_test.go)
- Good coverage of basic scenarios
- Missing: sync check failures, leadership transfer failures, worker panic recovery

📝 Code Quality & Maintainability

Typo (pkg/raft/node.go:120)

n.logger.Info().Msg("Boostrap raft cluster")  // "Boostrap" -> "Bootstrap"

Inconsistent nil checks (pkg/raft/node.go:107-110, 148-151, 169-173)
- Methods like Start(), waitForMsgsLanded(), Stop() check if n == nil
- But IsLeader() checks both n == nil and n.raft == nil
- Other methods like Broadcast() don't check for nil
  Recommendation: Be consistent - either all methods handle nil receivers or none do
Magic numbers (pkg/raft/election.go:83)
```
ticker := time.NewTicker(300 * time.Millisecond)
```
Recommendation: Extract to constant with descriptive name
Error messages lack context (block/internal/syncing/raft_retriever.go:92)
```
r.logger.Error().Err(err).Uint64("height", msg.State.Height).Msg("failed to apply raft block")
```
- Good structured logging
- Consider adding raft index from msg.Index for debugging
Missing godoc (pkg/raft/election.go:25-33)
- sourceNode interface lacks documentation
- Method purposes unclear without reading implementation
Shutdown error handling (node/failover.go:223)
```
_ = f.rpcServer.Shutdown(shutdownCtx)
```
- Ignores error, should at least log it

🎨 Configuration & Documentation

RAFT config validation (pkg/raft/node.go:48-52)
- No validation of config parameters
- Should validate: NodeID not empty, RaftAddr valid, peers well-formed, timeouts > 0
Bootstrap-only mode (pkg/raft/node.go:111-113)
```
if !n.config.Bootstrap {
    return fmt.Errorf("raft cluster requires bootstrap mode")
}
```
Issue: This means Start() always fails for non-bootstrap nodes. Unclear how follower nodes join.
Recommendation: Document expected join flow or support non-bootstrap starts
Config documentation (pkg/config/config.go:29-49)
- RAFT config flags not visible in this excerpt
- Ensure comprehensive documentation for HA setup

✅ Positive Highlights

Excellent separation of concerns - RAFT logic cleanly separated from application logic
Graceful failover handling - Leadership loss returns specific error for clean shutdown
Parallel cache loading - Good performance improvement
Comprehensive E2E test - 600+ line failover test shows commitment to testing
Proper use of contexts - Cancellation handled throughout
Structured logging - Consistent use of zerolog with context
Wait for messages (pkg/raft/node.go:148-167) - Critical for preventing double-signing

🎯 Priority Recommendations

Must Fix Before Merge

✅ Fix proposer validation bypass (issue ABCI interface #1)
✅ Add signature verification to RAFT retriever (issue Peer discovery #2)
✅ Fix race condition in leadership transition (issue Mempool #5)
✅ Increase test coverage to >70% for pkg/raft/* and raft_retriever.go

Should Fix Soon

Migrate to protobuf for RAFT messages (performance)
Add height gap validation (data integrity)
Fix context usage in Sync call
Add config validation

Nice to Have

Improve error handling (no silent drops)
Add metrics for RAFT operations
Document bootstrap vs non-bootstrap node behavior
Increase apply channel buffer size

📊 Coverage Analysis

The 42% patch coverage is concerning for a consensus-critical feature. Priority areas:

pkg/raft/node.go: 12.50% → target 80%+
block/internal/syncing/raft_retriever.go: 0% → target 80%+
node/failover.go: 72.44% → target 85%+

📚 Related Review Comments

I've reviewed the existing PR comments. Key items already discussed:

Protobuf migration (tac0turtle) - acknowledged
Sync state checking (julienrbrt) - good discussion on dependencies
Flush before broadcast (alpe) - correctly implemented
P2P peer setup (julienrbrt/alpe) - clarified aggregator behavior

Overall Assessment: This is a solid foundation for HA via RAFT, but has critical security gaps (proposer validation, signature verification) and needs significantly more test coverage before production use. The architecture is sound but needs hardening around edge cases and Byzantine scenarios.

julienrbrt

first walkthrough, I'll go read about github.com/hashicorp/raft and come back to review election.go and node.go

block/internal/cache/generic_cache.go

pkg/cmd/run_node.go

julienrbrt · 2025-11-19T13:50:26Z

pkg/sync/sync_service.go

 	return nil
 }

+// Height returns the current height stored


Why do we need to know the height of the p2p (go header) store? (I am still reviewing, this may get clearer). We can have the app height from the evolve store.

When the node switches from sync to aggregator mode, the internal state is key to prevent double signing.
The Syncer has a isCatchingUpState method now that checks the stores for any height > current.
it is called within the leader election loop to transfer leadership in case it is not fully synced, yet.

julienrbrt · 2025-11-19T13:51:59Z

pkg/raft/node.go

+}
+
+// SetApplyCallback sets a callback to be called when log entries are applied
+func (n *Node) SetApplyCallback(ch chan<- RaftApplyMsg) {


nit: what is this for? the go doc is very light

The channel is passed by the syncer to receive first level state updates from within the raft cluster. This should be the fastest communication channel available.

julienrbrt · 2025-11-19T13:54:16Z

block/internal/executing/executor.go

 	}()

+	// Check raft leadership if raft is enabled
+	if e.raftNode != nil && !e.raftNode.IsLeader() {


unrelated: i wonder how this will play with different sequencers.
In #2797 you can get to that path without node key (to sign). I suppose we'll need to add a condition for based sequencing.

yes, I was only preparing for single sequencer. Base would not work with raft as there are no aggregators.

julienrbrt · 2025-11-19T14:02:47Z

node/full.go

+	leaderFactory := func() (raftpkg.Runnable, error) {
+		logger.Info().Msg("Starting aggregator-MODE")
+		nodeConfig.Node.Aggregator = true
+		nodeConfig.P2P.Peers = "" // peers are not supported in aggregator mode


not sure I understand this. is the aggregator broadcasting to no one?

the aggregator is required to broadcast to at least one node part of a larger mesh other wise p2p will not work

This is more who calls whom. The aggregator gets called not otherwise. Starting all nodes with p2p-peer setup makes sense though. When a ha cluster is setup, the raft leader gets the aggregator role and I clear the peers when the p2p stack is restarted.
There is an error thrown somewhere when peers are not empty.

julienrbrt · 2025-11-19T14:04:02Z

node/full.go

+func initRaftNode(nodeConfig config.Config, logger zerolog.Logger) (*raftpkg.Node, error) {
+	raftDir := nodeConfig.Raft.RaftDir
+	if raftDir == "" {
+		raftDir = filepath.Join(nodeConfig.RootDir, "raft")


nit: we should be using DefaultConfig() value if empty.

julienrbrt · 2025-11-19T14:05:57Z

node/failover.go

+	bc                *block.Components
+}
+
+func newSyncMode(


nit: i was a tiny bit confused this was moved here instead of full.go

These are the constructors. Naming could be better, I guess.

julienrbrt · 2025-11-19T14:06:05Z

node/failover.go

+	}
+	return setupFailoverState(nodeConfig, nodeKey, database, genesis, logger, mainKV, rktStore, blockComponentsFn, raftNode)
+}
+func newAggregatorMode(


julienrbrt · 2025-11-19T14:10:39Z

pkg/raft/node.go

+		return fmt.Errorf("not leader")
+	}
+
+	data, err := json.Marshal(state) // todo:use protobuf


why the todo? size?

we should migrate to protobuf here. json will cause overhead, at 100ms we need to minimise it as much as possible

* main: chore: reduce log noise (#2864) fix: sync service for non zero height starts with empty store (#2834) build(deps): Bump golang.org/x/crypto from 0.43.0 to 0.45.0 in /execution/evm (#2861) chore: minor improvement for docs (#2862)

* main: chore: bump da (#2866) chore: bump core (#2865)

* main: chore: fix some comments (#2874) chore: bump node in evm-single (#2875) refactor(syncer,cache): use compare and swap loop and add comments (#2873) refactor: use state da height as well (#2872) refactor: retrieve highest da height in cache (#2870) chore: change from event count to start and end height (#2871)

## Overview Speed up cache write/loads via parallel execution. Pulled from #2836

## Overview Minor updates to make it easier to trace errors Extracted from #2836

alpe added 2 commits November 12, 2025 15:16

Add high availabiliy via raft

d4c7e7d

Prefent duplicate peers

820b9a3

github-project-automation bot added this to Evolve Nov 12, 2025

github-actions bot assigned alpe Nov 12, 2025

alpe added 4 commits November 17, 2025 10:55

Merge branch 'main' into alex/adr023_ha-failover_raft7

35cf0c8

* main: fix: remove duplicate error logging in light node shutdown (#2841) chore: fix incorrect function name in comment (#2840) chore: remove sequencer go.mod (#2837)

Go mod tidy

45a6e27

Minor updates

4562ded

alpe changed the title ~~[WIP] feat: High availabilty via RAFT~~ feat: High availabilty via RAFT Nov 17, 2025

alpe added 4 commits November 18, 2025 11:00

Merge branch 'main' into alex/adr023_ha-failover_raft7

4d47754

Fix test

0804346

Leadership fix

1c2e3f1

(cherry picked from commit c44cd77e665f6d5d463295c6ed61c59a56d88db3)

Fix test

c320a7a

alpe marked this pull request as ready for review November 19, 2025 09:54

alpe added 2 commits November 19, 2025 11:05

Remove sleep

72ec5fe

Increase buffer size for raft msgs

b34d7a0

evstack deleted a comment from claude bot Nov 19, 2025

alpe requested a review from julienrbrt November 19, 2025 11:35

julienrbrt reviewed Nov 19, 2025

View reviewed changes

This was referenced Nov 20, 2025

chore: parallel cache de/serialization #2868

Merged

chore: sync store on new block #2869

Open

alpe added 3 commits November 20, 2025 17:24

Review feedback

406377e

Merge branch 'main' into alex/adr023_ha-failover_raft7

0ff25b8

* main: chore: bump da (#2866) chore: bump core (#2865)

github-merge-queue bot pushed a commit that referenced this pull request Nov 21, 2025

chore: parallel cache de/serialization (#2868)

1644d37

## Overview Speed up cache write/loads via parallel execution. Pulled from #2836

Add replace to go.mod

406ad8b

alpe mentioned this pull request Nov 21, 2025

chore: better output on errors #2879

Merged

github-merge-queue bot pushed a commit that referenced this pull request Nov 21, 2025

chore: better output on errors (#2879)

82299ed

## Overview Minor updates to make it easier to trace errors Extracted from #2836

feat: High availabilty via RAFT #2836

Are you sure you want to change the base?

feat: High availabilty via RAFT #2836

Uh oh!

Conversation

alpe commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

alpe commented Nov 19, 2025

Uh oh!

claude bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete

🎯 High-Level Architecture Assessment

🔒 Security Issues

Critical

Medium

🐛 Correctness & Logic Issues

High Priority

Medium Priority

🔄 Concurrency & Race Conditions

⚡ Performance & Resource Management

🧪 Testing & Reliability

📝 Code Quality & Maintainability

🎨 Configuration & Documentation

✅ Positive Highlights

🎯 Priority Recommendations

Must Fix Before Merge

Should Fix Soon

Nice to Have

📊 Coverage Analysis

📚 Related Review Comments

Uh oh!

julienrbrt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

julienrbrt Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

alpe commented Nov 12, 2025 •

edited

Loading

github-actions bot commented Nov 12, 2025 •

edited

Loading

codecov bot commented Nov 12, 2025 •

edited

Loading

claude bot commented Nov 19, 2025 •

edited

Loading

julienrbrt Nov 19, 2025 •

edited

Loading