-
Notifications
You must be signed in to change notification settings - Fork 238
feat: High availabilty via RAFT #2836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2836 +/- ##
==========================================
- Coverage 64.67% 62.09% -2.58%
==========================================
Files 81 88 +7
Lines 7173 7891 +718
==========================================
+ Hits 4639 4900 +261
- Misses 1995 2416 +421
- Partials 539 575 +36
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
* main: build(deps): Bump the go_modules group across 2 directories with 3 updates (#2846) build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.7.0 to 1.8.0 in /test/e2e (#2851) build(deps): Bump github.com/consensys/gnark-crypto from 0.18.0 to 0.18.1 in /test/e2e (#2844) build(deps): Bump github.com/cometbft/cometbft from 0.38.17 to 0.38.19 in /test/e2e (#2843) build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.6.0 to 1.7.0 in /test/e2e (#2845)
(cherry picked from commit c44cd77e665f6d5d463295c6ed61c59a56d88db3)
|
@claude review again |
|
Claude finished @alpe's task —— View job Code Review CompleteI've conducted a comprehensive review of the RAFT-based high availability implementation. Here are my findings organized by category: 🎯 High-Level Architecture AssessmentStrengths:
Architecture Concerns:
🔒 Security IssuesCritical
Medium
🐛 Correctness & Logic IssuesHigh Priority
Medium Priority
🔄 Concurrency & Race Conditions
⚡ Performance & Resource Management
🧪 Testing & Reliability
📝 Code Quality & Maintainability
🎨 Configuration & Documentation
✅ Positive Highlights
🎯 Priority RecommendationsMust Fix Before Merge
Should Fix Soon
Nice to Have
📊 Coverage AnalysisThe 42% patch coverage is concerning for a consensus-critical feature. Priority areas:
📚 Related Review CommentsI've reviewed the existing PR comments. Key items already discussed:
Overall Assessment: This is a solid foundation for HA via RAFT, but has critical security gaps (proposer validation, signature verification) and needs significantly more test coverage before production use. The architecture is sound but needs hardening around edge cases and Byzantine scenarios. |
julienrbrt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first walkthrough, I'll go read about github.com/hashicorp/raft and come back to review election.go and node.go
| return nil | ||
| } | ||
|
|
||
| // Height returns the current height stored |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to know the height of the p2p (go header) store? (I am still reviewing, this may get clearer). We can have the app height from the evolve store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the node switches from sync to aggregator mode, the internal state is key to prevent double signing.
The Syncer has a isCatchingUpState method now that checks the stores for any height > current.
it is called within the leader election loop to transfer leadership in case it is not fully synced, yet.
| } | ||
|
|
||
| // SetApplyCallback sets a callback to be called when log entries are applied | ||
| func (n *Node) SetApplyCallback(ch chan<- RaftApplyMsg) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: what is this for? the go doc is very light
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The channel is passed by the syncer to receive first level state updates from within the raft cluster. This should be the fastest communication channel available.
| }() | ||
|
|
||
| // Check raft leadership if raft is enabled | ||
| if e.raftNode != nil && !e.raftNode.IsLeader() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated: i wonder how this will play with different sequencers.
In #2797 you can get to that path without node key (to sign). I suppose we'll need to add a condition for based sequencing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I was only preparing for single sequencer. Base would not work with raft as there are no aggregators.
| leaderFactory := func() (raftpkg.Runnable, error) { | ||
| logger.Info().Msg("Starting aggregator-MODE") | ||
| nodeConfig.Node.Aggregator = true | ||
| nodeConfig.P2P.Peers = "" // peers are not supported in aggregator mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I understand this. is the aggregator broadcasting to no one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the aggregator is required to broadcast to at least one node part of a larger mesh other wise p2p will not work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more who calls whom. The aggregator gets called not otherwise. Starting all nodes with p2p-peer setup makes sense though. When a ha cluster is setup, the raft leader gets the aggregator role and I clear the peers when the p2p stack is restarted.
There is an error thrown somewhere when peers are not empty.
node/full.go
Outdated
| func initRaftNode(nodeConfig config.Config, logger zerolog.Logger) (*raftpkg.Node, error) { | ||
| raftDir := nodeConfig.Raft.RaftDir | ||
| if raftDir == "" { | ||
| raftDir = filepath.Join(nodeConfig.RootDir, "raft") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we should be using DefaultConfig() value if empty.
| bc *block.Components | ||
| } | ||
|
|
||
| func newSyncMode( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: i was a tiny bit confused this was moved here instead of full.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the constructors. Naming could be better, I guess.
| } | ||
| return setupFailoverState(nodeConfig, nodeKey, database, genesis, logger, mainKV, rktStore, blockComponentsFn, raftNode) | ||
| } | ||
| func newAggregatorMode( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| return fmt.Errorf("not leader") | ||
| } | ||
|
|
||
| data, err := json.Marshal(state) // todo:use protobuf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the todo? size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should migrate to protobuf here. json will cause overhead, at 100ms we need to minimise it as much as possible
* main: chore: fix some comments (#2874) chore: bump node in evm-single (#2875) refactor(syncer,cache): use compare and swap loop and add comments (#2873) refactor: use state da height as well (#2872) refactor: retrieve highest da height in cache (#2870) chore: change from event count to start and end height (#2871)
## Overview Speed up cache write/loads via parallel execution. Pulled from #2836
## Overview Minor updates to make it easier to trace errors Extracted from #2836
Implement failover via RAFT