chore(raft): drop hashicorp/raft backend, kv/ on raftengine natively#590
chore(raft): drop hashicorp/raft backend, kv/ on raftengine natively#590
Conversation
…atively - delete internal/raftengine/hashicorp and the etcd/hashicorp FSM adapter - delete internal/raftstore (hashicorp-only Pebble log store) - delete cmd/etcd-raft-migrate and cmd/etcd-raft-rollback (hashicorp migration tools) - drop hashicorp engine branch from the runtime factory and data-dir detection kv/ now implements raftengine.StateMachine directly: kvFSM.Apply takes []byte, Snapshot returns raftengine.Snapshot, and Restore takes io.Reader. Legacy NewCoordinator/NewTransaction/NewLeaderProxy shims that took *raft.Raft are removed in favor of the engine-based constructors. Replace hashicorp-typed identifiers across the tree: raft.ServerAddress and raft.ServerID become plain strings, raft.Server / raft.Voter become raftengine.Server with Suffrage "voter". adapter/ production code, main.go, shard_config.go and the demo server follow suit. Test scaffolding in kv/ and adapter/ that built in-process hashicorp clusters is rewritten on top of the etcd factory with t.TempDir() data dirs. hashicorp-specific tests that exercised AddVoter flow, InmemTransport or the legacy LogStore (adapter/add_voter_join_test.go, adapter/dynamodb_failure_test.go, adapter/redis_list_raft_benchmark_test.go, multiraft_runtime_test.go bootstrap/mixed-artifact cases) are removed; equivalent behaviour is covered by the etcd engine and the bootstrap e2e harness. go.mod is tidied to drop hashicorp/raft, Jille/raft-grpc-transport, hashicorp/go-hclog and transitive hashicorp deps.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/gemini review @codex review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 44 minutes and 38 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThe pull request comprehensively migrates the codebase from HashiCorp Raft to an etcd-backed raft engine abstraction. It removes all HashiCorp Raft dependencies, deletes the corresponding raft engine wrapper/adapter code, refactors FSM interfaces, updates leader address types from Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60–90 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
- gci: re-format imports in adapter/redis_info_test.go, cmd/server/demo.go, kv/leader_routed_store_test.go. - noctx: replace net.DialTimeout with (*net.Dialer).DialContext bound to a context in TestLeaderProxy_ForwardsWhenFollower so the wait honours both the dial timeout and the surrounding context. - SA9004: annotate raftEngineMarkerFile with an explicit string type so the grouped const declaration does not silently downgrade it to untyped. - unconvert: drop the redundant string(...) wrapper around lis.Addr().String() in kv/leader_routed_store_test.go. - cyclop: extract openRaftEngine from demo.go run() so the raft engine setup (factory, data dir, bootstrap) lives in its own helper and run() drops below the cyclomatic-complexity threshold.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a35245a411
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| result, err := factory.Create(raftengine.FactoryConfig{ | ||
| LocalID: cfg.raftID, | ||
| LocalAddress: cfg.address, | ||
| DataDir: raftDir, | ||
| Bootstrap: cfg.raftBootstrap, |
There was a problem hiding this comment.
Pass peer config when creating demo Raft engines
This Create call never sets Peers, so in demo mode the non-bootstrap nodes (raftBootstrap=false for n2/n3 in the same file) start with an empty peer list and etcd rejects startup before joinCluster runs. I reproduced this by running ./server and seeing node startup fail with “no persisted peers and no peer list was supplied”. As written, the 3-node demo path cannot form a cluster.
Useful? React with 👍 / 👎.
| func detectRaftEngineFromDataDir(dir string) (raftEngineType, bool, error) { | ||
| hashicorpArtifacts, err := hasRaftArtifacts(dir, | ||
| "raft.db", | ||
| "logs.dat", | ||
| "stable.dat", | ||
| ) | ||
| if err != nil { | ||
| return "", false, err | ||
| } | ||
| etcdArtifacts, err := hasRaftArtifacts(dir, | ||
| "wal", | ||
| "snap", |
There was a problem hiding this comment.
Keep rejecting legacy Hashicorp raft artifacts
detectRaftEngineFromDataDir now only checks etcd markers, so a pre-marker data dir that still contains Hashicorp artifacts (raft.db, logs.dat, stable.dat) is treated as empty and gets an etcd marker written. On upgrade this can allow bootstrapping an incompatible engine over existing state instead of failing fast with ErrRaftEngineDataDir, which risks losing committed cluster history.
Useful? React with 👍 / 👎.
Existing grpc_test.go helpers (rawKVClient, transactionalKVClient) connect to hosts[0] and assume it is the cluster leader. The previous hashicorp setup enforced this with an electionTimeout=0 on node 0 while others waited 10s. With etcd/raft, election timing is randomised and any node can win, so the scan in Test_grpc_scan could race against a forwarded Put that had not yet applied on the contacted follower — the last Put's LastCommitTS hadn't propagated locally. ensureNodeZeroIsLeader waits for any leader to emerge, then asks the current leader to transfer leadership to node 0 via raftengine.Admin.TransferLeadershipToServer. The transfer is best-effort and the subsequent Eventually retries if a concurrent election races it.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c5dc58a523
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| result, err := factory.Create(raftengine.FactoryConfig{ | ||
| LocalID: cfg.raftID, | ||
| LocalAddress: cfg.address, | ||
| DataDir: raftDir, | ||
| Bootstrap: cfg.raftBootstrap, |
There was a problem hiding this comment.
Validate demo Raft data dir before creating engine
This code opens the etcd engine directly from raftDir without any compatibility check for legacy Hashicorp Raft artifacts; the previous setupStorage path rejected logs.dat/stable.dat before startup. In a restart/upgrade where --raftDataDir still contains legacy state, the node can initialize a fresh etcd Raft store instead of failing fast, which effectively ignores prior committed history and risks silent data loss. Add the same data-dir engine validation used in the main runtime before calling factory.Create.
Useful? React with 👍 / 👎.
Codex flagged two P1 issues on the initial drop-hashicorp commit: (1) cmd/server/demo.go was calling factory.Create without seeding Peers. In demo mode the non-bootstrap nodes (n2, n3) would hit etcd's "no persisted peers and no peer list was supplied" guard and abort before joinCluster ever ran. Build the full 3-node peer list in main() and pass it as cfg.raftPeers so every node bootstraps with the same membership; raftBootstrap is forced true for all demo nodes so etcd forms the cluster directly. (2) detectRaftEngineFromDataDir no longer noticed hashicorp artifacts (raft.db, logs.dat, stable.dat), so on upgrade an existing hashicorp data dir would silently be rewritten as an etcd cluster on top of the committed bolt state. Add hasHashicorpRaftArtifacts plus a new ErrLegacyHashicorpDataDir sentinel, and call the probe from ensureRaftEngineDataDir before any marker is written so production restarts fail fast with a migration-required error. demo.go's openRaftEngine runs the same check on its own data dir for the P2 follow-up codex raised there.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
multiraft_runtime_test.go (1)
160-197:⚠️ Potential issue | 🟡 MinorAdd coverage for legacy HashiCorp artifact rejection.
This test now covers etcd detection, but the PR’s fail-fast behavior for legacy HashiCorp data dirs is not asserted here. Add at least one
ErrLegacyHashicorpDataDircase so the no-silent-migration guard cannot regress.🧪 Proposed test coverage
func TestEnsureRaftEngineDataDir(t *testing.T) { t.Run("writes marker for empty dir", func(t *testing.T) { dir := t.TempDir() require.NoError(t, ensureRaftEngineDataDir(dir, raftEngineEtcd)) @@ require.True(t, ok) require.Equal(t, raftEngineEtcd, engineType) }) + + t.Run("rejects legacy hashicorp raft db", func(t *testing.T) { + dir := t.TempDir() + require.NoError(t, os.WriteFile(filepath.Join(dir, "raft.db"), []byte("legacy"), 0o600)) + + err := ensureRaftEngineDataDir(dir, raftEngineEtcd) + require.ErrorIs(t, err, ErrLegacyHashicorpDataDir) + }) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiraft_runtime_test.go` around lines 160 - 197, Add a new subtest in TestEnsureRaftEngineDataDir that creates a legacy HashiCorp-style data layout (e.g., a directory/file pattern your code treats as legacy) and calls detectRaftEngineFromDataDir; assert that it returns the specific sentinel error ErrLegacyHashicorpDataDir (and appropriate ok==false) so the fail-fast guard cannot regress. Locate the test block around TestEnsureRaftEngineDataDir and add the new t.Run that sets up the legacy artifact, invokes detectRaftEngineFromDataDir, and checks for ErrLegacyHashicorpDataDir.cmd/server/demo.go (1)
584-591:⚠️ Potential issue | 🟡 Minor
coordinator.Close()fires as soon asrun()returns, not at process exit.
run()does not block oneg.Wait()— it registers goroutines and returnsnilonce setup succeeds (cleanup.Release()at line 667, thenreturn nil). As a result thisdeferderegisters the leader-loss callback on the engine while the server is still serving traffic. Functional impact is limited (thestate == raftengine.StateLeadergate inLeaseReadstill holds the safety invariant), but the fast-path "invalidate lease the instant we lose leadership" guarantee documented onNewCoordinatorWithEngineis lost for the whole demo session.The doc comment ("before the process exits … matches the symmetric construction order") is also misleading because the symmetric teardown point here is
eg.Wait()inmain(), not thisrun()return.Either hoist the
Close()into theCleanupStack(so it only fires when the cleanup stack runs on a real teardown) or wire it into a shutdown task driven byctx.Done().🛠 Suggested approach
- defer func() { - // Release the leader-loss callback slot on the engine before - // the process exits. The engine itself is closed elsewhere in - // the shutdown path; both orderings are safe, but releasing - // the closure here matches the symmetric construction order. - _ = coordinator.Close() - }() + // Release the leader-loss callback slot on the engine only on real + // teardown; run() returns after spawning goroutines, so a plain + // defer here would fire while the server is still serving. + cleanup.Add(func() { _ = coordinator.Close() })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/server/demo.go` around lines 584 - 591, The defer calling coordinator.Close() in run() unregisters the leader-loss callback too early because run() returns before process shutdown; move the Close() call into the real teardown path instead of a function-local defer — e.g. register coordinator.Close with the CleanupStack/Cleanup.Release sequence or add a shutdown goroutine that waits for ctx.Done()/eg.Wait() and then calls coordinator.Close(); update/remove the misleading doc comment that says “before the process exits … matches the symmetric construction order” to reference eg.Wait() as the true symmetric teardown point and ensure coordinator.Close() is invoked only during actual shutdown.main.go (2)
464-485:⚠️ Potential issue | 🟡 MinorStale reference to the removed hashicorp backend in comment.
"(hashicorp backend today)" no longer reflects reality since the hashicorp engine was removed in this PR; the type assertion can only skip future engines that don't implement
monitoring.DispatchCounterSource.📝 Proposed wording
-// dispatchMonitorSources extracts the raft engines that expose etcd -// dispatch counters so monitoring can poll them for the hot-path -// dashboard. Engines that do not satisfy the interface (hashicorp -// backend today) are skipped silently; their groups simply won't -// contribute to elastickv_raft_dispatch_* metrics. +// dispatchMonitorSources extracts the raft engines that expose +// dispatch counters so monitoring can poll them for the hot-path +// dashboard. Engines that do not satisfy the interface are skipped +// silently; their groups simply won't contribute to +// elastickv_raft_dispatch_* metrics.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@main.go` around lines 464 - 485, The comment above dispatchMonitorSources contains a stale reference to the removed "hashicorp backend"; update the function comment to remove that specific backend mention and instead state that engines which do not implement monitoring.DispatchCounterSource are skipped (e.g., change "(hashicorp backend today)" to a generic note about non-implementing engines). Ensure the comment still explains that skipped groups won't contribute to elastickv_raft_dispatch_* metrics and references the dispatchMonitorSources function behavior and the monitoring.DispatchCounterSource type.
87-87:⚠️ Potential issue | 🟡 MinorStale flag help text lists
hashicorpas a valid value.
parseRaftEngineTypenow returnsErrUnsupportedRaftEnginefor anything other than""/"etcd", so advertisinghashicorphere will mislead operators reading--help.📝 Proposed wording
- raftEngineName = flag.String("raftEngine", string(raftEngineEtcd), "Raft engine implementation (etcd|hashicorp)") + raftEngineName = flag.String("raftEngine", string(raftEngineEtcd), "Raft engine implementation (etcd)")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@main.go` at line 87, The flag help text for raftEngineName incorrectly mentions "hashicorp" which is no longer supported by parseRaftEngineType (it returns ErrUnsupportedRaftEngine for anything other than "" or "etcd"); update the flag.String call for raftEngineName so the help string only advertises valid values (e.g., "Raft engine implementation (etcd)") or otherwise documents the empty/default behavior, ensuring the displayed --help is accurate and consistent with parseRaftEngineType and ErrUnsupportedRaftEngine.
🧹 Nitpick comments (5)
kv/sharded_coordinator_leader_test.go (1)
21-23: Drop the redundantre := ralias.Now that the
hashicorpraftengine.New(r)wrapper is gone,reis just an alias forr. Inline it for clarity.♻️ Proposed cleanup
- re := r groups := map[uint64]*ShardGroup{ - 1: {Engine: re, Store: st, Txn: NewLeaderProxyWithEngine(re)}, + 1: {Engine: r, Store: st, Txn: NewLeaderProxyWithEngine(r)}, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kv/sharded_coordinator_leader_test.go` around lines 21 - 23, Remove the redundant alias "re := r" and use "r" directly when constructing the groups map (the ShardGroup literal and the NewLeaderProxyWithEngine call), i.e. eliminate the "re" variable and replace its occurrences with "r" in the ShardGroup{Engine: ..., Store: ..., Txn: NewLeaderProxyWithEngine(...)} initialization to simplify the code.main_bootstrap_test.go (1)
15-36: LGTM. Assertions match the newparseRaftBootstrapMembersoutput inshard_config.go:156-186(raftengine.ServerwithSuffrage: "voter").Minor: the bare
"voter"literal is now duplicated betweenshard_config.goand this test. Consider exporting a named constant (e.g.,raftengine.SuffrageVoter) in a follow-up to avoid typos drifting silently.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@main_bootstrap_test.go` around lines 15 - 36, The test and shard_config.go both hardcode the suffrage string "voter"; introduce a single exported constant in the raftengine package (e.g., SuffrageVoter) and replace the literal "voter" in parseRaftBootstrapMembers (shard_config.go) and in main_bootstrap_test.go assertions (where raftengine.Server{Suffrage: "voter"}) to use raftengine.SuffrageVoter so the value is centralized and avoids silent typos drifting between files.kv/coordinator_dispatch_test.go (1)
19-20: Close coordinators created throughNewCoordinatorWithEngine.
NewCoordinatorWithEngineregisters an engine callback; add test cleanup so these short-lived coordinators release their callback slots before the raft engine is stopped.♻️ Proposed test cleanup
tm := NewTransactionWithProposer(r) c := NewCoordinatorWithEngine(tm, r) +t.Cleanup(func() { c.Close() })Apply the same cleanup after each
NewCoordinatorWithEngine(...)construction in this file.Also applies to: 45-46, 78-79
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kv/coordinator_dispatch_test.go` around lines 19 - 20, The test creates coordinators via NewCoordinatorWithEngine (e.g., "c := NewCoordinatorWithEngine(tm, r)") but never closes them, leaving engine callbacks registered; after each NewCoordinatorWithEngine call add cleanup to close the coordinator (for example use t.Cleanup(func(){ c.Close() }) or defer c.Close() in the test scope) so the callback slots are released before the raft engine is stopped; apply the same cleanup for the other occurrences at the places creating coordinators (lines around the other NewCoordinatorWithEngine calls).cmd/server/demo.go (1)
533-539: Duplicates legacy-artifact detection already implemented inmultiraft_runtime.go.
hasHashicorpRaftArtifacts+ErrLegacyHashicorpDataDirinmultiraft_runtime.goencode exactly this check and produce a uniform error. Inlining here means (a) the artifact list can drift between the two binaries, and (b) operators see a different error string from the demo server than from the production server.Consider extracting the helper and sentinel into an internal package (e.g.
internal/raftengine/legacy) so bothmainpackages can share them. Scope note: moving the error type to a shared package also lets external toolingerrors.Isagainst it.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/server/demo.go` around lines 533 - 539, The demo server duplicates the legacy Hashicorp raft artifact check already implemented by hasHashicorpRaftArtifacts and ErrLegacyHashicorpDataDir in multiraft_runtime.go; extract that logic and sentinel error into a shared internal package (for example internal/raftengine/legacy) and update cmd/server/demo.go to call hasHashicorpRaftArtifacts (or equivalent exported function) and return ErrLegacyHashicorpDataDir when detected so both binaries share the same artifact list and error type usable with errors.Is.shard_config.go (1)
181-186: Extract the"voter"magic string into a local constant to reduce duplication.The hardcoded string
"voter"appears in this file and also incmd/server/demo.go. Whileraftenginedoes not export a suffrage constant, defining one locally (e.g.,const suffrageVoter = "voter") would prevent drift and make the code more maintainable if the value changes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@shard_config.go` around lines 181 - 186, Introduce a local constant named suffrageVoter = "voter" and replace the hardcoded "voter" in the servers append call (the raftengine.Server literal where Suffrage is set) with suffrageVoter; also add the same local constant to the other file that uses the same literal (the demo server code) so both places reference the identical constant and avoid duplication/drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@adapter/distribution_milestone1_e2e_test.go`:
- Around line 230-260: The helper newSingleRaftForDistributionE2E currently
registers cleanup after require.Eventually and can leak result.Engine if the
leader wait times out; fix by making the stop closure idempotent (use a
sync.Once or a closed boolean) that safely checks result and result.Engine/Close
before calling Close, then register that stop immediately after factory.Create
(e.g. t.Cleanup(stop)) so cleanup runs even if require.Eventually fails;
reference result, stop, factory.Create, result.Engine.Close and result.Close
when making these changes.
In `@adapter/test_util.go`:
- Around line 235-311: Replace non-fatal assertions with fatal ones in the raft
readiness helpers: in waitForRaftReadiness and ensureNodeZeroIsLeader switch
assert.Eventually to require.Eventually so failures abort the test immediately;
update the three calls inside waitForRaftReadiness (the leader-consensus check)
and ensureNodeZeroIsLeader (the "some node became leader" and "node 0 did not
become leader" checks) and add the require import if missing.
---
Outside diff comments:
In `@cmd/server/demo.go`:
- Around line 584-591: The defer calling coordinator.Close() in run()
unregisters the leader-loss callback too early because run() returns before
process shutdown; move the Close() call into the real teardown path instead of a
function-local defer — e.g. register coordinator.Close with the
CleanupStack/Cleanup.Release sequence or add a shutdown goroutine that waits for
ctx.Done()/eg.Wait() and then calls coordinator.Close(); update/remove the
misleading doc comment that says “before the process exits … matches the
symmetric construction order” to reference eg.Wait() as the true symmetric
teardown point and ensure coordinator.Close() is invoked only during actual
shutdown.
In `@main.go`:
- Around line 464-485: The comment above dispatchMonitorSources contains a stale
reference to the removed "hashicorp backend"; update the function comment to
remove that specific backend mention and instead state that engines which do not
implement monitoring.DispatchCounterSource are skipped (e.g., change "(hashicorp
backend today)" to a generic note about non-implementing engines). Ensure the
comment still explains that skipped groups won't contribute to
elastickv_raft_dispatch_* metrics and references the dispatchMonitorSources
function behavior and the monitoring.DispatchCounterSource type.
- Line 87: The flag help text for raftEngineName incorrectly mentions
"hashicorp" which is no longer supported by parseRaftEngineType (it returns
ErrUnsupportedRaftEngine for anything other than "" or "etcd"); update the
flag.String call for raftEngineName so the help string only advertises valid
values (e.g., "Raft engine implementation (etcd)") or otherwise documents the
empty/default behavior, ensuring the displayed --help is accurate and consistent
with parseRaftEngineType and ErrUnsupportedRaftEngine.
In `@multiraft_runtime_test.go`:
- Around line 160-197: Add a new subtest in TestEnsureRaftEngineDataDir that
creates a legacy HashiCorp-style data layout (e.g., a directory/file pattern
your code treats as legacy) and calls detectRaftEngineFromDataDir; assert that
it returns the specific sentinel error ErrLegacyHashicorpDataDir (and
appropriate ok==false) so the fail-fast guard cannot regress. Locate the test
block around TestEnsureRaftEngineDataDir and add the new t.Run that sets up the
legacy artifact, invokes detectRaftEngineFromDataDir, and checks for
ErrLegacyHashicorpDataDir.
---
Nitpick comments:
In `@cmd/server/demo.go`:
- Around line 533-539: The demo server duplicates the legacy Hashicorp raft
artifact check already implemented by hasHashicorpRaftArtifacts and
ErrLegacyHashicorpDataDir in multiraft_runtime.go; extract that logic and
sentinel error into a shared internal package (for example
internal/raftengine/legacy) and update cmd/server/demo.go to call
hasHashicorpRaftArtifacts (or equivalent exported function) and return
ErrLegacyHashicorpDataDir when detected so both binaries share the same artifact
list and error type usable with errors.Is.
In `@kv/coordinator_dispatch_test.go`:
- Around line 19-20: The test creates coordinators via NewCoordinatorWithEngine
(e.g., "c := NewCoordinatorWithEngine(tm, r)") but never closes them, leaving
engine callbacks registered; after each NewCoordinatorWithEngine call add
cleanup to close the coordinator (for example use t.Cleanup(func(){ c.Close() })
or defer c.Close() in the test scope) so the callback slots are released before
the raft engine is stopped; apply the same cleanup for the other occurrences at
the places creating coordinators (lines around the other
NewCoordinatorWithEngine calls).
In `@kv/sharded_coordinator_leader_test.go`:
- Around line 21-23: Remove the redundant alias "re := r" and use "r" directly
when constructing the groups map (the ShardGroup literal and the
NewLeaderProxyWithEngine call), i.e. eliminate the "re" variable and replace its
occurrences with "r" in the ShardGroup{Engine: ..., Store: ..., Txn:
NewLeaderProxyWithEngine(...)} initialization to simplify the code.
In `@main_bootstrap_test.go`:
- Around line 15-36: The test and shard_config.go both hardcode the suffrage
string "voter"; introduce a single exported constant in the raftengine package
(e.g., SuffrageVoter) and replace the literal "voter" in
parseRaftBootstrapMembers (shard_config.go) and in main_bootstrap_test.go
assertions (where raftengine.Server{Suffrage: "voter"}) to use
raftengine.SuffrageVoter so the value is centralized and avoids silent typos
drifting between files.
In `@shard_config.go`:
- Around line 181-186: Introduce a local constant named suffrageVoter = "voter"
and replace the hardcoded "voter" in the servers append call (the
raftengine.Server literal where Suffrage is set) with suffrageVoter; also add
the same local constant to the other file that uses the same literal (the demo
server code) so both places reference the identical constant and avoid
duplication/drift.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 179f3410-b62a-4058-9639-fd78a3b293ad
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (63)
adapter/add_voter_join_test.goadapter/distribution_milestone1_e2e_test.goadapter/distribution_server_test.goadapter/dynamodb.goadapter/dynamodb_failure_test.goadapter/dynamodb_test.goadapter/internal.goadapter/redis.goadapter/redis_info_test.goadapter/redis_keys_pattern_test.goadapter/redis_list_raft_benchmark_test.goadapter/redis_retry_test.goadapter/s3.goadapter/s3_test.goadapter/test_util.gocmd/etcd-raft-migrate/main.gocmd/etcd-raft-rollback/main.gocmd/server/demo.gogo.modinternal/raftengine/etcd/hashicorp_fsm_adapter.gointernal/raftengine/etcd/migrate_test.gointernal/raftengine/hashicorp/engine.gointernal/raftengine/hashicorp/engine_test.gointernal/raftengine/hashicorp/factory.gointernal/raftengine/hashicorp/leadership_err_test.gointernal/raftengine/hashicorp/migrate.gointernal/raftengine/hashicorp/migrate_test.gointernal/raftstore/pebble.gointernal/raftstore/pebble_test.gokv/coordinator.gokv/coordinator_dispatch_test.gokv/coordinator_leader_test.gokv/fsm.gokv/fsm_occ_test.gokv/fsm_txn_test.gokv/grpc_conn_cache.gokv/grpc_conn_cache_test.gokv/leader_proxy.gokv/leader_proxy_test.gokv/leader_routed_store.gokv/leader_routed_store_test.gokv/lease_state_test.gokv/lock_resolver_test.gokv/raft_engine.gokv/shard_router_test.gokv/shard_store_txn_lock_test.gokv/sharded_coordinator.gokv/sharded_coordinator_abort_test.gokv/sharded_coordinator_del_prefix_test.gokv/sharded_coordinator_leader_test.gokv/sharded_integration_test.gokv/snapshot.gokv/snapshot_test.gokv/transaction.gokv/transaction_batch_test.gomain.gomain_bootstrap_e2e_test.gomain_bootstrap_test.gomain_s3.gomultiraft_runtime.gomultiraft_runtime_test.goshard_config.goshard_config_test.go
💤 Files with no reviewable changes (19)
- cmd/etcd-raft-migrate/main.go
- internal/raftengine/hashicorp/engine_test.go
- internal/raftengine/hashicorp/leadership_err_test.go
- kv/leader_proxy.go
- cmd/etcd-raft-rollback/main.go
- internal/raftengine/hashicorp/engine.go
- internal/raftstore/pebble.go
- adapter/internal.go
- kv/lease_state_test.go
- kv/transaction.go
- adapter/redis_list_raft_benchmark_test.go
- internal/raftengine/etcd/hashicorp_fsm_adapter.go
- internal/raftengine/hashicorp/migrate_test.go
- internal/raftstore/pebble_test.go
- go.mod
- adapter/dynamodb_failure_test.go
- internal/raftengine/hashicorp/factory.go
- adapter/add_voter_join_test.go
- internal/raftengine/hashicorp/migrate.go
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 02782f0822
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| n.raftPeers = peers | ||
| // etcd raft requires every member of a fresh cluster to | ||
| // bootstrap with the same peer list. Override the per-node | ||
| // raftBootstrap flag in demo mode so n2/n3 don't fail with | ||
| // "no persisted peers and no peer list was supplied". | ||
| n.raftBootstrap = true |
There was a problem hiding this comment.
Skip demo re-join after prebootstrapping all peers
Setting raftPeers and forcing raftBootstrap=true on every demo node means the 3-node cluster is already initialized at startup, but main still launches joinCluster, which always sends AddVoter to nodes[0]. With etcd's randomized elections, node 0 is often a follower, so those RPCs return not leader until joinNodeWithRetry exhausts retries and the demo process exits. This makes demo startup nondeterministically fail in the normal no---raftId path.
Useful? React with 👍 / 👎.
- adapter/distribution_milestone1_e2e_test.go: register the engine/factory cleanup with t.Cleanup BEFORE the require.Eventually leader-wait so a timeout no longer leaks the etcd engine, WAL and transport. The stop closure is now sync.Once-protected so the explicit defer-stop in test bodies plus the t.Cleanup fallback are both safe. - adapter/test_util.go: promote every assert.Eventually in waitForRaftReadiness / ensureNodeZeroIsLeader to require.Eventually so a missing leader or a failed leadership transfer fails the test immediately instead of letting subsequent RPCs panic on nil leader. The lingering assert.NoError calls in background goroutines stay because require.FailNow there would leak the goroutine. - multiraft_runtime_test.go: add three subtests (raft.db, logs.dat, stable.dat) that assert ensureRaftEngineDataDir returns ErrLegacyHashicorpDataDir, covering the no-silent-migration guard added in the previous commit.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Codex P1: once every demo node starts with raftPeers=[n1,n2,n3] and raftBootstrap=true (the previous commit's fix for the 'no persisted peers' abort), the 3-node cluster is already formed at startup. The lingering joinCluster goroutine then ran AddVoter against nodes[0], which under etcd's randomised elections is often a follower — the RPC would be rejected, retried for the full 20-attempt window, and either leak cancellation errors into the errgroup or attempt to re-add members that were already voters. Remove joinCluster plus its helpers (joinNodeWithRetry, joinRetryCancelResult, tryJoinNode, waitForJoinRetry, joinClusterWaitError) and the now-dead joinRetries / joinWait / joinRetryInterval / joinRPCTimeout constants. The unused fmt import is dropped with them.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Summary
Testing
|
Merged main (PR #590 dropped hashicorp/raft); migrated admin wiring to raftengine.Server. Addressed Codex P2 and CodeRabbit findings on commit ac8d6c3: - Refcount cached gRPC clients (cmd/elastickv-admin): clientFor now returns (client, release, err); evictOneLocked and invalidateClient retire entries but keep the *grpc.ClientConn alive until the last borrower releases. Prevents spurious cancellation of healthy in-flight GetClusterOverview calls when the cache hits maxCachedClients under fan-out load. - New --allowRemoteBind flag; by default --bindAddr is rejected unless the host is loopback. The admin HTTP layer has no browser-facing auth while holding the node admin token, so remote bind is an explicit opt-in for operators behind an auth proxy. - AdminServer.GetRaftGroups now populates LastContactUnixMs from raftengine.Status.LastContact (absolute ms = nowFunc() - LastContact). Nowtest-seamed via nowFunc so tests are deterministic. - Design doc §5.1 flush path rewritten to use atomic.SwapUint64 instead of pointer-swap, so no late Observe can race past the flush snapshot — no retired counters to harvest. - Design doc §5.2 calibration: replace "benchmark at startup" with a checked-in costPerObserveNs constant with CI drift check. Regression tests added under -race for: eviction-while-in-flight, post-close invalidateClient no-op, --bindAddr loopback validation, GetRaftGroups LastContactUnixMs.
Summary
internal/raftengine/hashicorpbackend, the etcd/hashicorp FSM adapter,internal/raftstore, andcmd/etcd-raft-migrate/cmd/etcd-raft-rollback.kv/to implementraftengine.StateMachinedirectly —kvFSM.Apply([]byte),Snapshot() (raftengine.Snapshot, error),Restore(io.Reader). Drop the legacyNewCoordinator(*raft.Raft)/NewTransaction/NewLeaderProxyshims.raft.ServerAddress/raft.ServerID/raft.Server/raft.Voterwith plainstringandraftengine.Serverthroughoutkv/,adapter/,main.go,shard_config.go,cmd/server/demo.go.kv/sharded_integration_test.go,adapter/test_util.go,adapter/s3_test.go,adapter/distribution_milestone1_e2e_test.go) on top of the etcd factory witht.TempDir()data dirs.adapter/add_voter_join_test.go,adapter/dynamodb_failure_test.go,adapter/redis_list_raft_benchmark_test.go, bootstrap and mixed-artifact cases inmultiraft_runtime_test.go).go mod tidydropshashicorp/raft,Jille/raft-grpc-transport,hashicorp/go-hclogand transitive hashicorp deps.Stats: 64 files, +487 / -3703.
Test plan
go build ./...go vet ./...go test ./kv/...(passes in ~5s)go test ./internal/...(etcd engine conformance, raftadmin, s3keys)go test ./adapter/...— heavy sequential testTest_consistency_satisfy_write_after_read_sequence(9999 × 3 sequential Puts on a 3-node loopback etcd cluster) hangs onPropose(context.Background())where it used to complete quickly with hashicorp's InmemTransport. Parallel variant (Test_consistency_satisfy_write_after_read_for_parallel, 1000 goroutines) passes in <1s. Other adapter tests pass individually. Needs follow-up on either test timing or etcd engine behavior under slow sequential back-pressure.main_bootstrap_e2e_test.go) — now only exercises the etcd engine variant.Notes for reviewers
adapter/test_util.gonow uses the etcd factory with a temp dir per node. Leader election relies on etcd's randomized election timeouts rather than the previous "node 0 has electionTimeout=0, others wait 10s" trick; this means the leader is not deterministic, butwaitForRaftReadinessconverges on whichever node wins.cmd/server/demo.gonow goes through the engine factory instead of constructing a raft instance directly.setupStorage/bootstrapClusterIfNeededare gone.multiraft_runtime.gono longer tries to detect mixed hashicorp/etcd artifacts — only etcd markers remain.Summary by CodeRabbit
New Features
Bug Fixes
Chores