Skip to content

chore(raft): drop hashicorp/raft backend, kv/ on raftengine natively#590

Merged
bootjp merged 6 commits intomainfrom
chore/drop-hashicorp-raft
Apr 23, 2026
Merged

chore(raft): drop hashicorp/raft backend, kv/ on raftengine natively#590
bootjp merged 6 commits intomainfrom
chore/drop-hashicorp-raft

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 22, 2026

Summary

  • Delete the internal/raftengine/hashicorp backend, the etcd/hashicorp FSM adapter, internal/raftstore, and cmd/etcd-raft-migrate / cmd/etcd-raft-rollback.
  • Flip kv/ to implement raftengine.StateMachine directly — kvFSM.Apply([]byte), Snapshot() (raftengine.Snapshot, error), Restore(io.Reader). Drop the legacy NewCoordinator(*raft.Raft) / NewTransaction / NewLeaderProxy shims.
  • Replace raft.ServerAddress / raft.ServerID / raft.Server / raft.Voter with plain string and raftengine.Server throughout kv/, adapter/, main.go, shard_config.go, cmd/server/demo.go.
  • Rewrite the in-process test clusters (kv/sharded_integration_test.go, adapter/test_util.go, adapter/s3_test.go, adapter/distribution_milestone1_e2e_test.go) on top of the etcd factory with t.TempDir() data dirs.
  • Remove hashicorp-specific tests that exercised AddVoter / InmemTransport / boltdb legacy detection (adapter/add_voter_join_test.go, adapter/dynamodb_failure_test.go, adapter/redis_list_raft_benchmark_test.go, bootstrap and mixed-artifact cases in multiraft_runtime_test.go).
  • go mod tidy drops hashicorp/raft, Jille/raft-grpc-transport, hashicorp/go-hclog and transitive hashicorp deps.

Stats: 64 files, +487 / -3703.

Test plan

  • go build ./...
  • go vet ./...
  • go test ./kv/... (passes in ~5s)
  • go test ./internal/... (etcd engine conformance, raftadmin, s3keys)
  • go test ./adapter/... — heavy sequential test Test_consistency_satisfy_write_after_read_sequence (9999 × 3 sequential Puts on a 3-node loopback etcd cluster) hangs on Propose(context.Background()) where it used to complete quickly with hashicorp's InmemTransport. Parallel variant (Test_consistency_satisfy_write_after_read_for_parallel, 1000 goroutines) passes in <1s. Other adapter tests pass individually. Needs follow-up on either test timing or etcd engine behavior under slow sequential back-pressure.
  • bootstrap e2e (main_bootstrap_e2e_test.go) — now only exercises the etcd engine variant.

Notes for reviewers

  • adapter/test_util.go now uses the etcd factory with a temp dir per node. Leader election relies on etcd's randomized election timeouts rather than the previous "node 0 has electionTimeout=0, others wait 10s" trick; this means the leader is not deterministic, but waitForRaftReadiness converges on whichever node wins.
  • cmd/server/demo.go now goes through the engine factory instead of constructing a raft instance directly. setupStorage / bootstrapClusterIfNeeded are gone.
  • multiraft_runtime.go no longer tries to detect mixed hashicorp/etcd artifacts — only etcd markers remain.

Summary by CodeRabbit

  • New Features

    • Migrated to an etcd-backed Raft engine implementation as the primary consensus backend.
  • Bug Fixes

    • Removed legacy Hashicorp Raft data directory support; systems using legacy artifacts will now receive a clear error indicating migration is required.
  • Chores

    • Eliminated dependency on Hashicorp Raft library.
    • Updated internal Raft leadership APIs to use plain strings instead of custom address types.
    • Removed command-line migration and rollback utilities.
    • Removed multiple integration test suites and benchmarks.

…atively

- delete internal/raftengine/hashicorp and the etcd/hashicorp FSM adapter
- delete internal/raftstore (hashicorp-only Pebble log store)
- delete cmd/etcd-raft-migrate and cmd/etcd-raft-rollback (hashicorp migration tools)
- drop hashicorp engine branch from the runtime factory and data-dir detection

kv/ now implements raftengine.StateMachine directly: kvFSM.Apply takes []byte,
Snapshot returns raftengine.Snapshot, and Restore takes io.Reader. Legacy
NewCoordinator/NewTransaction/NewLeaderProxy shims that took *raft.Raft are
removed in favor of the engine-based constructors.

Replace hashicorp-typed identifiers across the tree: raft.ServerAddress and
raft.ServerID become plain strings, raft.Server / raft.Voter become
raftengine.Server with Suffrage "voter". adapter/ production code, main.go,
shard_config.go and the demo server follow suit.

Test scaffolding in kv/ and adapter/ that built in-process hashicorp
clusters is rewritten on top of the etcd factory with t.TempDir() data
dirs. hashicorp-specific tests that exercised AddVoter flow, InmemTransport
or the legacy LogStore (adapter/add_voter_join_test.go,
adapter/dynamodb_failure_test.go, adapter/redis_list_raft_benchmark_test.go,
multiraft_runtime_test.go bootstrap/mixed-artifact cases) are removed;
equivalent behaviour is covered by the etcd engine and the bootstrap e2e
harness.

go.mod is tidied to drop hashicorp/raft, Jille/raft-grpc-transport,
hashicorp/go-hclog and transitive hashicorp deps.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 22, 2026

/gemini review @codex review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 22, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 44 minutes and 38 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 44 minutes and 38 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4d732ac7-8b5b-4e47-8d04-4f2b44f3e99a

📥 Commits

Reviewing files that changed from the base of the PR and between 02782f0 and 83e5e8c.

📒 Files selected for processing (4)
  • adapter/distribution_milestone1_e2e_test.go
  • adapter/test_util.go
  • cmd/server/demo.go
  • multiraft_runtime_test.go
📝 Walkthrough

Walkthrough

The pull request comprehensively migrates the codebase from HashiCorp Raft to an etcd-backed raft engine abstraction. It removes all HashiCorp Raft dependencies, deletes the corresponding raft engine wrapper/adapter code, refactors FSM interfaces, updates leader address types from raft.ServerAddress to string, and modifies test infrastructure to use the new engine factory pattern.

Changes

Cohort / File(s) Summary
HashiCorp Raft Engine Removal
internal/raftengine/hashicorp/engine.go, internal/raftengine/hashicorp/engine_test.go, internal/raftengine/hashicorp/factory.go, internal/raftengine/hashicorp/leadership_err_test.go, internal/raftengine/hashicorp/migrate.go, internal/raftengine/hashicorp/migrate_test.go
Deleted entire HashiCorp Raft engine implementation, factory, migration tools, and corresponding test suites; no longer provides raft engine wrapping or FSM adaptation.
Raft Store Removal
internal/raftstore/pebble.go, internal/raftstore/pebble_test.go
Removed Pebble-backed raft log/stable store implementation previously used by HashiCorp Raft to persist log entries and stable state.
FSM Interface Migration
kv/fsm.go, kv/fsm_occ_test.go, kv/fsm_txn_test.go, kv/snapshot.go, kv/snapshot_test.go
Changed FSM from raft.FSM to raftengine.StateMachine; updated Apply to accept raw bytes instead of *raft.Log, and replaced Persist/Release snapshot methods with WriteTo(io.Writer).
Coordinator Interface Updates
kv/coordinator.go, kv/leader_routed_store.go, kv/sharded_coordinator.go, adapter/dynamodb.go, adapter/redis.go, adapter/s3.go
Changed RaftLeader() and RaftLeaderForKey() return types from raft.ServerAddress to string; updated leader address map types from map[raft.ServerAddress]string to map[string]string.
Coordinator Constructor Deprecation
kv/coordinator.go, kv/transaction.go, kv/leader_proxy.go
Removed NewCoordinator()/NewTransaction()/NewLeaderProxy() constructors that accepted *raft.Raft; callers must now use engine-based variants (NewCoordinatorWithEngine, NewTransactionWithProposer, NewLeaderProxyWithEngine).
Test Coordinator Stubs
adapter/distribution_server_test.go, adapter/dynamodb_test.go, adapter/redis_info_test.go, adapter/redis_keys_pattern_test.go, adapter/redis_retry_test.go, kv/grpc_conn_cache_test.go, kv/leader_routed_store_test.go
Updated test stub methods to return string instead of raft.ServerAddress; removed corresponding HashiCorp Raft imports from test files.
Test Harness Migration
adapter/test_util.go, kv/sharded_integration_test.go, kv/shard_router_test.go, kv/leader_proxy_test.go
Refactored raft test setup from in-memory HashiCorp clusters to etcd-backed raftengine.Factory pattern; updated leader election detection, node readiness checks, and shutdown logic to use engine abstractions.
E2E Test Refactoring
adapter/distribution_milestone1_e2e_test.go, adapter/s3_test.go
Replaced raft.NewRaft + BootstrapCluster with etcdraftengine.NewFactory(...).Create(...) in test helpers; updated leader readiness and shutdown to use engine.State() == raftengine.StateLeader and engine.Close().
Test Removal
adapter/add_voter_join_test.go, adapter/dynamodb_failure_test.go, adapter/redis_list_raft_benchmark_test.go
Deleted integration tests and benchmarks that relied on complex multi-node HashiCorp Raft cluster setups; replaced with simpler abstraction-based test patterns elsewhere.
FSM Adapter Cleanup
internal/raftengine/etcd/hashicorp_fsm_adapter.go, internal/raftengine/etcd/migrate_test.go, kv/transaction_batch_test.go
Removed HashiCorp FSM adapter layer (AdaptHashicorpFSM); tests now pass FSM implementations directly to engine factories without wrapping.
Bootstrap Configuration Updates
shard_config.go, shard_config_test.go, kv/coordinator_dispatch_test.go, kv/coordinator_leader_test.go, main_bootstrap_test.go
Changed bootstrap member type from []raft.Server to []raftengine.Server; updated suffrage representation from raft.Voter constant to string literal "voter".
Command-Line Tools Removal
cmd/etcd-raft-migrate/main.go, cmd/etcd-raft-rollback/main.go
Deleted raft migration/rollback command entrypoints that were specific to HashiCorp Raft management.
Main Entry Point Updates
cmd/server/demo.go, main.go, main_s3.go, main_bootstrap_e2e_test.go, multiraft_runtime.go, multiraft_runtime_test.go
Removed HashiCorp Raft engine factory logic; integrated etcd-backed engine creation via openRaftEngine helper; updated leader address maps to use string keys; removed legacy data-dir detection for HashiCorp artifacts; changed default raft engine type to etcd-only.
Module Dependencies
go.mod
Removed direct dependencies on HashiCorp Raft (github.com/hashicorp/raft), raft-gRPC transport, and related transitive modules (github.com/hashicorp/go-hclog, github.com/armon/go-metrics, etc.).
Internal Service Registration
adapter/internal.go
Removed NewInternal() constructor that wrapped HashiCorp Raft; callers must now use NewInternalWithEngine() with an explicit raftengine.LeaderView.
GRPC Connection Caching
kv/grpc_conn_cache.go, kv/grpc_conn_cache_test.go
Changed connection cache key type from raft.ServerAddress to string; updated ConnFor() signature and internal caching logic accordingly.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–90 minutes

Possibly related PRs

Poem

🐰 Farewell, old HashiCorp, your tenure was grand,
Etcd raft engines now join the band!
No more ServerAddress, just humble strings we use,
Cleaner abstractions—the rabbit's good news!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.60% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly and concisely describes the main objective: removing the HashiCorp Raft backend and converting the kv/ module to use raftengine natively. It accurately reflects the core change across the 64-file changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/drop-hashicorp-raft

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

- gci: re-format imports in adapter/redis_info_test.go, cmd/server/demo.go,
  kv/leader_routed_store_test.go.
- noctx: replace net.DialTimeout with (*net.Dialer).DialContext bound to a
  context in TestLeaderProxy_ForwardsWhenFollower so the wait honours both
  the dial timeout and the surrounding context.
- SA9004: annotate raftEngineMarkerFile with an explicit string type so the
  grouped const declaration does not silently downgrade it to untyped.
- unconvert: drop the redundant string(...) wrapper around
  lis.Addr().String() in kv/leader_routed_store_test.go.
- cyclop: extract openRaftEngine from demo.go run() so the raft engine setup
  (factory, data dir, bootstrap) lives in its own helper and run() drops
  below the cyclomatic-complexity threshold.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a35245a411

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread cmd/server/demo.go
Comment on lines +520 to +524
result, err := factory.Create(raftengine.FactoryConfig{
LocalID: cfg.raftID,
LocalAddress: cfg.address,
DataDir: raftDir,
Bootstrap: cfg.raftBootstrap,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Pass peer config when creating demo Raft engines

This Create call never sets Peers, so in demo mode the non-bootstrap nodes (raftBootstrap=false for n2/n3 in the same file) start with an empty peer list and etcd rejects startup before joinCluster runs. I reproduced this by running ./server and seeing node startup fail with “no persisted peers and no peer list was supplied”. As written, the 3-node demo path cannot form a cluster.

Useful? React with 👍 / 👎.

Comment thread multiraft_runtime.go
Comment on lines 128 to 131
func detectRaftEngineFromDataDir(dir string) (raftEngineType, bool, error) {
hashicorpArtifacts, err := hasRaftArtifacts(dir,
"raft.db",
"logs.dat",
"stable.dat",
)
if err != nil {
return "", false, err
}
etcdArtifacts, err := hasRaftArtifacts(dir,
"wal",
"snap",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep rejecting legacy Hashicorp raft artifacts

detectRaftEngineFromDataDir now only checks etcd markers, so a pre-marker data dir that still contains Hashicorp artifacts (raft.db, logs.dat, stable.dat) is treated as empty and gets an etcd marker written. On upgrade this can allow bootstrapping an incompatible engine over existing state instead of failing fast with ErrRaftEngineDataDir, which risks losing committed cluster history.

Useful? React with 👍 / 👎.

Existing grpc_test.go helpers (rawKVClient, transactionalKVClient) connect
to hosts[0] and assume it is the cluster leader. The previous hashicorp
setup enforced this with an electionTimeout=0 on node 0 while others
waited 10s. With etcd/raft, election timing is randomised and any node
can win, so the scan in Test_grpc_scan could race against a forwarded
Put that had not yet applied on the contacted follower — the last Put's
LastCommitTS hadn't propagated locally.

ensureNodeZeroIsLeader waits for any leader to emerge, then asks the
current leader to transfer leadership to node 0 via
raftengine.Admin.TransferLeadershipToServer. The transfer is best-effort
and the subsequent Eventually retries if a concurrent election races it.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c5dc58a523

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread cmd/server/demo.go
Comment on lines +511 to +515
result, err := factory.Create(raftengine.FactoryConfig{
LocalID: cfg.raftID,
LocalAddress: cfg.address,
DataDir: raftDir,
Bootstrap: cfg.raftBootstrap,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate demo Raft data dir before creating engine

This code opens the etcd engine directly from raftDir without any compatibility check for legacy Hashicorp Raft artifacts; the previous setupStorage path rejected logs.dat/stable.dat before startup. In a restart/upgrade where --raftDataDir still contains legacy state, the node can initialize a fresh etcd Raft store instead of failing fast, which effectively ignores prior committed history and risks silent data loss. Add the same data-dir engine validation used in the main runtime before calling factory.Create.

Useful? React with 👍 / 👎.

Codex flagged two P1 issues on the initial drop-hashicorp commit:

(1) cmd/server/demo.go was calling factory.Create without seeding Peers.
In demo mode the non-bootstrap nodes (n2, n3) would hit etcd's
"no persisted peers and no peer list was supplied" guard and abort
before joinCluster ever ran. Build the full 3-node peer list in main()
and pass it as cfg.raftPeers so every node bootstraps with the same
membership; raftBootstrap is forced true for all demo nodes so etcd
forms the cluster directly.

(2) detectRaftEngineFromDataDir no longer noticed hashicorp artifacts
(raft.db, logs.dat, stable.dat), so on upgrade an existing hashicorp
data dir would silently be rewritten as an etcd cluster on top of the
committed bolt state. Add hasHashicorpRaftArtifacts plus a new
ErrLegacyHashicorpDataDir sentinel, and call the probe from
ensureRaftEngineDataDir before any marker is written so production
restarts fail fast with a migration-required error. demo.go's
openRaftEngine runs the same check on its own data dir for the P2
follow-up codex raised there.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 22, 2026

@codex address that feedback — commits 02782f0 (demo peers + legacy artifact rejection) and c5dc58a (test leader transfer).

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
multiraft_runtime_test.go (1)

160-197: ⚠️ Potential issue | 🟡 Minor

Add coverage for legacy HashiCorp artifact rejection.

This test now covers etcd detection, but the PR’s fail-fast behavior for legacy HashiCorp data dirs is not asserted here. Add at least one ErrLegacyHashicorpDataDir case so the no-silent-migration guard cannot regress.

🧪 Proposed test coverage
 func TestEnsureRaftEngineDataDir(t *testing.T) {
 	t.Run("writes marker for empty dir", func(t *testing.T) {
 		dir := t.TempDir()
 		require.NoError(t, ensureRaftEngineDataDir(dir, raftEngineEtcd))
@@
 		require.True(t, ok)
 		require.Equal(t, raftEngineEtcd, engineType)
 	})
+
+	t.Run("rejects legacy hashicorp raft db", func(t *testing.T) {
+		dir := t.TempDir()
+		require.NoError(t, os.WriteFile(filepath.Join(dir, "raft.db"), []byte("legacy"), 0o600))
+
+		err := ensureRaftEngineDataDir(dir, raftEngineEtcd)
+		require.ErrorIs(t, err, ErrLegacyHashicorpDataDir)
+	})
 
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiraft_runtime_test.go` around lines 160 - 197, Add a new subtest in
TestEnsureRaftEngineDataDir that creates a legacy HashiCorp-style data layout
(e.g., a directory/file pattern your code treats as legacy) and calls
detectRaftEngineFromDataDir; assert that it returns the specific sentinel error
ErrLegacyHashicorpDataDir (and appropriate ok==false) so the fail-fast guard
cannot regress. Locate the test block around TestEnsureRaftEngineDataDir and add
the new t.Run that sets up the legacy artifact, invokes
detectRaftEngineFromDataDir, and checks for ErrLegacyHashicorpDataDir.
cmd/server/demo.go (1)

584-591: ⚠️ Potential issue | 🟡 Minor

coordinator.Close() fires as soon as run() returns, not at process exit.

run() does not block on eg.Wait() — it registers goroutines and returns nil once setup succeeds (cleanup.Release() at line 667, then return nil). As a result this defer deregisters the leader-loss callback on the engine while the server is still serving traffic. Functional impact is limited (the state == raftengine.StateLeader gate in LeaseRead still holds the safety invariant), but the fast-path "invalidate lease the instant we lose leadership" guarantee documented on NewCoordinatorWithEngine is lost for the whole demo session.

The doc comment ("before the process exits … matches the symmetric construction order") is also misleading because the symmetric teardown point here is eg.Wait() in main(), not this run() return.

Either hoist the Close() into the CleanupStack (so it only fires when the cleanup stack runs on a real teardown) or wire it into a shutdown task driven by ctx.Done().

🛠 Suggested approach
-	defer func() {
-		// Release the leader-loss callback slot on the engine before
-		// the process exits. The engine itself is closed elsewhere in
-		// the shutdown path; both orderings are safe, but releasing
-		// the closure here matches the symmetric construction order.
-		_ = coordinator.Close()
-	}()
+	// Release the leader-loss callback slot on the engine only on real
+	// teardown; run() returns after spawning goroutines, so a plain
+	// defer here would fire while the server is still serving.
+	cleanup.Add(func() { _ = coordinator.Close() })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/server/demo.go` around lines 584 - 591, The defer calling
coordinator.Close() in run() unregisters the leader-loss callback too early
because run() returns before process shutdown; move the Close() call into the
real teardown path instead of a function-local defer — e.g. register
coordinator.Close with the CleanupStack/Cleanup.Release sequence or add a
shutdown goroutine that waits for ctx.Done()/eg.Wait() and then calls
coordinator.Close(); update/remove the misleading doc comment that says “before
the process exits … matches the symmetric construction order” to reference
eg.Wait() as the true symmetric teardown point and ensure coordinator.Close() is
invoked only during actual shutdown.
main.go (2)

464-485: ⚠️ Potential issue | 🟡 Minor

Stale reference to the removed hashicorp backend in comment.

"(hashicorp backend today)" no longer reflects reality since the hashicorp engine was removed in this PR; the type assertion can only skip future engines that don't implement monitoring.DispatchCounterSource.

📝 Proposed wording
-// dispatchMonitorSources extracts the raft engines that expose etcd
-// dispatch counters so monitoring can poll them for the hot-path
-// dashboard. Engines that do not satisfy the interface (hashicorp
-// backend today) are skipped silently; their groups simply won't
-// contribute to elastickv_raft_dispatch_* metrics.
+// dispatchMonitorSources extracts the raft engines that expose
+// dispatch counters so monitoring can poll them for the hot-path
+// dashboard. Engines that do not satisfy the interface are skipped
+// silently; their groups simply won't contribute to
+// elastickv_raft_dispatch_* metrics.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@main.go` around lines 464 - 485, The comment above dispatchMonitorSources
contains a stale reference to the removed "hashicorp backend"; update the
function comment to remove that specific backend mention and instead state that
engines which do not implement monitoring.DispatchCounterSource are skipped
(e.g., change "(hashicorp backend today)" to a generic note about
non-implementing engines). Ensure the comment still explains that skipped groups
won't contribute to elastickv_raft_dispatch_* metrics and references the
dispatchMonitorSources function behavior and the
monitoring.DispatchCounterSource type.

87-87: ⚠️ Potential issue | 🟡 Minor

Stale flag help text lists hashicorp as a valid value.

parseRaftEngineType now returns ErrUnsupportedRaftEngine for anything other than "" / "etcd", so advertising hashicorp here will mislead operators reading --help.

📝 Proposed wording
-	raftEngineName       = flag.String("raftEngine", string(raftEngineEtcd), "Raft engine implementation (etcd|hashicorp)")
+	raftEngineName       = flag.String("raftEngine", string(raftEngineEtcd), "Raft engine implementation (etcd)")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@main.go` at line 87, The flag help text for raftEngineName incorrectly
mentions "hashicorp" which is no longer supported by parseRaftEngineType (it
returns ErrUnsupportedRaftEngine for anything other than "" or "etcd"); update
the flag.String call for raftEngineName so the help string only advertises valid
values (e.g., "Raft engine implementation (etcd)") or otherwise documents the
empty/default behavior, ensuring the displayed --help is accurate and consistent
with parseRaftEngineType and ErrUnsupportedRaftEngine.
🧹 Nitpick comments (5)
kv/sharded_coordinator_leader_test.go (1)

21-23: Drop the redundant re := r alias.

Now that the hashicorpraftengine.New(r) wrapper is gone, re is just an alias for r. Inline it for clarity.

♻️ Proposed cleanup
-	re := r
 	groups := map[uint64]*ShardGroup{
-		1: {Engine: re, Store: st, Txn: NewLeaderProxyWithEngine(re)},
+		1: {Engine: r, Store: st, Txn: NewLeaderProxyWithEngine(r)},
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv/sharded_coordinator_leader_test.go` around lines 21 - 23, Remove the
redundant alias "re := r" and use "r" directly when constructing the groups map
(the ShardGroup literal and the NewLeaderProxyWithEngine call), i.e. eliminate
the "re" variable and replace its occurrences with "r" in the ShardGroup{Engine:
..., Store: ..., Txn: NewLeaderProxyWithEngine(...)} initialization to simplify
the code.
main_bootstrap_test.go (1)

15-36: LGTM. Assertions match the new parseRaftBootstrapMembers output in shard_config.go:156-186 (raftengine.Server with Suffrage: "voter").

Minor: the bare "voter" literal is now duplicated between shard_config.go and this test. Consider exporting a named constant (e.g., raftengine.SuffrageVoter) in a follow-up to avoid typos drifting silently.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@main_bootstrap_test.go` around lines 15 - 36, The test and shard_config.go
both hardcode the suffrage string "voter"; introduce a single exported constant
in the raftengine package (e.g., SuffrageVoter) and replace the literal "voter"
in parseRaftBootstrapMembers (shard_config.go) and in main_bootstrap_test.go
assertions (where raftengine.Server{Suffrage: "voter"}) to use
raftengine.SuffrageVoter so the value is centralized and avoids silent typos
drifting between files.
kv/coordinator_dispatch_test.go (1)

19-20: Close coordinators created through NewCoordinatorWithEngine.

NewCoordinatorWithEngine registers an engine callback; add test cleanup so these short-lived coordinators release their callback slots before the raft engine is stopped.

♻️ Proposed test cleanup
 tm := NewTransactionWithProposer(r)
 c := NewCoordinatorWithEngine(tm, r)
+t.Cleanup(func() { c.Close() })

Apply the same cleanup after each NewCoordinatorWithEngine(...) construction in this file.

Also applies to: 45-46, 78-79

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv/coordinator_dispatch_test.go` around lines 19 - 20, The test creates
coordinators via NewCoordinatorWithEngine (e.g., "c :=
NewCoordinatorWithEngine(tm, r)") but never closes them, leaving engine
callbacks registered; after each NewCoordinatorWithEngine call add cleanup to
close the coordinator (for example use t.Cleanup(func(){ c.Close() }) or defer
c.Close() in the test scope) so the callback slots are released before the raft
engine is stopped; apply the same cleanup for the other occurrences at the
places creating coordinators (lines around the other NewCoordinatorWithEngine
calls).
cmd/server/demo.go (1)

533-539: Duplicates legacy-artifact detection already implemented in multiraft_runtime.go.

hasHashicorpRaftArtifacts + ErrLegacyHashicorpDataDir in multiraft_runtime.go encode exactly this check and produce a uniform error. Inlining here means (a) the artifact list can drift between the two binaries, and (b) operators see a different error string from the demo server than from the production server.

Consider extracting the helper and sentinel into an internal package (e.g. internal/raftengine/legacy) so both main packages can share them. Scope note: moving the error type to a shared package also lets external tooling errors.Is against it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/server/demo.go` around lines 533 - 539, The demo server duplicates the
legacy Hashicorp raft artifact check already implemented by
hasHashicorpRaftArtifacts and ErrLegacyHashicorpDataDir in multiraft_runtime.go;
extract that logic and sentinel error into a shared internal package (for
example internal/raftengine/legacy) and update cmd/server/demo.go to call
hasHashicorpRaftArtifacts (or equivalent exported function) and return
ErrLegacyHashicorpDataDir when detected so both binaries share the same artifact
list and error type usable with errors.Is.
shard_config.go (1)

181-186: Extract the "voter" magic string into a local constant to reduce duplication.

The hardcoded string "voter" appears in this file and also in cmd/server/demo.go. While raftengine does not export a suffrage constant, defining one locally (e.g., const suffrageVoter = "voter") would prevent drift and make the code more maintainable if the value changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@shard_config.go` around lines 181 - 186, Introduce a local constant named
suffrageVoter = "voter" and replace the hardcoded "voter" in the servers append
call (the raftengine.Server literal where Suffrage is set) with suffrageVoter;
also add the same local constant to the other file that uses the same literal
(the demo server code) so both places reference the identical constant and avoid
duplication/drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@adapter/distribution_milestone1_e2e_test.go`:
- Around line 230-260: The helper newSingleRaftForDistributionE2E currently
registers cleanup after require.Eventually and can leak result.Engine if the
leader wait times out; fix by making the stop closure idempotent (use a
sync.Once or a closed boolean) that safely checks result and result.Engine/Close
before calling Close, then register that stop immediately after factory.Create
(e.g. t.Cleanup(stop)) so cleanup runs even if require.Eventually fails;
reference result, stop, factory.Create, result.Engine.Close and result.Close
when making these changes.

In `@adapter/test_util.go`:
- Around line 235-311: Replace non-fatal assertions with fatal ones in the raft
readiness helpers: in waitForRaftReadiness and ensureNodeZeroIsLeader switch
assert.Eventually to require.Eventually so failures abort the test immediately;
update the three calls inside waitForRaftReadiness (the leader-consensus check)
and ensureNodeZeroIsLeader (the "some node became leader" and "node 0 did not
become leader" checks) and add the require import if missing.

---

Outside diff comments:
In `@cmd/server/demo.go`:
- Around line 584-591: The defer calling coordinator.Close() in run()
unregisters the leader-loss callback too early because run() returns before
process shutdown; move the Close() call into the real teardown path instead of a
function-local defer — e.g. register coordinator.Close with the
CleanupStack/Cleanup.Release sequence or add a shutdown goroutine that waits for
ctx.Done()/eg.Wait() and then calls coordinator.Close(); update/remove the
misleading doc comment that says “before the process exits … matches the
symmetric construction order” to reference eg.Wait() as the true symmetric
teardown point and ensure coordinator.Close() is invoked only during actual
shutdown.

In `@main.go`:
- Around line 464-485: The comment above dispatchMonitorSources contains a stale
reference to the removed "hashicorp backend"; update the function comment to
remove that specific backend mention and instead state that engines which do not
implement monitoring.DispatchCounterSource are skipped (e.g., change "(hashicorp
backend today)" to a generic note about non-implementing engines). Ensure the
comment still explains that skipped groups won't contribute to
elastickv_raft_dispatch_* metrics and references the dispatchMonitorSources
function behavior and the monitoring.DispatchCounterSource type.
- Line 87: The flag help text for raftEngineName incorrectly mentions
"hashicorp" which is no longer supported by parseRaftEngineType (it returns
ErrUnsupportedRaftEngine for anything other than "" or "etcd"); update the
flag.String call for raftEngineName so the help string only advertises valid
values (e.g., "Raft engine implementation (etcd)") or otherwise documents the
empty/default behavior, ensuring the displayed --help is accurate and consistent
with parseRaftEngineType and ErrUnsupportedRaftEngine.

In `@multiraft_runtime_test.go`:
- Around line 160-197: Add a new subtest in TestEnsureRaftEngineDataDir that
creates a legacy HashiCorp-style data layout (e.g., a directory/file pattern
your code treats as legacy) and calls detectRaftEngineFromDataDir; assert that
it returns the specific sentinel error ErrLegacyHashicorpDataDir (and
appropriate ok==false) so the fail-fast guard cannot regress. Locate the test
block around TestEnsureRaftEngineDataDir and add the new t.Run that sets up the
legacy artifact, invokes detectRaftEngineFromDataDir, and checks for
ErrLegacyHashicorpDataDir.

---

Nitpick comments:
In `@cmd/server/demo.go`:
- Around line 533-539: The demo server duplicates the legacy Hashicorp raft
artifact check already implemented by hasHashicorpRaftArtifacts and
ErrLegacyHashicorpDataDir in multiraft_runtime.go; extract that logic and
sentinel error into a shared internal package (for example
internal/raftengine/legacy) and update cmd/server/demo.go to call
hasHashicorpRaftArtifacts (or equivalent exported function) and return
ErrLegacyHashicorpDataDir when detected so both binaries share the same artifact
list and error type usable with errors.Is.

In `@kv/coordinator_dispatch_test.go`:
- Around line 19-20: The test creates coordinators via NewCoordinatorWithEngine
(e.g., "c := NewCoordinatorWithEngine(tm, r)") but never closes them, leaving
engine callbacks registered; after each NewCoordinatorWithEngine call add
cleanup to close the coordinator (for example use t.Cleanup(func(){ c.Close() })
or defer c.Close() in the test scope) so the callback slots are released before
the raft engine is stopped; apply the same cleanup for the other occurrences at
the places creating coordinators (lines around the other
NewCoordinatorWithEngine calls).

In `@kv/sharded_coordinator_leader_test.go`:
- Around line 21-23: Remove the redundant alias "re := r" and use "r" directly
when constructing the groups map (the ShardGroup literal and the
NewLeaderProxyWithEngine call), i.e. eliminate the "re" variable and replace its
occurrences with "r" in the ShardGroup{Engine: ..., Store: ..., Txn:
NewLeaderProxyWithEngine(...)} initialization to simplify the code.

In `@main_bootstrap_test.go`:
- Around line 15-36: The test and shard_config.go both hardcode the suffrage
string "voter"; introduce a single exported constant in the raftengine package
(e.g., SuffrageVoter) and replace the literal "voter" in
parseRaftBootstrapMembers (shard_config.go) and in main_bootstrap_test.go
assertions (where raftengine.Server{Suffrage: "voter"}) to use
raftengine.SuffrageVoter so the value is centralized and avoids silent typos
drifting between files.

In `@shard_config.go`:
- Around line 181-186: Introduce a local constant named suffrageVoter = "voter"
and replace the hardcoded "voter" in the servers append call (the
raftengine.Server literal where Suffrage is set) with suffrageVoter; also add
the same local constant to the other file that uses the same literal (the demo
server code) so both places reference the identical constant and avoid
duplication/drift.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 179f3410-b62a-4058-9639-fd78a3b293ad

📥 Commits

Reviewing files that changed from the base of the PR and between b95750d and 02782f0.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (63)
  • adapter/add_voter_join_test.go
  • adapter/distribution_milestone1_e2e_test.go
  • adapter/distribution_server_test.go
  • adapter/dynamodb.go
  • adapter/dynamodb_failure_test.go
  • adapter/dynamodb_test.go
  • adapter/internal.go
  • adapter/redis.go
  • adapter/redis_info_test.go
  • adapter/redis_keys_pattern_test.go
  • adapter/redis_list_raft_benchmark_test.go
  • adapter/redis_retry_test.go
  • adapter/s3.go
  • adapter/s3_test.go
  • adapter/test_util.go
  • cmd/etcd-raft-migrate/main.go
  • cmd/etcd-raft-rollback/main.go
  • cmd/server/demo.go
  • go.mod
  • internal/raftengine/etcd/hashicorp_fsm_adapter.go
  • internal/raftengine/etcd/migrate_test.go
  • internal/raftengine/hashicorp/engine.go
  • internal/raftengine/hashicorp/engine_test.go
  • internal/raftengine/hashicorp/factory.go
  • internal/raftengine/hashicorp/leadership_err_test.go
  • internal/raftengine/hashicorp/migrate.go
  • internal/raftengine/hashicorp/migrate_test.go
  • internal/raftstore/pebble.go
  • internal/raftstore/pebble_test.go
  • kv/coordinator.go
  • kv/coordinator_dispatch_test.go
  • kv/coordinator_leader_test.go
  • kv/fsm.go
  • kv/fsm_occ_test.go
  • kv/fsm_txn_test.go
  • kv/grpc_conn_cache.go
  • kv/grpc_conn_cache_test.go
  • kv/leader_proxy.go
  • kv/leader_proxy_test.go
  • kv/leader_routed_store.go
  • kv/leader_routed_store_test.go
  • kv/lease_state_test.go
  • kv/lock_resolver_test.go
  • kv/raft_engine.go
  • kv/shard_router_test.go
  • kv/shard_store_txn_lock_test.go
  • kv/sharded_coordinator.go
  • kv/sharded_coordinator_abort_test.go
  • kv/sharded_coordinator_del_prefix_test.go
  • kv/sharded_coordinator_leader_test.go
  • kv/sharded_integration_test.go
  • kv/snapshot.go
  • kv/snapshot_test.go
  • kv/transaction.go
  • kv/transaction_batch_test.go
  • main.go
  • main_bootstrap_e2e_test.go
  • main_bootstrap_test.go
  • main_s3.go
  • multiraft_runtime.go
  • multiraft_runtime_test.go
  • shard_config.go
  • shard_config_test.go
💤 Files with no reviewable changes (19)
  • cmd/etcd-raft-migrate/main.go
  • internal/raftengine/hashicorp/engine_test.go
  • internal/raftengine/hashicorp/leadership_err_test.go
  • kv/leader_proxy.go
  • cmd/etcd-raft-rollback/main.go
  • internal/raftengine/hashicorp/engine.go
  • internal/raftstore/pebble.go
  • adapter/internal.go
  • kv/lease_state_test.go
  • kv/transaction.go
  • adapter/redis_list_raft_benchmark_test.go
  • internal/raftengine/etcd/hashicorp_fsm_adapter.go
  • internal/raftengine/hashicorp/migrate_test.go
  • internal/raftstore/pebble_test.go
  • go.mod
  • adapter/dynamodb_failure_test.go
  • internal/raftengine/hashicorp/factory.go
  • adapter/add_voter_join_test.go
  • internal/raftengine/hashicorp/migrate.go

Comment thread adapter/distribution_milestone1_e2e_test.go
Comment thread adapter/test_util.go
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 02782f0822

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread cmd/server/demo.go
Comment on lines +204 to +209
n.raftPeers = peers
// etcd raft requires every member of a fresh cluster to
// bootstrap with the same peer list. Override the per-node
// raftBootstrap flag in demo mode so n2/n3 don't fail with
// "no persisted peers and no peer list was supplied".
n.raftBootstrap = true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Skip demo re-join after prebootstrapping all peers

Setting raftPeers and forcing raftBootstrap=true on every demo node means the 3-node cluster is already initialized at startup, but main still launches joinCluster, which always sends AddVoter to nodes[0]. With etcd's randomized elections, node 0 is often a follower, so those RPCs return not leader until joinNodeWithRetry exhausts retries and the demo process exits. This makes demo startup nondeterministically fail in the normal no---raftId path.

Useful? React with 👍 / 👎.

- adapter/distribution_milestone1_e2e_test.go: register the engine/factory
  cleanup with t.Cleanup BEFORE the require.Eventually leader-wait so a
  timeout no longer leaks the etcd engine, WAL and transport. The stop
  closure is now sync.Once-protected so the explicit defer-stop in test
  bodies plus the t.Cleanup fallback are both safe.
- adapter/test_util.go: promote every assert.Eventually in
  waitForRaftReadiness / ensureNodeZeroIsLeader to require.Eventually
  so a missing leader or a failed leadership transfer fails the test
  immediately instead of letting subsequent RPCs panic on nil leader.
  The lingering assert.NoError calls in background goroutines stay
  because require.FailNow there would leak the goroutine.
- multiraft_runtime_test.go: add three subtests (raft.db, logs.dat,
  stable.dat) that assert ensureRaftEngineDataDir returns
  ErrLegacyHashicorpDataDir, covering the no-silent-migration guard
  added in the previous commit.
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

Codex P1: once every demo node starts with raftPeers=[n1,n2,n3] and
raftBootstrap=true (the previous commit's fix for the 'no persisted
peers' abort), the 3-node cluster is already formed at startup. The
lingering joinCluster goroutine then ran AddVoter against nodes[0],
which under etcd's randomised elections is often a follower — the RPC
would be rejected, retried for the full 20-attempt window, and either
leak cancellation errors into the errgroup or attempt to re-add members
that were already voters.

Remove joinCluster plus its helpers (joinNodeWithRetry,
joinRetryCancelResult, tryJoinNode, waitForJoinRetry,
joinClusterWaitError) and the now-dead joinRetries / joinWait /
joinRetryInterval / joinRPCTimeout constants. The unused fmt import
is dropped with them.
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@chatgpt-codex-connector
Copy link
Copy Markdown

Summary

Testing

  • git log --oneline -n 5
  • go test ./cmd/server -run TestDoesNotExist -count=1
  • go test . -run TestEnsureRaftEngineDataDir -count=1
  • ⚠️ timeout 90s go test ./... -run TestDetectRaftEngineFromDataDirRejectsLegacyHashicorpArtifacts -count=1 (timed out in this environment before producing output)

View task →

@bootjp bootjp merged commit 0471a3a into main Apr 23, 2026
9 checks passed
@bootjp bootjp deleted the chore/drop-hashicorp-raft branch April 23, 2026 04:14
bootjp added a commit that referenced this pull request Apr 23, 2026
Merged main (PR #590 dropped hashicorp/raft); migrated admin wiring to raftengine.Server. Addressed Codex P2 and CodeRabbit findings on commit ac8d6c3:

- Refcount cached gRPC clients (cmd/elastickv-admin): clientFor now returns (client, release, err); evictOneLocked and invalidateClient retire entries but keep the *grpc.ClientConn alive until the last borrower releases. Prevents spurious cancellation of healthy in-flight GetClusterOverview calls when the cache hits maxCachedClients under fan-out load.
- New --allowRemoteBind flag; by default --bindAddr is rejected unless the host is loopback. The admin HTTP layer has no browser-facing auth while holding the node admin token, so remote bind is an explicit opt-in for operators behind an auth proxy.
- AdminServer.GetRaftGroups now populates LastContactUnixMs from raftengine.Status.LastContact (absolute ms = nowFunc() - LastContact). Nowtest-seamed via nowFunc so tests are deterministic.
- Design doc §5.1 flush path rewritten to use atomic.SwapUint64 instead of pointer-swap, so no late Observe can race past the flush snapshot — no retired counters to harvest.
- Design doc §5.2 calibration: replace "benchmark at startup" with a checked-in costPerObserveNs constant with CI drift check.

Regression tests added under -race for: eviction-while-in-flight, post-close invalidateClient no-op, --bindAddr loopback validation, GetRaftGroups LastContactUnixMs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant