feat(encryption): Stage 5 PR-D — main.go gRPC wiring for EncryptionAdmin#760
Conversation
Stage 5 PR-D of the data-at-rest encryption rollout (design doc docs/design/2026_04_29_partial_data_at_rest_encryption.md). Wires the §6.1 EncryptionAdmin gRPC service onto every shard's gRPC listener so the §7.1 Phase-0 GetCapability fan-out can poll any cluster member. ## main.go New `--encryptionSidecarPath` flag (default empty) points the read-only probes at the §5.1 keys.json. Empty path means GetCapability reports encryption_capable=false, matching the "node has not been started with --encryption-enabled" branch from §6.1 — Stage 6 will replace the empty-path proxy with a proper cluster flag. The startRaftServers per-shard registration loop now calls registerEncryptionAdminServer between the AdminServer registration and the AdminForwardServer registration. Each shard's raftengine.Engine is wired as BOTH proposer AND LeaderView so requireLeader's VerifyLeader quorum check fires against the right shard. The full_node_id reported in CapabilityReport is the shard id, matching the §5.6 step 1a batch contract. ## main_encryption_admin.go (new) - encryptionAdminEngine is a local interface intersection of raftengine.Proposer and raftengine.LeaderView. Lets the test substitute a stub without dragging in the full engine surface. - registerEncryptionAdminServer is a no-error-return helper; the Validate() check is load-bearing for a future refactor that splits proposer and LeaderView wiring — at the current callsite the two options come from the same engine, so Validate cannot fail in production. If it does (programming error in a future caller), panic at startup rather than silently letting followers mutate state. ## main_encryption_admin_test.go (new) TestRegisterEncryptionAdminServer_Registers verifies the service shows up in grpc.Server.GetServiceInfo() so a future refactor of the registration helper cannot silently drop the service. ## docs/design/2026_04_29_partial_data_at_rest_encryption.md Stage 5 row update — 5C shipped in PR #759, 5D in PR, 5E open (the §5.6 step 1a capability fan-out helper + the `encryption bootstrap --discover-from=...` auto-batch CLI mode is split out so PR-D stays scoped to the wiring). ## Out of scope (Stage 5E / Stage 6) - §5.6 step 1a capability fan-out helper (auto-build the writer batch by dialing every cluster member). Operator still passes the batch via repeated --writer flags. - Stage 6 `--encryption-enabled` cluster flag; the existing --encryptionSidecarPath flag only enables the read-only probes, not encryption itself. - §6.3 WithEncryption applier wiring — successfully-applied bootstrap / rotation entries are still no-ops storage-side. ## Self-review (CLAUDE.md 5 passes) 1. Data loss — registration is purely additive; existing handlers are unchanged. Validate() catches the proposer-without-leaderView programming error at startup so a follower cannot quietly accept mutator RPCs. 2. Concurrency — the helper allocates one EncryptionAdminServer per shard. Each server's only mutable state is the sidecar file, serialised through encryption.WriteSidecar's §5.1 crash-durable protocol. No new shared state. 3. Performance — the registration runs once per shard at startup. RPCs are operator-facing and off the data plane. 4. Consistency — every shard's EncryptionAdminServer uses ITS OWN engine as proposer + LeaderView, so a follower shard correctly rejects mutators with FailedPrecondition pointing at the shard's current leader. 5. Test coverage — the registration helper has a unit test that inspects grpc.Server.GetServiceInfo. The wider adapter and CLI tests carry the EncryptionAdmin contract. ## Verification - go test -race -timeout=60s -run TestRegisterEncryptionAdminServer .: PASS - golangci-lint run on root pkg: 0 issues - go build ./...: PASS
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR wires the EncryptionAdmin gRPC service into the Raft server startup. It introduces a ChangesEncryptionAdmin Sidecar Service Wiring
Possibly related PRs
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~28 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b1f022ff3e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if adminServer != nil { | ||
| pb.RegisterAdminServer(gs, adminServer) | ||
| } | ||
| registerEncryptionAdminServer(gs, rt.engine, rt.spec.id) |
There was a problem hiding this comment.
Report stable node IDs in capability responses
startRaftServers wires rt.spec.id (the Raft group ID) into registerEncryptionAdminServer as full_node_id, but group IDs are shared by all replicas of that group (for default config every node uses group 1 from parseRaftGroups). This makes different nodes return the same CapabilityReport.full_node_id, so the Phase-0 writer batch will contain duplicates and BootstrapEncryption will fail validation in validateWriterBatchUniqueness (adapter/encryption_admin.go) with duplicate full_node_id errors. full_node_id needs to be a per-node stable identifier (e.g., derived from --raftId), not a shard/group ID.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request implements Stage 5D of the data-at-rest encryption design by wiring the EncryptionAdmin gRPC service to each shard and introducing the --encryptionSidecarPath flag. The changes include the registration logic for the encryption admin server and corresponding unit tests. Feedback was provided regarding the use of panic for startup configuration errors, suggesting that log.Fatalf or returning an error to the caller would be more idiomatic in Go.
| } | ||
| srv := adapter.NewEncryptionAdminServer(opts...) | ||
| if err := srv.Validate(); err != nil { | ||
| panic(errors.Wrap(err, "encryption admin server validation")) |
There was a problem hiding this comment.
While panicking on a misconfiguration during startup is acceptable for this component, using panic directly with errors.Wrap is less idiomatic in Go than using log.Fatalf or returning an error to the caller to be handled in main. Since this is called deep within the server startup sequence, consider if a more graceful shutdown or a clear fatal log message would be preferred to a stack trace in production logs.
Codex P1: startRaftServers wired rt.spec.id (the Raft GROUP id) as full_node_id, but group ids are shared by every replica of a group (every node uses group 1 in the default config). All nodes therefore returned the same CapabilityReport.full_node_id, and BootstrapEncryption's validateWriterBatchUniqueness check (added in PR #759 for the gemini HIGH on writer batches) would reject the bootstrap with "duplicate full_node_id" — the very gate the batch validation was meant to enforce now turns into the failure mode that prevents the cluster from bootstrapping at all. Fix: derive a per-node-stable uint64 from --raftId via the canonical etcd.DeriveNodeID(id string) uint64 helper that raftengine itself uses for peer identification (FNV-1a hash, 0 sentinel collision mapped to 1). The call site now passes etcdraftengine.DeriveNodeID(*raftId) and the helper's parameter is renamed shardID → fullNodeID with a docstring spelling out the must-be-per-node-stable contract so a future refactor does not regress to the shard-id wiring. Caller audit: registerEncryptionAdminServer has exactly one production caller (main.go:1292), now passing the derived per-node id. The single test caller passes a literal uint64 (1) which is unaffected — its purpose is to verify the gRPC registration path, not the id-derivation correctness. Regression test TestEncryptionAdminFullNodeID_DistinctPerRaftId pins the canonical 3-node config (n1/n2/n3) produces three distinct non-zero ids so a future refactor breaking the derivation surfaces as a failing unit test rather than a cluster-wide bootstrap deadlock. go build, go vet, go test -race, golangci-lint all clean.
|
@claude review Round 1 — Codex P1 ( Fix: derive per-node-stable uint64 via Caller audit: Regression test
|
|
Claude finished @bootjp's task in 4m 1s —— View job PR 760 round-1 fix review
SummaryThe fix is correct and complete for the targeted failure mode. Findings✅ Core fix ( ✅ Caller audit ✅ Regression test ✅
Suggested diffs:
- // a stable, distinct value. Codex r1 P1 on PR #760.
+ // a stable, distinct value.
-// A wrong-shape value here makes BootstrapEncryption fail with
-// "duplicate full_node_id" because every node reports the same
-// number — Codex r1 P1 on PR #760 caught the original wiring
-// passing the shard id by mistake.
+// A wrong-shape value here makes BootstrapEncryption fail with
+// "duplicate full_node_id" because every node reports the same number.(Attempted to commit this cleanup locally; push was denied due to repo permissions — applying manually is fine.) ℹ️ Deferred: multi-shard This is not a current bug — Stage 5E is explicitly out of scope, the service is production-inert until Stage 6, and in practice the fan-out design must deduplicate responses per-node. Flagging it here so the Stage 5E design doc captures the dedup requirement explicitly before the fan-out helper is implemented. The fix addresses the Codex P1 correctly. The two comment nits above are the only actionable items remaining. |
|
@claude review |
|
Claude finished @bootjp's task in 3m 22s —— View job PR Review — Stage 5 PR-D: EncryptionAdmin gRPC WiringTasks
Overall AssessmentThe P1 bug identified by Codex (shard group ID used as Three issues to address before merge: Issues1. PR/review references in source code (violates CLAUDE.md conventions)
// A wrong-shape value here makes BootstrapEncryption fail with
// "duplicate full_node_id" because every node reports the same
// number — Codex r1 P1 on PR #760 caught the original wiring
// passing the shard id by mistake.Per CLAUDE.md: "Don't reference the current task, fix, or callers ('used by X', 'added for the Y flow', 'handles the case from issue #123'), since those belong in the PR description and rot as the codebase evolves." The WHY is important and should be kept — just strip the Codex/PR attribution. The 2.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 33d1db54c4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| adapter.WithEncryptionAdminProposer(engine), | ||
| adapter.WithEncryptionAdminLeaderView(engine), | ||
| adapter.WithEncryptionAdminFullNodeID(fullNodeID), |
There was a problem hiding this comment.
Gate mutating EncryptionAdmin wiring until applier exists
registerEncryptionAdminServer always wires WithEncryptionAdminProposer/WithEncryptionAdminLeaderView, so mutating RPCs are live even when --encryptionSidecarPath is empty. In main, the FSM is still built as kv.NewKvFSMWithHLC(st, clock) (no WithEncryption applier), so a successful BootstrapEncryption/RotateDEK/RegisterEncryptionWriter proposal will emit opcode 0x03/0x04/0x05 and hit the fail-closed halt path (kvFSM.applyEncryption with nil applier), causing apply to stop instead of being production-inert. This turns an RPC call into a cluster-halting failure before Stage 6 gating is in place.
Useful? React with 👍 / 👎.
Codex r2 P1 (main_encryption_admin.go:57): registerEncryptionAdminServer unconditionally wired WithEncryptionAdminProposer and WithEncryptionAdminLeaderView, so mutating RPCs (RotateDEK / BootstrapEncryption / RegisterEncryptionWriter) accepted Raft proposals even on nodes without --encryptionSidecarPath. The §6.3 WithEncryption applier seam is still unwired on the FSM side pre-Stage-6, so a successfully-applied bootstrap entry would hit the fail-closed halt path in kvFSM.applyEncryption (nil applier) and stop the apply loop cluster-wide. The PR description claimed "production-inert until Stage 6" but mutating RPCs were not inert — they would escalate a single accidental client call into a cluster-halt before Stage 6 wires the applier. Fix: pair Proposer + LeaderView with the sidecarPath gate. Both wired iff sidecarPath set, both omitted otherwise. With the options omitted, EncryptionAdminServer.BootstrapEncryption / RotateDEK / RegisterEncryptionWriter return FailedPrecondition "proposer is not configured on this node" at the RPC boundary (adapter/encryption_admin.go:347, :541, :599), refusing the proposal before any Raft round-trip. The Validate() invariant (proposer non-nil → leaderView non-nil) holds by construction because the two options are now paired in the same conditional. Caller audit: registerEncryptionAdminServer has one production caller (main.go:1292) which is unaffected — the new behavior is exactly what production wants for the pre-Stage-6 inert state. The single test caller (TestRegisterEncryptionAdminServer_Registers) only checks registration so it stays valid. Regression test TestEncryptionAdmin_MutatingRPCRefusedWithoutSidecarPath uses bufconn to mount the helper with sidecarPath empty, dials BootstrapEncryption, and asserts FailedPrecondition. The original wiring would have failed this test because the stub engine's Propose() would have returned a fake success instead. go build, go vet, go test -race, golangci-lint all clean.
|
@claude review Round 2 — Codex P1 ( Fix: pair Proposer + LeaderView with the sidecarPath gate. Both wired iff sidecarPath set, both omitted otherwise. Mutating RPCs then return Caller audit: 1 production caller ( Regression test
|
|
Claude finished @bootjp's task in 2m 38s —— View job PR 760 Round 2 review — sidecarPath gate for mutating RPCsTasks
Overall AssessmentThe Round 2 P1 fix is correct. Gating One actionable bug was introduced alongside the fix (stale flag description). Two prior nits from the Round 1 review remain unaddressed and are now compounded. Issues1. Stale flag description — now documents the opposite of actual behavior (main.go:153)encryptionSidecarPath = flag.String("encryptionSidecarPath", "",
"§5.1 keys.json path; empty disables the read-only EncryptionAdmin sidecar probes. Mutating RPCs are unaffected (they are leader-gated through raftengine).")"Mutating RPCs are unaffected" was true before Round 2. After the fix, an empty sidecarPath causes Suggested replacement: 2. CLAUDE.md comment convention violations — PR/task references not cleaned up (compounded from Round 1)
The WHY is valuable; the attribution is not. Both 3.
|
|
@claude review |
|
Claude finished @bootjp's task in 3m 34s —— View job PR 760 Round 2 Review — sidecarPath gate for mutating RPCs
Overall AssessmentThe Round 2 fix is correct and the regression test is legitimate. The proposer/leaderView gating on Issues1. Stale flag description contradicts the fix (
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@main.go`:
- Around line 144-154: The flag help for encryptionSidecarPath is incorrect:
mutating RPCs are currently gated and refused when the sidecar path is empty.
Update the encryptionSidecarPath flag help string to state that a non-empty
--encryptionSidecarPath is required to enable mutating RPCs (leader/proposer
view) and that leaving it empty disables the read-only sidecar probes and
prevents mutating RPCs, making clear the current behavior of
leader-gating/proposer activation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 91377e62-9bc4-4582-8e74-961e797eb599
📒 Files selected for processing (4)
docs/design/2026_04_29_partial_data_at_rest_encryption.mdmain.gomain_encryption_admin.gomain_encryption_admin_test.go
… mutator gate coderabbit minor on PR #760 round 2: the --encryptionSidecarPath flag-help text and surrounding comment still said "Mutating RPCs are unaffected", but the round-2 fix gates the Proposer + LeaderView wiring on this flag. Empty path now ALSO leaves mutators short-circuited at the RPC boundary. The flag is the load-bearing safety boundary against the Codex r2 P1 — operators need to know that, not the stale "mutators are leader-gated through raftengine" line that came from PR-D's original production-inert claim. Help text now reads: §5.1 keys.json path; empty disables EncryptionAdmin sidecar capability probing AND leaves mutating EncryptionAdmin RPCs unwired on this node (proposer + LeaderView omitted → FailedPrecondition at the RPC boundary). The surrounding var-block comment is updated in parallel to document the same gate. ## Verification - go test -race -timeout=60s -run 'TestEncryptionAdmin|TestRegisterEncryptionAdminServer' .: PASS - golangci-lint run on root pkg: 0 issues
|
@claude review |
|
Claude finished @bootjp's task in 2m 17s —— View job PR 760 Round 3 ReviewTasks
Overall AssessmentReady to merge. Both P1 bugs are correctly fixed. Round 3's flag description update ( Round 3 commit (
|
| Item | Status |
|---|---|
DeriveNodeID(*raftId) for full_node_id (r1 P1) |
✅ Correct — FNV-1a hash, per-node-stable, matches raftengine peer ID derivation |
| Proposer + LeaderView gated on sidecarPath (r2 P1) | ✅ Correct — both options omitted together, Validate() invariant holds by construction |
| Flag description (r3 CodeRabbit) | ✅ Fixed in 6a7a27a — accurately describes sidecar probe AND mutating-RPC gating |
Validate() panic on misconfiguration |
✅ Unreachable at current callsite; load-bearing guard against future refactor splitting the two options — correct design |
Multi-shard full_node_id dedup (Stage 5E note) |
✅ Documented in PR description; 5E row in design doc is the right place to track the fan-out dedup requirement |
| Test coverage | ✅ Three tests cover: distinct ID derivation, mutating-RPC refusal without sidecarPath, and service registration |
…it param claude[bot] round-3 nit: registerEncryptionAdminServer reads the encryptionSidecarPath flag global from inside the helper. Stage 5E adds a section 5.6 step 1a fan-out helper that will need the same gating behaviour, and threading the global down a second call-tree perpetuates the implicit dependency. Round-3 guidance was to convert the global access to a parameter before further callers land. ## Change registerEncryptionAdminServer gains a trailing sidecarPath string parameter. The helper no longer references the encryptionSidecarPath flag variable; callers pass an explicit value. Production passes the deref at the one call-site; tests pass empty inline, which lets us drop the global save/restore dance from TestEncryptionAdmin_MutatingRPCRefusedWithoutSidecarPath. ## Caller audit (semantic change — signature) - main.go startRaftServers (1 prod site) — passes the deref of the flag variable, preserving the exact behaviour of the previous global read. - main_encryption_admin_test.go (2 test sites) — both pass empty string explicitly, which is the same value the flag default carried. The MutatingRPCRefused test no longer needs the prev/Cleanup save-restore dance because it no longer mutates the global. No behaviour change at any call-site; the refactor is a strict in-place parameterisation. ## Verification - go test -race -timeout=60s -run 'TestEncryptionAdmin|TestRegisterEncryptionAdminServer' .: PASS (1.07s) - go build ./...: PASS - golangci-lint run: 0 new issues on touched lines
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6a7a27a8fd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if *encryptionSidecarPath != "" { | ||
| opts = append(opts, | ||
| adapter.WithEncryptionAdminSidecarPath(*encryptionSidecarPath), | ||
| adapter.WithEncryptionAdminProposer(engine), | ||
| adapter.WithEncryptionAdminLeaderView(engine), | ||
| ) |
There was a problem hiding this comment.
Keep mutating EncryptionAdmin RPCs disabled before Stage 6
When --encryptionSidecarPath is set (the condition used to mark a node as capability-ready), this helper wires both WithEncryptionAdminProposer and WithEncryptionAdminLeaderView, so BootstrapEncryption/RotateDEK/RegisterEncryptionWriter can successfully propose encryption opcodes. In this same commit, the FSM is still constructed as kv.NewKvFSMWithHLC(st, clock) in main.go (no WithEncryption applier), so any committed opcode 0x03/0x04/0x05 will hit the fail-closed halt path (kvFSM.applyEncryption with nil applier) and stop apply. This means a node configured for capability probing can still be cluster-halted by a mutating admin RPC before Stage 6 gating exists.
Useful? React with 👍 / 👎.
|
@claude review Round-4: addressed the round-3 carry-forward nit — Caller audit: 3 sites, 1 prod + 2 test, all updated together with identical behaviour preservation (the global deref / default empty are the same values that the global would have returned). Verification: |
|
Claude finished @bootjp's task in 1m 33s —— View job PR 760 Round 4 Review
Overall AssessmentReady to merge. The Round-4 refactor is correct, minimal, and strictly behavior-preserving. All three P1 bugs found across rounds 1–2 are fixed, and all CodeRabbit and claude nits from rounds 1–3 are addressed. The one remaining attribution-string convention nit has been accepted as a pre-existing pattern in this file. Round-4 Change:
|
| Issue | Round fixed | Status |
|---|---|---|
full_node_id = shard group id (same on every replica) |
Round 1 | ✅ DeriveNodeID(*raftId) |
| Proposer + LeaderView wired unconditionally (cluster halt pre-Stage-6) | Round 2 | ✅ gated on sidecarPath != "" |
| Flag description contradicting fix ("mutating RPCs unaffected") | Round 3 | ✅ Updated to name FailedPrecondition behavior |
sidecarPath accessed as implicit global in helper |
Round 4 | ✅ Explicit parameter, this commit |
Nit: PR/Codex attribution strings (accepted, not blocking)
main_encryption_admin.go:37,49, main.go:1294, and test docstrings at :39,62 still contain "Codex r1/r2 P1 on PR #760" attribution. This was flagged in rounds 1, 2, and 3 without resolution. Since the same pattern appears pre-existing in main.go at multiple other lines (552, 847, 908, 1566, 1777) and has been accepted across three review passes, this is clearly the author's style — treating it as a pre-existing convention going forward rather than a blocker. A batch strip in a follow-up pass would be the cleanest resolution, but it's not PR-blocking.
Stage 5E note (unchanged, preserved as-is)
The multi-shard full_node_id deduplication requirement for the Phase-0 fan-out remains a design note for the Stage 5E implementer. The fan-out helper must deduplicate CapabilityReports by full_node_id before handing to validateWriterBatchUniqueness — each node's N shards all report the same DeriveNodeID(*raftId) value. This is by design (one physical node = one full_node_id); the fan-out is what needs to be dedup-aware.
No further issues. This PR is good to merge.
…til Stage 6 applier lands Codex round-3 P1 (main_encryption_admin.go:72) sharpens the analysis of PR760 r2's sidecarPath gate: When --encryptionSidecarPath is set (the condition used to mark a node as capability-ready), this helper wires both WithEncryptionAdminProposer and WithEncryptionAdminLeaderView, so BootstrapEncryption/RotateDEK/RegisterEncryptionWriter can successfully propose encryption opcodes. In this same commit, the FSM is still constructed as kv.NewKvFSMWithHLC(st, clock) in main.go (no WithEncryption applier), so any committed opcode 0x03/0x04/0x05 will hit the fail-closed halt path (kvFSM.applyEncryption with nil applier) and stop apply. The PR760 r2 fix (cf214fe) gated mutator wiring on sidecarPath != empty. That gate prevents the cluster-halt on sidecar-less nodes but leaves the footgun in place for nodes that set --encryptionSidecarPath for capability probing alone: calling a mutator on such a node would propose a 0x03/0x04/0x05 entry, every replica would HaltApply on the nil applier, and the cluster would stop. sidecarPath is the wrong gate — it signals "this node can answer the §7.1 capability probe", not "this cluster has the §6.3 applier wired". The principled gate is "the §6.3 applier exists on the FSM side". Stage 5D has no way to express that — the applier is plumbed by Stage 6. So Stage 5D must keep mutator wiring OFF unconditionally; Stage 6 will reintroduce the Proposer + LeaderView wiring in the same change that adds kv.NewKvFSMWithEncryption to the FSM construction site. ## Fix registerEncryptionAdminServer no longer wires Proposer or LeaderView under any condition. sidecarPath continues to gate ONLY the read-only WithEncryptionAdminSidecarPath option, which controls capability probing semantics. With Proposer + LeaderView both unset, EncryptionAdminServer.BootstrapEncryption / RotateDEK / RegisterEncryptionWriter return FailedPrecondition at the gRPC boundary regardless of sidecarPath — the proposal never reaches Raft. ResyncSidecar's leader guard (requireLeader) returns nil when LeaderView is unset (the "test escape hatch" path documented in requireLeader's godoc). This means ResyncSidecar serves the local sidecar on every node until Stage 6 wires the LeaderView back in. Pre-bootstrap the sidecar is either non-existent or empty so there is no §5.5 leak window — the sidecar carries no secrets until Stage 6. ## Caller audit (semantic change — signature + behavior) registerEncryptionAdminServer signature changes from (gs, engine, fullNodeID, sidecarPath) to (gs, fullNodeID, sidecarPath). The engine parameter and the local encryptionAdminEngine interface (Proposer + LeaderView intersection) are both removed — there is nothing in the Stage 5D helper that needs an engine handle anymore. - main.go startRaftServers (1 prod site, line 1295) — engine argument dropped. The new behavior (mutators always refused at the RPC boundary) is strictly safer than the previous behavior (mutators refused only when sidecarPath empty). Production today defaults --encryptionSidecarPath to empty, so observable behavior at the production default is unchanged; the change closes the operator-footgun path of setting sidecarPath alone. - main_encryption_admin_test.go (2 test sites) — stub engine dropped. The mutator-refusal test split into two sub-cases (sidecar_empty + sidecar_set) and both must see FailedPrecondition. The "sidecar_set" sub-case is the regression check for codex r3 P1: under the previous gated wiring this assertion would have failed because the stub Propose() would have returned a fake success. No external caller (outside this repo) depends on the helper — it is package-private. ## Stage 6 forward contract When Stage 6 re-wires Proposer + LeaderView, both options MUST be paired in the same conditional so the Validate() panic (proposer non-nil → leaderView non-nil) holds by construction. This commit's docstring documents the contract explicitly so the Stage 6 author cannot accidentally split the two options. ## Verification - go test -race -timeout=60s -run 'TestEncryptionAdmin|TestRegisterEncryptionAdminServer' .: PASS (1.08s, 4 cases: distinct-id + 2 mutator-refusal subcases + service registration) - go build ./...: PASS - golangci-lint run: 0 issues on touched files
|
@claude review Round-4 (commit Codex round-3 P1
PR760 r2 (
Fix
Stage 6 will reintroduce Proposer + LeaderView wiring in the same change that adds Caller audit (semantic change — signature + behavior)Signature:
Verification
|
|
Claude finished @bootjp's task in 1m 53s —— View job PR 760 Round-4 Review
Overall AssessmentThe Round-4 Codex-r3-P1 fix is correct. Removing Proposer + LeaderView unconditionally — not just Issues1. Block comment
|
…ment to round-4 wiring claude[bot] round-4 review on fc8bc1f flagged that the block comment at main.go:144-156 still describes the Round-2 sidecarPath-gates-mutator-wiring behavior: "This flag is the load-bearing gate for the mutating RPCs pre-Stage 6: setting it wires the proposer + LeaderView so RotateDEK / BootstrapEncryption / RegisterEncryptionWriter become reachable on the shard's leader..." After Round-4 (fc8bc1f), this is wrong. Stage 5D now refuses mutating RPCs with FailedPrecondition unconditionally — sidecar path gates ONLY the read-only capability surface. An operator reading the stale comment would expect --encryptionSidecarPath to enable Bootstrap/RotateDEK/RegisterEncryptionWriter when in fact it does not. ## Fix Re-sync both the block comment and the flag help string at main.go:144-156 with the round-4 wiring: - Block comment now describes the two distinct surfaces: * sidecarPath ON → capability_capable=true (read-only) * sidecarPath OFF → capability_capable=false (read-only) * Mutators → FailedPrecondition in BOTH cases until Stage 6 - Flag help string mirrors the same statement so operators passing --help see the correct behavior contract. The registerEncryptionAdminServer doc comment (main_encryption_admin.go:11-63) already reflects the round-4 wiring — this commit only brings the main.go side back in sync. ## Caller audit Documentation-only change. No code paths touched, no callers to audit. ## Verification - go build ./...: PASS - go test -race -timeout=60s -run 'TestEncryptionAdmin|TestRegisterEncryptionAdminServer' .: PASS (1.07s) - golangci-lint run ./...: 0 issues on main.go
|
@claude review Round-5 (commit FixBoth the block comment AND the flag help string at New contract documented in both locations:
The Caller auditDocumentation-only — no code paths touched, no callers to audit. Verification
The CLAUDE.md attribution-string nit ( |
|
Claude finished @bootjp's task in 1m 20s —— View job PR 760 Round-5 Review
Overall AssessmentReady to merge. The Round-5 commit correctly re-syncs the Round-5 change:
|
| Issue | Round | Status |
|---|---|---|
full_node_id = Raft group id (shared across replicas) |
1 | ✅ DeriveNodeID(*raftId) |
| Proposer + LeaderView wired unconditionally (cluster halt pre-Stage-6) | 2 | ✅ initially gated on sidecarPath |
| sidecarPath-gating insufficient — capable nodes still haltable | 3/4 | ✅ mutator wiring removed entirely |
| Flag help text contradicting fix | 3 (CodeRabbit) | ✅ updated in commit 6a7a27a |
sidecarPath global access in helper |
4 | ✅ explicit parameter |
Stale main.go block comment describing Round-2 behavior |
4 verdict | ✅ fixed in this Round-5 commit |
Nit: Attribution strings — author's decision accepted
main.go:1302, main_encryption_admin.go:22-24,33-34, and main_encryption_admin_test.go:17-18,41-43 still contain "Codex r1/r3 P1 on PR #760" and "PR760 r* Codex P1" markers. The author acknowledges this in the Round-5 message and has explicitly decided to keep them as regression anchors per the pre-existing convention in this file. That's the author's call — the rationale is documented in the PR description where it belongs. Not blocking.
No further actionable issues. This PR is good to merge.
Two issues on round-2 (commit 1b3cabf): ## codex P1: 6A re-wiring mutators creates the same cluster-halt class as PR #760 Round-2 wrote: "any operator who hand-triggers BootstrapEncryption or RotateDEK on a 6A-only binary with --encryption-enabled set hits a clean, named error in the gRPC error rather than a process crash." Codex correctly flagged this as wrong: - The FSM contract is fail-closed: kv/fsm_encryption.go wraps ApplyBootstrap/ApplyRotation errors with ErrEncryptionApply. - raftengine/etcd/engine.go halts apply without advancing setApplied when HaltApply() is non-nil. - So a 6A-only mutator RPC: handler proposes, entry commits, every replica's FSM apply returns ErrKEKNotConfigured, HaltApply fires, apply loop halts. Restart replays the same entry → halts again. Cluster bricks on that index. This is the same class of failure as PR #760 round-3 codex P1. The fix for #760 was "don't wire mutators until applier exists"; the principled extension is "don't wire mutators until applier AND KEK are both present." ## Fix: revise 6A/6B split - **6A revised**: lands the §6.3 EncryptionApplier as INTERNAL scaffolding only. WithEncryption(applier) is wired at FSM construction so the dispatch path exists, but registerEncryptionAdminServer carries forward Stage 5D's posture — Proposer + LeaderView stay unwired, mutating RPCs continue to refuse with FailedPrecondition. No flag introduced. The applier is reachable only through direct FSM.Apply calls in unit tests. 6A's value is unblocking 6B/6D by providing a real applier to bind KEK and cutover work against. - **6B revised**: lands KEK plumbing AND introduces the --encryption-enabled flag AND re-enables the mutator wiring gated on (--encryption-enabled AND KEKConfigured()). Both conditions must hold, so a binary with the flag set but no KEK source loaded continues to refuse mutators at the RPC boundary — preventing the cluster-halt path codex identified. - **6C revised**: now extends the flag "introduced in 6B" (not 6A) since the introduction moved to 6B. - **6D rationale rewritten**: the previously-special "6D depends on 6B because §7.1 Phase 1 step 4 implicitly triggers ApplyBootstrap" dependency is now a special case of the general "6B gates all mutator wiring" rule. Both 6D and 6E depend on 6B unconditionally — 6B is what makes any EncryptionAdmin mutator reachable. The 6D row drops the long inline dependency note since the rationale section covers it. The defense-in-depth ErrKEKNotConfigured marker in the applier is retained — a future refactor that bypasses the 6B RPC gate should produce a named grep-able error in logs rather than a nil pointer dereference. The marker is no longer the production safety boundary (the RPC gate is); it is a belt-and-suspenders backstop. ## claude[bot] round-2 nit: "four" → "five" The 6F rationale bullet said "after the four load-bearing PRs (6A–6E)". The range 6A–6E is five PRs. Fixed. ## Caller audit Doc-only — no code paths touched, no callers. ## Verification - git diff scope: design doc only, 81 insertions / 53 deletions. - make lint: 0 issues (no code to lint).
…pped (#762) ## Summary - Record Stage 5D (PR #760, merged 2026-05-16T16:35:03Z) as shipped in the milestone table. - Mark Stage 5E (§5.6 step 1a capability fan-out helper) as **deferred** — its scope is narrowly the auto-batch CLI convenience and does not gate any Stage 6 work. - Decompose the original Stage 6 row (`3-phase rollout flags + applier wiring`) into six sub-milestones **6A–6F**, each independently reviewable, leaving the cluster in a production-safe state, unblocking the next. - Add a Stage 6 sub-milestone rationale section explaining why the decomposition follows this shape — why 6A is the load-bearing safety boundary (it lands the §6.3 applier whose absence was PR #760 round-3 codex P1), why 6D/6E must be separate cutovers (intermediate Phase-1 storage-only state is a valid operator dwell), and why 6F is the last ergonomics PR. ## Stage 6 breakdown | Sub-milestone | Scope | |---|---| | **6A** | §6.3 `EncryptionApplier` concrete implementation (writer-registry insert + sidecar mutate + keystore update) + `kv.NewKvFSMWithHLC(... WithEncryption(applier))` wiring + re-enable mutating-RPC wiring in `registerEncryptionAdminServer` (gated on `--encryption-enabled`) | | **6B** | §6.5 KEK plumbing — `--kekFile` / `--kekUri` flags + `internal/encryption/kek` selection at startup + applier KEK-unwrap path | | **6C** | §6.5 `--encryption-enabled` + §9.1 startup refusal guards (sidecar present without flag, KEK mismatch, sidecar/raft index gap, fsync support, `local_epoch` rollback / exhaustion, `node_id` collision) | | **6D** | §6.6 `enable-storage-envelope` admin RPC + §7.1 Phase-1 storage cutover (§6.2 toggle ON) + Voters ∪ Learners capability gate | | **6E** | §6.6 `enable-raft-envelope` admin RPC + §7.1 Phase-2 raft cutover + `raft_envelope_cutover_index` sidecar record + `internal/raftengine/etcd/engine.go` `applyNormalEntry` unwrap hook + `ErrRaftUnwrapFailed` HaltApply path | | **6F** | §6.5 `--encryption-rotate-on-startup` request flag + leader-elected rotation proposal | ## Why doc-first CLAUDE.md design-doc-first workflow: "For any change that goes beyond a single-file edit ... write a `*_proposed_*.md` design doc first and land it before the implementation." The existing partial doc already covers Stage 6 at the high level — this PR refines the milestone table so each of 6A–6F has a clear, independently-reviewable scope before implementation begins. 6A will land as the next PR. ## Five-lens self-review 1. **Data loss** — n/a (doc-only). 2. **Concurrency / distributed failures** — n/a (doc-only). The breakdown itself is concurrency-safe: 6A leaves the cluster in the same production-inert state PR #760 left it in (mutators OFF by default via the new `--encryption-enabled` gate). 3. **Performance** — n/a (doc-only). 4. **Data consistency** — n/a (doc-only). The narrative explicitly notes that the Phase-1-only intermediate state (storage encrypted, raft cleartext) is a valid operator dwell, which is the consistency invariant operators rely on for rollback. 5. **Test coverage** — n/a (doc-only). Each of 6A–6F will land its own unit + property + Jepsen coverage as appropriate. ## Test plan - [x] No code touched — `go build ./...` not required. - [x] No code touched — `go test ./...` not required. - [x] No code touched — `make lint` not required (the local pre-commit hook ran and reported 0 issues). - [x] Manually re-read the updated milestone table for internal consistency. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **Documentation** * Updated milestone progress tracking for data-at-rest encryption feature development * Extended stage breakdown with implementation details for stages 5D through 6F * Clarified stage 6 sub-milestone planning and dependencies <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/bootjp/elastickv/pull/762?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Stage 5 PR-D of the data-at-rest encryption rollout (design doc:
docs/design/2026_04_29_partial_data_at_rest_encryption.md). Wiresthe §6.1
EncryptionAdmingRPC service onto every shard's gRPClistener so the §7.1 Phase-0
GetCapabilityfan-out can poll anycluster member.
Production-inert:
--encryptionSidecarPathdefaults to empty, soGetCapabilityreportsencryption_capable=falseuntil Stage 6introduces the proper cluster-flag gate.
What lands
--encryptionSidecarPathflag (main.go): points theread-only probes at the §5.1 keys.json. Empty path = node has
not been started with encryption — matches the §6.1
not-capable signal that the cutover refuses with
ErrCapabilityCheckFailed.registerEncryptionAdminServerhelper (main_encryption_admin.go):per-shard registration, plumbing the shard's
raftengine.Enginein as BOTH proposer AND
LeaderViewsorequireLeader'sVerifyLeaderquorum check fires against the right shard. Thefull_node_id reported in
CapabilityReportis the shard id,matching the §5.6 step 1a batch contract.
Validate()is load-bearing: panics at startup if a futurerefactor splits the proposer/LeaderView wiring apart. At the
current callsite both come from the same engine, so the panic
is unreachable — but the guard catches the failure mode where a
follower might silently accept mutator RPCs.
main_encryption_admin_test.go): asserts the serviceshows up in
grpc.Server.GetServiceInfo()so a futureregistration-helper refactor cannot silently drop the service.
Out of scope
encryption bootstrap --discover-from=...auto-batch CLI mode.--encryption-enabledcluster flag, §6.3WithEncryptionapplier wiring, §9.1 startup refusal guards.Test plan
go test -race -timeout=60s -run TestRegisterEncryptionAdminServer .— verifies the service is registered.
go build ./...— PASS.golangci-lint runon root pkg — 0 issues.Self-review (CLAUDE.md 5 passes)
handlers are unchanged.
Validate()catches theproposer-without-LeaderView programming error at startup so a
follower cannot quietly accept mutator RPCs.
EncryptionAdminServerper shard. Theonly mutable state is the sidecar file, serialised through
encryption.WriteSidecar's §5.1 crash-durable protocol.RPCs are off the data plane.
EncryptionAdminServerusesITS OWN engine, so a follower shard correctly rejects
mutators with
FailedPreconditionpointing at the shard'scurrent leader.
GetServiceInfo-based registration check;the wider adapter + CLI tests carry the
EncryptionAdmincontract.
Summary by CodeRabbit
New Features
--encryptionSidecarPathcommand-line flag for configuring encryption sidecar probesImprovements
Documentation