feat(encryption): Stage 5 PR-B — RotateDEK + RegisterEncryptionWriter mutators#756
feat(encryption): Stage 5 PR-B — RotateDEK + RegisterEncryptionWriter mutators#756bootjp wants to merge 8 commits into
Conversation
… mutators Stage 5 PR-B of the data-at-rest encryption rollout (design doc docs/design/2026_04_29_partial_data_at_rest_encryption.md, §5.2 rotation, §4.1 writer registry, §6.1 admin.go service, §6.6 admin commands). Wires two of the three §11.3 mutating opcodes through raftengine.Proposer on top of Stage 4's fsmwire body encoders. Production-inert: main.go does not yet register the EncryptionAdminServer on its gRPC listener, and the WithEncryptionAdminProposer option is unset by default — every mutating RPC then short-circuits to FailedPrecondition with "proposer not configured". Stage 6 flips this on under the --encryption-enabled cluster flag; PR-C lands BootstrapEncryption + §5.6 step 1a capability fan-out. ## adapter/encryption_admin.go - New options: WithEncryptionAdminProposer (raftengine.Proposer) and WithEncryptionAdminLeaderView (raftengine.LeaderView). Both are nil-safe; a nil leaderView treats the node as always-leader (test affordance) while a nil proposer rejects every mutating RPC at the FailedPrecondition boundary. - RotateDEK validates purpose / new_dek_id != 0 / non-empty wrapped_new_dek / proposer_local_epoch <= 0xFFFF at the gRPC boundary, composes a fsmwire.RotationPayload with SubTag=RotateSubRotateDEK, prepends the §11.3 0x05 opcode tag, and Propose()s. Returns the engine's CommitIndex as applied_index. - RegisterEncryptionWriter requires exactly one writer per call (multi-writer batches go through BootstrapEncryption per §5.6 step 1a) and applies the same local_epoch <= 0xFFFF bound. Encodes a fsmwire.RegistrationPayload under the 0x03 opcode tag. - ResyncSidecar now requires leadership: §5.5 recovery only consults the leader's record, so a follower's sidecar should not be used as a peer's recovery source. - proposeEncryptionEntry is the byte-level glue between the fsmwire-encoded body and Proposer.Propose, prepending the opcode tag in one place so the wire layout is locked. - requireLeader embeds the leader's id + address in the FailedPrecondition status detail so the operator's CLI can re-target without parsing free-form error text. - uint32ToLocalEpoch is a defence-in-depth narrowing helper for the §4.1 16-bit nonce; bound-checked at the handler boundary plus masked here. ## cmd/elastickv-admin/encryption_mutators.go (new) - `encryption rotate-dek --purpose=storage|raft --new-dek-id= --wrapped-new-dek=<base64> --proposer-node-id= --proposer-local-epoch=` - `encryption register-writer --dek-id= --full-node-id= --local-epoch=` - Shared --endpoint / --timeout flags via the PR-A newEncryptionEndpointFlags helper. Help-flag (`-h` / `--help`) is handled via flag.ErrHelp per the PR-A convention. - Wrapped DEK material accepted as base64 because operator-side KEK wrapping is done out-of-band (Stage 9 CLI); base64 is the lingua franca for pasting opaque bytes through a terminal. - narrowUint32 is the CLI-side twin of uint32ToLocalEpoch: bound-checked + masked. ## cmd/elastickv-admin/encryption.go - encryptionMain dispatches the two new subcommands and updates the usage / unknown-subcommand error text to list them. ## docs/design/2026_04_29_partial_data_at_rest_encryption.md - Stage 5 milestone row split further: 5A (shipped, PR #754), 5B (this PR), 5C (Bootstrap + fan-out + main.go wiring). ## Tests adapter/encryption_admin_test.go: - RotateDEK happy path verifies the proposed entry byte layout by round-tripping through fsmwire.DecodeRotation. - RotateDEK follower-rejection embeds the leader hint in the error string. - RotateDEK input-validation table: zero new_dek_id, empty wrapped, unspecified purpose, local_epoch > 0xFFFF. - RegisterEncryptionWriter happy path round-trips through fsmwire.DecodeRegistration. - RegisterEncryptionWriter rejects batch (len(writers) != 1). - ResyncSidecar follower-rejection. - MutatingRPCs_RejectWithoutProposer pins the production-inert guarantee. cmd/elastickv-admin/encryption_test.go: - runEncryptionRotateDEK happy path against a stub gRPC server. - runEncryptionRotateDEK rejects bad --purpose and epoch overflow at the CLI parse boundary (no round-trip needed). - runEncryptionRegisterWriter happy path against a stub. ## Self-review (CLAUDE.md 5 passes) 1. Data loss: every mutating RPC is leader-gated; followers return FailedPrecondition before the proposer is touched. A nil proposer rejects with FailedPrecondition. Stage 4's HaltApply seam already covers the apply-side failure mode. 2. Concurrency: the server adds no mutable state; the proposer is set once at construction. raftengine.Proposer's contract serialises proposals through the engine's queue. 3. Performance: hot path is unchanged. Mutating RPCs are operator-facing and rare. 4. Consistency: fsmwire encoders are the single source of truth for the on-wire byte layout. Round-trip tests (Encode→DecodeRotation / Encode→DecodeRegistration) lock the server's view against fsmwire's, so a future change to either side cannot silently desync. 5. Test coverage: input-validation table, leader-rejection, wire-layout round-trip, and CLI happy path are all covered. No proposer wired = FailedPrecondition is pinned. ## Verification - go test -race -timeout=60s ./adapter ./cmd/elastickv-admin (Stage 5 scope): PASS - golangci-lint run on touched paths: 0 issues
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 748f537db3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // surface here. We surface the engine's error rather | ||
| // than rewriting it so the operator can grep against | ||
| // the engine's own logs. | ||
| return 0, pkgerrors.Wrapf(err, "encryption: propose 0x%02x", opcode) |
There was a problem hiding this comment.
Return gRPC status for proposer-side proposal errors
When Propose fails (e.g. leadership changes between requireLeader() and Propose), this handler returns a plain wrapped error instead of a status error, so gRPC reports codes.Unknown to clients. That drops structured retry semantics for known Raft conditions (ErrNotLeader, ErrLeadershipLost, transfer-in-progress) and makes mutating admin RPC behavior inconsistent with the rest of this service’s FailedPrecondition-based leader routing.
Useful? React with 👍 / 👎.
| payload := fsmwire.RegistrationPayload{ | ||
| DEKID: req.GetDekId(), | ||
| FullNodeID: w.GetFullNodeId(), | ||
| LocalEpoch: uint32ToLocalEpoch(w.GetLocalEpoch()), |
There was a problem hiding this comment.
Reject zero full_node_id in writer registration payloads
This path accepts writers[0].full_node_id == 0 and forwards it directly into the Raft registration payload. In this codebase full_node_id=0 is used as a sentinel for “not capable/unset”, so persisting it into writer-registry state can create an invalid writer identity and break nonce-uniqueness assumptions if multiple registrations collapse onto node id 0. The server should reject zero full_node_id (and similarly for rotate proposer id) at the RPC boundary.
Useful? React with 👍 / 👎.
…error mapping) Addresses both codex P1 findings on PR #756. Per CLAUDE.md, the defect fixes are paired with failing regression tests added in the same commit. ## codex P1 #1 — reject zero full_node_id at the RPC boundary `adapter/encryption_admin.go` accepted `proposer_node_id == 0` (RotateDEK) and `writers[0].full_node_id == 0` (RegisterEncryptionWriter), forwarding the sentinel directly into the §11.3 0x03 / 0x05 Raft entry body. §6.1 reserves `full_node_id = 0` as the "node not encryption-capable" marker; persisting it into a writer-registry row would collide with every other un-bootstrapped node's sentinel and break the §4.1 nonce-uniqueness invariant. Fix: reject with `codes.InvalidArgument` at the gRPC boundary in both RPCs. Regression tests: - TestEncryptionAdmin_RotateDEK_RejectsZeroProposerNodeID - TestEncryptionAdmin_RegisterEncryptionWriter_RejectsZeroFullNodeID Caller audit: ProposerNodeId / FullNodeId are read by tests (all use non-zero fixtures), the CLI flag parsers (operator-controlled, pass straight through), and the new validators here. No production caller depends on the old "accept zero" behaviour. ## codex P1 #2 — map Propose errors to gRPC status codes `proposeEncryptionEntry` returned `pkgerrors.Wrap` on Propose failure, which gRPC surfaces as `codes.Unknown`. Clients then lose structured retry semantics for the known engine errors (`ErrNotLeader`, `ErrLeadershipLost`, `ErrLeadershipTransferInProgress`). Fix: new `proposeErrorToStatus` helper maps: - leadership errors → FailedPrecondition (same shape as the up-front requireLeader rejection so clients can re-target the current leader uniformly) - context.Canceled → Canceled - context.DeadlineExceeded → DeadlineExceeded - everything else → Unavailable (transient retryable) Regression tests: - TestEncryptionAdmin_RotateDEK_MapsProposeLeaderErrorToFailedPrecondition (table-driven across all three leadership sentinels) - TestEncryptionAdmin_RotateDEK_MapsProposeOtherErrorToUnavailable Caller audit: proposeEncryptionEntry is called only from RotateDEK and RegisterEncryptionWriter, both of which `return err` directly without rewrapping, so the gRPC status reaches the client intact. CLI code wraps via `errors.Wrap(err, "RotateDEK")` but `status.Code(err)` walks the Unwrap chain and still recovers the status, so the CLI-side error rendering is unaffected. ## Verification - go test -race -timeout=60s ./adapter ./cmd/elastickv-admin -run 'TestEncryptionAdmin|TestEncryptionMain|TestRunEncryption': PASS - golangci-lint run on touched paths: 0 issues
|
@claude review |
|
Claude finished @bootjp's task in 2m 47s —— View job PR Review — feat(encryption): Stage 5 PR-BTodo list:
SummaryThis is a well-structured PR. The architecture is sound, the Codex P1 issues cited in the automated review have both been addressed in the follow-up commit Codex P1 status — both resolved ✅P1-A: gRPC status for proposer-side errors — resolved by P1-B: zero
Issues foundP2 —
Suggestion for Stage 6: add a validation step on P2 —
if len(writers) != 1 {
return nil, grpcStatusErrorf(codes.InvalidArgument,
"encryption: RegisterEncryptionWriter requires exactly one writer in PR-B, got %d (use BootstrapEncryption for multi-writer batches)",
len(writers))
}When switch len(writers) {
case 0:
return nil, grpcStatusError(codes.InvalidArgument,
"encryption: RegisterEncryptionWriter requires exactly one writer, got 0")
case 1:
// ok
default:
return nil, grpcStatusErrorf(codes.InvalidArgument,
"encryption: RegisterEncryptionWriter requires exactly one writer in PR-B, got %d (use BootstrapEncryption for multi-writer batches)",
len(writers))
}P3 — CLI missing zero-value guard for
P3 —
Observations (non-blocking)
VerdictApprove with minor notes. The two Codex P1s are fully addressed, input validation is thorough, and the wire-format round-trip tests lock the critical invariant. The P2 issues (nil leaderView bypass, empty-writers message) are worth addressing before Stage 6 flips the cluster flag; the P3 items can go in as follow-ups. |
There was a problem hiding this comment.
Code Review
This pull request implements the Stage 5B foundation for encryption administration, specifically wiring the RotateDEK and RegisterEncryptionWriter RPCs as leader-only proposers. It introduces the necessary raftengine.Proposer and LeaderView dependencies to the EncryptionAdminServer, adds CLI subcommands for these operations, and includes comprehensive unit tests. Feedback was provided regarding error handling in proposeEncryptionEntry, where engine errors like context cancellation or leadership changes should be mapped to specific gRPC status codes to improve client-side error handling.
|
|
||
| // proposeEncryptionEntry prepends the §11.3 opcode tag to a | ||
| // fsmwire-encoded body and submits the resulting Raft entry. The | ||
| // Stage 4 FSM apply path peels the tag, dispatches into | ||
| // applyEncryption, and Halt-Applies on any decode failure — this | ||
| // helper is just the byte-level glue between the server-side | ||
| // encoder and raftengine.Proposer. |
There was a problem hiding this comment.
The current error handling in proposeEncryptionEntry wraps all engine errors using pkgerrors.Wrapf, which results in a codes.Unknown status for the gRPC client.
Specifically, context.Canceled and context.DeadlineExceeded should be mapped to their respective gRPC status codes (codes.Canceled and codes.DeadlineExceeded). Additionally, if the engine returns raftengine.ErrNotLeader, we should attempt to provide the leader hint by calling requireLeader() before returning the error.
res, err := s.proposer.Propose(ctx, entry)
if err != nil {
if errors.Is(err, context.Canceled) {
return 0, grpcStatusError(codes.Canceled, "encryption: propose canceled")
}
if errors.Is(err, context.DeadlineExceeded) {
return 0, grpcStatusError(codes.DeadlineExceeded, "encryption: propose timeout")
}
if errors.Is(err, raftengine.ErrNotLeader) {
if lErr := s.requireLeader(); lErr != nil {
return 0, lErr
}
}
// Raft-engine errors are operator-visible diagnostics:
// not-leader / context-canceled / propose-queue full all
// surface here. We surface the engine's error rather
// than rewriting it so the operator can grep against
// the engine's own logs.
return 0, pkgerrors.Wrapf(err, "encryption: propose 0x%02x", opcode)
}…e + table test) claude[bot] round-1 said "Approve with minor notes" after confirming both codex P1 fixes resolved. This commit picks up the two P2/P3 items that are cheap to land before Stage 6 flips the cluster flag; the third (nil-leaderView startup enforcement) is correctly Stage 6 territory and is intentionally deferred. ## adapter/encryption_admin.go - RegisterEncryptionWriter switches the writers-length check from a single != 1 branch to a three-arm switch so the zero-length case gets a "got 0" message instead of misrouting the operator to BootstrapEncryption. The >1 case keeps its existing multi-writer hint. ## adapter/encryption_admin_test.go - New TestEncryptionAdmin_RegisterEncryptionWriter_RejectsBadInputs is the table-driven twin of the existing RotateDEK table test. Covers zero dek_id, local_epoch > 0xFFFF, zero full_node_id, empty writers — all returning InvalidArgument. - New TestEncryptionAdmin_RegisterEncryptionWriter_EmptyWritersMessage pins the message-routing fix: an empty writers list must surface "got 0" and must NOT mention BootstrapEncryption. ## Deferred (Stage 6) claude[bot]'s P2 #1 — `requireLeader` nil bypass — is a production-wiring concern. The current pattern (nil = always leader) is intentional for unit tests, and claude[bot] explicitly defers the startup-time assertion to Stage 6. The production wiring in main.go will gate the gRPC-server registration behind `--encryption-enabled`, at which point a nil leaderView is a config bug to fail closed on. ## Verification - go test -race -timeout=60s on ./adapter and ./cmd/elastickv-admin (Stage 5 scope): PASS - golangci-lint run on touched paths: 0 issues
|
@claude review |
|
Claude finished @bootjp's task in 2m 7s —— View job PR Review — feat(encryption): Stage 5 PR-BTodo list:
Previous review findings — statusCodex P1-A (propose error mapping) — ✅ fully resolved. Codex P1-B (zero Claude round-1 P2 (empty-writers message) — ✅ resolved in Claude round-1 P3 ( New findingsP2 — Stale docstring on The current docstring says:
But PR-B has already landed that guard ( P2 — Stale code comment in The comment says:
But P2 (carry-forward from round-1) —
This is intentionally deferred here (the server is production-inert until Stage 6), but it should be recorded as a Stage 6 TODO. Observations (non-blocking)
No VerdictApprove with two P2 comment fixes before merge. The architecture is sound, all round-1 and round-2 findings are addressed, leader-only gating is correctly enforced, and the wire-format round-trip tests lock the critical |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 72c174a1c4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if s.leaderView.State() == raftengine.StateLeader { | ||
| return nil |
There was a problem hiding this comment.
Verify quorum before accepting leader-only encryption RPCs
requireLeader returns success solely from State()==StateLeader, but that state can remain true on a partitioned former leader until it observes step-down; during that window ResyncSidecar will serve sidecar data without any Raft proposal or linearizable-read barrier. A follower resyncing against that stale node can overwrite its local sidecar with an outdated DEK set and miss recent rotations. Please add a real leadership confirmation (LeaderView.VerifyLeader/linearizable check) on the request path before treating the node as authoritative.
Useful? React with 👍 / 👎.
… claude[bot] P2 docstrings ## codex P1 — VerifyLeader quorum confirmation in requireLeader Codex round-2 finding: requireLeader returned success purely from State() == StateLeader. A partitioned former leader's local State() can still report StateLeader until the engine observes step-down. During that window: - RotateDEK / RegisterEncryptionWriter would Propose() against a Raft engine that no longer has a quorum; the proposal queues and either fails late or applies under a new leader's term. - ResyncSidecar would ship the local sidecar's wrapped-DEK set to a recovering follower. The set may be missing recent rotations committed under a newer leader, silently overwriting the follower's freshly-recovered keys with a stale snapshot. Fix: requireLeader now takes ctx and calls LeaderView.VerifyLeader after the fast-path State() check. VerifyLeader does a ReadIndex round-trip; a partitioned leader cannot complete it. The status detail surfaces the engine error so the operator can see the quorum-loss reason in the CLI output. This is a fail-closed semantic change for requireLeader. Caller audit: 3 sites (RotateDEK, RegisterEncryptionWriter, ResyncSidecar). All gRPC handlers, all have a ctx in scope. All three updated to pass ctx. No external callers. Regression tests: - TestEncryptionAdmin_RotateDEK_RejectsStaleLeader — stubLeaderView with State=Leader but VerifyLeader=error; expects FailedPrecondition with "VerifyLeader" in the message. - TestEncryptionAdmin_ResyncSidecar_RejectsStaleLeader — the read-only twin; ResyncSidecar has no Propose() to catch leadership loss, so VerifyLeader is the only defence. stubLeaderView gains a verifyErr field to simulate the partitioned-leader case; existing happy-path tests use the zero value (verifyErr=nil) and continue to pass. ## claude[bot] P2 — stale docstrings claude[bot] round-2 flagged two outdated comments: - adapter/encryption_admin.go ResyncSidecar docstring said "PR-A serves this from the local sidecar without leadership verification... PR-B will add a leader-only guard". PR-B has already added it; the docstring now documents the actual behaviour (VerifyLeader-based quorum confirmation + the §5.5 rationale for the strict guard). - cmd/elastickv-admin/encryption.go encryptionMain comment listed "PR-B adds bootstrap / rotate-dek / register-writer / resync-sidecar". bootstrap is deferred to PR-C and ResyncSidecar has no CLI surface. Comment now reflects the actual subcommand matrix. ## Deferred claude[bot]'s carry-forward P2 (nil leaderView startup enforcement) remains a Stage 6 concern; the production wiring will gate the server registration on --encryption-enabled and a nil leaderView at that point is a config bug to fail closed on. ## Verification - go test -race -timeout=60s on ./adapter and ./cmd/elastickv-admin: PASS (includes 2 new stale-leader regression tests) - golangci-lint run on touched paths: 0 issues
|
@claude review |
|
Claude finished @bootjp's task in 2m 47s —— View job PR Review — feat(encryption): Stage 5 PR-B (Round 3)Todo list:
Previous round findings — statusClaude round-2 P2 (stale ResyncSidecar docstring) — ✅ resolved. Claude round-2 P2 (stale dispatcher comment) — ✅ resolved. Codex round-2 P1 (VerifyLeader quorum guard) — ✅ fully resolved.
Both mutating RPCs ( Regression tests:
Claude round-2 P2 carry-forward (nil leaderView startup enforcement) — explicitly deferred to Stage 6 in the commit message. The docstring at New findingsP3 — The docstring says:
PR-B landed without implementing the keystore fast-path — the keystore field is stored but never read anywhere in the current implementation. The comment is now a stale intent reference. Non-blocking; can be updated in a follow-up cleanup commit or at Stage 7 when the keystore is actually wired. P3 —
P3 — The comment says TLS / token auth for the CLI surface "is deferred to PR-B for the CLI surface so the initial PR stays scoped to read-only status." PR-B has now landed and TLS is still not implemented (it's a Stage 6 concern). Minor stale reference. Non-blocking. Observations (unchanged from round-2, non-blocking)
VerdictApprove. All round-2 and codex round-2 issues are fully addressed:
The three P3 items above (stale keystore docstring, happy-path test gap, stale TLS comment) can go in as follow-ups. The wire-format round-trip tests, the stale-leader regression tests, and |
claude[bot] round-3 said "Approve." with three P3 follow-ups —
all stale-comment / coverage-gap items. None block correctness;
landing them now keeps the PR fully clean before merge.
## adapter/encryption_admin.go
- `WithEncryptionAdminKeystore` docstring updated. The previous
"PR-B will use it to fast-path the RotateDEK pre-check" was
never realised; the keystore is currently set-but-unread.
Comment now points to Stage 7 (writer registry) as the actual
consumer.
## cmd/elastickv-admin/encryption.go
- `newEncryptionEndpointFlags` docstring updated. The previous
"deferred to PR-B for the CLI surface" was misleading after
PR-B shipped without TLS — TLS is a Stage 6 concern. Comment
now reads "shares the existing flags on the HTTP surface and
is wired in Stage 6 alongside the --encryption-enabled gate".
## adapter/encryption_admin_test.go
- `TestEncryptionAdmin_ResyncSidecar_ShipsWrappedDEKs` now wires a
`stubLeaderView{state: StateLeader}` so the happy path
exercises `State()==StateLeader && VerifyLeader()==nil`
end-to-end. Previously it short-circuited through the
nil-leaderView test escape hatch and never exercised the
VerifyLeader call site on the success path.
## Skipped (claude[bot] style observations)
- `requireUint16Plus1` rename — non-blocking and cosmetic.
- `localEpochMask` const placement — non-blocking and cosmetic.
## Verification
- go test -race -timeout=60s ./adapter ./cmd/elastickv-admin: PASS
- golangci-lint run on touched paths: 0 issues
|
@claude review |
|
Claude finished @bootjp's task in 1m 48s —— View job PR Review — feat(encryption): Stage 5 PR-B (Round 4)Todo list:
Round-3 P3 items — all addressed ✅Stale keystore docstring ( ResyncSidecar happy-path test gap ( Stale TLS comment in
New findingP3 — // dialEncryption opens a gRPC client. The context argument is
// reserved for the PR-B auth path (TLS handshake + token attach);This is the same pattern as the Observations (carry-forward, non-blocking)
VerdictApprove. All round-3 findings are cleanly addressed. The two explicitly-skipped cosmetic items are a deliberate decision correctly documented in the commit message. The one new P3 item (stale The PR is in excellent shape: P1s resolved (proposer error mapping, zero node_id), P2s resolved (empty-writers message, VerifyLeader quorum guard, stale docstrings), wire-format round-trip tests lock the critical |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 83e4c49726
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if err := s.leaderView.VerifyLeader(ctx); err != nil { | ||
| return grpcStatusErrorf(codes.FailedPrecondition, | ||
| "encryption: VerifyLeader failed, refusing to act on stale-leader state: %v", err) |
There was a problem hiding this comment.
Preserve context error codes from VerifyLeader
requireLeader currently converts every VerifyLeader failure into codes.FailedPrecondition, including context.Canceled and context.DeadlineExceeded. In slow or partitioned clusters, a client timeout during the ReadIndex check will therefore be reported as “not leader” instead of cancellation/timeout, which breaks retry semantics and makes callers treat transport timing issues as leadership errors.
Useful? React with 👍 / 👎.
…+ claude[bot] P3 stale-comment cleanup ## codex P1 — preserve context codes from VerifyLeader Codex round-4 finding: requireLeader mapped every VerifyLeader failure to codes.FailedPrecondition, including context.Canceled and context.DeadlineExceeded. A client whose ctx cancels or times out during the Raft ReadIndex round-trip would see "not leader", which collides with the genuine leadership-rejection branch and breaks transport-timing retry semantics in slow or partitioned clusters. Fix: new verifyLeaderErrorToStatus helper mirrors the existing proposeErrorToStatus pattern. context.Canceled → codes.Canceled, context.DeadlineExceeded → codes.DeadlineExceeded, everything else → codes.FailedPrecondition (the stale-leader / partition case). Caller audit: verifyLeaderErrorToStatus is called only from requireLeader. requireLeader has 3 callers (RotateDEK, RegisterEncryptionWriter, ResyncSidecar), all gRPC handlers, all propagate the status via `return err` without rewrap. CLI uses errors.Wrap; status.Code walks the Unwrap chain. No silent behaviour change downstream. Regression test: - TestEncryptionAdmin_RotateDEK_VerifyLeader_PreservesContextCodes is a two-case table (context.Canceled, context.DeadlineExceeded) that pins the new mapping. ## claude[bot] P3 — two more stale comments - dialEncryption (`cmd/elastickv-admin/encryption.go:78`): "PR-B auth path" updated to "Stage 6 auth path" so the comment matches the just-fixed newEncryptionEndpointFlags comment. - TestEncryptionAdmin_GetCapability_Bootstrapped (`adapter/encryption_admin_test.go:116`): the "PR-A always reports local_epoch=0" label is generalised to "Stage 5 reports... pre-Stage-7" since both PR-A and PR-B observe the invariant. ## Skipped (claude[bot] non-blocking observations) - requireUint16Plus1 rename — cosmetic only. - localEpochMask const placement — cosmetic only. Both items are explicitly carried forward and the commit message records the deferral rationale. ## Verification - go test -race -timeout=60s ./adapter ./cmd/elastickv-admin: PASS (includes 1 new VerifyLeader context-code regression test) - golangci-lint run on touched paths: 0 issues
|
@claude review |
|
Claude finished @bootjp's task in 2m 24s —— View job PR Review — feat(encryption): Stage 5 PR-B (Round 5)Todo list:
Round-4 findings — all resolved ✅Codex round-4 P1 (
case errors.Is(err, context.Canceled):
return grpcStatusErrorf(codes.Canceled, ...)
case errors.Is(err, context.DeadlineExceeded):
return grpcStatusErrorf(codes.DeadlineExceeded, ...)
default:
return grpcStatusErrorf(codes.FailedPrecondition, ...)Context timeouts during the ReadIndex round-trip now surface as
Claude round-4 P3 (stale New findingsP3 — No
Adding P3 — PR-stage archaeological references in code docstrings
Observations (carry-forward, non-blocking, explicitly deferred in round-4 commit)
VerdictApprove. All round-4 findings are resolved. The codex P1 (VerifyLeader context codes) is cleanly handled via
The wire-format round-trip tests, the stale-leader regression tests, |
claude[bot] round-5 verdict was "Approve" with one new P3 noting that prior commits accumulated PR-stage / round-number references in code comments (e.g. "Codex P1 round-4 finding on PR #756"). CLAUDE.md says: Don't reference the current task, fix, or callers ("used by X", "added for the Y flow", "handles the case from issue #123"), since those belong in the PR description and rot as the codebase evolves. This commit removes the PR/round provenance from every code and test comment that picked it up across rounds 1-5, while preserving the invariant-level WHY (which IS load-bearing for future readers). Files touched: - adapter/encryption_admin.go (2 sites: requireLeader, verifyLeaderErrorToStatus) - adapter/encryption_admin_test.go (6 sites: regression-test docstrings) - cmd/elastickv-admin/encryption_test.go (2 sites: help-flag test, non-precondition propagation test) No behaviour change. No tests changed structurally — only docstrings. ## Skipped (claude[bot] non-blocking observations) - ResyncSidecar context-code twin test: correctness guaranteed by the shared verifyLeaderErrorToStatus helper that the existing RotateDEK test exercises. - requireUint16Plus1 rename: cosmetic carry-forward. - localEpochMask const placement: cosmetic carry-forward. ## Verification - go test -race -timeout=60s ./adapter ./cmd/elastickv-admin: PASS - golangci-lint run on touched paths: 0 issues
|
@claude review |
|
Claude finished @bootjp's task in 2m 48s —— View job PR Review — feat(encryption): Stage 5 PR-B (Round 6)Todo list:
Round-5 P3 items — status
No New findingsP3 — PR-stage labels in two operator-visible error strings The round-6 sweep correctly removed internal review-tracking labels from docstrings, but two user-facing error strings still contain "PR-B":
return errors.Errorf("encryption: unknown subcommand %q (PR-B supports: status, rotate-dek, register-writer)", sub)An operator who types an unknown subcommand sees the text "PR-B supports:…" — the PR-stage label is meaningless outside the delivery team. Suggest: return errors.Errorf("encryption: unknown subcommand %q (supported: status, rotate-dek, register-writer)", sub)
"encryption: RegisterEncryptionWriter requires exactly one writer in PR-B, got %d (use BootstrapEncryption for multi-writer batches)"The "in PR-B" is an internal delivery note embedded in a gRPC status error that operators see. Suggest dropping it: "encryption: RegisterEncryptionWriter requires exactly one writer, got %d (use BootstrapEncryption for multi-writer batches)"Observations (carry-forward, explicitly deferred)
VerdictApprove. The round-6 commit cleanly strips the internal review-tracking artifacts from All P1/P2 items from prior rounds remain closed:
|
…visible errors claude[bot] round-6 verdict was "Approve" with two P3 follow-ups: operator-visible error strings still carried internal PR-stage labels that don't belong in user-facing output. ## adapter/encryption_admin.go RegisterEncryptionWriter multi-writer error dropped "in PR-B"; the gRPC InvalidArgument status now reads "requires exactly one writer, got %d (use BootstrapEncryption for multi-writer batches)" — same content, no internal delivery note leaking out to the operator. ## cmd/elastickv-admin/encryption.go Unknown-subcommand CLI error dropped "PR-B supports:" in favor of "supported:". Same list, no delivery-tracking label. ## Skipped claude[bot] flagged the PR-A/PR-B/PR-C roadmap references in function-level docstrings as a separate concern from the review-tracking artifacts removed in round-6. These are load-bearing scope markers for current readers of the in-flight stack and will be cleaned up as a single sweep once Stage 5 fully ships (PR-C lands). ## Verification - go test -race -timeout=60s ./adapter ./cmd/elastickv-admin: PASS - golangci-lint run on touched paths: 0 issues
|
@claude review |
|
Claude finished @bootjp's task in 2m 31s —— View job PR Review — feat(encryption): Stage 5 PR-B (Round 7)Todo list:
Round-6 P3 items — both resolved ✅
return errors.Errorf("encryption: unknown subcommand %q (supported: status, rotate-dek, register-writer)", sub)"PR-B supports:" is gone; operators see "supported:" without the delivery-tracking label.
"encryption: RegisterEncryptionWriter requires exactly one writer, got %d (use BootstrapEncryption for multi-writer batches)"The round-7 commit is a clean, minimal 2-line change (one per file) that precisely targets the two flagged strings and nothing else. No regressions visible. Carry-forward statusNo
PR-A/PR-B/PR-C roadmap refs in function docstrings — explicitly preserved per the round-7 commit message rationale ("load-bearing scope markers for current readers of the in-flight stack, cleaned up as a single sweep once Stage 5 fully ships"). Reasonable decision. VerdictApprove. All P1/P2 findings from every prior round are resolved. The two round-6 P3 items (operator-visible error strings with PR-stage labels) are cleanly addressed in Complete P-level history for this PR:
This is a solid foundation for Stage 6. |
Summary
Stage 5 PR-B of the data-at-rest encryption rollout (design doc:
docs/design/2026_04_29_partial_data_at_rest_encryption.md, §5.2rotation, §4.1 writer registry, §6.1 admin.go service, §6.6 admin
commands). Wires two of the three §11.3 mutating opcodes through
raftengine.Proposeron top of Stage 4's fsmwire body encoders.Production-inert:
main.godoes not yet register theEncryptionAdminServeron its gRPC listener, andWithEncryptionAdminProposeris unset by default — every mutatingRPC then short-circuits to
FailedPreconditionwith "proposer notconfigured". Stage 6 flips this on under the
--encryption-enabledcluster flag; PR-C lands
BootstrapEncryption+ §5.6 step 1acapability fan-out.
What lands
RotateDEK+RegisterEncryptionWriterserver-sideimplementations on top of
raftengine.Proposer. Leader-only,with the leader id+address embedded in
FailedPreconditionstatus detail.
ResyncSidecaris now leader-gated (it was always-on in PR-A);§5.5 recovery only consults the leader's record.
WithEncryptionAdminProposer/WithEncryptionAdminLeaderViewoptions.
elastickv-admin encryption rotate-dekandelastickv-admin encryption register-writersubcommands. Thewrapped DEK material is taken as base64.
open.
Out of scope (PR-C / Stage 6)
BootstrapEncryptionmutator +bootstrapCLI.main.goregistersEncryptionAdminServeron the gRPC listener(Stage 6, under the cluster flag).
Test plan
go test -race -timeout=60s ./adapter -run TestEncryptionAdmin— 14 cases including:
fsmwire.DecodeRotation).new_dek_id, emptywrapped, unspecified purpose,
local_epoch > 0xFFFF).fsmwire.DecodeRegistration).MutatingRPCs_RejectWithoutProposerpins the production-inertguarantee.
go test -race -timeout=60s ./cmd/elastickv-admin -run TestRunEncryption— CLI happy path through a real gRPC loopback server, plus the
parse-time validation (
--purpose=junk,--proposer-local-epoch=70000).golangci-lint runon touched paths: 0 issues.Self-review (CLAUDE.md 5 passes)
return
FailedPreconditionbefore the proposer is touched. Anil proposer rejects with
FailedPrecondition. Stage 4'sHaltApplyseam already covers the apply-side failure mode.proposer is set once at construction.
raftengine.Proposer'scontract serialises proposals through the engine's queue.
operator-facing and rare.
truth for the on-wire byte layout. Round-trip tests
(
Encode→Decode) lock the server's view against fsmwire's, soa future change to either side cannot silently desync.
wire-layout round-trip, and CLI happy path are all covered.
No-proposer =
FailedPreconditionis pinned.Notes
libprotoc 29.6(29.3 notinstallable on macOS via Homebrew); generated headers carry the
v5.29.6 string vs the canonical v5.29.3, no wire-relevant
difference. Same toolchain note as PR feat(encryption): Stage 5 PR-A — EncryptionAdmin proto + read-only RPCs + status CLI #754.