feat(encryption): Stage 6D-6a - EnableStorageEnvelope admin RPC + server method#812
Conversation
…ver method
Stage 6D-6 (the final 6D milestone) is sliced into 6D-6a/b/c so
each slice matches the per-PR review cadence of 6D-3/4/5. This
PR lands 6D-6a: the proto + server method + server-level unit
tests. 6D-6b adds the CLI; 6D-6c adds main.go production wiring
+ end-to-end integration test.
## Proto changes (§3.1)
- Add EnableStorageEnvelope RPC to the EncryptionAdmin service.
- Add EnableStorageEnvelopeRequest message (proposer_node_id,
proposer_local_epoch with the standard <= 0xFFFF bound).
- Add EnableStorageEnvelopeResponse message (applied_index,
capability_summary, cutover_index_unknown, was_already_active)
matching the §3.1 layout. Notably uses OK gRPC status + a
was_already_active boolean discriminator rather than
AlreadyExists — unary gRPC drops the response body on non-OK
status, so the §6.4 idempotency contract MUST live on the
success path.
- Add CapabilityVerdict message (full_node_id, encryption_capable,
build_sha, sidecar_present) for the response's per-member
membership-view summary.
- proto/Makefile gen target updated to include
encryption_admin.proto so future regenerations stay in sync.
## Server method (adapter/encryption_admin.go)
- New CapabilityFanoutFn closure type wired via the new
WithEncryptionAdminCapabilityFanout option. Nil leaves the
RPC short-circuited with FailedPrecondition — the 6D-6c
main.go wiring closes over the live route snapshot + DialFunc
+ timeout policy.
- EnableStorageEnvelope orchestrates the §3.2 server-side
sequence:
1. Leader gate (requireLeader, embeds leader hint in the
FailedPrecondition status detail).
2. Proposer + sidecar path configured.
3. Request validation (proposer_node_id != 0,
proposer_local_epoch <= 0xFFFF).
4. Sidecar read → bootstrap gate (Active.Storage != 0).
5. Idempotent-retry short-circuit (§6.4): if
StorageEnvelopeActive == true, return OK +
was_already_active=true + applied_index =
StorageEnvelopeCutoverIndex. The defensive
cutover_index_unknown branch fires when the cutover
index is 0 on an active sidecar (operationally
impossible under normal apply but hedged against
schema rollback / hand-edited sidecars).
6. Capability fan-out gate (s.capabilityFanout == nil →
refuse).
7. Run fan-out; any verdict with EncryptionCapable=false
or Reachable=false → FailedPrecondition naming the
refusing node.
8. Compose RotationPayload with SubTag =
RotateSubEnableStorageEnvelope, DEKID =
sidecar.Active.Storage, Purpose = PurposeStorage,
Wrapped = []byte{} (§2.1 constraint #2),
ProposerRegistration covering Active.Storage.
9. Propose through Raft via proposeEncryptionEntry.
10. Post-apply re-read discriminates:
- StorageEnvelopeActive=true (apply landed) → fresh
success.
- StorageEnvelopeActive=false (the 6D-4 applier
consumed the entry as a §2.1 #3 benign no-op
because a RotateDEK race advanced
sidecar.Active.Storage) → FailedPrecondition with
retry hint, NOT halt — the apply path deliberately
keeps the cluster up.
- Orchestration split into cutoverPrecheck / runCutoverFanout
/ proposeCutoverEntry / cutoverPostcheck helpers so the
outer method stays under the cyclop budget and so future
test refactors can pin each phase independently.
## Tests (adapter/encryption_admin_test.go)
12 new server-level tests:
- TestEncryptionAdmin_MutatingRPCs_RejectWithoutProposer
extended to cover EnableStorageEnvelope.
- _RejectsOnFollower - leader gate + hint embedding.
- _RejectsWithoutSidecarPath - sidecar dependency.
- _RejectsZeroProposerNodeID - §3.1 / §6.1 sentinel rule.
- _RejectsOversizedLocalEpoch - §3.1 / §4.1 16-bit bound.
- _RejectsNotBootstrapped - bootstrap gate with hint.
- _IdempotentRetry - sidecar already active → OK + was_already_active=true,
applied_index = original StorageEnvelopeCutoverIndex, NO fan-out,
NO propose, capability_summary empty.
- _DefensiveCutoverIndexUnknown - active sidecar with index=0 (hand-edit
hedge) → CutoverIndexUnknown=true, AppliedIndex falls back to
RaftAppliedIndex.
- _RejectsWithoutCapabilityFanout - the §4 pre-flight is mandatory.
- _RejectsOnCapabilityRefusal - any fan-out NO refuses + names the
refusing node.
- _RejectsOnFanoutError - fan-out helper itself erroring maps to
FailedPrecondition (not Internal).
- _HappyPath - end-to-end fresh cutover with simulated apply
effect; verifies the wire-level Raft entry decodes as
RotateSubEnableStorageEnvelope with the §2.1 constraints.
- _StaleDEKIDRace - the simulated apply does NOT flip the
sidecar field (because a RotateDEK raced); the post-apply
re-read takes the stale-DEKID branch and the RPC refuses
with the retry hint.
- _CapabilitySummaryProjection - per-row wire-shape pin for the
internal CapabilityVerdict → proto.CapabilityVerdict mapping.
New test helpers:
- fixedCapabilityFanout - deterministic fan-out fixture.
- applyingProposer - recordingProposer extension that simulates
the FSM apply effect on the sidecar so the post-Propose
re-read takes the right branch.
- applyCutover, applyStaleDEKIDRace - the two apply-effect
variants the §6.4 / §2.1 #3 paths require.
- cutoverReadySidecarFixture - the bootstrapped + pre-cutover
sidecar shape the propose-path tests share.
- allOKFanoutResult - the "fan-out approved" deterministic
fixture.
- assertFreshCutoverResponse, assertCutoverProposalShape,
assertProtoVerdict - shared assertion helpers.
## Caller audit
- The new RPC is purely additive: no existing method's signature
or semantics change.
- WithEncryptionAdminCapabilityFanout is a new option; existing
test fixtures that don't wire it leave s.capabilityFanout=nil,
which only affects EnableStorageEnvelope (refuses) - other
mutator RPCs are unaffected.
- proto.UnimplementedEncryptionAdminServer carries the
EnableStorageEnvelope stub returning codes.Unimplemented, so
test doubles that don't override stay unbroken.
## Self-review (5 lenses)
1. Data loss - no. The server method reads the sidecar, proposes
via Raft, then re-reads. The 6D-4 applier is the durable
writer; this PR only reads pre- and post-apply.
2. Concurrency / distributed failures - the propose-then-re-read
sequence relies on the apply being visible on the leader's
sidecar by the time Propose returns. The Raft proposer's
contract (proposeEncryptionEntry returns the commit index
after the FSM apply completes locally on the leader) is the
load-bearing invariant; if a future Proposer change made
Propose return before local apply, the re-read could see
stale state. The existing RotateDEK / BootstrapEncryption
paths rely on the same contract.
3. Performance - one extra sidecar read pre- and post-Propose
per cutover. Cutover is one-shot per cluster lifetime; the
read cost is negligible.
4. Data consistency - the §6.4 idempotency contract is preserved:
on retry, applied_index reports the ORIGINAL
StorageEnvelopeCutoverIndex (not the retry's Raft index).
The CapabilitySummary is intentionally empty on retries so a
caller cannot accidentally re-use the original cutover's
membership view. The §2.1 #3 stale-DEKID race is surfaced as
FailedPrecondition with a retry hint (not Aborted; the
design's §6.4 row pins FailedPrecondition).
5. Test coverage - 12 server-level tests across the §3.2
sequence + idempotency contract + race discrimination + wire
shape. The applyingProposer helper specifically pins the
§6.4 / §2.1 #3 discrimination logic, which is the central
correctness-critical branch.
## Verification
- go test -race -timeout 240s -run TestEncryptionAdmin_ ./adapter/... -
all green
- golangci-lint --new-from-rev=origin/main - 0 issues
Refs: design doc §3.1 / §3.2 / §6.4 (6D-6a slice of the §7
decomposition table).
|
Warning Review limit reached
Your plan currently allows 1 review/hour. Refill in 41 minutes and 43 seconds. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more review capacity refills, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughImplements the ChangesEnableStorageEnvelope RPC Cutover Flow
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 64a55a120b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if err != nil { | ||
| return admin.CapabilityFanoutResult{}, grpcStatusErrorf(codes.FailedPrecondition, | ||
| "encryption: capability fan-out failed: %v", err) | ||
| } |
There was a problem hiding this comment.
Preserve context errors from capability fan-out
Map context.Canceled/context.DeadlineExceeded from s.capabilityFanout(ctx) to their matching gRPC codes instead of always wrapping as FailedPrecondition. As written, if the request deadline expires or the client cancels during the pre-flight probe, EnableStorageEnvelope reports a state/configuration failure, which misleads callers and breaks retry behavior that depends on Canceled/DeadlineExceeded semantics.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request implements the EnableStorageEnvelope RPC in the EncryptionAdmin service, facilitating the transition from cleartext to envelope encryption for storage writes. The implementation includes comprehensive server-side logic for leader verification, input validation, capability fan-out pre-flights, and idempotent retry handling. Additionally, it introduces new protobuf messages for the RPC, updates design documentation to reflect the progress of milestone 6D-6a, and adds extensive unit tests covering various success and failure scenarios, including race conditions with DEK rotation. I have no feedback to provide.
… capability fan-out)
Codex P2 finding on PR812:
Pre-fix: runCutoverFanout wrapped every err from
s.capabilityFanout(ctx) as FailedPrecondition. When the request
deadline expires or the client cancels mid-pre-flight, the
client received FailedPrecondition (a configuration-failure
signal) instead of Canceled / DeadlineExceeded (the transport-
layer cancellation signal). This breaks retry logic that
switches on the gRPC code.
Fix: new capabilityFanoutErrorToStatus helper maps
context.Canceled and context.DeadlineExceeded to their native
gRPC codes (codes.Canceled / codes.DeadlineExceeded) and
preserves the FailedPrecondition default for non-transport
errors (zero-member snapshot input validation, etc.). Symmetric
with the existing proposeErrorToStatus pattern in the same file.
Regression test:
TestEncryptionAdmin_EnableStorageEnvelope_PreservesContextCancellation
table-driven across {context.Canceled, context.DeadlineExceeded}
asserts the gRPC status code matches the native ctx error.
## Caller audit (semantic change)
The new error code mapping is semantically narrower than
before: errors that were previously FailedPrecondition now
surface as Canceled / DeadlineExceeded for transport errors,
and FailedPrecondition stays for everything else. Caller
impact:
- adapter/encryption_admin.go: runCutoverFanout is called only
from cutoverPostcheck which threads the returned error back
to gRPC. Now propagates the correct code.
- CLI (lands in 6D-6b) will see the correct code for retry
decisions.
- No other internal caller exists; the new helper is file-
scoped.
## Verification
- go test -race -timeout 60s -run TestEncryptionAdmin_EnableStorageEnvelope
./adapter/... - all green
- golangci-lint --new-from-rev=origin/main - 0 issues
|
@claude review Round-1 (commit codex P2 — preserve context errors from capability fan-outPre-fix: Fix: new Regression test: Caller audit
Verification
|
|
@claude review |
|
Claude finished @bootjp's task in 4m 13s —— View job Code Review — PR #812: Stage 6D-6a EnableStorageEnvelope admin RPCScope: Overall assessmentThe implementation is clean, disciplined, and tightly coupled to the design doc. The §3.2 orchestration decomposition ( 1. Data lossNo concerns. The server is a pure propose-then-re-read; the 6D-4 applier is the only durable writer and is not touched here. 2. Concurrency / distributed failuresThe load-bearing invariant — Informational: 3. PerformanceThe cutover is one-shot per cluster lifetime; two sidecar reads per call is negligible. No hot-path impact. 4. Data consistencyThe
Informational — // Defensive: apply always sets cutover index alongside the active flag.
// A zero here means the post-apply read raced or a sidecar was hand-edited.
// proposedIdx is still the correct applied_index (this call's entry committed);
// CutoverIndexUnknown stays false because we do know the index — it is proposedIdx.
appliedIndex = proposedIdx5. Test coverage — P2 finding
Suggested table extension: Fix this → // In TestEncryptionAdmin_EnableStorageEnvelope_RejectsOnCapabilityRefusal,
// add a second fixture that exercises the !Reachable branch:
unreachable := admin.CapabilityFanoutResult{
Verdicts: []admin.CapabilityVerdict{
{FullNodeID: 88, EncryptionCapable: true, Reachable: false},
},
OK: false,
}
// assert: status == FailedPrecondition, error contains "88" and "unreachable"Also informational: Round-1 fix (commit
|
| Lens | Result |
|---|---|
| Data loss | ✓ clean |
| Concurrency / distributed | ✓ clean (load-bearing propose-apply invariant noted, same as RotateDEK) |
| Performance | ✓ clean |
| Data consistency | ✓ clean |
| Test coverage | P2: Reachable=false branch in capabilityRefusalSummary untested |
The P2 test gap is straightforward to close with a single extra fixture in the existing refusal test. No blocking issues. 6D-6b (CLI) and 6D-6c (main.go wiring + integration test) can proceed.
…verage) + informational comment
Claude bot review identified one P2 test gap + two informational
points on PR812 round-1.
## claude P2 - capabilityRefusalSummary Reachable=false branch untested
Pre-fix: _RejectsOnCapabilityRefusal only exercised the
EncryptionCapable=false branch of capabilityRefusalSummary. The
Reachable=false branch (which produces 'unreachable member ...'
in the status detail) had no regression coverage. A future
refactor that broke that string would slip through silently.
Fix: convert the test to table-driven across {not_capable,
unreachable} fixtures. Each case asserts:
- status code is FailedPrecondition
- error message contains the refusing node's full_node_id
- error message contains the refusal reason label
('not-capable' or 'unreachable')
## informational - freshCutoverResponse defensive fallback comment
Claude bot noted that the defensive fallback to proposedIdx when
sc.StorageEnvelopeCutoverIndex == 0 doesn't set
CutoverIndexUnknown=true, in contrast with the idempotent-retry
path's same anomaly. The asymmetry is correct (proto §3.1 says
CutoverIndexUnknown is 'only meaningful when
was_already_active=true', and on the fresh-success path we DO
know the correct index: it's proposedIdx). Added a clarifying
comment so a future reader doesn't have to trace through the
proto doc.
## informational - catch-all in capabilityRefusalSummary
Claude bot noted that 'fan-out reported OK=false with no
per-member refusal' is structurally unreachable under the
current CapabilityFanout contract (OK=false ⇒ at least one
failing verdict). Marked informational, no code change — the
catch-all is a defensive fallback message, not a correctness
path; covering it would require an invalid input the helper
contract forbids.
## Caller audit (semantic change)
No semantic change. The test refactor only widens fixture
coverage; the production code change is a comment addition.
No caller impact.
## Verification
- go test -race -timeout 60s -run TestEncryptionAdmin_EnableStorageEnvelope
./adapter/... - all green
- golangci-lint --new-from-rev=origin/main - 0 issues
|
@claude review Round-2 (commit claude P2 —
|
|
Claude finished @bootjp's task in 2m 7s —— View job Code Review — PR #812 Round-2Todo:
Overall assessmentRound-2 correctly addresses both the P2 test gap and the informational comment. The test conversion is sound, the production comment is accurate, and no new correctness issues are introduced. One P2 and one nit remain. P2 fix —
|
| Lens | Result |
|---|---|
| Data loss | ✓ clean (no production logic changed) |
| Concurrency / distributed | ✓ clean |
| Performance | ✓ clean |
| Data consistency | ✓ clean |
| Test coverage | ✓ both capabilityRefusalSummary branches now covered |
The P2 task-reference comment cleanup is a one-liner that can land in the same commit or as a follow-on before merge. No blocking issues.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
adapter/encryption_admin_test.go (1)
1233-1268: ⚡ Quick winProve the already-active branch really bypasses fan-out.
Both tests still wire a successful
WithEncryptionAdminCapabilityFanout, so they won't catch a regression that checks fan-out before the idempotent short-circuit. That would break the documented "already-active can opt out" behavior. Consider omitting fan-out here, or using a fail-fast stub and asserting it is never called.Also applies to: 1277-1305
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@adapter/encryption_admin_test.go` around lines 1233 - 1268, The test currently supplies a successful capability-fanout (WithEncryptionAdminCapabilityFanout(fixedCapabilityFanout(allOKFanoutResult(), nil))) so it won't detect regressions where fan-out is invoked before the idempotent early-return; replace that successful fanout with a fail-fast stub (or omit fanout) that records or panics if called, and assert it was never invoked; do this in TestEncryptionAdmin_EnableStorageEnvelope_IdempotentRetry and the corresponding test around lines 1277-1305, referencing the WithEncryptionAdminCapabilityFanout/fixedCapabilityFanout usage so the idempotent branch is verified to bypass fan-out.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@adapter/encryption_admin.go`:
- Around line 622-698: Concurrent EnableStorageEnvelope RPCs can race between
cutoverPrecheck → runCutoverFanout → proposeCutoverEntry causing a stale caller
to be reported as a fresh success; serialize the whole sequence by acquiring a
dedicated cutover/mutator mutex around the end-to-end flow in
EnableStorageEnvelope (covering cutoverPrecheck, runCutoverFanout,
proposeCutoverEntry and cutoverPostcheck). Add or reuse a mutex on the server
struct (e.g. s.cutoverMu) and Lock() before running the
precheck/fanout/propose/postcheck sequence and defer Unlock() immediately after
to ensure the second caller re-reads the post-apply sidecar and returns the
idempotent retry shape. Ensure the same lock is applied in the other cutover
path noted (lines ~886-921) for symmetry.
In `@docs/design/2026_05_18_partial_6d_enable_storage_envelope.md`:
- Around line 70-71: Remove the obsolete combined milestone entry "6D-6 —
`EnableStorageEnvelope`" from the document so the open milestones section only
lists the split entries (6D-6a, 6D-6b, 6D-6c); locate the text matching the
literal "6D-6 — `EnableStorageEnvelope` admin RPC + CLI command + integration
test composing 6D-3 + 6D-4 + 6D-5" and delete that paragraph to avoid
duplication and confusion.
---
Nitpick comments:
In `@adapter/encryption_admin_test.go`:
- Around line 1233-1268: The test currently supplies a successful
capability-fanout
(WithEncryptionAdminCapabilityFanout(fixedCapabilityFanout(allOKFanoutResult(),
nil))) so it won't detect regressions where fan-out is invoked before the
idempotent early-return; replace that successful fanout with a fail-fast stub
(or omit fanout) that records or panics if called, and assert it was never
invoked; do this in TestEncryptionAdmin_EnableStorageEnvelope_IdempotentRetry
and the corresponding test around lines 1277-1305, referencing the
WithEncryptionAdminCapabilityFanout/fixedCapabilityFanout usage so the
idempotent branch is verified to bypass fan-out.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3d166b05-4ed5-4b85-bd8b-97025d0590ec
⛔ Files ignored due to path filters (2)
proto/encryption_admin.pb.gois excluded by!**/*.pb.goproto/encryption_admin_grpc.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (5)
adapter/encryption_admin.goadapter/encryption_admin_test.godocs/design/2026_05_18_partial_6d_enable_storage_envelope.mdproto/Makefileproto/encryption_admin.proto
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c408449560
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if !result.OK { | ||
| return admin.CapabilityFanoutResult{}, grpcStatusErrorf(codes.FailedPrecondition, | ||
| "encryption: capability check refused cutover (%s)", capabilityRefusalSummary(result)) |
There was a problem hiding this comment.
Preserve cancellation codes when fanout returns not-OK
This branch turns every result.OK == false outcome into FailedPrecondition, but the production fanout helper (internal/admin.CapabilityFanout) can hit request timeout/cancellation and still return (result, nil) with OK=false (it synthesizes unreachable verdicts rather than returning an error). In that case, EnableStorageEnvelope misclassifies a transport cancellation/deadline event as a configuration refusal, which breaks client retry behavior keyed on Canceled/DeadlineExceeded. Check ctx.Err() (or plumb cancellation state through the fanout result) before forcing FailedPrecondition.
Useful? React with 👍 / 👎.
…dex P2 (ctx-err preservation) + cleanups Round-2 review surfaced one Major + two P2 + one Minor. ## coderabbit Major - serialize concurrent EnableStorageEnvelope calls Pre-fix: gRPC serves handlers concurrently. Two concurrent EnableStorageEnvelope calls could both observe StorageEnvelopeActive=false at their precheck, both propose, and the loser would assemble freshCutoverResponse with was_already_active=false but the FIRST cutover's StorageEnvelopeCutoverIndex - violating the §6.4 fresh-success contract documented on the proto response. Fix: add cutoverMu sync.Mutex to EncryptionAdminServer. EnableStorageEnvelope acquires it at the very top and releases on defer. The lock spans the entire precheck → fan-out → propose → postcheck sequence so a second overlapping call sees StorageEnvelopeActive=true at its precheck (because the first call already proposed and applied) and takes the §6.4 idempotent-retry short-circuit. Note on design §2.1 #4 'shared lock with RotateDEK': the design calls for a shared mutator lock so a RotateDEK cannot interleave between a cutover's propose and apply. That piece is left to a follow-up because extending serialization to RotateDEK changes a hot-path mutator's semantics; the §2.1 #3 stale-DEKID benign no-op handles the un-serialized RotateDEK / cutover interleave today. Doc comment in cutoverMu records the deferred work. Regression test: TestEncryptionAdmin_EnableStorageEnvelope_SerializesConcurrentCutovers fires N=4 concurrent callers via a sync.WaitGroup release barrier, asserts proposer.calls == 1 (cutover proposed exactly once) and that exactly one caller saw was_already_active=false and the rest hit the §6.4 idempotent-retry shape. ## codex P2 - preserve ctx codes when fan-out returns OK=false synthesized Pre-fix: when the production CapabilityFanout helper hits ctx.expired mid-probe, it synthesizes Reachable=false verdicts and returns (result, nil) with OK=false rather than erroring out. runCutoverFanout then classified the outcome as a 'capability check refused' FailedPrecondition, hiding the transport-layer Canceled / DeadlineExceeded that the caller should observe. Fix: in runCutoverFanout's OK=false branch, check ctx.Err() first. If set, route through capabilityFanoutErrorToStatus so codes.Canceled / codes.DeadlineExceeded surface to the client. Only fall through to FailedPrecondition for genuine configuration refusals. Regression test: TestEncryptionAdmin_EnableStorageEnvelope_PreservesContextCancellationOnFanoutNotOK table-driven across {canceled, deadline_exceeded} contexts; fan-out fixture returns synthesized (notOK, nil); the test asserts the gRPC status code matches the ctx error. ## coderabbit quick-win - idempotent-retry tests must not wire fan-out Pre-fix: _IdempotentRetry and _DefensiveCutoverIndexUnknown both wired a SUCCESSFUL capability fan-out fixture. A regression that re-ordered cutoverPrecheck so the fan-out fired BEFORE the §6.4 short-circuit would have silently passed both tests. Fix: new failOnCallCapabilityFanout(t) helper that fails the test if invoked. Both idempotent-retry tests now use it - any re-ordering regression trips the fixture. ## coderabbit minor - design doc obsolete entry Removed the original combined '6D-6' milestone line from the design doc's open-milestones section (it was superseded by the 6D-6a shipped + 6D-6b/c open structure). ## Caller audit (semantic change) EnableStorageEnvelope is the only mutator that acquires cutoverMu; RotateDEK, BootstrapEncryption, and RegisterEncryptionWriter remain un-serialized at the RPC layer (their FSM apply-path conflict detection is unchanged). No production caller of EnableStorageEnvelope exists yet (CLI lands in 6D-6b); the only callers are server-level unit tests, which exercise the new locked path. ## Verification - go test -race -timeout 60s -run TestEncryptionAdmin_EnableStorageEnvelope ./adapter/... - all green (the -race build also exercises the serialization test's 4-goroutine fan-out) - golangci-lint --new-from-rev=origin/main - 0 issues
|
@claude review Round-3 (commit coderabbit Major — serialize concurrent EnableStorageEnvelope callsPre-fix: gRPC handlers run concurrently. Two cutover RPCs could both observe Fix: The §2.1 #4 'shared lock with RotateDEK' is left to a follow-up — extending serialization to RotateDEK changes a hot-path mutator's semantics, and §2.1 #3 stale-DEKID benign no-op handles the un-serialized RotateDEK/cutover interleave today. Documented in the Regression test: codex P2 — preserve ctx codes when fan-out returns OK=falsePre-fix: when production Fix: in the OK=false branch, check Regression test: coderabbit quick-win — idempotent-retry tests must not wire fan-outPre-fix: Fix: new coderabbit minor — design doc obsolete entryRemoved the original combined '6D-6' milestone line; it's been superseded by 6D-6a (shipped) + 6D-6b/c (open). Caller audit
Verification
|
|
Claude finished @bootjp's task in 4m 12s —— View job Code Review — PR #812 Round-3Todo:
Overall assessmentAll three substantive fixes in Round-3 are correct and the new tests are sound. One P2 is a carryover from Round-2 that grew rather than shrank, plus one nit on the concurrent test's channel-read order. 1. Mutex serialization (
|
| File | Line | Reference |
|---|---|---|
encryption_admin.go |
50 | (coderabbit Major on PR812) |
encryption_admin.go |
697 | (design §2.1 #4 + PR812 coderabbit Major) |
encryption_admin.go |
778 | (codex P2 on PR812) |
encryption_admin.go |
790 | Codex P2 round-2 on PR812: |
encryption_admin_test.go |
1053 | PR812 coderabbit quick-win round-2. |
encryption_admin_test.go |
1447, 1514, 1571 | various on PR812 references |
The surrounding reasoning is sound and worth keeping — only the PR-tracker attributions need to be stripped. Fix this →
Summary
| Lens | Result |
|---|---|
| Data loss | ✓ clean |
| Concurrency / distributed | ✓ cutoverMu fix is correct; concurrent test is well-structured |
| Performance | ✓ clean |
| Data consistency | ✓ clean |
| Test coverage | ✓ all three round-3 branches have regression tests |
The P2 task-reference cleanup (8 instances) is a straightforward find-and-delete. No blocking issues.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a45eeac798
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| s.cutoverMu.Lock() | ||
| defer s.cutoverMu.Unlock() |
There was a problem hiding this comment.
Honor context deadlines before blocking on cutover mutex
EnableStorageEnvelope takes cutoverMu before any context check and holds it across fan-out + Raft propose, so concurrent calls can block on s.cutoverMu.Lock() well past their own cancellation/deadline. In practice, if one call stalls in fan-out/propose, later callers with short timeouts will not return Canceled/DeadlineExceeded until the first call exits, which breaks expected RPC cancellation semantics and can tie up admin handlers unnecessarily.
Useful? React with 👍 / 👎.
…mutator wait)
Codex P2 round-3 on PR812:
Pre-fix: EnableStorageEnvelope took cutoverMu (sync.Mutex)
unconditionally at the top. A concurrent caller with a short
ctx deadline could block on Lock() well past its own
cancellation/deadline if the first caller stalled in fan-out
or propose - breaking gRPC cancellation semantics and tying
up admin handlers.
Fix: replace sync.Mutex with a capacity-1 channel semaphore
(cutoverSem chan struct{}). New acquireCutoverSemaphore /
releaseCutoverSemaphore helpers; acquire uses a select on
ctx.Done() so an expired/canceled context returns immediately
with the matching gRPC code (Canceled / DeadlineExceeded). New
cutoverSemaphoreErrorToStatus helper maps the ctx error -
named distinctly from capabilityFanoutErrorToStatus so the
status detail surfaces which stage of the RPC the wait
fired in.
Initialization: NewEncryptionAdminServer eagerly creates the
buffered chan (a nil chan would block forever on send,
defeating the ctx-aware acquire).
Regression test:
TestEncryptionAdmin_EnableStorageEnvelope_HonorsCtxDeadlineWaitingOnMutator
drives the first call into a fan-out fixture that blocks on a
never-firing channel (holds cutoverSem); the second call uses
an already-expired ctx and asserts:
- the RPC returns within 500ms (NOT blocking on the held
semaphore)
- the gRPC status code is DeadlineExceeded
## Caller audit (semantic change)
acquireCutoverSemaphore is called only from
EnableStorageEnvelope. The release path is symmetric (defer
releaseCutoverSemaphore). No other code path touches
cutoverSem - the channel is internal to the cutover RPC.
The SerializesConcurrentCutovers test from round-3 still
covers the mutual-exclusion invariant (the loser still hits
the §6.4 idempotent-retry shape); HonorsCtxDeadlineWaitingOnMutator
covers the new ctx-aware acquisition path.
## Verification
- go test -race -timeout 60s -run TestEncryptionAdmin_EnableStorageEnvelope
./adapter/... - all green
- golangci-lint --new-from-rev=origin/main - 0 issues
|
@claude review Round-4 (commit codex P2 round-3 — honor ctx deadlines before blocking on cutover mutexPre-fix: Fix: replaced
Regression test: Caller audit
Verification
|
|
Claude finished @bootjp's task in 3m 41s —— View job Code Review — PR #812 Round-4Todo:
Overall assessmentThe channel-semaphore implementation is correct and symmetric with the rest of the error-mapping layer. One P2 is a carryover that has now accumulated across all four rounds without being addressed; the second P2 is a timing-dependent test the author acknowledged in the test comment itself. 1. Channel-semaphore implementation ✓The pattern is correct: capacity-1 channel, send to acquire, receive to release.
2. Performance / distributed failures ✓Holding the semaphore across the full precheck → fan-out → propose → postcheck sequence is correct per §2.1 #4. The fan-out + Raft-propose sequence can take O(hundreds of ms); callers with tighter deadlines now surface 3. P2 —
|
| File | Lines | Reference |
|---|---|---|
encryption_admin.go |
49 | (coderabbit Major on PR812) |
encryption_admin.go |
56 | (codex P2 round-3 on PR812) |
encryption_admin.go |
709 | design §2.1 #4 + PR812 coderabbit Major |
encryption_admin.go |
718 | codex P2 round-3 on PR812 |
encryption_admin.go |
749–750 | codex P2 round-3 on / PR812 |
encryption_admin.go |
838–839 | codex P2 on / PR812 |
encryption_admin.go |
851 | Codex P2 round-2 on PR812 |
encryption_admin.go |
1012 | Claude bot informational on PR812 |
encryption_admin_test.go |
1053 | PR812 coderabbit quick-win round-2 |
encryption_admin_test.go |
1361 | PR812 claude P2 |
encryption_admin_test.go |
1447 | Major on PR812 |
encryption_admin_test.go |
1514 | codex P2 round-3 on PR812 |
encryption_admin_test.go |
1575 | codex P2 round-2 on PR812 |
encryption_admin_test.go |
1632 | the codex P2 finding on PR812 |
CLAUDE.md: "Don't reference the current task, fix, or callers… since those belong in the PR description and rot as the codebase evolves." The surrounding reasoning in every case is sound and worth keeping — only the tracker attributions need to be stripped.
Summary
| Lens | Result |
|---|---|
| Data loss | ✓ clean |
| Concurrency / distributed | ✓ semaphore is correct; test has timing-dependent sleep (P2) |
| Performance | ✓ clean |
| Data consistency | ✓ clean |
| Test coverage | ✓ new test covers the ctx-deadline path |
Two P2 items before merge: the deterministic-semaphore-wait test fix (3 lines, sync.Once + fanoutEntered channel) and the PR-reference comment cleanup (14 instances, all trailing attributions). The core semaphore implementation is correct and can proceed.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f93bea7e59
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| select { | ||
| case s.cutoverSem <- struct{}{}: | ||
| return nil | ||
| case <-ctx.Done(): |
There was a problem hiding this comment.
Check context before acquiring cutover semaphore
When ctx is already canceled and cutoverSem has capacity, both select branches are ready and Go may choose the send case, so the RPC continues instead of immediately returning Canceled/DeadlineExceeded. That breaks the cancellation guarantee this semaphore path is meant to provide and can run precheck/fan-out/propose work for requests that were already aborted. Add an explicit ctx.Err() check (or otherwise prioritize <-ctx.Done()) before the blocking acquire to make cancellation deterministic.
Useful? React with 👍 / 👎.
…rcuit before semaphore acquire)
Codex P1 round-4 on PR812:
Pre-fix: acquireCutoverSemaphore entered a 2-case select
(send to cutoverSem vs. ctx.Done) without first checking
ctx.Err(). When ctx was already canceled AND the semaphore
had capacity, BOTH select cases were immediately ready, and
Go's select picks uniformly at random among ready cases - so
the acquire could non-deterministically take the send branch
and the RPC would proceed into precheck / fan-out / propose
against an aborted caller.
Fix: add an explicit ctx.Err() check before the select. The
select still handles the during-wait cancellation case;
the up-front check turns the pre-canceled case into a
deterministic short-circuit.
Regression test:
TestEncryptionAdmin_EnableStorageEnvelope_PreCanceledCtxDeterministicShortCircuit
table-driven across {canceled, deadline_exceeded}. Each case
constructs a freshly-created server (semaphore empty, send
slot available) and runs 100 iterations with a pre-canceled
ctx. The fan-out fixture is failOnCallCapabilityFanout, which
t.Errorf's if invoked - so any iteration that lost the
random select coin flip and advanced into precheck would
trip the test.
Without the explicit ctx.Err() check, the 100-iteration
loop would surface the bug as a flaky t.Errorf hit on roughly
half the iterations.
## Caller audit (semantic change)
acquireCutoverSemaphore is called only from
EnableStorageEnvelope. The change narrows behaviour: a
pre-canceled ctx now ALWAYS returns the matching gRPC code,
rather than sometimes (when the semaphore is taken) running
the cutover sequence to completion. Strictly safer; no other
caller of acquireCutoverSemaphore exists.
## Verification
- go test -race -timeout 60s -run TestEncryptionAdmin_EnableStorageEnvelope
./adapter/... - all green (the new 100-iteration test
passes deterministically; would have flapped pre-fix)
- golangci-lint --new-from-rev=origin/main - 0 issues
|
@claude review Round-5 (commit codex P1 — pre-canceled ctx + free semaphore was non-deterministicPre-fix: Fix: add an explicit Regression test: Caller audit
Verification
|
|
Claude finished @bootjp's task in 2m 38s —— View job Code Review — PR #812 Round-5Todo:
Overall assessmentThe P1 fix is correct and the regression test is sound. One carryover P2 from Round-4 has been rendered moot by Round-5 itself (explained below). The PR-reference comment cleanup remains the only outstanding item across 5 rounds. 1. P1 fix —
|
| File | Lines | Reference |
|---|---|---|
encryption_admin.go |
49 | (coderabbit Major on PR812) |
encryption_admin.go |
56 | (codex P2 round-3 on PR812) |
encryption_admin.go |
709–710 | PR812 coderabbit Major |
encryption_admin.go |
718 | codex P2 round-3 on PR812 |
encryption_admin.go |
749–750 | codex P2 round-3 on PR812 |
encryption_admin.go |
759 | codex P1 round-4 on PR812 (new) |
encryption_admin.go |
850–851 | codex P2 on PR812 |
encryption_admin.go |
863 | Codex P2 round-2 on PR812 |
encryption_admin.go |
1024 | Claude bot informational on PR812 |
encryption_admin_test.go |
1053–1054 | PR812 coderabbit quick-win round-2 |
encryption_admin_test.go |
1361 | PR812 claude P2 |
encryption_admin_test.go |
1447 | Major on PR812 |
encryption_admin_test.go |
1514 | codex P1 round-4 on PR812 (new) |
encryption_admin_test.go |
1584 | codex P2 round-3 on PR812 |
encryption_admin_test.go |
1645 | codex P2 round-2 on PR812 |
encryption_admin_test.go |
1702 | codex P2 finding on PR812 |
CLAUDE.md: "Don't reference the current task, fix, or callers… since those belong in the PR description and rot as the codebase evolves." The substantive reasoning in every comment block is worth keeping — only the reviewer/round attributions need to be stripped. Fix this →
Summary
| Lens | Result |
|---|---|
| Data loss | ✓ clean |
| Concurrency / distributed | ✓ P1 fix is correct and complete |
| Performance | ✓ clean |
| Data consistency | ✓ clean |
| Test coverage | ✓ regression test is deterministic and sound |
The P2 sleep is now informational (Round-5 itself makes it moot). The only remaining action item before merge is the PR-reference comment cleanup (16 instances, 5 rounds).
Summary
EnableStorageEnvelopeRPC +EnableStorageEnvelopeRequest/Response+CapabilityVerdictmessages per design §3.1. Uses OK gRPC status withwas_already_activeboolean discriminator rather thanAlreadyExists— unary gRPC drops the response body on non-OK status, so the §6.4 idempotency contract MUST live on the success path.adapter/encryption_admin.goships the server method composing the §3.2 sequence: leader gate → input validation → sidecar read → bootstrap gate → idempotent-retry short-circuit (§6.4) → capability fan-out (6D-3) → proposeRotateSubEnableStorageEnvelopethrough Raft (6D-4 wire) → post-apply re-read discriminating fresh-success vs. stale-DEKID race.WithEncryptionAdminCapabilityFanoutoption wires the §4 pre-flight. Without it,EnableStorageEnveloperefuses withFailedPrecondition— matches the proposer / leaderView posture for other mutator RPCs. The 6D-6c main.go wiring closes over the live route snapshot + DialFunc + timeout.Test plan
go test -race -timeout 240s -run TestEncryptionAdmin_ ./adapter/...— all greengolangci-lint --new-from-rev=origin/main— 0 issues_RejectsOnFollower— leader gate + hint embedding_RejectsWithoutSidecarPath,_RejectsWithoutCapabilityFanout— wiring gates_RejectsZeroProposerNodeID,_RejectsOversizedLocalEpoch— §3.1 input validation_RejectsNotBootstrapped— bootstrap gate with hint_IdempotentRetry— sidecar already active → OK +was_already_active=true, applied_index = originalStorageEnvelopeCutoverIndex, NO fan-out, NO propose_DefensiveCutoverIndexUnknown— active sidecar with index=0 (hand-edit hedge) →CutoverIndexUnknown=true, fallback toRaftAppliedIndex_RejectsOnCapabilityRefusal— names the refusing node_RejectsOnFanoutError— fan-out helper itself erroring → FailedPrecondition_HappyPath— end-to-end fresh cutover with simulated apply effect; wire-level decode confirmsRotateSubEnableStorageEnvelope+ §2.1 constraints_StaleDEKIDRace— simulated apply does NOT flip sidecar (RotateDEK raced) → FailedPrecondition with retry hint_CapabilitySummaryProjection— per-row wire-shape pin for the internalCapabilityVerdict→proto.CapabilityVerdictmappingCaller audit (semantic change)
The new RPC is purely additive: no existing method's signature or semantics change.
WithEncryptionAdminCapabilityFanoutis a new option; existing test fixtures that don't wire it leaves.capabilityFanout=nil, which only affectsEnableStorageEnvelope(refuses) — other mutator RPCs are unaffected.proto.UnimplementedEncryptionAdminServercarries theEnableStorageEnvelopestub returningcodes.Unimplemented, so test doubles that don't override stay unbroken.Self-review (5 lenses)
RotateDEK/BootstrapEncryptionpaths rely on the same.applied_indexreports the ORIGINALStorageEnvelopeCutoverIndex(not the retry's Raft index).CapabilitySummaryis intentionally empty on retries so a caller cannot accidentally re-use the original cutover's membership view. The §2.1 Update docker/setup-qemu-action action to v3 #3 stale-DEKID race is surfaced asFailedPreconditionwith a retry hint (not Aborted; the design pinsFailedPrecondition).applyingProposerhelper pins the §6.4 / §2.1 Update docker/setup-qemu-action action to v3 #3 discrimination logic — the central correctness-critical branch.Refs: design doc §3.1 / §3.2 / §6.4 (6D-6a slice of the §7 decomposition table).
Summary by CodeRabbit
New Features
Tests
Documentation