s3: AdminDeleteBucket DEL_PREFIX safety net (closes TOCTOU)#695
Conversation
Pre-existing TOCTOU between the empty-probe (ScanAt at readTS) and the BucketMetaKey delete commit, flagged by coderabbitai on PR #669 and tracked in the partial admin dashboard design doc Outstanding section. This doc proposes adopting Option C (DEL_PREFIX safety net) over Options A (BucketGenerationKey bump on every PutObject) and B (OperationGroup.ReadRanges schema extension): - Option A imposes a permanent throughput tax on every PutObject (the hot path) to fix a race that fires only at admin-delete time. - Option B is the semantically purest fix but the schema + coordinator + FSM surface is large and blocks unrelated work. - Option C closes the orphan window with the smallest patch. Existing op type (pb.Op_DEL_PREFIX), no schema change, PutObject hot path unchanged. Trade-off: a contract change that a PutObject returning 200 OK during the race window can have its data swept by the racing AdminDeleteBucket. Bounded and matches what docs/admin_deployment.md already advises operators to avoid. The doc also enumerates every per-bucket key family in internal/s3keys (manifest / upload meta / upload part / blob / gc / route) and the new *PrefixForBucket helpers each one needs. BucketGenerationKey is intentionally not deleted so re-creating the bucket bumps generation and orphans from the old generation remain isolated. Implementation lands in subsequent commits per CLAUDE.md design- doc-first workflow.
ObjectManifestPrefixForBucket has existed since the s3 adapter was introduced; the other five per-bucket key families never had a helper because the existing call sites always knew the per-upload or per-object prefix they needed. AdminDeleteBucket's DEL_PREFIX safety net (design doc 2026_04_28_proposed_admin_delete_bucket_safety_net.md) needs to sweep every per-bucket prefix in one OperationGroup, so every key family gets a parallel helper: - UploadMetaPrefixForBucket - UploadPartPrefixForBucket - BlobPrefixForBucket - GCUploadPrefixForBucket - RoutePrefixForBucket All six functions now share a single bucketScopedPrefix helper — the encoding is identical (`<family-prefix><EncodeSegment(bucket)> <appendU64(gen)>`) so factoring it out avoids drift if the per-key constructors evolve. Pinned the contract with TestPerBucketPrefixes_IsolateByBucketAnd Generation: for each of the six families, every key under (bucket=A, gen=N) matches the (A, N) prefix, while keys under (B, N) and (A, M) do NOT. The (A, M) check is the load-bearing assertion — if a future encoding change put generation before bucket, the orphan-isolation property bucket re-creation relies on would silently break and DEL_PREFIX would start nuking the new bucket's data. The test fails at construction time before any prod call site notices. No behavior change in this commit — pure refactor + new helpers.
Fails on current main: AdminDeleteBucket only deletes BucketMetaKey,
so any keys planted in the 5 non-manifest per-bucket prefixes
(upload_meta, upload_part, blob, gc_upload, route) survive the
delete and become orphaned under a now-deleted bucket meta.
Per CLAUDE.md "test the bug first" — this commit lands the
failing test; the next commit lands the DEL_PREFIX safety net
fix. Together they pin the regression so a future refactor that
strips the safety net trips the assertion.
The test plants directly into the store (via the local adapter
coordinator's Put) rather than racing two goroutines because:
- the empty-probe scan only looks at the manifest prefix, so
orphans in the other 5 prefixes are exactly what can leak
through the race window without disturbing the probe
- direct planting is deterministic; goroutine racing is not
- the safety net's behavior is independent of the race —
it must wipe every per-bucket prefix at delete time, full
stop, and that's what the test asserts
A second test, TestS3Server_AdminDeleteBucket_BucketGenerationKey
Survives, pins the orphan-isolation property the design relies
on: BucketGenerationKey must NOT be deleted, so re-creating the
bucket bumps generation and any pre-existing orphans under the
old generation prefix stay invisible to the new bucket.
Closes the orphan-object race coderabbitai flagged on PR #669 and analysed in docs/design/2026_04_28_proposed_admin_delete_bucket_safety_net.md. The empty-probe (ScanAt at readTS) and the BucketMetaKey delete (commit at a later commitTS) are not atomic; a concurrent PutObject between the two would leave manifest, blob, upload metadata, upload parts, GC entries, and route key all orphaned under a now-deleted bucket meta. The OCC validator only inspects keys appearing in OperationGroup.ReadKeys, so the existing single-key read set could not catch range-level mutations. The fix appends DEL_PREFIX ops to the same OperationGroup as the BucketMetaKey delete: - ObjectManifestPrefixForBucket - UploadMetaPrefixForBucket - UploadPartPrefixForBucket - BlobPrefixForBucket - GCUploadPrefixForBucket - RoutePrefixForBucket All seven ops share a commitTS, so anything that landed in any per-bucket prefix between readTS and commit is tombstoned with the bucket meta. BucketGenerationKey is intentionally NOT in the list — re-creating the bucket bumps generation and orphans under the old generation prefix stay isolated from the new bucket. Pinned by TestS3Server_AdminDeleteBucket_BucketGeneration KeySurvives. The same shape is mirrored on the SigV4 path (adapter/s3.go:deleteBucket) so both delete entrypoints share the same race-window guarantees. Factored into one bucketDeleteOperationGroupElems helper so a future per-bucket key family added to either path covers both delete paths in lockstep. Contract change for clients: a PutObject that returned 200 OK during the race window can have its data swept by the concurrent delete. Operationally bounded — operators are already advised to pause writes before AdminDeleteBucket. The alternative was orphan objects that no API can enumerate or remove; the new contract is strictly cleaner. DEL_PREFIX is broadcast to every shard (kv/sharded_coordinator.go), but per-shard scans return 0 keys in the common case (empty-probe already passed), and bucket delete is operator-frequency. The cost is acceptable. Tests: previously failing TestS3Server_AdminDeleteBucket_SweepsOrphansAcrossAllPerBucketPrefixes now passes (the 5 non-manifest prefixes are wiped). Existing AdminDeleteBucket tests (HappyPath / MissingBucket / RejectsRead Only / RejectsNonEmpty) and the full TestS3 race suite continue to pass. go test -race ./adapter/ — passes. golangci-lint run ./adapter/... ./internal/s3keys/... — 0 issues.
Two updates now that the DEL_PREFIX safety net has landed:
1. **docs/admin_deployment.md** — new §4.6 "Deleting an S3
bucket" covering the operator-facing contract:
- Standard "bucket must be empty" rule still applies (409
BucketNotEmpty when not).
- The race-window contract change is documented explicitly:
a PutObject that returned 200 OK during the empty-probe →
commit window of a concurrent AdminDeleteBucket can have
its data swept. Calls out the two scenarios where this
matters (planned vs emergency delete) and what operators
should do in each case.
- Confirms re-create-after-delete is safe because
BucketGenerationKey survives — orphans from any previous
escape stay isolated under the old generation prefix.
- Cross-link to the new design doc.
2. **docs/design/2026_04_24_partial_admin_dashboard.md** —
Outstanding open items section: TOCTOU entry struck through
and replaced with the "fixed" summary plus a pointer to the
safety-net design doc. Updated the closing rename trigger so
only the rolling-upgrade flag (criterion 5) blocks promotion
to `_implemented_` now.
No design changes; no behaviour changes. State refresh.
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (9)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@claude review |
|
Claude finished @bootjp's task in 3m 17s —— View job PR Review: s3: AdminDeleteBucket DEL_PREFIX safety net (closes TOCTOU)Todo List:
SummaryThe design intent is sound — appending 🔴 Critical: Mixed
|
There was a problem hiding this comment.
Code Review
This pull request addresses a TOCTOU race condition in bucket deletion by introducing a safety net that uses DEL_PREFIX operations to tombstone all per-bucket key families during the deletion transaction. This prevents orphaned keys from being left behind if a concurrent write occurs. The changes include new key prefix helpers, updates to both the SigV4 and Admin delete paths, and extensive regression tests and documentation. I have no feedback to provide.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5f48cd7230
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| _, err = s.coordinator.Dispatch(ctx, &kv.OperationGroup[kv.OP]{ | ||
| IsTxn: true, | ||
| StartTS: startTS, | ||
| Elems: []*kv.Elem[kv.OP]{ | ||
| {Op: kv.Del, Key: s3keys.BucketMetaKey(name)}, | ||
| }, | ||
| Elems: bucketDeleteOperationGroupElems(name, meta.Generation), |
There was a problem hiding this comment.
Split bucket delete into supported coordinator operation shapes
AdminDeleteBucket now dispatches IsTxn: true with bucketDeleteOperationGroupElems(...), which includes both kv.Del and kv.DelPrefix. That combination is rejected by real coordinators: kv/sharded_coordinator.go routes any request containing DelPrefix through dispatchDelPrefixBroadcast, which rejects transactional and mixed-operation requests, and the single-group txn path rejects Op_DEL_PREFIX intents in kv/fsm.go. In production this makes bucket deletion fail (admin and SigV4, since both paths share this helper) instead of applying the intended safety net; the new local tests miss it because newLocalAdapterCoordinator bypasses coordinator validation.
Useful? React with 👍 / 👎.
Codex P1 on PR #695 caught that the original single-OperationGroup shape would be rejected by the production coordinator on every bucket delete. The fix lands here together with a test-coordinator tightening that catches this class of bug in unit tests. Background: kv/sharded_coordinator.go:dispatchDelPrefixBroadcast rejects any OperationGroup that mixes DEL_PREFIX with other ops AND any OperationGroup that uses DEL_PREFIX inside a transaction. The earlier shape ([Del BucketMetaKey, DelPrefix×6] with IsTxn=true) hit both rejection rules, so production would have failed every bucket delete with ErrInvalidRequest. The local test coordinator bypassed those checks, so the regression test passed despite the production-breaking shape. Fix is two-part: 1. **Split AdminDeleteBucket and s3.go:deleteBucket into two Dispatch calls**: Phase 1 — `Del BucketMetaKey` in a txn (OCC-protected against a concurrent AdminCreateBucket racing the delete). Lives inside retryS3Mutation so an OCC conflict retries the whole closure. Phase 2 — DEL_PREFIX broadcast over every per-bucket key family in a non-txn OperationGroup. Lives outside retryS3Mutation because Phase 1 is the point of no return: a Phase-2 retry would 404 at loadBucketMetaAt. Phase-2 failure is logged via slog.WarnContext and not propagated. The bucket meta is already gone from the operator's POV; orphan keys (if any) are no worse than the pre-fix state on main and can be recovered by a future sweep tool. Surfacing a 500 to the operator after a successful delete would be a worse UX. Phase-1-first ordering is deliberate: a Phase-2-first ordering could leave the bucket meta extant while per-bucket data was wiped if Phase 1 then failed (concurrent recreate). Phase-1- first localises any partial failure to "bucket gone, orphan data may persist", which has a clean audit trail. 2. **Tighten localAdapterCoordinator validation** to mirror the production coordinator's dispatch-time rejection rules: reject IsTxn=true with any DelPrefix, reject mixed Del+DelPrefix groups. Without this, a future regression that ships the rejected shape would silently pass tests while breaking production. The existing TestS3Server_AdminDeleteBucket_SweepsOrphansAcrossAllPerBucketPrefixes test now exercises the production-realistic dispatch path and would have caught Codex P1 directly. Refactor: split the single `bucketDeleteOperationGroupElems` helper into `bucketDeleteSafetyNetElems` (DEL_PREFIX-only, used by Phase 2) and a new `runBucketDeleteSafetyNet` method that shares the dispatch-and-log logic between Admin and SigV4 paths. The Phase-1 Del op shape is small enough to inline at each call site. Design doc §6.2 rewritten with the two-phase rationale, the Codex P1 finding, the Phase-2 failure semantics, and why Phase-1- first ordering is correct. Tests: go test -count=1 -run TestS3Server_AdminDeleteBucket ./adapter/ — passes go test -race -count=1 -run TestS3 ./adapter/ — passes (1.455s) golangci-lint run ./adapter/... — 0 issues
|
@claude review Addressed the Codex P1 finding (single-OperationGroup shape rejected by production coordinator): The original [Del BucketMetaKey, DelPrefix×6] with IsTxn=true would have hit two rejection rules in Two-part fix in commit
Design doc §6.2 rewritten with the two-phase rationale, the Phase-2 failure semantics, why Phase-1-first ordering is correct, and the test-coordinator parity note.
The single comment at |
|
Claude finished @bootjp's task in 1m 55s —— View job PR Review: s3: AdminDeleteBucket DEL_PREFIX safety net (closes TOCTOU)Todo List:
Review of commit
|
Per docs/design/README.md's lifecycle convention. The original P1–P4 plan has fully shipped: - P1 (admin skeleton + Dynamo + AdminForward) — #634/#635/#644/#648 - P2 (S3 endpoints incl. write paths and AdminForward integration) — #658 / #669 / #673 / #695 (TOCTOU safety net) - P3 (React SPA + embed) — #649 / #650 - P4 (TLS / role / CSRF / operator doc / deployment runbook / scripts/rolling-update.sh admin support) — #674 / #669 / #678 The AdminDeleteBucket TOCTOU caught during PR #669 review (the last "in-flight" item that kept the doc at _partial_) is fully resolved by the safety-net design landed in #695. What changed: - git mv 2026_04_24_partial_admin_dashboard.md → 2026_04_24_implemented_admin_dashboard.md (history follows the rename) - Header Status line: "Partial" → "Implemented", explanation updated to reflect the post-fix state and the rationale for promotion. - "Last updated" bumped to 2026-04-28 with the rename trigger. - Section heading "Outstanding open items" → "Out-of-scope follow-ups" — the remaining three entries (criterion 5, object browser, TLS hot-reload) are not in-flight work; they are deferred-at-design or Non-goal items. The TOCTOU bullet is removed (resolved) and replaced with a one-line cross-link to the safety-net design + admin_deployment.md §4.6 contract. - Removed the closing "rename trigger" sentence — we just did the rename. - Status table: P2 row now lists #695 alongside #658/#669/#673 so a future reader can find the TOCTOU fix from the index. - Cross-references updated everywhere the old filename appeared: docs/admin.md (header link + Cross-references) docs/admin_deployment.md (header link + final cross-ref) docs/design/2026_04_28_proposed_admin_delete_bucket_safety_net.md (Background section pointer) internal/admin/config.go (Section 7.1 reference comment) No code changes other than the comment-only filename refresh in config.go.
## Summary Promote the admin dashboard design doc from `_partial_` → `_implemented_` per `docs/design/README.md`'s lifecycle convention. PR #695 landed the TOCTOU safety-net fix (the last in-flight item that kept the doc at `_partial_`), so the original P1–P4 plan is now fully shipped: | Phase | Landed via | |---|---| | P1 (admin skeleton + Dynamo + AdminForward) | #634 / #635 / #644 / #648 | | P2 (S3 endpoints + writes + AdminForward S3 + TOCTOU fix) | #658 / #669 / #673 / **#695** | | P3 (React SPA + embed) | #649 / #650 | | P4 (TLS / role / CSRF / operator doc / runbook / script wiring) | #674 / #669 / #678 | The remaining three items in the doc move from "Outstanding open items" (in-flight) to **"Out-of-scope follow-ups"** (deferred-at-design or Non-goal): - AdminForward criterion 5 — rolling-upgrade flag, deferred behind a cluster-version bump that doesn't exist yet - S3 object browser — Non-goal per §2.2 - TLS cert hot-reload — out of scope per `docs/admin.md` ## Changes - `git mv` partial → implemented (history follows the rename) - Status line / Last-updated / status-table / Out-of-scope section content reflects the promotion - Cross-references updated in all 4 referencing files: `docs/admin.md`, `docs/admin_deployment.md`, `docs/design/2026_04_28_proposed_admin_delete_bucket_safety_net.md`, `internal/admin/config.go` (comment-only) ## Test plan - [x] No code changes other than a comment-only filename refresh in `config.go` - [x] `go build ./...` passes - [x] `golangci-lint run ./internal/admin/...` — 0 issues - [x] `grep -rn "2026_04_24_partial\|2026_04_24_proposed_admin"` returns nothing — no stale references
Summary
Closes the AdminDeleteBucket TOCTOU race that coderabbitai flagged on PR #669. The empty-probe (
ScanAtatreadTS) and theBucketMetaKeydelete (commit at a latercommitTS) were not atomic — a concurrentPutObjectbetween the two would leave manifest, blob chunks, upload metadata, upload parts, GC entries, and route key all orphaned under a now-deleted bucket meta.Fix
Append
DEL_PREFIXops to the sameOperationGroupas theBucketMetaKeydelete, covering every per-bucket key family ininternal/s3keys:ObjectManifestPrefixForBucketUploadMetaPrefixForBucket(new)UploadPartPrefixForBucket(new)BlobPrefixForBucket(new)GCUploadPrefixForBucket(new)RoutePrefixForBucket(new)All seven ops share a
commitTS, so anything that snuck into any per-bucket prefix betweenreadTSand commit is tombstoned at the same logical moment as the bucket meta.BucketGenerationKeyis intentionally not deleted — re-creating the bucket bumps generation and orphans under the old generation prefix stay isolated.The same shape lands on the SigV4 path (
adapter/s3.go:deleteBucket) so both delete entrypoints share the race-window guarantees. Factored into onebucketDeleteOperationGroupElemshelper.Contract change for clients
A
PutObjectthat returned 200 OK during the race window can have its data swept by the concurrent delete. This is operationally bounded — operators are advised to pause writes before bucket delete (now indocs/admin_deployment.md§4.6). The alternative was orphan objects that no API can enumerate or remove; the new contract is strictly cleaner.Commit structure (per CLAUDE.md design-doc-first + test-first)
3b8ef475— design doc proposing Option C (DEL_PREFIX safety net) over Options A (BucketGenerationKey bump on every PutObject) and B (OperationGroup.ReadRangesschema extension)360a2ee0—s3keysper-bucket prefix helpers + isolation test (no behaviour change)ad4de053— failing regression test (would fail on current main)d7c3589e— DEL_PREFIX safety net implementation (test passes)5f48cd72— operator doc + partial design doc Outstanding section updateSelf-review (5 lenses per CLAUDE.md)
DEL_PREFIXis broadcast to every shard (kv/sharded_coordinator.go:230); per-shard scans return 0 keys in the common case (empty-probe already passed).bucketDeleteOperationGroupElemsis shared between Admin and SigV4 paths so leader-direct and forwarded calls land identical ops.AdminDeleteBucketis operator-frequency, not data-plane. PutObject hot path unchanged (rejected Option A specifically because it would have taxed it).TestS3Server_AdminDeleteBucket_BucketGenerationKeySurvives. Generation bumps on recreate, orphans (if any escape from a future regression) stay invisible.TestS3Server_AdminDeleteBucket_SweepsOrphansAcrossAllPerBucketPrefixestable-driven across all 6 prefixes), generation-survival test, ands3keysisolation test (every per-bucket prefix matches its own bucket+gen and rejects sibling buckets / other generations).Test plan
go test -count=1 ./internal/s3keys/...— passesgo test -count=1 -run TestS3Server_AdminDeleteBucket ./adapter/— passesgo test -race -count=1 -run TestS3 ./adapter/— passes (1.456s)golangci-lint run ./adapter/... ./internal/s3keys/...— 0 issues