Skip to content

s3: AdminDeleteBucket DEL_PREFIX safety net (closes TOCTOU)#695

Merged
bootjp merged 6 commits intomainfrom
feat/admin-delete-bucket-toctou-fix
Apr 28, 2026
Merged

s3: AdminDeleteBucket DEL_PREFIX safety net (closes TOCTOU)#695
bootjp merged 6 commits intomainfrom
feat/admin-delete-bucket-toctou-fix

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 27, 2026

Summary

Closes the AdminDeleteBucket TOCTOU race that coderabbitai flagged on PR #669. The empty-probe (ScanAt at readTS) and the BucketMetaKey delete (commit at a later commitTS) were not atomic — a concurrent PutObject between the two would leave manifest, blob chunks, upload metadata, upload parts, GC entries, and route key all orphaned under a now-deleted bucket meta.

Fix

Append DEL_PREFIX ops to the same OperationGroup as the BucketMetaKey delete, covering every per-bucket key family in internal/s3keys:

  • ObjectManifestPrefixForBucket
  • UploadMetaPrefixForBucket (new)
  • UploadPartPrefixForBucket (new)
  • BlobPrefixForBucket (new)
  • GCUploadPrefixForBucket (new)
  • RoutePrefixForBucket (new)

All seven ops share a commitTS, so anything that snuck into any per-bucket prefix between readTS and commit is tombstoned at the same logical moment as the bucket meta. BucketGenerationKey is intentionally not deleted — re-creating the bucket bumps generation and orphans under the old generation prefix stay isolated.

The same shape lands on the SigV4 path (adapter/s3.go:deleteBucket) so both delete entrypoints share the race-window guarantees. Factored into one bucketDeleteOperationGroupElems helper.

Contract change for clients

A PutObject that returned 200 OK during the race window can have its data swept by the concurrent delete. This is operationally bounded — operators are advised to pause writes before bucket delete (now in docs/admin_deployment.md §4.6). The alternative was orphan objects that no API can enumerate or remove; the new contract is strictly cleaner.

Commit structure (per CLAUDE.md design-doc-first + test-first)

  1. 3b8ef475 — design doc proposing Option C (DEL_PREFIX safety net) over Options A (BucketGenerationKey bump on every PutObject) and B (OperationGroup.ReadRanges schema extension)
  2. 360a2ee0s3keys per-bucket prefix helpers + isolation test (no behaviour change)
  3. ad4de053 — failing regression test (would fail on current main)
  4. d7c3589e — DEL_PREFIX safety net implementation (test passes)
  5. 5f48cd72 — operator doc + partial design doc Outstanding section update

Self-review (5 lenses per CLAUDE.md)

  1. Data loss — the contract change is intentional and documented. Without the fix, data was lost in a worse way (orphaned, unreachable). With the fix, data loss is bounded to the race window and operator-controllable.
  2. Concurrency / distributed failuresDEL_PREFIX is broadcast to every shard (kv/sharded_coordinator.go:230); per-shard scans return 0 keys in the common case (empty-probe already passed). bucketDeleteOperationGroupElems is shared between Admin and SigV4 paths so leader-direct and forwarded calls land identical ops.
  3. Performance — DEL_PREFIX × 6 × N shards on each delete. AdminDeleteBucket is operator-frequency, not data-plane. PutObject hot path unchanged (rejected Option A specifically because it would have taxed it).
  4. Data consistency — BucketGenerationKey isolation pinned by TestS3Server_AdminDeleteBucket_BucketGenerationKeySurvives. Generation bumps on recreate, orphans (if any escape from a future regression) stay invisible.
  5. Test coverage — failing-then-passing regression test (TestS3Server_AdminDeleteBucket_SweepsOrphansAcrossAllPerBucketPrefixes table-driven across all 6 prefixes), generation-survival test, and s3keys isolation test (every per-bucket prefix matches its own bucket+gen and rejects sibling buckets / other generations).

Test plan

  • go test -count=1 ./internal/s3keys/... — passes
  • go test -count=1 -run TestS3Server_AdminDeleteBucket ./adapter/ — passes
  • go test -race -count=1 -run TestS3 ./adapter/ — passes (1.456s)
  • golangci-lint run ./adapter/... ./internal/s3keys/... — 0 issues
  • No dedicated S3 Jepsen suite today; if added, this fix should be exercised under it

bootjp added 5 commits April 28, 2026 01:07
Pre-existing TOCTOU between the empty-probe (ScanAt at readTS) and
the BucketMetaKey delete commit, flagged by coderabbitai on PR #669
and tracked in the partial admin dashboard design doc Outstanding
section. This doc proposes adopting Option C (DEL_PREFIX safety
net) over Options A (BucketGenerationKey bump on every PutObject)
and B (OperationGroup.ReadRanges schema extension):

  - Option A imposes a permanent throughput tax on every PutObject
    (the hot path) to fix a race that fires only at admin-delete
    time.
  - Option B is the semantically purest fix but the schema +
    coordinator + FSM surface is large and blocks unrelated work.
  - Option C closes the orphan window with the smallest patch.
    Existing op type (pb.Op_DEL_PREFIX), no schema change,
    PutObject hot path unchanged. Trade-off: a contract change
    that a PutObject returning 200 OK during the race window can
    have its data swept by the racing AdminDeleteBucket. Bounded
    and matches what docs/admin_deployment.md already advises
    operators to avoid.

The doc also enumerates every per-bucket key family in
internal/s3keys (manifest / upload meta / upload part / blob / gc /
route) and the new *PrefixForBucket helpers each one needs.
BucketGenerationKey is intentionally not deleted so re-creating
the bucket bumps generation and orphans from the old generation
remain isolated.

Implementation lands in subsequent commits per CLAUDE.md design-
doc-first workflow.
ObjectManifestPrefixForBucket has existed since the s3 adapter was
introduced; the other five per-bucket key families never had a
helper because the existing call sites always knew the per-upload
or per-object prefix they needed. AdminDeleteBucket's DEL_PREFIX
safety net (design doc
2026_04_28_proposed_admin_delete_bucket_safety_net.md) needs to
sweep every per-bucket prefix in one OperationGroup, so every key
family gets a parallel helper:

  - UploadMetaPrefixForBucket
  - UploadPartPrefixForBucket
  - BlobPrefixForBucket
  - GCUploadPrefixForBucket
  - RoutePrefixForBucket

All six functions now share a single bucketScopedPrefix helper —
the encoding is identical (`<family-prefix><EncodeSegment(bucket)>
<appendU64(gen)>`) so factoring it out avoids drift if the per-key
constructors evolve.

Pinned the contract with TestPerBucketPrefixes_IsolateByBucketAnd
Generation: for each of the six families, every key under
(bucket=A, gen=N) matches the (A, N) prefix, while keys under
(B, N) and (A, M) do NOT. The (A, M) check is the load-bearing
assertion — if a future encoding change put generation before
bucket, the orphan-isolation property bucket re-creation relies on
would silently break and DEL_PREFIX would start nuking the new
bucket's data. The test fails at construction time before any
prod call site notices.

No behavior change in this commit — pure refactor + new helpers.
Fails on current main: AdminDeleteBucket only deletes BucketMetaKey,
so any keys planted in the 5 non-manifest per-bucket prefixes
(upload_meta, upload_part, blob, gc_upload, route) survive the
delete and become orphaned under a now-deleted bucket meta.

Per CLAUDE.md "test the bug first" — this commit lands the
failing test; the next commit lands the DEL_PREFIX safety net
fix. Together they pin the regression so a future refactor that
strips the safety net trips the assertion.

The test plants directly into the store (via the local adapter
coordinator's Put) rather than racing two goroutines because:
  - the empty-probe scan only looks at the manifest prefix, so
    orphans in the other 5 prefixes are exactly what can leak
    through the race window without disturbing the probe
  - direct planting is deterministic; goroutine racing is not
  - the safety net's behavior is independent of the race —
    it must wipe every per-bucket prefix at delete time, full
    stop, and that's what the test asserts

A second test, TestS3Server_AdminDeleteBucket_BucketGenerationKey
Survives, pins the orphan-isolation property the design relies
on: BucketGenerationKey must NOT be deleted, so re-creating the
bucket bumps generation and any pre-existing orphans under the
old generation prefix stay invisible to the new bucket.
Closes the orphan-object race coderabbitai flagged on PR #669
and analysed in
docs/design/2026_04_28_proposed_admin_delete_bucket_safety_net.md.

The empty-probe (ScanAt at readTS) and the BucketMetaKey delete
(commit at a later commitTS) are not atomic; a concurrent
PutObject between the two would leave manifest, blob, upload
metadata, upload parts, GC entries, and route key all orphaned
under a now-deleted bucket meta. The OCC validator only inspects
keys appearing in OperationGroup.ReadKeys, so the existing
single-key read set could not catch range-level mutations.

The fix appends DEL_PREFIX ops to the same OperationGroup as
the BucketMetaKey delete:

  - ObjectManifestPrefixForBucket
  - UploadMetaPrefixForBucket
  - UploadPartPrefixForBucket
  - BlobPrefixForBucket
  - GCUploadPrefixForBucket
  - RoutePrefixForBucket

All seven ops share a commitTS, so anything that landed in any
per-bucket prefix between readTS and commit is tombstoned with
the bucket meta. BucketGenerationKey is intentionally NOT in
the list — re-creating the bucket bumps generation and orphans
under the old generation prefix stay isolated from the new
bucket. Pinned by TestS3Server_AdminDeleteBucket_BucketGeneration
KeySurvives.

The same shape is mirrored on the SigV4 path
(adapter/s3.go:deleteBucket) so both delete entrypoints share
the same race-window guarantees. Factored into one
bucketDeleteOperationGroupElems helper so a future per-bucket
key family added to either path covers both delete paths in
lockstep.

Contract change for clients: a PutObject that returned 200 OK
during the race window can have its data swept by the
concurrent delete. Operationally bounded — operators are
already advised to pause writes before AdminDeleteBucket. The
alternative was orphan objects that no API can enumerate or
remove; the new contract is strictly cleaner.

DEL_PREFIX is broadcast to every shard
(kv/sharded_coordinator.go), but per-shard scans return 0 keys
in the common case (empty-probe already passed), and bucket
delete is operator-frequency. The cost is acceptable.

Tests: previously failing
TestS3Server_AdminDeleteBucket_SweepsOrphansAcrossAllPerBucketPrefixes
now passes (the 5 non-manifest prefixes are wiped). Existing
AdminDeleteBucket tests (HappyPath / MissingBucket / RejectsRead
Only / RejectsNonEmpty) and the full TestS3 race suite continue
to pass.

go test -race ./adapter/ — passes.
golangci-lint run ./adapter/... ./internal/s3keys/... — 0 issues.
Two updates now that the DEL_PREFIX safety net has landed:

1. **docs/admin_deployment.md** — new §4.6 "Deleting an S3
   bucket" covering the operator-facing contract:
   - Standard "bucket must be empty" rule still applies (409
     BucketNotEmpty when not).
   - The race-window contract change is documented explicitly:
     a PutObject that returned 200 OK during the empty-probe →
     commit window of a concurrent AdminDeleteBucket can have
     its data swept. Calls out the two scenarios where this
     matters (planned vs emergency delete) and what operators
     should do in each case.
   - Confirms re-create-after-delete is safe because
     BucketGenerationKey survives — orphans from any previous
     escape stay isolated under the old generation prefix.
   - Cross-link to the new design doc.

2. **docs/design/2026_04_24_partial_admin_dashboard.md** —
   Outstanding open items section: TOCTOU entry struck through
   and replaced with the "fixed" summary plus a pointer to the
   safety-net design doc. Updated the closing rename trigger so
   only the rolling-upgrade flag (criterion 5) blocks promotion
   to `_implemented_` now.

No design changes; no behaviour changes. State refresh.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 39 minutes and 47 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f7b81c33-1861-4737-a62e-50be867e1da8

📥 Commits

Reviewing files that changed from the base of the PR and between 303ebe2 and 69c808c.

📒 Files selected for processing (9)
  • adapter/dynamodb_migration_test.go
  • adapter/s3.go
  • adapter/s3_admin.go
  • adapter/s3_admin_test.go
  • docs/admin_deployment.md
  • docs/design/2026_04_24_partial_admin_dashboard.md
  • docs/design/2026_04_28_proposed_admin_delete_bucket_safety_net.md
  • internal/s3keys/keys.go
  • internal/s3keys/keys_test.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/admin-delete-bucket-toctou-fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 27, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Claude finished @bootjp's task in 3m 17s —— View job


PR Review: s3: AdminDeleteBucket DEL_PREFIX safety net (closes TOCTOU)

Todo List:

  • Read PR context and understand the changes
  • Review adapter/s3_admin.go — core fix implementation
  • Review adapter/s3.go — SigV4 path changes
  • Review internal/s3keys/keys.go — new prefix helpers
  • Review adapter/s3_admin_test.go — test coverage
  • Review internal/s3keys/keys_test.go — key isolation tests
  • Review design doc and admin deployment doc
  • Synthesize findings and post full review

Summary

The design intent is sound — appending DEL_PREFIX ops to the same OperationGroup as the BucketMetaKey delete is the right shape to close the TOCTOU window. The s3keys prefix helpers, isolation tests, and design doc are all well-structured. However, there is a critical implementation bug that causes the production dispatch to always fail, and the test coordinator masks it completely.


🔴 Critical: Mixed Del + DelPrefix in a transaction is rejected by the production coordinator

adapter/s3_admin.go:421-425 and adapter/s3.go:676-681

_, err = s.coordinator.Dispatch(ctx, &kv.OperationGroup[kv.OP]{
    IsTxn:   true,
    StartTS: startTS,
    Elems:   bucketDeleteOperationGroupElems(name, meta.Generation),
})

bucketDeleteOperationGroupElems returns [Del, DelPrefix×6]. The production ShardedCoordinator.Dispatch (kv/sharded_coordinator.go:221-264) routes any group containing a DelPrefix element through dispatchDelPrefixBroadcast. That function has two guards that both reject this call:

// kv/sharded_coordinator.go:292-298
func (c *ShardedCoordinator) dispatchDelPrefixBroadcast(isTxn bool, elems []*Elem[OP]) (*CoordinateResponse, error) {
    if isTxn {
        return nil, errors.Wrap(ErrInvalidRequest, "DEL_PREFIX not supported in transactions")
    }
    if err := validateDelPrefixOnly(elems); err != nil {  // rejects the leading kv.Del
        return nil, err
    }
    ...

In production, AdminDeleteBucket and deleteBucket will always return ErrInvalidRequest at the dispatch step, silently breaking bucket deletion. Because ErrInvalidRequest is not in isRetryableS3MutationErr, the retry loop bails out immediately and the HTTP handler returns an internal error. Bucket metadata is never deleted, which is a regression from the previous behavior.

Why do the tests pass? The test coordinator (localAdapterCoordinator.applyElem, adapter/dynamodb_migration_test.go:69-83) is a stub that directly calls store.DeletePrefixAt for DelPrefix ops, bypassing all ShardedCoordinator validation. The test never touches the real dispatch path. The regression test (TestS3Server_AdminDeleteBucket_SweepsOrphansAcrossAllPerBucketPrefixes) validates the desired post-condition against the stub but doesn't validate that the production coordinator can actually execute the operation.

Fix direction: The simplest correct fix is to split the commit into two separate dispatches:

  1. The txn dispatch (IsTxn: true) with only {Op: kv.Del, Key: BucketMetaKey(name)} — this is the current behavior and closes the bucket for new creates.
  2. A non-txn DEL_PREFIX broadcast for each per-bucket prefix — this sweeps orphans after the meta is gone.

This sacrifices the single-commitTS atomicity guarantee from the design doc (the two commits have different timestamps), but it's the only shape the current coordinator supports. The practical impact is a brief window where the meta is deleted but orphan data is not yet swept — operators who read the admin deployment doc's "pause writes" guidance won't be affected, and orphaned data from this window is bounded to a single request's worth. The design doc's Section 4 should be updated to reflect this implementation constraint.

Alternatively, extend dispatchDelPrefixBroadcast to support non-txn mixed Del+DelPrefix groups — but that is a coordinator-level change with broader blast radius.

Fix this →


🟠 Test gap: stub coordinator masks production coordinator constraints

adapter/s3_admin_test.go:378-497 (the two new regression tests)

All adapter tests use localAdapterCoordinator (defined in adapter/dynamodb_migration_test.go:14-83), which is a direct-apply stub that neither validates operation types nor enforces the IsTxn + DelPrefix incompatibility. This is the root cause of the green test suite hiding the production failure above.

The existing test infrastructure should either:

  • Add a ShardedCoordinator-backed integration path for this specific test (even a minimal single-shard version), or
  • Add an explicit assertion in the test that the coordinator accepted the operation group without error and validate against the coordinator's actual constraints.

🟡 Design doc lifecycle inconsistency

docs/design/2026_04_28_proposed_admin_delete_bucket_safety_net.md

Per CLAUDE.md: "rename *_proposed_*.md*_partial_*.md once the first milestone ships." If the implementation is landing in this same PR (commits 360a2ee through d7c3589), the file should be renamed *_partial_*.md (or *_implemented_*.md if all milestones ship here) at merge time. The doc currently carries the Proposed status.

Also minor: the doc is dated 2026-04-28 but the branch appears to be merged on 2026-04-27.


✅ What's good

  • internal/s3keys/keys.go — The bucketScopedPrefix factoring is clean. All five new *PrefixForBucket helpers correctly encode bucket-segment + generation-u64, matching the structure of their per-key constructors.
  • internal/s3keys/keys_test.go:TestPerBucketPrefixes_IsolateByBucketAndGeneration — Solid isolation coverage: pinning that a prefix matches its own (bucket, gen) and rejects both sibling buckets and other generations. This is exactly the contract that matters for the orphan-isolation property.
  • TestS3Server_AdminDeleteBucket_BucketGenerationKeySurvives — Correct and important. Pinning that BucketGenerationKey survives the delete, and that a recreate lands at a strictly higher generation, is the right safety guarantee.
  • adapter/s3_admin.go:bucketDeleteOperationGroupElems — The helper function is the right abstraction. Sharing it between Admin and SigV4 paths ensures both delete entrypoints stay in sync if a future key family is added. The comment is well-reasoned.
  • docs/admin_deployment.md §4.6 — The "pause writes" guidance and the contract-change explanation are clear and accurate for what the fix is intended to do. This section should be retained when the implementation is corrected.
  • adapter/s3_admin.go comment block on AdminDeleteBucket — The race analysis, the BucketGenerationKey rationale, and the contract change callout are thorough and accurate.

Summary

The test-visible behavior is correct (the stub coordinator does the right thing), but in production both AdminDeleteBucket and deleteBucket will fail every time due to the coordinator rejecting the mixed Del + DelPrefix transaction. The fix needs to be restructured to two separate dispatches before merging.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a TOCTOU race condition in bucket deletion by introducing a safety net that uses DEL_PREFIX operations to tombstone all per-bucket key families during the deletion transaction. This prevents orphaned keys from being left behind if a concurrent write occurs. The changes include new key prefix helpers, updates to both the SigV4 and Admin delete paths, and extensive regression tests and documentation. I have no feedback to provide.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5f48cd7230

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread adapter/s3_admin.go Outdated
Comment on lines +421 to +424
_, err = s.coordinator.Dispatch(ctx, &kv.OperationGroup[kv.OP]{
IsTxn: true,
StartTS: startTS,
Elems: []*kv.Elem[kv.OP]{
{Op: kv.Del, Key: s3keys.BucketMetaKey(name)},
},
Elems: bucketDeleteOperationGroupElems(name, meta.Generation),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Split bucket delete into supported coordinator operation shapes

AdminDeleteBucket now dispatches IsTxn: true with bucketDeleteOperationGroupElems(...), which includes both kv.Del and kv.DelPrefix. That combination is rejected by real coordinators: kv/sharded_coordinator.go routes any request containing DelPrefix through dispatchDelPrefixBroadcast, which rejects transactional and mixed-operation requests, and the single-group txn path rejects Op_DEL_PREFIX intents in kv/fsm.go. In production this makes bucket deletion fail (admin and SigV4, since both paths share this helper) instead of applying the intended safety net; the new local tests miss it because newLocalAdapterCoordinator bypasses coordinator validation.

Useful? React with 👍 / 👎.

Codex P1 on PR #695 caught that the original single-OperationGroup
shape would be rejected by the production coordinator on every
bucket delete. The fix lands here together with a test-coordinator
tightening that catches this class of bug in unit tests.

Background:
  kv/sharded_coordinator.go:dispatchDelPrefixBroadcast rejects any
  OperationGroup that mixes DEL_PREFIX with other ops AND any
  OperationGroup that uses DEL_PREFIX inside a transaction. The
  earlier shape ([Del BucketMetaKey, DelPrefix×6] with IsTxn=true)
  hit both rejection rules, so production would have failed every
  bucket delete with ErrInvalidRequest. The local test coordinator
  bypassed those checks, so the regression test passed despite
  the production-breaking shape.

Fix is two-part:

1. **Split AdminDeleteBucket and s3.go:deleteBucket into two
   Dispatch calls**:

   Phase 1 — `Del BucketMetaKey` in a txn (OCC-protected against
             a concurrent AdminCreateBucket racing the delete).
             Lives inside retryS3Mutation so an OCC conflict
             retries the whole closure.
   Phase 2 — DEL_PREFIX broadcast over every per-bucket key
             family in a non-txn OperationGroup. Lives outside
             retryS3Mutation because Phase 1 is the point of no
             return: a Phase-2 retry would 404 at loadBucketMetaAt.

   Phase-2 failure is logged via slog.WarnContext and not
   propagated. The bucket meta is already gone from the operator's
   POV; orphan keys (if any) are no worse than the pre-fix state
   on main and can be recovered by a future sweep tool. Surfacing
   a 500 to the operator after a successful delete would be a
   worse UX.

   Phase-1-first ordering is deliberate: a Phase-2-first ordering
   could leave the bucket meta extant while per-bucket data was
   wiped if Phase 1 then failed (concurrent recreate). Phase-1-
   first localises any partial failure to "bucket gone, orphan
   data may persist", which has a clean audit trail.

2. **Tighten localAdapterCoordinator validation** to mirror the
   production coordinator's dispatch-time rejection rules:
   reject IsTxn=true with any DelPrefix, reject mixed
   Del+DelPrefix groups. Without this, a future regression that
   ships the rejected shape would silently pass tests while
   breaking production. The existing
   TestS3Server_AdminDeleteBucket_SweepsOrphansAcrossAllPerBucketPrefixes
   test now exercises the production-realistic dispatch path and
   would have caught Codex P1 directly.

Refactor: split the single `bucketDeleteOperationGroupElems`
helper into `bucketDeleteSafetyNetElems` (DEL_PREFIX-only, used
by Phase 2) and a new `runBucketDeleteSafetyNet` method that
shares the dispatch-and-log logic between Admin and SigV4
paths. The Phase-1 Del op shape is small enough to inline at
each call site.

Design doc §6.2 rewritten with the two-phase rationale, the
Codex P1 finding, the Phase-2 failure semantics, and why Phase-1-
first ordering is correct.

Tests:
  go test -count=1 -run TestS3Server_AdminDeleteBucket ./adapter/ — passes
  go test -race -count=1 -run TestS3 ./adapter/ — passes (1.455s)
  golangci-lint run ./adapter/... — 0 issues
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 27, 2026

@claude review

Addressed the Codex P1 finding (single-OperationGroup shape rejected by production coordinator):

The original [Del BucketMetaKey, DelPrefix×6] with IsTxn=true would have hit two rejection rules in kv/sharded_coordinator.go:dispatchDelPrefixBroadcast — DelPrefix-in-txn AND mixed Del+DelPrefix. Production would have failed every bucket delete with ErrInvalidRequest. My local test coordinator bypassed those checks, which is exactly the gap Codex caught.

Two-part fix in commit 69c808cf:

  1. Split into two Dispatch calls:

    • Phase 1: Del BucketMetaKey in a txn (OCC-protected against concurrent AdminCreateBucket)
    • Phase 2: DEL_PREFIX broadcast (non-txn) over the 6 per-bucket prefixes
    • Phase 2 failure is logged via slog.WarnContext and not propagated — Phase 1 is the point of no return; a Phase-2 retry would 404 at loadBucketMetaAt. Orphan state is no worse than pre-fix.
    • Phase-1-first ordering is deliberate: Phase-2-first could leave bucket meta extant while data was wiped if Phase 1 then failed (concurrent recreate). Phase-1-first localises partial failure to a well-defined "bucket gone, orphans may persist" state.
  2. Tightened localAdapterCoordinator.validateDispatchShape to mirror the production rejection rules. Without this, the regression test passed despite the production-breaking dispatch shape — exactly the gap Codex called out. The existing TestS3Server_AdminDeleteBucket_SweepsOrphansAcrossAllPerBucketPrefixes now exercises the production-realistic dispatch path.

Design doc §6.2 rewritten with the two-phase rationale, the Phase-2 failure semantics, why Phase-1-first ordering is correct, and the test-coordinator parity note.

go test -race passes. Lint 0 issues.

The single comment at adapter/s3_admin.go:424 references the now-removed bucketDeleteOperationGroupElems shape — the helper no longer exists; the fix splits the dispatch into two calls so neither rejection rule fires.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Claude finished @bootjp's task in 1m 55s —— View job


PR Review: s3: AdminDeleteBucket DEL_PREFIX safety net (closes TOCTOU)

Todo List:

  • Read trigger comment to understand what was changed
  • Read adapter/s3_admin.go — two-phase dispatch implementation
  • Read adapter/s3.go — SigV4 path two-phase dispatch
  • Read adapter/dynamodb_migration_test.govalidateDispatchShape additions
  • Read kv/sharded_coordinator.go — verify production constraints
  • Read updated design doc §6.2
  • Synthesize findings and post review

Review of commit 69c808cf

The P1 fix is correct. Here's the full assessment.


✅ Two-phase dispatch — correct shape

adapter/s3_admin.go:439-448 and adapter/s3.go:681-690

Phase 1 dispatches {IsTxn:true, Elems:[Del BucketMetaKey]} — only Del ops, no DelPrefix, so neither production rejection rule fires. Phase 2 dispatches {Elems:[DelPrefix×6]} with no IsTxn — a pure DelPrefix-only non-txn broadcast, which is the only shape dispatchDelPrefixBroadcast accepts. Both shapes are confirmed valid against kv/sharded_coordinator.go:292-298.


validateDispatchShape mirrors production correctly

adapter/dynamodb_migration_test.go:58-81

The two rejection rules are faithfully reproduced:

  • IsTxn && hasDelPrefix"DEL_PREFIX not supported in transactions"
  • hasDelPrefix && hasOther"DEL_PREFIX cannot be mixed with other operations"

This closes the gap the previous review flagged: a future regression that ships IsTxn:true + [Del, DelPrefix...] will now fail in tests before it reaches production.


deletedGeneration capture is correct

var deletedGeneration uint64 is set inside the closure only after Phase 1 successfully commits (deletedGeneration = meta.Generation). The if err != nil { return err } guard after retryS3Mutation ensures Phase 2 is never called when Phase 1 failed, so the zero-value deletedGeneration = 0 is never passed to runBucketDeleteSafetyNet. The same pattern is correct in both AdminDeleteBucket and deleteBucket.


✅ Phase-1-first ordering is correct

Phase-2-first would wipe per-bucket data while the bucket meta still exists — if Phase 1 then failed (OCC from concurrent recreate), callers would see "bucket exists, data gone", which is incoherent. Phase-1-first bounds any partial failure to "bucket gone, orphans may persist", which has a clear audit trail in slog.WarnContext and is no worse than the pre-fix race condition.


bucketDeleteSafetyNetElems + runBucketDeleteSafetyNet abstraction

One place to maintain the prefix list, one place to do the error-swallow-and-warn. Adding a new per-bucket key family in the future means editing exactly one function — exactly the right shape.


🟡 Design doc lifecycle — should be renamed (minor, same as previous review)

docs/design/2026_04_28_proposed_admin_delete_bucket_safety_net.md

Per CLAUDE.md: "rename *_proposed_*.md*_partial_*.md once the first milestone ships" and "rename *_partial_*.md*_implemented_*.md once the final milestone ships." The implementation has fully landed — all milestones in §7 have shipped in this PR. The file should be renamed to *_implemented_*.md via git mv before merge so history follows it.

Fix this →


Summary

The critical issue from the previous review is fully resolved. The two-phase dispatch is structurally correct, the test coordinator now has production-realistic rejection rules, both delete paths share a single implementation, and the design doc accurately reflects the implementation. Only the doc lifecycle rename remains.

@bootjp bootjp merged commit eb85c3d into main Apr 28, 2026
9 checks passed
@bootjp bootjp deleted the feat/admin-delete-bucket-toctou-fix branch April 28, 2026 05:13
bootjp added a commit that referenced this pull request Apr 28, 2026
Per docs/design/README.md's lifecycle convention. The original
P1–P4 plan has fully shipped:

  - P1 (admin skeleton + Dynamo + AdminForward) — #634/#635/#644/#648
  - P2 (S3 endpoints incl. write paths and AdminForward integration)
        — #658 / #669 / #673 / #695 (TOCTOU safety net)
  - P3 (React SPA + embed) — #649 / #650
  - P4 (TLS / role / CSRF / operator doc / deployment runbook
        / scripts/rolling-update.sh admin support) — #674 / #669
        / #678

The AdminDeleteBucket TOCTOU caught during PR #669 review (the
last "in-flight" item that kept the doc at _partial_) is fully
resolved by the safety-net design landed in #695.

What changed:

  - git mv 2026_04_24_partial_admin_dashboard.md →
            2026_04_24_implemented_admin_dashboard.md (history
            follows the rename)
  - Header Status line: "Partial" → "Implemented", explanation
    updated to reflect the post-fix state and the rationale for
    promotion.
  - "Last updated" bumped to 2026-04-28 with the rename trigger.
  - Section heading "Outstanding open items" → "Out-of-scope
    follow-ups" — the remaining three entries (criterion 5,
    object browser, TLS hot-reload) are not in-flight work; they
    are deferred-at-design or Non-goal items. The TOCTOU bullet
    is removed (resolved) and replaced with a one-line
    cross-link to the safety-net design + admin_deployment.md
    §4.6 contract.
  - Removed the closing "rename trigger" sentence — we just did
    the rename.
  - Status table: P2 row now lists #695 alongside #658/#669/#673
    so a future reader can find the TOCTOU fix from the index.
  - Cross-references updated everywhere the old filename
    appeared:
      docs/admin.md (header link + Cross-references)
      docs/admin_deployment.md (header link + final cross-ref)
      docs/design/2026_04_28_proposed_admin_delete_bucket_safety_net.md
        (Background section pointer)
      internal/admin/config.go (Section 7.1 reference comment)

No code changes other than the comment-only filename refresh in
config.go.
bootjp added a commit that referenced this pull request Apr 28, 2026
## Summary

Promote the admin dashboard design doc from `_partial_` →
`_implemented_` per `docs/design/README.md`'s lifecycle convention. PR
#695 landed the TOCTOU safety-net fix (the last in-flight item that kept
the doc at `_partial_`), so the original P1–P4 plan is now fully
shipped:

| Phase | Landed via |
|---|---|
| P1 (admin skeleton + Dynamo + AdminForward) | #634 / #635 / #644 /
#648 |
| P2 (S3 endpoints + writes + AdminForward S3 + TOCTOU fix) | #658 /
#669 / #673 / **#695** |
| P3 (React SPA + embed) | #649 / #650 |
| P4 (TLS / role / CSRF / operator doc / runbook / script wiring) | #674
/ #669 / #678 |

The remaining three items in the doc move from "Outstanding open items"
(in-flight) to **"Out-of-scope follow-ups"** (deferred-at-design or
Non-goal):

- AdminForward criterion 5 — rolling-upgrade flag, deferred behind a
cluster-version bump that doesn't exist yet
- S3 object browser — Non-goal per §2.2
- TLS cert hot-reload — out of scope per `docs/admin.md`

## Changes

- `git mv` partial → implemented (history follows the rename)
- Status line / Last-updated / status-table / Out-of-scope section
content reflects the promotion
- Cross-references updated in all 4 referencing files: `docs/admin.md`,
`docs/admin_deployment.md`,
`docs/design/2026_04_28_proposed_admin_delete_bucket_safety_net.md`,
`internal/admin/config.go` (comment-only)

## Test plan

- [x] No code changes other than a comment-only filename refresh in
`config.go`
- [x] `go build ./...` passes
- [x] `golangci-lint run ./internal/admin/...` — 0 issues
- [x] `grep -rn "2026_04_24_partial\|2026_04_24_proposed_admin"` returns
nothing — no stale references
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant