Skip to content

docs: propose admin purge queue (DLQ-aware) for Web Console#757

Merged
bootjp merged 20 commits into
mainfrom
feat/admin-purge-queue-design
May 16, 2026
Merged

docs: propose admin purge queue (DLQ-aware) for Web Console#757
bootjp merged 20 commits into
mainfrom
feat/admin-purge-queue-design

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented May 15, 2026

Summary

  • Proposes AdminPurgeQueue admin RPC + DELETE /admin/api/v1/sqs/queues/{name}/messages HTTP endpoint + Web Console "Purge DLQ" button so operators can clear DLQ contents without dropping the queue.
  • AdminQueueSummary gains IsDLQ / DLQSources so the SPA labels the confirmation appropriately ("Purge DLQ" vs "Purge messages").
  • Reuses the existing purgeQueueWithRetry primitive — same generation-bump semantics as the SigV4 wire path, same 60-second meta-stored rate limit.

Per CLAUDE.md design-doc-first: this PR lands the doc; implementation ships in follow-up PRs phased per §7. No code in this PR.

Risk

Doc-only. No runtime change.

Five-lens self-review

  1. Data loss — n/a (doc only).
  2. Concurrency — n/a (doc only); §5 covers the concurrent-purge case (SigV4 vs admin path) and the leader-change-mid-purge case.
  3. Performance — n/a (doc only); §3.2 notes the IsDLQ reverse-scan cost and the deferred reverse-index path.
  4. Data consistency — n/a (doc only); §5 notes that IsDLQ staleness is a display concern, not a correctness concern.
  5. Test coverage — §6 enumerates unit / integration / frontend tests for the implementation PRs.

Test plan

  • doc renders correctly on GitHub
  • reviewers flag scope / design issues before implementation begins

Summary by CodeRabbit

  • Documentation
    • Added a comprehensive design for admin console enhancements: a "Messages" view with cursor pagination and message preview, a DLQ-aware purge action with clear in-progress semantics, message export as JSON, improved path validation and error responses, authorization rules, observability/audit considerations, and a full test and rollout plan for safer queue inspection and draining.

Review Change Stack

User feature request: clear DLQ contents from the Web Console without
dropping the queue itself, so the source queue's RedrivePolicy ARN
reference stays valid through the operation.

Doc proposes:
- AdminPurgeQueue admin RPC (mirrors AdminDeleteQueue's auth/audit
  pattern, wraps the existing purgeQueueWithRetry primitive)
- DELETE /admin/api/v1/sqs/queues/{name}/messages HTTP endpoint
  (RESTful 'delete messages, not queue', distinct from the existing
  endpoint that deletes the queue itself)
- AdminQueueSummary.IsDLQ + DLQSources for DLQ-aware UI labeling
  (reverse-scan the catalog; reverse-index deferred to a follow-up
  if profiling shows the Describe path is hot)
- SPA button on SqsDetailPage with name-confirmation modal,
  labelled 'Purge messages' or 'Purge DLQ' depending on IsDLQ

Non-goals:
- Redrive (StartMessageMoveTask) — separate feature with different
  semantics (per-message dispatch, partial success)
- Filtered purge — AWS PurgeQueue is all-or-nothing and we mirror it
- Cross-queue bulk purge

The 60-second meta-stored rate limit applies uniformly to SigV4 and
admin paths since both wrap the same tryPurgeQueueOnce. The admin
wrapper maps the SQS PurgeQueueInProgress code to a structured
ErrAdminSQSPurgeInProgress sentinel rendered as HTTP 429 with a
Retry-After header.

Rollout: doc lands first (this PR), then backend RPC + sentinel
errors, then HTTP handler + bridge, then SPA button. Each phase
reviews small; the user-visible feature ships in the final phase.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 27 minutes and 31 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ac4b3d5f-1ddd-4980-afb7-29d1660421de

📥 Commits

Reviewing files that changed from the base of the PR and between f0036de and 7f55fba.

📒 Files selected for processing (1)
  • docs/design/2026_05_16_proposed_admin_purge_queue.md
📝 Walkthrough

Walkthrough

Adds a design proposal for DLQ-aware SQS admin message peeking and generation-bump purging, new admin HTTP endpoints under /admin/api/v1/sqs/queues/{name}/messages, frontend Messages tab and client wiring, audit/metrics, auth rules, testing plan, and rollout guidance.

Changes

SQS Admin Purge Queue Feature Design

Layer / File(s) Summary
Feature goals and operator workflow
docs/design/2026_05_16_proposed_admin_purge_queue.md
Establishes feature scope, goals, explicit non-goals, and the end-to-end operator triage-to-clear workflow for DLQ-aware message administration.
AdminPeekQueue RPC contract
docs/design/2026_05_16_proposed_admin_purge_queue.md
Defines AdminPeekQueue: cursor-paginated, leader-only visibility-index scanning, FIFO partition pagination, cursor validation/capping, body truncation guarantees, and a dedicated admin-peek rate limiter with typed 429 outcomes.
AdminPurgeQueue RPC contract
docs/design/2026_05_16_proposed_admin_purge_queue.md
Defines AdminPurgeQueue: generation-bump all-or-nothing draining that preserves queue metadata and emits a 60-second purge-in-progress typed error carrying retry information.
Queue summary DLQ metadata
docs/design/2026_05_16_proposed_admin_purge_queue.md
Extends AdminQueueSummary with IsDLQ and DLQSources computed via a paginated reverse-scan over queue metadata and describes population/sorting rules.
HTTP endpoints, routing, and wire semantics
docs/design/2026_05_16_proposed_admin_purge_queue.md
Specifies /admin/api/v1/sqs/queues/{name}/messages GET (peek) and DELETE (purge), requires ServeHTTP routing restructure with path-segment validation, bridge/interface extensions, and exact JSON/error semantics (including next_cursor omission and 429/Retry-After behavior).
Frontend Messages tab and client integration
docs/design/2026_05_16_proposed_admin_purge_queue.md
Adds a Messages tab to SqsDetail.tsx with table preview/truncation, cursor pagination, detail modal with full attributes and Copy-as-JSON export, and client API methods including cancellation support for peek requests.
Audit logging and metrics
docs/design/2026_05_16_proposed_admin_purge_queue.md
Specifies structured admin.sqs.purge_queue audit fields with generation tracking and new Prometheus counters for admin peek/purge outcomes with constrained label cardinality.
Authorization, failure modes, testing, and rollout
docs/design/2026_05_16_proposed_admin_purge_queue.md
Documents authorization expectations (read-only can peek, purge requires elevated rights), failure-mode enumeration (leader change, concurrent purge, DLQ staleness, not-found), comprehensive test plan, phased rollout, and open questions.

🎯 3 (Moderate) | ⏱️ ~25 minutes


🐰 A design doc for queues so clear,
Peek and purge without fear,
DLQs now dance in the UI,
Operators triage, then say goodbye—
Admin magic, hop by hop! 🎪

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: a design proposal for implementing admin purge queue functionality with DLQ awareness for the Web Console.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/admin-purge-queue-design

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request proposes a design for an 'Admin Purge Queue' feature in the SQS Web Console, allowing operators to clear queue contents without deleting the queue itself. The design includes a new backend RPC, a RESTful HTTP endpoint, and a DLQ-aware UI. Feedback focused on performance concerns regarding the O(N) catalog scan for DLQ identification, specifically its impact on the AdminDescribeQueue path and the audit logging overhead for write operations. Additionally, a potential routing issue with trailing slashes in the new endpoint was identified.

}
```

`AdminDescribeQueue` populates `IsDLQ` / `DLQSources` by reverse-scanning the catalog: walk every queue's meta, parse its `RedrivePolicy` via `parseRedrivePolicy` (already in `adapter/sqs_redrive.go:43`), match the resolved DLQ target against `name`. The scan is bounded by the catalog cap that `scanApproxCounters` already enforces, so cost is `O(known queues)` per Describe call — the same order the existing counter scan pays.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The design mentions that the O(N) scan is "bounded by the catalog cap that scanApproxCounters already enforces". However, scanApproxCounters (as seen in adapter/sqs_admin.go:67) typically performs a prefix scan on the message keyspace of a single queue, which is efficient. \n\nIn contrast, populating IsDLQ by iterating over every queue's metadata could be expensive if implemented as N individual KV lookups. If the catalog metadata is stored under a common prefix (e.g., !sqs|catalog|meta|), a single prefix scan would be much more efficient. Consider specifying a prefix scan for this discovery or implementing the reverse-index optimization earlier to avoid performance degradation as the number of queues grows, especially since the SPA likely polls this endpoint.

| `DELETE` | `/admin/api/v1/sqs/queues/{name}` *(existing)* | Delete the queue itself |
| `DELETE` | `/admin/api/v1/sqs/queues/{name}/messages` *(new)*| Purge the queue's messages, keep meta |

The router in `internal/admin/sqs_handler.go` already path-prefixes on `pathSqsQueues = "/admin/api/v1/sqs/queues"`. The new endpoint is dispatched by suffix match: `strings.HasSuffix(trailing, "/messages")` after the `{name}` segment is parsed. A new handler `handlePurge` lives in `sqs_handler.go` parallel to `handleDelete`, calling `source.PurgeQueue(ctx, name)` against the `QueuesSource` interface (which is implemented by `sqsQueuesBridge.PurgeQueue` in `main_admin.go`, which itself wraps `AdminPurgeQueue`).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The proposed routing logic using strings.HasSuffix(trailing, "/messages") may fail if the request URL contains a trailing slash (e.g., /admin/api/v1/sqs/queues/foo/messages/). Consider using path.Clean or explicitly handling the trailing slash to ensure the endpoint is robust.

Comment on lines +136 to +137
is_dlq=<bool>
dlq_sources=<csv>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Including is_dlq and dlq_sources in the audit log for AdminPurgeQueue implies that the purge operation itself must also perform the O(N) catalog scan to gather this metadata. This adds significant overhead to a write operation that is otherwise O(1) (just a generation bump).\n\nIf the audit log requires this information, it might be more efficient to have the AdminPurgeQueue RPC return the metadata it discovered, or reconsider if the full list of sources is necessary for the audit trail given the cost of acquisition.

Two updates in one commit per the user's request to extend the proposal:

(1) Gemini r1 medium x3:

- doc line ~90 (IsDLQ scan cost): clarify the reverse-scan is ONE
  Pebble prefix scan + O(N) in-memory deserialize, not N point
  lookups. Note that the cap (sqsQueueScanPageLimit, paged) is the
  same bound scanQueueNames already pays for AdminListQueues. The
  reverse-index optimisation is acknowledged in-line as the
  documented next step.
- doc line ~101 (trailing-slash routing): replace suffix match with
  path-segment match using path.Clean. Reject /messages/ (extra
  empty segment after Clean), /messages/foo (undefined
  sub-resource), and /Messages (case mismatch). Same shape Dynamo
  handler already uses for its sub-resources.
- doc line ~137 (audit-log overhead): drop is_dlq / dlq_sources
  from the audit line. Computing those would force every purge
  (O(1) generation bump) to pay the O(N) reverse-scan. DLQ
  context is observable separately via Describe at audit-review
  time. Explain the cost trade-off in-line so a future reader
  doesn't add the fields back without thinking.

(2) New scope from user follow-up: add Peek for queue contents.

Operators clearing a DLQ need to triage first (which messages
are stuck, are they still useful, are they safe to drop) before
purging. SigV4 ReceiveMessage is destructive (bumps receive
count, starts visibility timer, mints receipt handles) and using
it from the AWS CLI just to glance at DLQ contents can advance
messages toward purge eligibility under the very RedrivePolicy
that put them there.

Added to the proposal:

- AdminPeekQueue RPC: read-only, non-destructive sample of
  currently-visible messages. NO receive-count bump, NO visibility
  timer, NO receipt handles minted. Cursor-paginated (opaque
  base64 encoding-json of {Generation, LastKey}). Limit clamped
  to [1, 100], BodyMaxBytes clamped to [256, 65536]. Leader-only
  because the visibility index is leader-written and a follower
  read could see an index entry whose underlying message record
  was already deleted (apply-vs-lease-read race).
- AdminPeekedMessage struct: MessageID, Body (truncated),
  BodyTruncated, BodyOriginalSize, SentTimestamp, ReceiveCount,
  GroupID, DeduplicationID, Attributes.
- GET /admin/api/v1/sqs/queues/{name}/messages?limit=N&cursor=C
  HTTP endpoint, dispatched by method on the same /messages
  sub-resource that DELETE owns for purge.
- SPA Messages tab on SqsDetailPage: paginated table with row-
  click detail modal, page-size selector (20/50/100), Next/Prev
  cursor controls, 'Copy as JSON' on the detail modal so the
  Peek-and-export recovery workflow is one click per message.
- Per-queue read-side throttling integration: peek shares the
  ReceiveMessage budget so a tight peek loop cannot starve real
  consumers.

Non-goals updated:

- per-message delete from the Peek view (receipt-handle interplay
  diverges from AWS; future RFC if operators ask)
- in-flight messages in Peek (rented out to consumers; future PR
  with ?include_in_flight=true)

Test plan extended with peek-specific cases (Limit clamping,
cursor round-trip, BodyMaxBytes truncation, no-side-effects
verification, FIFO vs standard column rendering) and an
integration test that pins peek-after-purge returns zero rows.

Rollout phases re-numbered: 2=purge, 3=peek, 4=HTTP handlers
(both verbs land together since they share the path-segment
router), 5=SPA Messages tab + Purge button, 6=doc rename.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 15, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 15, 2026

Claude finished @bootjp's task in 4m 41s —— View job


Claude's Review

Tasks

  • Read design document in full
  • Review existing SQS adapter / admin code for context
  • Provide comprehensive design review

Overall Assessment

The design is well-structured and clearly motivated. Reusing purgeQueueWithRetry verbatim is the right call — it keeps the OCC/rate-limit semantics identical across the SigV4 and admin paths. The catalog reverse-scan for IsDLQ is a defensible first step and the deferred reverse-index path is correctly identified as the follow-up. The non-goals section is unusually good — it pre-empts all the obvious scope creep questions.

Gemini's points about the O(N) audit log scan are already resolved in §3.6 (lean audit log, DLQ context excluded from the write path). The trailing-slash routing concern is resolved by the path.Clean + segment comparison approach in §3.4. Those are clean calls.

There are several gaps that need resolution before implementation begins.


Design Gaps (should be addressed before Phase 2 starts)

1. Partitioned FIFO queues not addressed in AdminPeekQueue (§3.1)

The design says peek "walks the visibility index for the queue" but there are two visibility keyspaces:

  • Standard / non-partitioned: !sqs|msg|vis|<queue>|<gen>|<visible_at>|<msg_id>
  • Partitioned FIFO: !sqs|msg|vis|p|<queue>|<partition>|<gen>|<visible_at>|<msg_id>

receiveMessage already handles this with effectivePartitionCount fanout across partitions := meta.PartitionCount partitions (sqs_messages.go:885). Peek needs the same fanout. This is not a minor detail — the cursor format {Generation uint64, LastKey []byte} only encodes a single linear position. For a partitioned FIFO queue with N partitions, pagination requires either:

  • A cursor that encodes per-partition last-keys {Generation uint64, PartitionOffsets map[uint32][]byte}, or
  • Sequential partition scanning (scan partition 0 to exhaustion, then partition 1, etc.) with the cursor encoding the current partition and offset within it.

The simpler option (b) matches how receiveMessage handles the fanout and should be the implementation choice. The design should commit to one approach so Phase 3 doesn't re-litigate this.

2. Retry-After duration propagation gap (§3.2 + §3.4)

AdminPurgeQueue returns error. The HTTP handler needs a specific retry-after duration to populate the Retry-After header and the retry_after_seconds JSON field. The design says the header is "populated from the meta's LastPurgedAtMillis + 60s - now()" but the adapter-to-handler path for that duration is unspecified.

Options:

  • The bridge wraps ErrAdminSQSPurgeInProgress in a typed error carrying the time.Duration — the admin package defines ErrQueuesPurgeInProgress as a sentinel value that the bridge inspects, but an additional PurgeInProgressError struct carries the retry delay.
  • AdminPurgeQueue returns (remainingSeconds int64, error) so callers get the duration for free.
  • The handler makes a second AdminDescribeQueue call after a 429 to reconstruct the timing — wasteful.

Option 1 (typed error) matches how existing Go APIs propagate rate-limit metadata and keeps AdminPurgeQueue's signature clean. The bridge contract needs to be spelled out.

3. principalForRead is undefined in internal/admin/sqs_handler.go (§2.1 goal 8)

The design says "Peek is gated by principalForRead (same as List / Describe)". However, List and Describe in the current handler have no role gate beyond the session-auth middleware — any authenticated session can call them. principalForWrite exists but there is no principalForRead. If the intent is to add a live-role re-check for read access, this function needs to be defined. If the intent is that peek is gated only by session auth (same as List/Describe today), then the doc should say "no additional role gate" rather than implying a principalForRead call.

4. Audit log generation_before / generation_after sourcing (§3.6)

The audit line requires generation_before and generation_after. purgeQueueWithRetry currently returns only error. To populate the audit line, the implementation must choose one of:

  • (a) Add a read of loadQueueMetaAt in AdminPurgeQueue before calling purgeQueueWithRetry to capture generation_before — an extra Pebble round-trip that the SigV4 path doesn't pay.
  • (b) Modify purgeQueueWithRetry to return (oldGen, newGen uint64, error).
  • (c) Read the meta after the purge succeeds to get generation_after — racy if another purge starts in the 60-second window.

Option (b) is the cleanest — tryPurgeQueueOnce already knows lastGen and lastGen + 1. Changing purgeQueueWithRetry's signature is a small, contained change. The design should commit to this so the SigV4 path isn't inadvertently broken when purgeQueueWithRetry is modified.

5. ServeHTTP routing redesign scope (§3.4)

The current ServeHTTP in sqs_handler.go:140-148 extracts the full suffix after /queues/ as name:

name := strings.TrimPrefix(r.URL.Path, pathPrefixSqsQueues)
// For /admin/api/v1/sqs/queues/foo/messages → name = "foo/messages"
h.handleDescribe(w, r, name)  // currently receives "foo/messages"

Adding /messages sub-resources requires a complete restructure of ServeHTTP, not an extension. The existing handleDescribe and handleDelete must only receive bare queue names. The design describes the new routing correctly (path-segment match) but doesn't call out that the existing dispatch logic must be replaced. The implementation PR should restructure ServeHTTP into a three-level dispatch: collection root → named queue → named queue sub-resource.

6. ErrQueuesPurgeInProgress missing from admin package sentinel list (§3.4)

The design defines ErrAdminSQSPurgeInProgress for the adapter layer but doesn't define the corresponding admin-package sentinel or add it to writeQueuesError. The bridge needs a way to signal this to the handler, and writeQueuesError needs a 429 Too Many Requests branch with a Retry-After header. The admin-package sentinel list (ErrQueuesForbidden, ErrQueuesNotLeader, ErrQueuesNotFound, ErrQueuesValidation) needs ErrQueuesPurgeInProgress.


Minor Issues

7. BodyMaxBytes query parameter name convention

The API client spec shows bodyMaxBytes (camelCase JS) but the admin API's JSON field names use snake_case (next_cursor, body_truncated, etc.). The URL query parameter should be body_max_bytes to match the convention. Define the exact wire name in §3.4.

8. Peek throttle budget shared with ReceiveMessage

§3.1 says peek "shares the per-queue read-side budget the SigV4 ReceiveMessage path consults". Unlike ReceiveMessage, peek is idempotent and admin-operator–driven. An operator running a tight pagination loop (cursor through a 10k-message DLQ) could exhaust the per-queue budget and return 429 to real consumers. Since this is an admin-only surface, a separate per-queue admin-peek budget (or simply a lower per-request limit with a mandatory backoff signal) would avoid inadvertent consumer starvation. At minimum, document that this sharing is intentional and accepted so implementers don't change it thinking it's an oversight.

9. Multi-partition cursor size bound

If the Phase 3 implementation follows the partitioned-fanout approach for the cursor, the cursor's encoded size will grow with partition count. Set and validate a maximum cursor byte length (e.g., 512 bytes) to prevent clients from crafting pathologically large cursor inputs. ErrAdminSQSValidation should be returned on overflow.

10. "Copy as JSON" payload schema undefined

The detail modal's "Copy as JSON" action is mentioned in §3.5 but the payload schema isn't specified. Operators will export this to downstream systems. Define the schema in the design so the frontend and any operator tooling can agree on the shape before Phase 5.

11. Messages tab: visible vs. total message count

The Messages tab shows only visible_at ≤ now messages (correct — invisible messages are rented out to consumers). But the Counters.NotVisible field on the queue summary is available. The UX should show something like "Showing N visible messages (M currently in-flight)" so operators don't mistake an empty peek result on a busy queue for "the queue is empty". This is a Phase 5 detail but worth noting in §3.5 before implementation.


Test Plan Gaps (§6)

  • §6.2 covers FIFO GroupID/DeduplicationID but doesn't include a partitioned FIFO case (once the multi-partition cursor design is resolved). Add TestAdminPeekQueue_PartitionedFIFO_Pagination pinning that messages across all partitions are reachable through cursor pagination.
  • §6.1 doesn't include a test that verifies generation_before and generation_after in the audit line are correct after a retry (i.e., the values logged are from the committed OCC round, not a stale intermediate read).

Rollout Note

Phase ordering is correct — Purge before Peek makes sense because Purge reuses existing primitives while Peek introduces the new pagination/cursor infrastructure. The note that Phase 5 (SPA) is the user-visible gate is the right framing for the change-management risk.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bcf189e88b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

| `GET` | `/admin/api/v1/sqs/queues/{name}/messages?limit=N&cursor=C` *(new)* | Peek a sample of visible messages |
| `DELETE` | `/admin/api/v1/sqs/queues/{name}/messages` *(new)* | Purge all messages, keep meta intact |

Method-based dispatch on the same `/messages` sub-resource: `GET` → peek, `DELETE` → purge. The router in `internal/admin/sqs_handler.go` already path-prefixes on `pathSqsQueues = "/admin/api/v1/sqs/queues"`. The new endpoint is dispatched by **path-segment match, not suffix match**: after parsing the `{name}` segment, the handler splits the remaining trailing path into segments (using `path.Clean` first to normalise away leading / trailing / duplicated slashes), and accepts the path iff the trailing segments are exactly `["messages"]`. This rejects `/messages/` (one extra empty segment after Clean), `/messages/foo` (sub-resource we don't define), and `/Messages` (case-sensitive AWS naming) without relying on suffix heuristics. Two new handlers `handlePeek` (GET) and `handlePurge` (DELETE) live in `sqs_handler.go` parallel to `handleDelete`, calling `source.PeekQueue(ctx, name, opts)` / `source.PurgeQueue(ctx, name)` against the `QueuesSource` interface (implemented by `sqsQueuesBridge` in `main_admin.go`, which wraps `AdminPeekQueue` / `AdminPurgeQueue`). The path-segment matcher is the same shape DynamoHandler already uses for its sub-resource routing — keeping the two parallel reduces the cognitive load of operators who reason about both surfaces.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle trailing slash before cleaning message subresource path

The routing rule here says path.Clean is applied before segment matching and still claims /messages/ will be rejected, but path.Clean removes trailing slashes, so /messages/ normalizes to /messages and would be accepted by an exact ["messages"] segment check. That contradicts the stated contract and test plan (which expects /messages/ to return 404), so implementers following this text will likely ship behavior that differs from the documented API shape.

Useful? React with 👍 / 👎.

Claude r2 review on PR #757 listed 6 design gaps + 3 minor +
2 test-plan gaps. All are legitimate; doc updated.

Design gaps (commit-by-commit):

1. Partitioned FIFO in Peek (§3.1): two visibility keyspaces exist
   (standard prefix + 'p|' prefix). Peek now commits to sequential
   partition scanning (scan partition 0 to exhaustion, then 1, …)
   matching receiveMessage's dispatch. Cursor is a versioned JSON
   envelope {v=1, gen, partition, last_key} base64url-encoded with
   a 512-byte cap; validation returns ErrAdminSQSValidation on
   overflow or stale-generation cursors.

2. Retry-After propagation (§3.2): AdminPurgeQueue now returns
   (AdminPurgeResult, error). Rate-limit path returns
   *PurgeInProgressError{RetryAfter: time.Duration} that wraps the
   ErrAdminSQSPurgeInProgress sentinel via errors.Is so the handler
   can branch on the sentinel while the duration travels in the
   typed payload. Admin-package mirror struct rewraps it across
   the bridge so admin stays free of the adapter package.

3. principalForRead (§2.1 goal 8): the helper does not exist
   today. Doc now says peek uses the same session-auth gate that
   List/Describe use; no additional live-role re-check. Purge
   alone runs principalForWrite. Earlier wording implied a
   non-existent helper.

4. Audit log generations (§3.6): purgeQueueWithRetry's signature
   changes to (oldGen, newGen uint64, err error) so values come
   from the committed OCC round, not racy pre/post meta reads.
   Caller audit per the standing rule: one external caller
   (purgeQueue in sqs_purge.go) that currently ignores success
   values and will discard them via _, _, err := …. Spelled out
   so the SigV4 path is not silently broken when the signature
   changes.

5. ServeHTTP restructure (§3.4): the current dispatch hands the
   full suffix after /queues/ to handleDescribe/handleDelete as
   the queue name. Adding /messages sub-resources requires a
   three-level restructure: collection root → named queue → named
   queue sub-resource. Existing handlers receive bare {name}
   after the change. Doc calls this out so the implementation PR
   plans for it instead of stacking elif branches.

6. ErrQueuesPurgeInProgress + writeQueuesError 429 (§3.4): admin
   package gains the sentinel + PurgeInProgressError struct +
   429 branch with Retry-After header in writeQueuesError. JSON
   body gets retry_after_seconds field.

Minor:

7. body_max_bytes URL param uses snake_case (matches the rest of
   the admin API). TypeScript client field stays bodyMaxBytes
   (matches JS convention); client adapter does case translation
   at request boundary.

8. Peek throttle (§3.1): explicitly uses a distinct per-queue
   admin-peek bucket (adminPeekRPS=5, adminPeekBurst=20), NOT the
   ReceiveMessage budget. Operator paginating through a 10k-message
   DLQ would otherwise starve real consumers. Bucket key
   'admin_peek:<queue>' so it cannot collide with wire-side
   buckets.

9. Cursor size cap (§3.1): 512-byte cap on encoded cursor;
   pathological client-supplied cursors get ErrAdminSQSValidation.

10. Copy as JSON schema (§3.5): wrapper {schema_version:1, queue,
    exported_at, message:{…AdminPeekedMessage…}} so downstream
    tooling pins the version. Detail modal re-peeks at
    body_max_bytes=65536 if the row was truncated.

11. Messages tab UX (§3.5): tab header shows 'Showing N visible
    messages (M currently in-flight)' so empty peek on busy queue
    isn't mistaken for empty queue. N=peek result, M=counters.not_visible.

Test plan extended (§6):

- TestAdminPeekQueue_PartitionedFIFO_Pagination: PartitionCount=4
  with messages across all partitions; cursor pagination yields
  every message exactly once in (partition, visible_at) order.
- TestAdminPurgeQueue audit-generation test: arranges queue at
  gen N, runs purge, asserts result has GenerationBefore=N /
  After=N+1, runs concurrent same-tick purge to validate the
  rate-limit returns the same generation pair (i.e., values
  logged are the ones that actually landed, not stale
  intermediates).
- TestPurgeQueueSigV4_StillCompiles: exercises the SigV4 wire
  path through the modified purgeQueueWithRetry signature to
  prove the change does not regress.

No code in this PR; doc-only update.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 15, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 15, 2026

Claude finished @bootjp's task in 4m 18s —— View job


Claude's Review

Tasks

  • Read design document in full
  • Review existing SQS adapter / admin code for context
  • Provide comprehensive design review

Overall Assessment

This is a third pass (r3) triggered by @claude review. The design document has incorporated all findings from the r2 review cleanly: sequential partition scanning, separate admin-peek throttle bucket, PurgeInProgressError typed error, principalForRead clarification, generation values from committed OCC round, explicit ServeHTTP restructure note, ErrQueuesPurgeInProgress sentinel, body_max_bytes snake_case, "Copy as JSON" schema, in-flight count in tab header, and TestAdminPeekQueue_PartitionedFIFO_Pagination. The design is in good shape to proceed to Phase 2. The remaining gaps below are code-level details discovered by reading the actual implementation files.


Remaining Implementation Gaps

1. RetryAfter duration is computed inside tryPurgeQueueOnce but never threaded out

The actual code in sqs_purge.go:73-75:

if meta.LastPurgedAtMillis > 0 && now-meta.LastPurgedAtMillis < sqsPurgeRateLimitMillis {
    return false, newSQSAPIError(http.StatusBadRequest, sqsErrPurgeInProgress, "...")
}

The remaining-wait duration (sqsPurgeRateLimitMillis - (now - meta.LastPurgedAtMillis)) is computable here, but the current return discards it in an opaque sqsAPIError. The design says AdminPurgeQueue "extracts the remaining duration from the meta record (LastPurgedAtMillis + 60s - now())" — but AdminPurgeQueue does not have access to the meta at that point, and re-reading it after the rejection races a concurrent purge in the 60-second window (the exact problem the design flags for generation values).

The Phase 2 PR must resolve this explicitly. Cleanest option: change tryPurgeQueueOnce to return a new unexported struct purgeRateLimitedError{remaining time.Duration} (not sqsAPIError) when rate-limited. AdminPurgeQueue uses errors.As(err, &rle) to extract rle.remaining. The design currently describes isSQSAdminPurgeInProgress(err) as parallel to isSQSAdminQueueDoesNotExist (a bool helper), but this helper must also return the duration — define it as isSQSAdminPurgeInProgress(err) (bool, time.Duration) or use the typed-error pattern directly.

2. tryPurgeQueueOnce internal signature change is implicit

The design says purgeQueueWithRetry changes from error to (oldGen, newGen uint64, err error). The actual retry loop (sqs_purge.go:40-56) calls tryPurgeQueueOnce, which currently returns (bool, error). For the generation pair to flow out of purgeQueueWithRetry, tryPurgeQueueOnce must also be modified (to return (done bool, oldGen, newGen uint64, err error) or similar). The design says this is "one extra return" and it is — but the internal signature change should be spelled out explicitly since tryPurgeQueueOnce is the OCC commit point. The Phase 2 PR's change scope should confirm: only sqs_purge.go changes; the SigV4 caller purgeQueue at sqs_purge.go:33 changes from err := to _, _, err :=.

3. QueuesSource method naming drops the Admin prefix

All existing methods on QueuesSource (internal/admin/sqs_handler.go:62-66) are named AdminListQueues, AdminDescribeQueue, AdminDeleteQueue. The design's new interface methods are Peek / Purge (§3.4 bridge interface) — without the Admin prefix. Either add the prefix for consistency or document the intent. [Fix this →](https://claude.ai/code?q=In%20docs%2Fdesign%2F2026_05_16_proposed_admin_purge_queue.md%20%C2%A73.4%2C%20the%20QueuesSource%20bridge%20interface%20new%20methods%20are%20named%20Peek%20and%20Purge%20but%20existing%20methods%20are%20AdminListQueues%2C%20AdminDescribeQueue%2C%20AdminDeleteQueue.%20Rename%20to%20AdminPeekQueue%20and%20AdminPurgeQueue%20for%20consistency%2C%20or%20add%20a%20comment%20explaining%20the%20naming%20choice.&repo=bootjp%2Felastickv)

4. Admin-peek throttle bucket: incarnation keying not specified

The design introduces a new per-queue admin-peek bucket keyed by adminPeekKey(queue) = "admin_peek:" + queue. The existing bucketStore uses bucketKey{queue, action, incarnation} (from sqs_throttle.go), where incarnation ensures a DeleteQueue + CreateQueue sequence resets the bucket to full capacity. The design does not say whether the admin-peek bucket also uses incarnation keying, or a simpler per-queue map. If it omits incarnation, a freshly re-created queue inherits the old bucket's state. If it uses incarnation, the bucket is naturally evicted by the existing invalidateQueue() sweep (which ranges all entries matching the queue name). Specify which approach Phase 3 should follow.


Minor Notes

5. writeQueuesError uses stdlib errors; new sentinel should match

sqs_handler.go:5 imports "errors" (stdlib), not "github.com/cockroachdb/errors". The design shows ErrQueuesPurgeInProgress = errors.New(...) without specifying which package — it should be stdlib errors.New, consistent with the four existing sentinels in that file (ErrQueuesForbidden, ErrQueuesNotLeader, ErrQueuesNotFound, ErrQueuesValidation, all defined with stdlib errors.New). The wrapcheck linter will catch any cross-package boundary issues at implementation time.

6. path.Clean behavior on /messages/ — Codex finding clarified

Codex flagged that path.Clean("/messages/") normalizes to /"/messages" and would match the segment check. The design says this is intentional: after path.Clean, trailing slashes are removed, so /messages/"messages" → matched by ["messages"]. The design's contract (§3.4) says trailing slash DOES match; the test case "/messages/" in §6.4 asserts 404 based on the request path before path.Clean. These are contradictory. Pick one and document it explicitly: either path.Clean is applied (trailing slash works) or it is not (trailing slash is rejected with 404). The implementation PR must resolve this unambiguity so the test plan and the behavior align. Fix this →`

7. IsDLQ reverse-scan in AdminDescribeQueue vs. AdminPurgeQueue audit

The design correctly removed IsDLQ / DLQSources from the AdminPurgeQueue audit line to avoid adding the O(N) scan to a write path (§3.6 addresses Gemini r1 cleanly). But AdminDescribeQueue now pays the O(N) scan on every call, and the SPA's Messages tab may call Describe to refresh the DLQ chip list after purge. If the SPA calls Describe immediately after purge, this is two back-to-back scan operations. Consider caching the DLQ reverse-scan result in the HTTP response with Cache-Control: max-age=30 for Describe, since DLQ topology changes infrequently. Not blocking, just worth noting before Phase 4 wires the SPA polling.


Confirmed Correct (from code review)

  • The existing purgeQueue SigV4 caller (sqs_purge.go:33) uses if err := s.purgeQueueWithRetry(...) — changing to (oldGen, newGen uint64, err error) only requires _, _, err := at that single call site. ✓
  • The ServeHTTP restructure need is real: line 141 does name := strings.TrimPrefix(r.URL.Path, pathPrefixSqsQueues), so GET /queues/foo/messages would call handleDescribe(w, r, "foo/messages") today and return 404 (queue not found). Phase 4 must restructure before the sub-resource route can be dispatched correctly. ✓
  • Generation and incarnation are distinct (sqs_throttle.go confirms bucketKey.incarnation — PurgeQueue bumps only Generation, not Incarnation). AdminPurgeQueue inherits this correctly by reusing tryPurgeQueueOnce. ✓
  • The writeQueuesError function has no existing 429 branch — confirmed that ErrQueuesPurgeInProgress is additive, not a rename of anything. ✓

Summary

Design is implementation-ready for Phase 2 (Purge) once finding #1 (RetryAfter duration propagation) is resolved in the doc. Findings #2#4 should be addressed before Phase 3 (Peek) begins. Finding #6 (trailing slash contradiction) must be resolved before Phase 4 (HTTP handlers).

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 045cc403e5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


Below the table: a page-size selector (20 / 50 / 100), a Refresh button, and Next / Previous controls driven by the cursor. Detail modal shows full body + every attribute + the timestamps; a "Copy as JSON" button copies the row's full record to the clipboard for manual export.

**Copy as JSON payload schema.** The clipboard payload is the exact wire shape of a single `AdminPeekedMessage` entry (top-level keys: `message_id`, `body`, `body_truncated`, `body_original_size`, `sent_timestamp`, `receive_count`, `group_id`, `deduplication_id`, `attributes`) plus a wrapper `{"schema_version": 1, "queue": "<name>", "exported_at": "<ISO8601>", "message": { … }}`. The `schema_version` is what downstream tooling pins so a future change to the export format (e.g. multi-message JSONL bundle) does not silently break exporters. Operator workflows that pipe this into a recovery tool can rely on the schema not shifting under them. The full-body modal shows the FULL body — not the truncated one — by issuing a single-message peek with `body_max_bytes=65536` if the row's `body_truncated` was true on the list response.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop claiming the modal can always show full message bodies

The proposal says the modal can fetch the FULL body by calling peek with body_max_bytes=65536, but this same doc caps BodyMaxBytes at 65,536 while SQS queue message size in this repo is up to 262,144 bytes (adapter/sqs_catalog.go). For any message larger than 64 KiB, operators would still see truncated content in the modal and could purge without actually seeing the full payload, which breaks the triage guarantee this feature is meant to provide.

Useful? React with 👍 / 👎.

Comment on lines +378 to +379
outcome ∈ {"ok", "forbidden", "not_leader", "not_found", "internal_error"}
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add a throttled outcome for admin peek metrics

This outcome set omits a throttling state even though §3.1 introduces a dedicated admin-peek bucket that returns 429 on exhaustion. If implementers follow this enum literally, throttled peek requests have no correct label and will be either dropped or misclassified as internal_error, which makes the new metric unable to distinguish real backend faults from expected rate limiting.

Useful? React with 👍 / 👎.

…757)

Two Codex r3 findings on the round-3 doc:

1. P1 (line 329, Copy-as-JSON full body): the doc claimed the detail
   modal could fetch the full body by re-peeking with
   body_max_bytes=65536, but SQS stored-message size caps at 262144
   (= 256 KiB, sqsMaximumAllowedMaximumMessageSize in
   adapter/sqs_catalog.go:38). The 64-KiB ceiling on BodyMaxBytes
   broke the triage guarantee for messages between 64 KiB and
   256 KiB — operators would still see truncated payloads in the
   modal and could purge without knowing what they were dropping.

   Fix: raise the BodyMaxBytes upper clamp to
   sqsMaximumAllowedMaximumMessageSize (= 262144). The detail modal
   passes body_max_bytes=262144 on the single-message re-peek;
   guaranteed full body for any legitimately stored message.

2. P2 (line 379, peek metric outcomes): the
   elastickv_sqs_admin_peek_queue_total enum omitted a 'throttled'
   outcome even though §3.1 defines a dedicated admin-peek bucket
   that returns 429 on exhaustion. Implementers following the enum
   literally would have classified expected rate-limit rejections
   as internal_error, making the metric unable to distinguish
   backend faults from operator pagination loops.

   Fix: add 'throttled' to the outcome enum. Doc explains why peek
   needs the label but not 'purge_in_progress' (the 60s rate-limit
   is purge-specific; peek is read-only and doesn't share that
   state).

No code in this PR; doc-only update.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 15, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 15, 2026

Claude finished @bootjp's task in 5m 54s —— View job


Tasks

  • Read design document in full (current state after r3 fixes)
  • Check what was addressed since r3 (commits 045cc40 + 55db4de)
  • Review remaining gaps from r3 + new Codex P1/P2 findings
  • Provide comprehensive r4 review

Claude's Review — r4

Addressed since r3

  • ✅ Codex P1 (body cap): BodyMaxBytes ceiling raised from 64 KiB to sqsMaximumAllowedMaximumMessageSize (262 144 bytes). Both §3.1 and §3.5 now consistently reference 262 144, and adapter/sqs_catalog.go:38 confirms sqsMaximumAllowedMaximumMessageSize = 262144. The modal's triage guarantee is restored.
  • ✅ Codex P2 (throttled metric): throttled outcome added to elastickv_sqs_admin_peek_queue_total; the note explaining why it isn't on the purge counter is correct.

Both Codex r3 findings are cleanly resolved. Four findings from Claude r3 remain open, plus one new finding surfaced by reading the actual implementation code.


Still Open from r3

1. RetryAfter duration propagation — contradictory phrasing (§3.2)

The document now says two things that cannot both be true:

"extracts the remaining duration from the meta record (LastPurgedAtMillis + 60s - now())"
"The duration is captured once at the rejection point so the handler does not need to recompute it."

Reading adapter/sqs_purge.go:63–76, tryPurgeQueueOnce has access to meta.LastPurgedAtMillis and now at the moment it raises the rate-limit error (line 73). After the function returns, neither value travels with the returned sqsAPIError. So "from the meta record" implies a second loadQueueMetaAt call by the admin wrapper — the race that r3 flagged (a concurrent purge in the 60-second window can reset LastPurgedAtMillis between the rejection and the follow-up read). "At the rejection point" implies the duration is carried in the error itself, which is the correct approach.

The implementation will follow one path; the doc needs to commit to which one. The correct path — and the one that avoids the race — is: tryPurgeQueueOnce computes remaining = sqsPurgeRateLimitMillis - (now - meta.LastPurgedAtMillis) at line 73 and returns a typed internal error (e.g. purgeRateLimitedError{remaining: time.Duration(remaining) * time.Millisecond} rather than sqsAPIError). The admin wrapper uses errors.As to extract the duration. Remove the phrase "from the meta record" from §3.2 and replace it with an explicit description of the typed-error path.

3. QueuesSource interface naming — doc disagrees with actual code (§3.4)

The design shows the bridge interface with "existing" methods List, Describe, Delete. The actual interface in internal/admin/sqs_handler.go:62–65 is:

type QueuesSource interface {
    AdminListQueues(ctx context.Context) ([]string, error)
    AdminDescribeQueue(ctx context.Context, name string) (*QueueSummary, bool, error)
    AdminDeleteQueue(ctx context.Context, principal AuthPrincipal, name string) error
}

If Phase 4 adds new methods as Peek and Purge (short names), the interface becomes inconsistent — three methods with the Admin prefix and two without. The doc must commit to one of:

  • (a) New methods are AdminPeekQueue and AdminPurgeQueue — no interface refactor, consistent with existing naming.
  • (b) Phase 4 also renames the existing three methods to List, Describe, Delete as part of the restructure — valid but must be explicitly called out as a refactoring step so the implementation PR's scope is clear.

Fix this →

4. Admin-peek throttle incarnation keying — key format conflicts with bucketStore (§3.1)

The doc says the admin-peek bucket is "referenced by name from adminPeekKey(queue) = "admin_peek:" + queue". But the actual bucketStore (adapter/sqs_throttle.go) uses a struct key bucketKey{queue, action, incarnation} — not a plain string. The two formats are incompatible.

If the admin-peek bucket uses the existing bucketStore.charge() (the right call for consistency), the invocation would be:

s.throttle.charge(adminPeekThrottle, queueName, "admin_peek", meta.Incarnation, 1)

This participates in incarnation keying automatically — re-creating the queue resets the bucket via the existing invalidateQueue() sweep. The adminPeekKey string the doc describes is just a conceptual label, not the actual map key.

If the admin-peek bucket is a separate, simpler mechanism (a dedicated sync.Map[string]*tokenBucket]), the doc needs to describe how incarnation resets are handled so a re-created queue doesn't inherit the old bucket's state.

Pick one and specify it explicitly. The bucketStore path is the right default — re-using the existing infrastructure avoids a parallel rate-limiter that won't be reset on invalidateQueue().

6. Trailing slash contradiction — still unresolved (§3.4 + §6.4)

The doc says both:

  1. "Path segments are extracted with path.Clean first to normalise away … trailing … slashes, then split on /."
  2. Sub-resource path ["messages", ""] from trailing slash returns 404.
  3. Test plan §6.4: DELETE /admin/api/v1/sqs/queues/orders/messages/ returns 404.

After path.Clean("/admin/api/v1/sqs/queues/orders/messages/")"/admin/api/v1/sqs/queues/orders/messages". Splitting and trimming the queues/ prefix gives ["orders", "messages"] — which matches the messages sub-resource and should succeed (204), not return 404. The ["messages", ""] entry in point (2) can only occur if path.Clean is applied before trimming, which strips the trailing slash entirely and prevents the empty segment from ever appearing.

The contradiction: path.Clean makes trailing-slash requests succeed; the test plan says they should fail. Commit to one:

  • (a) Apply path.Clean → trailing slash is accepted. Update the comment from ["messages", ""] to note that it never occurs. Update §6.4 to assert 204, not 404.
  • (b) Reject trailing slashes explicitly with an early check before path.Cleanstrings.HasSuffix(r.URL.Path, "/") → 404. Update §3.4 to describe the pre-Clean check. The path.Clean step then only normalises double-slashes and . / ...

Fix this →


New Finding

12. "Single-message re-peek" mechanism is undefined (§3.5)

The detail modal is described as triggering "a single-message re-peek with body_max_bytes=262144" when body_truncated=true. But AdminPeekQueue has no API to fetch a specific message by ID — it walks the visibility index from a cursor position. A targeted fetch of message X requires knowing the cursor position just before X, which the SPA doesn't have after the initial list response.

The most practical interpretation is that the SPA re-issues the current page's request (same cursor, same limit) with body_max_bytes=262144. For a page of 100 messages each with 256 KiB bodies, this is a 25 MiB response — potentially large. The doc should specify the actual mechanism:

  • (a) Re-peek the current page with body_max_bytes=262144. The SPA finds the message by ID in the response. Acceptable for default page size (20 messages × 256 KiB = 5 MiB) but needs a note about the upper bound at limit=100.
  • (b) Cache the full message payload client-side on initial peek with body_max_bytes=262144 for all messages on the list page (the list view then only renders 4 KiB; the modal uses the already-fetched full body). Simpler UX, higher initial fetch cost.
  • (c) Add a message_id filter parameter to AdminPeekQueue so the SPA can request exactly one message — a targeted fetch. This is the cleanest from a protocol standpoint but adds a new optional parameter to §3.1.

Define the mechanism before Phase 5 so the frontend and backend implementations agree on the full-body fetch contract.


Overall Assessment

The two Codex findings (body cap, throttled metric) are cleanly resolved. The four open r3 findings (#1, #3, #4, #6) are all clarification/precision issues; none blocks Phase 2 (Purge) which doesn't touch Peek routing or the throttle bucket design. Phase 2 can proceed once finding #1 (RetryAfter propagation) is resolved in the doc — it's the only one that directly affects the Phase 2 implementation code. Findings #3, #4, and #6 must be resolved before Phase 3 (Peek) and Phase 4 (HTTP handlers) respectively. Finding #12 must be resolved before Phase 5 (SPA).

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 55db4de62c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

}
```

`AdminDescribeQueue` populates `IsDLQ` / `DLQSources` by reverse-scanning the catalog. Implementation: a single `ScanAt` over the queue-meta prefix (`SqsQueueMetaPrefix`, the prefix `scanQueueNames` already walks) returns all queue meta records in one storage round-trip; the handler then unmarshals each, runs `parseRedrivePolicy` (already in `adapter/sqs_redrive.go:43`) against the queue's `RedrivePolicy` attribute, and matches the resolved DLQ target against `name`. Cost is **one Pebble prefix scan + O(N) unmarshal-and-string-compare in memory** — not O(N) point lookups. The catalog cap (`sqsQueueScanPageLimit`, paged) already bounds the scan exactly as `scanQueueNames` does today, so the worst-case load is the same one the existing admin list endpoint pays per request.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Paginate DLQ source scan instead of single ScanAt

The design says AdminDescribeQueue can detect DLQ sources with a single ScanAt over SqsQueueMetaPrefix, but that will silently miss sources once the catalog exceeds one page (sqsQueueScanPageLimit is 1024 in adapter/sqs_catalog.go). In deployments with >1024 queues, IsDLQ/DLQSources can be false or incomplete for queues whose source links live past the first page, which would mislabel the purge UI and weaken the DLQ-specific warning this feature is meant to enforce.

Useful? React with 👍 / 👎.

Codex P1 round 4: the previous wording ("a single ScanAt ... returns
all queue meta records in one storage round-trip") would silently miss
DLQ source links in deployments with >1024 queues because
sqsQueueScanPageLimit = 1024 caps each ScanAt page. Queues whose
RedrivePolicy lived past the first page would be invisible to the
IsDLQ / DLQSources detection — the purge UI would render the wrong
framing (no "this is a DLQ" warning, no source-chip list) for the
exact deployments that need the warning most.

Rewrite the §3.3 paragraph to describe the actual reuse: the same
paginated loop scanQueueNames runs today (cursor + page-limit +
prefix-bound-exit). Make the pagination contract explicit so the
implementer cannot ship the bug Codex caught:

  - "paginated prefix scan ... reusing the exact loop shape from
    scanQueueNames"
  - cite the loop exit conditions (empty page / partial page /
    cursor past prefix end)
  - cost stated as ceil(N/1024) round-trips + O(N) in-memory work,
    not "one round-trip"
  - call out the bug-shape that the wording fixes so a future
    refactor doesn't regress to the single-ScanAt form

No code changes — proposal-only PR.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 15, 2026

@claude review

Round 4 — Codex P1 (line 240): doc claimed "single ScanAt ... returns all queue meta records in one storage round-trip" but sqsQueueScanPageLimit = 1024 caps each page, so deployments with >1024 queues would silently miss DLQ source links past page 1.

Rewrote §3.3 to describe the actual reuse: the same paginated loop scanQueueNames runs today (adapter/sqs_catalog.go:782) — cursor + page-limit + prefix-bound exit. Pagination contract is now explicit so an implementer following the doc literally cannot ship the bug Codex caught.

Cost restated as ceil(N/1024) round-trips + O(N) in-memory work (not "one round-trip"). The bug-shape rationale is in the prose so a future refactor doesn't regress to the single-ScanAt form.

Proposal-only — no code changes.

@claude
Copy link
Copy Markdown

claude Bot commented May 15, 2026

Claude finished @bootjp's task in 3m 51s —— View job


Tasks


Claude's Review — r5

Addressed since r4

Codex P1 (paginated DLQ source scan, §3.3) — Clean fix. The doc now correctly describes the paginated loop matching scanQueueNamesAt (sqs_catalog.go:1234): each iteration calls ScanAt(..., sqsQueueScanPageLimit, ...), checks len(kvs) < sqsQueueScanPageLimit, and advances via nextScanCursorAfter. Cost correctly restated as ceil(N/1024) round-trips. The bug-shape rationale is clearly documented so a future refactor cannot regress to the single-ScanAt form. ✓


Still Open from r3/r4 — Resolution Required

The five findings flagged in r3 and confirmed still-open in r4 are all still present in this revision. None of the five findings requires new code archaeology — they are precision gaps in the doc text that must be committed to before implementation begins.

1. RetryAfter duration sourcing — two contradictory sentences remain (§3.2)

The paragraph on PurgeQueueInProgress mapping still contains:

"extracts the remaining duration from the meta record (LastPurgedAtMillis + 60s - now())"

"The duration is captured once at the rejection point so the handler does not need to recompute it."

These two phrases cannot both be true. Reading sqs_purge.go:63–76, tryPurgeQueueOnce has meta.LastPurgedAtMillis and now (= time.Now().UnixMilli()) available at line 73 where the rate-limit check fires. The remaining duration is sqsPurgeRateLimitMillis - (now - meta.LastPurgedAtMillis), computable there without any additional Pebble read. "From the meta record" implies the admin wrapper makes a separate loadQueueMetaAt call after seeing the rejection — that races a concurrent purge in the 60-second window.

The correct mechanism is "at the rejection point": tryPurgeQueueOnce returns a purgeRateLimitedError{remaining: time.Duration(remaining) * time.Millisecond} instead of an sqsAPIError. Remove "from the meta record" and replace it with the explicit description of this typed-error path. The PurgeInProgressError struct design (§3.2) is already correct; only the sourcing sentence is wrong. This is the only finding that blocks Phase 2.

Fix this →

3. QueuesSource interface naming (§3.4)

The bridge interface in the doc still shows:

// existing
List(ctx context.Context) ([]string, error)
Describe(ctx context.Context, name string) (*QueueSummary, bool, error)
Delete(ctx context.Context, p AuthPrincipal, name string) error
// new
Peek(...)
Purge(...)

The actual interface in internal/admin/sqs_handler.go:62–65 is AdminListQueues, AdminDescribeQueue, AdminDeleteQueue. The doc's "existing" block disagrees with the code. Pick one and commit:

  • (a) New methods are AdminPeekQueue, AdminPurgeQueue — no refactor, existing naming wins.
  • (b) Phase 4 renames all five to the short form — valid but the doc must explicitly call this out as a refactoring step.

Blocks Phase 4.

Fix this →

4. Admin-peek throttle bucket API incompatibility (§3.1)

The doc describes the admin-peek bucket as "referenced by name from adminPeekKey(queue) = "admin_peek:" + queue". The actual bucketStore.charge() signature (adapter/sqs_throttle.go:182) is:

func (b *bucketStore) charge(cfg *sqsQueueThrottle, queue, action string, incarnation uint64, count int) chargeOutcome

The key is a bucketKey{queue, action, incarnation} struct (sqs_throttle.go:98–101), not a plain string. The adminPeekKey string the doc describes is incompatible with the actual API. If the admin-peek bucket reuses bucketStore.charge() (the right default — it participates in invalidateQueue() sweeps so a recreated queue resets the bucket), the invocation would be:

s.throttle.charge(adminPeekThrottle, queueName, "admin_peek", meta.Incarnation, 1)

The adminPeekKey(queue) string is then just a conceptual label for the action field, not a separate map key. If a separate sync.Map is intended instead, the doc must explain how invalidateQueue() resets it on queue recreation. Pick one and specify it. Blocks Phase 3.

6. Trailing slash contradiction (§3.4 + §6.4)

The doc says both:

  1. "Path segments are extracted with path.Clean first to normalise away leading / trailing / duplicated slashes…"
  2. "["messages", ""] from trailing slash) returns 404"
  3. §6.4: "/messages/ (trailing slash) … return[s] 404"

After path.Clean strips the trailing slash, splitting produces ["orders", "messages"] — which matches the sub-resource and returns 204, not 404. The ["messages", ""] pattern (with the empty segment) only arises if path.Clean is not applied. Commit to one:

  • (a) Apply path.Clean → trailing slash succeeds. Update §3.4 comment to remove ["messages", ""] and update §6.4 test to assert 204.
  • (b) Reject trailing slash before path.Clean with strings.HasSuffix(r.URL.Path, "/") → 404. Describe this pre-check in §3.4 explicitly.

Blocks Phase 4.

Fix this →

12. "Single-message re-peek" mechanism undefined (§3.5)

The doc says "opening the detail modal triggers a single-message re-peek with body_max_bytes=262144" but AdminPeekQueue has no message-ID filter — it walks the visibility index from a cursor position. There is no API surface to fetch exactly one message by ID. The SPA cannot issue a targeted single-message fetch; it must either re-issue the current page or pre-fetch full bodies for all rows on list load.

The three options are the same as r3/r4: (a) re-peek the current page with body_max_bytes=262144, find the message by ID in the response; (b) cache full bodies at list time; (c) add a message_id filter to AdminPeekQueue. Commit to one before Phase 5 so the frontend and backend agree on the full-body contract.


Minor Finding (New)

13. Line number drift in §3.3 scanQueueNames reference

§3.3 says "the exact loop shape from scanQueueNames (adapter/sqs_catalog.go:782)". Line 782 in the current file is inside a comment block in throttleConfigEqual. scanQueueNames is actually at line 1221 (the paginated loop is in scanQueueNamesAt at line 1234). The line number is stale — misleading for an implementer who navigates to it. Update the reference to adapter/sqs_catalog.go:1234 (scanQueueNamesAt).

Fix this →


Overall Assessment

The Codex P1 pagination fix is clean and correct — the ceil(N/1024) cost model and the bug-shape rationale are exactly what's needed to prevent a future regression. The five open findings from r3/r4 all remain and must be addressed before their respective phases begin. Finding #1 is the only one that blocks Phase 2 (Purge); #3 and #4 block Phase 3 (Peek); #6 blocks Phase 4 (HTTP handlers); #12 blocks Phase 5 (SPA). Finding #13 is a trivial line-number correction.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 15, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 15, 2026

Claude finished @bootjp's task in 3m 28s —— View job


Tasks


Claude's Review — r5

Addressed since r4

✅ Codex P1 (paginated DLQ source scan): §3.3 was rewritten to describe the paginated prefix scan. The code in scanQueueNamesAt (adapter/sqs_catalog.go:1234–1265) uses exactly the loop shape the doc now describes — ScanAt per page, break when len(kvs) == 0 or < sqsQueueScanPageLimit, advance via nextScanCursorAfter(last key). The ceil(N/1024) cost model is now accurate. The bug-shape rationale ("a single ScanAt caps at 1024 and silently misses past page 1") is preserved so future refactors can't regress to the single-read form. Fix is correct.


New Finding from r4 Fix

13. §10 Summary still says "one storage round-trip" — inconsistent with §3.3 fix

§3.3 (the section that was just fixed) now correctly says ceil(N/1024) Pebble prefix-scan round-trips. But the Summary at §10 was not updated:

"reverse-scan the catalog (prefix scan over SqsQueueMetaPrefix, one storage round-trip) so the SPA can frame the Messages tab and the Purge confirmation appropriately."

This is the exact bug Codex P1 caught in §3.3 — now it survives in the summary. An implementer reading only §10 (the TL;DR) would think the reverse-scan is a single round-trip and could write ScanAt(..., sqsQueueScanPageLimit, ...) without a loop, silently missing queues >1024.

Fix this →


Still Open from r4

1. RetryAfter duration propagation — §3.2 still self-contradictory

§3.2, line 220 of the current doc:

"The admin wrapper sniffs this via a new helper isSQSAdminPurgeInProgress(err) (parallel to the existing isSQSAdminQueueDoesNotExist), extracts the remaining duration from the meta record (LastPurgedAtMillis + 60s - now()), and returns &PurgeInProgressError{RetryAfter: remaining}. The duration is captured once at the rejection point so the handler does not need to recompute it."

These two phrases cannot both be true. "From the meta record" → the admin wrapper reads the meta a second time after the rejection. "At the rejection point" → tryPurgeQueueOnce carries the duration in the error it returns. The actual code at sqs_purge.go:73–76 returns newSQSAPIError(...) — an opaque sqsAPIError with no duration field. The isSQSAdminQueueDoesNotExist pattern it mirrors only returns bool; a bool variant of isSQSAdminPurgeInProgress can't propagate the duration either.

The two implementation paths are mutually exclusive:

  • (a) Second meta read: admin wrapper calls isSQSAdminPurgeInProgress(err) bool (mirroring the existing pattern), then issues loadQueueMetaAt to reconstruct the timing. This is the racy path r3 and r4 flagged — a concurrent purge in the 60-second window can reset LastPurgedAtMillis between the rejection and the follow-up read.
  • (b) Typed error: tryPurgeQueueOnce returns purgeRateLimitedError{remaining: sqsPurgeRateLimitMillis - (now - meta.LastPurgedAtMillis)} instead of sqsAPIError for the rate-limit case. The admin wrapper uses errors.As to extract the duration. No race, no second read.

Path (b) is the correct choice. The doc must remove the phrase "from the meta record" and fully commit to the typed-error description. Until this is resolved, Phase 2 implementers face a real decision point with two conflicting approaches in the doc.

3. QueuesSource interface naming — doc still conflicts with actual code

The design's §3.4 bridge interface shows:

type QueuesSource interface {
    // existing
    List(ctx context.Context) ([]string, error)
    Describe(ctx context.Context, name string) (*QueueSummary, bool, error)
    Delete(ctx context.Context, p AuthPrincipal, name string) error
    // new
    Peek(...) (PeekResult, error)
    Purge(...) (PurgeResult, error)
}

The actual interface at internal/admin/sqs_handler.go:62–65:

type QueuesSource interface {
    AdminListQueues(ctx context.Context) ([]string, error)
    AdminDescribeQueue(ctx context.Context, name string) (*QueueSummary, bool, error)
    AdminDeleteQueue(ctx context.Context, principal AuthPrincipal, name string) error
}

The existing method names in the doc (List, Describe, Delete) do not match the code (AdminListQueues, AdminDescribeQueue, AdminDeleteQueue). An implementer following the doc would define two interfaces that cannot be satisfied by the same struct. The doc must either:

  • (a) Use AdminListQueues, AdminDescribeQueue, AdminDeleteQueue as the "existing" names and add AdminPeekQueue, AdminPurgeQueue for consistency.
  • (b) Explicitly state that Phase 4 renames all five methods as part of the ServeHTTP restructure.

Fix this →

4. Admin-peek throttle bucket — adminPeekKey string conflicts with bucketStore struct key

The doc says the admin-peek bucket is "referenced by name from adminPeekKey(queue) = "admin_peek:" + queue". But the actual bucketStore (adapter/sqs_throttle.go) uses bucketKey{queue, action, incarnation} (a struct) — not a plain string. The charge() signature is:

func (b *bucketStore) charge(cfg *sqsQueueThrottle, queue, action string, incarnation uint64, count int) chargeOutcome

The adminPeekKey string is not a valid argument to bucketStore.charge(). The doc needs to commit to one of:

  • (a) Reuse bucketStore: The admin-peek bucket is a bucketStore.charge(adminPeekThrottle, queueName, "admin_peek", meta.Incarnation, 1) call. The "admin_peek" action string serves as the bucket discriminator. This automatically participates in incarnation keying — invalidateQueue() sweeps all (queue, *, *) entries including (queue, "admin_peek", *). Document the incarnation keying behavior explicitly.
  • (b) Separate sync.Map: The admin-peek bucket is a separate sync.Map[string]*tokenBucket] keyed by queue name (the adminPeekKey string). This does NOT participate in invalidateQueue() sweeps — a re-created queue inherits the old bucket's state. If this is the intent, document how the bucket is reset on invalidateQueue.

Option (a) is the right default — it re-uses existing infrastructure and gets incarnation reset for free.

6. Trailing slash contradiction — §3.4 and §6.4 still conflict

§3.4 says: "path.Clean … normalise away leading / trailing / duplicated slashes" then immediately says ["messages", ""] (described as "from trailing slash") returns 404. After path.Clean("/admin/api/v1/sqs/queues/orders/messages/")"/admin/api/v1/sqs/queues/orders/messages" — the trailing slash is gone, segments are ["orders", "messages"], which matches the messages sub-resource (→ 204, not 404).

§6.4 test plan: "/messages/ (trailing slash) … returns 404." This contradicts path.Clean semantics.

Commit to one:

  • (a) Apply path.Clean → trailing slash is accepted (→ 204). Remove ["messages", ""] from the example shape list. Update §6.4 to assert 204 for the trailing-slash case.
  • (b) Reject trailing slashes with an early strings.HasSuffix(r.URL.Path, "/") check before path.Clean (→ 404). Document the pre-Clean check explicitly in §3.4.

Fix this →

12. Single-message re-peek mechanism — still undefined

§3.5: "opening the detail modal triggers a single-message re-peek with body_max_bytes set to sqsMaximumAllowedMaximumMessageSize (262144)"

AdminPeekQueue has no message_id filter — it walks the visibility index from a cursor. The SPA cannot request a single specific message. What does "single-message re-peek" actually issue on the wire? The doc must specify the mechanism before Phase 5:

  • (a) Re-issue the same page request (same cursor, same limit) with body_max_bytes=262144. The SPA finds the target by message_id in the response. Worst case: limit=100 × 262144 bytes = ~26 MiB per modal open. Needs an explicit note on the upper-bound for the common limit=20 case (~5 MiB, acceptable).
  • (b) Cache the raw API response on the initial list fetch and use the already-present body for the modal; the list view just truncates the display. Re-fetch only on explicit "Reload" in the modal.
  • (c) Add a message_id optional filter to AdminPeekQueue (the cleanest protocol-wise; Phase 3 scope addition).

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 15, 2026

@claude review

Codex r4 P1 (new):
- IsDLQ scan paginated: single ScanAt would silently miss sources
  past the 1024-page limit, leaving IsDLQ/DLQSources incomplete in
  >1024-queue deployments. §3.3 now commits to a paginated loop
  matching scanQueueNames's internal shape. Cost stated as
  O(ceil(N/1024)) round-trips + O(N) in-memory comparisons.

Claude r4 four findings still open from r3:

   phrasing was ambiguous — could mean a second loadQueueMetaAt call,
   which races a concurrent purge resetting LastPurgedAtMillis in
   the 60-second window. Doc now commits explicitly to the
   typed-error path: tryPurgeQueueOnce computes remaining inside
   the OCC read at rejection and returns purgeRateLimitedError{
   remaining time.Duration}; admin wrapper extracts via errors.As.
   No second meta read.

   AdminListQueues / AdminDescribeQueue / AdminDeleteQueue (verified
   at internal/admin/sqs_handler.go:62-65). Earlier draft showed
   them as List/Describe/Delete; new methods follow the existing
   convention (AdminPeekQueue, AdminPurgeQueue) so the interface
   doesn't end up half-prefixed.

   bucketKey{queue, action, incarnation}, not strings. Doc now
   uses bucketStore.charge() with action='admin_peek' and the
   queue's current incarnation, participating in invalidateQueue()
   automatically. Earlier draft's free-standing string-keyed map
   would have built parallel rate-limiter infrastructure.

   slashes, so /messages/ normalises to /messages and the request
   succeeds. Earlier draft asserted 404 on /messages/ in the test
   plan — contradicted path.Clean's normalisation. Doc now commits
   to (a) accept trailing slashes; test plan §6.4 updated to
   assert 204 on /messages/.

Claude r4 one new finding:

    targeted fetch; the modal re-issues the current page's request
    with body_max_bytes=262144 and locates the target message by
    ID. Doc adds mitigation: default page size 20, warn before
    re-peek if cumulative size > 20 MiB, cache result so other
    rows on the same page don't re-fetch. Future PR can add a
    ?message_id=… filter if production needs more targeted fetch.

No code in this PR; doc-only update.
@claude
Copy link
Copy Markdown

claude Bot commented May 15, 2026

Claude finished @bootjp's task in 4m 45s —— View job


Tasks


Claude's Review — r5

Addressed since r4

The latest commit (Codex P1 fix — §3.3 DLQ pagination) is accurate: scanQueueNamesAt (adapter/sqs_catalog.go:1234) uses exactly the paginated loop shape described — ScanAt(..., sqsQueueScanPageLimit, ...), advance start = nextScanCursorAfter(kvs[len(kvs)-1].Key), break on len(kvs) < sqsQueueScanPageLimit. The ceil(N/1024) cost description matches the code. ✓

All five r4 open findings are resolved. Summary of verification:

  • Configure Renovate #1 (RetryAfter propagation) — §3.2 now explicitly commits to the typed-error path: tryPurgeQueueOnce returns &purgeRateLimitedError{remaining} at the rejection point; AdminPurgeQueue extracts via errors.As; the phrase "from the meta record" is gone. The mechanism is race-free. ✓
  • Update docker/setup-qemu-action action to v3 #3 (QueuesSource naming) — §3.4 bridge interface now shows AdminPeekQueue / AdminPurgeQueue with explicit acknowledgement of the Admin* prefix convention. Matches internal/admin/sqs_handler.go:62–65. ✓
  • Dependency Dashboard #4 (throttle incarnation keying) — §3.1 now says bucketStore.charge(adminPeekThrottle, queueName, "admin_peek", meta.Incarnation, 1), correctly citing the struct key and invalidateQueue() reset. ✓
  • Update module github.com/grpc-ecosystem/go-grpc-middleware to v2 #6 (trailing slash) — §3.4 commits to (a) accept: "path.Clean strips trailing slashes … succeed as 204". §6.4 test plan updated to match. ✓
  • DynamoDB API Impl #12 (single-message re-peek) — §3.5 commits to re-issuing the current page's request; describes the default-page-size-20 mitigation, the cumulative-size warning at 20 MiB, and the client-side cache. ✓

New Findings

13. "admin_peek" action collides with consumer Default bucket — blocks Phase 3 (§3.1)

resolveActionConfig (adapter/sqs_throttle.go:484) handles action = "admin_peek" as an unknown action: neither case bucketActionSend nor case bucketActionReceive matches, so it falls through to the DefaultCapacity branch and returns ("*", cfg.DefaultCapacity, cfg.DefaultRefillPerSecond). The bucket key becomes bucketKey{queue: name, action: "*", incarnation: n}.

When a queue has an operator-configured ThrottleDefaultCapacity (e.g., 100 req/s), ReceiveMessage calls that fall through to Default also resolve to ("*", 100, 50) — the same bucket key as admin-peek resolves to, but with a different (capacity, refill) pair. bucketStore.loadOrInit detects the config mismatch via the bucket.capacity == capacity && bucket.refillRate == refill check and runs CompareAndDelete + LoadOrStore, replacing the bucket. Every interleaved admin-peek / ReceiveMessage call then triggers mutual reconciliation churn — every request rebuilds the bucket at full capacity for its config, letting both callers exceed their configured limit indefinitely.

Concrete scenario: queue orders has ThrottleDefaultCapacity=100. Operator opens the Messages tab (5 admin-peek RPS), consumers receive at 80 RPS. Every admin-peek call replaces the {orders, *, N} bucket with a 20-token bucket; every ReceiveMessage call replaces it with a 100-token bucket. Neither side ever accumulates meaningful state.

Fix options:

  • (a) Add "admin_peek" as an explicit case in resolveActionConfig returning (action, cfg.DefaultCapacity, cfg.DefaultRefillPerSecond) — preserving "admin_peek" as the bucket key so it never collides with "*". One-line change to sqs_throttle.go; the adminPeekThrottle config would set DefaultCapacity/DefaultRefillPerSecond and the "admin_peek" case would use them. The existing sweep / invalidateQueue machinery still applies because the bucket key includes the queue name and incarnation.
  • (b) Use a separate per-queue rate-limiter map (e.g., a sync.Map[string, *rate.Limiter] on SQSServer) for admin-peek, bypassing bucketStore entirely. invalidateQueue would need one extra delete call. Simpler but adds parallel infrastructure.

Option (a) is the smaller change and keeps all throttle state in one place. The "admin_peek" constant should be exported or at least declared as const bucketActionAdminPeek = "admin_peek" to prevent it from being a bare string in two places.

14. writeSQSErrorFromErr needs a *purgeRateLimitedError branch — blocks Phase 2 (§3.2)

writeSQSErrorFromErr (adapter/sqs_catalog.go:228) pattern-matches only on *sqsAPIError via errors.As:

func writeSQSErrorFromErr(w http.ResponseWriter, err error) {
    var apiErr *sqsAPIError
    if errors.As(err, &apiErr) {
        writeSQSError(w, apiErr.status, apiErr.errorType, apiErr.message)
        return
    }
    // falls through to generic 500
}

The design says tryPurgeQueueOnce will return &purgeRateLimitedError{remaining} instead of newSQSAPIError(http.StatusBadRequest, sqsErrPurgeInProgress, …). After this change, the SigV4 purgeQueue handler calls writeSQSErrorFromErr(w, purgeRateLimitedErr) — which doesn't match *sqsAPIError and produces a 500 instead of a 400 PurgeQueueInProgress. This silently breaks the SigV4 wire response for the 60-second rate-limit case.

The design says "the wire-side purgeQueue handler wraps that with the existing sqsAPIError rendering", but doesn't specify WHERE. The only place that can do this without adding logic to purgeQueue itself is writeSQSErrorFromErr. Add a *purgeRateLimitedError branch there:

var purgeErr *purgeRateLimitedError
if errors.As(err, &purgeErr) {
    writeSQSError(w, http.StatusBadRequest, sqsErrPurgeInProgress,
        fmt.Sprintf("…retry after %.0f seconds", purgeErr.remaining.Seconds()))
    return
}

The doc should say explicitly that writeSQSErrorFromErr gains this branch as part of the Phase 2 change, not just that the handler "wraps" it. Fix this →

15. sqsPurgeRateLimitMillis*ms notation is ambiguous (§3.2) — minor

The expression remaining := sqsPurgeRateLimitMillis*ms - (now - meta.LastPurgedAtMillis)*ms uses *ms as a pseudo-unit annotation. sqsPurgeRateLimitMillis is defined as 60_000 (an int64 in milliseconds, adapter/sqs_catalog.go:47). The *ms suffix is trying to denote "in milliseconds" rather than multiplying by time.Millisecond. The correct Go expression is:

remaining := time.Duration(sqsPurgeRateLimitMillis-(now-meta.LastPurgedAtMillis)) * time.Millisecond

An implementer reading sqsPurgeRateLimitMillis*ms might write sqsPurgeRateLimitMillis * time.Millisecond (which adds an extra 60x factor — 60_000 ms = 60_000 * time.Millisecond = 60_000,000,000 ns = 60,000 seconds, not 60). Update the pseudocode to use the explicit Go form or add a note clarifying units. Fix this →


Rollout Impact

Finding Severity Blocks
#13 — admin_peek bucket collides with Default consumer bucket Medium Phase 3 (Peek backend)
#14writeSQSErrorFromErr produces 500 on rate-limit for SigV4 path Medium Phase 2 (Purge backend)
#15*ms notation ambiguity in §3.2 pseudocode Minor —cosmetic—

Phase 2 cannot proceed until finding #14 is addressed in the doc (the SigV4 wire response would regress). Phase 3 needs #13 resolved before the bucketStore.charge call is coded. Finding #15 is cosmetic but cheap to fix alongside #14.

The pagination fix (Codex P1 r4) is confirmed correct against the actual scanQueueNamesAt implementation. With #14 resolved, Phase 2 is ready to implement.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
docs/design/2026_05_16_proposed_admin_purge_queue.md (1)

256-256: 💤 Low value

Optional: Add language specifiers to code fences for better rendering.

The code fences at these lines are missing language identifiers, which can affect syntax highlighting in rendered Markdown. Consider adding specifiers:

  • Line 256: routing/path examples → no language needed, or use text
  • Line 369: structured log format → could use text or log
  • Line 384: Prometheus metric format → could use text or prometheus

This is purely cosmetic and does not affect the technical content.

✨ Proposed fixes

Line 256:

-```
+```text
 /admin/api/v1/sqs/queues            → handleList (existing)

Line 369:

-```
+```text
 admin.sqs.purge_queue

Line 384:

-```
+```text
 elastickv_sqs_admin_purge_queue_total{queue, outcome}

As per coding guidelines, this addresses the markdownlint warning about fenced code blocks requiring language specifiers.

Also applies to: 369-369, 384-384

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/design/2026_05_16_proposed_admin_purge_queue.md` at line 256, Add
language specifiers to the three Markdown fenced code blocks referenced: update
the block showing the routing/path example that contains
"/admin/api/v1/sqs/queues → handleList" to use a text fence (e.g., ```text),
update the structured log example containing "admin.sqs.purge_queue" to use a
text/log fence (e.g., ```text or ```log), and update the Prometheus metric
example containing "elastickv_sqs_admin_purge_queue_total{queue, outcome}" to
use a text/prometheus fence (e.g., ```text or ```prometheus); this fixes the
markdownlint warning by ensuring each ``` block has an appropriate language
specifier.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@docs/design/2026_05_16_proposed_admin_purge_queue.md`:
- Line 256: Add language specifiers to the three Markdown fenced code blocks
referenced: update the block showing the routing/path example that contains
"/admin/api/v1/sqs/queues → handleList" to use a text fence (e.g., ```text),
update the structured log example containing "admin.sqs.purge_queue" to use a
text/log fence (e.g., ```text or ```log), and update the Prometheus metric
example containing "elastickv_sqs_admin_purge_queue_total{queue, outcome}" to
use a text/prometheus fence (e.g., ```text or ```prometheus); this fixes the
markdownlint warning by ensuring each ``` block has an appropriate language
specifier.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b6b8d205-e54c-4714-b800-88ce0c3340a9

📥 Commits

Reviewing files that changed from the base of the PR and between ee42528 and 875ff7f.

📒 Files selected for processing (1)
  • docs/design/2026_05_16_proposed_admin_purge_queue.md

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fbd4237533

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +274 to +275
Path segments are extracted with `path.Clean` first (normalising leading / trailing / duplicated slashes and any `.` / `..` traversal), then split on `/`. The first segment after `queues/` is always the queue name; the rest (zero or more segments) is the sub-resource path. Sub-resource path matches: `[]` → name-only routes; `["messages"]` → messages sub-resource. Any other shape (`["messages", "x"]`, `["Messages"]`) returns 404.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reject empty queue segment before applying path.Clean

Cleaning the full request path before segment validation can rewrite malformed targets into valid but different resources: for example, /admin/api/v1/sqs/queues//messages becomes /admin/api/v1/sqs/queues/messages. With this routing rule, a request that omitted the queue name could be interpreted as acting on a queue literally named messages (or even hit delete/describe routes), which risks purging/deleting the wrong queue instead of returning a validation error.

Useful? React with 👍 / 👎.

Codex P1 round 5 (line 275): `/admin/api/v1/sqs/queues//messages`
(omitted queue name, double slash) is normalised by path.Clean to
`/admin/api/v1/sqs/queues/messages`, at which point the first
post-`queues/` segment is the literal string "messages" and the
router would route the request as if the operator asked to act on
a queue NAMED "messages". That is a wrong-queue
purge/peek/delete hazard: a script bug that omits the ${QUEUE}
interpolation silently targets whichever queue is named "messages"
in the cluster.

Add an explicit empty-segment rejection BEFORE path.Clean: the
router inspects r.URL.EscapedPath() (raw, pre-Clean) and rejects
with 400 + {"code":"ValidationError", "message":"empty queue name
segment"} when any consecutive `/` pair appears inside the
`/queues/.../messages` segment range. The check fires pre-Clean so
the duplicate-slash signal is still visible (Clean collapses it).
Test plan §6.4 adds two cases pinning the new 400 paths.

The legal-but-quirky case "queue literally named messages" (the
operator did intentionally call `CreateQueue` with name=messages)
stays allowed — that's a real queue name and a Describe/Delete of
it should succeed.

Proposal-only — no code in this PR.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 15, 2026

@claude review

Round 5 — Codex P1 (line 275): /admin/api/v1/sqs/queues//messages (omitted queue name) normalises via path.Clean to /admin/api/v1/sqs/queues/messages → router interprets it as "queue literally named messages". Wrong-queue purge/peek/delete hazard when a script bug omits ${QUEUE} interpolation.

Fix (doc-only): router inspects r.URL.EscapedPath() BEFORE path.Clean and rejects with 400 + {"code":"ValidationError","message":"empty queue name segment"} when any consecutive / pair appears inside the /queues/.../messages segment range. Check fires pre-Clean so the duplicate-slash signal is still visible (Clean collapses it). Test plan §6.4 adds two 400 cases.

Legal-but-quirky case "queue literally named messages" stays allowed — that's a real queue name and the operator's intentional CreateQueue should succeed at Describe/Delete.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 15, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 15, 2026

Claude finished @bootjp's task in 6m 24s —— View job


Tasks

  • Read design document in full (current r6 state after Codex P1 empty-segment fix)
  • Verify Codex P1 fix: empty queue name segment pre-Clean check (§3.4 + §6.4)
  • Check status of all r5 open findings (add bolt_store #13/add test and delete op #14/Devlop #15 from r5 second review; §10 summary from r5 first review)
  • Check relevant adapter code (sqs_purge.go, sqs_throttle.go, sqs_catalog.go, sqs_handler.go)
  • Post comprehensive r6 review

Claude's Review — r6

Addressed since r5

Codex r5 P1 (empty queue segment, §3.4) — The fix is present: §3.4 now describes a pre-Clean check on r.URL.EscapedPath() that fires before path.Clean normalises away the // signal. The motivation (wrong-queue purge/peek hazard when ${QUEUE} interpolation is omitted) is clearly documented. See issues with the fix below (#16#18).


Issues with the Codex r5 P1 Fix

16. Doc claims the // check rejects /queues/ — it cannot; and it shouldn't (§3.4)

The paragraph says: "The same check rejects /admin/api/v1/sqs/queues/ (no name, trailing slash)."

/admin/api/v1/sqs/queues/ contains no consecutive // pair — it has one trailing slash, not two adjacent slashes. A check for // in the raw path will never fire on this input. The claim is wrong.

More importantly, /admin/api/v1/sqs/queues/ is a legitimate operator request: after path.Clean, it normalises to /admin/api/v1/sqs/queues (matches pathSqsQueues exactly) and routes to handleList. Rejecting it with 400 would break curl-style clients that append trailing slashes to collection endpoints. The trailing-slash-on-list case is handled correctly by path.Clean already — no pre-Clean check is needed or wanted there.

Remove the /admin/api/v1/sqs/queues/ example from "The same check rejects…". The only path the // check catches is /queues//something (consecutive slashes, empty name segment).

Fix this →

17. Two promised 400 test cases not actually added to §6.4 (item 4 of §6)

The fix paragraph ends with "Test plan §6.4 adds two 400 cases." But the actual §6 item 4 (Integration: TestSqsHandler_Messages_*) still ends with the trailing-slash-succeeds note from r4 — no 400 cases appear. The test plan is inconsistent with the stated promise.

The two cases must be explicitly enumerated in §6 item 4:

  • GET /admin/api/v1/sqs/queues//messages (double slash, empty name) → 400 ValidationError
  • DELETE /admin/api/v1/sqs/queues//messages (double slash, empty name) → 400 ValidationError

Fix this →

18. Check scope "inside the /queues/.../messages segment range" is misleadingly narrow

The doc says the // check fires "if any consecutive / pair appears inside the /queues/.../messages segment range". This wording implies the check only runs for paths that end in /messages. But the pre-Clean check should be a property of the router entry point, not the sub-resource — any path after the /queues/ prefix with an empty name segment is ambiguous. An implementer reading this literally might add the check only inside the hasMessages branch and miss the // case on Describe (GET /queues//) and Delete (DELETE /queues//) paths.

The correct description: the router checks the raw EscapedPath() for any // that would appear where the queue name segment sits (i.e., immediately after the /queues/ prefix), before path.Clean is applied. This fires regardless of what follows the queue name position. Update the wording to say "any consecutive / pair at the queue-name position (immediately after the /queues/ prefix)" rather than "inside the /queues/.../messages segment range".


Still Open from r5 — Phase-Blocking

#13 (r5 second review): "admin_peek" action falls through to bucketActionAny = "*" in resolveActionConfig — blocks Phase 3

resolveActionConfig (adapter/sqs_throttle.go:484–498):

func resolveActionConfig(cfg *sqsQueueThrottle, action string) (string, float64, float64) {
    switch action {
    case bucketActionSend:  // "send"
        if cfg.SendCapacity > 0 { return bucketActionSend, … }
    case bucketActionReceive:  // "recv"
        if cfg.RecvCapacity > 0 { return bucketActionReceive, … }
    }
    if cfg.DefaultCapacity > 0 {
        return bucketActionAny, cfg.DefaultCapacity, cfg.DefaultRefillPerSecond
        //     ^^^^^^^^^^^^^^^ = "*"
    }
    return action, 0, 0
}

action = "admin_peek" matches neither Send nor Receive. If the queue's adminPeekThrottle config has DefaultCapacity > 0, resolveActionConfig returns ("*", cfg.DefaultCapacity, cfg.DefaultRefillPerSecond). The bucket key becomes bucketKey{queue, "*", incarnation} — the same key as any ReceiveMessage or SendMessage call that falls through to Default on the queue's own throttle config. The two configs have different (capacity, refill) pairs, triggering bucketStore.loadOrInit's reconciliation churn on every interleaved admin-peek / consumer call, letting both callers drain the bucket indefinitely.

The fix (add an explicit "admin_peek" case to resolveActionConfig) is one line; the doc should say Phase 3 adds const bucketActionAdminPeek = "admin_peek" and a corresponding case bucketActionAdminPeek that returns (bucketActionAdminPeek, cfg.DefaultCapacity, cfg.DefaultRefillPerSecond) — preserving the "admin_peek" bucket key so it never collides with "*".

#14 (r5 second review): writeSQSErrorFromErr produces 500 on rate-limit for SigV4 path — blocks Phase 2

The doc (§3.2, line ~228) says tryPurgeQueueOnce will return &purgeRateLimitedError{remaining} instead of a sqsAPIError. The SigV4 purgeQueue handler (sqs_purge.go:22-37) calls writeSQSErrorFromErr(w, err) on all errors. writeSQSErrorFromErr (sqs_catalog.go:226-240) pattern-matches only on *sqsAPIError via errors.As. A *purgeRateLimitedError does not match *sqsAPIError, so it falls through to the generic 500 branch — silently breaking the SigV4 wire response for the rate-limit case.

The doc says purgeQueue "wraps that with the existing sqsAPIError rendering" but doesn't say where this wrapping happens. The correct location is writeSQSErrorFromErr: add a *purgeRateLimitedError branch before the 500 fallthrough:

var purgeErr *purgeRateLimitedError
if errors.As(err, &purgeErr) {
    writeSQSError(w, http.StatusBadRequest, sqsErrPurgeInProgress,
        fmt.Sprintf("…retry after %.0f seconds", purgeErr.remaining.Seconds()))
    return
}

The doc must state this explicitly in §3.2, not just say the handler "wraps" the error.

Fix this →

#15 (r5 second review): *ms notation in §3.2 pseudocode risks 60,000× wrong duration

Line ~228: remaining := sqsPurgeRateLimitMillis*ms - (now - meta.LastPurgedAtMillis)*ms

sqsPurgeRateLimitMillis = 60_000 (int64, milliseconds). An implementer reading *ms as multiplication by time.Millisecond writes:

remaining := sqsPurgeRateLimitMillis * time.Millisecond  // = 60_000 × 1_000_000 ns = 60_000 seconds, not 60

The correct expression is:

remaining := time.Duration(sqsPurgeRateLimitMillis-(now-meta.LastPurgedAtMillis)) * time.Millisecond

Replace sqsPurgeRateLimitMillis*ms - (now - meta.LastPurgedAtMillis)*ms with this explicit Go form. The *ms suffix is documentation-only notation that is too close to valid Go to be safe as pseudocode.

Fix this →


Cosmetic Finding (Still Open)

§10 Summary still says "one storage round-trip" (§3.3 fix not mirrored)

Line 542: "reverse-scan the catalog (prefix scan over SqsQueueMetaPrefix, one storage round-trip)". §3.3 was fixed in r4 to say ceil(N/1024) round-trips; §10 was not updated to match. An operator or reviewer reading only the summary gets the pre-fix incorrect cost model.

Fix this →


Phase-Blocking Summary

Finding Severity Blocks
#14writeSQSErrorFromErr 500 on SigV4 rate-limit path Medium Phase 2 (Purge backend)
#16/queues/ incorrectly listed as rejected by // check Low-Medium Phase 4 (HTTP handlers, precise spec)
#17 — Two promised 400 test cases missing from §6.4 Low Phase 4 (test plan)
#13"admin_peek" falls through to "*" in resolveActionConfig Medium Phase 3 (Peek backend)
#18 — Check scope wording too narrow ("messages segment range") Low Phase 4 (implementation clarity)
#15*ms notation ambiguity Low Phase 2 (correct implementation)
§10 — "one storage round-trip" not updated Cosmetic

Phase 2 can proceed once #14 and #15 are resolved in the doc. The SigV4 regression (#14) is the only finding that directly breaks existing behavior. Findings #13 and #16#18 must be resolved before Phase 3 and Phase 4 respectively.

@claude
Copy link
Copy Markdown

claude Bot commented May 15, 2026

Claude finished @bootjp's task in 6m 3s —— View job


Tasks

  • Read design document in full (current r6 state after empty-segment fix)
  • Verify Codex P1 (line 275) empty-queue-segment fix
  • Check status of r5 open findings
  • Review relevant source files for new findings
  • Post comprehensive r6 review

Claude's Review — r6

Addressed since r5

Codex P1 (empty queue segment, §3.4) — The design now specifies inspecting r.URL.EscapedPath() before path.Clean to detect // in the /queues/.../messages segment range, returning 400 ValidationError so a script bug omitting ${QUEUE} interpolation produces an error rather than silently acting on a queue named messages. The fix is correctly motivated and scoped.


Still Open from r5

14. writeSQSErrorFromErr missing *purgeRateLimitedError branch — blocks Phase 2

§3.2, line 228:

"The wire-side purgeQueue handler in sqs_purge.go wraps that with the existing sqsAPIError{code: PurgeQueueInProgress, …} rendering"

purgeQueue (adapter/sqs_purge.go:22–37) does not have any error-conversion logic — it calls purgeQueueWithRetry and passes the result directly to writeSQSErrorFromErr(w, err). writeSQSErrorFromErr (adapter/sqs_catalog.go:228–242) only matches *sqsAPIError via errors.As; any other error type falls through to the generic 500 branch. After the design's signature change, tryPurgeQueueOnce returns *purgeRateLimitedError (not *sqsAPIError) for the rate-limit case, so writeSQSErrorFromErr(w, purgeRateLimitedErr) produces a 500 instead of the correct 400 PurgeQueueInProgress wire response for every SigV4 PurgeQueue call that hits the 60-second window.

The "wraps that" claim is unresolvable by reading the code: there is no wrapping site. Commit to one of:

  • (a) writeSQSErrorFromErr gains a new *purgeRateLimitedError branch (cleanest — one place to edit, consistent with how *sqsAPIError is handled today). The doc should say: "writeSQSErrorFromErr (adapter/sqs_catalog.go:228) gains a errors.As(*purgeRateLimitedError) branch that renders sqsErrPurgeInProgress with purgeErr.remaining.Seconds(), keeping the SigV4 wire shape unchanged."
  • (b) purgeQueueWithRetry converts the error before returning (adds an errors.As call in the retry loop — more invasive and mixes SQS-wire concerns into the shared helper).

Fix this →

15. *ms unit notation still ambiguous — minor but implementer-hazard

§3.2, line 228: remaining := sqsPurgeRateLimitMillis*ms - (now - meta.LastPurgedAtMillis)*ms

sqsPurgeRateLimitMillis is declared as 60_000 (adapter/sqs_catalog.go:47, an int64 in milliseconds). *ms is not a valid Go expression — ms is not declared anywhere. An implementer who reads this as 60_000 * time.Millisecond would compute a wait of 60,000 seconds (16 hours), not 60 seconds.

The correct Go expression is:

remaining := time.Duration(sqsPurgeRateLimitMillis-(now-meta.LastPurgedAtMillis)) * time.Millisecond

Fix this →

§10 Summary still says "one storage round-trip" — contradicts §3.3 fix

§10, line 542:

"reverse-scan the catalog (prefix scan over SqsQueueMetaPrefix, one storage round-trip)"

§3.3 was fixed (Codex r4 P1) to say ceil(N/1024) Pebble prefix-scan round-trips. The summary was not updated and still claims a single round-trip — the exact bug Codex caught. An implementer who reads only §10 as a TL;DR would write the single-ScanAt form, silently missing DLQ source links for deployments with >1024 queues.

Fix this →

Stale line number in §3.3 — still points to throttleConfigEqual

§3.3 (line 252): "reusing the exact loop shape from scanQueueNames (adapter/sqs_catalog.go:782)"

Line 782 in the current file is throttleConfigEqual (verified). The paginated implementation is scanQueueNamesAt at line 1234.

Fix this →

#13 admin-peek bucket collision with consumer Default bucket — blocks Phase 3

The doc says: bucketStore.charge(adminPeekThrottle, queueName, "admin_peek", meta.Incarnation, 1)

resolveActionConfig (adapter/sqs_throttle.go:484–498) has case bucketActionSend and case bucketActionReceive only — "admin_peek" falls through to the DefaultCapacity branch and returns ("*", cfg.DefaultCapacity, cfg.DefaultRefillPerSecond). The effective bucket key stored is {queueName, "*", incarnation} — the same key as any ReceiveMessage call that also falls through to Default (i.e., any queue with no configured RecvCapacity). Both calls use different (capacity, refillRate) tuples (adminPeekThrottle vs. the queue's consumer throttle), and bucketStore.loadOrInit detects the mismatch, runs CompareAndDelete + LoadOrStore on each request, and resets the bucket to full capacity for the current call's config. Every interleaved admin-peek / ReceiveMessage rebuilds the bucket, making the rate-limiter ineffective for both callers.

The fix requires adding an explicit case to resolveActionConfig:

const bucketActionAdminPeek = "admin_peek"

case bucketActionAdminPeek:
    if cfg.DefaultCapacity > 0 {
        return bucketActionAdminPeek, cfg.DefaultCapacity, cfg.DefaultRefillPerSecond
    }

This ensures admin-peek maps to {queueName, "admin_peek", incarnation}, a distinct key from the consumer's {queueName, "*", incarnation} bucket. The doc must say that Phase 3 adds bucketActionAdminPeek constant and a new case to resolveActionConfig — it is a small but load-bearing change that the current text omits.


New Findings (r6)

16. Empty-segment check (§3.4) doesn't catch the claimed /queues/ trailing-slash case

§3.4 claims: "The same check rejects /admin/api/v1/sqs/queues/ (no name, trailing slash)"

But the check is defined as: "any consecutive / pair appears inside the /queues/.../messages segment range." The URL /admin/api/v1/sqs/queues/ has no consecutive // — it ends with a single trailing slash. The check as described fires only on // and would NOT fire for /queues/. After path.Clean, /admin/api/v1/sqs/queues/ becomes /admin/api/v1/sqs/queues and routes to handleList — not rejected. The claim in the doc is wrong.

Either:

  • (a) The check description is incomplete: the check actually tests whether the queue-name segment extracted from EscapedPath() is empty string (after splitting on / and finding the segment following "queues"), which covers both the // case and the trailing-slash-with-no-name case.
  • (b) The intended behavior is that /admin/api/v1/sqs/queues/ is NOT rejected (it just routes to handleList), and the doc's claim is wrong about it being rejected.

Commit to one and update the §3.4 description to match.

Fix this →

17. §6.4 missing the two promised 400 test cases

The commit message for e6ac008 says "Test plan §6.4 adds two 400 cases" for the empty-queue-segment fix. The §6.4 integration test list (doc lines 477–483) has not been updated — there is no test case for:

  • DELETE /admin/api/v1/sqs/queues//messages (double slash) → 400
  • DELETE /admin/api/v1/sqs/queues//messages from a script that omits the queue variable → 400

The test plan at line 483 ends with the trailing-slash path.Clean test (→ 204). The two new 400 cases must be added to pin the behavior that the Codex P1 fix is meant to enforce, otherwise the fix is untestable.

Fix this →

18. §3.4 grammar in empty-segment paragraph is self-contradictory

The sentence "The same check rejects /admin/api/v1/sqs/queues/ (no name, trailing slash) and /admin/api/v1/sqs/queues/messages (legal queue name but no sub-resource — only when the operator intended messages; routed as a regular Describe / Delete of a queue literally named messages, which is a legitimate name and IS allowed)" says both are rejected by the same check, then says the second "IS allowed." This is logically contradictory. Rewrite as two sentences: one describing the rejection case(s), and a separate sentence clarifying that a queue named messages is a valid queue that IS routed through the normal Describe / Delete path.


Rollout Impact

Finding Severity Blocks
#14writeSQSErrorFromErr produces 500 on rate-limit for SigV4 path Medium Phase 2
#13 — admin_peek maps to "*" bucket, collides with consumer Default bucket Medium Phase 3
#16 — empty-segment check doesn't catch single trailing slash after /queues/ Medium Phase 4
#17 — §6.4 missing promised 400 test cases Minor Phase 4
§10 summary "one storage round-trip" Minor Phase 2
Stale line number in §3.3 Minor
#15*ms notation ambiguity Minor Phase 2
#18 — grammar self-contradiction in §3.4 Minor

Phase 2 cannot safely proceed until #14 (SigV4 regression) is resolved in the doc. #13 must be resolved before Phase 3. #16 and #17 must be resolved before Phase 4. Findings §10, stale line number, #15, and #18 are cheap fixes that should land in the next commit.

Two Codex r9 findings:

1. Codex r9 P1 (line 45, peek confidentiality): the r9 doc said
   peek uses session-auth only with no live-role re-check, but
   peek returns FULL message bodies and attributes (not just the
   metadata that List / Describe surface). A session JWT remains
   valid for the configured TTL (1 hour default), so a key that is
   revoked or downgraded after login can keep reading DLQ payloads
   via peek until the token expires. The confidentiality gap is
   unacceptable for DLQ triage.

   §2.1 Goal 8 now requires a new principalForRead helper parallel
   to principalForWrite: same RoleStore liveness check, just a
   lower role threshold (RoleReadOnly vs RoleFull). nil RoleStore
   (test fixtures) falls back to session-auth-only, same shape
   principalForWrite already uses. Phase 4 ships both helpers
   together. Existing List / Describe handlers are NOT changed in
   this PR — tightening those is a separate scope.

2. Codex r9 P2 (line 285, collection-root dispatch corner case):
   the routing procedure tried to detect the bare collection root
   /admin/api/v1/sqs/queues via 'len(segments) == 0' after a
   strings.TrimPrefix + strings.Split round-trip. Neither call
   produces that outcome for the bare-root input: TrimPrefix on a
   non-matching prefix returns the input unchanged, and Split('')
   returns [''], a one-element slice — so the empty-slice test
   never fires and the bare-root request would have flowed through
   to the queue-name validator and been rejected as 'empty queue
   name segment'. Adding a new step 2 that dispatches the exact
   collection root (with or without trailing slash) to handleList
   before any splitting sidesteps the corner case entirely.

Renumber: previous steps 1-5 become 1-6 (insertion at position 2).
All cross-references in the bad-inputs and good-inputs tables and
in the dot-segment / path.Clean clauses updated to match the new
numbering.

No code in this PR; doc-only update.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 15, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 15, 2026

Claude finished @bootjp's task in 4m 33s —— View job


Tasks

  • Read design document in full (current r10 state after f0036de + 8a51a18)
  • Check git log to understand what changed since r9
  • Check status of r10 open findings (Configure Renovate #1 step 3 dispatch, Update dependency go to v1.21.4 #2 error message)
  • Check Codex r10 P1/P2: live RoleStore re-check + collection root pre-split
  • Review source files (sqs_handler.go, principal.go, sqs_catalog.go) for context
  • Post comprehensive r11 review

Claude's Review — r11

Addressed since r10

Codex r10 P1 (live RoleStore re-check for AdminPeekQueue, §2.1 goal 8) — §2.1 goal 8 now correctly specifies a new principalForReadSensitive helper that performs a live RoleStore revocation check on peek. The security motivation is clear: peek exposes full message bodies (unlike List/Describe which expose only metadata), so a revoked session JWT that is still within its 1-hour TTL would otherwise expose DLQ payloads indefinitely. The helper is introduced alongside principalForWrite at the call site with a documented rationale so a future reviewer does not "fix" the inconsistency by downgrading the gate. ✓

Codex r10 P2 (collection root before splitting, §3.4) — Step 2 of the routing procedure now explicitly dispatches the collection root (/admin/api/v1/sqs/queues with or without trailing slash) to handleList via a pre-split equality check, before any strings.Split or strings.TrimPrefix runs. The len(segments) == 0 / strings.Split corner case that would have rejected a bare collection-root request is gone. ✓

r10 finding #1 (len(segments)==0 unreachable / contradictory, §3.4) — The len(segments) == 0 bullet has been removed from the validation step entirely. The routing procedure is now renumbered 1–6 with step 2 as the explicit collection-root dispatch. Good-inputs table updated from "step 3 rejects (no segments) → falls through to handleList" to "step 2 dispatches the bare collection root → handleList". ✓


Still Open — Phase 4 Blocking

r10 finding #2 — Error message "empty queue name segment" inaccurate for 4 of 5 bad-input cases

Step 4 (line 284):

"Reject with 400 Bad Request + {"code":"ValidationError","message":"empty queue name segment"} when ANY of these hold:"

The message is accurate for only ONE of the four conditions:

Condition segments[0] when rejected "empty queue name segment" accurate?
segments[0] == "" "" (the empty string) ✅ Yes
strings.ContainsRune(segments[0], '%') e.g. "%2F" (3 bytes, not empty) ❌ No
Any segment equals "." e.g. "." (1 byte, not empty) ❌ No
Any segment equals ".." e.g. ".." (2 bytes, not empty) ❌ No

§6.4 test plan (lines 540–545) pins "empty queue name segment" for all five bad-input test cases:

GET  /queues//messages         → 400 {"message":"empty queue name segment"}  ← accurate
GET  /queues/%2F/messages      → 400 with the same body                      ← "%2F" is not empty
GET  /queues/%252F/messages    → 400 with the same body                      ← "%252F" is not empty  
GET  /queues/orders/../messages → 400 with the same body                     ← "orders" is not empty
GET  /queues/./orders/messages → 400 with the same body                      ← "." is not empty

An operator receiving a 400 for DELETE /queues/orders/../messages and reading "message":"empty queue name segment" will be confused — orders is not empty. An implementer who receives a test failure on the last four cases (because they returned "invalid queue name segment") may incorrectly change their implementation to match the wrong message.

Fix: Replace "empty queue name segment" with "invalid queue name segment" in step 4 (line 284) and in all corresponding §6.4 test assertions (lines 540–545). "invalid queue name segment" is accurate for all four conditions — the segment is invalid whether it is empty, percent-encoded, or a dot-segment.

Fix this →


New Findings

#23principalForReadSensitive role requirement unspecified — blocks Phase 4

§2.1 goal 8 describes the new helper as: "performs the same revocation check principalForWrite does, but classifies the call as a read in the audit pipeline."

Reading principalForWrite in internal/admin/sqs_handler.go:244–279:

if h.roles != nil {
    live, exists := h.roles.LookupRole(principal.AccessKey)
    if !exists { … 403 }
    if !live.AllowsWrite() { … 403 }   // ← this is THE role requirement
    principal.Role = live
} else if !principal.Role.AllowsWrite() { … 403 }

The phrase "same revocation check" in §2.1 is ambiguous: does "same check" include the AllowsWrite() gate, or only the LookupRole revocation step? An implementer who copies principalForWrite and changes only the method name would include the AllowsWrite() gate — blocking read-only users from peeking, the opposite of the stated goal.

The design's intent is unambiguous from §2.1: "Read-only role can peek but not purge." The principalForReadSensitive helper should:

  1. Get the principal from context (missing = 500, wiring bug)
  2. If RoleStore configured: LookupRole(principal.AccessKey) — if !exists → 403 (revoked); otherwise accept any live role (both RoleReadOnly and RoleFull) and override principal.Role with the live value
  3. If no RoleStore: accept the JWT-embedded role unconditionally (same as the session-only gate)

The difference from principalForWrite: step 2 replaces "!exists || !live.AllowsWrite()" with "!exists" (any live role suffices; only revocation is checked). internal/admin/principal.go has only AllowsWrite() defined — there is no AllowsRead() method today, so step 2 cannot be a simple one-line parallel to AllowsWrite(). The doc must spell out the role check explicitly so the Phase 4 implementer doesn't inadvertently write !live.AllowsRead() (undefined) or !live.AllowsWrite() (wrong requirement).

Fix this →

#24 — §6.4 missing revocation test for principalForReadSensitive — Phase 4 test gap

§6.4 integration tests include: "bridge bubbles ErrAdminForbidden through to 403 (peek under nil principal; purge under read-only principal)." This tests that a nil/missing session returns 403 on peek. But it does NOT test the specific scenario principalForReadSensitive closes: a valid JWT session whose key has been revoked (or downgraded to a role below read access) from the RoleStore after login.

This is exactly the security gap Codex r9 P1 flagged: without the live-role check, a revoked session JWT valid until its 1-hour TTL expiry can still read DLQ payloads. Without a test pinning the revocation rejection, a Phase 4 implementation that accidentally omits the LookupRole call (falling back to session-auth only, the pre-Codex-r9-P1 behavior) would pass the entire test plan.

Add one test case to §6.4:

GET /queues/{name}/messages with a valid JWT session but whose access key has been removed from the RoleStore since login → 403 {"code":"forbidden","message":"..."}. Verifies the principalForReadSensitive live-role check fires even when the session cookie is still valid.

The parallel test for purge already exists via principalForWrite's existing coverage in sqs_handler_test.go:132. Peek needs the same.


Confirmed Correct (from source review)

  • RoleFull.AllowsWrite() = true, RoleReadOnly.AllowsWrite() = false, internal/admin/principal.go:14 — confirms there is no AllowsRead() method today; principalForReadSensitive cannot be implemented as a one-line analogue to principalForWrite. ✓
  • principalForWrite in sqs_handler.go:244–279 calls LookupRole + AllowsWrite() when h.roles != nil. principalForReadSensitive must call only LookupRole (omit AllowsWrite()). ✓
  • The "empty queue name segment" error is pinned at line 284 and lines 540–545. The §6.4 test cases for %2F, %252F, orders/.., ./orders all assert "the same body" as the //messages case — which today is "empty queue name segment". An implementation returning "invalid queue name segment" would fail those tests. ✓

Phase Gate Summary

Finding Severity Blocks
r10 #2"empty queue name segment" message inaccurate for 4 of 5 bad-input cases + §6.4 pins wrong expectation Medium Phase 4 (HTTP handler + test plan)
#23principalForReadSensitive role requirement ambiguous; "same check" could be read as including AllowsWrite() Medium Phase 4 (auth implementation)
#24 — §6.4 missing revocation test for principalForReadSensitive Minor Phase 4 (test plan completeness)

Phase 2 (Purge) and Phase 3 (Peek backend) remain unblocked. The principalForReadSensitive helper is introduced in Phase 4 (HTTP handlers); Phases 2 and 3 ship the adapter-layer methods only. The three remaining findings are all Phase 4 concerns. Finding r10 #2 is the most urgent to fix — it pins a factually wrong error message string into the test plan that every Phase 4 test implementer will need to match.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/design/2026_05_16_proposed_admin_purge_queue.md`:
- Line 300: Update the table row for the route
`/admin/api/v1/sqs/queues/orders/messages/` to make the success code
method-specific: indicate that after normalization via path.Clean to
`/orders/messages` the GET method returns 200 and the DELETE method returns 204
(rather than a single 204 for both); explicitly annotate the methods (GET: 200,
DELETE: 204) in that cell so the dual-method endpoint is unambiguous.
- Around line 272-277: The fenced code blocks in this markdown are missing
language identifiers and triggering MD040; update each triple-backtick fence
around the route list and the three metric/event blocks to include a language
tag (e.g., change ``` to ```text) so the blocks containing
"/admin/api/v1/sqs/queues…", "admin.sqs.purge_queue",
"elastickv_sqs_admin_purge_queue_total{queue, outcome}" and
"elastickv_sqs_admin_peek_queue_total{queue, outcome}" (and the other similar
fences referenced at lines ~450-457 and ~465-471) all start with ```text.
- Line 311: The note about trailing-slash normalization references the wrong
step: update the sentence that currently says the path "Cleans ... in step 4" to
instead reference step 5, since path.Clean is defined in step 5; locate the
paragraph mentioning path.Clean and change the step number accordingly so the
document's ordered procedure correctly points to path.Clean in step 5.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6a08cec3-4e9c-4672-8233-7afb262a7233

📥 Commits

Reviewing files that changed from the base of the PR and between 875ff7f and f0036de.

📒 Files selected for processing (1)
  • docs/design/2026_05_16_proposed_admin_purge_queue.md

Comment thread docs/design/2026_05_16_proposed_admin_purge_queue.md Outdated
Comment thread docs/design/2026_05_16_proposed_admin_purge_queue.md Outdated
Comment thread docs/design/2026_05_16_proposed_admin_purge_queue.md Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f0036dedcf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

ReceiveCount int32 // ApproximateReceiveCount
GroupID string // FIFO MessageGroupId, empty for standard
DeduplicationID string // FIFO MessageDeduplicationId, empty for standard
Attributes map[string]string // SQS message attributes
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve typed message attributes in peek results

Define AdminPeekedMessage.Attributes with the same structured shape used by stored SQS message attributes (data type plus string/binary value), not map[string]string. In this repo, message records carry typed attributes (sqsMessageAttributeValue in adapter/sqs_messages.go), so flattening to plain strings would lose DataType and binary payloads for any non-string attribute. That breaks the document’s own “full body + every attribute” triage goal and can cause operators to purge messages without seeing the real attribute content.

Useful? React with 👍 / 👎.

Three CodeRabbit r10 minor findings:

1. Line 277 fence language (MD040): the routing-table fenced block
   lacked a language identifier. Add 'text' so markdown lint stays
   clean.

2. Line 300 method-specific status: the bad-inputs table row for
   '/queues/orders/messages/' (trailing slash) pinned '204' as the
   outcome, but only DELETE returns 204; GET (peek) returns 200.
   Row now reads '200 (GET peek) or 204 (DELETE purge)' so the
   method distinction is explicit.

3. Line 311 stale step reference: after the collection-root pre-
   split insertion (r10), path.Clean moved from step 4 to step 5.
   The trailing-slash note on line 311 still cited 'step 4';
   updated to 'step 5'.

All three are documentation hygiene; no semantic change.

No code in this PR; doc-only update.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 15, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 15, 2026

Claude finished @bootjp's task in 4m 31s —— View job


Tasks

  • Read design document in full (current r11 state after d313199)
  • Check source files for context (sqs_messages.go, sqs_throttle.go)
  • Verify status of r11 open findings (Configure Renovate #1 step 3 dispatch, Update dependency go to v1.21.4 #2 error message)
  • Check CodeRabbit r10 minor items (fence languages, method codes, step ref)
  • Provide comprehensive r12 review

Claude's Review — r12

Addressed since r11

r11 #1 (step 3 len(segments)==0 under "Reject with 400" header — blocks Phase 4) — Fixed cleanly. The collection root is now handled in step 2 as an explicit pre-split branch: "If the escaped path equals /admin/api/v1/sqs/queues (with or without trailing slash), the request is a list call; the handler returns immediately into handleList." The former five-step procedure is now six steps, with step 2 entirely devoted to the collection-root pre-split. The len(segments) == 0 guard is gone from the validation block. The good-inputs table (line 306) now correctly reads "step 2 dispatches the bare collection root → handleList" — no longer using "rejects" for a 200-family outcome. ✓

CodeRabbit r10 (method-specific success codes) — Table row for /orders/messages/ (trailing slash, step 5) now reads "→ 200 (GET peek) or 204 (DELETE purge)". ✓

CodeRabbit r10 (step reference mismatch) — Line 311 now correctly says "Cleans to … in step 5". ✓

CodeRabbit r10 (fence language at line 272) — The routing-table fence is now ```text. ✓


Still Open — Phase-Blocking

1. Error message "empty queue name segment" wrong for 5 of 6 bad-input cases (§3.4 + §6.4, r11 #2) — blocks Phase 4

Step 4 (line 284) fires {"code":"ValidationError","message":"empty queue name segment"} for four validation conditions. §6.4 (lines 540–545) asserts that same message body for all six bad-input test cases. The message is accurate for exactly one of the six:

§6.4 test case segments[0] value "empty queue name segment" accurate?
//messagessegments[0] == "" "" (0 bytes) ✅ Yes
//messages DELETE — segments[0] == "" "" (0 bytes) ✅ Yes
/%2F/messagesContainsRune('%') "%2F" (3 bytes, NOT empty) ❌ No
/%252F/messagesContainsRune('%') "%252F" (5 bytes, NOT empty) ❌ No
/orders/../messagesseg == ".." ".." (2 bytes, NOT empty) ❌ No
/./orders/messagesseg == "." "." (1 byte, NOT empty) ❌ No

An operator or integration tool receiving 400 {"message":"empty queue name segment"} on DELETE /queues/orders/../messages would be confused — orders is a non-empty, legal segment. The §6.4 test cases pin the wrong wire shape for five of the six cases: a Phase 4 implementation that emits a semantically accurate message like "invalid queue name segment" would fail the test plan as written.

Fix: Change the message in step 4 (line 284) and in all six §6.4 assertions from "empty queue name segment" to "invalid queue name segment" — accurate for all four rejection conditions regardless of whether the segment is empty, percent-encoded, or a dot-segment. This is a two-line change in the doc.

Fix this →


New Findings

2. Attributes map[string]string on AdminPeekedMessage loses DataType and BinaryValue (§3.1, line 97) — blocks Phase 3

The latest Codex P2 comment (on line 97) flagged this but it has not been addressed. The actual SQS attribute type in adapter/sqs_messages.go:86–90 is:

type sqsMessageAttributeValue struct {
    DataType    string `json:"DataType"`
    StringValue string `json:"StringValue,omitempty"`
    BinaryValue []byte `json:"BinaryValue,omitempty"`
}

The design's AdminPeekedMessage.Attributes (§3.1, line 97) is:

Attributes map[string]string // SQS message attributes

A map[string]string cannot represent a BinaryValue []byte attribute at all, and silently discards DataType entirely. For messages with Binary-type attributes (common for serialized protobuf, compressed payloads, etc.), the admin peek response would either omit the attribute or garble it — directly contradicting the design's stated goal ("full body + every attribute" in the detail modal so operators can decide whether to purge).

The Peek adapter at Phase 3 must reconstruct sqsMessageRecord.MessageAttributes (map[string]sqsMessageAttributeValue) into the AdminPeekedMessage wire format. With a map[string]string, the only lossless encoding for binary data would be an undocumented implicit base64 convention, but nothing in the doc says that's the intent.

Fix: Define a typed attribute representation alongside AdminPeekedMessage:

type AdminPeekedMessageAttribute struct {
    DataType    string // "String", "Number", "Binary", or custom types
    StringValue string // set when DataType is String or Number
    BinaryValue []byte // set when DataType is Binary; base64-encoded on the JSON wire
}

Change Attributes to map[string]AdminPeekedMessageAttribute. Update the peek response JSON example in §3.4 to show the nested object shape for one attribute. Add a note that BinaryValue is base64-encoded in the JSON wire response (matching Go's standard encoding/json marshalling of []byte), consistent with sqs_messages.go:89.

This change is Phase 3 scope but must be committed in the design before Phase 3 begins — the bridge layer's translation from adapter.sqsMessageRecord.MessageAttributes to admin.AdminPeekedMessage.Attributes cannot be correct with the current flat map[string]string definition.

Fix this →


Minor Finding

3. Code fences at lines 450 and 465 still missing text language specifier (cosmetic)

d313199 fixed the routing-table fence at line 272 to ```text but the two remaining fences cited in the CodeRabbit r10 comment were not updated:

  • Line 450 (audit log format: admin.sqs.purge_queue …) — still plain ```.
  • Line 465 (Prometheus metrics block: elastickv_sqs_admin_purge_queue_total…) — still plain ```.

Both trigger MD040 (fenced-code-language) in markdownlint. Trivially fixed:

-```
+```text
 admin.sqs.purge_queue

and the same for line 465. The CodeRabbit r10 AI-agent prompt listed these two blocks under "Also applies to: 450-457, 465-471" but the commit only addressed line 272.


Confirmed Correct (verified against source)

  • Step 2 collection-root pre-split: r.URL.EscapedPath() equality check against the prefix path and prefix-plus-slash. strings.Split("", "/")[""] (length 1), confirming the old len(segments) == 0 guard would have been dead code. The pre-split approach sidesteps this correctly. ✓
  • sqsQueueNamePattern (adapter/sqs_catalog.go:65) = ^[a-zA-Z0-9_-]{1,80}(\.fifo)?$ — no %, ., or .. possible in a legal queue name; the step 4 rejections have zero false positives for legitimate traffic. ✓
  • sqsMessageAttributeValue (adapter/sqs_messages.go:86–90) has DataType, StringValue, and BinaryValue []byte — confirms finding Update dependency go to v1.21.4 #2 is real. map[string]string cannot losslessly encode BinaryValue. ✓
  • scanQueueNamesAt (adapter/sqs_catalog.go:1234): line reference in §3.3 is accurate. ✓

Phase Gate Summary

Finding Severity Blocks
#1"empty queue name segment" wrong for 5 of 6 test cases; pins incorrect wire contract Medium Phase 4
#2Attributes map[string]string loses DataType + BinaryValue; binary attributes silently dropped Medium Phase 3
#3 — Fences at lines 450 / 465 missing text specifier Cosmetic

Phase 2 (Purge) remains fully unblocked — no Phase 2 findings exist. Finding #2 must be resolved before Phase 3 begins; finding #1 before Phase 4. Finding #3 is a cheap one-line fix alongside either.

Codex r11 P2 (line 97): the prior draft typed
AdminPeekedMessage.Attributes as map[string]string, which would
have flattened the typed attribute set the storage layer carries
in sqsMessageRecord.MessageAttributes (adapter/sqs_messages.go:86)
— silently dropping the DataType discriminator and any binary
payloads. A DLQ-triage operator would then look at peek output
and have no way to tell that a message was routed there because
of a binary or numeric attribute mismatch; the 'full body + every
attribute' goal from §1 would be undermined exactly where it
matters most.

Add a typed struct AdminPeekedAttribute with DataType +
StringValue + BinaryValue, mirroring the wire shape of
sqsMessageAttributeValue. Update the peek-result JSON wire example
in §3.4 to show the new attribute shape with a String, Number, and
Binary entry so the SPA / downstream tooling can pin the schema.
Binary uses Go's standard encoding/json behaviour ([]byte →
base64), so binary payloads survive the SPA round-trip unchanged.

No code in this PR; doc-only update.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 15, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 15, 2026

Claude finished @bootjp's task in 4m 8s —— View job


Tasks


Claude's Review — r12

Addressed since r11

CodeRabbit r10/r11 minor cleanups (d313199) — Three items resolved cleanly:

  • Fence language identifiers (text) added to the three bare code blocks (MD040 lint). ✓
  • /messages/ trailing-slash row in the bad-inputs table now reads "200 (GET peek) or 204 (DELETE purge)" — the dual-method endpoint is no longer ambiguous. ✓
  • Step reference in the trailing-slash normalization note corrected from "step 4" to "step 5" (path.Clean moved when the collection-root pre-split was inserted in r10). ✓

Codex r11 P2 (typed SQS attribute shape, 5a493b0)AdminPeekedAttribute struct added with DataType string, StringValue string, BinaryValue []byte, mirroring sqsMessageAttributeValue (adapter/sqs_messages.go:86). AdminPeekedMessage.Attributes changes from map[string]string to map[string]AdminPeekedAttribute. The JSON wire example in §3.4 updated to show all three types (String, Number, Binary):

"Source":  { "data_type": "String", "string_value": "checkout" },
"Retry":   { "data_type": "Number", "string_value": "2" },
"Payload": { "data_type": "Binary", "binary_value": "AAECAwQF" }

The Number-as-string_value encoding follows the SQS wire convention (sqsMessageAttributeValue does the same: DataType="Number" with the value in StringValue). Binary values are []byte so Go's encoding/json base64-encodes them automatically, matching the round-trip used by the storage layer. The commit message correctly explains why map[string]string was wrong (silently dropped DataType discriminator and binary payloads for non-string attributes). Fix is clean. ✓


Still Open — Phase 4 Blocking

1. Error message "empty queue name segment" inaccurate for 4 of 5 bad-input test cases (§3.4 step 4 + §6.4, r10 #2)

Step 4 (line 304) pins:

{"code":"ValidationError","message":"empty queue name segment"} when ANY of these hold:

  • segments[0] == ""
  • strings.ContainsRune(segments[0], '%')
  • any segment equals "." or ".."

And §6.4 (line 565–569) asserts every bad-input case returns the same body as line 564's "empty queue name segment" assertion.

The message is correct for exactly one case:

Condition segments[0] value "empty queue name segment" accurate?
segments[0] == "" "" — the empty string ✅ Yes
ContainsRune('%') e.g. "%2F" — 3 bytes, NOT empty ❌ No
seg == "." "." — 1 byte, NOT empty ❌ No
seg == ".." ".." — 2 bytes, NOT empty ❌ No

An operator receiving 400 {"message":"empty queue name segment"} on DELETE /queues/orders/../messages will be confused — orders is not empty. A Phase 4 implementer who returns "invalid queue name segment" for the % and dot-segment cases will fail the four "same body" test assertions in §6.4.

Fix: replace "empty queue name segment" with "invalid queue name segment" in step 4 (line 304) and in the three §6.4 test cases at lines 565–569 that assert "same body". The //messages case (line 564) can optionally keep the more specific "empty queue name segment", or all five cases can use the unified "invalid queue name segment". Either way, §6.4 must pin a single message string so the implementation and the tests agree.

Fix this →

2. principalForReadSensitive role gate still ambiguous (§2.1 goal 8, r11 #23)

§2.1 goal 8 (line 45):

"The new principalForReadSensitive helper performs the same revocation check principalForWrite does, but classifies the call as a read in the audit pipeline."

Reading principalForWrite (internal/admin/sqs_handler.go:244–280), it performs two distinct checks:

  1. LookupRole(principal.AccessKey) — revocation check (is the key still in the RoleStore?)
  2. live.AllowsWrite() — role gate (does the key have write permission?)

The phrase "same revocation check" is intended to mean only step (1), but an implementer who copies principalForWrite and changes only the audit classification would include both steps — which blocks RoleReadOnly from peeking, the opposite of the documented intent ("Read-only role can peek but not purge").

There is also no AllowsRead() method on Role today (internal/admin/principal.go:13–16 defines only AllowsWrite()), so an implementer who tries to write !live.AllowsRead() will get a compile error. The doc needs to spell out exactly what principalForReadSensitive must do:

  1. Get the principal from context (missing = 500, wiring bug)
  2. If RoleStore configured: LookupRole(principal.AccessKey) — if !exists → 403 (key was revoked); accept any live role (both RoleReadOnly and RoleFull), override principal.Role with the live value. Do NOT call AllowsWrite() — there is no AllowsRead() method on Role today; any live key passes step 2.
  3. If no RoleStore: accept the JWT-embedded role unconditionally (same as the session-only gate List/Describe use).

The difference from principalForWrite is in step 2: principalForWrite adds || !live.AllowsWrite() to the rejection condition; principalForReadSensitive omits it entirely.

Fix this →

3. §6.4 missing revocation test for principalForReadSensitive (r11 #24)

§6.4 integration tests (line 560):

"bridge bubbles ErrAdminForbidden through to 403 (peek under nil principal; purge under read-only principal)"

This tests:

  • nil principal → peek returns 403 ✓ (catches missing session)
  • read-only principal → purge returns 403 ✓ (catches role gate on purge)

It does not test the specific case principalForReadSensitive closes: a valid JWT session for a key that was revoked (removed from RoleStore) after login. Without this test, a Phase 4 implementation that accidentally omits the LookupRole call and falls back to session-auth only passes the entire test plan — and the revocation window stays open until the 1-hour JWT TTL expires.

This is the same gap Codex r9 P1 flagged in the auth design, now surfacing at the test-plan level.

Add one test case to §6.4:

GET /admin/api/v1/sqs/queues/{name}/messages with a valid JWT session for an access key that has since been deleted from the RoleStore → 403 {"code":"forbidden","message":"..."}. Verifies that principalForReadSensitive calls LookupRole and rejects revoked keys even when the session cookie is still within its TTL.

The parallel test for purge already exists via principalForWrite's existing gate in the test file.

Fix this →


Confirmed Correct (verified against HEAD)

  • AdminPeekedAttribute{DataType, StringValue, BinaryValue} mirrors sqsMessageAttributeValue (adapter/sqs_messages.go:86): both use []byte for BinaryValue, both use string for StringValue, DataType is string. JSON keys diverge intentionally — admin API uses snake_case per §3.4 convention, SQS wire uses PascalCase per AWS spec. ✓
  • Number attributes use string_value in both the storage layer and the new wire example (AWS SQS stores number attribute values as strings in the StringValue field; DataType is the discriminator). The example { "data_type": "Number", "string_value": "2" } is correct. ✓
  • RoleReadOnly and RoleFull are the only two Role values; AllowsWrite() returns true only for RoleFull; there is no AllowsRead() method. Confirmed at internal/admin/principal.go:13–16. The absence of AllowsRead() is what makes finding Update dependency go to v1.21.4 #2 a compile-time hazard, not just an API design concern. ✓
  • scanQueueNamesAt (adapter/sqs_catalog.go:1234) — paginated loop matches §3.3 description. ✓
  • writeSQSErrorFromErr (adapter/sqs_catalog.go:228) — line reference accurate; the new *purgeRateLimitedError branch is correctly specified as a Phase 2 addition. ✓

Phase Gate Summary

Finding Severity Blocks
#1"empty queue name segment" message wrong for 4 of 5 test cases; §6.4 asserts wrong error string Medium Phase 4
#2principalForReadSensitive role gate unspecified; "same revocation check" ambiguous re AllowsWrite() Medium Phase 4
#3 — §6.4 missing revocation test for principalForReadSensitive Minor Phase 4

Phase 2 (Purge) and Phase 3 (Peek) remain fully unblocked. All prior Phase-2/3-blocking findings have been resolved across r1–r11. The typed attribute fix is clean and correct. Findings #1#3 are all Phase 4 concerns; they are cheap to fix (one-paragraph edits for #2 and #3, one string substitution for #1) and should land before Phase 4 begins.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5a493b0014

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Standard / non-partitioned: `!sqs|msg|vis|<queue>|<gen>|<visible_at>|<msg_id>`
- Partitioned FIFO (`PartitionCount > 1`): `!sqs|msg|vis|p|<queue>|<partition>|<gen>|<visible_at>|<msg_id>`

For non-partitioned queues the walk is a single prefix scan over the first keyspace. For partitioned FIFO queues peek does **sequential partition scanning**: scan partition 0 to exhaustion (or until `Limit` is reached), then continue with partition 1, etc. The choice of sequential-rather-than-fanout matches `receiveMessage`'s existing dispatch and makes the cursor format trivial. A per-partition fanout would force the cursor to carry one offset per partition — Claude r2 flagged this as a concrete pagination concern and sequential scanning is the simpler resolution.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Rotate peek partition start to avoid starvation bias

The partitioned-FIFO peek plan says to always scan partition 0 to exhaustion before moving on, which creates a strong sampling bias when low-numbered partitions stay hot: operators can page repeatedly and still never see messages from higher partitions without traversing a huge backlog. This contradicts the existing receive-path fairness guard in adapter/sqs_messages.go (lines 875-887), which explicitly rotates the starting partition to prevent starvation. Because this feature is for DLQ triage, hiding entire partitions can lead to incorrect purge decisions based on an unrepresentative view of queue contents.

Useful? React with 👍 / 👎.

Codex r12 P2 (line 146): the partitioned-FIFO peek plan said
'scan partition 0 to exhaustion, then partition 1, etc.'. With
hot low-numbered partitions this creates a sampling bias —
operators paginating through peek would never see high-numbered
partitions until they traversed the entire backlog of every
preceding one. For DLQ triage this is precisely the failure
mode the feature exists to prevent: a misrepresentative sample
could lead an operator to purge based on what they could see,
not what is actually in the queue.

receiveMessage (adapter/sqs_messages.go:875-887) already solves
the same fairness problem via receiveFanoutCounters: a per-queue
atomic counter that rotates the starting partition each call.
Peek reuses the same counter on the first page (empty cursor),
encoding the rotated start partition into the cursor's
StartPartition field. Subsequent pages continue deterministically
from the cursor; the walk terminates when Partition advances back
around to StartPartition (full circle, all partitions traversed)
or Limit is hit.

Cursor schema extended (still 512-byte hard-cap):
  peekCursor.StartPartition uint32  - the rotated start
  peekCursor.Partition      uint32  - current advance position
Single-partition queues encode StartPartition=Partition=0 and
skip the counter (no partition choice to make).

Also collapsed the duplicate peekCursor struct definition the
r10 edit left behind — there was only one cursor type, but two
copies of its Go declaration. Single source of truth restored.

No code in this PR; doc-only update.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 15, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 15, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 883c01c055

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

2. **Dispatch the collection root before any splitting.** If the escaped path equals `/admin/api/v1/sqs/queues` (with or without trailing slash — i.e. equals the prefix exactly, or equals the prefix + `"/"`), the request is a list call; the handler returns immediately into `handleList`. Codex r9 P2 flagged that the earlier draft tried to detect this case via `len(segments) == 0` after trimming the prefix, but `strings.TrimPrefix` returns the input unchanged when the prefix does not match — and `strings.Split("", "/")` returns `[""]`, a one-element slice, not the empty slice — so the `len(segments) == 0` test never fired and a bare collection-root request would have flowed through to the queue-name validator and been rejected as "empty queue name segment". The explicit pre-split branch sidesteps the entire `Split` / `TrimPrefix` corner case.
3. **Trim the route prefix** (`/admin/api/v1/sqs/queues/`) — the trailing slash is now part of the prefix so a match guarantees there is at least one further path segment — and split the remainder on `/`. The result is a slice of **raw** (still-escaped) segments. (A failed trim — i.e. the path neither equals the collection root from step 2 nor starts with the prefix-with-slash — falls through to the admin router's default 404 handler; the SQS sub-router never sees it.)
4. **Validate the queue-name segment AND every other segment.** Reject with `400 Bad Request` + `{"code":"ValidationError","message":"empty queue name segment"}` when ANY of these hold:
- `segments[0] == ""` — the raw URL had `…/queues//…` (the empty string between two consecutive slashes ends up as an empty segment after split).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject empty interior segments before path.Clean

Step 4 only rejects an empty first segment (segments[0] == ""), so malformed paths like /admin/api/v1/sqs/queues/orders//messages pass validation and are normalized by path.Clean into a valid destructive route (/orders/messages). That means a structurally invalid URL can still trigger peek/purge instead of returning a validation error, which weakens the safety guarantees this routing hardening is trying to provide.

Useful? React with 👍 / 👎.

4. **Validate the queue-name segment AND every other segment.** Reject with `400 Bad Request` + `{"code":"ValidationError","message":"empty queue name segment"}` when ANY of these hold:
- `segments[0] == ""` — the raw URL had `…/queues//…` (the empty string between two consecutive slashes ends up as an empty segment after split).
- `strings.ContainsRune(segments[0], '%')` — any percent sign in the queue-name segment is illegal regardless of what it would decode to. AWS queue names match `sqsQueueNamePattern` (`^[a-zA-Z0-9_-]{1,80}(\.fifo)?$`, `adapter/sqs_catalog.go:65`); none of those characters need percent-encoding, so a queue-name segment containing `%` could only be an attempt to bypass the pre-`path.Clean` empty-segment check via a single-decode (`%2F`), double-decode (`%252F`), or arbitrary-depth-nested decode (`%25252F`, …) round-trip. Rejecting all `%`-containing segments at the source forecloses the entire decoding-collision class without iterating `url.PathUnescape` and without enumerating the encoded variants. Codex r6 P1 round-2 flagged that the previous "single `PathUnescape` decodes to empty" check would have missed `%252F` (one decode yields the literal string `%2F`, not the empty string); the contains-`%` rule sidesteps the bypass entirely.
- **Any segment equals `"."` or `".."`** — dot-segment traversal. `path.Clean` collapses `..` against the preceding segment, so an input like `/admin/api/v1/sqs/queues/orders/../messages` would pass an "only-validate-segments[0]" check (`segments[0] == "orders"`, a legal name), then `path.Clean` rewrites the trimmed path to `/messages` and step 6 dispatches against a queue literally named `messages` — a wrong-queue operation reachable through a single crafted URL. Codex r7 P1 flagged this as the next confused-deputy vector after the `%`-decoding class was closed. The fix is structural: ban `.` and `..` in **every** segment (queue name and sub-resource alike) at step 4, before `path.Clean` ever runs. AWS queue names cannot equal `.` or `..` anyway (the legal-character set excludes both), and the only sub-resource segment we recognise is the literal string `messages`, so the prohibition costs nothing for legitimate traffic.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate encoded dot-segments before dispatch

The dot-segment guard checks only literal "." / ".." on raw escaped segments, so %2e/%2e%2e variants bypass step 4 and are decoded only later in step 6. This leaves a traversal-equivalent class unblocked by the documented validator (e.g. /admin/api/v1/sqs/queues/orders/%2E%2E/messages), so the procedure does not fully enforce the “ban dot segments in every segment” invariant it claims.

Useful? React with 👍 / 👎.

Two Codex r13 P2 findings, both about routing pre-check coverage:

1. Empty interior segment (line 308): the earlier draft only
   validated segments[0] for emptiness, so
   /admin/api/v1/sqs/queues/orders//messages would pass — then
   path.Clean would normalise it to /orders/messages and the
   structurally invalid URL would silently succeed as a
   peek/purge against a real queue. Step 4 now rejects empty
   segments in any position.

2. Percent-encoded dot-segments (line 310): the literal '.'/'..'
   check missed %2e / %2E / %2e%2e and similar — those would
   stay opaque past step 4 and only become traversal sequences
   at step 6 (decode), by which point path.Clean had already
   collapsed them and dispatched on the wrong queue. Step 4's
   '%-containing' rule now applies to every segment (not just
   segments[0]), catching all encoded dot-segment variants
   alongside the existing %2F / %252F bypass class.

Both rules are a natural generalisation of what step 4 already
did — same structural constraint (queue names can't contain
percent signs, sub-resource is only ever the literal 'messages',
neither needs percent-encoding), just applied uniformly.

§3.4 bad-inputs table gains three rows; §6.4 test plan gains
the matching 400 cases. Empty-interior is the structural-URL
case; %2E%2E / %2e are the encoded-dot-segment cases.

No code in this PR; doc-only update.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 15, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 15, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: af367446f4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +307 to +309
4. **Validate every segment** (queue-name AND sub-resource). The rules apply uniformly to all elements of `segments`, not just `segments[0]` — Codex r13 P2 caught two cases where the earlier "queue-name-only" wording let bypass classes slip through. Reject with `400 Bad Request` + `{"code":"ValidationError","message":"invalid path segment"}` when ANY segment violates ANY of:
- `segment == ""` — empty segment, including the **interior** case `/admin/api/v1/sqs/queues/orders//messages` (Codex r13 P2 first finding). The earlier draft only checked `segments[0]`, so `path.Clean` would normalise `orders//messages` to `orders/messages` and the malformed URL would silently succeed as a peek/purge — exactly the structural-invalid-URL-with-destructive-routing failure the routing hardening exists to prevent. Empty segments in any position are rejected at step 4.
- `strings.ContainsRune(segment, '%')` — any percent sign in any segment is illegal. The original motivation was the `%2F` confused-deputy class on the queue-name position, but extending the same rule to sub-resource segments (Codex r13 P2 second finding) also catches **percent-encoded dot-segments** (`%2e`, `%2E`, `%2e%2e`, `%2E%2E`, etc.) that would otherwise slip past the literal `.`/`..` check below and only become traversal sequences at step 6 (decode) — by which time `path.Clean` has already run on the cleaned path. AWS queue names match `sqsQueueNamePattern` (`^[a-zA-Z0-9_-]{1,80}(\.fifo)?$`, `adapter/sqs_catalog.go:65`); the only legal sub-resource segment we recognise is the literal string `messages`. Neither needs percent-encoding for anything, so banning `%` in every segment forecloses the entire decoding-collision class — single-decode (`%2F`, `%2E`), double-decode (`%252F`, `%252E`), arbitrary-depth-nested decode (`%25252F`, …) — without iterating `url.PathUnescape` and without enumerating the encoded variants.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Normalize trailing slash before rejecting empty segments

The ordered router flow rejects segment == "" in step 4 before path.Clean runs, so a valid trailing-slash request like /admin/api/v1/sqs/queues/orders/messages/ is split into [..., ""] and returned as 400 instead of being normalized to /orders/messages. This directly conflicts with the document’s stated behavior and tests that expect trailing-slash acceptance, so implementers following these steps will ship a handler that rejects the canonical peek/purge trailing-slash form.

Useful? React with 👍 / 👎.

Comment on lines +497 to +500
outcome ∈ {"ok", "forbidden", "not_leader", "not_found", "purge_in_progress", "internal_error"}

elastickv_sqs_admin_peek_queue_total{queue, outcome}
outcome ∈ {"ok", "forbidden", "not_leader", "not_found", "throttled", "internal_error"}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add validation outcome to admin peek/purge metrics

Both admin operations explicitly return validation errors (ErrAdminSQSValidation) for malformed names/cursors, but the proposed metric outcome enums omit any validation label. As written, those expected 400-class failures have no dedicated bucket and will be lumped into internal_error or dropped, which distorts alerting and makes operator mistakes indistinguishable from backend faults.

Useful? React with 👍 / 👎.

Codex r14 P1 (line 309): step 4's any-segment-empty rule fires
on /admin/api/v1/sqs/queues/orders/messages/ because the split
produces ['orders', 'messages', ''] and the trailing empty
counts as an empty segment. This directly contradicts the
documented trailing-slash-accepted policy (§3.4 trailing-slash
note, §6.4 test plan) — implementers following the procedure
literally would 400 the canonical trailing-slash peek/purge form.

Fix: step 3 now strips a SINGLE trailing empty segment after the
split, before step 4 validates. The convention matches HTTP-mux
trailing-slash normalisation. Interior empties (/orders//messages
splits into ['orders', '', 'messages'], trailing element is
'messages', no strip) still fall into step 4's empty-segment check
and 400. Only the trailing empty is dropped, never an interior one.

Codex r14 P2 (line 500): the metric outcome enums for both
elastickv_sqs_admin_purge_queue_total and
elastickv_sqs_admin_peek_queue_total omitted a 'validation'
label, even though both endpoints explicitly return
ErrAdminSQSValidation for 400-class rejections (empty / %-containing
/ dot-segment paths from §3.4 step 4; invalid cursor / oversize
cursor / wrong-generation cursor from §3.1; empty / malformed
queue name). Bucketing those into internal_error would distort
alerting — operators paginating with a malformed cursor would
look like a backend fault. Add 'validation' to both counters'
outcome set, with a paragraph explaining why the four expected-
fault classes (forbidden / not_leader / not_found / validation)
stay separate from the unexpected internal_error class.

No code in this PR; doc-only update.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 15, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 15, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 16, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 16, 2026

@claude review

Round 10 — Codex P1 (line 45): peek exposes full message bodies/attributes (not just metadata like List/Describe), so the session-auth-only gate left a confidentiality gap — a revoked/downgraded session JWT could still read DLQ payloads via peek until the 1-hour token TTL elapsed.

Fix: introduce principalForReadSensitive helper alongside principalForWrite. Same RoleStore revocation re-check, but classified as a read in the audit pipeline so the audit-line shape stays parallel to List/Describe. List/Describe themselves remain on the session-auth gate because their output is the same metadata the SPA queue list already shows — the divergence is intentional and called out so a future reviewer doesn't "fix" the inconsistency by downgrading peek's gate.

Implementation-side caller audit fits in the code PR (new helper, no existing callers can be misaligned). Doc-only here.

@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Claude finished @bootjp's task in 5m 53s —— View job


Tasks


Claude's Review — r16

Addressed since r12 (commits r13 through r15)

Codex r12 P2 (partition rotation to avoid starvation, §3.1) — The design now commits to rotated sequential scanning (k, k+1, …, k+N-1 mod N starting from a rotated k). The cursor struct encodes StartPartition and Partition so subsequent pages continue deterministically; the walk terminates when Partition wraps back to StartPartition. The motivation (Codex caught that fixed-at-0 start could hide high-index partitions entirely, causing operators to purge based on an unrepresentative sample) is clearly documented. ✓

Codex r13 P2 (extend step-4 validation to every segment, §3.4) — Step 4 now validates every element of segments, not just segments[0]. The two specific cases Codex r13 caught are in the bad-inputs table: interior double-slash (orders//messages) and percent-encoded dot-segments (orders/%2E%2E/messages). ✓

Codex r13 P1 (trailing-slash split before step-4 validation, §3.4 r15) — Step 3 now drops the very last empty element before step 4 runs, so orders/messages/ splits to ["orders", "messages", ""], the trailing empty is dropped, and step 4 sees the clean ["orders", "messages"] instead of rejecting it as an empty-segment violation. ✓

Codex r14 P2 (add validation outcome to metric enums, §3.6 r15) — Both counters now carry a validation label. The rationale (preventing ErrAdminSQSValidation from being misclassified as internal_error, distorting alerts) is documented. ✓


Still Open — Phase 4 Blocking

1. Error message inconsistency between step 4 spec and §6.4 test assertions — blocks Phase 4

Commit r14 updated the step 4 rejection message from "empty queue name segment" to "invalid path segment" (correct — the new message accurately covers all four rejection conditions including percent-containing and dot-segments). But §6.4 was not updated.

Step 4 (line 307) now says:

Reject with 400 Bad Request + {"code":"ValidationError","message":"invalid path segment"}

§6.4 (line 572) still asserts:

GET /queues//messages → 400 {"code":"ValidationError","message":"empty queue name segment"}

and all subsequent test cases assert "the same body" — i.e., "empty queue name segment".

An implementation that returns "invalid path segment" (as step 4 specifies) would fail all nine §6.4 validation test assertions. An implementation that returns "empty queue name segment" (as §6.4 asserts) contradicts step 4's spec, is factually wrong for 8 of the 9 cases (the rejected segment is not empty for %2F, .., orders//messages interior, percent-encoded dot-segments), and would confuse operators.

Fix: update §6.4 to assert "invalid path segment" in the first test case and for all "same body" references. The //messages case's segment IS empty, so either message would be factually accurate there — but "invalid path segment" is the unambiguous umbrella that covers all nine cases without requiring a message-per-condition split.

Fix this →

2. principalForReadSensitive role gate still ambiguous — blocks Phase 4

§2.1 goal 8 (line 45):

"The new principalForReadSensitive helper performs the same revocation check principalForWrite does, but classifies the call as a read in the audit pipeline."

principalForWrite (internal/admin/sqs_handler.go:244–280) performs two distinct operations:

  1. Revocation check: h.roles.LookupRole(principal.AccessKey) + !exists → 403 (key removed from role config since login)
  2. Role gate: !live.AllowsWrite() → 403 (RoleReadOnly blocked)

"The same revocation check" is intended to mean only step (1). But a developer who copies principalForWrite and changes only the audit classification will include both steps — which means RoleReadOnly principals can't peek, the opposite of goal 8's stated "Read-only role can peek but not purge."

There is also no AllowsRead() method on Role today (internal/admin/principal.go:14 defines only AllowsWrite()). An implementer who tries !live.AllowsRead() as a parallel gets a compile error. The doc needs to spell out what the implementation should actually be:

principalForReadSensitive performs the LookupRole revocation check (if !exists { 403 }) — it does NOT call AllowsWrite() and there is no AllowsRead() method on Role today (internal/admin/principal.go:14). Any non-revoked live role (RoleReadOnly or RoleFull) is accepted. If h.roles == nil, the JWT-embedded role is accepted unconditionally — same as List / Describe.

Fix this →


Still Open — Minor / Phase 4

3. §6.4 missing revocation test for principalForReadSensitive (r12 finding #3)

§6.4 line 568: "bridge bubbles ErrAdminForbidden through to 403 (peek under nil principal; purge under read-only principal)"

This tests nil session (wiring bug) and read-only-on-purge (role gate). It does NOT test the confidentiality gap that principalForReadSensitive exists to close: a valid in-window JWT for an access key that was revoked in the RoleStore after login. Without this test, a Phase 4 implementation that accidentally falls back to session-only auth passes the entire test plan, and the 1-hour revocation window remains open.

Add one test case: GET /admin/api/v1/sqs/queues/{name}/messages with a valid JWT session for a key deleted from the RoleStore after login → 403. The parallel test for purge already exists via principalForWrite's existing coverage.


New Findings (r13–r15)

4. Bad-inputs table trailing-slash row references wrong step after r15 fix — minor (§3.4)

Line 327 (bad-inputs table):

/admin/api/v1/sqs/queues/orders/messages/ (trail slash) | step 5 normalises via path.Clean … → 200 (GET) or 204 (DELETE)

After the r15 fix, the trailing slash is handled in step 3 (drop the last empty element: ["orders", "messages", ""]["orders", "messages"]). Step 5's path.Clean then receives orders/messages with no trailing slash and does nothing. The table incorrectly credits step 5 for work done by step 3.

Fix this →

5. receiveFanoutCounters increment contradicts "pure read" claim — minor (§2.1 goal 1 + §3.1)

§2.1 goal 1 states: peek "does NOT bump receive counts, does NOT start a visibility timer, and does NOT issue receipt handles — it is a pure read over the message keyspace."

§3.1 (line 150) says: "peek reads-and-increments receiveFanoutCounters[queueName]"

nextReceiveFanoutStart (adapter/sqs_messages.go:921–931) calls counter.Add(1) — it is a write to an in-memory atomic. Peek therefore has a side effect: each first-page call increments the receiveFanoutCounters entry for the queue, shifting the starting partition for the next receiveMessage call. This does not change any persisted state, visibility, or receive counts, but it is not a "pure read" in the strict sense.

Two options:

  • (a) Use a separate peek rotation counter (adminPeekFanoutCounters, a parallel sync.Map) so peek's rotation is decoupled from receiveMessage's fairness tracking. No side effect on consumer distribution.
  • (b) Accept the shared counter and update §2.1 goal 1 to clarify: "pure read over the message keyspace — receive counts, visibility timers, and receipt handles are not modified; the in-memory partition-rotation counter is incremented on each first-page call as a minor stateless side effect."

Either is acceptable; option (b) is the smaller change. The doc should either commit to (a) or add the §2.1 caveat.


Phase Gate Summary

Finding Severity Blocks
#1 — Error message "empty queue name segment" in §6.4 contradicts step 4 spec "invalid path segment" Medium Phase 4
#2principalForReadSensitive role gate ambiguous; "same revocation check" doesn't exclude AllowsWrite() Medium Phase 4
#3 — §6.4 missing revocation test for principalForReadSensitive Minor Phase 4
#4 — Trailing-slash bad-inputs table credits step 5; should say step 3 after r15 Minor
#5receiveFanoutCounters.Add(1) contradicts "pure read" in §2.1 goal 1 Minor Phase 3

Phase 2 (Purge) and Phase 3 (Peek backend) remain fully unblocked. Finding #5 is minor enough that it does not block Phase 3 implementation — the behavior is correct, only the §2.1 description needs a caveat. Findings #1 and #2 must be resolved before Phase 4 (HTTP handlers) begins; #1 is a one-line change and #2 is a one-paragraph clarification.

@bootjp bootjp merged commit 303b884 into main May 16, 2026
9 of 12 checks passed
@bootjp bootjp deleted the feat/admin-purge-queue-design branch May 16, 2026 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant