feat(sqs): atomic dormancy gate-lift + htfifo capability gate (Phase 3.D PR 5b-3)#734
feat(sqs): atomic dormancy gate-lift + htfifo capability gate (Phase 3.D PR 5b-3)#734
Conversation
… gate (Phase 3.D PR 5b-3) The §11 PR 2 dormancy gate (PartitionCount > 1 hard-rejected at CreateQueue) was a placeholder while the data plane caught up. The data plane is now in place (PR #731 + #732), so this PR atomically lifts the dormancy gate and replaces it with the §8.5 capability gate that polls every cluster peer's /sqs_health for the htfifo capability. What changes: - Remove validatePartitionDormancyGate and the htfifoTemporaryGateMessage constant from sqs_partitioning.go. Both were marked "Removed in PR 5 in the same commit that wires the data plane so the gate-and-lift land atomically" — that PR is this one. - Add (*SQSServer).validateHTFIFOCapability in adapter/sqs_capability_gate.go, called from createQueueCore. Two-stage fail-closed check on PartitionCount > 1: 1. Local: this binary must advertise htfifo (htfifoCapabilityAdvertised). Refuses the create with InvalidAttributeValue if not. 2. Peers: every entry in s.leaderSQS must report htfifo via /sqs_health within the poller's per-peer timeout. Any timeout, HTTP error, malformed body, or missing capability blocks the create. Vacuous on PartitionCount <= 1 and on empty leaderSQS (single- node cluster — the local check is the whole cluster). - collectSQSPeers helper returns leaderSQS values in deterministic sorted order with empty/duplicate addresses filtered, so the poller and operator-facing error messages never depend on Go map iteration order. - buildHTFIFOCapabilityRejection composes the rejection message with each failing peer's address + reason (per-peer Error or "missing capability") so an operator triaging a partial-rolling- upgrade cluster does not need to re-run the poll out-of-band. CreateQueue control flow on PartitionCount > 1: schema validators (validatePartitionConfig, etc.) → validateHTFIFOCapability → htfifoCapabilityAdvertised check (local) → PollSQSHTFIFOCapability(ctx, collectSQSPeers(), …) → reject with InvalidAttributeValue on any failure → createQueueWithRetry Caller audit: validateHTFIFOCapability has exactly one production caller (createQueueCore in sqs_catalog.go); both the JSON handler and the future query-protocol handler reach it through that one path. SetQueueAttributes is unaffected because PartitionCount is immutable post-create (validatePartitionImmutability). Test changes: - Delete TestValidatePartitionDormancyGate_RejectsAboveOne (the function it tested is gone). - Convert TestSQSServer_HTFIFO_DormancyGate_RejectsPartitionedCreate into TestSQSServer_HTFIFO_CapabilityGate_AcceptsOnSingleNode — the same wire payloads now SUCCEED because the local node advertises htfifo and there are no peers to poll. Renamed TestSQSServer_HTFIFO_DormancyGate_AllowsPartitionCountOne → TestSQSServer_HTFIFO_CapabilityGate_AllowsPartitionCountOne for consistency. - Update comments on TestSQSServer_HTFIFO_RejectsQueueScopedDedupOnPartitioned, TestSQSServer_HTFIFO_RejectsNonPowerOfTwoPartitionCount, TestSQSServer_HTFIFO_ImmutabilitySetQueueAttributesRejects, mustCreateFIFOWithThroughputLimit, and the installPartitionedMetaForTest helper to describe the new capability-gate world. New unit tests in sqs_capability_gate_test.go: - TestValidateHTFIFOCapability_ShortCircuitsOnLegacyMeta: PartitionCount in {0, 1} skips the poll entirely (proven by wiring a peer that would FAIL the gate and verifying the short-circuit path bypasses it). - TestValidateHTFIFOCapability_AcceptsWhenAllPeersAdvertise: happy path with two fake peers. - TestValidateHTFIFOCapability_AcceptsOnEmptyPeerList: vacuous case (single-node cluster). - TestValidateHTFIFOCapability_RejectsWhenOnePeerLacksCapability: rolling-upgrade fail-closed; offending peer's address surfaces in the InvalidAttributeValue message. - TestValidateHTFIFOCapability_RejectsWhenPeerUnreachable: transient-network fail-closed. - TestCollectSQSPeers_Deterministic: sort + dedup + empty-skip. - TestBuildHTFIFOCapabilityRejection_ShapesOperatorMessage: rejection-message shape pinned (advertising peers absent, failing peers contribute "(reason)" suffix, defensive paths). Self-review (CLAUDE.md): 1. Data loss — None. The gate strictly tightens CreateQueue acceptance vs. the previous dormancy reject; no path now accepts a write that would have been rejected before. The dormancy gate's invariant ("partitioned-shape meta cannot land on a binary that does not handle the partitioned keyspace") is preserved by the local htfifoCapabilityAdvertised check and strengthened by the cluster-wide poll. 2. Concurrency / distributed failures — Poll runs concurrently across peers via the existing PollSQSHTFIFOCapability helper (covered by its own race tests). collectSQSPeers + sort are pure / deterministic. The leaderSQS map is only mutated at SQSServer construction (WithSQSLeaderMap), not at request time, so no read/write races. Leader transitions during the poll are handled by the existing proxyToLeader path that gates createQueue before validateHTFIFOCapability runs. 3. Performance — Poll cost is O(peers) and only paid on PartitionCount > 1 creates (rare control-plane operation). Legacy / single-partition CreateQueue calls pay one short-circuit branch. collectSQSPeers' sort is O(N log N) on a small N (cluster size). No hot-path impact. 4. Data consistency — Schema validators (PartitionCount shape, dedup-scope rule, perMessageGroupId rule) still run BEFORE the capability gate inside parseAttributesIntoMeta, so an invalid shape rejects with the schema's reason rather than the gate's. SetQueueAttributes immutability remains the guard for post-create partition-shape changes. 5. Test coverage — Gate function: 5 unit tests covering the short-circuit, happy path, vacuous empty, rolling-upgrade, and unreachable-peer classes. Helpers: 2 unit tests pinning deterministic order and message shape. Wire-level: existing HT-FIFO integration tests carry forward, with the dormancy- reject test converted to the new accepts-on-single-node happy path.
|
@claude review |
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (10)
📝 WalkthroughWalkthroughA cluster-wide HT-FIFO capability gate replaces the temporary dormancy gate for partitioned FIFO CreateQueue. For PartitionCount > 1 the local node must advertise htfifo and all distinct peers from ChangesHT-FIFO Capability Gate Implementation & Integration
Sequence Diagram(s)sequenceDiagram
participant Client
participant SQSServer as Local SQS Server
participant Poller
participant Peer as SQS Peer /sqs_health
Client->>SQSServer: CreateQueue (FifoQueue=true, PartitionCount>1)
SQSServer->>SQSServer: check local htfifo advertisement
SQSServer->>Poller: PollSQSHTFIFOCapability(peers)
Poller->>Peer: GET /sqs_health
Peer-->>Poller: 200 + advertised capabilities / timeout/error
Poller-->>SQSServer: capability report (per-peer results)
alt all advertise
SQSServer->>Client: 200 OK (proceed to metadata write)
else any missing/unreachable
SQSServer->>SQSServer: slog.Warn(operator details)
SQSServer->>Client: 400 BadRequest (sanitized htfifoCapabilityRejectionPublic)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 23 minutes and 7 seconds.Comment |
|
Claude finished @bootjp's task in 2m 15s —— View job Code Review: feat(sqs): atomic dormancy gate-lift + htfifo capability gate (Phase 3.D PR 5b-3)
SummaryThe design is sound: atomically removing the dormancy gate and replacing it with a cluster-wide capability poll is the right approach. The fail-closed semantics, deterministic peer ordering, and operator-facing rejection messages are well-executed. Test coverage is thorough for the new surface area. Issues Found1. Stale comment in
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: df11d2b9a4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // or missing capability so a partitioned queue cannot land in a | ||
| // partially-upgraded cluster where some peer would silently | ||
| // store its records under the legacy single-partition keyspace. | ||
| if err := s.validateHTFIFOCapability(ctx, requested); err != nil { |
There was a problem hiding this comment.
Check existing queue before enforcing HTFIFO peer gate
createQueueCore now runs validateHTFIFOCapability before createQueueWithRetry, so any transient peer poll failure (timeout/unreachable/malformed health) returns InvalidAttributeValue even when the queue already exists with identical attributes. That breaks the adapter’s idempotent CreateQueue behavior (TestSQSServer_CatalogCreateIsIdempotent) for partitioned FIFO queues and can cause create-or-get clients to fail during partial outages or rolling upgrades despite no new queue being created. The capability gate should run only on the new-queue creation path after the existing-queue equality check.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request replaces the temporary dormancy gate with a cluster-wide capability check for partitioned FIFO queues. When a queue is created with multiple partitions, the system now verifies that all cluster peers advertise the "htfifo" capability, ensuring data consistency during rolling upgrades. Review feedback highlights potential concurrency issues when accessing the peer map and suggests optimizing the network-based capability polling through caching or low-latency retry configurations.
| func (s *SQSServer) collectSQSPeers() []string { | ||
| if len(s.leaderSQS) == 0 { | ||
| return nil | ||
| } | ||
| peers := make([]string, 0, len(s.leaderSQS)) | ||
| seen := make(map[string]struct{}, len(s.leaderSQS)) | ||
| for _, addr := range s.leaderSQS { | ||
| if addr == "" { | ||
| continue | ||
| } | ||
| if _, ok := seen[addr]; ok { | ||
| continue | ||
| } | ||
| seen[addr] = struct{}{} | ||
| peers = append(peers, addr) | ||
| } | ||
| sort.Strings(peers) | ||
| return peers | ||
| } |
There was a problem hiding this comment.
The collectSQSPeers method iterates over s.leaderSQS, which is a map. If this map is modified concurrently by another goroutine (e.g., during a cluster membership update), this iteration will cause a panic. While the PR description mentions that leaderSQS is only mutated at construction, it is safer to protect this access with a read lock. Additionally, if this operation is part of a read-modify-write sequence requiring leadership, ensure you verify leadership status after acquiring the lock to prevent race conditions.
References
- To prevent race conditions in read-modify-write operations that require leadership, verify leadership status after acquiring the lock that protects the operation.
| peers := s.collectSQSPeers() | ||
| if len(peers) == 0 { | ||
| // Single-node deployment: the local check above is the | ||
| // whole cluster. Vacuously true on the peer side. | ||
| return nil | ||
| } | ||
| report := PollSQSHTFIFOCapability(ctx, peers, PollerConfig{}) | ||
| if report == nil || !report.AllAdvertise { | ||
| return newSQSAPIError(http.StatusBadRequest, sqsErrInvalidAttributeValue, | ||
| buildHTFIFOCapabilityRejection(report)) | ||
| } |
There was a problem hiding this comment.
The validateHTFIFOCapability function performs a cluster-wide HTTP poll for every CreateQueue request where PartitionCount > 1. This introduces a blocking network dependency. Consider caching the capability status. If retries are implemented for these network calls, keep the maximum backoff duration short (e.g., 10ms) to minimize latency impact on the control plane, as is standard for our low-latency components.
References
- For low-latency systems, use a short maximum backoff duration for retries (e.g., 10ms) as long-lived contention is not the primary scenario to handle.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@adapter/sqs_capability_gate.go`:
- Around line 54-57: The current branch returns
buildHTFIFOCapabilityRejection(report) directly to the caller which exposes peer
addresses and poller error text; instead, in the PollSQSHTFIFOCapability failure
path (where report == nil || !report.AllAdvertise) log the full report and
per-peer details server-side (use the same redaction approach as
adapter/sqs_catalog.go) and replace the client-visible payload with a sanitized
rejection message (e.g., a generic capability rejection string) passed into
newSQSAPIError; keep using PollSQSHTFIFOCapability,
buildHTFIFOCapabilityRejection and newSQSAPIError as the touchpoints but ensure
only redacted data is returned to callers while full details are written to
server logs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 800e14f6-c2b6-4468-b08d-c78e73f7fe36
📒 Files selected for processing (7)
adapter/sqs_capability_gate.goadapter/sqs_capability_gate_test.goadapter/sqs_catalog.goadapter/sqs_partitioned_dispatch_test.goadapter/sqs_partitioning.goadapter/sqs_partitioning_integration_test.goadapter/sqs_partitioning_test.go
💤 Files with no reviewable changes (1)
- adapter/sqs_partitioning_test.go
… message (PR #734, round 1) Two PR #734 review findings addressed in one commit because they touch the same code path: 1. Codex P1 — Check existing queue before enforcing HTFIFO peer gate The previous placement of validateHTFIFOCapability inside createQueueCore (before createQueueWithRetry) ran the gate on EVERY CreateQueue call with PartitionCount > 1, including idempotent retries on an already-existing queue with identical attributes. A transient peer poll failure (timeout / unreachable / malformed health) during such a retry would then return InvalidAttributeValue instead of the AWS-correct 200 OK, breaking create-or-get clients during partial outages or rolling upgrades. Fix: move the gate INTO tryCreateQueueOnce after the existence check and BEFORE the OCC dispatch. The order in tryCreateQueueOnce is now: 1. loadQueueMetaAt — check if queue exists at the snapshot 2. exists + matching attrs → return (true, nil) idempotent OK 3. exists + different attrs → return QueueNameExists 4. validateHTFIFOCapability — runs ONLY on the genuine create path 5. loadQueueGenerationAt + dispatch the create The gate may run more than once across OCC retries (each retry that gets to "queue still missing" re-polls), but every retry that hits an existing-queue match short-circuits before the gate runs — so idempotent CreateQueue under a partial cluster outage stays AWS-correct. Caller audit: validateHTFIFOCapability has exactly one production caller (now tryCreateQueueOnce); both the JSON handler (createQueue → createQueueCore → createQueueWithRetry → tryCreateQueueOnce) and the query-protocol handler (sqs_query_protocol.go: 182 → createQueueCore → …) reach it through that one path. Move is symmetric — no caller observes a semantic change for a queue that genuinely needs to be created; only the existing-queue path stops paying the gate cost. 2. CodeRabbit major — Don't send raw peer poll details back to caller buildHTFIFOCapabilityRejection's output (peer addresses + raw poller error text) was returned to the wire layer as the InvalidAttributeValue message, leaking cluster topology to any authenticated CreateQueue caller. This conflicts with the stricter error-redaction policy already used elsewhere in sqs_catalog.go. Fix: the wire-level rejection is now the sanitized constant htfifoCapabilityRejectionPublic ("PartitionCount > 1 requires every cluster peer to advertise the htfifo capability via /sqs_health; one or more peers did not — see server logs for details"). The full per-peer detail goes to slog.Warn with structured fields (queueName, partitionCount, peerCount, detail) so an operator triaging the rolling upgrade can read the failing peer addresses + reasons from the server logs without rerunning the poll out-of-band. Renamed buildHTFIFOCapabilityRejection → formatHTFIFOCapabilityReportForLog to make its server-side-only contract obvious at the call site. Test changes: - New TestSQSServer_HTFIFO_CapabilityGate_IsIdempotentOnExistingQueue (wire-level): creates a partitioned queue on a single-node cluster (gate vacuously passes), poisons leaderSQS with an unreachable address, then re-creates the same queue with identical attrs and expects 200; finally creates a NEW queue with the poisoned peer map and expects the 400 (proves the gate is still in effect for genuine creates). - New TestValidateHTFIFOCapability_PublicMessageDoesNotLeakPeerDetails: pins the sanitization contract — the wire-level message must equal htfifoCapabilityRejectionPublic exactly, never contain a peer host:port. - Updated TestValidateHTFIFOCapability_RejectsWhenOnePeerLacksCapability and TestValidateHTFIFOCapability_RejectsWhenPeerUnreachable to assert the sanitized constant + NotContains on the peer address. - Renamed TestBuildHTFIFOCapabilityRejection_ShapesOperatorMessage → TestFormatHTFIFOCapabilityReportForLog_ShapesServerSideDetail to match the renamed helper; assertion that the helper output is server-side-only (no client wire surface assertion here). Below threshold (intentionally not addressed in this round): - Gemini medium on collectSQSPeers concurrency: leaderSQS is only mutated at SQSServer construction (WithSQSLeaderMap), not at request time. Gemini's own comment acknowledges this. - Gemini medium on caching the capability status: CreateQueue is a rare control-plane operation; caching adds a stale-window failure mode (a cluster that already lost a peer would still accept a partitioned queue while the cache is warm). Pure performance suggestion, not correctness.
|
@claude review Round 1 — addressed two threshold-level findings in one commit (they touch the same code path): 1. Codex P1 — Idempotency break: Caller audit: 2. CodeRabbit Major — Information leak: the wire-level Tests:
Below threshold (intentionally not addressed in this round):
Build / |
|
Claude finished @bootjp's task in 2m 50s —— View job Code Review: Round 1 Changes
Verdict: Two threshold-level fixes are correct — three stale comments from round 0 remain openBoth fixes are correct, clean, and well-tested. Flagging the remaining stale comments as a round 2 cleanup item before merge. Idempotency fix (Codex P1) — correctThe gate moves from
Information-leak fix (CodeRabbit Major) — correct
Stale comments — three from round 0 still openThe previous review flagged four stale comments. One was fixed (the 1. // PR 2 of the rollout introduces this field but a temporary CreateQueue gate
// rejects PartitionCount > 1 until PR 5 lifts the gate atomically
// with the data-plane fanout — so the schema exists but no
// partitioned data can land before the data plane is wired.This is now false — PR 5b-3 has lifted the gate. The last sentence claims partitioned data cannot land, which is wrong. 2. // PR 2 of the §11 rollout introduces the schema fields plus the
// validation surface — including the temporary dormancy gate that
// rejects PartitionCount > 1 at CreateQueue. PR 5 lifts the gate
// atomically with the data-plane fanout so a half-deployed cluster
// can never accept a partitioned queue without the data plane to
// serve it. Until then the field exists in the meta type and the
// router function compiles, but no partitioned queue can land."Until then … no partitioned queue can land" is now false. Should be updated to past tense: the gate was lifted in PR 5b-3. 3. // FIFO + perMessageGroupId + PartitionCount=8: accept (the
// dormancy gate runs separately on CreateQueue and rejects this
// at the wire today, but the cross-attribute validator on its
// own does not)."the dormancy gate … rejects this at the wire today" is no longer true — CreateQueue with PartitionCount=8 now succeeds on a capable cluster. Minor (non-blocking)
Positive highlights
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
adapter/sqs_catalog.go (1)
385-392:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winStale comment — references the removed dormancy gate.
The second sentence now describes a gate that no longer exists:
The temporary dormancy gate (§11 PR 2) runs separately in createQueue so SetQueueAttributes paths share the schema validator without re-rejecting on the gate.The dormancy gate was removed by this PR. The rationale still holds (the capability gate runs in
tryCreateQueueOnce, not here, precisely soSetQueueAttributespaths don't re-run the network poll), but the reference to "§11 PR 2" and "dormancy gate" is now incorrect.📝 Proposed fix
- // HT-FIFO validation runs after resolveFifoQueueFlag so the - // IsFIFO-only checks see the post-resolution flag. The temporary - // dormancy gate (§11 PR 2) runs separately in createQueue so - // SetQueueAttributes paths share the schema validator without - // re-rejecting on the gate. + // HT-FIFO validation runs after resolveFifoQueueFlag so the + // IsFIFO-only checks see the post-resolution flag. The cluster-wide + // capability gate (validateHTFIFOCapability, §11 PR 5b-3) runs inside + // tryCreateQueueOnce — not here — so SetQueueAttributes paths share + // the schema validator without triggering the network poll.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@adapter/sqs_catalog.go` around lines 385 - 392, The comment above the validatePartitionConfig(meta) call is stale because the "dormancy gate (§11 PR 2)" was removed; update the comment to remove the dormancy/dormancy-gate reference and instead state that HT-FIFO validation runs after resolveFifoQueueFlag so IsFIFO-only checks see the post-resolution flag, and note that the capability gate now runs in tryCreateQueueOnce (not here) so SetQueueAttributes paths share the schema validator without re-running the network poll; reference the functions resolveFifoQueueFlag, validatePartitionConfig, createQueue, tryCreateQueueOnce and the SetQueueAttributes path to make the intent clear.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@adapter/sqs_catalog.go`:
- Around line 385-392: The comment above the validatePartitionConfig(meta) call
is stale because the "dormancy gate (§11 PR 2)" was removed; update the comment
to remove the dormancy/dormancy-gate reference and instead state that HT-FIFO
validation runs after resolveFifoQueueFlag so IsFIFO-only checks see the
post-resolution flag, and note that the capability gate now runs in
tryCreateQueueOnce (not here) so SetQueueAttributes paths share the schema
validator without re-running the network poll; reference the functions
resolveFifoQueueFlag, validatePartitionConfig, createQueue, tryCreateQueueOnce
and the SetQueueAttributes path to make the intent clear.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 434dc892-4c26-4f9c-a5f3-7d98ebe9b988
📒 Files selected for processing (4)
adapter/sqs_capability_gate.goadapter/sqs_capability_gate_test.goadapter/sqs_catalog.goadapter/sqs_partitioning_integration_test.go
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 166e079abb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if err := s.validateHTFIFOCapability(ctx, requested); err != nil { | ||
| return false, err |
There was a problem hiding this comment.
Re-check existence before returning capability-gate failure
tryCreateQueueOnce snapshots queue existence at readTS and, when it sees the queue as missing, immediately runs validateHTFIFOCapability; if another request creates the same queue after that snapshot but before this gate call, any transient peer-poll failure returns InvalidAttributeValue and exits without hitting the conflict-retry path, so an idempotent create-or-get call can still fail even though the queue now exists with identical attributes. This race requires concurrent CreateQueue plus a temporary poll failure, but it directly impacts the reliability of idempotent retries during rolling upgrades or partial outages.
Useful? React with 👍 / 👎.
… round 2) Claude review on round 1 flagged that several doc comments still described the §11 PR 2 dormancy gate as if it were live, even though PR 5b-3 has lifted it and replaced it with the cluster-wide htfifo capability check. Documentation-only sweep — no behavior change. Updated: - adapter/sqs_catalog.go (sqsQueueMeta.PartitionCount field doc): was "PR 2 of the rollout introduces this field but a temporary CreateQueue gate rejects PartitionCount > 1 until PR 5 lifts the gate atomically with the data-plane fanout — so the schema exists but no partitioned data can land before the data plane is wired". Now describes validateHTFIFOCapability in tryCreateQueueOnce as the live guard. - adapter/sqs_catalog.go (parseAttributesIntoMeta comment): swapped the dormancy-gate reference for the capability gate. - adapter/sqs.go (sqsCapabilityHTFIFO and htfifoCapabilityAdvertised doc blocks): three stale references updated. The "PR 5 will use this for the CreateQueue capability gate" forward-reference is now "merged via #721 (PR 4-B-3a) and now consumed by validateHTFIFOCapability in the CreateQueue gate (PR 5b-3)". The block describing PR 5 lifting the dormancy gate is now factual past-tense. - adapter/sqs_keys.go (partitioned key family doc): swapped the "dormancy gate currently rejects" sentence for the validateHTFIFOCapability gate. - adapter/sqs_messages.go (decodeClientReceiptHandle doc): the "PR 5b-3 lifts the gate together with the capability check" was itself stale — PR 5b-3 has shipped. Updated to describe the current routing through validateReceiptHandleVersion. - adapter/sqs_partitioning_test.go (two test docstrings): "after PR 5 lifts the dormancy gate" / "the dormancy gate runs separately on CreateQueue and rejects this at the wire today" updated to refer to the capability gate. Verified by build + golangci-lint (0 issues) + targeted SQS test sweep (-race) all clean.
|
@claude review Round 2 — addressed Claude's round-1 callout on stale dormancy-gate doc references. Claude's round-1 verdict was "two threshold-level fixes are correct — three stale comments from round 0 remain open" with explicit "round 2 cleanup" framing. Documentation-only sweep across 5 files; no behavior change. Updated:
Build / Codex P2 ("re-check existence before returning capability-gate failure" at sqs_catalog.go:977) and CodeRabbit minor ("stale comment at sqs_catalog.go:385") are below threshold; this round addresses the latter as part of the doc sweep, the former is a TOCTOU between the snapshot existence read and the gate poll — narrow window, requires concurrent CreateQueue + transient peer outage at the same instant; not addressed in this round per loop priority. |
|
Claude finished @bootjp's task in 1m 56s —— View job Code Review: Round 2 (Doc-only sweep)
Verdict: Five of the six flagged locations are fixed — one stale block in
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dccac18e0c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| report := PollSQSHTFIFOCapability(ctx, peers, PollerConfig{}) | ||
| if report == nil || !report.AllAdvertise { |
There was a problem hiding this comment.
Reject partitioned CreateQueue without routing map coverage
validateHTFIFOCapability now allows PartitionCount > 1 queues after only checking peer /sqs_health, but it never verifies that the queue is actually routable by the partition resolver. In clusters where --sqsFifoPartitionMap is present but missing this queue (or has fewer partitions than requested), CreateQueue succeeds here, yet later Send/Receive/Delete operations build partitioned keys that ShardRouter.ResolveGroup treats as recognized-but-unresolved and fails closed, surfacing as coordinator no route for key errors (and then SQS InternalFailure). This change introduces a create-success/runtime-failure mode that leaves users with unusable queues; CreateQueue should fail early unless every partition for the requested queue can be routed.
Useful? React with 👍 / 👎.
…coverage (PR #734, round 2) Codex P1 review: validateHTFIFOCapability admitted PartitionCount > 1 queues after only checking peer /sqs_health, but never verified that the queue was actually routable. In a cluster where --sqsFifoPartitionMap is configured but missing the new queue (or configured with fewer partitions than requested), CreateQueue succeeded and then SendMessage / ReceiveMessage / DeleteMessage failed closed at kv.ShardRouter ("no route for key") on first use, surfacing as InternalFailure to the client. Operators ended up with unusable queues that couldn't be deleted by the AWS CreateQueue contract's idempotency rules either. Fix: extend validateHTFIFOCapability with a routing-coverage check that runs INDEPENDENTLY of the peer poll (and BEFORE it, so the empty-peer-list short-circuit doesn't bypass coverage). Three branches: - resolver==nil → skip (single-shard / no---sqsFifoPartitionMap deployment, partitioned keys fall through to the engine's default group, this case has worked since PR 5b-3 and stays that way for backward compatibility) - resolver!=nil + queue not in map → reject - resolver!=nil + RoutedPartitionCount(queue) < PartitionCount → reject The wire-level rejection is the new sanitized constant htfifoRoutingCoverageRejectionPublic — same redaction principle as htfifoCapabilityRejectionPublic from round 1: the operator detail (queue name, requested vs. actual partition count) goes to slog.Warn so an authenticated CreateQueue caller cannot probe the operator's --sqsFifoPartitionMap shape via error messages. What changed: - adapter/sqs_partition_resolver.go: new RoutedPartitionCount(queueName string) int method on *SQSPartitionResolver. Nil-receiver returns 0 so the gate's resolver==nil branch + empty-queue branch both treat unconfigured routes as "no coverage", which routes naturally to the rejection path or short-circuit. - adapter/sqs_capability_gate.go: new htfifoRoutingCoverageRejectionPublic constant and validateHTFIFORoutingCoverage helper. validateHTFIFOCapability now calls the routing-coverage check before the peer poll. - adapter/sqs.go: new partitionResolver field on SQSServer + WithSQSPartitionResolver option. - main_sqs.go: startSQSServer takes the resolver as a parameter and threads it through WithSQSPartitionResolver. - main.go: split buildSQSPartitionResolver into a wrapper + buildSQSPartitionResolverConcrete (returns the concrete pointer the SQS server needs without re-introducing the typed-nil interface trap that the wrapper guards against); serversInput populates the runtime's sqsPartitionResolver from the same partition map the coordinator consumes. Tests (sqs_capability_gate_test.go): - TestValidateHTFIFOCapability_RejectsWhenRoutingMapMissingQueue - TestValidateHTFIFOCapability_RejectsWhenRoutingMapPartiallyCoversQueue - TestValidateHTFIFOCapability_AcceptsWhenRoutingMapFullyCoversQueue - TestValidateHTFIFOCapability_AcceptsWhenResolverNil (backward compat) - TestRoutedPartitionCount_NilReceiver - TestRoutedPartitionCount_KnownAndUnknownQueue Caller audit: - validateHTFIFOCapability: one production caller (tryCreateQueueOnce in sqs_catalog.go), tests aside. Returns nil / *sqsAPIError as before; new failure mode joins the existing InvalidAttributeValue bucket. No caller observes a semantic change. - RoutedPartitionCount: only validateHTFIFORoutingCoverage consumes it in production. Nil-receiver semantics documented and tested. - WithSQSPartitionResolver: only main_sqs.go uses it. Default-nil short-circuit preserves existing test fixtures that build &SQSServer{} directly. - buildSQSPartitionResolverConcrete: new in this commit; one caller (serversInput populate). The typed-nil-safe wrapper buildSQSPartitionResolver still exists and is the only path the coordinator's WithPartitionResolver consumes.
|
@claude review Round 3 — addressed Codex P1: routing-map coverage.
Fix: new
Wire-level rejection is the new sanitized constant Wiring: new Tests:
Caller audit:
Build / |
|
Claude finished @bootjp's task in 3m 22s —— View job Code Review: Round 3 (Routing-map coverage fix)
Verdict: Routing-coverage fix is correct — one stale comment persists from Round 0, one doc inaccuracy to addressRouting-coverage fix (Codex P1) — correct
Three-branch logic is correct:
The
Test suite — thoroughSix new tests cover all the important cases and all assert on the sanitized
Wiring in
|
Summary
Phase 3.D PR 5b-3: lift the §11 PR 2 dormancy gate (PartitionCount > 1 hard-rejected at CreateQueue) and replace it with the §8.5 cluster-wide htfifo capability poll, atomically in one commit. The data plane caught up in #731 + #732, so partitioned FIFO queues are now creatable end-to-end.
What changes
Remove
validatePartitionDormancyGateandhtfifoTemporaryGateMessagefromadapter/sqs_partitioning.go. Both were marked "Removed in PR 5 in the same commit that wires the data plane so the gate-and-lift land atomically" — this is that commit.Add
(*SQSServer).validateHTFIFOCapabilityin a newadapter/sqs_capability_gate.go, called fromcreateQueueCore. Two-stage fail-closed check onPartitionCount > 1:htfifoCapabilityAdvertised). Refuses withInvalidAttributeValueotherwise.s.leaderSQSmust report htfifo via/sqs_healthwithin the poller's per-peer timeout. Any timeout, HTTP error, malformed body, or missing capability blocks the create.Vacuous on
PartitionCount <= 1and on emptyleaderSQS(single-node cluster — the local check is the whole cluster).Helpers:
collectSQSPeersreturnsleaderSQSvalues in deterministic sorted order with empty/duplicate addresses filtered.buildHTFIFOCapabilityRejectioncomposes the rejection message with each failing peer's address + reason so an operator triaging a partial-rolling-upgrade cluster does not need to re-run the poll out-of-band.CreateQueue control flow on
PartitionCount > 1Caller audit:
validateHTFIFOCapabilityhas exactly one production caller (createQueueCoreinsqs_catalog.go); both the JSON handler and the future query-protocol handler reach it through that one path.SetQueueAttributesis unaffected becausePartitionCountis immutable post-create (validatePartitionImmutability).Test changes
TestValidatePartitionDormancyGate_RejectsAboveOne(function gone).TestSQSServer_HTFIFO_DormancyGate_RejectsPartitionedCreate→TestSQSServer_HTFIFO_CapabilityGate_AcceptsOnSingleNode: the same wire payloads now SUCCEED because the local node advertises htfifo and there are no peers to poll.TestSQSServer_HTFIFO_DormancyGate_AllowsPartitionCountOne→TestSQSServer_HTFIFO_CapabilityGate_AllowsPartitionCountOnefor consistency.New unit tests (
sqs_capability_gate_test.go):TestValidateHTFIFOCapability_ShortCircuitsOnLegacyMetaTestValidateHTFIFOCapability_AcceptsWhenAllPeersAdvertiseTestValidateHTFIFOCapability_AcceptsOnEmptyPeerListTestValidateHTFIFOCapability_RejectsWhenOnePeerLacksCapabilityTestValidateHTFIFOCapability_RejectsWhenPeerUnreachableTestCollectSQSPeers_DeterministicTestBuildHTFIFOCapabilityRejection_ShapesOperatorMessageSelf-review (CLAUDE.md)
htfifoCapabilityAdvertisedcheck and strengthened by the cluster-wide poll.PollSQSHTFIFOCapabilityhelper (covered by its own race tests).collectSQSPeers+ sort are pure / deterministic.leaderSQSis only mutated at SQSServer construction (WithSQSLeaderMap), not at request time, so no read/write races. Leader transitions during the poll are handled by the existingproxyToLeaderpath that gatescreateQueuebeforevalidateHTFIFOCapabilityruns.PartitionCount > 1creates (rare control-plane operation). Legacy / single-partition CreateQueue calls pay one short-circuit branch.collectSQSPeers' sort is O(N log N) on a small N (cluster size). No hot-path impact.PartitionCountshape, dedup-scope rule, perMessageGroupId rule) still run BEFORE the capability gate insideparseAttributesIntoMeta, so an invalid shape rejects with the schema's reason rather than the gate's. SetQueueAttributes immutability remains the guard for post-create partition-shape changes.Test plan
make lint— golangci-lint 0 issuesTestValidateHTFIFOCapability* | TestCollectSQSPeers | TestBuildHTFIFOCapabilityRejection | TestSQSServer_HTFIFO* | TestValidatePartition*(-race, clean)Summary by CodeRabbit
New Features
Bug Fixes
Tests