Skip to content

feat(sqs): atomic dormancy gate-lift + htfifo capability gate (Phase 3.D PR 5b-3)#734

Merged
bootjp merged 4 commits intomainfrom
feat/sqs-htfifo-capability-gate
May 4, 2026
Merged

feat(sqs): atomic dormancy gate-lift + htfifo capability gate (Phase 3.D PR 5b-3)#734
bootjp merged 4 commits intomainfrom
feat/sqs-htfifo-capability-gate

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented May 4, 2026

Summary

Phase 3.D PR 5b-3: lift the §11 PR 2 dormancy gate (PartitionCount > 1 hard-rejected at CreateQueue) and replace it with the §8.5 cluster-wide htfifo capability poll, atomically in one commit. The data plane caught up in #731 + #732, so partitioned FIFO queues are now creatable end-to-end.

What changes

  • Remove validatePartitionDormancyGate and htfifoTemporaryGateMessage from adapter/sqs_partitioning.go. Both were marked "Removed in PR 5 in the same commit that wires the data plane so the gate-and-lift land atomically" — this is that commit.

  • Add (*SQSServer).validateHTFIFOCapability in a new adapter/sqs_capability_gate.go, called from createQueueCore. Two-stage fail-closed check on PartitionCount > 1:

    1. Local: this binary must advertise htfifo (htfifoCapabilityAdvertised). Refuses with InvalidAttributeValue otherwise.
    2. Peers: every entry in s.leaderSQS must report htfifo via /sqs_health within the poller's per-peer timeout. Any timeout, HTTP error, malformed body, or missing capability blocks the create.

    Vacuous on PartitionCount <= 1 and on empty leaderSQS (single-node cluster — the local check is the whole cluster).

  • Helpers: collectSQSPeers returns leaderSQS values in deterministic sorted order with empty/duplicate addresses filtered. buildHTFIFOCapabilityRejection composes the rejection message with each failing peer's address + reason so an operator triaging a partial-rolling-upgrade cluster does not need to re-run the poll out-of-band.

CreateQueue control flow on PartitionCount > 1

schema validators (validatePartitionConfig, etc.)
→ validateHTFIFOCapability
    → htfifoCapabilityAdvertised check (local)
    → PollSQSHTFIFOCapability(ctx, collectSQSPeers(), …)
    → reject with InvalidAttributeValue on any failure
→ createQueueWithRetry

Caller audit: validateHTFIFOCapability has exactly one production caller (createQueueCore in sqs_catalog.go); both the JSON handler and the future query-protocol handler reach it through that one path. SetQueueAttributes is unaffected because PartitionCount is immutable post-create (validatePartitionImmutability).

Test changes

  • Delete TestValidatePartitionDormancyGate_RejectsAboveOne (function gone).
  • Convert TestSQSServer_HTFIFO_DormancyGate_RejectsPartitionedCreateTestSQSServer_HTFIFO_CapabilityGate_AcceptsOnSingleNode: the same wire payloads now SUCCEED because the local node advertises htfifo and there are no peers to poll.
  • Rename TestSQSServer_HTFIFO_DormancyGate_AllowsPartitionCountOneTestSQSServer_HTFIFO_CapabilityGate_AllowsPartitionCountOne for consistency.
  • Refresh comments on related tests + helpers.

New unit tests (sqs_capability_gate_test.go):

  • TestValidateHTFIFOCapability_ShortCircuitsOnLegacyMeta
  • TestValidateHTFIFOCapability_AcceptsWhenAllPeersAdvertise
  • TestValidateHTFIFOCapability_AcceptsOnEmptyPeerList
  • TestValidateHTFIFOCapability_RejectsWhenOnePeerLacksCapability
  • TestValidateHTFIFOCapability_RejectsWhenPeerUnreachable
  • TestCollectSQSPeers_Deterministic
  • TestBuildHTFIFOCapabilityRejection_ShapesOperatorMessage

Self-review (CLAUDE.md)

  1. Data loss — None. The gate strictly tightens CreateQueue acceptance vs. the previous dormancy reject; no path now accepts a write that would have been rejected before. The dormancy gate's invariant ("partitioned-shape meta cannot land on a binary that does not handle the partitioned keyspace") is preserved by the local htfifoCapabilityAdvertised check and strengthened by the cluster-wide poll.
  2. Concurrency / distributed failures — Poll runs concurrently across peers via the existing PollSQSHTFIFOCapability helper (covered by its own race tests). collectSQSPeers + sort are pure / deterministic. leaderSQS is only mutated at SQSServer construction (WithSQSLeaderMap), not at request time, so no read/write races. Leader transitions during the poll are handled by the existing proxyToLeader path that gates createQueue before validateHTFIFOCapability runs.
  3. Performance — Poll cost is O(peers) and only paid on PartitionCount > 1 creates (rare control-plane operation). Legacy / single-partition CreateQueue calls pay one short-circuit branch. collectSQSPeers' sort is O(N log N) on a small N (cluster size). No hot-path impact.
  4. Data consistency — Schema validators (PartitionCount shape, dedup-scope rule, perMessageGroupId rule) still run BEFORE the capability gate inside parseAttributesIntoMeta, so an invalid shape rejects with the schema's reason rather than the gate's. SetQueueAttributes immutability remains the guard for post-create partition-shape changes.
  5. Test coverage — Gate function: 5 unit tests covering short-circuit, happy path, vacuous empty, rolling-upgrade, and unreachable-peer classes. Helpers: 2 unit tests pinning deterministic order and message shape. Wire-level: existing HT-FIFO integration tests carry forward, with the dormancy-reject test converted to the new accepts-on-single-node happy path.

Test plan

  • make lint — golangci-lint 0 issues
  • Targeted: TestValidateHTFIFOCapability* | TestCollectSQSPeers | TestBuildHTFIFOCapabilityRejection | TestSQSServer_HTFIFO* | TestValidatePartition* (-race, clean)
  • Wider regression sweep on CreateQueue / DeleteQueue / PurgeQueue / SetQueueAttributes / Send/Receive/Delete / PartitionedFIFO / dispatch (-race, clean)
  • CI: full Jepsen + race on the PR

Summary by CodeRabbit

  • New Features

    • Cluster-wide capability validation for creating partitioned FIFO queues—creation now requires peers to advertise HT‑FIFO.
    • Client-facing rejection messages are sanitized; detailed per-peer diagnostics are recorded in server logs only.
  • Bug Fixes

    • Partitioned FIFO queues can be created on single-node clusters (no longer rejected).
  • Tests

    • Added comprehensive tests for the capability gate, peer collection, deterministic behavior, and server-side log formatting.

… gate (Phase 3.D PR 5b-3)

The §11 PR 2 dormancy gate (PartitionCount > 1 hard-rejected at
CreateQueue) was a placeholder while the data plane caught up. The
data plane is now in place (PR #731 + #732), so this PR atomically
lifts the dormancy gate and replaces it with the §8.5 capability
gate that polls every cluster peer's /sqs_health for the htfifo
capability.

What changes:

  - Remove validatePartitionDormancyGate and the
    htfifoTemporaryGateMessage constant from sqs_partitioning.go.
    Both were marked "Removed in PR 5 in the same commit that wires
    the data plane so the gate-and-lift land atomically" — that PR
    is this one.

  - Add (*SQSServer).validateHTFIFOCapability in
    adapter/sqs_capability_gate.go, called from createQueueCore.
    Two-stage fail-closed check on PartitionCount > 1:
      1. Local: this binary must advertise htfifo
         (htfifoCapabilityAdvertised). Refuses the create with
         InvalidAttributeValue if not.
      2. Peers: every entry in s.leaderSQS must report htfifo via
         /sqs_health within the poller's per-peer timeout. Any
         timeout, HTTP error, malformed body, or missing capability
         blocks the create.
    Vacuous on PartitionCount <= 1 and on empty leaderSQS (single-
    node cluster — the local check is the whole cluster).

  - collectSQSPeers helper returns leaderSQS values in deterministic
    sorted order with empty/duplicate addresses filtered, so the
    poller and operator-facing error messages never depend on Go map
    iteration order.

  - buildHTFIFOCapabilityRejection composes the rejection message
    with each failing peer's address + reason (per-peer Error or
    "missing capability") so an operator triaging a partial-rolling-
    upgrade cluster does not need to re-run the poll out-of-band.

CreateQueue control flow on PartitionCount > 1:

    schema validators (validatePartitionConfig, etc.)
    → validateHTFIFOCapability
        → htfifoCapabilityAdvertised check (local)
        → PollSQSHTFIFOCapability(ctx, collectSQSPeers(), …)
        → reject with InvalidAttributeValue on any failure
    → createQueueWithRetry

Caller audit: validateHTFIFOCapability has exactly one production
caller (createQueueCore in sqs_catalog.go); both the JSON handler
and the future query-protocol handler reach it through that one
path. SetQueueAttributes is unaffected because PartitionCount is
immutable post-create (validatePartitionImmutability).

Test changes:

  - Delete TestValidatePartitionDormancyGate_RejectsAboveOne (the
    function it tested is gone).

  - Convert TestSQSServer_HTFIFO_DormancyGate_RejectsPartitionedCreate
    into TestSQSServer_HTFIFO_CapabilityGate_AcceptsOnSingleNode —
    the same wire payloads now SUCCEED because the local node
    advertises htfifo and there are no peers to poll. Renamed
    TestSQSServer_HTFIFO_DormancyGate_AllowsPartitionCountOne →
    TestSQSServer_HTFIFO_CapabilityGate_AllowsPartitionCountOne for
    consistency.

  - Update comments on
    TestSQSServer_HTFIFO_RejectsQueueScopedDedupOnPartitioned,
    TestSQSServer_HTFIFO_RejectsNonPowerOfTwoPartitionCount,
    TestSQSServer_HTFIFO_ImmutabilitySetQueueAttributesRejects,
    mustCreateFIFOWithThroughputLimit, and the
    installPartitionedMetaForTest helper to describe the new
    capability-gate world.

New unit tests in sqs_capability_gate_test.go:

  - TestValidateHTFIFOCapability_ShortCircuitsOnLegacyMeta:
    PartitionCount in {0, 1} skips the poll entirely (proven by
    wiring a peer that would FAIL the gate and verifying the
    short-circuit path bypasses it).
  - TestValidateHTFIFOCapability_AcceptsWhenAllPeersAdvertise:
    happy path with two fake peers.
  - TestValidateHTFIFOCapability_AcceptsOnEmptyPeerList: vacuous
    case (single-node cluster).
  - TestValidateHTFIFOCapability_RejectsWhenOnePeerLacksCapability:
    rolling-upgrade fail-closed; offending peer's address surfaces
    in the InvalidAttributeValue message.
  - TestValidateHTFIFOCapability_RejectsWhenPeerUnreachable:
    transient-network fail-closed.
  - TestCollectSQSPeers_Deterministic: sort + dedup + empty-skip.
  - TestBuildHTFIFOCapabilityRejection_ShapesOperatorMessage:
    rejection-message shape pinned (advertising peers absent,
    failing peers contribute "(reason)" suffix, defensive paths).

Self-review (CLAUDE.md):

  1. Data loss — None. The gate strictly tightens CreateQueue
     acceptance vs. the previous dormancy reject; no path now
     accepts a write that would have been rejected before. The
     dormancy gate's invariant ("partitioned-shape meta cannot
     land on a binary that does not handle the partitioned
     keyspace") is preserved by the local htfifoCapabilityAdvertised
     check and strengthened by the cluster-wide poll.
  2. Concurrency / distributed failures — Poll runs concurrently
     across peers via the existing PollSQSHTFIFOCapability helper
     (covered by its own race tests). collectSQSPeers + sort are
     pure / deterministic. The leaderSQS map is only mutated at
     SQSServer construction (WithSQSLeaderMap), not at request
     time, so no read/write races. Leader transitions during the
     poll are handled by the existing proxyToLeader path that
     gates createQueue before validateHTFIFOCapability runs.
  3. Performance — Poll cost is O(peers) and only paid on
     PartitionCount > 1 creates (rare control-plane operation).
     Legacy / single-partition CreateQueue calls pay one
     short-circuit branch. collectSQSPeers' sort is O(N log N)
     on a small N (cluster size). No hot-path impact.
  4. Data consistency — Schema validators (PartitionCount shape,
     dedup-scope rule, perMessageGroupId rule) still run BEFORE
     the capability gate inside parseAttributesIntoMeta, so an
     invalid shape rejects with the schema's reason rather than
     the gate's. SetQueueAttributes immutability remains the
     guard for post-create partition-shape changes.
  5. Test coverage — Gate function: 5 unit tests covering the
     short-circuit, happy path, vacuous empty, rolling-upgrade,
     and unreachable-peer classes. Helpers: 2 unit tests pinning
     deterministic order and message shape. Wire-level: existing
     HT-FIFO integration tests carry forward, with the dormancy-
     reject test converted to the new accepts-on-single-node
     happy path.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 4, 2026

@claude review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 23 minutes and 7 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9a6ca2af-a4db-441d-9cfa-e53e6b351d69

📥 Commits

Reviewing files that changed from the base of the PR and between 166e079 and 63193ad.

📒 Files selected for processing (10)
  • adapter/sqs.go
  • adapter/sqs_capability_gate.go
  • adapter/sqs_capability_gate_test.go
  • adapter/sqs_catalog.go
  • adapter/sqs_keys.go
  • adapter/sqs_messages.go
  • adapter/sqs_partition_resolver.go
  • adapter/sqs_partitioning_test.go
  • main.go
  • main_sqs.go
📝 Walkthrough

Walkthrough

A cluster-wide HT-FIFO capability gate replaces the temporary dormancy gate for partitioned FIFO CreateQueue. For PartitionCount > 1 the local node must advertise htfifo and all distinct peers from s.leaderSQS must report the capability via /sqs_health; otherwise the request is rejected with a sanitized client message while operator-facing details are logged. Dormancy-gate code and tests were removed/updated.

Changes

HT-FIFO Capability Gate Implementation & Integration

Layer / File(s) Summary
Core Gate Logic
adapter/sqs_capability_gate.go
Adds htfifoCapabilityRejectionPublic, (*SQSServer).validateHTFIFOCapability (fail-closed cluster poll), collectSQSPeers (deterministic dedupe+sort), and formatHTFIFOCapabilityReportForLog (operator-facing per-peer failure formatting).
Queue Creation Integration
adapter/sqs_catalog.go
Removes temporary dormancy check from createQueueCore and invokes s.validateHTFIFOCapability(ctx, requested) inside tryCreateQueueOnce after existence check and before metadata commit.
Dormancy Gate Removal
adapter/sqs_partitioning.go
Deletes htfifoTemporaryGateMessage and validatePartitionDormancyGate; updates validatePartitionConfig comments to reflect gating moved to validateHTFIFOCapability.
Test Infrastructure
adapter/sqs_capability_gate_test.go
Adds test helper htfifoCapabilityServer and tests covering short-circuits (nil/PartitionCount<=1), acceptance when all peers advertise, acceptance on empty peer list, rejection for missing capability or unreachable peers, deterministic peer collection, server-side log formatting, and confirmation the public rejection message is sanitized.
Integration & Unit Test Updates
adapter/sqs_partitioning_integration_test.go, adapter/sqs_partitioning_test.go, adapter/sqs_partitioned_dispatch_test.go
Replaces dormancy-gate assertions with capability-gate expectations (single-node accepts PartitionCount>1), adds idempotency test for existing queues bypassing gate, removes dormancy-gate unit test, and updates explanatory comments.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant SQSServer as Local SQS Server
    participant Poller
    participant Peer as SQS Peer /sqs_health

    Client->>SQSServer: CreateQueue (FifoQueue=true, PartitionCount>1)
    SQSServer->>SQSServer: check local htfifo advertisement
    SQSServer->>Poller: PollSQSHTFIFOCapability(peers)
    Poller->>Peer: GET /sqs_health
    Peer-->>Poller: 200 + advertised capabilities / timeout/error
    Poller-->>SQSServer: capability report (per-peer results)
    alt all advertise
        SQSServer->>Client: 200 OK (proceed to metadata write)
    else any missing/unreachable
        SQSServer->>SQSServer: slog.Warn(operator details)
        SQSServer->>Client: 400 BadRequest (sanitized htfifoCapabilityRejectionPublic)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

"I’m a rabbit with a tiny log,
Peers report back through health and cog.
Gates closed tight till all say ‘go’,
Then partitions dance in ordered row.
Hop, hop — see server logs for more! 🐇"

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: removing the temporary dormancy gate and replacing it with an HT-FIFO capability gate, which is the core structural modification across the entire changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/sqs-htfifo-capability-gate

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 23 minutes and 7 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude Bot commented May 4, 2026

Claude finished @bootjp's task in 2m 15s —— View job


Code Review: feat(sqs): atomic dormancy gate-lift + htfifo capability gate (Phase 3.D PR 5b-3)

  • Gather context and understand the PR
  • Read changed files
  • Review adapter/sqs_capability_gate.go (new file)
  • Review adapter/sqs_capability_gate_test.go (new file)
  • Review adapter/sqs_catalog.go (modified)
  • Review adapter/sqs_partitioning.go (modified)
  • Review test modifications
  • Post final review

Summary

The design is sound: atomically removing the dormancy gate and replacing it with a cluster-wide capability poll is the right approach. The fail-closed semantics, deterministic peer ordering, and operator-facing rejection messages are well-executed. Test coverage is thorough for the new surface area.


Issues Found

1. Stale comment in sqs_catalog.go — PartitionCount applier (lines 481–486)

// PR 2 of the rollout introduces the field but the temporary
// dormancy gate in tryCreateQueueOnce rejects PartitionCount > 1
// until PR 5 lifts the gate atomically with the data plane.

PR 5 has now landed; this comment describes the old gating as if it still exists. A reader discovering this comment will be confused about whether the gate is still active. The applier comment should be updated to reflect the current state (gate is lifted; validateHTFIFOCapability in createQueueCore is the live guard).

Fix this →

2. Stale comment in sqs_catalog.gosqsQueueMeta.PartitionCount field (lines 126–130)

// PR 2 of the rollout introduces this field but a temporary CreateQueue gate
// rejects PartitionCount > 1 until PR 5 lifts the gate atomically
// with the data-plane fanout — so the schema exists but no
// partitioned data can land before the data plane is wired.

Same issue on the struct field's doc comment. Since sqs_catalog.go is already modified in this PR, this comment should be updated here too — the last sentence is now factually incorrect.

Fix this →

3. Stale comment in sqs_partitioning.go file header (lines 14–18)

// Until then the field exists in the meta type and the router function
// compiles, but no partitioned queue can land.

This is now incorrect — partitioned queues CAN land after this PR. The "Until then" block predates the gate-lift and was never removed.

Fix this →

4. Stale comment in sqs_partitioning_test.go — line 237

// FIFO + perMessageGroupId + PartitionCount=8: accept (the
// dormancy gate runs separately on CreateQueue and rejects this
// at the wire today, but the cross-attribute validator on its
// own does not).

"Rejects this at the wire today" is no longer true. This test comment dates from before the gate-lift.

Fix this →


Observations (non-blocking)

Dead report == nil guard in sqs_capability_gate.go:55

if report == nil || !report.AllAdvertise {

PollSQSHTFIFOCapability never returns nil — it always allocates and returns *HTFIFOCapabilityReport. The nil branch is dead code. It's harmless, but a future reader may spend time understanding why it can't be nil. A comment like // PollSQSHTFIFOCapability never returns nil; nil guard is future-proofing would help, or the guard could be dropped since the function signature guarantees a non-nil return.

htfifoCapabilityAdvertised = false path is untestable

sqs_capability_gate.go:44 branches on htfifoCapabilityAdvertised, which is a package-level constant (= true in sqs.go). The rejection path on line 45–47 cannot be exercised in any unit or integration test without a build tag or test seam. This is not a regression (the constant is intentional), but the test comment in sqs_capability_gate_test.go:79 correctly documents this as a build-time assumption. Worth a design note or a // NOT TESTABLE: build-time const marker on the branch.

collectSQSPeers polls self during the capability check

sqs_capability_gate.go:68 explicitly documents that the map may contain the leader's own SQS address and that "polling self over loopback is cheap." However, the local check on line 44 already verifies htfifoCapabilityAdvertised for this node. Polling self adds one HTTP round-trip to the critical path of CreateQueue. This is noted in the comment and is by-design to keep the invariant uniform — just flagging it for awareness, not requesting a change.

PollerConfig{} zero value passes http.DefaultClient to the poll

sqs_capability_gate.go:54 passes PollerConfig{} to PollSQSHTFIFOCapability. This falls back to http.DefaultClient, which has no Client.Timeout set. The per-peer timeout (3 s from defaultSQSCapabilityPollTimeout) is applied via context.WithTimeout per goroutine, so the effective timeout is still bounded. This is correct, but a future reviewer may wonder why there's no HTTPClient.Timeout — a comment in validateHTFIFOCapability could note that the per-peer context deadline is the timeout boundary.


Positive Highlights

  • Atomic gate-and-lift is exactly the right approach. There is no window where the dormancy gate is removed before the capability check is wired.
  • createQueueCore is the single caller, covering both the JSON handler and the future query-protocol handler. SetQueueAttributes correctly bypasses the gate since PartitionCount is immutable post-create.
  • collectSQSPeers sort + dedup is simple, correct, and makes the error message and test assertions stable across map iteration order.
  • buildHTFIFOCapabilityRejection provides genuinely useful operator diagnostics — naming the failing peer + reason without requiring out-of-band diagnosis is good UX.
  • Test suite covers all the important cases: short-circuit, happy path, vacuous empty, rolling-upgrade reject, unreachable peer, deterministic ordering, and message shape. The conversion of the dormancy-reject test to the capability-gate-accepts test is clean.
  • installPartitionedMetaForTest comment is updated with accurate context about why the helper predates the gate-lift and when new tests should prefer the public surface.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: df11d2b9a4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread adapter/sqs_catalog.go Outdated
// or missing capability so a partitioned queue cannot land in a
// partially-upgraded cluster where some peer would silently
// store its records under the legacy single-partition keyspace.
if err := s.validateHTFIFOCapability(ctx, requested); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Check existing queue before enforcing HTFIFO peer gate

createQueueCore now runs validateHTFIFOCapability before createQueueWithRetry, so any transient peer poll failure (timeout/unreachable/malformed health) returns InvalidAttributeValue even when the queue already exists with identical attributes. That breaks the adapter’s idempotent CreateQueue behavior (TestSQSServer_CatalogCreateIsIdempotent) for partitioned FIFO queues and can cause create-or-get clients to fail during partial outages or rolling upgrades despite no new queue being created. The capability gate should run only on the new-queue creation path after the existing-queue equality check.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the temporary dormancy gate with a cluster-wide capability check for partitioned FIFO queues. When a queue is created with multiple partitions, the system now verifies that all cluster peers advertise the "htfifo" capability, ensuring data consistency during rolling upgrades. Review feedback highlights potential concurrency issues when accessing the peer map and suggests optimizing the network-based capability polling through caching or low-latency retry configurations.

Comment on lines +69 to +87
func (s *SQSServer) collectSQSPeers() []string {
if len(s.leaderSQS) == 0 {
return nil
}
peers := make([]string, 0, len(s.leaderSQS))
seen := make(map[string]struct{}, len(s.leaderSQS))
for _, addr := range s.leaderSQS {
if addr == "" {
continue
}
if _, ok := seen[addr]; ok {
continue
}
seen[addr] = struct{}{}
peers = append(peers, addr)
}
sort.Strings(peers)
return peers
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The collectSQSPeers method iterates over s.leaderSQS, which is a map. If this map is modified concurrently by another goroutine (e.g., during a cluster membership update), this iteration will cause a panic. While the PR description mentions that leaderSQS is only mutated at construction, it is safer to protect this access with a read lock. Additionally, if this operation is part of a read-modify-write sequence requiring leadership, ensure you verify leadership status after acquiring the lock to prevent race conditions.

References
  1. To prevent race conditions in read-modify-write operations that require leadership, verify leadership status after acquiring the lock that protects the operation.

Comment on lines +48 to +58
peers := s.collectSQSPeers()
if len(peers) == 0 {
// Single-node deployment: the local check above is the
// whole cluster. Vacuously true on the peer side.
return nil
}
report := PollSQSHTFIFOCapability(ctx, peers, PollerConfig{})
if report == nil || !report.AllAdvertise {
return newSQSAPIError(http.StatusBadRequest, sqsErrInvalidAttributeValue,
buildHTFIFOCapabilityRejection(report))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The validateHTFIFOCapability function performs a cluster-wide HTTP poll for every CreateQueue request where PartitionCount > 1. This introduces a blocking network dependency. Consider caching the capability status. If retries are implemented for these network calls, keep the maximum backoff duration short (e.g., 10ms) to minimize latency impact on the control plane, as is standard for our low-latency components.

References
  1. For low-latency systems, use a short maximum backoff duration for retries (e.g., 10ms) as long-lived contention is not the primary scenario to handle.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@adapter/sqs_capability_gate.go`:
- Around line 54-57: The current branch returns
buildHTFIFOCapabilityRejection(report) directly to the caller which exposes peer
addresses and poller error text; instead, in the PollSQSHTFIFOCapability failure
path (where report == nil || !report.AllAdvertise) log the full report and
per-peer details server-side (use the same redaction approach as
adapter/sqs_catalog.go) and replace the client-visible payload with a sanitized
rejection message (e.g., a generic capability rejection string) passed into
newSQSAPIError; keep using PollSQSHTFIFOCapability,
buildHTFIFOCapabilityRejection and newSQSAPIError as the touchpoints but ensure
only redacted data is returned to callers while full details are written to
server logs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 800e14f6-c2b6-4468-b08d-c78e73f7fe36

📥 Commits

Reviewing files that changed from the base of the PR and between 637e543 and df11d2b.

📒 Files selected for processing (7)
  • adapter/sqs_capability_gate.go
  • adapter/sqs_capability_gate_test.go
  • adapter/sqs_catalog.go
  • adapter/sqs_partitioned_dispatch_test.go
  • adapter/sqs_partitioning.go
  • adapter/sqs_partitioning_integration_test.go
  • adapter/sqs_partitioning_test.go
💤 Files with no reviewable changes (1)
  • adapter/sqs_partitioning_test.go

Comment thread adapter/sqs_capability_gate.go Outdated
… message (PR #734, round 1)

Two PR #734 review findings addressed in one commit because they
touch the same code path:

1. Codex P1 — Check existing queue before enforcing HTFIFO peer gate

   The previous placement of validateHTFIFOCapability inside
   createQueueCore (before createQueueWithRetry) ran the gate on
   EVERY CreateQueue call with PartitionCount > 1, including
   idempotent retries on an already-existing queue with identical
   attributes. A transient peer poll failure (timeout / unreachable
   / malformed health) during such a retry would then return
   InvalidAttributeValue instead of the AWS-correct 200 OK, breaking
   create-or-get clients during partial outages or rolling upgrades.

   Fix: move the gate INTO tryCreateQueueOnce after the existence
   check and BEFORE the OCC dispatch. The order in tryCreateQueueOnce
   is now:

     1. loadQueueMetaAt — check if queue exists at the snapshot
     2. exists + matching attrs → return (true, nil) idempotent OK
     3. exists + different attrs → return QueueNameExists
     4. validateHTFIFOCapability — runs ONLY on the genuine create
        path
     5. loadQueueGenerationAt + dispatch the create

   The gate may run more than once across OCC retries (each retry
   that gets to "queue still missing" re-polls), but every retry
   that hits an existing-queue match short-circuits before the gate
   runs — so idempotent CreateQueue under a partial cluster outage
   stays AWS-correct.

   Caller audit: validateHTFIFOCapability has exactly one production
   caller (now tryCreateQueueOnce); both the JSON handler
   (createQueue → createQueueCore → createQueueWithRetry →
   tryCreateQueueOnce) and the query-protocol handler
   (sqs_query_protocol.go: 182 → createQueueCore → …) reach it
   through that one path. Move is symmetric — no caller observes a
   semantic change for a queue that genuinely needs to be created;
   only the existing-queue path stops paying the gate cost.

2. CodeRabbit major — Don't send raw peer poll details back to caller

   buildHTFIFOCapabilityRejection's output (peer addresses + raw
   poller error text) was returned to the wire layer as the
   InvalidAttributeValue message, leaking cluster topology to any
   authenticated CreateQueue caller. This conflicts with the
   stricter error-redaction policy already used elsewhere in
   sqs_catalog.go.

   Fix: the wire-level rejection is now the sanitized constant
   htfifoCapabilityRejectionPublic ("PartitionCount > 1 requires
   every cluster peer to advertise the htfifo capability via
   /sqs_health; one or more peers did not — see server logs for
   details"). The full per-peer detail goes to slog.Warn with
   structured fields (queueName, partitionCount, peerCount, detail)
   so an operator triaging the rolling upgrade can read the failing
   peer addresses + reasons from the server logs without rerunning
   the poll out-of-band.

   Renamed buildHTFIFOCapabilityRejection →
   formatHTFIFOCapabilityReportForLog to make its server-side-only
   contract obvious at the call site.

Test changes:

  - New TestSQSServer_HTFIFO_CapabilityGate_IsIdempotentOnExistingQueue
    (wire-level): creates a partitioned queue on a single-node
    cluster (gate vacuously passes), poisons leaderSQS with an
    unreachable address, then re-creates the same queue with
    identical attrs and expects 200; finally creates a NEW queue
    with the poisoned peer map and expects the 400 (proves the
    gate is still in effect for genuine creates).
  - New TestValidateHTFIFOCapability_PublicMessageDoesNotLeakPeerDetails:
    pins the sanitization contract — the wire-level message must
    equal htfifoCapabilityRejectionPublic exactly, never contain
    a peer host:port.
  - Updated TestValidateHTFIFOCapability_RejectsWhenOnePeerLacksCapability
    and TestValidateHTFIFOCapability_RejectsWhenPeerUnreachable
    to assert the sanitized constant + NotContains on the peer
    address.
  - Renamed TestBuildHTFIFOCapabilityRejection_ShapesOperatorMessage
    → TestFormatHTFIFOCapabilityReportForLog_ShapesServerSideDetail
    to match the renamed helper; assertion that the helper output
    is server-side-only (no client wire surface assertion here).

Below threshold (intentionally not addressed in this round):
  - Gemini medium on collectSQSPeers concurrency: leaderSQS is
    only mutated at SQSServer construction (WithSQSLeaderMap), not
    at request time. Gemini's own comment acknowledges this.
  - Gemini medium on caching the capability status: CreateQueue is
    a rare control-plane operation; caching adds a stale-window
    failure mode (a cluster that already lost a peer would still
    accept a partitioned queue while the cache is warm). Pure
    performance suggestion, not correctness.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 4, 2026

@claude review

Round 1 — addressed two threshold-level findings in one commit (they touch the same code path):

1. Codex P1 — Idempotency break: validateHTFIFOCapability ran in createQueueCore before createQueueWithRetry, so a transient peer poll failure during an idempotent CreateQueue retry on an existing partitioned queue with identical attrs would flip a 200-OK into a 400. Moved the gate INTO tryCreateQueueOnce after the existence check and before the OCC dispatch. Idempotent retries now short-circuit via the existing attributesEqual(existing, requested) branch and never touch the network.

Caller audit: validateHTFIFOCapability has exactly one production caller (now tryCreateQueueOnce). Both the JSON handler and the query-protocol handler reach it through that single path. Move is symmetric — no caller observes a semantic change for queues that genuinely need to be created; only the existing-queue path stops paying the gate cost.

2. CodeRabbit Major — Information leak: the wire-level InvalidAttributeValue message included peer addresses + raw poller error text, leaking cluster topology to authenticated callers. The wire message is now the sanitized constant htfifoCapabilityRejectionPublic ("…one or more peers did not — see server logs for details"); the per-peer detail goes to slog.Warn with structured fields (queueName, partitionCount, peerCount, detail) for operator triage. Renamed buildHTFIFOCapabilityRejectionformatHTFIFOCapabilityReportForLog to make its server-side-only contract obvious at the call site.

Tests:

  • New TestSQSServer_HTFIFO_CapabilityGate_IsIdempotentOnExistingQueue (wire-level): create a partitioned queue → poison leaderSQS with an unreachable peer → re-create with identical attrs (must 200) → create a NEW queue (must 400, proves gate is still in effect).
  • New TestValidateHTFIFOCapability_PublicMessageDoesNotLeakPeerDetails: pins the redaction contract — wire message must equal the sanitized constant exactly, never contain a peer host:port.
  • Updated reject-path tests to assert on the sanitized constant + NotContains on the peer address.
  • Renamed TestBuildHTFIFOCapabilityRejection_*TestFormatHTFIFOCapabilityReportForLog_*.

Below threshold (intentionally not addressed in this round):

  • Gemini medium on collectSQSPeers concurrency: leaderSQS is only mutated at SQSServer construction (WithSQSLeaderMap), not at request time. Gemini's own comment acknowledges this.
  • Gemini medium on caching the capability status: caching adds a stale-window failure mode (a cluster that just lost a peer would still accept a partitioned queue while the cache is warm). Pure performance suggestion, not correctness.

Build / golangci-lint --fix / targeted SQS test sweep all clean under -race.

@claude
Copy link
Copy Markdown

claude Bot commented May 4, 2026

Claude finished @bootjp's task in 2m 50s —— View job


Code Review: Round 1 Changes

  • Gather context and understand the PR
  • Read changed files (capability gate, catalog, tests)
  • Review idempotency fix (gate moved to tryCreateQueueOnce)
  • Review information leak fix (sanitized wire message)
  • Review new tests
  • Check stale comments from previous round
  • Post final review

Verdict: Two threshold-level fixes are correct — three stale comments from round 0 remain open

Both fixes are correct, clean, and well-tested. Flagging the remaining stale comments as a round 2 cleanup item before merge.


Idempotency fix (Codex P1) — correct

The gate moves from createQueueCore (before createQueueWithRetry) to tryCreateQueueOnce at sqs_catalog.go:976, placed after the attributesEqual short-circuit on line 957 and before loadQueueGenerationAt on line 979. Exactly right. The comment block at lines 962–975 explains the invariant clearly.

TestSQSServer_HTFIFO_CapabilityGate_IsIdempotentOnExistingQueue is the right test: create → poison peer map → identical-attr retry must 200 → different-name queue must 400. The third assertion is critical — it proves the gate is still live for genuinely new queues, not just suppressed. No complaints here.

Information-leak fix (CodeRabbit Major) — correct

htfifoCapabilityRejectionPublic at sqs_capability_gate.go:18 is the sanitized wire constant; slog.Warn at line 71 gets the per-peer detail via formatHTFIFOCapabilityReportForLog. The rename from buildHTFIFOCapabilityRejectionformatHTFIFOCapabilityReportForLog makes the server-side-only contract clear at the call site.

TestValidateHTFIFOCapability_PublicMessageDoesNotLeakPeerDetails pins the contract with require.Equal(t, htfifoCapabilityRejectionPublic, ...) + require.NotContains(t, ..., oldAddr). The existing rejection tests were also updated with NotContains guards.


Stale comments — three from round 0 still open

The previous review flagged four stale comments. One was fixed (the PartitionCount applier doc). Three remain:

1. sqs_catalog.go:125–130PartitionCount field doc on sqsQueueMeta

// PR 2 of the rollout introduces this field but a temporary CreateQueue gate
// rejects PartitionCount > 1 until PR 5 lifts the gate atomically
// with the data-plane fanout — so the schema exists but no
// partitioned data can land before the data plane is wired.

This is now false — PR 5b-3 has lifted the gate. The last sentence claims partitioned data cannot land, which is wrong.

Fix this →

2. sqs_partitioning.go:14–18 — file header

// PR 2 of the §11 rollout introduces the schema fields plus the
// validation surface — including the temporary dormancy gate that
// rejects PartitionCount > 1 at CreateQueue. PR 5 lifts the gate
// atomically with the data-plane fanout so a half-deployed cluster
// can never accept a partitioned queue without the data plane to
// serve it. Until then the field exists in the meta type and the
// router function compiles, but no partitioned queue can land.

"Until then … no partitioned queue can land" is now false. Should be updated to past tense: the gate was lifted in PR 5b-3.

Fix this →

3. sqs_partitioning_test.go:234–237 — test comment

// FIFO + perMessageGroupId + PartitionCount=8: accept (the
// dormancy gate runs separately on CreateQueue and rejects this
// at the wire today, but the cross-attribute validator on its
// own does not).

"the dormancy gate … rejects this at the wire today" is no longer true — CreateQueue with PartitionCount=8 now succeeds on a capable cluster.

Fix this →


Minor (non-blocking)

sqs_partitioning.go:188–190 in validateStandardQueueRejectsHTFIFO's doc still says "once PR 5 lifts the dormancy gate" in future tense. Trivially update to past tense — "since PR 5 lifted" — to match the completed state.


Positive highlights

  • The gate placement in tryCreateQueueOnce is tight: after attributesEqual (so idempotent retries never touch the network) and before loadQueueGenerationAt (so a rejected create doesn't burn an OCC slot). Correct on both sides.
  • TestSQSServer_HTFIFO_CapabilityGate_IsIdempotentOnExistingQueue's three-phase structure (create → poison → retry-same-name-200 → new-name-400) is the right test. The third assertion in particular ensures the gate isn't just suppressed.
  • formatHTFIFOCapabilityReportForLog rename makes the server-side-only contract impossible to miss at the call site.
  • The slog.Warn fields (queueName, partitionCount, peerCount, detail) give operators everything they need without a re-poll.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
adapter/sqs_catalog.go (1)

385-392: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Stale comment — references the removed dormancy gate.

The second sentence now describes a gate that no longer exists:

The temporary dormancy gate (§11 PR 2) runs separately in createQueue so
SetQueueAttributes paths share the schema validator without re-rejecting on the gate.

The dormancy gate was removed by this PR. The rationale still holds (the capability gate runs in tryCreateQueueOnce, not here, precisely so SetQueueAttributes paths don't re-run the network poll), but the reference to "§11 PR 2" and "dormancy gate" is now incorrect.

📝 Proposed fix
-	// HT-FIFO validation runs after resolveFifoQueueFlag so the
-	// IsFIFO-only checks see the post-resolution flag. The temporary
-	// dormancy gate (§11 PR 2) runs separately in createQueue so
-	// SetQueueAttributes paths share the schema validator without
-	// re-rejecting on the gate.
+	// HT-FIFO validation runs after resolveFifoQueueFlag so the
+	// IsFIFO-only checks see the post-resolution flag. The cluster-wide
+	// capability gate (validateHTFIFOCapability, §11 PR 5b-3) runs inside
+	// tryCreateQueueOnce — not here — so SetQueueAttributes paths share
+	// the schema validator without triggering the network poll.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/sqs_catalog.go` around lines 385 - 392, The comment above the
validatePartitionConfig(meta) call is stale because the "dormancy gate (§11 PR
2)" was removed; update the comment to remove the dormancy/dormancy-gate
reference and instead state that HT-FIFO validation runs after
resolveFifoQueueFlag so IsFIFO-only checks see the post-resolution flag, and
note that the capability gate now runs in tryCreateQueueOnce (not here) so
SetQueueAttributes paths share the schema validator without re-running the
network poll; reference the functions resolveFifoQueueFlag,
validatePartitionConfig, createQueue, tryCreateQueueOnce and the
SetQueueAttributes path to make the intent clear.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@adapter/sqs_catalog.go`:
- Around line 385-392: The comment above the validatePartitionConfig(meta) call
is stale because the "dormancy gate (§11 PR 2)" was removed; update the comment
to remove the dormancy/dormancy-gate reference and instead state that HT-FIFO
validation runs after resolveFifoQueueFlag so IsFIFO-only checks see the
post-resolution flag, and note that the capability gate now runs in
tryCreateQueueOnce (not here) so SetQueueAttributes paths share the schema
validator without re-running the network poll; reference the functions
resolveFifoQueueFlag, validatePartitionConfig, createQueue, tryCreateQueueOnce
and the SetQueueAttributes path to make the intent clear.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 434dc892-4c26-4f9c-a5f3-7d98ebe9b988

📥 Commits

Reviewing files that changed from the base of the PR and between df11d2b and 166e079.

📒 Files selected for processing (4)
  • adapter/sqs_capability_gate.go
  • adapter/sqs_capability_gate_test.go
  • adapter/sqs_catalog.go
  • adapter/sqs_partitioning_integration_test.go

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 166e079abb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread adapter/sqs_catalog.go
Comment on lines +976 to +977
if err := s.validateHTFIFOCapability(ctx, requested); err != nil {
return false, err
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Re-check existence before returning capability-gate failure

tryCreateQueueOnce snapshots queue existence at readTS and, when it sees the queue as missing, immediately runs validateHTFIFOCapability; if another request creates the same queue after that snapshot but before this gate call, any transient peer-poll failure returns InvalidAttributeValue and exits without hitting the conflict-retry path, so an idempotent create-or-get call can still fail even though the queue now exists with identical attributes. This race requires concurrent CreateQueue plus a temporary poll failure, but it directly impacts the reliability of idempotent retries during rolling upgrades or partial outages.

Useful? React with 👍 / 👎.

… round 2)

Claude review on round 1 flagged that several doc comments still
described the §11 PR 2 dormancy gate as if it were live, even though
PR 5b-3 has lifted it and replaced it with the cluster-wide htfifo
capability check. Documentation-only sweep — no behavior change.

Updated:

- adapter/sqs_catalog.go (sqsQueueMeta.PartitionCount field doc):
  was "PR 2 of the rollout introduces this field but a temporary
  CreateQueue gate rejects PartitionCount > 1 until PR 5 lifts the
  gate atomically with the data-plane fanout — so the schema exists
  but no partitioned data can land before the data plane is wired".
  Now describes validateHTFIFOCapability in tryCreateQueueOnce as
  the live guard.

- adapter/sqs_catalog.go (parseAttributesIntoMeta comment): swapped
  the dormancy-gate reference for the capability gate.

- adapter/sqs.go (sqsCapabilityHTFIFO and htfifoCapabilityAdvertised
  doc blocks): three stale references updated. The "PR 5 will use
  this for the CreateQueue capability gate" forward-reference is now
  "merged via #721 (PR 4-B-3a) and now consumed by
  validateHTFIFOCapability in the CreateQueue gate (PR 5b-3)". The
  block describing PR 5 lifting the dormancy gate is now factual
  past-tense.

- adapter/sqs_keys.go (partitioned key family doc): swapped the
  "dormancy gate currently rejects" sentence for the
  validateHTFIFOCapability gate.

- adapter/sqs_messages.go (decodeClientReceiptHandle doc): the
  "PR 5b-3 lifts the gate together with the capability check" was
  itself stale — PR 5b-3 has shipped. Updated to describe the
  current routing through validateReceiptHandleVersion.

- adapter/sqs_partitioning_test.go (two test docstrings): "after
  PR 5 lifts the dormancy gate" / "the dormancy gate runs separately
  on CreateQueue and rejects this at the wire today" updated to
  refer to the capability gate.

Verified by build + golangci-lint (0 issues) + targeted SQS test
sweep (-race) all clean.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 4, 2026

@claude review

Round 2 — addressed Claude's round-1 callout on stale dormancy-gate doc references.

Claude's round-1 verdict was "two threshold-level fixes are correct — three stale comments from round 0 remain open" with explicit "round 2 cleanup" framing. Documentation-only sweep across 5 files; no behavior change.

Updated:

  • adapter/sqs_catalog.go: sqsQueueMeta.PartitionCount field doc + parseAttributesIntoMeta comment
  • adapter/sqs.go: sqsCapabilityHTFIFO and htfifoCapabilityAdvertised doc blocks (3 stale forward-references to "PR 5 will lift" / "PR 5 will use this")
  • adapter/sqs_keys.go: partitioned-key-family doc
  • adapter/sqs_messages.go: decodeClientReceiptHandle doc (the "PR 5b-3 lifts the gate" line was itself stale — PR 5b-3 has shipped)
  • adapter/sqs_partitioning_test.go: two test docstrings

Build / golangci-lint --fix (0 issues) / targeted SQS test sweep all clean under -race.

Codex P2 ("re-check existence before returning capability-gate failure" at sqs_catalog.go:977) and CodeRabbit minor ("stale comment at sqs_catalog.go:385") are below threshold; this round addresses the latter as part of the doc sweep, the former is a TOCTOU between the snapshot existence read and the gate poll — narrow window, requires concurrent CreateQueue + transient peer outage at the same instant; not addressed in this round per loop priority.

@claude
Copy link
Copy Markdown

claude Bot commented May 4, 2026

Claude finished @bootjp's task in 1m 56s —— View job


Code Review: Round 2 (Doc-only sweep)

  • Gather context and understand the PR
  • Read updated files to verify stale comment fixes
    • adapter/sqs_catalog.go — PartitionCount field doc + parseAttributesIntoMeta comment
    • adapter/sqs.go — sqsCapabilityHTFIFO and htfifoCapabilityAdvertised doc blocks
    • adapter/sqs_keys.go — partitioned-key-family doc
    • adapter/sqs_messages.go — decodeClientReceiptHandle doc
    • adapter/sqs_partitioning_test.go — two test docstrings
  • Verify Round 1 stale comments (×3) are resolved
  • Assess Codex P2 TOCTOU (intentionally deferred)
  • Post final review

Verdict: Five of the six flagged locations are fixed — one stale block in sqs_partitioning.go remains


Fixed correctly ✅

All five files in the Round 2 sweep are clean:

  • sqs_catalog.go:122–130PartitionCount field doc now correctly describes validateHTFIFOCapability in tryCreateQueueOnce as the live guard. ✅
  • sqs_catalog.go:382–392parseAttributesIntoMeta comment now says "cluster-wide capability gate … runs separately in tryCreateQueueOnce"; also adds the useful note about immutability blocking PartitionCount changes in SetQueueAttributes. This also closes the CodeRabbit minor flagged at line 385. ✅
  • sqs.go:53–84 — All three forward-references to "PR 5 will lift / will use" are updated to past-tense. The block that described PR 5 as a future gate-and-lift now accurately describes the current state. ✅
  • sqs_keys.go:262–264 — "dormancy gate currently rejects" updated to reference validateHTFIFOCapability, PR 5b-3. ✅
  • sqs_messages.go:384–396 — The stale "PR 5b-3 lifts the gate together with the capability check" paragraph is gone; the replacement describes the current routing through validateReceiptHandleVersion inside loadMessageForDelete / loadAndVerifyMessage, which is more informative than the old forward-reference. ✅
  • sqs_partitioning_test.go:166–171 and 233–237 — Both test docstrings correctly updated. ✅

Remaining: sqs_partitioning.go file header (lines 12–18) — still stale

This comment was flagged in Round 0 and called out again in Round 1, but sqs_partitioning.go is not in the Round 2 sweep. Current text:

// PR 2 of the §11 rollout introduces the schema fields plus the
// validation surface — including the temporary dormancy gate that
// rejects PartitionCount > 1 at CreateQueue. PR 5 lifts the gate
// atomically with the data-plane fanout so a half-deployed cluster
// can never accept a partitioned queue without the data plane to
// serve it. Until then the field exists in the meta type and the
// router function compiles, but no partitioned queue can land.

"Until then … no partitioned queue can land" is now false. While this is a doc-only issue with no behavioral impact, it is the last surviving pre-lift forward-reference in the SQS adapter and will mislead any reader who enters the file here.

Fix this →

The same file also has a minor at validateStandardQueueRejectsHTFIFO (line 189): "once PR 5 lifts the dormancy gate" — still future tense, should be "since PR 5 lifted". Trivially fixed in the same pass.


Codex P2 TOCTOU — deferral accepted

The narrow race (tryCreateQueueOnce sees queue absent → peer poll fails simultaneously with a concurrent create → original caller gets InvalidAttributeValue instead of the OCC retry) is real but the probability is low: it requires concurrent same-name creates AND a transient poll failure within the peer-poll window (≤ 3 s). The gate-placement comment at sqs_catalog.go:962–975 accurately documents the invariant. Deferring is a reasonable call.


Overall

The doc sweep is thorough and the five files are clean. One block remains in sqs_partitioning.go. Once that's patched (a two-line edit) the stale-comment debt from PR 5b-3 will be fully cleared.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dccac18e0c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +64 to +65
report := PollSQSHTFIFOCapability(ctx, peers, PollerConfig{})
if report == nil || !report.AllAdvertise {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reject partitioned CreateQueue without routing map coverage

validateHTFIFOCapability now allows PartitionCount > 1 queues after only checking peer /sqs_health, but it never verifies that the queue is actually routable by the partition resolver. In clusters where --sqsFifoPartitionMap is present but missing this queue (or has fewer partitions than requested), CreateQueue succeeds here, yet later Send/Receive/Delete operations build partitioned keys that ShardRouter.ResolveGroup treats as recognized-but-unresolved and fails closed, surfacing as coordinator no route for key errors (and then SQS InternalFailure). This change introduces a create-success/runtime-failure mode that leaves users with unusable queues; CreateQueue should fail early unless every partition for the requested queue can be routed.

Useful? React with 👍 / 👎.

…coverage (PR #734, round 2)

Codex P1 review: validateHTFIFOCapability admitted PartitionCount > 1
queues after only checking peer /sqs_health, but never verified that
the queue was actually routable. In a cluster where
--sqsFifoPartitionMap is configured but missing the new queue (or
configured with fewer partitions than requested), CreateQueue
succeeded and then SendMessage / ReceiveMessage / DeleteMessage
failed closed at kv.ShardRouter ("no route for key") on first use,
surfacing as InternalFailure to the client. Operators ended up with
unusable queues that couldn't be deleted by the AWS CreateQueue
contract's idempotency rules either.

Fix: extend validateHTFIFOCapability with a routing-coverage check
that runs INDEPENDENTLY of the peer poll (and BEFORE it, so the
empty-peer-list short-circuit doesn't bypass coverage). Three
branches:

  - resolver==nil → skip (single-shard / no---sqsFifoPartitionMap
    deployment, partitioned keys fall through to the engine's
    default group, this case has worked since PR 5b-3 and stays
    that way for backward compatibility)
  - resolver!=nil + queue not in map → reject
  - resolver!=nil + RoutedPartitionCount(queue) <
    PartitionCount → reject

The wire-level rejection is the new sanitized constant
htfifoRoutingCoverageRejectionPublic — same redaction principle as
htfifoCapabilityRejectionPublic from round 1: the operator detail
(queue name, requested vs. actual partition count) goes to
slog.Warn so an authenticated CreateQueue caller cannot probe the
operator's --sqsFifoPartitionMap shape via error messages.

What changed:

  - adapter/sqs_partition_resolver.go: new
    RoutedPartitionCount(queueName string) int method on
    *SQSPartitionResolver. Nil-receiver returns 0 so the gate's
    resolver==nil branch + empty-queue branch both treat unconfigured
    routes as "no coverage", which routes naturally to the
    rejection path or short-circuit.
  - adapter/sqs_capability_gate.go: new
    htfifoRoutingCoverageRejectionPublic constant and
    validateHTFIFORoutingCoverage helper. validateHTFIFOCapability
    now calls the routing-coverage check before the peer poll.
  - adapter/sqs.go: new partitionResolver field on SQSServer +
    WithSQSPartitionResolver option.
  - main_sqs.go: startSQSServer takes the resolver as a parameter
    and threads it through WithSQSPartitionResolver.
  - main.go: split buildSQSPartitionResolver into a wrapper +
    buildSQSPartitionResolverConcrete (returns the concrete
    pointer the SQS server needs without re-introducing the
    typed-nil interface trap that the wrapper guards against);
    serversInput populates the runtime's sqsPartitionResolver
    from the same partition map the coordinator consumes.

Tests (sqs_capability_gate_test.go):

  - TestValidateHTFIFOCapability_RejectsWhenRoutingMapMissingQueue
  - TestValidateHTFIFOCapability_RejectsWhenRoutingMapPartiallyCoversQueue
  - TestValidateHTFIFOCapability_AcceptsWhenRoutingMapFullyCoversQueue
  - TestValidateHTFIFOCapability_AcceptsWhenResolverNil (backward compat)
  - TestRoutedPartitionCount_NilReceiver
  - TestRoutedPartitionCount_KnownAndUnknownQueue

Caller audit:

  - validateHTFIFOCapability: one production caller
    (tryCreateQueueOnce in sqs_catalog.go), tests aside.
    Returns nil / *sqsAPIError as before; new failure mode joins
    the existing InvalidAttributeValue bucket. No caller observes
    a semantic change.
  - RoutedPartitionCount: only validateHTFIFORoutingCoverage
    consumes it in production. Nil-receiver semantics
    documented and tested.
  - WithSQSPartitionResolver: only main_sqs.go uses it.
    Default-nil short-circuit preserves existing test fixtures
    that build &SQSServer{} directly.
  - buildSQSPartitionResolverConcrete: new in this commit; one
    caller (serversInput populate). The typed-nil-safe wrapper
    buildSQSPartitionResolver still exists and is the only path
    the coordinator's WithPartitionResolver consumes.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 4, 2026

@claude review

Round 3 — addressed Codex P1: routing-map coverage.

validateHTFIFOCapability admitted partitioned creates after only checking peer /sqs_health, never verifying that the queue was actually routable. In a cluster with --sqsFifoPartitionMap configured but missing the new queue (or under-allocated partitions), CreateQueue succeeded → SendMessage failed closed at kv.ShardRouterInternalFailure on first use.

Fix: new validateHTFIFORoutingCoverage helper called from validateHTFIFOCapability BEFORE the peer poll (so the empty-peer-list short-circuit doesn't bypass coverage). Three branches:

  • resolver==nil → skip (single-shard / no---sqsFifoPartitionMap; backward compat preserved)
  • resolver!=nil + queue not in map → reject
  • resolver!=nil + RoutedPartitionCount(queue) < PartitionCount → reject

Wire-level rejection is the new sanitized constant htfifoRoutingCoverageRejectionPublic (same redaction principle as the round-1 htfifoCapabilityRejectionPublic). Operator detail (queue name, requested vs. actual count) goes to slog.Warn.

Wiring: new WithSQSPartitionResolver option + partitionResolver field on SQSServer. main.go splits buildSQSPartitionResolver into the existing typed-nil-safe wrapper + new buildSQSPartitionResolverConcrete that returns the concrete pointer for the SQS server. serversInput populates a new sqsPartitionResolver field on the runtime, threaded through to startSQSServer.

Tests:

  • TestValidateHTFIFOCapability_RejectsWhenRoutingMapMissingQueue
  • TestValidateHTFIFOCapability_RejectsWhenRoutingMapPartiallyCoversQueue
  • TestValidateHTFIFOCapability_AcceptsWhenRoutingMapFullyCoversQueue
  • TestValidateHTFIFOCapability_AcceptsWhenResolverNil (backward compat)
  • TestRoutedPartitionCount_NilReceiver
  • TestRoutedPartitionCount_KnownAndUnknownQueue

Caller audit:

  • validateHTFIFOCapability — single production caller (tryCreateQueueOnce); new failure mode joins existing InvalidAttributeValue bucket, no caller-side semantic change.
  • RoutedPartitionCount — only validateHTFIFORoutingCoverage consumes it in production; nil-receiver returns 0 (documented + tested).
  • WithSQSPartitionResolver / buildSQSPartitionResolverConcrete — single caller each, default-nil short-circuit preserves existing fixtures that build &SQSServer{} directly.

Build / golangci-lint (0 issues, full repo) / -race test sweep all clean.

@claude
Copy link
Copy Markdown

claude Bot commented May 4, 2026

Claude finished @bootjp's task in 3m 22s —— View job


Code Review: Round 3 (Routing-map coverage fix)

  • Gather context and understand the PR
  • Read adapter/sqs_capability_gate.go (new routing coverage logic)
  • Read adapter/sqs_capability_gate_test.go (new tests)
  • Read adapter/sqs_partition_resolver.go (new RoutedPartitionCount method)
  • Read adapter/sqs_catalog.go (gate placement in tryCreateQueueOnce)
  • Read main.go and main_sqs.go (wiring changes)
  • Check remaining stale comment from Round 2 (sqs_partitioning.go)
  • Post final review

Verdict: Routing-coverage fix is correct — one stale comment persists from Round 0, one doc inaccuracy to address


Routing-coverage fix (Codex P1) — correct

validateHTFIFORoutingCoverage is called from validateHTFIFOCapability before the collectSQSPeers() call and before the empty-peer-list short-circuit (sqs_capability_gate.go:74). This is the right placement: a single-node cluster with a pre-configured --sqsFifoPartitionMap that doesn't cover the new queue correctly gets the coverage rejection rather than slipping through the vacuous-empty-peer path. The comment block at lines 67–73 explains this invariant precisely.

Three-branch logic is correct:

  • resolver == nil → skip; single-shard deployments fall through to the engine's default group
  • resolver != nil + RoutedPartitionCount(queue) == 0int64(0) >= int64(N) is false for N ≥ 2 → reject
  • resolver != nil + RoutedPartitionCount(queue) < PartitionCount → reject

The int64 widening to dodge gosec G115 is clean. The comment on lines 129–132 explains why the widening is safe (both sides fit in int64 without loss, PartitionCount is bounded by htfifoMaxPartitions=32 upstream). The >= rather than == semantics (over-allocated routes are harmless) are correctly stated and tested in TestValidateHTFIFOCapability_AcceptsWhenRoutingMapFullyCoversQueue.

RoutedPartitionCount nil-receiver returning 0 is correct: len(nil_slice) == 0 in Go, and the function only gets called when the nil check at line 124 has passed. The direct nil-receiver test (TestRoutedPartitionCount_NilReceiver) documents that the method is also safe to call without the outer guard.

Test suite — thorough

Six new tests cover all the important cases and all assert on the sanitized htfifoRoutingCoverageRejectionPublic constant (no operator-detail leakage, consistent with the round-1 redaction fix):

Test What it pins
TestValidateHTFIFOCapability_RejectsWhenRoutingMapMissingQueue queue not in map → reject
TestValidateHTFIFOCapability_RejectsWhenRoutingMapPartiallyCoversQueue partial coverage → reject
TestValidateHTFIFOCapability_AcceptsWhenRoutingMapFullyCoversQueue over-allocated routes ok
TestValidateHTFIFOCapability_AcceptsWhenResolverNil nil resolver → skip
TestRoutedPartitionCount_NilReceiver nil *SQSPartitionResolver returns 0 safely
TestRoutedPartitionCount_KnownAndUnknownQueue basic API contract

Wiring in main.go — clean

The two-function split (buildSQSPartitionResolver returning kv.PartitionResolver for the coordinator, buildSQSPartitionResolverConcrete returning *adapter.SQSPartitionResolver for the SQS server) avoids the typed-nil interface trap. The comment block on buildSQSPartitionResolver (lines 504–513) explains the hazard exactly. The sqsPartitionResolver field and its comment on serversInput (lines 887–895) explain why the routes seen by the gate match those seen by the coordinator at dispatch time.


Issues found

1. sqs_partitioning.go file header (lines 12–18) — stale, FOURTH round

This comment was flagged in Rounds 0, 1, and 2 and has been explicitly called out as the last remaining stale block in this PR. It is NOT in the Round 3 file sweep. Current text:

// PR 2 of the §11 rollout introduces the schema fields plus the
// validation surface — including the temporary dormancy gate that
// rejects PartitionCount > 1 at CreateQueue. PR 5 lifts the gate
// atomically with the data-plane fanout so a half-deployed cluster
// can never accept a partitioned queue without the data plane to
// serve it. Until then the field exists in the meta type and the
// router function compiles, but no partitioned queue can land.

"Until then … no partitioned queue can land" is now false — PR 5b-3 has shipped. Same file, validateStandardQueueRejectsHTFIFO on line 189: "once PR 5 lifts the dormancy gate" is still future tense.

Fix this →

2. sqs.go:226–227WithSQSPartitionResolver doc says "same pointer" but the code builds two separate instances

// Callers must ensure the resolver passed here matches the one
// installed on the kv coordinator via WithPartitionResolver,
// otherwise the gate would admit a queue that the coordinator
// then fails to route. main.go builds the resolver once and
// hands the same pointer to both consumers.

main.go actually calls buildSQSPartitionResolverConcrete(in.cfg.sqsFifoPartitionMap) separately (line 895) from buildSQSPartitionResolver(cfg.sqsFifoPartitionMap) (line 322). These are two separate allocations — two distinct *SQSPartitionResolver instances, each with a defensive copy of the routes. They are not the same pointer. The semantic invariant (both built from the same config map, both immutable post-construction, so routes are always identical) holds — but the comment misrepresents the mechanism and will mislead a future reader who greps for "same pointer" and finds two separate buildSQSPartitionResolverConcrete calls.

Fix this →


Non-blocking observations

validateHTFIFORoutingCoverage is ordered correctly relative to PartitionCount <= 1 short-circuit

validateHTFIFOCapability line 60 (if requested == nil || requested.PartitionCount <= 1 { return nil }) runs BEFORE the routing check on line 74. This means validateHTFIFORoutingCoverage is only called with PartitionCount >= 2, so the 0 >= PartitionCount case is impossible at that call site. The nil-receiver returning 0 for RoutedPartitionCount is still a useful safety property, but the 0 >= 2 math is never exercised. This is correct and the test for PartitionCount <= 1 in TestValidateHTFIFOCapability_ShortCircuitsOnLegacyMeta pins the short-circuit correctly.

slog.Warn in validateHTFIFORoutingCoverage does not include the routed count when routed == 0

When the queue is absent from the map, routed == 0 and the log line at line 136 emits "routedPartitionCount": 0. This is accurate (0 routes), but an operator reading the log might not know whether 0 means "queue absent" or "queue explicitly configured with 0 partitions" (the latter is impossible since NewSQSPartitionResolver would have no entry for a 0-partition queue). This is cosmetic — no action required, just noting that the log is unambiguous in practice.


Positive highlights

  • Coverage check before peer poll is the correct ordering. The comment at lines 67–73 explains precisely why: the empty-peer-list short-circuit must not bypass the routing check.
  • RoutedPartitionCount == 0 for missing queues is handled naturally by len(nil_slice) == 0, no explicit branch needed.
  • htfifoRoutingCoverageRejectionPublic keeps the same redaction principle as the round-1 fix — operator detail in slog.Warn, generic message on the wire.
  • buildSQSPartitionResolverConcrete avoids the typed-nil interface trap while sharing the routing map between the coordinator and the SQS server.
  • sqsPartitionResolver field in serversInput is well-commented (lines 887–894) with an explicit back-reference to the Codex P1 finding.

@bootjp bootjp merged commit cfacebf into main May 4, 2026
9 checks passed
@bootjp bootjp deleted the feat/sqs-htfifo-capability-gate branch May 4, 2026 11:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant