feat(sqs): /sqs_health capability JSON (Phase 3.D PR 4-B-1)#708
Conversation
Adds the JSON capability-advertisement shape to /sqs_health that the
Phase 3.D §8.5 mixed-version gate requires before the data plane
can safely accept partitioned FIFO queues.
What changes
- /sqs_health now returns JSON when Accept: application/json is set:
{"status":"ok","capabilities":[...]}
The capabilities slice is built deterministically off
htfifoCapabilityAdvertised, a package-level constant that stays
false in this PR — the data plane and the §8 leadership-refusal
hook are not yet wired, so advertising "htfifo" would let peers
build partitioned queues this binary cannot safely serve.
- The legacy "ok\n" text body is preserved byte-identical for any
caller that does not signal application/json (no Accept header,
bare "*/*" wildcard, text/plain). Existing k8s liveness probes
and curl integrations stay unchanged.
What does NOT change yet
This is the first slice of the PR 4-B work. Two follow-ups complete
the design's "advertise htfifo only when routing AND leadership-
refusal are both in place" rule:
- PR 4-B-2: routing resolver in kv/shard_router.go that consumes
--sqsFifoPartitionMap to dispatch (queue, partition) keys to the
operator-chosen Raft group.
- PR 4-B-3: §8 leadership-refusal hook that calls TransferLeadership
when a non-htfifo binary holds an SQS Raft group hosting a
partitioned queue + the catalog-polling helper for the CreateQueue
capability gate (PR 5 lifts the dormancy and starts using it).
PR 4-B-3 also flips htfifoCapabilityAdvertised to true.
Self-review (per CLAUDE.md)
1. Data loss — health endpoint only; no FSM/Pebble/retention path.
No issue.
2. Concurrency — single-shot read off a const; no shared state.
No issue.
3. Performance — JSON marshal of a 2-field struct on a low-QPS
endpoint. No issue.
4. Data consistency — the gate constant is the single source of
truth for the capability list; sqsAdvertisedCapabilities() is the
only producer. Tests pin both true/false branches. No issue.
5. Test coverage — TestServeSQSHealthz_LegacyTextPath (3 sub-tests),
TestServeSQSHealthz_JSONShape (3 sub-tests),
TestServeSQSHealthz_HEAD_JSONOmitsBody,
TestServeSQSHealthz_RejectsNonGETHEAD (2 sub-tests),
TestSQSAdvertisedCapabilities_TracksFlag (true/false guard),
TestClientAcceptsSQSHealthJSON_Boundaries (9 sub-tests).
|
@claude review Phase 3.D PR 4-B-1: capability JSON shape on /sqs_health. Routing + leadership-refusal land in 4-B-2 / 4-B-3, so |
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 59 minutes and 52 seconds.Comment |
|
Claude finished @bootjp's task in 2m 19s —— View job Code Review: Phase 3.D PR 4-B-1 —
|
| Lens | Result |
|---|---|
| Data loss | Health endpoint only; no FSM/Pebble path. ✓ |
| Concurrency | Single read off a compile-time const; no shared mutable state. ✓ |
| Performance | json.Marshal of a 2-field struct; negligible. ✓ |
| Data consistency | htfifoCapabilityAdvertised is the single source of truth; both branches pinned by TestSQSAdvertisedCapabilities_TracksFlag. ✓ |
| Test coverage | 19 sub-tests across 6 functions; both Accept paths and both flag branches covered. ✓ |
Verdict: LGTM. No blocking issues. Ready to queue behind PR 4-B-2 and 4-B-3 as planned.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c2930740da
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if raw == "" || raw == "*/*" { | ||
| continue | ||
| } | ||
| if strings.Contains(raw, "application/json") { |
There was a problem hiding this comment.
Match JSON media type case-insensitively in Accept
clientAcceptsSQSHealthJSON uses a case-sensitive substring check, so valid headers like Accept: Application/JSON are treated as non-JSON and the handler falls back to legacy text/plain. HTTP media type type/subtype tokens are case-insensitive, so this causes incorrect content negotiation for standards-compliant clients that vary casing.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request introduces JSON capability advertising to the SQS health endpoint, allowing the service to report supported features like partitioned FIFO queues (htfifo) via the Accept header. The implementation ensures backward compatibility with legacy text-based health checks and includes comprehensive tests for various request methods and header configurations. Review feedback suggests removing the trailing newline from the JSON response to better align with standard API practices and optimizing the test suite by using rec.Body.Bytes() for more efficient response validation.
| return | ||
| } | ||
| _, _ = w.Write(encoded) | ||
| _, _ = io.WriteString(w, "\n") |
| require.Equal(t, "application/json; charset=utf-8", rec.Header().Get("Content-Type")) | ||
|
|
||
| var got sqsHealthBody | ||
| require.NoError(t, json.Unmarshal([]byte(strings.TrimSpace(rec.Body.String())), &got)) |
There was a problem hiding this comment.
To align with the removal of the trailing newline in the response, this test should be updated to not trim whitespace. This makes the test stricter and ensures the response is a clean JSON document. Additionally, using rec.Body.Bytes() is more efficient as it avoids a string conversion.
| require.NoError(t, json.Unmarshal([]byte(strings.TrimSpace(rec.Body.String())), &got)) | |
| require.NoError(t, json.Unmarshal(rec.Body.Bytes(), &got)) |
|
@claude review |
|
Claude finished @bootjp's task in 2m 7s —— View job Code Review: Phase 3.D PR 4-B-1 —
|
| Lens | Result |
|---|---|
| Data loss | Health endpoint only; no FSM/Pebble/retention path. ✓ |
| Concurrency | Single read off a compile-time const; no shared mutable state. ✓ |
| Performance | json.Marshal of a 2-field struct on a low-QPS health endpoint. ✓ |
| Data consistency | htfifoCapabilityAdvertised is the single source of truth; both branches pinned by TestSQSAdvertisedCapabilities_TracksFlag. ✓ |
| Test coverage | 19 sub-tests across 6 functions; both Accept paths and both gate-constant branches covered. ✓ |
Verdict: LGTM with the case-sensitivity note as a suggestion (not a blocker). The capability gate is held correctly at false, backward compatibility is preserved, and the test suite is thorough. Ready to queue behind PR 4-B-2 and 4-B-3 as planned.
#715) ## Summary Routing-layer half of PR 4-B. Adds a `PartitionResolver` that `ShardRouter` consults BEFORE falling through to the byte-range engine. SQS HT-FIFO needs partition-aware dispatch, but the engine's non-overlapping-cover model can't express overlay routes — the resolver-first dispatch sidesteps this cleanly. Stacks on top of #708 (PR 4-B-1, capability JSON). Next is PR 4-B-3 (leadership-refusal + catalog polling + flip `htfifoCapabilityAdvertised` to `true`). ## What's added - `kv.PartitionResolver` interface — `ResolveGroup([]byte) (uint64, bool)`. - `kv.ShardRouter.WithPartitionResolver(...)` — fluent option, nil-safe. - `kv.ShardRouter.resolveGroup(...)` — unified dispatch path: resolver first, engine fallback. Both `groupRequests` (Commit/Abort) and `Get` route through it. - `kv.ShardedCoordinator.WithPartitionResolver(...)` — delegates to the router so `main.go` can install via the existing fluent-construction style. - `adapter.SQSPartitionResolver` — parses `(queue, partition)` from the partitioned key shape, looks up the operator-chosen group. Defensive copy at construction, nil-safe `ResolveGroup`, returns `(0, false)` for legacy / non-matching keys. - `main.go` — builds the resolver from `runtimeConfig.sqsFifoPartitionMap` and installs it. Resolver is `nil` on a non-partitioned cluster — hot path stays engine-only. ## What's NOT added (deferred to PR 4-B-3) - §8 leadership-refusal hook in `kv` (refuses leadership for an SQS Raft group hosting a partitioned queue when the binary lacks `htfifo`). - Catalog-polling helper for the CreateQueue capability gate (PR 5 starts using it). - Flipping `htfifoCapabilityAdvertised` from `false` to `true`. The design's "advertise htfifo only when both routing AND leadership-refusal are in place" rule keeps the constant `false` in this PR — PR 4-B-3 flips it. ## Test plan - [x] `adapter/sqs_partition_resolver_test.go` — 9 top-level tests: nil-on-empty, defensive-copy, partition dispatch across all 5 families, queue-name prefix isolation (`"queue"` vs `"queue1"`), legacy fall-through (8 sub-cases), unknown queue, out-of-range partition, nil receiver, prefix alignment with `sqs_keys.go` constants. - [x] `kv/shard_router_partition_test.go` — 4 tests: resolver wins over engine, engine fallthrough on resolver-miss, nil resolver no-op, `Get` path also routes through the resolver. - [x] `go test -race ./kv/...` pass. - [x] `go test -race ./adapter/...` pass. - [x] `golangci-lint ./kv/... ./adapter/... .` clean. ## Self-review (per CLAUDE.md) 1. **Data loss** — routing layer only; no FSM/Pebble/retention path. No issue. 2. **Concurrency** — `partitionResolver` is set once at startup before any request. `ResolveGroup` reads a constructor-time defensive copy, so a future hot-reload of `--sqsFifoPartitionMap` cannot perturb in-flight requests. No issue. 3. **Performance** — one map lookup + 4-byte BigEndian decode per resolver hit (only on partitioned-prefix matches). Engine-only path adds a single `if s.partitionResolver != nil` branch. No issue. 4. **Data consistency** — resolver output strictly OVERRIDES the engine for partitioned keys; legacy keys flow through unchanged. "queue not found" / "partition out of range" branches return `(0, false)` so the router surfaces an explicit error rather than silently mis-routing. No issue. 5. **Test coverage** — 13 tests across two new files; existing `TestShardRouter*` tests unchanged. Both override and fall-through paths pinned, plus the queue-name prefix-isolation invariant from PR #703. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added partition-based routing for distributed coordination. Requests are now directed to specific nodes based on configured partition mappings. * Enhanced SQS queue handling with optional partition-aware routing. When partition information is unavailable or unconfigured, the system automatically falls back to the existing routing mechanism. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
First slice of Phase 3.D PR 4-B — adds the JSON capability-advertisement shape to
/sqs_healththat the §8.5 mixed-version gate requires.htfifoCapabilityAdvertisedstaysfalsein this PR; subsequent PRs (4-B-2 routing, 4-B-3 leadership-refusal + catalog polling) flip it totrueonce the rest of §11 PR 4 is in place.What changes
/sqs_healthreturns JSON whenAccept: application/jsonis set:htfifoCapabilityAdvertised. Stays empty in this PR — advertisinghtfifobefore the routing + leadership-refusal pair lands would let peers build partitioned queues this binary cannot safely serve.ok\ntext body preserved byte-identical for any caller that does not signalapplication/json(no Accept header, bare*/*wildcard,text/plain).What does NOT change yet
--sqsFifoPartitionMap→ PR 4-B-2 (resolver hook inkv/shard_router.go).htfifoCapabilityAdvertised = trueflip → PR 4-B-3 (with the catalog-polling helper for the CreateQueue capability gate that PR 5 will start using).PR 5 still rejects
PartitionCount > 1per the §11 PR 2 dormancy gate, so no data-plane risk from this PR.Test plan
TestServeSQSHealthz_LegacyTextPath— 3 sub-tests pinning byte-identical legacy body.TestServeSQSHealthz_JSONShape— 3 sub-tests covering plain JSON, q-factor, comma-list.TestServeSQSHealthz_HEAD_JSONOmitsBody— HEAD on JSON path emits content-type but no body.TestServeSQSHealthz_RejectsNonGETHEAD— POST returns 405 in both legacy and JSON modes.TestSQSAdvertisedCapabilities_TracksFlag— both branches of the gate constant.TestClientAcceptsSQSHealthJSON_Boundaries— 9 Accept-header edge cases.go test -race ./adapter/...pass.golangci-lint ./adapter/...clean.Self-review (per CLAUDE.md)
htfifoCapabilityAdvertisedis the single source of truth for the capability list;sqsAdvertisedCapabilities()is the only producer. Both true/false branches are pinned by tests. No issue.