feat(sqs): HT-FIFO partition metrics counter (Phase 3.D PR 7a)#737
feat(sqs): HT-FIFO partition metrics counter (Phase 3.D PR 7a)#737
Conversation
Adds elastickv_sqs_partition_messages_total{queue, partition, action}
so dashboards and alerts can spot uneven MessageGroupId distributions
across partitioned-FIFO queues.
- monitoring/sqs.go: SQSMetrics with cardinality cap (sqsMaxTrackedQueues
= 512, overflow collapses to "_other") mirroring DynamoDBMetrics. Drops
empty queue names and unknown action labels so a future call-site bug
cannot pollute the series space dashboards have to learn about. Nil-
receiver-safe so adapter call sites do not need to nil-guard.
- adapter/sqs.go: SQSPartitionObserver interface + WithSQSPartitionObserver
option. Re-declared in adapter so it does not import monitoring at the
package boundary (matches DynamoDB / Redis observer pattern). Action
constants (send/receive/delete) re-declared on the adapter side and
validated at runtime by the monitoring side.
- adapter/sqs_fifo.go, adapter/sqs_messages.go: emit the counter on the
partitioned commit branch only (PartitionCount > 1) for send / receive
/ delete. Legacy single-partition queues stay off the metric — the
cardinality cost would buy nothing since partition is always 0.
- monitoring/registry.go, main_sqs.go, main.go: wire the registry
SQSPartitionObserver() into startSQSServer so the SQS server picks
up the production observer on cluster boot. Test fixtures and CLI
tools that build SQSServer without a registry pass nil and the
metric stays at zero.
Tests (monitoring/sqs_test.go):
- TestSQSMetrics_ObservePartitionMessage_IncrementsByLabelTriple — pin
the (queue, partition, action) label triple counter contract.
- TestSQSMetrics_ObservePartitionMessage_DropsInvalidAction — pin the
typo guard against future drift between adapter and monitoring
constants.
- TestSQSMetrics_ObservePartitionMessage_DropsEmptyQueue — pin that an
empty queue name does not collapse with valid observations onto a
shared series.
- TestSQSMetrics_NilReceiverIsSafe — pin the nil-receiver short-circuit
the adapter relies on.
- TestSQSMetrics_QueueLabelOverflow — pin the cap-and-collapse so a
misbehaving caller cannot exhaust the Prometheus series budget.
- TestSQSMetrics_RegistryWiring — pin that the public Registry exposes
the metric under the documented name.
Self-review (5 lenses):
1. Data loss — N/A; metrics-only, no storage / Raft / FSM touch.
2. Concurrency — counter increments are atomic via Prometheus; the
trackedQueues map is only consulted from the dispatch-success path
under the SQS server existing concurrency model. No new locks.
3. Performance — one map lookup + one CounterVec lookup per partitioned
send/receive/delete on the success branch. Legacy queues skip the
call entirely. Cardinality bounded at 512 queue x 256 partition x 3
action ~= 393k series worst case; in practice a 32-partition queue
yields 96 series, so the budget is plenty for the SLO panels.
4. Data consistency — the metric is observed AFTER OCC dispatch
succeeds, so the counter reflects committed state. Receive/delete
branches that return on retryable errors deliberately do not
increment.
5. Test coverage — 6 unit tests in monitoring/, plus the adapter-side
nil-observer path is exercised by all existing partitioned-FIFO
tests in adapter/sqs_partitioned_dispatch_test.go (they pass nil
observer through the test fixture).
Jepsen HT-FIFO workload (PR 7b) lands as a separate PR so the metrics
and the workload review loops do not cross.
Refs: docs/design/2026_05_01_partial_split_queue_fifo.md §11 PR 7
|
@claude review |
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (8)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 15 minutes and 20 seconds.Comment |
|
Claude finished @bootjp's task in 4m 14s —— View job Code Review — feat(sqs): HT-FIFO partition metrics counter (Phase 3.D PR 7a)
Overall this is a clean, well-structured metrics-only PR that correctly mirrors the existing DynamoDB observer pattern. No blocking issues. A few observations below from most to least significant. Findings1. Cardinality calculation in PR description is off (doc-only issue) The PR description states: "cardinality bounded at 512 queue × 256 partition × 3 action ≈ 393k series worst case".
2. Prometheus Help text contains an internal design-doc reference
Help: "Total HT-FIFO partition operations by queue, partition, and action (send / receive / delete). Per the §11 PR 7 split-queue-FIFO design contract — gives dashboards a hot-partition signal so operators can spot uneven MessageGroupId distributions.",
3.
4. Nil-interface safety for the If a user creates an SQS queue literally named 5. The test correctly pins the 6. No concurrent stress test for The mutex around 7. Minor: observe call order in
SummaryCode is correct, tests are thorough and cover the key invariants (nil safety, cardinality cap, label isolation, registry wiring). The five-lens self-review in the PR description is accurate. Two suggestions: clean up the Help string of internal references, and correct the 256→32 cardinality figure in the description (code is fine). LGTM with those minor nits. |
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces observability for SQS HT-FIFO operations by tracking message counts per queue, partition, and action (send, receive, delete). It adds a new SQSPartitionObserver interface, a Prometheus-backed implementation in the monitoring package with cardinality capping, and integrates these metrics into the SQS adapter's send, receive, and delete paths. Feedback focuses on performance optimizations for the metrics hot path, specifically suggesting the use of WithLabelValues to avoid heap allocations from map creation and recommending more efficient synchronization, such as sync.RWMutex, to reduce contention during queue label resolution.
| m.partitionMessages.With(prometheus.Labels{ | ||
| "queue": queueLabel, | ||
| "partition": strconv.FormatUint(uint64(partition), 10), | ||
| "action": action, | ||
| }).Inc() |
There was a problem hiding this comment.
Using prometheus.Labels (a map[string]string) on every observation causes a heap allocation for the map. On the hot path of SQS operations, this adds significant GC pressure. Switching to WithLabelValues avoids the map allocation. Additionally, strconv.FormatUint allocates a new string; for common partition numbers (e.g., 0-31), consider using a pre-allocated lookup table to further reduce allocations.
| m.partitionMessages.With(prometheus.Labels{ | |
| "queue": queueLabel, | |
| "partition": strconv.FormatUint(uint64(partition), 10), | |
| "action": action, | |
| }).Inc() | |
| m.partitionMessages.WithLabelValues( | |
| queueLabel, | |
| strconv.FormatUint(uint64(partition), 10), | |
| action, | |
| ).Inc() |
| func (m *SQSMetrics) queueLabelForCardinalityBudget(queue string) string { | ||
| m.mu.Lock() | ||
| defer m.mu.Unlock() | ||
| if _, ok := m.trackedQueues[queue]; ok { | ||
| return queue | ||
| } | ||
| if len(m.trackedQueues) >= sqsMaxTrackedQueues { | ||
| return sqsQueueOverflow | ||
| } | ||
| m.trackedQueues[queue] = struct{}{} | ||
| return queue | ||
| } |
There was a problem hiding this comment.
The queueLabelForCardinalityBudget method is called on every SQS operation for partitioned queues. Using a plain sync.Mutex and defer creates a global bottleneck by serializing all metrics observations. Since trackedQueues is read-mostly once initialized, using a sync.RWMutex with a fast-path RLock or a sync.Map would significantly reduce contention. This follows the general rule of preferring efficient synchronization on hot paths.
References
- For frequently accessed fields that require monotonic updates (like a configuration index), prefer atomic operations (e.g., CAS) over mutexes to improve performance on hot paths.
Claude flagged that the Prometheus Help string for
elastickv_sqs_partition_messages_total contained an internal
design-doc pointer ("Per the §11 PR 7 split-queue-FIFO design
contract") which leaks into production /metrics output and the
Grafana help tooltip — neither audience has access to the doc.
Replaced with a description of the observable semantics: which
queues populate the metric (PartitionCount > 1 only) and what
the metric is for (spotting uneven MessageGroupId distribution
across partitions). The cap-and-collapse / dropped-action /
empty-queue behaviours stay implementation details exposed only
through the unit tests.
Refs: PR #737 review thread; Claude review at run 25328847079.
|
@claude review |
|
Claude finished @bootjp's task in 2m 17s —— View job Code Review (Round 2) — feat(sqs): HT-FIFO partition metrics counter (Phase 3.D PR 7a)
Round 2 verdict: LGTM with one actionable nitThe two-commit state is clean. Round 1 findings are resolved or correctly accepted. Addressing Gemini's two suggestions below. Help text fix (commit 9491561) ✅The updated Help string is correct and operator-facing: Help: "Total HT-FIFO partition operations by queue, partition, and action (send / receive / delete). Non-zero only for queues with PartitionCount > 1; use to spot uneven MessageGroupId distributions across partitions.",The internal doc reference is gone. The replacement text describes the observable semantics precisely and is appropriate for Grafana/Alertmanager output. Gemini finding 1:
|
| Finding | Status |
|---|---|
| Help text internal doc reference (Round 1 #2) | ✅ Fixed in 9491561 |
| Cardinality figure in PR description (Round 1 #1) | Doc-only; no code change needed |
WithLabelValues vs prometheus.Labels (Gemini) |
|
sync.RWMutex (Gemini) |
✅ Skip — mirrors DynamoDB, complexity > benefit at this scale |
| Batch path coverage (Round 1 #3) | ✅ Confirmed correct |
_other name collision (Round 1 #4) |
✅ Accepted trade-off (mirrored from DynamoDB) |
| Nil guard safety (Round 1 #5) | ✅ Correct |
One actionable item remaining: swap With(prometheus.Labels{...}) → WithLabelValues(...) in monitoring/sqs.go:89. Everything else is correct.
Claude review round 2 surfaced one actionable nit (Gemini): the
ObservePartitionMessage path used With(prometheus.Labels{...})
which allocates a map[string]string on every observe call.
Switched to WithLabelValues with positional args (queue,
partition, action) matching the NewCounterVec declaration —
avoids the per-call map allocation and lines up with the
DynamoDBMetrics pattern (monitoring/dynamodb.go) which already
uses WithLabelValues across all observe sites.
Behaviour unchanged. Single caller (the adapter observe wrapper),
return type unchanged, label semantics identical, so no caller
audit needed.
Round 2 also flagged sync.RWMutex for queueLabelForCardinalityBudget
but Claude (and I) agreed to skip — the critical section is ~5ns,
RWMutex has higher overhead than Mutex at that scale, and the
DynamoDB twin uses Mutex. Diverging only in SQS would be
inconsistent without performance evidence.
Refs: PR #737 round 2 review at run 25329442300.
|
@claude review |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Claude finished @bootjp's task in 1m 51s —— View job Code Review (Round 3) — feat(sqs): HT-FIFO partition metrics counter (Phase 3.D PR 7a)
Round 3 verdict: LGTM — all previous findings resolvedBoth round 1 and round 2 changes are correctly applied. No new issues found. Verification of round 2 fix:
|
| Component | Status |
|---|---|
monitoring/sqs.go — WithLabelValues, clean Help text, action validation, cardinality cap |
✅ |
monitoring/registry.go — sqs *SQSMetrics field, newSQSMetrics in constructor, SQSPartitionObserver() method |
✅ |
adapter/sqs.go — interface re-declared (no cross-package import), observePartitionMessage nil guard (s.partitionObserver == nil), WithSQSPartitionObserver option |
✅ |
adapter/sqs_fifo.go — observe fires post-commit, PartitionCount > 1 gate |
✅ |
adapter/sqs_messages.go — receive: same gate; delete: adds meta != nil check (appropriate, different code path) |
✅ |
main.go — sqsPartitionObserver field on runtimeServerRunner, sourced from in.metricsRegistry.SQSPartitionObserver() |
✅ |
main_sqs.go — partitionObserver param threaded into WithSQSPartitionObserver in startSQSServer |
✅ |
monitoring/sqs_test.go — 6 tests covering counter increment, invalid action drop, empty queue drop, nil receiver, cardinality overflow, registry wiring |
✅ |
One residual non-blocking note (same as round 2 accepted trade-off)
If a queue is literally named _other and the cap is also reached, the _other series will aggregate both. This is the same accepted behavior as DynamoDBMetrics and is documented in the self-review. No action needed.
No further changes required. PR is ready to merge.
## Summary Phase 3.D PR 7b — Jepsen HT-FIFO workload that stresses partitioned-FIFO queues against the three contracts AWS HT-FIFO is supposed to honour even under partition and node-loss faults: **within-group ordering**, **no message loss**, **no duplicates**. Pattern follows [aphyr's Jepsen RabbitMQ analysis](https://aphyr.com/posts/315-jepsen-rabbitmq): track every `:send` and `:recv` in the operation history, then a custom checker verifies the contracts against the recorded events at the end of the run. ## What's in this PR - **`jepsen/project.clj`** — Adds `com.cognitect.aws/sqs` at the same version as the existing dynamodb dep, so the SDK wire protocol (auth, retry classification, error parsing) is exercised end-to-end against elastickv rather than a hand-rolled HTTP layer. - **`jepsen/src/elastickv/db.clj`** — Extends `start-node!` to accept `:sqs-port` (port spec like `:dynamo-port`) and `:sqs-region`. Both are optional, so existing dynamodb / s3 / redis test specs are byte-identical at the args level when `sqs-port` is absent. - **`jepsen/src/elastickv/jepsen_test.clj`** — Registers `elastickv-sqs-htfifo-test` alongside the other workloads. - **`jepsen/src/elastickv/sqs_htfifo_workload.clj`** (new, ~430 lines) — The workload. Uses cognitect/aws-api SQS, creates an HT-FIFO queue with `PartitionCount=4` + `ContentBasedDeduplication`, runs sends and receives across N `MessageGroupId` values, and the custom `ht-fifo-checker` validates the three contracts. - **`jepsen/test/elastickv/sqs_htfifo_workload_test.clj`** (new) — Pure-function tests for the checker plus integration smoke tests for the test-spec builder. 11 tests / 27 assertions. ## Checker contracts For each `MessageGroupId` independently: 1. **Within-group ordering** — the sequence of received `seq` values, sorted by global completion time across all consumers, is monotonically non-decreasing. 2. **No loss** — every `(group, seq)` successfully `:sent` eventually appears in the `:recv` history. Sends with `:info` status are treated as possibly-committed and not counted as lost. 3. **No duplicates** — every `(group, seq)` appears at most once in the `:recv` history. `ContentBasedDeduplication` on the queue + a unique `(group, seq)` body is what enforces this server-side; a duplicate here is a real bug (e.g. a deletion that did not commit). ## Open-endpoint mode The elastickv server starts without `--sqsCredentialsFile`, so the SQS adapter accepts any signed request (mirroring how the S3 adapter is wired in jepsen today). The SDK client signs with dummy credentials, so the SigV4 path still exercises end-to-end at the protocol level. ## Self-review (5 lenses) 1. **Data loss** — N/A; this is a test-only PR. The workload's whole purpose is to *detect* data loss in the system under test. 2. **Concurrency** — The shared per-group `seq-counter` is an `atom` updated via `swap!` (CAS-based), so concurrent sends from different worker threads always assign distinct seqs. The checker is pure; no shared mutable state. 3. **Performance** — Test-only code, runs at low rate (5 ops/sec/worker). Not on any hot path. 4. **Data consistency** — The checker compares committed sends against the receive history globally, so all the consistency assertions are at end-of-run with a complete picture. Sends with `:info` (uncertain commit) are correctly excluded from the loss set, matching Jepsen's standard approach. 5. **Test coverage** — 11 unit tests for the checker pin the contract surface (clean / loss / info-not-loss / duplicates / within-group ordering / cross-group interleaving / failed-send-not-counted / empty-receive). Integration smoke tests pin the test-spec builder. The workload itself is exercised end-to-end on a real cluster via `lein run -m elastickv.sqs-htfifo-workload`. ## Test plan - [x] `lein test elastickv.sqs-htfifo-workload-test` — 11 tests / 27 assertions pass - [x] `lein test` for non-redis suite (dynamodb / dynamodb-types / s3 / cli / sqs-htfifo) — 21 tests / 41 assertions pass - [ ] End-to-end live cluster run — operator-driven (out of scope for the merge gate; relies on a 3-node cluster setup) The `elastickv.redis-workload` namespace fails to load due to the empty `redis/src/` tree, which is pre-existing on main and unrelated to this PR. ## Out of scope (next milestones) - Wiring the workload into `scripts/run-jepsen-local.sh` — the existing script is dynamodb-only; an sqs counterpart lands as a follow-up. - Multi-shard cluster topology that lands distinct partitions on distinct Raft groups. This PR's `PartitionCount=4` routes to the default group on a single-shard cluster — partitioning logic (different keys per partition, ordering preserved within group) is fully exercised, but the cross-shard scaling story is gated on separate work. - Design-doc lifecycle rename (`*_proposed_*.md` → `*_partial_*.md`) — that is §11 PR 8 in the design doc and is tracked separately. ## Refs - `docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md` §11 PR 7. - Closes the testing half of §11 PR 7. PR 7a (metrics) shipped at #737. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added AWS SQS integration with HT‑FIFO support, SQS port/region configuration, and runtime options to exercise FIFO dedupe/order semantics * **Tests** * Added comprehensive unit and workload tests validating ordering, no‑loss, no‑duplicates, and option handling * **Chores** * CI updated to run the SQS HT‑FIFO workload as part of Jepsen test runs <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Phase 3.D PR 7a — adds the
elastickv_sqs_partition_messages_total{queue, partition, action}Prometheus counter so dashboards and alerts can spot unevenMessageGroupIddistributions across partitioned-FIFO queues. Metrics-only: the Jepsen HT-FIFO workload (PR 7b) ships separately so the two review loops do not cross.monitoring/sqs.go(new):SQSMetricswith cardinality cap (sqsMaxTrackedQueues = 512, overflow collapses to_other) mirroringDynamoDBMetrics. Drops empty queue names and unknown action labels so a future call-site bug cannot pollute the series space dashboards have to learn about. Nil-receiver-safe so adapter call sites do not need to nil-guard.adapter/sqs.go:SQSPartitionObserverinterface +WithSQSPartitionObserveroption. Re-declared inadapterso it doesn't importmonitoringat the package boundary (matches the DynamoDB/Redis observer pattern). Action constants (send/receive/delete) re-declared on the adapter side and validated at runtime by the monitoring side — drift between the two surfaces as a dropped observation, not a wedge.adapter/sqs_fifo.go,adapter/sqs_messages.go: emit the counter on the partitioned commit branch only (PartitionCount > 1) for send / receive / delete. Legacy single-partition queues stay off the metric since partition is always 0 and the cardinality cost would buy nothing.monitoring/registry.go,main_sqs.go,main.go: wire the registry'sSQSPartitionObserver()intostartSQSServerso the SQS server picks up the production observer on cluster boot. Test fixtures and CLI tools that buildSQSServerwithout a registry passniland the metric stays at zero.Tests
monitoring/sqs_test.go(new, 6 cases):TestSQSMetrics_ObservePartitionMessage_IncrementsByLabelTriple— pin the(queue, partition, action)counter contract.TestSQSMetrics_ObservePartitionMessage_DropsInvalidAction— pin the typo guard against future drift between adapter and monitoring constants.TestSQSMetrics_ObservePartitionMessage_DropsEmptyQueue— pin that an empty queue name does not collapse with valid observations onto a shared series.TestSQSMetrics_NilReceiverIsSafe— pin the nil-receiver short-circuit the adapter relies on.TestSQSMetrics_QueueLabelOverflow— pin the cap-and-collapse so a misbehaving caller cannot exhaust the Prometheus series budget.TestSQSMetrics_RegistryWiring— pin that the publicRegistryexposes the metric under the documented name.Self-review (5 lenses)
trackedQueuesmap is only consulted from the dispatch-success path under the SQS server's existing concurrency model. No new locks.CounterVeclookup per partitioned send/receive/delete on the success branch. Legacy queues skip the call entirely. Cardinality bounded at 512 queue × 32 partition (htfifoMaxPartitions) × 3 action ≈ 49k series worst case; in practice a 32-partition queue yields 96 series, so the budget is plenty for the SLO panels.monitoring/, plus the adapter-side nil-observer path is exercised by all existing partitioned-FIFO tests inadapter/sqs_partitioned_dispatch_test.go(they passnilobserver through the test fixture).Test plan
go test -race -count=1 ./monitoring/...go test -race -count=1 -run 'TestSQS' ./adapter/...go test -race -count=1 ./...(full suite)golangci-lint --config=.golangci.yaml run ./...(full repo)Refs
docs/design/2026_05_01_partial_split_queue_fifo.md§11 PR 7