Skip to content

feat(sqs): HT-FIFO partition metrics counter (Phase 3.D PR 7a)#737

Merged
bootjp merged 3 commits intomainfrom
feat/sqs-htfifo-partition-metrics
May 4, 2026
Merged

feat(sqs): HT-FIFO partition metrics counter (Phase 3.D PR 7a)#737
bootjp merged 3 commits intomainfrom
feat/sqs-htfifo-partition-metrics

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented May 4, 2026

Summary

Phase 3.D PR 7a — adds the elastickv_sqs_partition_messages_total{queue, partition, action} Prometheus counter so dashboards and alerts can spot uneven MessageGroupId distributions across partitioned-FIFO queues. Metrics-only: the Jepsen HT-FIFO workload (PR 7b) ships separately so the two review loops do not cross.

  • monitoring/sqs.go (new): SQSMetrics with cardinality cap (sqsMaxTrackedQueues = 512, overflow collapses to _other) mirroring DynamoDBMetrics. Drops empty queue names and unknown action labels so a future call-site bug cannot pollute the series space dashboards have to learn about. Nil-receiver-safe so adapter call sites do not need to nil-guard.
  • adapter/sqs.go: SQSPartitionObserver interface + WithSQSPartitionObserver option. Re-declared in adapter so it doesn't import monitoring at the package boundary (matches the DynamoDB/Redis observer pattern). Action constants (send/receive/delete) re-declared on the adapter side and validated at runtime by the monitoring side — drift between the two surfaces as a dropped observation, not a wedge.
  • adapter/sqs_fifo.go, adapter/sqs_messages.go: emit the counter on the partitioned commit branch only (PartitionCount > 1) for send / receive / delete. Legacy single-partition queues stay off the metric since partition is always 0 and the cardinality cost would buy nothing.
  • monitoring/registry.go, main_sqs.go, main.go: wire the registry's SQSPartitionObserver() into startSQSServer so the SQS server picks up the production observer on cluster boot. Test fixtures and CLI tools that build SQSServer without a registry pass nil and the metric stays at zero.

Tests

monitoring/sqs_test.go (new, 6 cases):

  • TestSQSMetrics_ObservePartitionMessage_IncrementsByLabelTriple — pin the (queue, partition, action) counter contract.
  • TestSQSMetrics_ObservePartitionMessage_DropsInvalidAction — pin the typo guard against future drift between adapter and monitoring constants.
  • TestSQSMetrics_ObservePartitionMessage_DropsEmptyQueue — pin that an empty queue name does not collapse with valid observations onto a shared series.
  • TestSQSMetrics_NilReceiverIsSafe — pin the nil-receiver short-circuit the adapter relies on.
  • TestSQSMetrics_QueueLabelOverflow — pin the cap-and-collapse so a misbehaving caller cannot exhaust the Prometheus series budget.
  • TestSQSMetrics_RegistryWiring — pin that the public Registry exposes the metric under the documented name.

Self-review (5 lenses)

  1. Data loss — N/A; metrics-only, no storage / Raft / FSM touch.
  2. Concurrency — counter increments are atomic via Prometheus; the trackedQueues map is only consulted from the dispatch-success path under the SQS server's existing concurrency model. No new locks.
  3. Performance — one map lookup + one CounterVec lookup per partitioned send/receive/delete on the success branch. Legacy queues skip the call entirely. Cardinality bounded at 512 queue × 32 partition (htfifoMaxPartitions) × 3 action ≈ 49k series worst case; in practice a 32-partition queue yields 96 series, so the budget is plenty for the SLO panels.
  4. Data consistency — the metric is observed AFTER OCC dispatch succeeds, so the counter reflects committed state. Receive/delete branches that return on retryable errors deliberately do not increment (the retry path will observe on the eventual success).
  5. Test coverage — 6 unit tests in monitoring/, plus the adapter-side nil-observer path is exercised by all existing partitioned-FIFO tests in adapter/sqs_partitioned_dispatch_test.go (they pass nil observer through the test fixture).

Test plan

  • go test -race -count=1 ./monitoring/...
  • go test -race -count=1 -run 'TestSQS' ./adapter/...
  • go test -race -count=1 ./... (full suite)
  • golangci-lint --config=.golangci.yaml run ./... (full repo)
  • Jepsen HT-FIFO workload — deferred to PR 7b

Refs

Adds elastickv_sqs_partition_messages_total{queue, partition, action}
so dashboards and alerts can spot uneven MessageGroupId distributions
across partitioned-FIFO queues.

- monitoring/sqs.go: SQSMetrics with cardinality cap (sqsMaxTrackedQueues
  = 512, overflow collapses to "_other") mirroring DynamoDBMetrics. Drops
  empty queue names and unknown action labels so a future call-site bug
  cannot pollute the series space dashboards have to learn about. Nil-
  receiver-safe so adapter call sites do not need to nil-guard.
- adapter/sqs.go: SQSPartitionObserver interface + WithSQSPartitionObserver
  option. Re-declared in adapter so it does not import monitoring at the
  package boundary (matches DynamoDB / Redis observer pattern). Action
  constants (send/receive/delete) re-declared on the adapter side and
  validated at runtime by the monitoring side.
- adapter/sqs_fifo.go, adapter/sqs_messages.go: emit the counter on the
  partitioned commit branch only (PartitionCount > 1) for send / receive
  / delete. Legacy single-partition queues stay off the metric — the
  cardinality cost would buy nothing since partition is always 0.
- monitoring/registry.go, main_sqs.go, main.go: wire the registry
  SQSPartitionObserver() into startSQSServer so the SQS server picks
  up the production observer on cluster boot. Test fixtures and CLI
  tools that build SQSServer without a registry pass nil and the
  metric stays at zero.

Tests (monitoring/sqs_test.go):
- TestSQSMetrics_ObservePartitionMessage_IncrementsByLabelTriple — pin
  the (queue, partition, action) label triple counter contract.
- TestSQSMetrics_ObservePartitionMessage_DropsInvalidAction — pin the
  typo guard against future drift between adapter and monitoring
  constants.
- TestSQSMetrics_ObservePartitionMessage_DropsEmptyQueue — pin that an
  empty queue name does not collapse with valid observations onto a
  shared series.
- TestSQSMetrics_NilReceiverIsSafe — pin the nil-receiver short-circuit
  the adapter relies on.
- TestSQSMetrics_QueueLabelOverflow — pin the cap-and-collapse so a
  misbehaving caller cannot exhaust the Prometheus series budget.
- TestSQSMetrics_RegistryWiring — pin that the public Registry exposes
  the metric under the documented name.

Self-review (5 lenses):
1. Data loss — N/A; metrics-only, no storage / Raft / FSM touch.
2. Concurrency — counter increments are atomic via Prometheus; the
   trackedQueues map is only consulted from the dispatch-success path
   under the SQS server existing concurrency model. No new locks.
3. Performance — one map lookup + one CounterVec lookup per partitioned
   send/receive/delete on the success branch. Legacy queues skip the
   call entirely. Cardinality bounded at 512 queue x 256 partition x 3
   action ~= 393k series worst case; in practice a 32-partition queue
   yields 96 series, so the budget is plenty for the SLO panels.
4. Data consistency — the metric is observed AFTER OCC dispatch
   succeeds, so the counter reflects committed state. Receive/delete
   branches that return on retryable errors deliberately do not
   increment.
5. Test coverage — 6 unit tests in monitoring/, plus the adapter-side
   nil-observer path is exercised by all existing partitioned-FIFO
   tests in adapter/sqs_partitioned_dispatch_test.go (they pass nil
   observer through the test fixture).

Jepsen HT-FIFO workload (PR 7b) lands as a separate PR so the metrics
and the workload review loops do not cross.

Refs: docs/design/2026_05_01_partial_split_queue_fifo.md §11 PR 7
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 4, 2026

@claude review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 15 minutes and 20 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f8ab090d-9b39-4f01-9f58-4018bb8dc998

📥 Commits

Reviewing files that changed from the base of the PR and between f489669 and 8230365.

📒 Files selected for processing (8)
  • adapter/sqs.go
  • adapter/sqs_fifo.go
  • adapter/sqs_messages.go
  • main.go
  • main_sqs.go
  • monitoring/registry.go
  • monitoring/sqs.go
  • monitoring/sqs_test.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/sqs-htfifo-partition-metrics

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 15 minutes and 20 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude Bot commented May 4, 2026

Claude finished @bootjp's task in 4m 14s —— View job


Code Review — feat(sqs): HT-FIFO partition metrics counter (Phase 3.D PR 7a)

  • Read task context and classify request
  • Read all changed files
  • Review monitoring/sqs.go and monitoring/sqs_test.go
  • Review adapter/sqs.go changes
  • Review adapter/sqs_fifo.go and adapter/sqs_messages.go changes
  • Review main.go, main_sqs.go, monitoring/registry.go wiring
  • Post final review

Overall this is a clean, well-structured metrics-only PR that correctly mirrors the existing DynamoDB observer pattern. No blocking issues. A few observations below from most to least significant.

Findings

1. Cardinality calculation in PR description is off (doc-only issue)

The PR description states: "cardinality bounded at 512 queue × 256 partition × 3 action ≈ 393k series worst case".

htfifoMaxPartitions = 32 (adapter/sqs_partitioning.go:26), so the actual worst case is 512 × 32 × 3 = 49,152 series — about 8× less than stated. The code and cardinality cap are both correct; only the description is inaccurate. Worth correcting if the design doc is updated to *_implemented_*.

2. Prometheus Help text contains an internal design-doc reference

monitoring/sqs.go:59–62:

Help: "Total HT-FIFO partition operations by queue, partition, and action (send / receive / delete). Per the §11 PR 7 split-queue-FIFO design contract — gives dashboards a hot-partition signal so operators can spot uneven MessageGroupId distributions.",

"Per the §11 PR 7 split-queue-FIFO design contract" will appear verbatim in production /metrics output and Prometheus help text. External operators and SREs reading this in Grafana or Alertmanager won't have access to the internal doc. Consider trimming to the observable semantics: Fix this →

3. SendMessageBatch / DeleteMessageBatch coverage (confirmed correct)

sendMessageBatchsendMessageBatchWithRetryrunFifoSendWithRetrysendFifoMessage (in sqs_messages_batch.go:304, sqs_messages_batch.go:344) — the metric fires on the shared FIFO commit path, so batches are covered.

deleteMessageBatch iterates and calls deleteMessageWithRetry per entry (sqs_messages_batch.go:485) where the observe call already lives. Both batch paths are correctly covered.

4. Nil-interface safety for the _other overflow sentinel vs. a legitimately named _other queue

If a user creates an SQS queue literally named _other (a valid name under the [a-zA-Z0-9_-] character set), its observations would be tracked under the real _other label while it's within the 512-queue cap. Overflow queues would also emit to _other. When both exist, the _other series aggregates both, making it impossible to distinguish the real queue from overflow noise. This is an accepted trade-off mirroring the DynamoDB pattern, but it's worth noting in a dashboards runbook.

5. TestSQSMetrics_NilReceiverIsSafe — belt-and-suspenders guard

The test correctly pins the if m == nil { return } nil-receiver guard in ObservePartitionMessage. The adapter's production nil-check lives in observePartitionMessage (s.partitionObserver == nil), so this guard is only reachable if a caller directly stores a typed nil (*SQSMetrics)(nil) into the interface. That can't happen through Registry.SQSPartitionObserver() since NewRegistry always initializes r.sqs, but the defensive guard + test is appropriate.

6. No concurrent stress test for queueLabelForCardinalityBudget

The mutex around trackedQueues is correct. The existing tests create independent instances, so no concurrent stress is exercised. Prometheus's own CounterVec is thread-safe, and the race detector on go test -race would surface any problem here. Low risk, but a TestSQSMetrics_ConcurrentObserve could be added to a follow-up if desired.

7. Minor: observe call order in commitReceiveRotation

monitoring/sqs.go is observed after Dispatch succeeds but before encodeReceiptHandleDispatch (adapter/sqs_messages.go:1279). If encoding fails, the Raft commit has happened (message rotation committed) but the client gets an error. The metric increment is semantically correct (partition was used), and this scenario mirrors pre-existing behaviour in commitReceiveRotation — not a regression introduced by this PR.

Summary

Code is correct, tests are thorough and cover the key invariants (nil safety, cardinality cap, label isolation, registry wiring). The five-lens self-review in the PR description is accurate. Two suggestions: clean up the Help string of internal references, and correct the 256→32 cardinality figure in the description (code is fine). LGTM with those minor nits.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 4, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces observability for SQS HT-FIFO operations by tracking message counts per queue, partition, and action (send, receive, delete). It adds a new SQSPartitionObserver interface, a Prometheus-backed implementation in the monitoring package with cardinality capping, and integrates these metrics into the SQS adapter's send, receive, and delete paths. Feedback focuses on performance optimizations for the metrics hot path, specifically suggesting the use of WithLabelValues to avoid heap allocations from map creation and recommending more efficient synchronization, such as sync.RWMutex, to reduce contention during queue label resolution.

Comment thread monitoring/sqs.go Outdated
Comment on lines +89 to +93
m.partitionMessages.With(prometheus.Labels{
"queue": queueLabel,
"partition": strconv.FormatUint(uint64(partition), 10),
"action": action,
}).Inc()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using prometheus.Labels (a map[string]string) on every observation causes a heap allocation for the map. On the hot path of SQS operations, this adds significant GC pressure. Switching to WithLabelValues avoids the map allocation. Additionally, strconv.FormatUint allocates a new string; for common partition numbers (e.g., 0-31), consider using a pre-allocated lookup table to further reduce allocations.

Suggested change
m.partitionMessages.With(prometheus.Labels{
"queue": queueLabel,
"partition": strconv.FormatUint(uint64(partition), 10),
"action": action,
}).Inc()
m.partitionMessages.WithLabelValues(
queueLabel,
strconv.FormatUint(uint64(partition), 10),
action,
).Inc()

Comment thread monitoring/sqs.go
Comment on lines +102 to +113
func (m *SQSMetrics) queueLabelForCardinalityBudget(queue string) string {
m.mu.Lock()
defer m.mu.Unlock()
if _, ok := m.trackedQueues[queue]; ok {
return queue
}
if len(m.trackedQueues) >= sqsMaxTrackedQueues {
return sqsQueueOverflow
}
m.trackedQueues[queue] = struct{}{}
return queue
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The queueLabelForCardinalityBudget method is called on every SQS operation for partitioned queues. Using a plain sync.Mutex and defer creates a global bottleneck by serializing all metrics observations. Since trackedQueues is read-mostly once initialized, using a sync.RWMutex with a fast-path RLock or a sync.Map would significantly reduce contention. This follows the general rule of preferring efficient synchronization on hot paths.

References
  1. For frequently accessed fields that require monotonic updates (like a configuration index), prefer atomic operations (e.g., CAS) over mutexes to improve performance on hot paths.

Claude flagged that the Prometheus Help string for
elastickv_sqs_partition_messages_total contained an internal
design-doc pointer ("Per the §11 PR 7 split-queue-FIFO design
contract") which leaks into production /metrics output and the
Grafana help tooltip — neither audience has access to the doc.

Replaced with a description of the observable semantics: which
queues populate the metric (PartitionCount > 1 only) and what
the metric is for (spotting uneven MessageGroupId distribution
across partitions). The cap-and-collapse / dropped-action /
empty-queue behaviours stay implementation details exposed only
through the unit tests.

Refs: PR #737 review thread; Claude review at run 25328847079.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 4, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 4, 2026

Claude finished @bootjp's task in 2m 17s —— View job


Code Review (Round 2) — feat(sqs): HT-FIFO partition metrics counter (Phase 3.D PR 7a)

  • Read task context
  • Read current state of changed files (post-fix commit 9491561)
  • Evaluate Gemini review suggestions
  • Assess Help text fix
  • Post final review

Round 2 verdict: LGTM with one actionable nit

The two-commit state is clean. Round 1 findings are resolved or correctly accepted. Addressing Gemini's two suggestions below.


Help text fix (commit 9491561) ✅

The updated Help string is correct and operator-facing:

Help: "Total HT-FIFO partition operations by queue, partition, and action (send / receive / delete). Non-zero only for queues with PartitionCount > 1; use to spot uneven MessageGroupId distributions across partitions.",

The internal doc reference is gone. The replacement text describes the observable semantics precisely and is appropriate for Grafana/Alertmanager output.


Gemini finding 1: WithLabelValues instead of prometheus.LabelsApply this

monitoring/sqs.go:89–93:

m.partitionMessages.With(prometheus.Labels{
    "queue":     queueLabel,
    "partition": strconv.FormatUint(uint64(partition), 10),
    "action":    action,
}).Inc()

With(prometheus.Labels{...}) allocates a map[string]string on every call (plus the strconv.FormatUint string). WithLabelValues avoids the map allocation and takes the label values positionally in the order declared in NewCounterVec ("queue", "partition", "action"):

m.partitionMessages.WithLabelValues(
    queueLabel,
    strconv.FormatUint(uint64(partition), 10),
    action,
).Inc()

The existing DynamoDB metrics (monitoring/dynamodb.go:197, 230–238, 311–328) all use WithLabelValuessqs.go is the only metric in the package that uses With(prometheus.Labels{...}). This is an inconsistency with a concrete allocation cost on the hot path. Worth a one-liner fix. Fix this →


Gemini finding 2: sync.RWMutex for queueLabelForCardinalityBudgetSkip, document why

monitoring/sqs.go:52, 103–113 uses sync.Mutex, identical to monitoring/dynamodb.go:77 which holds the same cardinality pattern. Gemini's suggestion is theoretically valid (read-mostly once the 512-queue cap is reached), but:

  1. Upgrading from RLock to Lock is not atomic in Go — the correct implementation requires release/reacquire plus a double-check, making the function meaningfully more complex.
  2. The critical section is ~5ns (map lookup + branch + possible insert of a struct{}). For short critical sections, sync.RWMutex has higher overhead than sync.Mutex due to the reader-count atomic ops; the crossover point is typically when the critical section is ≥1µs or contention is severe.
  3. DynamoDB uses sync.Mutex for the same pattern and hasn't been flagged for contention. Changing only sqs.go would be inconsistent with no performance evidence justifying the divergence.

No change needed here. If cardinality-map contention is ever measured as a bottleneck, the correct fix would be to apply sync.Map with a separate atomic counter (removing the len() call from under the lock) — and apply it to both SQSMetrics and DynamoDBMetrics together.


Summary

Finding Status
Help text internal doc reference (Round 1 #2) ✅ Fixed in 9491561
Cardinality figure in PR description (Round 1 #1) Doc-only; no code change needed
WithLabelValues vs prometheus.Labels (Gemini) ⚠️ Apply — inconsistency with existing pattern, allocation cost
sync.RWMutex (Gemini) ✅ Skip — mirrors DynamoDB, complexity > benefit at this scale
Batch path coverage (Round 1 #3) ✅ Confirmed correct
_other name collision (Round 1 #4) ✅ Accepted trade-off (mirrored from DynamoDB)
Nil guard safety (Round 1 #5) ✅ Correct

One actionable item remaining: swap With(prometheus.Labels{...})WithLabelValues(...) in monitoring/sqs.go:89. Everything else is correct.

Claude review round 2 surfaced one actionable nit (Gemini): the
ObservePartitionMessage path used With(prometheus.Labels{...})
which allocates a map[string]string on every observe call.
Switched to WithLabelValues with positional args (queue,
partition, action) matching the NewCounterVec declaration —
avoids the per-call map allocation and lines up with the
DynamoDBMetrics pattern (monitoring/dynamodb.go) which already
uses WithLabelValues across all observe sites.

Behaviour unchanged. Single caller (the adapter observe wrapper),
return type unchanged, label semantics identical, so no caller
audit needed.

Round 2 also flagged sync.RWMutex for queueLabelForCardinalityBudget
but Claude (and I) agreed to skip — the critical section is ~5ns,
RWMutex has higher overhead than Mutex at that scale, and the
DynamoDB twin uses Mutex. Diverging only in SQS would be
inconsistent without performance evidence.

Refs: PR #737 round 2 review at run 25329442300.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 4, 2026

@claude review

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@claude
Copy link
Copy Markdown

claude Bot commented May 4, 2026

Claude finished @bootjp's task in 1m 51s —— View job


Code Review (Round 3) — feat(sqs): HT-FIFO partition metrics counter (Phase 3.D PR 7a)

  • Read task context and classify request
  • Read current state of changed files (post-round-2 commit 8230365)
  • Verify WithLabelValues fix landed correctly
  • Fresh review pass on all changed files
  • Post final verdict

Round 3 verdict: LGTM — all previous findings resolved

Both round 1 and round 2 changes are correctly applied. No new issues found.


Verification of round 2 fix: WithLabelValues (commit 8230365) ✅

monitoring/sqs.go:93–97 now reads:

m.partitionMessages.WithLabelValues(
    queueLabel,
    strconv.FormatUint(uint64(partition), 10),
    action,
).Inc()

prometheus.Labels{} map allocation is gone. The prometheus import is still needed for prometheus.CounterVec, prometheus.Registerer, prometheus.NewCounterVec, and prometheus.CounterOpts — no orphaned import introduced by the change. Consistent with monitoring/dynamodb.go hot path.

The inline comment (line 89–92) correctly explains the why (allocation cost), not just the what.


Full wiring audit

Component Status
monitoring/sqs.goWithLabelValues, clean Help text, action validation, cardinality cap
monitoring/registry.gosqs *SQSMetrics field, newSQSMetrics in constructor, SQSPartitionObserver() method
adapter/sqs.go — interface re-declared (no cross-package import), observePartitionMessage nil guard (s.partitionObserver == nil), WithSQSPartitionObserver option
adapter/sqs_fifo.go — observe fires post-commit, PartitionCount > 1 gate
adapter/sqs_messages.go — receive: same gate; delete: adds meta != nil check (appropriate, different code path)
main.gosqsPartitionObserver field on runtimeServerRunner, sourced from in.metricsRegistry.SQSPartitionObserver()
main_sqs.gopartitionObserver param threaded into WithSQSPartitionObserver in startSQSServer
monitoring/sqs_test.go — 6 tests covering counter increment, invalid action drop, empty queue drop, nil receiver, cardinality overflow, registry wiring

One residual non-blocking note (same as round 2 accepted trade-off)

If a queue is literally named _other and the cap is also reached, the _other series will aggregate both. This is the same accepted behavior as DynamoDBMetrics and is documented in the self-review. No action needed.


No further changes required. PR is ready to merge.

@bootjp bootjp merged commit 04a997a into main May 4, 2026
9 checks passed
@bootjp bootjp deleted the feat/sqs-htfifo-partition-metrics branch May 4, 2026 16:34
bootjp added a commit that referenced this pull request May 4, 2026
## Summary

Phase 3.D PR 7b — Jepsen HT-FIFO workload that stresses partitioned-FIFO
queues against the three contracts AWS HT-FIFO is supposed to honour
even under partition and node-loss faults: **within-group ordering**,
**no message loss**, **no duplicates**.

Pattern follows [aphyr's Jepsen RabbitMQ
analysis](https://aphyr.com/posts/315-jepsen-rabbitmq): track every
`:send` and `:recv` in the operation history, then a custom checker
verifies the contracts against the recorded events at the end of the
run.

## What's in this PR

- **`jepsen/project.clj`** — Adds `com.cognitect.aws/sqs` at the same
version as the existing dynamodb dep, so the SDK wire protocol (auth,
retry classification, error parsing) is exercised end-to-end against
elastickv rather than a hand-rolled HTTP layer.
- **`jepsen/src/elastickv/db.clj`** — Extends `start-node!` to accept
`:sqs-port` (port spec like `:dynamo-port`) and `:sqs-region`. Both are
optional, so existing dynamodb / s3 / redis test specs are
byte-identical at the args level when `sqs-port` is absent.
- **`jepsen/src/elastickv/jepsen_test.clj`** — Registers
`elastickv-sqs-htfifo-test` alongside the other workloads.
- **`jepsen/src/elastickv/sqs_htfifo_workload.clj`** (new, ~430 lines) —
The workload. Uses cognitect/aws-api SQS, creates an HT-FIFO queue with
`PartitionCount=4` + `ContentBasedDeduplication`, runs sends and
receives across N `MessageGroupId` values, and the custom
`ht-fifo-checker` validates the three contracts.
- **`jepsen/test/elastickv/sqs_htfifo_workload_test.clj`** (new) —
Pure-function tests for the checker plus integration smoke tests for the
test-spec builder. 11 tests / 27 assertions.

## Checker contracts

For each `MessageGroupId` independently:

1. **Within-group ordering** — the sequence of received `seq` values,
sorted by global completion time across all consumers, is monotonically
non-decreasing.
2. **No loss** — every `(group, seq)` successfully `:sent` eventually
appears in the `:recv` history. Sends with `:info` status are treated as
possibly-committed and not counted as lost.
3. **No duplicates** — every `(group, seq)` appears at most once in the
`:recv` history. `ContentBasedDeduplication` on the queue + a unique
`(group, seq)` body is what enforces this server-side; a duplicate here
is a real bug (e.g. a deletion that did not commit).

## Open-endpoint mode

The elastickv server starts without `--sqsCredentialsFile`, so the SQS
adapter accepts any signed request (mirroring how the S3 adapter is
wired in jepsen today). The SDK client signs with dummy credentials, so
the SigV4 path still exercises end-to-end at the protocol level.

## Self-review (5 lenses)

1. **Data loss** — N/A; this is a test-only PR. The workload's whole
purpose is to *detect* data loss in the system under test.
2. **Concurrency** — The shared per-group `seq-counter` is an `atom`
updated via `swap!` (CAS-based), so concurrent sends from different
worker threads always assign distinct seqs. The checker is pure; no
shared mutable state.
3. **Performance** — Test-only code, runs at low rate (5
ops/sec/worker). Not on any hot path.
4. **Data consistency** — The checker compares committed sends against
the receive history globally, so all the consistency assertions are at
end-of-run with a complete picture. Sends with `:info` (uncertain
commit) are correctly excluded from the loss set, matching Jepsen's
standard approach.
5. **Test coverage** — 11 unit tests for the checker pin the contract
surface (clean / loss / info-not-loss / duplicates / within-group
ordering / cross-group interleaving / failed-send-not-counted /
empty-receive). Integration smoke tests pin the test-spec builder. The
workload itself is exercised end-to-end on a real cluster via `lein run
-m elastickv.sqs-htfifo-workload`.

## Test plan

- [x] `lein test elastickv.sqs-htfifo-workload-test` — 11 tests / 27
assertions pass
- [x] `lein test` for non-redis suite (dynamodb / dynamodb-types / s3 /
cli / sqs-htfifo) — 21 tests / 41 assertions pass
- [ ] End-to-end live cluster run — operator-driven (out of scope for
the merge gate; relies on a 3-node cluster setup)

The `elastickv.redis-workload` namespace fails to load due to the empty
`redis/src/` tree, which is pre-existing on main and unrelated to this
PR.

## Out of scope (next milestones)

- Wiring the workload into `scripts/run-jepsen-local.sh` — the existing
script is dynamodb-only; an sqs counterpart lands as a follow-up.
- Multi-shard cluster topology that lands distinct partitions on
distinct Raft groups. This PR's `PartitionCount=4` routes to the default
group on a single-shard cluster — partitioning logic (different keys per
partition, ordering preserved within group) is fully exercised, but the
cross-shard scaling story is gated on separate work.
- Design-doc lifecycle rename (`*_proposed_*.md` → `*_partial_*.md`) —
that is §11 PR 8 in the design doc and is tracked separately.

## Refs

- `docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md` §11 PR 7.
- Closes the testing half of §11 PR 7. PR 7a (metrics) shipped at #737.


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added AWS SQS integration with HT‑FIFO support, SQS port/region
configuration, and runtime options to exercise FIFO dedupe/order
semantics

* **Tests**
* Added comprehensive unit and workload tests validating ordering,
no‑loss, no‑duplicates, and option handling

* **Chores**
* CI updated to run the SQS HT‑FIFO workload as part of Jepsen test runs
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant