feat(sqs): Jepsen HT-FIFO workload (Phase 3.D PR 7b)#738
Conversation
Adds the Jepsen workload that stresses partitioned-FIFO (HT-FIFO) queues against the three contracts AWS HT-FIFO is supposed to honour even under partition and node-loss faults — within-group ordering, no message loss, no duplicates. Pattern follows aphyr's classic Jepsen RabbitMQ analysis (the user pointed at https://aphyr.com/posts/315-jepsen-rabbitmq as a starting point): track every :send and :recv in the history, then a custom checker verifies the contracts against the recorded events. Files: - jepsen/project.clj — adds com.cognitect.aws/sqs at the same version as the existing dynamodb dep, so the SDK wire protocol (auth, retry classification, error parsing) is exercised end-to-end against elastickv rather than a hand-rolled HTTP layer. - jepsen/src/elastickv/db.clj — extends start-node! to accept :sqs-port (port spec like dynamo-port) and :sqs-region. Both are optional so existing dynamodb / s3 / redis test specs are byte-identical at the args level when sqs-port is absent. - jepsen/src/elastickv/jepsen_test.clj — registers the new workload under elastickv-sqs-htfifo-test alongside the other workloads. - jepsen/src/elastickv/sqs_htfifo_workload.clj — the workload itself (~430 lines). Uses cognitect/aws-api SQS, creates an HT-FIFO queue with PartitionCount=4 + ContentBasedDeduplication, runs sends and receives across N MessageGroupId values, and the custom ht-fifo-checker validates the three contracts. - jepsen/test/elastickv/sqs_htfifo_workload_test.clj — pure-function tests for the checker (synthetic histories pin clean / loss / duplicate / ordering-violation / cross-group / info-send-not-loss / failed-send-not-counted / empty-receive cases) and integration smoke tests for the test-spec builder. Open-endpoint setup: the elastickv server starts without --sqsCredentialsFile so the SQS adapter accepts any signed request (matching how the S3 adapter is wired in jepsen today). The SDK client signs with dummy credentials so the SigV4 path still exercises end-to-end at the protocol level. Tests run: lein test elastickv.sqs-htfifo-workload-test passes 11 tests / 27 assertions. Full non-redis suite (sqs / dynamodb / dynamodb-types / s3 / cli) passes 21 tests / 41 assertions. The elastickv.redis-workload namespace fails to load due to the empty redis/src/ tree, which is pre-existing on main and unrelated to this PR. Out of scope for this PR (next milestones): - Wiring the workload into scripts/run-jepsen-local.sh — the existing script is dynamodb-only; an sqs counterpart can land as a follow-up. - Multi-shard cluster topology that lands distinct partitions on distinct Raft groups. This PR's PartitionCount=4 routes to the default group on a single-shard cluster — partitioning logic (different keys per partition, ordering preserved within group) is fully exercised, but the cross-shard scaling story is gated on separate work. - Design-doc lifecycle rename (proposed → partial) — that is §11 PR 8 in the design doc and is tracked separately. Refs: docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md §11 PR 7.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
@claude review |
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds an SQS HT‑FIFO Jepsen workload: new Cognitect SQS dependency, SQS startup wiring, a full workload (client, queue provisioning, send/recv semantics, checker), tests, and CI workflow steps to run the workload against elastickv. ChangesSQS HT-FIFO Workload
sequenceDiagram
participant Runner as Jepsen Test Runner
participant Client as HTFIFO Client
participant SQS as SQS Service
participant Server as elastickv Server
participant Checker as HT-FIFO Checker
Runner->>Client: setup (create queue)
Client->>SQS: CreateQueue (FIFO, dedupe, partitions)
SQS-->>Client: QueueUrl
Client-->>Runner: ready
loop workload
Runner->>Client: invoke :send
Client->>Client: choose group, next seq
Client->>SQS: SendMessage (MessageGroupId, body "group:seq")
SQS-->>Client: SendMessageResponse
Runner->>Client: invoke :recv
Client->>SQS: ReceiveMessage (batch)
SQS-->>Client: Messages
Client->>Client: decode bodies -> (group,seq)
Client->>SQS: DeleteMessage (per message)
SQS-->>Client: DeleteResponse
Client-->>Runner: received tuples
end
Runner->>Checker: submit history
Checker->>Checker: collect sends/receives
Checker->>Checker: validate ordering, loss, duplicates
Checker-->>Runner: report
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 36 minutes and 59 seconds.Comment |
|
Claude finished @bootjp's task in 4m 5s —— View job Code Review: feat(sqs): Jepsen HT-FIFO workload (Phase 3.D PR 7b)
The overall design is solid — the three-contract checker, atom-based seq counters, and Jepsen lifecycle wiring all look correct. A few findings below, ordered by severity. Findings1. Misleading loss formula —
|
There was a problem hiding this comment.
Code Review
This pull request introduces a Jepsen workload for testing elastickv's SQS-compatible HT-FIFO queues, featuring a custom checker for ordering, loss, and duplicates. It also updates the database configuration to include SQS parameters. Feedback identifies two potential issues in the test logic: silent failure of message deletions which could cause false duplicate reports, and a drain phase that may be too short relative to the visibility timeout, potentially leading to false reports of lost messages.
| (try | ||
| (sqs-invoke! sqs :DeleteMessage | ||
| {:QueueUrl queue-url | ||
| :ReceiptHandle receipt-handle}) | ||
| (catch clojure.lang.ExceptionInfo _ | ||
| ;; A failed delete leaves the message visible after the | ||
| ;; visibility window — the next receive will see it again. | ||
| ;; The checker will count it as a duplicate, which is the | ||
| ;; correct signal: an at-least-once delivery on a FIFO | ||
| ;; queue indicates a delete-side bug. | ||
| nil))) |
There was a problem hiding this comment.
The current error handling for DeleteMessage silently ignores all ExceptionInfo (including network/transport faults) and proceeds to mark the :recv operation as :ok. In a Jepsen environment where network partitions are expected, a failed delete will result in the message being redelivered by the SQS-compatible server after the visibility timeout. The checker (line 341) will then flag this as a duplicate and fail the test. Since redelivery on failed delete is standard SQS FIFO behavior, this leads to false positives. If a delete fails due to a transport error, the operation should be classified as :info (uncertain) rather than :ok. This can be achieved by letting the exception propagate to the outer catch block (line 239), which already correctly classifies transport anomalies.
(sqs-invoke! sqs :DeleteMessage
{:QueueUrl queue-url
:ReceiptHandle receipt-handle})
| :shard-ranges (:shard-ranges opts)})) | ||
| rate (double (or (:rate opts) 5)) | ||
| time-limit (or (:time-limit opts) 30) | ||
| drain-time (or (:drain-time opts) (max 5 (quot time-limit 6))) |
There was a problem hiding this comment.
The default drain-time is too short (5s for a 30s test) compared to the visibility-timeout-seconds (30s). If a message is received by a worker that is subsequently partitioned or killed near the end of the main phase, the message will remain invisible and unavailable for redelivery until the visibility timeout expires. If the drain phase ends before this happens, the checker will incorrectly report the message as lost. To ensure all messages can be recovered, the drain phase should be at least as long as the visibility timeout.
| drain-time (or (:drain-time opts) (max 5 (quot time-limit 6))) | |
| drain-time (or (:drain-time opts) (max visibility-timeout-seconds (quot time-limit 6))) |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@jepsen/src/elastickv/sqs_htfifo_workload.clj`:
- Around line 201-209: The :send branch drops the [group seq-num] tuple when
sqs-invoke! throws, causing collect-sends to misclassify transient failures;
modify the :send handling so group, seq-num (from seq-counters/next-seq!) and
body (encode-body) are computed first, then call sqs-invoke! inside a try/catch
and always assoc the op with :value [group seq-num]; on success set :type :ok,
on exception set :type :info (optionally include the exception under :error) so
collect-sends can see the tuple; apply the same pattern to the similar send code
around the other block referenced (lines 239-264).
- Line 58: The drain phase default is shorter than the message visibility
timeout, so update the default drain-window used in the drain phase (the
drain-phase/default drain duration configured around lines ~428-429) to be at
least visibility-timeout-seconds (the var defn named
visibility-timeout-seconds), or compute the drain timeout as (max
existing-default visibility-timeout-seconds). If the drain window is specified
in milliseconds convert visibility-timeout-seconds to ms before using it, and
replace the hard-coded 5s default with a value derived from
visibility-timeout-seconds so deleted-but-not-actually-removed messages can
reappear before the checker runs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 789de676-7809-4bb7-9559-b2061141c73a
📒 Files selected for processing (5)
jepsen/project.cljjepsen/src/elastickv/db.cljjepsen/src/elastickv/jepsen_test.cljjepsen/src/elastickv/sqs_htfifo_workload.cljjepsen/test/elastickv/sqs_htfifo_workload_test.clj
Addresses CodeRabbit / Gemini / Claude findings on PR #738. Severity- ordered: Major (CodeRabbit + Gemini both flagged): 1. Drain phase shorter than visibility timeout. Default drain-time was max(5, time-limit/6); for the default 30s test that's 5s, while visibility-timeout-seconds is 30s. A delete that fails or a worker killed mid-receive leaves messages invisible for 30s; if the drain phase is only 5s, those messages can never reappear before the checker runs and they get reported as :lost. Fix: max with visibility-timeout-seconds. High (Gemini): 2. DeleteMessage failure was silently swallowed and the recv op marked :ok, so the undeleted message reappearing in a later :recv was scored as a duplicate. Real SQS contract: a message becomes "received" (and the no-duplicate-delivery contract activates) only AFTER a successful DeleteMessage acks it. Fix: only successfully-deleted tuples enter :value (so the checker sees them), failed deletes are logged via clojure.tools.logging/warn and dropped from :value (the message will reappear naturally — that's correct SQS redelivery, not a duplicate the checker should flag). Major (CodeRabbit) — defensive, no immediate bug: 3. :send exceptions previously dropped the [group seq-num] tuple from the op's :value. Today the checker's "lost" formula is committed-only (committed and in-flight are disjoint by construction), so this didn't false-positive — but the standard Jepsen pattern is to keep :value on every op so the history is reasonable for future analysis. Fix: localize the :send try/catch so the op carries :value [group seq-num] through both :ok and :info paths; on exception, re-raise with the enriched op stashed in ex-data so the outer error-classification cond preserves the tuple while still routing through the existing :info / :fail branches. P2 polish (Claude): 4. Dead in-flight subtraction in the loss formula. committed and in-flight are disjoint by construction (next-seq! assigns each tuple exactly once and an op is either :ok or :info), so the subtraction was a no-op. Removed; comment now explains the disjoint invariant rather than implying the subtraction does work. 5. Docstring said "monotonically non-decreasing" while the checker uses `(>= p c)`, which enforces strictly increasing. Updated the docstring to match the implementation and noted that equal seqs would already be flagged as duplicates by contract 3. 6. collect-receives sort key was just :time. Two workers polling at the same wall-clock nanosecond, OR two messages within one batch sharing the op's :time, could produce a non-deterministic order that spuriously trips the per-group ordering check. Added :process and a per-batch :index to the sort key (juxt :time :process :index) so within-batch order is preserved (matches the server's FIFO response order) and cross-process ties are stable. Test additions: - TestSQSHTFIFO_checker-info-recv-is-ignored — pin that an :info recv (delete failed mid-batch) is dropped by collect-receives so the next :ok recv records the redelivery without a duplicate flag. - TestSQSHTFIFO_checker-same-time-batch-preserves-within-group-order — pin the new :index sort key by feeding a batch of three same-time same-group messages and asserting ordering-violations stays empty. - send-op / recv-op test helpers now include :process so synthetic histories satisfy the new sort key. Caller audit per the standing semantic-change rule: - :recv :value semantics now mean "tuples whose delete acked"; consumer is collect-receives only, single caller. Aligns with the SQS at-least-once-delivery / at-most-once-after-ack model. - collect-sends in-flight set semantics unchanged; the checker reports its size for diagnostics but does not subtract it from loss anymore. - :send exception path now uses ex-data :enriched-op as a stash; the outer catch is the only consumer. Tests: lein test elastickv.sqs-htfifo-workload-test passes 13/32. Full non-redis suite (sqs / dynamodb / dynamodb-types / s3 / cli) passes 23/46. Refs: PR #738 review thread; Claude review at run 25331884279.
|
@claude review |
|
Claude finished @bootjp's task in 4m 10s —— View job Code Review: feat(sqs): Jepsen HT-FIFO workload (Phase 3.D PR 7b)
Overall the implementation is solid. The previous review round's findings (dead What looks good
Findings1. Drain window equals (not exceeds) visibility timeout — narrow race window
drain-time (or (:drain-time opts)
(max visibility-timeout-seconds
(max 5 (quot time-limit 6))))With defaults ( The fix from the previous review made drain ≥ visibility timeout, which is necessary but not sufficient. Adding a small buffer (e.g., +10s) closes the race: drain-time (or (:drain-time opts)
(+ visibility-timeout-seconds 10))This replaces the 2.
|
Three findings on round 2; all addressed plus a docstring nit.
Major:
1. Drain window equals (not exceeds) the visibility timeout. With
defaults (time-limit=30, visibility-timeout-seconds=30), drain
was max(30, max(5, 5)) = 30s. A message that becomes invisible
right at drain-start reappears at exactly drain-end (or later),
and gen/time-limit can fire before the next :recv surfaces it —
the checker would then report :lost while the server still owns
the message. Fix: drain-time = visibility-timeout-seconds + 10s.
The +10s buffer also dominates the (max 5, time-limit/6) short-
test floor at any reasonable time-limit, so the floor is gone.
Major (latent, --no-cluster mode):
2. Test queue persists across runs. The previous queue-name was a
top-level constant ("jepsen-htfifo.fifo"), so re-running the
workload against the same cluster (e.g. --no-rebuild --no-cluster)
left old messages in the queue. They'd appear in the receive
history without corresponding committed sends in the current
run, inflating :received and — because seqs reset to 0 each
run — potentially trip per-group ordering checks. Worse, the
5-minute ContentBasedDeduplication window could block current-
run sends that match prior-run bodies. Fix: per-run unique
queue name "jepsen-htfifo-<millis>.fifo" generated at workload
construction time and stashed on the HTFIFOClient record so
every worker's setup! converges on the same name. New
:queue-name opt provides an explicit override knob.
Caller audit per the standing semantic-change rule:
- queue-name was a private def used only in create-htfifo-queue!
and the setup! info log. Both are inside the same namespace.
After the change, both read the value from the client record.
No external callers; no test changes needed (the unit tests
exercise the checker on synthetic histories, not the queue
setup path).
- create-htfifo-queue! gained a queue-name parameter; the only
caller is setup!.
P3 polish:
3. when-let on (str/split body #":" 2) was misleading — str/split
never returns nil, so when-let always entered the body. The
actual nil-safety lived in the str/blank? checks below. Replaced
with plain let so the intent is explicit (str/blank? handles the
nil seq-str from a no-colon body correctly).
Nit (Claude):
4. Added a docstring to elastickv-sqs-htfifo-test in jepsen_test.clj
pointing at the workload's own -main as the entry point, matching
the existing dynamodb / s3 pattern. Top-level -main still
dispatches Redis only; this is consistent and intentional.
Tests: lein test elastickv.sqs-htfifo-workload-test passes 13/32.
Full non-redis suite (sqs / dynamodb / s3 / cli) passes 23/46.
Refs: PR #738 round 2 review at run 25333566777.
|
@claude review |
|
Claude finished @bootjp's task in 4m 39s —— View job Code Review: feat(sqs): Jepsen HT-FIFO workload (Phase 3.D PR 7b) — Round 3
All previous findings from rounds 1 and 2 have been addressed correctly. The code is in good shape. Full audit below. Previous findings — statusRound 1 (all addressed):
Round 2 (all addressed):
Fresh analysis — round 3What looks good
One minor observation
[nil "--drain-time SECONDS" "Receive-only drain phase after the main generator finishes."
:default nil
:parse-fn #(Integer/parseInt %)]A user running "Receive-only drain phase after the main generator finishes (default: visibility-timeout + 10s = 40s)."No behavior change; purely a UX improvement. Fix this → Tests coverage checkThe 11 tests in
All three contracts are covered by direct tests. The SummaryNo blocking issues found in round 3. The one observation above (CLI help text for |
Round 3 review verdict: ready to merge. One cosmetic nit: the --drain-time CLI option had :default nil in the spec, so the computed default (visibility-timeout + 10s = 40s) was invisible to users running --help. Updated the description string to spell it out. No behaviour change. Refs: PR #738 round 3 review at run 25333949895.
|
@claude review |
|
Claude finished @bootjp's task in 3m 2s —— View job Code Review: feat(sqs): Jepsen HT-FIFO workload (Phase 3.D PR 7b) — Round 4
All findings from rounds 1–3 have been resolved correctly. Full independent audit below. Round 3 finding — status✅ Addressed in commit Fresh analysis — round 4What looks good
One very minor observation
(repeat (max 1 (Math/round (* 10.0 (double send-fraction)))) send-op)When SummaryNo blocking issues found in round 4. The round 3 cosmetic finding (CLI help text) is addressed. The implementation is correct across all three contracts, the error classification matches Jepsen's standard conventions, and the test coverage is thorough. Ready to merge. |
Adds the new SQS HT-FIFO workload (PR #738) to the GitHub Actions Jepsen runs so it gets exercised on every push and on the self-hosted VM lane. - .github/workflows/jepsen-test.yml (per-push, ubuntu-latest): - Cluster startup adds --sqsAddress 127.0.0.1:6350N for each of the 3 nodes plus --raftSqsMap for the leader-SQS lookup table. New port band 63501-63503 keeps SQS clear of redis (63791-3), dynamo (63801-3), and s3 (63901-3). - Wait-for-listeners loop now also checks the SQS ports. - New step "Run SQS HT-FIFO Jepsen workload against elastickv" runs `lein run -m elastickv.sqs-htfifo-workload --local --time-limit 5 --rate 5 --concurrency 5 --partition-count 4 --group-count 6 --sqs-ports 63501,63502,63503 --host 127.0.0.1` with a 120s outer timeout (3min job timeout). Same shape as the redis / dynamodb / s3 workload steps. - .github/workflows/jepsen.yml (VM-based, self-hosted): - New step "Run SQS HT-FIFO Jepsen workload" inside the Vagrant ctrl node, using the same nodes/time-limit/rate/ faults dispatch inputs as the redis / dynamodb / s3 steps. Routing flows through jepsen/src/elastickv/db.clj's existing :sqs-port + :sqs-region wiring (added in PR #738). Not in scope (follow-up): - .github/workflows/jepsen-test-scheduled.yml (the 6-hourly stress run) currently launches via cmd/server/demo.go, which doesn't speak SQS yet. Wiring HT-FIFO into the scheduled stress run would require either adding the SQS adapter to demo.go or switching the scheduled lane to the manual binary launch pattern that jepsen-test.yml uses. That change is larger than this PR's CI-wiring scope and is left as a follow-up. The capability gate (PR 5b-3, #734) accepts CreateQueue with PartitionCount > 1 on this single-shard 3-node cluster because no --sqsFifoPartitionMap is supplied: validateHTFIFORoutingCoverage returns nil when partitionResolver is nil, and the peer poll succeeds because every node is the same binary. The data plane exercises all the partitioned-FIFO key shapes (msg, vis, group, dedup) and the new partition-aware dispatch helpers; the storage groups still consolidate to the default Raft group on this topology. Refs: docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md §11 PR 7.
|
@claude review |
|
Claude finished @bootjp's task in 4m 0s —— View job Code Review: feat(sqs): Jepsen HT-FIFO workload (Phase 3.D PR 7b) — Round 5
All findings from rounds 1–4 have been addressed. Full independent audit below. Round 4 finding — status✅ Addressed in commit Fresh analysis — round 5What looks good
Findings1. CI timeout is tight given the 40s drain phase
timeout-minutes: 3
run: |
timeout 120 ~/lein run -m elastickv.sqs-htfifo-workload --local \
--time-limit 5 --rate 5 --concurrency 5 ...The active test window is For local timeout 120 ~/lein run -m elastickv.sqs-htfifo-workload --local \
--time-limit 5 --rate 5 --concurrency 5 \
--partition-count 4 --group-count 6 \
--drain-time 15 \
--sqs-ports 63501,63502,63503 --host 127.0.0.12. Minor:
|
Round 5 verdict: Ready to merge. Two findings, both addressed: 1. (Actionable) CI shell timeout was tight against the workload's default 40s drain phase. Active test window was time-limit (5s) + drain-time (40s) = 45s, plus ~15-20s Lein JVM startup ≈ 60s inside a 120s shell timeout. Pass --drain-time 15 in the CI step: in --local mode the nemesis is a no-op so no message can become invisible due to partition/kill, making the 40s default (which exists to outlast the visibility-timeout window during real faults) overkill. 15s gives the same correctness with comfortable headroom under the 120s shell timeout. 2. (Pre-existing minor) --send-fraction CLI description said "Probability a generator op is :send" but the implementation clamps both ends with (max 1 ...) to prevent generator starvation, so 0.0 still emits ~9% sends and 1.0 still emits ~9% receives. Updated the description string to spell out the clamping at the boundaries. Tests: lein test elastickv.sqs-htfifo-workload-test passes 13/32. YAML lint clean. Refs: PR #738 round 5 review at run 25335805377.
|
@claude review |
|
Claude finished @bootjp's task in 3m 12s —— View job Code Review: feat(sqs): Jepsen HT-FIFO workload (Phase 3.D PR 7b) — Round 6
Round 5 finding — status✅ Addressed in commit Fresh analysis — round 6What looks good
FindingsNo new blocking issues. One minor observation that is a genuine edge case but well within acceptable bounds for a test workload: Minor:
|
Summary
Phase 3.D PR 7b — Jepsen HT-FIFO workload that stresses partitioned-FIFO queues against the three contracts AWS HT-FIFO is supposed to honour even under partition and node-loss faults: within-group ordering, no message loss, no duplicates.
Pattern follows aphyr's Jepsen RabbitMQ analysis: track every
:sendand:recvin the operation history, then a custom checker verifies the contracts against the recorded events at the end of the run.What's in this PR
jepsen/project.clj— Addscom.cognitect.aws/sqsat the same version as the existing dynamodb dep, so the SDK wire protocol (auth, retry classification, error parsing) is exercised end-to-end against elastickv rather than a hand-rolled HTTP layer.jepsen/src/elastickv/db.clj— Extendsstart-node!to accept:sqs-port(port spec like:dynamo-port) and:sqs-region. Both are optional, so existing dynamodb / s3 / redis test specs are byte-identical at the args level whensqs-portis absent.jepsen/src/elastickv/jepsen_test.clj— Registerselastickv-sqs-htfifo-testalongside the other workloads.jepsen/src/elastickv/sqs_htfifo_workload.clj(new, ~430 lines) — The workload. Uses cognitect/aws-api SQS, creates an HT-FIFO queue withPartitionCount=4+ContentBasedDeduplication, runs sends and receives across NMessageGroupIdvalues, and the customht-fifo-checkervalidates the three contracts.jepsen/test/elastickv/sqs_htfifo_workload_test.clj(new) — Pure-function tests for the checker plus integration smoke tests for the test-spec builder. 11 tests / 27 assertions.Checker contracts
For each
MessageGroupIdindependently:seqvalues, sorted by global completion time across all consumers, is monotonically non-decreasing.(group, seq)successfully:senteventually appears in the:recvhistory. Sends with:infostatus are treated as possibly-committed and not counted as lost.(group, seq)appears at most once in the:recvhistory.ContentBasedDeduplicationon the queue + a unique(group, seq)body is what enforces this server-side; a duplicate here is a real bug (e.g. a deletion that did not commit).Open-endpoint mode
The elastickv server starts without
--sqsCredentialsFile, so the SQS adapter accepts any signed request (mirroring how the S3 adapter is wired in jepsen today). The SDK client signs with dummy credentials, so the SigV4 path still exercises end-to-end at the protocol level.Self-review (5 lenses)
seq-counteris anatomupdated viaswap!(CAS-based), so concurrent sends from different worker threads always assign distinct seqs. The checker is pure; no shared mutable state.:info(uncertain commit) are correctly excluded from the loss set, matching Jepsen's standard approach.lein run -m elastickv.sqs-htfifo-workload.Test plan
lein test elastickv.sqs-htfifo-workload-test— 11 tests / 27 assertions passlein testfor non-redis suite (dynamodb / dynamodb-types / s3 / cli / sqs-htfifo) — 21 tests / 41 assertions passThe
elastickv.redis-workloadnamespace fails to load due to the emptyredis/src/tree, which is pre-existing on main and unrelated to this PR.Out of scope (next milestones)
scripts/run-jepsen-local.sh— the existing script is dynamodb-only; an sqs counterpart lands as a follow-up.PartitionCount=4routes to the default group on a single-shard cluster — partitioning logic (different keys per partition, ordering preserved within group) is fully exercised, but the cross-shard scaling story is gated on separate work.*_proposed_*.md→*_partial_*.md) — that is §11 PR 8 in the design doc and is tracked separately.Refs
docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md§11 PR 7.Summary by CodeRabbit
New Features
Tests
Chores