Skip to content

feat(sqs): Jepsen HT-FIFO workload (Phase 3.D PR 7b)#738

Merged
bootjp merged 6 commits intomainfrom
feat/sqs-jepsen-htfifo
May 4, 2026
Merged

feat(sqs): Jepsen HT-FIFO workload (Phase 3.D PR 7b)#738
bootjp merged 6 commits intomainfrom
feat/sqs-jepsen-htfifo

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented May 4, 2026

Summary

Phase 3.D PR 7b — Jepsen HT-FIFO workload that stresses partitioned-FIFO queues against the three contracts AWS HT-FIFO is supposed to honour even under partition and node-loss faults: within-group ordering, no message loss, no duplicates.

Pattern follows aphyr's Jepsen RabbitMQ analysis: track every :send and :recv in the operation history, then a custom checker verifies the contracts against the recorded events at the end of the run.

What's in this PR

  • jepsen/project.clj — Adds com.cognitect.aws/sqs at the same version as the existing dynamodb dep, so the SDK wire protocol (auth, retry classification, error parsing) is exercised end-to-end against elastickv rather than a hand-rolled HTTP layer.
  • jepsen/src/elastickv/db.clj — Extends start-node! to accept :sqs-port (port spec like :dynamo-port) and :sqs-region. Both are optional, so existing dynamodb / s3 / redis test specs are byte-identical at the args level when sqs-port is absent.
  • jepsen/src/elastickv/jepsen_test.clj — Registers elastickv-sqs-htfifo-test alongside the other workloads.
  • jepsen/src/elastickv/sqs_htfifo_workload.clj (new, ~430 lines) — The workload. Uses cognitect/aws-api SQS, creates an HT-FIFO queue with PartitionCount=4 + ContentBasedDeduplication, runs sends and receives across N MessageGroupId values, and the custom ht-fifo-checker validates the three contracts.
  • jepsen/test/elastickv/sqs_htfifo_workload_test.clj (new) — Pure-function tests for the checker plus integration smoke tests for the test-spec builder. 11 tests / 27 assertions.

Checker contracts

For each MessageGroupId independently:

  1. Within-group ordering — the sequence of received seq values, sorted by global completion time across all consumers, is monotonically non-decreasing.
  2. No loss — every (group, seq) successfully :sent eventually appears in the :recv history. Sends with :info status are treated as possibly-committed and not counted as lost.
  3. No duplicates — every (group, seq) appears at most once in the :recv history. ContentBasedDeduplication on the queue + a unique (group, seq) body is what enforces this server-side; a duplicate here is a real bug (e.g. a deletion that did not commit).

Open-endpoint mode

The elastickv server starts without --sqsCredentialsFile, so the SQS adapter accepts any signed request (mirroring how the S3 adapter is wired in jepsen today). The SDK client signs with dummy credentials, so the SigV4 path still exercises end-to-end at the protocol level.

Self-review (5 lenses)

  1. Data loss — N/A; this is a test-only PR. The workload's whole purpose is to detect data loss in the system under test.
  2. Concurrency — The shared per-group seq-counter is an atom updated via swap! (CAS-based), so concurrent sends from different worker threads always assign distinct seqs. The checker is pure; no shared mutable state.
  3. Performance — Test-only code, runs at low rate (5 ops/sec/worker). Not on any hot path.
  4. Data consistency — The checker compares committed sends against the receive history globally, so all the consistency assertions are at end-of-run with a complete picture. Sends with :info (uncertain commit) are correctly excluded from the loss set, matching Jepsen's standard approach.
  5. Test coverage — 11 unit tests for the checker pin the contract surface (clean / loss / info-not-loss / duplicates / within-group ordering / cross-group interleaving / failed-send-not-counted / empty-receive). Integration smoke tests pin the test-spec builder. The workload itself is exercised end-to-end on a real cluster via lein run -m elastickv.sqs-htfifo-workload.

Test plan

  • lein test elastickv.sqs-htfifo-workload-test — 11 tests / 27 assertions pass
  • lein test for non-redis suite (dynamodb / dynamodb-types / s3 / cli / sqs-htfifo) — 21 tests / 41 assertions pass
  • End-to-end live cluster run — operator-driven (out of scope for the merge gate; relies on a 3-node cluster setup)

The elastickv.redis-workload namespace fails to load due to the empty redis/src/ tree, which is pre-existing on main and unrelated to this PR.

Out of scope (next milestones)

  • Wiring the workload into scripts/run-jepsen-local.sh — the existing script is dynamodb-only; an sqs counterpart lands as a follow-up.
  • Multi-shard cluster topology that lands distinct partitions on distinct Raft groups. This PR's PartitionCount=4 routes to the default group on a single-shard cluster — partitioning logic (different keys per partition, ordering preserved within group) is fully exercised, but the cross-shard scaling story is gated on separate work.
  • Design-doc lifecycle rename (*_proposed_*.md*_partial_*.md) — that is §11 PR 8 in the design doc and is tracked separately.

Refs

Summary by CodeRabbit

  • New Features

    • Added AWS SQS integration with HT‑FIFO support, SQS port/region configuration, and runtime options to exercise FIFO dedupe/order semantics
  • Tests

    • Added comprehensive unit and workload tests validating ordering, no‑loss, no‑duplicates, and option handling
  • Chores

    • CI updated to run the SQS HT‑FIFO workload as part of Jepsen test runs

Adds the Jepsen workload that stresses partitioned-FIFO (HT-FIFO)
queues against the three contracts AWS HT-FIFO is supposed to honour
even under partition and node-loss faults — within-group ordering,
no message loss, no duplicates.

Pattern follows aphyr's classic Jepsen RabbitMQ analysis (the user
pointed at https://aphyr.com/posts/315-jepsen-rabbitmq as a starting
point): track every :send and :recv in the history, then a custom
checker verifies the contracts against the recorded events.

Files:

- jepsen/project.clj — adds com.cognitect.aws/sqs at the same version
  as the existing dynamodb dep, so the SDK wire protocol (auth, retry
  classification, error parsing) is exercised end-to-end against
  elastickv rather than a hand-rolled HTTP layer.
- jepsen/src/elastickv/db.clj — extends start-node! to accept
  :sqs-port (port spec like dynamo-port) and :sqs-region. Both are
  optional so existing dynamodb / s3 / redis test specs are
  byte-identical at the args level when sqs-port is absent.
- jepsen/src/elastickv/jepsen_test.clj — registers the new workload
  under elastickv-sqs-htfifo-test alongside the other workloads.
- jepsen/src/elastickv/sqs_htfifo_workload.clj — the workload itself
  (~430 lines). Uses cognitect/aws-api SQS, creates an HT-FIFO queue
  with PartitionCount=4 + ContentBasedDeduplication, runs sends and
  receives across N MessageGroupId values, and the custom
  ht-fifo-checker validates the three contracts.
- jepsen/test/elastickv/sqs_htfifo_workload_test.clj — pure-function
  tests for the checker (synthetic histories pin clean / loss /
  duplicate / ordering-violation / cross-group / info-send-not-loss /
  failed-send-not-counted / empty-receive cases) and integration
  smoke tests for the test-spec builder.

Open-endpoint setup: the elastickv server starts without
--sqsCredentialsFile so the SQS adapter accepts any signed request
(matching how the S3 adapter is wired in jepsen today). The SDK
client signs with dummy credentials so the SigV4 path still
exercises end-to-end at the protocol level.

Tests run: lein test elastickv.sqs-htfifo-workload-test passes 11
tests / 27 assertions. Full non-redis suite (sqs / dynamodb /
dynamodb-types / s3 / cli) passes 21 tests / 41 assertions. The
elastickv.redis-workload namespace fails to load due to the empty
redis/src/ tree, which is pre-existing on main and unrelated to
this PR.

Out of scope for this PR (next milestones):
- Wiring the workload into scripts/run-jepsen-local.sh — the
  existing script is dynamodb-only; an sqs counterpart can land as
  a follow-up.
- Multi-shard cluster topology that lands distinct partitions on
  distinct Raft groups. This PR's PartitionCount=4 routes to the
  default group on a single-shard cluster — partitioning logic
  (different keys per partition, ordering preserved within group)
  is fully exercised, but the cross-shard scaling story is gated
  on separate work.
- Design-doc lifecycle rename (proposed → partial) — that is §11
  PR 8 in the design doc and is tracked separately.

Refs: docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md §11 PR 7.
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 4, 2026

@claude review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 36 minutes and 59 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 58ef0c6e-00c7-40ec-b587-f345cbd6d475

📥 Commits

Reviewing files that changed from the base of the PR and between 52c6e41 and de3eb7c.

📒 Files selected for processing (2)
  • .github/workflows/jepsen-test.yml
  • jepsen/src/elastickv/sqs_htfifo_workload.clj
📝 Walkthrough

Walkthrough

Adds an SQS HT‑FIFO Jepsen workload: new Cognitect SQS dependency, SQS startup wiring, a full workload (client, queue provisioning, send/recv semantics, checker), tests, and CI workflow steps to run the workload against elastickv.

Changes

SQS HT-FIFO Workload

Layer / File(s) Summary
Dependency
jepsen/project.clj
Adds com.cognitect.aws/sqs 847.2.1365.0 to project dependencies.
Node Startup
jepsen/src/elastickv/db.clj
start-node! destructuring accepts :sqs-port and :sqs-region; when present appends --sqsAddress and optionally --sqsRegion to server startup args.
Workload Implementation (core)
jepsen/src/elastickv/sqs_htfifo_workload.clj
New namespace implementing HT‑FIFO workload: SQS client factory, idempotent queue creation with FIFO/dedupe/partition attributes, per-group monotonic seq counters, message encode/decode, HTFIFOClient with :send/:recv semantics and anomaly classification.
Workload Implementation (checker & generator)
jepsen/src/elastickv/sqs_htfifo_workload.clj
Custom ht-fifo-checker validating within-group strict ordering, no loss, no duplicates; mixed send/recv generator and drain phase; CLI options and -main wiring.
Test Entrypoint Wiring
jepsen/src/elastickv/jepsen_test.clj
Requires elastickv.sqs-htfifo-workload and adds elastickv-sqs-htfifo-test that delegates to the workload test wrapper.
Unit Tests
jepsen/test/elastickv/sqs_htfifo_workload_test.clj
New tests validating workload spec/defaults/options, client creation with host override, and extensive checker unit tests (valid history, loss, in-flight/info handling, duplicates, ordering, cross-group behavior, failed sends, empty receives, same-time batch order).
CI / Workflows
.github/workflows/jepsen-test.yml, .github/workflows/jepsen.yml
CI updated to include SQS RAFT mapping and --sqsAddress args, readiness checks for SQS ports, and new workflow steps to run the SQS HT‑FIFO workload against ports 63501–63503.
sequenceDiagram
    participant Runner as Jepsen Test Runner
    participant Client as HTFIFO Client
    participant SQS as SQS Service
    participant Server as elastickv Server
    participant Checker as HT-FIFO Checker

    Runner->>Client: setup (create queue)
    Client->>SQS: CreateQueue (FIFO, dedupe, partitions)
    SQS-->>Client: QueueUrl
    Client-->>Runner: ready

    loop workload
        Runner->>Client: invoke :send
        Client->>Client: choose group, next seq
        Client->>SQS: SendMessage (MessageGroupId, body "group:seq")
        SQS-->>Client: SendMessageResponse

        Runner->>Client: invoke :recv
        Client->>SQS: ReceiveMessage (batch)
        SQS-->>Client: Messages
        Client->>Client: decode bodies -> (group,seq)
        Client->>SQS: DeleteMessage (per message)
        SQS-->>Client: DeleteResponse
        Client-->>Runner: received tuples
    end

    Runner->>Checker: submit history
    Checker->>Checker: collect sends/receives
    Checker->>Checker: validate ordering, loss, duplicates
    Checker-->>Runner: report
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I hopped a queue with FIFO breeze,
Grouped my seeds by gentle decrees,
Seq numbers climbed, no double sight,
Jepsen checks the hops at night—
HT‑FIFO hums, the system's pleased.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(sqs): Jepsen HT-FIFO workload (Phase 3.D PR 7b)' clearly and specifically summarizes the main change—adding a Jepsen HT-FIFO workload for SQS testing—which aligns with the primary contribution across all modified files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/sqs-jepsen-htfifo

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 36 minutes and 59 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude Bot commented May 4, 2026

Claude finished @bootjp's task in 4m 5s —— View job


Code Review: feat(sqs): Jepsen HT-FIFO workload (Phase 3.D PR 7b)

  • Gather context and understand the request
  • Read changed files
  • Review sqs_htfifo_workload.clj (core workload logic)
  • Review sqs_htfifo_workload_test.clj (unit tests)
  • Review supporting changes (project.clj, db.clj, jepsen_test.clj)
  • Post comprehensive review feedback

The overall design is solid — the three-contract checker, atom-based seq counters, and Jepsen lifecycle wiring all look correct. A few findings below, ordered by severity.


Findings

1. Misleading loss formula — in-flight subtraction is dead code

sqs_htfifo_workload.clj:338-342

lost (cset/difference committed received-tuples in-flight)

The comment says "we exclude in-flight since their commit status is undefined," but committed and in-flight are always disjoint — every (group, seq) tuple is assigned by next-seq! exactly once, and a single op is either :ok (goes into committed) or :info (goes into in-flight), never both. So subtracting in-flight from committed is always a no-op.

In-flight sends are already correctly excluded from loss by not including them in committed in the first place. The dead subtraction and the comment together imply a property that isn't how the code actually works, which will confuse future readers.

Fix: Drop the in-flight argument from cset/difference and update the comment to say ":info sends are excluded from committed entirely, so they can't appear in lost." Fix this →


2. Docstring promises "non-decreasing" but implementation enforces strictly increasing

sqs_htfifo_workload.clj:11 (contract 1) and ordering-violations line 312:

;; Contract: "monotonically non-decreasing"
;;
;; Implementation:
(filter (fn [[p c]] (>= p c)) pairs)   ; flags when prev >= curr

(>= p c) means the ordering check reports a violation when the next seq equals the current one, i.e., it enforces strictly increasing, not just non-decreasing. Since seqs are monotone by construction and duplicates are caught separately, this isn't observable as a false positive today. But if the contract definition changes or the body encoding changes, this disagreement will cause confusion.

Fix: Update the docstring to say "strictly increasing" (not "non-decreasing"), since that's what the checker enforces. Fix this →


3. Swallowed DeleteMessage failure is not logged

sqs_htfifo_workload.clj:228-235

(catch clojure.lang.ExceptionInfo _
  ;; A failed delete leaves the message visible after the
  ;; visibility window — the next receive will see it again.
  nil)

The comment is correct about what happens next, but silently swallowing the exception makes it very hard to distinguish between "the server has a delete-side bug" and "the client had a transient network error before completing the delete." Both paths produce a duplicate in the checker, but the error rate in server logs (or a warn-level log here) is how you'd triage the root cause.

Fix: Add (warn e "DeleteMessage failed for receipt-handle" receipt-handle) before the nil. Fix this →


4. ordering-violations doesn't handle same-:time events within a batch deterministically

sqs_htfifo_workload.clj:289-300

(defn- collect-receives [history]
  (->> history
       (filter #(and (= :recv (:f %)) (= :ok (:type %))))
       (mapcat (fn [op]
                 (map (fn [[g s]] {:group g :seq s :time (:time op)}) (:value op))))
       (sort-by :time)))

When a single recv batch returns two or more messages from the same group, they all share the same :time. Clojure's sort-by is stable, so their relative order in the sorted sequence is the order from (:value op), which is the order in the server's response. This is actually what you want — AWS FIFO ordering within a batch is meaningful and the server's response order should reflect it.

The only risk is if sort-by on :time interleaves same-time events from different recv ops (e.g., two workers polling at identical timestamps). In practice Jepsen assigns wall-clock :time in nanos and collisions are rare, but it's not impossible. If two recv ops at the same time both return a g0 message, the relative order between them in the sorted list is undefined and could produce a spurious ordering violation.

Recommendation: Sort by (juxt :time :worker-id) or accept this as a known false-positive edge case and document it, since the ordering guarantee only applies within a single consumer's view.


5. sqs-htfifo-workload.clj is missing from jepsen_test.clj -main

jepsen/src/elastickv/jepsen_test.clj:27-29

(defn -main [& args]
  (cli/run! (cli/single-test-cmd {:test-fn elastickv-test}) args))

-main only dispatches the Redis workload. The SQS workload has its own -main in sqs_htfifo_workload.clj, so it's accessible, but a user running lein run through jepsen_test.clj won't reach it. The existing dynamodb/s3 workloads are in the same situation (they also have their own -main), so this is consistent with the existing pattern. Worth documenting in a comment or the elastickv-sqs-htfifo-test docstring that the entry point is lein run -m elastickv.sqs-htfifo-workload.


6. Minor: decode-body split edge case

sqs_htfifo_workload.clj:163-164

(when-let [[group seq-str] (str/split body #":" 2)]

str/split with limit 2 on "g0" (no colon) returns ["g0"]. The destructuring [group seq-str] binds group="g0" and seq-str=nil. Then (not (str/blank? nil)) → false → returns nil correctly. This works, but it's an implicit reliance on str/blank? accepting nil. Adding a direct test case for the no-colon path would make this intent explicit (there's no unit test for decode-body directly since it's private, but a round-trip test via the checker for a history with a garbled body would catch regressions).


What looks good

  • Checker math — The three-contract structure is clean and complete. collect-sends correctly separates :ok from :info; duplicate-receives and ordering-violations are straightforward and correct.
  • Atomic seq counternext-seq! via swap! + dec is the idiomatic Clojure pattern for atomic fetch-and-increment. Sharing the atom across workers via the record field is correct.
  • Jepsen lifecycleopen! → per-worker SQS client, setup! → queue creation, close!aws/stop. The queue-url propagation from setup! to workers (via assoc on the record before open! is called per worker) is handled correctly.
  • Error classification — The fault/unavailable/interrupted → :info, 5xx → :info, client-side rejection → :fail pattern matches Jepsen's standard approach and the existing dynamodb workload's style.
  • db.clj change — The sqs-port / sqs-region extension to start-node! is backward-compatible (both are under when), and the port-for helper handles both map and scalar forms correctly.
  • Test coverage — The 8 checker unit tests pin all three contracts plus the edge cases (info-not-loss, failed-not-counted, empty-receives, cross-group interleaving). Good baseline.
  • project.clj change — Adding com.cognitect.aws/sqs at the same version as the DynamoDB dep is the right approach.

Summary

Three findings worth addressing before merge: the dead in-flight subtraction (misleading comment + code), the docstring/implementation disagreement on "non-decreasing" vs "strictly increasing," and the silent delete failure. Finding 4 (same-timestamp ordering) is a latent false-positive risk worth documenting. Everything else is minor.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a Jepsen workload for testing elastickv's SQS-compatible HT-FIFO queues, featuring a custom checker for ordering, loss, and duplicates. It also updates the database configuration to include SQS parameters. Feedback identifies two potential issues in the test logic: silent failure of message deletions which could cause false duplicate reports, and a drain phase that may be too short relative to the visibility timeout, potentially leading to false reports of lost messages.

Comment on lines +225 to +235
(try
(sqs-invoke! sqs :DeleteMessage
{:QueueUrl queue-url
:ReceiptHandle receipt-handle})
(catch clojure.lang.ExceptionInfo _
;; A failed delete leaves the message visible after the
;; visibility window — the next receive will see it again.
;; The checker will count it as a duplicate, which is the
;; correct signal: an at-least-once delivery on a FIFO
;; queue indicates a delete-side bug.
nil)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current error handling for DeleteMessage silently ignores all ExceptionInfo (including network/transport faults) and proceeds to mark the :recv operation as :ok. In a Jepsen environment where network partitions are expected, a failed delete will result in the message being redelivered by the SQS-compatible server after the visibility timeout. The checker (line 341) will then flag this as a duplicate and fail the test. Since redelivery on failed delete is standard SQS FIFO behavior, this leads to false positives. If a delete fails due to a transport error, the operation should be classified as :info (uncertain) rather than :ok. This can be achieved by letting the exception propagate to the outer catch block (line 239), which already correctly classifies transport anomalies.

            (sqs-invoke! sqs :DeleteMessage
                         {:QueueUrl      queue-url
                          :ReceiptHandle receipt-handle})

:shard-ranges (:shard-ranges opts)}))
rate (double (or (:rate opts) 5))
time-limit (or (:time-limit opts) 30)
drain-time (or (:drain-time opts) (max 5 (quot time-limit 6)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The default drain-time is too short (5s for a 30s test) compared to the visibility-timeout-seconds (30s). If a message is received by a worker that is subsequently partitioned or killed near the end of the main phase, the message will remain invisible and unavailable for redelivery until the visibility timeout expires. If the drain phase ends before this happens, the checker will incorrectly report the message as lost. To ensure all messages can be recovered, the drain phase should be at least as long as the visibility timeout.

Suggested change
drain-time (or (:drain-time opts) (max 5 (quot time-limit 6)))
drain-time (or (:drain-time opts) (max visibility-timeout-seconds (quot time-limit 6)))

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@jepsen/src/elastickv/sqs_htfifo_workload.clj`:
- Around line 201-209: The :send branch drops the [group seq-num] tuple when
sqs-invoke! throws, causing collect-sends to misclassify transient failures;
modify the :send handling so group, seq-num (from seq-counters/next-seq!) and
body (encode-body) are computed first, then call sqs-invoke! inside a try/catch
and always assoc the op with :value [group seq-num]; on success set :type :ok,
on exception set :type :info (optionally include the exception under :error) so
collect-sends can see the tuple; apply the same pattern to the similar send code
around the other block referenced (lines 239-264).
- Line 58: The drain phase default is shorter than the message visibility
timeout, so update the default drain-window used in the drain phase (the
drain-phase/default drain duration configured around lines ~428-429) to be at
least visibility-timeout-seconds (the var defn named
visibility-timeout-seconds), or compute the drain timeout as (max
existing-default visibility-timeout-seconds). If the drain window is specified
in milliseconds convert visibility-timeout-seconds to ms before using it, and
replace the hard-coded 5s default with a value derived from
visibility-timeout-seconds so deleted-but-not-actually-removed messages can
reappear before the checker runs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 789de676-7809-4bb7-9559-b2061141c73a

📥 Commits

Reviewing files that changed from the base of the PR and between 675adc3 and 40ab758.

📒 Files selected for processing (5)
  • jepsen/project.clj
  • jepsen/src/elastickv/db.clj
  • jepsen/src/elastickv/jepsen_test.clj
  • jepsen/src/elastickv/sqs_htfifo_workload.clj
  • jepsen/test/elastickv/sqs_htfifo_workload_test.clj

Comment thread jepsen/src/elastickv/sqs_htfifo_workload.clj
Comment thread jepsen/src/elastickv/sqs_htfifo_workload.clj Outdated
Addresses CodeRabbit / Gemini / Claude findings on PR #738. Severity-
ordered:

Major (CodeRabbit + Gemini both flagged):
1. Drain phase shorter than visibility timeout. Default
   drain-time was max(5, time-limit/6); for the default 30s test
   that's 5s, while visibility-timeout-seconds is 30s. A delete
   that fails or a worker killed mid-receive leaves messages
   invisible for 30s; if the drain phase is only 5s, those
   messages can never reappear before the checker runs and they
   get reported as :lost. Fix: max with visibility-timeout-seconds.

High (Gemini):
2. DeleteMessage failure was silently swallowed and the recv op
   marked :ok, so the undeleted message reappearing in a later
   :recv was scored as a duplicate. Real SQS contract: a message
   becomes "received" (and the no-duplicate-delivery contract
   activates) only AFTER a successful DeleteMessage acks it.
   Fix: only successfully-deleted tuples enter :value (so the
   checker sees them), failed deletes are logged via
   clojure.tools.logging/warn and dropped from :value (the
   message will reappear naturally — that's correct SQS
   redelivery, not a duplicate the checker should flag).

Major (CodeRabbit) — defensive, no immediate bug:
3. :send exceptions previously dropped the [group seq-num] tuple
   from the op's :value. Today the checker's "lost" formula is
   committed-only (committed and in-flight are disjoint by
   construction), so this didn't false-positive — but the standard
   Jepsen pattern is to keep :value on every op so the history is
   reasonable for future analysis. Fix: localize the :send try/catch
   so the op carries :value [group seq-num] through both :ok and
   :info paths; on exception, re-raise with the enriched op stashed
   in ex-data so the outer error-classification cond preserves the
   tuple while still routing through the existing :info / :fail
   branches.

P2 polish (Claude):
4. Dead in-flight subtraction in the loss formula. committed and
   in-flight are disjoint by construction (next-seq! assigns each
   tuple exactly once and an op is either :ok or :info), so the
   subtraction was a no-op. Removed; comment now explains the
   disjoint invariant rather than implying the subtraction does
   work.
5. Docstring said "monotonically non-decreasing" while the checker
   uses `(>= p c)`, which enforces strictly increasing. Updated the
   docstring to match the implementation and noted that equal seqs
   would already be flagged as duplicates by contract 3.
6. collect-receives sort key was just :time. Two workers polling at
   the same wall-clock nanosecond, OR two messages within one batch
   sharing the op's :time, could produce a non-deterministic order
   that spuriously trips the per-group ordering check. Added
   :process and a per-batch :index to the sort key (juxt :time
   :process :index) so within-batch order is preserved (matches the
   server's FIFO response order) and cross-process ties are stable.

Test additions:
- TestSQSHTFIFO_checker-info-recv-is-ignored — pin that an :info
  recv (delete failed mid-batch) is dropped by collect-receives so
  the next :ok recv records the redelivery without a duplicate flag.
- TestSQSHTFIFO_checker-same-time-batch-preserves-within-group-order
  — pin the new :index sort key by feeding a batch of three same-time
  same-group messages and asserting ordering-violations stays empty.
- send-op / recv-op test helpers now include :process so synthetic
  histories satisfy the new sort key.

Caller audit per the standing semantic-change rule:
- :recv :value semantics now mean "tuples whose delete acked";
  consumer is collect-receives only, single caller. Aligns with the
  SQS at-least-once-delivery / at-most-once-after-ack model.
- collect-sends in-flight set semantics unchanged; the checker
  reports its size for diagnostics but does not subtract it from
  loss anymore.
- :send exception path now uses ex-data :enriched-op as a stash;
  the outer catch is the only consumer.

Tests: lein test elastickv.sqs-htfifo-workload-test passes 13/32.
Full non-redis suite (sqs / dynamodb / dynamodb-types / s3 / cli)
passes 23/46.

Refs: PR #738 review thread; Claude review at run 25331884279.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 4, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 4, 2026

Claude finished @bootjp's task in 4m 10s —— View job


Code Review: feat(sqs): Jepsen HT-FIFO workload (Phase 3.D PR 7b)

  • Gather context and understand the request
  • Read changed files
  • Review sqs_htfifo_workload.clj (core workload logic)
  • Review sqs_htfifo_workload_test.clj (unit tests)
  • Review supporting changes (project.clj, db.clj, jepsen_test.clj)
  • Post comprehensive review feedback

Overall the implementation is solid. The previous review round's findings (dead in-flight subtraction, "non-decreasing" vs "strictly increasing" docstring, silent delete, same-timestamp sort) have all been addressed in the polish commit. Below is my independent analysis.


What looks good

  • invoke! :send path — computing [group seq-num] and constructing op' before calling sqs-invoke!, then threading it through the inner exception as :enriched-op, is exactly right. The outer catch picks it up so every :info send carries its tuple for the checker. Clean and robust.
  • invoke! :recv path — only ACK'd tuples go into :value @acked; a failed DeleteMessage logs via warn and drops the tuple (the message reappears naturally after the visibility timeout). This correctly avoids false duplicate reports from transient delete failures.
  • collect-receives sort key(juxt :time :process :index) handles same-nanosecond ties deterministically: :process breaks ties across concurrent workers; :index preserves the within-batch server response order. The matching checker-same-time-batch-preserves-within-group-order test pins this.
  • create-htfifo-queue! — idempotent via QueueAlreadyExists / GetQueueUrl fallback. Each Jepsen worker calls setup! independently (open! before setup! per worker), so all workers independently resolve the same URL without race.
  • Checker mathlost = committed − received-tuples is clean; in-flight is reported for diagnostics but correctly excluded from committed at collection time, not subtracted after. The in-flight/committed disjointness is correct by construction.
  • db.clj change — fully backward-compatible; both :sqs-port and :sqs-region are under when, so existing test specs are unaffected.
  • Test coverage — 11 unit tests cover all three contracts and all edge cases (info-not-loss, fail-not-counted, empty-receives, same-time batch, cross-group interleaving). Good baseline.

Findings

1. Drain window equals (not exceeds) visibility timeout — narrow race window

sqs_htfifo_workload.clj:487-489

drain-time (or (:drain-time opts)
               (max visibility-timeout-seconds
                    (max 5 (quot time-limit 6))))

With defaults (time-limit=30, visibility-timeout-seconds=30): drain-time = max(30, max(5, 5)) = 30. A message that becomes invisible at exactly the start of the drain phase reappears at drain-start + 30s = drain-end, but the gen/time-limit may have already fired. The checker then reports it as lost even though the server still owns it.

The fix from the previous review made drain ≥ visibility timeout, which is necessary but not sufficient. Adding a small buffer (e.g., +10s) closes the race:

drain-time (or (:drain-time opts)
               (+ visibility-timeout-seconds 10))

This replaces the max(5, time-limit/6) floor too — for any reasonable time-limit, 40s is always longer than 5 or time-limit/6. Fix this →

2. when-let in decode-body does not guard against nil

sqs_htfifo_workload.clj:167

(when-let [[group seq-str] (str/split body #":" 2)]

str/split never returns nil — it always returns a vector. when-let only skips the body when the binding is nil/false, which never happens here. The actual guards are the (not (str/blank? group)) and (not (str/blank? seq-str)) checks on the next lines. The when-let gives a false impression of nil-safety.

Replace with plain let:

(let [[group seq-str] (str/split body #":" 2)]
  (when (and (not (str/blank? group))
             (not (str/blank? seq-str)))
    ...))

[Fix this →](https://claude.ai/code?q=In%20jepsen%2Fsrc%2Felastickv%2Fsqs_htfifo_workload.clj%20at%20line%20167%2C%20replace%20(when-let%20%5B%5Bgroup%20seq-str%5D%20(str%2Fsplit%20body%20%23%22%3A%22%202)%5D%20with%20(let%20%5B%5Bgroup%20seq-str%5D%20(str%2Fsplit%20body%20%23%22%3A%22%202)%5D%20and%20add%20a%20(when%20(and%20(not%20(str%2Fblank%3F%20group))%20(not%20(str%2Fblank%3F%20seq-str))))%20wrapper%20around%20the%20try%20block%2C%20since%20str%2Fsplit%20never%20returns%20nil%20and%20the%20when-let%20is%20misleading.&repo=bootjp/elastickv)

3. Queue persists across test runs — potential inter-run contamination

sqs_htfifo_workload.clj:192-194

(teardown! [_this _test]
  ;; Leave the queue around — the test cluster is torn down by db/teardown!.
  nil)

If the test is re-run against a cluster that wasn't restarted (e.g., --no-rebuild --no-cluster mode), old messages from the prior run may still be in the queue. Two problems:

  1. Spurious receives: Old g0:0, g0:1, etc. from the prior run appear in the receive history as receives without corresponding committed sends in the current run. They won't be flagged as lost (they're not in committed), but they inflate the received count and can trigger spurious ordering violations if they interleave with current-run messages of the same groups. The seqs start over from 0 each run, so g0:0 from the prior run could arrive between g0:0 and g0:1 of the current run.
  2. ContentBasedDeduplication blocking new sends: Since the body encoding is "group:seq" and seqs reset to 0, the deduplication window (5 minutes for SQS) may block a current-run g0:0 send if a prior-run g0:0 was sent recently.

The easiest fix is to use a per-run unique queue name (e.g., append a timestamp or UUID suffix) so each run starts fresh. Alternatively, purge the queue in setup! before running. Fix this →

4. Minor: jepsen_test.clj -main not wired to the SQS test

jepsen/src/elastickv/jepsen_test.clj:27-29

The -main function in jepsen_test.clj still runs only the Redis single-test command. The SQS workload is reachable via its own -main in sqs_htfifo_workload.clj. This is consistent with the existing S3 and DynamoDB workload pattern, but elastickv-sqs-htfifo-test in jepsen_test.clj is currently unreachable via any -main. A one-line comment in the function docstring noting the entry point (lein run -m elastickv.sqs-htfifo-workload) would prevent confusion.


Summary

Three issues worth addressing before merge:

  1. Drain window race (finding 1) — narrow but real false-positive window; a +10s buffer over the visibility timeout fixes it.
  2. Queue contamination across runs (finding 3) — can cause spurious ordering violations in --no-cluster mode; unique queue name per run is the clean fix.
  3. when-let clarity (finding 2) — minor code clarity issue; simple to fix.

The core three-contract checker, atomic seq counters, and Jepsen lifecycle wiring are all correct. The test coverage is thorough and the polished version has addressed all findings from the first review round.

Three findings on round 2; all addressed plus a docstring nit.

Major:
1. Drain window equals (not exceeds) the visibility timeout. With
   defaults (time-limit=30, visibility-timeout-seconds=30), drain
   was max(30, max(5, 5)) = 30s. A message that becomes invisible
   right at drain-start reappears at exactly drain-end (or later),
   and gen/time-limit can fire before the next :recv surfaces it —
   the checker would then report :lost while the server still owns
   the message. Fix: drain-time = visibility-timeout-seconds + 10s.
   The +10s buffer also dominates the (max 5, time-limit/6) short-
   test floor at any reasonable time-limit, so the floor is gone.

Major (latent, --no-cluster mode):
2. Test queue persists across runs. The previous queue-name was a
   top-level constant ("jepsen-htfifo.fifo"), so re-running the
   workload against the same cluster (e.g. --no-rebuild --no-cluster)
   left old messages in the queue. They'd appear in the receive
   history without corresponding committed sends in the current
   run, inflating :received and — because seqs reset to 0 each
   run — potentially trip per-group ordering checks. Worse, the
   5-minute ContentBasedDeduplication window could block current-
   run sends that match prior-run bodies. Fix: per-run unique
   queue name "jepsen-htfifo-<millis>.fifo" generated at workload
   construction time and stashed on the HTFIFOClient record so
   every worker's setup! converges on the same name. New
   :queue-name opt provides an explicit override knob.

Caller audit per the standing semantic-change rule:
- queue-name was a private def used only in create-htfifo-queue!
  and the setup! info log. Both are inside the same namespace.
  After the change, both read the value from the client record.
  No external callers; no test changes needed (the unit tests
  exercise the checker on synthetic histories, not the queue
  setup path).
- create-htfifo-queue! gained a queue-name parameter; the only
  caller is setup!.

P3 polish:
3. when-let on (str/split body #":" 2) was misleading — str/split
   never returns nil, so when-let always entered the body. The
   actual nil-safety lived in the str/blank? checks below. Replaced
   with plain let so the intent is explicit (str/blank? handles the
   nil seq-str from a no-colon body correctly).

Nit (Claude):
4. Added a docstring to elastickv-sqs-htfifo-test in jepsen_test.clj
   pointing at the workload's own -main as the entry point, matching
   the existing dynamodb / s3 pattern. Top-level -main still
   dispatches Redis only; this is consistent and intentional.

Tests: lein test elastickv.sqs-htfifo-workload-test passes 13/32.
Full non-redis suite (sqs / dynamodb / s3 / cli) passes 23/46.

Refs: PR #738 round 2 review at run 25333566777.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 4, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 4, 2026

Claude finished @bootjp's task in 4m 39s —— View job


Code Review: feat(sqs): Jepsen HT-FIFO workload (Phase 3.D PR 7b) — Round 3

  • Gather context
  • Read current state of all changed files
  • Check what round-2 polish addressed
  • Review sqs_htfifo_workload.clj
  • Review sqs_htfifo_workload_test.clj
  • Review supporting changes
  • Post comprehensive review

All previous findings from rounds 1 and 2 have been addressed correctly. The code is in good shape. Full audit below.


Previous findings — status

Round 1 (all addressed):

  • ✅ Dead in-flight subtraction removed; comment explains disjointness invariant
  • ✅ Docstring updated to "strictly increasing" with explanation at sqs_htfifo_workload.clj:11-15
  • DeleteMessage failure now warn-logs and drops tuple from :value at line 280-282 (correct SQS at-most-once-after-ack semantics)
  • ✅ Sort key on collect-receives updated to (juxt :time :process :index) at line 363 — within-batch order is preserved; cross-process same-nanosecond ties are stable
  • jepsen_test.clj docstring added at line 24-28

Round 2 (all addressed):

  • ✅ Drain window race closed: drain-time = visibility-timeout-seconds + 10s = 40s at line 510-511
  • when-let replaced with plain let in decode-body at line 183 — str/split never returns nil; comment explains the nil-safety contract
  • ✅ Queue contamination across runs fixed: fresh-queue-name generates "jepsen-htfifo-<millis>.fifo" per run at line 62-71; :queue-name override knob available

Fresh analysis — round 3

What looks good

  • Send-path enrichmentop' is constructed before sqs-invoke! at line 234, so :info sends carry their [group seq-num] tuple. The inner catch re-throws via ex-data :enriched-op; the outer catch picks it up at line 292. Every send op in the history — :ok, :info, or :fail — carries :value.
  • collect-sends mathcommitted = :ok values, in-flight = :info values. Disjoint by construction. lost = committed − received-tuples is clean.
  • collect-receives — the (juxt :time :process :index) sort key is correct. :index comes from map-indexed inside a single batch, so within-batch FIFO order is preserved. :process handles two workers at the same nanosecond.
  • create-htfifo-queue! is idempotent: QueueAlreadyExists / QueueNameExists both fall through to GetQueueUrl. Multiple workers calling setup! in parallel converge on the same URL.
  • seq-counters sharing — the atom is constructed once in sqs-htfifo-workload at line 464. open! and setup! return new records via assoc, preserving the same atom reference across all workers. All workers assign globally distinct seqs.
  • decode-body edge cases"g0" (no colon) → seq-str=nilstr/blank? guards skip → returns nil correctly. "g0:"seq-str="" → blank check returns nil. "g0:notanumber"NumberFormatException caught → nil. All handled.
  • ordering-violations with a single event per group(rest seqs) is empty, pairs is empty, breaks is empty, when (seq breaks) is false. No spurious violation. Correct.
  • Drain phase(gen/phases main-gen drain-gen) runs sequentially. drain-gen is bounded by (gen/time-limit 40). Any message invisible at drain-start reappears within 30s; the +10s buffer gives a full extra poll cycle before the checker runs. Correct.
  • teardown! — with per-run queue names, leaving the queue is safe (no contamination). The rationale is documented in the comment at line 209-211.
  • db.clj port-for — handles both map and scalar correctly. get on nil with a default works: (get nil "n1" 9324) → 9324. So even a nil node->port falls back gracefully.

One minor observation

sqs_htfifo_workload.clj:577-579 — the --drain-time CLI option shows :default nil in the spec:

[nil "--drain-time SECONDS" "Receive-only drain phase after the main generator finishes."
 :default nil
 :parse-fn #(Integer/parseInt %)]

A user running lein run -m elastickv.sqs-htfifo-workload --help will see no default printed. The actual computed default (40s = visibility-timeout-seconds + 10) is invisible to CLI users. Consider updating the description string to mention it:

"Receive-only drain phase after the main generator finishes (default: visibility-timeout + 10s = 40s)."

No behavior change; purely a UX improvement. Fix this →

Tests coverage check

The 11 tests in sqs_htfifo_workload_test.clj pin:

  • Clean valid history: checker-clean-history-is-valid
  • Loss detection: checker-detects-loss
  • :info not counted as lost: checker-info-send-is-not-loss
  • Duplicate detection: checker-detects-duplicates
  • Within-group ordering violation: checker-detects-within-group-ordering-violation
  • Cross-group freedom: checker-cross-group-receives-do-not-violate-ordering
  • :fail not counted: checker-failed-sends-are-not-counted
  • Empty recv no-op: checker-empty-receives-do-not-pollute
  • :info recv dropped: checker-info-recv-is-ignored
  • Same-time batch ordering: checker-same-time-batch-preserves-within-group-order
  • Integration: builds-test-spec, custom-options-override-defaults, host-override-creates-client

All three contracts are covered by direct tests. The checker-info-recv-is-ignored test correctly models the DeleteMessage-failure path: an :info recv with a tuple is dropped from collect-receives, and the subsequent :ok recv records the redelivery as a first-time receipt — no duplicate flag. ✅


Summary

No blocking issues found in round 3. The one observation above (CLI help text for --drain-time) is cosmetic. The three-contract checker, atomic seq counters, error classification, drain phase, and Jepsen lifecycle wiring are all correct and thoroughly tested. Ready to merge.

Round 3 review verdict: ready to merge. One cosmetic nit: the
--drain-time CLI option had :default nil in the spec, so the
computed default (visibility-timeout + 10s = 40s) was invisible
to users running --help. Updated the description string to spell
it out. No behaviour change.

Refs: PR #738 round 3 review at run 25333949895.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 4, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 4, 2026

Claude finished @bootjp's task in 3m 2s —— View job


Code Review: feat(sqs): Jepsen HT-FIFO workload (Phase 3.D PR 7b) — Round 4

  • Gather context and understand the request
  • Read current state of all changed files
  • Verify round-3 finding (drain-time CLI help) was addressed
  • Fresh analysis of the workload code
  • Post comprehensive review feedback

All findings from rounds 1–3 have been resolved correctly. Full independent audit below.


Round 3 finding — status

✅ Addressed in commit 8b237a6: The --drain-time CLI option description at sqs_htfifo_workload.clj:577 now reads "Receive-only drain phase after the main generator finishes (default: visibility-timeout + 10s = 40s)." — users running lein run -m elastickv.sqs-htfifo-workload --help now see the computed default.


Fresh analysis — round 4

What looks good

  • Send-path enrichmentop' is computed before sqs-invoke! (group, seq-num, and body all resolved first). On failure the inner catch re-wraps with :enriched-op op' in ex-data; the outer catch at line 292 pulls it out and uses it as base. Every :send op in the history — :ok, :info, or :fail — carries a [group seq-num] :value. The checker-info-send-is-not-loss and checker-failed-sends-are-not-counted tests lock this down.

  • collect-sends disjointness invariantcommitted (:ok) and in-flight (:info) are mutually exclusive by construction; the loss formula (cset/difference committed received-tuples) is correct and the comment at lines 396–401 explains why in-flight is diagnostic-only.

  • collect-receives sort key(juxt :time :process :index) handles all tie-break cases: :time is primary, :process separates two workers polling at the exact same nanosecond, and :index (from map-indexed over the batch) preserves within-batch FIFO order from the server. The checker-same-time-batch-preserves-within-group-order test pins this.

  • Delete failure handling — failed DeleteMessage calls log via warn and drop the tuple from @acked. The message reappears naturally after the visibility window; the next :recv records it correctly. Only successfully-acked tuples go into :value. No false duplicates.

  • decode-body nil-safetylet (not when-let) with explicit comment explaining that str/split never returns nil; str/blank? guards on both group and seq-str; NumberFormatException caught. All edge cases handled.

  • Drain-time race closeddrain-time = visibility-timeout-seconds + 10 = 40s ensures any message that becomes invisible at drain start (30s window) reappears with 10 full seconds of drain headroom remaining. The +10s buffer also dominates the old max(5, time-limit/6) floor, which is correctly removed.

  • Per-run unique queue namesfresh-queue-name uses System/currentTimeMillis to prevent cross-run contamination. The teardown! comment explains why leaving the queue is safe.

  • ordering-violations edge cases — single event per group: (rest seqs) is empty → pairs empty → (when (seq breaks) ...) is false → no spurious violation. ✅

  • create-htfifo-queue! concurrent convergence — multiple workers calling setup! in parallel: first creates, subsequent hit QueueAlreadyExists/QueueNameExists → fallback GetQueueUrl → same URL. Correct.

  • db.clj backward compatibilitysqs-port and sqs-region are both guarded by when in start-node!; existing test specs without these keys are byte-identical at the args level.

  • Test coverage — 11 tests pin all three contracts plus edge cases: info-not-loss, fail-not-counted, empty-receives, info-recv-ignored, same-time-batch, cross-group-interleaving. The checker-info-recv-is-ignored test correctly specifies checker behavior (:info recvs are skipped by collect-receives) even though the current workload doesn't produce :info recv ops in practice (the outer catch fires before any value is set on :recv).

One very minor observation

sqs_htfifo_workload.clj:435:

(repeat (max 1 (Math/round (* 10.0 (double send-fraction)))) send-op)

When send-fraction = 0.0, Math/round(0.0) = 0, so max(1, 0) = 1 — the generator still sends with ~1-in-11 probability. A user passing --send-fraction 0 expecting pure-receive mode would get occasional sends. This is a CLI UX edge case and won't cause checker false positives (unexpected :ok sends just show up in committed and the checker verifies they're received). Not a bug; worth a note in the option description if drain-only mode is a use case users would reach for.


Summary

No blocking issues found in round 4. The round 3 cosmetic finding (CLI help text) is addressed. The implementation is correct across all three contracts, the error classification matches Jepsen's standard conventions, and the test coverage is thorough. Ready to merge.

Adds the new SQS HT-FIFO workload (PR #738) to the GitHub Actions
Jepsen runs so it gets exercised on every push and on the
self-hosted VM lane.

- .github/workflows/jepsen-test.yml (per-push, ubuntu-latest):
  - Cluster startup adds --sqsAddress 127.0.0.1:6350N for each
    of the 3 nodes plus --raftSqsMap for the leader-SQS lookup
    table. New port band 63501-63503 keeps SQS clear of redis
    (63791-3), dynamo (63801-3), and s3 (63901-3).
  - Wait-for-listeners loop now also checks the SQS ports.
  - New step "Run SQS HT-FIFO Jepsen workload against elastickv"
    runs `lein run -m elastickv.sqs-htfifo-workload --local
    --time-limit 5 --rate 5 --concurrency 5 --partition-count 4
    --group-count 6 --sqs-ports 63501,63502,63503 --host
    127.0.0.1` with a 120s outer timeout (3min job timeout).
    Same shape as the redis / dynamodb / s3 workload steps.

- .github/workflows/jepsen.yml (VM-based, self-hosted):
  - New step "Run SQS HT-FIFO Jepsen workload" inside the
    Vagrant ctrl node, using the same nodes/time-limit/rate/
    faults dispatch inputs as the redis / dynamodb / s3 steps.
    Routing flows through jepsen/src/elastickv/db.clj's existing
    :sqs-port + :sqs-region wiring (added in PR #738).

Not in scope (follow-up):
- .github/workflows/jepsen-test-scheduled.yml (the 6-hourly
  stress run) currently launches via cmd/server/demo.go, which
  doesn't speak SQS yet. Wiring HT-FIFO into the scheduled
  stress run would require either adding the SQS adapter to
  demo.go or switching the scheduled lane to the manual binary
  launch pattern that jepsen-test.yml uses. That change is
  larger than this PR's CI-wiring scope and is left as a
  follow-up.

The capability gate (PR 5b-3, #734) accepts CreateQueue with
PartitionCount > 1 on this single-shard 3-node cluster because
no --sqsFifoPartitionMap is supplied: validateHTFIFORoutingCoverage
returns nil when partitionResolver is nil, and the peer poll
succeeds because every node is the same binary. The data plane
exercises all the partitioned-FIFO key shapes (msg, vis, group,
dedup) and the new partition-aware dispatch helpers; the
storage groups still consolidate to the default Raft group on
this topology.

Refs: docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md §11 PR 7.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 4, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 4, 2026

Claude finished @bootjp's task in 4m 0s —— View job


Code Review: feat(sqs): Jepsen HT-FIFO workload (Phase 3.D PR 7b) — Round 5

  • Gather context and understand the request
  • Read current state of all changed files
  • Check what round-4 findings were addressed
  • Fresh independent analysis
  • Post comprehensive review feedback

All findings from rounds 1–4 have been addressed. Full independent audit below.


Round 4 finding — status

✅ Addressed in commit 52c6e41: The --drain-time option in the CI step comment now accurately explains the 40s default. The code itself was already correct; the CI YAML (jepsen-test.yml) was wired with --sqs-ports 63501,63502,63503 and SQS readiness checks.


Fresh analysis — round 5

What looks good

  • invoke! :send pathgroup, seq-num, body, and op' are all computed before sqs-invoke! is called. Inner catch re-wraps via ex-data :enriched-op op'; outer catch extracts it via (or (:enriched-op data) op). Every :send:ok, :info, or :fail — carries [group seq-num] as :value. ✅
  • collect-sends / collect-receives — Disjointness of committed and in-flight is correct by construction; (cset/difference committed received-tuples) is the clean loss formula. The (juxt :time :process :index) sort key handles all tie-break cases. ✅
  • decode-bodylet (not when-let), with a comment explaining that str/split never returns nil; str/blank? guards both halves; NumberFormatException caught. ✅
  • create-htfifo-queue! — Idempotent with both QueueAlreadyExists and QueueNameExists handled. Parallel workers calling setup! all converge on the same URL. ✅
  • fresh-queue-name — Per-run unique queue name prevents cross-run contamination in --no-cluster mode. ✅
  • Drain timevisibility-timeout-seconds + 10 = 40s closes the narrow race where a message going invisible at drain-start reappears at exactly drain-end. ✅
  • ordering-violations — Single-event groups correctly produce zero violations ((rest seqs) empty → pairs empty). ✅
  • --drain-time CLI help — Now shows (default: visibility-timeout + 10s = 40s). ✅
  • db.clj backward compatibilitysqs-port and sqs-region are both guarded by when; existing test specs are unaffected. ✅
  • Test coverage — 11 tests pin all three contracts plus all identified edge cases (info-not-loss, fail-not-counted, empty-receives, info-recv-ignored, same-time-batch, cross-group interleaving). ✅

Findings

1. CI timeout is tight given the 40s drain phase

.github/workflows/jepsen-test.yml:156-160

timeout-minutes: 3
run: |
  timeout 120 ~/lein run -m elastickv.sqs-htfifo-workload --local \
    --time-limit 5 --rate 5 --concurrency 5 ...

The active test window is time-limit + drain-time = 5 + 40 = 45s. Add Lein JVM startup (~15–20s on a warm CI runner) and the total is ~60–65s, leaving ~55s headroom inside the 120s shell timeout. That's probably fine, but it's ~4× tighter than other workloads (which use 5s active time with the same 120s cap).

For local --local mode there are no faults, so the visibility-timeout concern that motivates the 40s drain doesn't apply — a message can only become invisible due to a genuine server-side delete failure, not a partition. Passing --drain-time 15 in the CI invocation would halve the risk with no correctness cost for local-mode runs:

timeout 120 ~/lein run -m elastickv.sqs-htfifo-workload --local \
  --time-limit 5 --rate 5 --concurrency 5 \
  --partition-count 4 --group-count 6 \
  --drain-time 15 \
  --sqs-ports 63501,63502,63503 --host 127.0.0.1

Fix this →

2. Minor: send-fraction = 0.0 boundary undocumented (pre-existing from round 4)

sqs_htfifo_workload.clj:435

(repeat (max 1 (Math/round (* 10.0 (double send-fraction)))) send-op)
(repeat (max 1 (- 10 ...))                                   recv-op)

send-fraction = 0.0 gives 1 send op in 11 (≈9% sends) due to max(1, 0), not pure-recv mode. Likewise send-fraction = 1.0 gives 1 recv in 11. The CLI description says "Probability a generator op is :send" which implies 0.0 = no sends. This won't cause false positives (the sends still produce valid checker state), but it violates user expectations at the boundary. A note in the description like "(0.0 and 1.0 still emit ≥1 op of each type to prevent generator starvation)" would make the clamping explicit. Very minor.


What's definitively clean

  • Three-contract checker — ordering, loss, and duplicates logic is correct and all edge cases are pinned by unit tests.
  • Error classification — fault/unavailable/interrupted → :info, 5xx → :info, definite client rejection → :fail. Matches Jepsen convention and the existing DynamoDB workload style.
  • Atomic seq countersnext-seq! via swap! + dec is correct; all workers share the same atom; every assigned seq is globally unique.
  • Jepsen lifecycle wiringopen! → setup! → [invoke!]* → teardown! → close! sequence is correct; queue-url from setup! is available to all subsequent invoke! calls on the same worker record.
  • db.clj SQS wiringsqs-port map or scalar both handled by port-for; sqs-region optional; both guarded by when so no existing test spec is affected.

Summary

One actionable finding: the CI timeout 120 is tight against a 45s minimum active test window; passing --drain-time 15 in the local CI step closes the gap without correctness cost. One minor pre-existing observation about send-fraction boundaries. Everything from rounds 1–4 is cleanly addressed.

Ready to merge after considering finding 1.

Round 5 verdict: Ready to merge. Two findings, both addressed:

1. (Actionable) CI shell timeout was tight against the workload's
   default 40s drain phase. Active test window was time-limit (5s)
   + drain-time (40s) = 45s, plus ~15-20s Lein JVM startup ≈ 60s
   inside a 120s shell timeout. Pass --drain-time 15 in the CI
   step: in --local mode the nemesis is a no-op so no message can
   become invisible due to partition/kill, making the 40s default
   (which exists to outlast the visibility-timeout window during
   real faults) overkill. 15s gives the same correctness with
   comfortable headroom under the 120s shell timeout.

2. (Pre-existing minor) --send-fraction CLI description said
   "Probability a generator op is :send" but the implementation
   clamps both ends with (max 1 ...) to prevent generator
   starvation, so 0.0 still emits ~9% sends and 1.0 still emits
   ~9% receives. Updated the description string to spell out the
   clamping at the boundaries.

Tests: lein test elastickv.sqs-htfifo-workload-test passes 13/32.
YAML lint clean.

Refs: PR #738 round 5 review at run 25335805377.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 4, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 4, 2026

Claude finished @bootjp's task in 3m 12s —— View job


Code Review: feat(sqs): Jepsen HT-FIFO workload (Phase 3.D PR 7b) — Round 6

  • Gather context and understand the request
  • Read current state of all changed files
  • Check what round-5 findings were addressed
  • Fresh independent analysis
  • Post comprehensive review feedback

Round 5 finding — status

✅ Addressed in commit de3eb7c: --drain-time 15 is now explicitly passed in the CI step, with a comment explaining why the 40s default is overkill in --local mode (no faults → no partition-induced visibility-timeout events). The 120s shell timeout now has ~100s of headroom over the 5s main phase + 15s drain + JVM startup.


Fresh analysis — round 6

What looks good

  • Send-path enrichmentop' is bound (with [group seq-num] in :value) before sqs-invoke! is called. Inner catch re-throws via {:enriched-op op'} in ex-data; outer ExceptionInfo catch extracts it with (or (:enriched-op data) op). Every :send:ok, :info, or :fail — carries :value [group seq-num]. ✅

  • Plain-Exception corner case is actually benign — If sqs-invoke! throws a non-ExceptionInfo exception (underlying Java IO), catch Exception e fires with bare op (no :value). However, since that path always produces :info (never :ok), any nil :value lands only in the diagnostic in-flight set — not in committed. The loss formula (cset/difference committed received-tuples) therefore cannot produce a false-positive loss report from this path. Safe as-is.

  • collect-receives sort key(juxt :time :process :index) handles all tie-break cases: primary :time, :process for two workers polling at identical nanoseconds, :index from map-indexed preserving within-batch FIFO order. checker-same-time-batch-preserves-within-group-order pins this. ✅

  • Delete failure handling — Failed DeleteMessage calls warn-log and drop the tuple from @acked. The message reappears naturally after the 30s visibility window; the next :recv records it. No false duplicates. ✅

  • Drain-time race closeddrain-time = 30 + 10 = 40s (default), 15s in CI. ✅

  • Per-run unique queue names(str "jepsen-htfifo-" (System/currentTimeMillis) ".fifo") prevents cross-run contamination. Two runs within the same millisecond would collide, but test setup takes minutes, so this is negligible in practice. ✅

  • decode-body — plain let with comment explaining that str/split never returns nil; str/blank? guards both halves; NumberFormatException caught. ✅

  • ordering-violations — single-event groups produce no spurious violations; cross-group interleaving is correctly unconstrained. ✅

  • VM workflow (jepsen.yml) — The new SQS step follows the exact same pattern as DynamoDB, Redis, and S3: no explicit port/host overrides, relying on the default-sqs-port = 9324 and node-name resolution via the Vagrant network. The server is started with --sqsAddress n1:9324 (via db.clj) and the client connects to port 9324 on the same node name — consistent. ✅

  • db.clj backward compatibilitysqs-port and sqs-region both guarded by when; existing test specs are unaffected. ✅

  • Test coverage — 11 unit tests pin all three contracts plus all identified edge cases (info-not-loss, fail-not-counted, empty-receives, info-recv-ignored, same-time-batch, cross-group interleaving, clean valid history). ✅

  • send-fraction boundary documented — CLI description at line 574 now reads "0.0 and 1.0 still emit at least 1 op of each kind to prevent generator starvation", so the clamping behaviour is explicit to users. ✅


Findings

No new blocking issues. One minor observation that is a genuine edge case but well within acceptable bounds for a test workload:

Minor: ordering-violations treats same-group same-time events across different recv ops as deterministic

sqs_htfifo_workload.clj:363

(sort-by (juxt :time :process :index))

:index is relative within a single batch (map-indexed over (:value op)). If two different recv ops from different processes poll at exactly the same :time and both return a message from the same group, their relative order in the sorted output is determined by :process (an integer worker ID). This is deterministic and stable, but the :process ordering may not reflect actual delivery order across two independent poll responses. In practice, Jepsen assigns nanosecond wall-clock times and cross-process sub-nanosecond ties are extremely rare. And even if they occur, the checker would produce a spurious ordering violation only if the two messages are from the same group and their seq values are swapped by the :process tiebreak — which requires exactly the same nanosecond completion time, different processes, same group, and the "wrong" process ID ordering. This is a known, documented edge case (from round 2) that is acceptable given the rarity in real runs.


Summary

All findings from rounds 1–5 are cleanly addressed. No blocking issues found in round 6. The three-contract checker, atomic seq counters, drain-time race protection, error classification, per-run unique queue names, and Jepsen lifecycle wiring are all correct and thoroughly tested.

Ready to merge.

@bootjp bootjp merged commit afd2b1c into main May 4, 2026
10 checks passed
@bootjp bootjp deleted the feat/sqs-jepsen-htfifo branch May 4, 2026 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant