ci(jepsen): add scheduled dedup-mode workflow + design doc M4 criterion#889
Conversation
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@claude review PR-C: Jepsen scheduled workflow (dedup gate ON) + design doc M4 criterion. Mechanism: new workflow .github/workflows/jepsen-test-scheduled-dedup.yml runs daily at 03:17 UTC with ELASTICKV_REDIS_ONEPHASE_DEDUP=1 at the job level. Launch step asserts the env var is '1' before waiting on listeners — fails fast on misconfig. Scoped to Redis workload only because DynamoDB/S3/SQS do not route through the dedup loop. actionlint clean. Please verify: (1) the env-var assertion is sufficient to guarantee the cluster process has the gate on (no log-grep needed since the job env propagates to all run: steps deterministically); (2) the cadence (17 3 * * *, off-peak non-zero minute) is sensible vs the existing 6-hourly general workflow; (3) the scope rationale (Redis-only) holds — DynamoDB/S3/SQS adapters do not call runTransactionWithDedup, listPushCoreWithDedup, or the standalone SET dedup path. |
|
Claude finished @bootjp's task in 3m 34s —— View job PR #889 Review — Jepsen Scheduled Workflow (Dedup Mode) + Design Doc M4Todos
Verification: Three Requested Points1. Env-var assertion is sufficient to guarantee the gate is activeVerdict: Yes, with one sequencing note.
The assertion in the launch step is a correct config-drift guard. One minor sequencing note: the demo cluster is backgrounded before the assertion runs ( 2. Cadence is sensibleVerdict: Yes.
The 7-consecutive-days-green criterion in §M4 maps cleanly to 7 individual run IDs in the new workflow's history — no ambiguity about what "consecutive" means. 3. Redis-only scope rationale holdsVerdict: Confirmed by code inspection. A Issues FoundBug: Silent failure in
|
There was a problem hiding this comment.
Code Review
This pull request updates the design document for proposed transaction secondary idempotency, adding detailed criteria for Jepsen testing, local reproduction steps, workflow scope rationale, and demo cluster gate confirmation. The review feedback suggests a minor consistency improvement to use the full path for the scheduled Jepsen workflow file.
| workflow (`.github/workflows/jepsen-test-scheduled-dedup.yml`, | ||
| daily at 03:17 UTC). The general scheduled workflow | ||
| (`jepsen-test-scheduled.yml`, every 6 h) continues to run *without* |
gemini PR #889 MEDIUM: consistency with the new dedup workflow on line 539 which uses .github/workflows/... prefix. The bare filename next line was a typo from the diff. Caller audit: pure documentation; no Go callers.
|
Claude finished @bootjp's task in 1m 27s —— View job PR #889 Review — Round 2 (post d9f67f2)Todos
What Changed in Round 1 Fix✅ gemini MEDIUM path consistency — Confirmed Fixed
Both references now use full paths. Fix is correct and complete. Still Open from Round 1Bug:
|
963a2d3 to
8321ad0
Compare
gemini PR #889 MEDIUM: consistency with the new dedup workflow on line 539 which uses .github/workflows/... prefix. The bare filename next line was a typo from the diff. Caller audit: pure documentation; no Go callers.
d9f67f2 to
2b662f3
Compare
claude[bot] PR #889 blocking finding: the previous loop shape until [ "$n" -ge 3 ]; do ~/lein deps && break; done exited on iteration count, not lein success. When all attempts failed the last executed command was sleep (exit 0), so the step reported green despite no dependencies being warmed -- a transient Maven Central outage would have silently produced a cluster running the Jepsen workload without a warmed cache, masking the dedup signal. Replaced with the pattern from jepsen-test-scheduled.yml: until ~/lein deps; do ...; if [ "$n" -ge "$max" ]; then exit 1; fi; ...; done Loop now exits on lein success, otherwise reaches the explicit exit 1 once max retries are hit. Backoff also aligned to 30*n seconds for parity (previously 15*n). Caller audit: pure shell-script change; no Go code touched, no semantic change to existing callers. actionlint clean.
32ad1ef to
2cb6645
Compare
gemini PR #889 MEDIUM: consistency with the new dedup workflow on line 539 which uses .github/workflows/... prefix. The bare filename next line was a typo from the diff. Caller audit: pure documentation; no Go callers.
claude[bot] PR #889 blocking finding: the previous loop shape until [ "$n" -ge 3 ]; do ~/lein deps && break; done exited on iteration count, not lein success. When all attempts failed the last executed command was sleep (exit 0), so the step reported green despite no dependencies being warmed -- a transient Maven Central outage would have silently produced a cluster running the Jepsen workload without a warmed cache, masking the dedup signal. Replaced with the pattern from jepsen-test-scheduled.yml: until ~/lein deps; do ...; if [ "$n" -ge "$max" ]; then exit 1; fi; ...; done Loop now exits on lein success, otherwise reaches the explicit exit 1 once max retries are hit. Backoff also aligned to 30*n seconds for parity (previously 15*n). Caller audit: pure shell-script change; no Go code touched, no semantic change to existing callers. actionlint clean.
|
@claude review Round 2 fix (commit 2cb6645):
actionlint clean. Pure shell-script change; no Go code touched. |
|
Claude finished @bootjp's task in 1m 4s —— View job PR #889 Review — Round 3 (post 2cb6645)Todos
Blocking Issue:
|
| Finding | Round raised | Status |
|---|---|---|
lein deps retry loop silent success on exhaustion |
Round 1 (blocking) | ✅ Fixed in 2cb6645 |
Doc path consistency (jepsen-test-scheduled.yml → full path) |
Round 1 / Gemini MEDIUM | ✅ Fixed in d9f67f2 |
| Assertion ordering (nohup fires before env var check) | Round 1 (cosmetic) | Open — non-blocking, functionally acceptable |
workflow_dispatch time-limit default 300 vs general 300 |
Round 1 (minor) | Open — intentional, no action needed |
The only remaining open item is the cosmetic assertion ordering (the if [ "${ELASTICKV_REDIS_ONEPHASE_DEDUP:-}" != "1" ] block runs after the nohup go run launch rather than before it). As noted in Round 1, this does not produce false-positive M4 signal because the cleanup step (Stop demo cluster) runs if: always() and kills the orphan process, and the workload step never runs. Not a correctness issue.
Summary
Approved. The one blocking issue is fixed. The fix pattern is identical to the general workflow, backoff is aligned, and the comment block correctly attributes and documents the prior failure mode. No new issues found in Round 3. The PR is ready to merge.
gemini PR #889 MEDIUM: consistency with the new dedup workflow on line 539 which uses .github/workflows/... prefix. The bare filename next line was a typo from the diff. Caller audit: pure documentation; no Go callers.
claude[bot] PR #889 blocking finding: the previous loop shape until [ "$n" -ge 3 ]; do ~/lein deps && break; done exited on iteration count, not lein success. When all attempts failed the last executed command was sleep (exit 0), so the step reported green despite no dependencies being warmed -- a transient Maven Central outage would have silently produced a cluster running the Jepsen workload without a warmed cache, masking the dedup signal. Replaced with the pattern from jepsen-test-scheduled.yml: until ~/lein deps; do ...; if [ "$n" -ge "$max" ]; then exit 1; fi; ...; done Loop now exits on lein success, otherwise reaches the explicit exit 1 once max retries are hit. Backoff also aligned to 30*n seconds for parity (previously 15*n). Caller audit: pure shell-script change; no Go code touched, no semantic change to existing callers. actionlint clean.
2cb6645 to
5c644e7
Compare
|
@claude review Head moved to 5c644e7 after rebase onto the updated PR-A branch (which now contains the merged PR-B content). The PR #889 diff itself (jepsen-test-scheduled-dedup.yml + design doc M4 section) is unchanged — only the merge base advanced. The round-2 "Approved" verdict on commit 2cb6645 (lein-deps loop fix to match jepsen-test-scheduled.yml pattern) is still the substantive content. |
|
Claude finished @bootjp's task in 1m 20s —— View job PR #889 Review — Round 4 (post-rebase to 5c644e7)Todos
Rebase Verification
These are the same three commits as the Round 3 review (Round 2 fix + Round 1 fix + original). The rebase advanced the merge base but did not alter commit content. File State Confirmation
set -uo pipefail
n=0
max=3
until ~/lein deps; do
n=$((n + 1))
if [ "$n" -ge "$max" ]; then
echo "lein deps failed after $n attempts" >&2
exit 1
fi
sleep_secs=$((n * 30))
echo "lein deps failed (attempt $n/$max), retrying in ${sleep_secs}s..." >&2
sleep "$sleep_secs"
doneThe blocking fix is present and byte-for-byte correct. Loop exits on
Both workflow references now use full paths ( All Findings — Final Status
No new issues introduced by the rebase. SummaryApproved. The rebase is a merge-base-only change — no PR content was modified. All prior blocking and MEDIUM findings are fixed. The two remaining open items (assertion ordering, |
Stacked on PR-B (#888). Adds .github/workflows/jepsen-test-scheduled-dedup.yml: a daily Jepsen run that launches the demo cluster with ELASTICKV_REDIS_ONEPHASE_DEDUP=1 and executes the Redis workload. The 7-consecutive-days-green criterion in design doc §M4 is now operationally checkable. Why a separate workflow vs adding the env var to the existing jepsen-test-scheduled.yml ========================================================= The legacy path (gate off) must also stay covered. The existing 6-hourly workflow runs the legacy path; this new daily workflow runs the dedup path. Both must stay green for option-2 to be safe to default-on. Mixing the gate into the existing workflow would lose legacy coverage without adding the additional dedup-on signal that the design doc actually calls for. Workflow scope ============== - Cadence: daily at 03:17 UTC (off-peak; non-zero minute matching the project's cron pattern guidance). - Workload: Redis only. The dedup feature ships behind the Redis adapter's onePhaseTxnDedup flag (RPUSH/LPUSH via listPushCoreWithDedup, MULTI/EXEC via runTransactionWithDedup, standalone SET via single-mop EXEC routing). DynamoDB / S3 / SQS do NOT route through the dedup loop, so re-running them under the gate would add hours of CI for zero signal on the new code path. - Cluster gate assertion: the launch step exits 2 immediately if the job-level env var is not '1'. The env var is set on the job and inherited by every run: step; nothing in demo.go can intercept or unset it before NewRedisServer reads os.Getenv. So if the asserted value is '1' at launch time, the dedup gate IS active in the cluster process — no log-grep guesswork. Design doc updates (docs/design/2026_05_21_..._idempotency.md) ============================================================== §M4 expanded with: - Local reproduction script reference (make jepsen-redis with the env var on). - 7-day criterion specifically tied to the new workflow file name. - Workflow scope rationale (Redis-only is intentional, not an oversight). - Gate assertion mechanism (env at job level, fail-fast on '!=1'). Caller audit (per /loop semantic-change rule) ============================================= No Go code changed. This is pure infrastructure: a new workflow file and a doc update. No production behavior change, no new test coverage on existing Go callers. Validation ========== - actionlint .github/workflows/jepsen-test-scheduled-dedup.yml clean. - The workflow's lein / go-mod-download / cache steps mirror the existing scheduled workflow line for line, so cache invalidation semantics are unchanged.
gemini PR #889 MEDIUM: consistency with the new dedup workflow on line 539 which uses .github/workflows/... prefix. The bare filename next line was a typo from the diff. Caller audit: pure documentation; no Go callers.
claude[bot] PR #889 blocking finding: the previous loop shape until [ "$n" -ge 3 ]; do ~/lein deps && break; done exited on iteration count, not lein success. When all attempts failed the last executed command was sleep (exit 0), so the step reported green despite no dependencies being warmed -- a transient Maven Central outage would have silently produced a cluster running the Jepsen workload without a warmed cache, masking the dedup signal. Replaced with the pattern from jepsen-test-scheduled.yml: until ~/lein deps; do ...; if [ "$n" -ge "$max" ]; then exit 1; fi; ...; done Loop now exits on lein success, otherwise reaches the explicit exit 1 once max retries are hit. Backoff also aligned to 30*n seconds for parity (previously 15*n). Caller audit: pure shell-script change; no Go code touched, no semantic change to existing callers. actionlint clean.
5c644e7 to
448fad0
Compare
…t to fix CI flake (#890) ## Summary Fixes the CI flake observed in [Test workflow run 26678774684](https://github.com/bootjp/elastickv/actions/runs/26678774684): ``` --- FAIL: TestSQSServer_Throttle_SetQueueAttributesInvalidatesBucket (1.79s) sqs_throttle_integration_test.go:170: expected throttle, got 200 ``` ## Root cause Identical race to the one fixed for `TestSQSServer_Throttle_NoOpSetQueueAttributesPreservesBucket` in commit [54c6cd5](54c6cd56) (PR #819 follow-up): the 1-token-per-second refill rate races the test's own wall clock under `-race` on slow CI runners. For **this** test: 1. `mustSetQueueAttributes(Capacity=10, Refill=1)` 2. `for range 10 { send }` — drains the bucket 3. Sanity send — expects HTTP 400 (throttle) Each send goes through Raft propose+apply at ~100-250ms under `-race`. The 11 writes from steps 1-3 elapse ~1.1-2.75s. At Refill=1/sec the bucket has accumulated ≥1 token by step 3, so the sanity send returns HTTP 200 instead of 400 — **falsely** indicating a bucket-invalidation regression that does not exist. ## Fix Drop the initial Refill from `"1"` to `"0.01"` (1 token per 100 seconds) so no test-window wall-clock can accumulate to a whole token. The test's intent — *verify that a Capacity/Refill **raise** invalidates the cached bucket on the very next request* — is independent of the **initial** refill rate. The post-set assertion at line 182 is exercised against the fresh `Capacity=20/Refill=20` bucket, which is what the test actually claims to pin. ## Why only this test, not the sibling `TestSQSServer_Throttle_DeleteQueueInvalidatesBucket` has the same `Refill=1` initial config but **no post-drain sanity assertion** — it just drains without status checks, then verifies fresh capacity post-recreate. No race window there. ## Relation to other open PRs Unrelated to PRs #887/#888/#889 (option-2 dedup work). The flake surfaced on PR #889's CI run but the fix lives in pre-existing SQS test code that none of the dedup PRs touch. ## Caller audit (per /loop semantic-change rule) Test-only change. The throttle config validator (`sqs_catalog.go:163`) accepts fractional `float64 SendRefillPerSecond`; `0.01` is non-zero so `IsEmpty` (line 172) returns `false` and throttling stays enabled — the test still exercises the throttle path. Matches the prior fix's caller-audit conclusion verbatim. ## Validation - `go test ./adapter/ -run TestSQSServer_Throttle -race -count=3 -timeout 120s` passes (5.3s wall, all three iterations green) - `gofmt`, `go vet`, `golangci-lint run` all clean <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Updated test configuration and expanded documentation to improve test stability on slow CI environments. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/bootjp/elastickv/pull/890?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Stacked on PR-B (#888). Adds
.github/workflows/jepsen-test-scheduled-dedup.yml: a daily Jepsen run that launches the demo cluster withELASTICKV_REDIS_ONEPHASE_DEDUP=1and executes the Redis workload. The 7-consecutive-days-green criterion in design doc §M4 is now operationally checkable.Base:
feat/standalone-set-dedup(PR #888).Why a separate workflow vs adding the env var to existing scheduled workflow
The legacy path (gate off) must also stay covered. The existing 6-hourly workflow runs the legacy path; this new daily workflow runs the dedup path. Both must stay green for option-2 to be safe to default-on. Mixing the gate into the existing workflow would lose legacy coverage without adding the additional dedup-on signal that the design doc actually calls for.
Workflow scope
onePhaseTxnDedupflag (RPUSH/LPUSH vialistPushCoreWithDedup, MULTI/EXEC viarunTransactionWithDedup, standalone SET via single-mop EXEC routing). DynamoDB / S3 / SQS do NOT route through the dedup loop, so re-running them under the gate would add hours of CI for zero signal on the new code path.1. The env var is set on the job and inherited by everyrun:step; nothing indemo.gocan intercept or unset it beforeNewRedisServerreadsos.Getenv. So if the asserted value is1at launch time, the dedup gate IS active in the cluster process — no log-grep guesswork.Design doc updates
docs/design/2026_05_21_proposed_txn_secondary_idempotency.md§M4 expanded with:make jepsen-rediswith the env var on).!= "1").Caller audit (per /loop semantic-change rule)
No Go code changed. This is pure infrastructure: a new workflow file and a doc update. No production behavior change, no new test coverage on existing Go callers.
Validation
actionlint .github/workflows/jepsen-test-scheduled-dedup.ymlclean