From da6ccec39ae45da0c20a035fc820e10a8bd33c0e Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sat, 30 May 2026 16:34:31 +0900 Subject: [PATCH 1/3] ci(jepsen): add scheduled dedup-mode workflow + design doc M4 criterion MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Stacked on PR-B (#888). Adds .github/workflows/jepsen-test-scheduled-dedup.yml: a daily Jepsen run that launches the demo cluster with ELASTICKV_REDIS_ONEPHASE_DEDUP=1 and executes the Redis workload. The 7-consecutive-days-green criterion in design doc §M4 is now operationally checkable. Why a separate workflow vs adding the env var to the existing jepsen-test-scheduled.yml ========================================================= The legacy path (gate off) must also stay covered. The existing 6-hourly workflow runs the legacy path; this new daily workflow runs the dedup path. Both must stay green for option-2 to be safe to default-on. Mixing the gate into the existing workflow would lose legacy coverage without adding the additional dedup-on signal that the design doc actually calls for. Workflow scope ============== - Cadence: daily at 03:17 UTC (off-peak; non-zero minute matching the project's cron pattern guidance). - Workload: Redis only. The dedup feature ships behind the Redis adapter's onePhaseTxnDedup flag (RPUSH/LPUSH via listPushCoreWithDedup, MULTI/EXEC via runTransactionWithDedup, standalone SET via single-mop EXEC routing). DynamoDB / S3 / SQS do NOT route through the dedup loop, so re-running them under the gate would add hours of CI for zero signal on the new code path. - Cluster gate assertion: the launch step exits 2 immediately if the job-level env var is not '1'. The env var is set on the job and inherited by every run: step; nothing in demo.go can intercept or unset it before NewRedisServer reads os.Getenv. So if the asserted value is '1' at launch time, the dedup gate IS active in the cluster process — no log-grep guesswork. Design doc updates (docs/design/2026_05_21_..._idempotency.md) ============================================================== §M4 expanded with: - Local reproduction script reference (make jepsen-redis with the env var on). - 7-day criterion specifically tied to the new workflow file name. - Workflow scope rationale (Redis-only is intentional, not an oversight). - Gate assertion mechanism (env at job level, fail-fast on '!=1'). Caller audit (per /loop semantic-change rule) ============================================= No Go code changed. This is pure infrastructure: a new workflow file and a doc update. No production behavior change, no new test coverage on existing Go callers. Validation ========== - actionlint .github/workflows/jepsen-test-scheduled-dedup.yml clean. - The workflow's lein / go-mod-download / cache steps mirror the existing scheduled workflow line for line, so cache invalidation semantics are unchanged. --- .../workflows/jepsen-test-scheduled-dedup.yml | 193 ++++++++++++++++++ ...5_21_proposed_txn_secondary_idempotency.md | 27 ++- 2 files changed, 217 insertions(+), 3 deletions(-) create mode 100644 .github/workflows/jepsen-test-scheduled-dedup.yml diff --git a/.github/workflows/jepsen-test-scheduled-dedup.yml b/.github/workflows/jepsen-test-scheduled-dedup.yml new file mode 100644 index 000000000..fb89525bb --- /dev/null +++ b/.github/workflows/jepsen-test-scheduled-dedup.yml @@ -0,0 +1,193 @@ +# Jepsen Scheduled Stress Test — Option-2 Dedup Mode +# +# Daily run with ELASTICKV_REDIS_ONEPHASE_DEDUP=1 so the demo cluster +# exercises the option-2 idempotency path. The criterion for design doc +# §M4 closure is 7 consecutive days of green runs with no +# :duplicate-elements / :G-single-item-realtime anomalies in the Redis +# workload's analysis output. +# +# Scope: Redis workload only. The dedup feature only ships behind the +# Redis adapter's onePhaseTxnDedup flag (RPUSH/LPUSH, MULTI/EXEC, +# standalone SET); DynamoDB / S3 / SQS do not route through the dedup +# loop, so re-running them here would add hours of CI for no signal +# on the new code path. +# +# Cadence: 03:17 UTC daily (off-peak; non-zero minute per ScheduleWakeup +# guidance). The general 6-hourly scheduled workflow continues to run +# without the dedup gate so the legacy path also stays covered. + +on: + schedule: + - cron: '17 3 * * *' + workflow_dispatch: + inputs: + time-limit: + description: "Workload runtime seconds" + required: false + default: "300" + rate: + description: "Ops/sec per worker" + required: false + default: "10" + concurrency: + description: "Number of worker threads" + required: false + default: "8" + key-count: + description: "Number of distinct keys" + required: false + default: "16" + max-writes-per-key: + description: "Maximum writes per key before exhaustion" + required: false + default: "250" + max-txn-length: + description: "Maximum micro-ops per transaction" + required: false + default: "4" + +concurrency: + group: ${{ github.workflow }}-${{ github.ref }}-jepsen-dedup-scheduled + +name: Jepsen Scheduled Stress Test (Option-2 Dedup) +permissions: + contents: read +jobs: + test: + runs-on: ubuntu-latest + env: + GOCACHE: /tmp/go-build + # Enable the Redis adapter option-2 dedup gate for this run. This + # is the load-bearing differentiator from the general scheduled + # workflow — the demo cluster's redis adapter routes RPUSH/LPUSH, + # MULTI/EXEC, and standalone SET through runTransactionWithDedup, + # exercising the FSM exact-ts probe and the reusable retry + # state. Anomalies in :duplicate-elements / :G-single-item-realtime + # under this flag indicate a regression in option-2 plumbing. + ELASTICKV_REDIS_ONEPHASE_DEDUP: "1" + steps: + - uses: actions/checkout@v6 + with: + submodules: recursive + - uses: actions/setup-java@v5 + with: + distribution: temurin + java-version: '21' + - uses: actions/setup-go@v6 + with: + go-version-file: 'go.mod' + - name: Install netcat and graphviz + run: sudo apt-get update && sudo apt-get install -y netcat-openbsd graphviz + - name: Install Leiningen + run: | + curl -L https://raw.githubusercontent.com/technomancy/leiningen/stable/bin/lein > ~/lein + chmod +x ~/lein + ~/lein version + - name: Cache Maven and Leiningen artifacts + uses: actions/cache@v5 + with: + path: | + ~/.m2/repository + ~/.lein + key: ${{ runner.os }}-maven-${{ hashFiles('jepsen/project.clj') }} + restore-keys: | + ${{ runner.os }}-maven- + - name: Pre-fetch Go modules + run: | + mkdir -p "$GOCACHE" /tmp/go-tmp + export GOCACHE GOTMPDIR=/tmp/go-tmp + go mod download + - name: Warm Leiningen Maven cache + working-directory: jepsen + run: | + set -uo pipefail + n=0 + until [ "$n" -ge 3 ]; do + ~/lein deps && break + n=$((n+1)) + sleep_secs=$((15 * n)) + echo "lein deps failed (attempt $n); sleeping ${sleep_secs}s before retry..." + sleep "$sleep_secs" + done + - name: Launch demo cluster (dedup gate ON) + run: | + set -euo pipefail + mkdir -p "$GOCACHE" /tmp/go-tmp + export GOTMPDIR=/tmp/go-tmp + # The ELASTICKV_REDIS_ONEPHASE_DEDUP=1 env var is inherited + # from the job env above. demo.go reads it via the redis + # server's WithOnePhaseTxnDedup option wired in + # adapter/redis.go NewRedisServer. + nohup go run cmd/server/demo.go > /tmp/elastickv-demo.log 2>&1 & + echo $! > /tmp/elastickv-demo.pid + + echo "ELASTICKV_REDIS_ONEPHASE_DEDUP=${ELASTICKV_REDIS_ONEPHASE_DEDUP}" + # The env var is set at the JOB level above and inherited by + # all `run:` steps; nothing in demo.go can intercept or unset + # it before NewRedisServer reads os.Getenv. So if the env var + # is "1" here, the dedup gate IS active in the cluster. We + # print it explicitly so a failed run's log makes the + # configuration unambiguous (vs the general 6-hourly workflow + # whose runs would have an empty value here). + if [ "${ELASTICKV_REDIS_ONEPHASE_DEDUP:-}" != "1" ]; then + echo "FATAL: ELASTICKV_REDIS_ONEPHASE_DEDUP is not '1' — this workflow runs only with the dedup gate on" + exit 2 + fi + + echo "Waiting for redis listeners (63791-63793)..." + for i in {1..90}; do + if nc -z 127.0.0.1 63791 && nc -z 127.0.0.1 63792 && nc -z 127.0.0.1 63793; then + echo "Cluster is up" + exit 0 + fi + sleep 1 + done + + echo "Demo cluster failed to start; dumping log:" + tail -n 200 /tmp/elastickv-demo.log || true + exit 1 + - name: Run Redis Jepsen workload (dedup mode) against elastickv + working-directory: jepsen + timeout-minutes: 10 + run: | + timeout 480 ~/lein run -m elastickv.redis-workload \ + --time-limit ${{ inputs.time-limit || '300' }} \ + --rate ${{ inputs.rate || '10' }} \ + --concurrency ${{ inputs.concurrency || '8' }} \ + --key-count ${{ inputs.key-count || '16' }} \ + --max-writes-per-key ${{ inputs.max-writes-per-key || '250' }} \ + --max-txn-length ${{ inputs.max-txn-length || '4' }} \ + --ports 63791,63792,63793 \ + --host 127.0.0.1 + - name: Dump demo cluster log on failure + if: failure() + run: | + echo "=== first 200 lines (startup) ===" + head -n 200 /tmp/elastickv-demo.log || true + echo "=== last 1000 lines (most recent activity) ===" + tail -n 1000 /tmp/elastickv-demo.log || true + echo "=== full log line count ===" + wc -l /tmp/elastickv-demo.log || true + - name: Upload demo cluster log on failure + if: failure() + uses: actions/upload-artifact@v7 + with: + name: elastickv-demo-log-dedup + path: /tmp/elastickv-demo.log + retention-days: 14 + if-no-files-found: warn + - name: Upload Jepsen store on failure + if: failure() + uses: actions/upload-artifact@v7 + with: + name: jepsen-store-redis-dedup + path: jepsen/store + retention-days: 14 + - name: Stop demo cluster + if: always() + run: | + if [ -f /tmp/elastickv-demo.pid ]; then + pid=$(cat /tmp/elastickv-demo.pid) + kill "$pid" 2>/dev/null || true + wait "$pid" 2>/dev/null || true + fi diff --git a/docs/design/2026_05_21_proposed_txn_secondary_idempotency.md b/docs/design/2026_05_21_proposed_txn_secondary_idempotency.md index c463ceb03..b60f388d6 100644 --- a/docs/design/2026_05_21_proposed_txn_secondary_idempotency.md +++ b/docs/design/2026_05_21_proposed_txn_secondary_idempotency.md @@ -538,9 +538,30 @@ preserves availability and adds correctness. - Local Jepsen reproduction. Because the trigger is election churn (see "Resolved" below), reproduce by running the 3-node demo under CPU pressure or with shortened election timeouts so leadership - flaps during the workload. -- Scheduled Jepsen run goes 7 consecutive days without - `:duplicate-elements` / `:G-single-item-realtime`. + flaps during the workload. Local script: `make jepsen-redis` against + `cmd/server/demo.go` with `ELASTICKV_REDIS_ONEPHASE_DEDUP=1`. +- **Scheduled Jepsen run criterion.** 7 consecutive days without + `:duplicate-elements` / `:G-single-item-realtime` in the dedup-mode + workflow (`.github/workflows/jepsen-test-scheduled-dedup.yml`, + daily at 03:17 UTC). The general scheduled workflow + (`jepsen-test-scheduled.yml`, every 6 h) continues to run *without* + the gate so the legacy path stays covered — both must stay green + for option-2 to be safe to default-on. +- **Workflow scope rationale.** The dedup-mode workflow exercises only + the Redis workload. The dedup feature ships behind the Redis + adapter's `onePhaseTxnDedup` flag (RPUSH/LPUSH via + `listPushCoreWithDedup`, MULTI/EXEC via `runTransactionWithDedup`, + standalone SET via single-mop EXEC routing); DynamoDB / S3 / SQS do + not route through the dedup loop, so re-running them under the gate + would add hours of CI for zero signal on the new code path. +- **Demo cluster gate confirmation.** The launch step asserts + `ELASTICKV_REDIS_ONEPHASE_DEDUP=1` before waiting on the listeners. + The env var is set at the workflow job level and inherited by every + `run:` step — nothing in `demo.go` can intercept or unset it before + `NewRedisServer` reads `os.Getenv`. A misconfigured workflow (e.g. + the env var dropped during a careless edit) exits non-zero + immediately rather than producing a clean run that would prove + nothing about the dedup code path. Scope estimate: M1–M3 are adapter + one `store` helper + a one-field one-phase request change (~250 LOC Go + tests), no FSM dedup table, no GC. From d3f91452b668efa8c6e33acca476f8ac73e7da4d Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sat, 30 May 2026 16:40:38 +0900 Subject: [PATCH 2/3] fix(docs): use full workflow path for jepsen-test-scheduled.yml gemini PR #889 MEDIUM: consistency with the new dedup workflow on line 539 which uses .github/workflows/... prefix. The bare filename next line was a typo from the diff. Caller audit: pure documentation; no Go callers. --- docs/design/2026_05_21_proposed_txn_secondary_idempotency.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/design/2026_05_21_proposed_txn_secondary_idempotency.md b/docs/design/2026_05_21_proposed_txn_secondary_idempotency.md index b60f388d6..98dd6d4b1 100644 --- a/docs/design/2026_05_21_proposed_txn_secondary_idempotency.md +++ b/docs/design/2026_05_21_proposed_txn_secondary_idempotency.md @@ -544,7 +544,7 @@ preserves availability and adds correctness. `:duplicate-elements` / `:G-single-item-realtime` in the dedup-mode workflow (`.github/workflows/jepsen-test-scheduled-dedup.yml`, daily at 03:17 UTC). The general scheduled workflow - (`jepsen-test-scheduled.yml`, every 6 h) continues to run *without* + (`.github/workflows/jepsen-test-scheduled.yml`, every 6 h) continues to run *without* the gate so the legacy path stays covered — both must stay green for option-2 to be safe to default-on. - **Workflow scope rationale.** The dedup-mode workflow exercises only From 448fad0a9da3891be20de4f0018731321811e5df Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sat, 30 May 2026 17:04:03 +0900 Subject: [PATCH 3/3] fix(ci): lein-deps retry loop must exit non-zero on exhaustion claude[bot] PR #889 blocking finding: the previous loop shape until [ "$n" -ge 3 ]; do ~/lein deps && break; done exited on iteration count, not lein success. When all attempts failed the last executed command was sleep (exit 0), so the step reported green despite no dependencies being warmed -- a transient Maven Central outage would have silently produced a cluster running the Jepsen workload without a warmed cache, masking the dedup signal. Replaced with the pattern from jepsen-test-scheduled.yml: until ~/lein deps; do ...; if [ "$n" -ge "$max" ]; then exit 1; fi; ...; done Loop now exits on lein success, otherwise reaches the explicit exit 1 once max retries are hit. Backoff also aligned to 30*n seconds for parity (previously 15*n). Caller audit: pure shell-script change; no Go code touched, no semantic change to existing callers. actionlint clean. --- .../workflows/jepsen-test-scheduled-dedup.yml | 24 +++++++++++++++---- 1 file changed, 19 insertions(+), 5 deletions(-) diff --git a/.github/workflows/jepsen-test-scheduled-dedup.yml b/.github/workflows/jepsen-test-scheduled-dedup.yml index fb89525bb..d7970b5a0 100644 --- a/.github/workflows/jepsen-test-scheduled-dedup.yml +++ b/.github/workflows/jepsen-test-scheduled-dedup.yml @@ -100,13 +100,27 @@ jobs: - name: Warm Leiningen Maven cache working-directory: jepsen run: | + # Matches the retry pattern used in jepsen-test-scheduled.yml so + # both workflows fail the step (not silently succeed) when Maven + # Central exhausts the retry budget. The previous shape + # `until [ "$n" -ge 3 ]; do ~/lein deps && break; done` exited + # the loop on the iteration count rather than on lein-deps + # success; when every attempt failed the loop terminated with + # the last command being `sleep` (exit 0), reporting the step + # as green despite no dependencies being warmed -- claude[bot] + # PR #889 blocking finding. Backoff also aligned to 30*n + # seconds for parity with the general workflow. set -uo pipefail n=0 - until [ "$n" -ge 3 ]; do - ~/lein deps && break - n=$((n+1)) - sleep_secs=$((15 * n)) - echo "lein deps failed (attempt $n); sleeping ${sleep_secs}s before retry..." + max=3 + until ~/lein deps; do + n=$((n + 1)) + if [ "$n" -ge "$max" ]; then + echo "lein deps failed after $n attempts" >&2 + exit 1 + fi + sleep_secs=$((n * 30)) + echo "lein deps failed (attempt $n/$max), retrying in ${sleep_secs}s..." >&2 sleep "$sleep_secs" done - name: Launch demo cluster (dedup gate ON)