You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Jepsen Scheduled Stress (Redis workload) — duplicate-elements + future-read + G-single/G2-item-realtime on main
Summary
The Jepsen Scheduled Stress Test run #27033231956 (Redis list-append workload, scheduled cron on main) failed validation with a hard anomaly set. The prior scheduled stress run on main at 12:57 UTC the same day (#27016167818) passed.
The single commit landed on main between the two scheduled runs is PR #928 — strictly inside internal/backup/ (S3 backup encoder); it does not touch the Redis adapter, the Raft engine, the coordinator, or any path the Redis workload exercises. So this is not introduced by PR #928 — it is a pre-existing edge case that the stress harness occasionally exposes.
Per-PR Jepsen runs (jepsen-test.yml) use --time-limit 5 --rate 5 --concurrency 5. The scheduled stress uses --time-limit 150 --rate 10 --concurrency 8 --key-count 16 --max-writes-per-key 250 --max-txn-length 4 — ~30× wall-clock time, 2× throughput per client, longer transactions. The duplicate / future-read patterns surface only at that scale.
{:duplicates {187 2}} on op index 525, process 1, :f :txn reading key 9 = [1 2 3], appending [13 50], reading key 15. Element 187 appears twice in some list.
:future-read
op index 1073, process 18, transactional read returned a list whose tail contained values not yet appended in real-time at the read-start instant.
:G-single-item-realtime
first witness at op index 1667 (~84.2s into the run); second around 1669. SCC traversed; cycle confirmed.
:G2-item-realtime
op index 2797 (~137.7s into the run). Multi-key anti-dependency cycle.
:cycle-search-timeout
{:scc-size 495 :anomaly-spec-type :G-single-item-realtime} — Elle exhausted its budget exploring one SCC. Indicates the history has many candidate cycles to check, consistent with the realtime violations above.
The two anomalies that are most diagnostic are :duplicate-elements and :future-read — these are not "noise" from real-time analysis alone:
:duplicate-elements in a list-append workload means the same value appended twice to the same list. The standard cause is client retry of a transaction that actually committed at the server but whose ACK was lost (timeout / connection close / leader change mid-RPC). Without an idempotency token or matching commit-deduplication path, the retry produces a second commit.
:future-read = a successful read returned a value before the value's write started in wall-clock real-time. Combined with :duplicate-elements, the most likely explanation is the same retry-committed-twice value being observable on a path that is not tied to the first write's commit timestamp.
Both are consistent with an ambiguous-commit window: the Redis adapter, the Lua bridge, or the txn dispatch returns an uncertain error (io.unknown) on the path back to the client, the client retries, the original commit had already been applied → duplicate append + tail of the list reordered relative to other clients' reads.
Reproducibility unknown — only two scheduled stress runs exist in the visible history; 12:57 passed, 18:37 failed. Recommend manually re-dispatching once or twice to confirm whether this is reliably reproducible or sporadic.
Suggested investigation order
Audit the Redis adapter's RPC-failure → client-error → retry boundary for any commit-applied-but-not-acked window. The relevant entry points are adapter/redis.go and the Lua dispatch (adapter/redis_lua_pool.go); the Raft propose ACK path is in kv/sharded_coordinator.go / kv/shard_store.go.
Check the EXEC reuse / dedup path (PR Consider whether WAL can be substituted with Raft Log. #46 — runTransactionWithDedup, prepareDispatch, onePhaseTxnDedup). The dedup probe was meant to make exactly this class of retry idempotent for one-phase txns; verify the same protection covers the path that produced the duplicate here (:r 9 → :append 13 50 → :r 15).
Reproduce locally with ./scripts/run-jepsen-local.sh using the same stress parameters once it's clear which adapter code path is suspect. The stress params are at the top of jepsen-test-scheduled.yml — paste them into the local script.
If reproduced, add an Elle regression test under jepsen/ (or a Go-side targeted test under adapter/) that exercises the ambiguous-commit retry directly so the next regression is caught at PR time, not at scheduled-stress time.
Jepsen Scheduled Stress (Redis workload) — duplicate-elements + future-read + G-single/G2-item-realtime on main
Summary
The
Jepsen Scheduled Stress Testrun #27033231956 (Redis list-append workload, scheduled cron onmain) failed validation with a hard anomaly set. The prior scheduled stress run on main at 12:57 UTC the same day (#27016167818) passed.The single commit landed on
mainbetween the two scheduled runs is PR #928 — strictly insideinternal/backup/(S3 backup encoder); it does not touch the Redis adapter, the Raft engine, the coordinator, or any path the Redis workload exercises. So this is not introduced by PR #928 — it is a pre-existing edge case that the stress harness occasionally exposes.Per-PR Jepsen runs (
jepsen-test.yml) use--time-limit 5 --rate 5 --concurrency 5. The scheduled stress uses--time-limit 150 --rate 10 --concurrency 8 --key-count 16 --max-writes-per-key 250 --max-txn-length 4— ~30× wall-clock time, 2× throughput per client, longer transactions. The duplicate / future-read patterns surface only at that scale.Verdict (from the failed run)
{:valid? false, :anomaly-types (:G-single-item-realtime :G2-item-realtime :cycle-search-timeout :duplicate-elements :future-read)}Notable anomaly entries:
:duplicate-elements{:duplicates {187 2}}on op index 525, process 1,:f :txnreading key 9 =[1 2 3], appending[13 50], reading key 15. Element 187 appears twice in some list.:future-read:G-single-item-realtime:G2-item-realtime:cycle-search-timeout{:scc-size 495 :anomaly-spec-type :G-single-item-realtime}— Elle exhausted its budget exploring one SCC. Indicates the history has many candidate cycles to check, consistent with the realtime violations above.The two anomalies that are most diagnostic are
:duplicate-elementsand:future-read— these are not "noise" from real-time analysis alone::duplicate-elementsin a list-append workload means the same value appended twice to the same list. The standard cause is client retry of a transaction that actually committed at the server but whose ACK was lost (timeout / connection close / leader change mid-RPC). Without an idempotency token or matching commit-deduplication path, the retry produces a second commit.:future-read= a successful read returned a value before the value's write started in wall-clock real-time. Combined with:duplicate-elements, the most likely explanation is the same retry-committed-twice value being observable on a path that is not tied to the first write's commit timestamp.Both are consistent with an ambiguous-commit window: the Redis adapter, the Lua bridge, or the txn dispatch returns an uncertain error (
io.unknown) on the path back to the client, the client retries, the original commit had already been applied → duplicate append + tail of the list reordered relative to other clients' reads.Reproduction profile
main, HEAD00a8d0bc("backup: Phase 0b M4-2b impl - S3 collision-rename reversal (backup: Phase 0b M4-2b impl - S3 collision-rename reversal #928)")..github/workflows/jepsen-test-scheduled.yml(display nameJepsen Scheduled Stress Test).elastickv.redis-workload(list-append, Elle checker).--time-limit 150 --rate 10 --concurrency 8 --key-count 16 --max-writes-per-key 250 --max-txn-length 4 --ports 63791,63792,63793 --host 127.0.0.1.Suggested investigation order
adapter/redis.goand the Lua dispatch (adapter/redis_lua_pool.go); the Raft propose ACK path is inkv/sharded_coordinator.go/kv/shard_store.go.runTransactionWithDedup,prepareDispatch,onePhaseTxnDedup). The dedup probe was meant to make exactly this class of retry idempotent for one-phase txns; verify the same protection covers the path that produced the duplicate here (:r 9 → :append 13 50 → :r 15)../scripts/run-jepsen-local.shusing the same stress parameters once it's clear which adapter code path is suspect. The stress params are at the top ofjepsen-test-scheduled.yml— paste them into the local script.jepsen/(or a Go-side targeted test underadapter/) that exercises the ambiguous-commit retry directly so the next regression is caught at PR time, not at scheduled-stress time.Out of scope
impl/snapshot-skip-b3-skip-gate, cold-start FSM snapshot-restore skip metrics plumbing) — different branch, per-PR Jepsen passed, metrics-only change.internal/backup/.Links