From 85444daa3b2296a54e75d1db445a84966066bbfe Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Mon, 1 Jun 2026 20:30:10 +0900 Subject: [PATCH 1/4] test(util): commit one no-op entry to prove leader writeable in waitForRaftReadiness MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Structural fix for the recurring leader-churn flake pattern. Multiple commits over the past months have absorbed post-readiness leader churn per-test by wrapping the first write in retry helpers: 5ba2ef94 test(redis): retry initial writes on 'not leader' to absorb CI leader churn (added doEventually / rpushEventually / lpushEventually) 2b6e8f28 test(adapter): widen raft election budget + retry gRPC on leader churn 78b9d091 test(grpc): retry leader churn in consistency loops 0fb7db76 test(redis): fix flaky TestRedis_LuaRPopLPushBullMQLikeLists ba3f5bf8 test(redis): wait for follower apply catch-up in MultiExec list tests The root cause is the same in every case: waitForStableLeader (the existing helper) only proves leadership *appears* stable for a 300 ms window. The freshly-elected leader can still step down on its first real heartbeat-quorum miss after that window closes, surfacing as "etcd raft engine is not leader" on the test's first write. This commit closes the post-readiness window at the readiness layer instead of asking every new test author to remember to use *Eventually helpers. After waitForStableLeader returns, drive one no-op Raft entry through full quorum and retry on transient leader errors until it commits. Any subsequent test write inherits a leader that has already held quorum through at least one apply. Mechanism ========= The probe payload is a 9-byte HLC-lease entry with ceiling=0: [0x02][0x00 * 8] ^ raftEncodeHLCLease kv/fsm.go applyHLCLease reads data[1:] as a big-endian uint64 ceiling. The `if ceilingMs > 0` guard at line 185 makes ceiling=0 a true no-op at the state-machine layer (no HLC mutation, no storage write, no encryption-band dispatch). The entry still traverses the full propose -> append -> replicate -> commit -> apply pipeline, so a successful Propose proves the leader can actually drive quorum. waitForWriteableLeader retries the propose on transient leader-unavailable errors (the same set isTransientNotLeaderErr already classifies) up to the caller-supplied waitTimeout (10 s at the current createNode call site). Per-propose timeout is 2 s — generous enough to absorb scheduler jitter on a loaded runner, tight enough that a genuinely-stuck leader fails fast instead of consuming the entire 10 s budget on one attempt. Caller audit per /loop semantic-change rule ============================================ - waitForRaftReadiness: single production call site (test_util.go createNode); behavior change is additive — same semantics PLUS the writeable-leader gate. Tests that previously raced into a step-down window now block until the leader has committed one entry. - waitForWriteableLeader / raftReadinessProbePayload: package- private to adapter/test_util.go; no external callers. Verified by grep across the repo. - applyHLCLease (kv/fsm.go): receives one extra invocation per test that calls createNode. Behavior on ceiling=0 is documented no-op (kv/fsm.go:185 `if ceilingMs > 0`). No state divergence across replicas — every FSM applies the same payload. - Side effects on the engine / FSM: - pendingApplyIdx advances by 1 (normal flow) - encryption applier untouched (opcode 0x02 is not in the encryption band) - snapshots include one extra applied index (already varies by test timing) - lease provider does not track HLC entries - Per-test wall-clock impact: one extra Raft commit on startup, ~10 ms on a stable cluster. 29 test files use createNode, so total CI overhead is ~290 ms. Negligible. Validation ========== go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists -race -count=3 -> ok 5.4s go test ./adapter/ -run 'TestRedis_SET|TestRedis_LPush|TestRedis_RPush|TestSQSServer_Throttle|TestRedis_MULTI' -race -count=2 -> ok 5.6s gofmt + go vet + golangci-lint -> 0 issues Follow-up note ============== Existing per-test *Eventually wrappers (rpushEventually, lpushEventually) remain valid for tests that issue writes long after createNode, where mid-test leader churn under CI load can still hit. The readiness probe targets the startup window specifically — it does not obsolete the helpers for other use cases. Subsequent commits can audit whether tests still calling rpushEventually / lpushEventually right after createNode can drop back to direct calls now that the startup window is closed at the readiness layer. --- adapter/test_util.go | 64 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 64 insertions(+) diff --git a/adapter/test_util.go b/adapter/test_util.go index a31e16077..079a342a4 100644 --- a/adapter/test_util.go +++ b/adapter/test_util.go @@ -312,6 +312,70 @@ func waitForRaftReadiness(t *testing.T, nodes []Node, peers []raftengine.Server, // transient "not leader" errors. A stability window catches the // flip and loops until the leader actually holds. waitForStableLeader(t, nodes, peers, waitTimeout) + + // Prove the leader can actually COMMIT by driving one no-op + // entry through full Raft quorum before we hand the cluster + // to the test body. The 300 ms stability window above only + // proves leadership *appears* held; a leader can still step + // down on the first real heartbeat-quorum miss after that. + // 5ba2ef94 / rpushEventually / lpushEventually have been + // absorbing that post-readiness churn per-test for the past + // several months — this commits-one-entry probe closes the + // window at the readiness layer instead, so any subsequent + // write inherits a leader that has already held quorum + // through one apply. + waitForWriteableLeader(t, nodes, waitTimeout) +} + +// raftReadinessProbePayload is the 9-byte Raft entry waitForWriteableLeader +// proposes. The FSM dispatches on data[0]: 0x02 (raftEncodeHLCLease) selects +// applyHLCLease, which reads `data[1:]` as a big-endian uint64 physical +// ceiling. With ceiling=0 the function's `if ceilingMs > 0` guard skips the +// HLC update and returns nil — a true no-op at the state-machine layer, +// while the entry still goes through full Raft quorum so a successful +// Propose proves the leader holds. See kv/fsm.go applyHLCLease + the +// raftEncodeHLCLease comment. +var raftReadinessProbePayload = []byte{ + 0x02, // raftEncodeHLCLease + 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // BE uint64 ceiling = 0 (no-op) +} + +// waitForWriteableLeader commits one no-op Raft entry through nodes[0]'s +// engine to prove the leader is not just *named* but actually able to +// drive quorum. Retries on transient leader-unavailable errors up to +// timeout — a re-election that races the probe is what this helper is +// designed to ride out. +// +// Without this step, waitForRaftReadiness would return after a 300 ms +// stability window during which leadership only *looks* stable; a +// leader that steps down on its first real heartbeat-quorum miss can +// surface as "etcd raft engine is not leader" on the test's first +// write. See commits 5ba2ef94 / rpushEventually / lpushEventually for +// the per-test work-arounds this readiness gate is meant to obsolete. +func waitForWriteableLeader(t *testing.T, nodes []Node, timeout time.Duration) { + t.Helper() + if len(nodes) == 0 || nodes[0].engine == nil { + return + } + deadline := time.Now().Add(timeout) + const perProposeTimeout = 2 * time.Second + var lastErr error + for { + proposeCtx, cancel := context.WithTimeout(context.Background(), perProposeTimeout) + _, err := nodes[0].engine.Propose(proposeCtx, raftReadinessProbePayload) + cancel() + if err == nil { + return + } + lastErr = err + if !isTransientNotLeaderErr(err) { + t.Fatalf("readiness probe failed with non-transient error: %v", err) + } + if !time.Now().Before(deadline) { + t.Fatalf("readiness probe never committed within %s (last err: %v)", timeout, lastErr) + } + time.Sleep(leaderChurnRetryInterval) + } } // waitForStableLeader polls the cluster until nodes[0] has been the leader From cd6a67e895fb727c099626e0e6debaca08ff3c67 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Mon, 1 Jun 2026 20:49:33 +0900 Subject: [PATCH 2/4] =?UTF-8?q?fix(test-util):=20address=20gemini=20PR=20#?= =?UTF-8?q?898=20HIGH=20=E2=80=94=20bound=20per-propose=20ctx=20+=20classi?= =?UTF-8?q?fy=20ctx=20errors=20as=20transient?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit gemini HIGH (adapter/test_util.go:373): 1. Per-propose timeout could exceed outer budget. Hardcoded perProposeTimeout=2s combined with the outer waitTimeout (10s at createNode) means a propose started within the last <2s of the budget could block past the caller deadline. Fix: currTimeout = min(perProposeTimeout, remaining). 2. context.DeadlineExceeded not classified as transient. The base isTransientNotLeaderErr (redis_retry.go) recognizes only raft-engine leader-unavailable signals, not ctx-timeout. A per-propose ctx that fires on a slow CI runner would have immediately t.Fatalf-ed instead of retrying within the outer budget — directly re-introducing the flake class the readiness probe is meant to eliminate. Widened with a probe-local isReadinessProbeRetryable that ORs context.DeadlineExceeded and context.Canceled. Extracted proposeReadinessOnce + isReadinessProbeRetryable to keep waitForWriteableLeader under cyclop=10 after the new branch. Caller audit: all three new symbols (proposeReadinessOnce, isReadinessProbeRetryable, raftReadinessProbePayload) are package-private; sole caller is waitForWriteableLeader. Verified by grep across the repo. Validation: go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists -race -count=2 -> ok 7.2s go test ./adapter/ -run TestSQSServer_Throttle -race -count=2 -> ok golangci-lint run -> 0 issues --- adapter/test_util.go | 42 +++++++++++++++++++++++++++++++++++++----- 1 file changed, 37 insertions(+), 5 deletions(-) diff --git a/adapter/test_util.go b/adapter/test_util.go index 079a342a4..d25924962 100644 --- a/adapter/test_util.go +++ b/adapter/test_util.go @@ -358,17 +358,18 @@ func waitForWriteableLeader(t *testing.T, nodes []Node, timeout time.Duration) { return } deadline := time.Now().Add(timeout) - const perProposeTimeout = 2 * time.Second var lastErr error for { - proposeCtx, cancel := context.WithTimeout(context.Background(), perProposeTimeout) - _, err := nodes[0].engine.Propose(proposeCtx, raftReadinessProbePayload) - cancel() + remaining := time.Until(deadline) + if remaining <= 0 { + t.Fatalf("readiness probe never committed within %s (last err: %v)", timeout, lastErr) + } + err := proposeReadinessOnce(nodes[0].engine, remaining) if err == nil { return } lastErr = err - if !isTransientNotLeaderErr(err) { + if !isReadinessProbeRetryable(err) { t.Fatalf("readiness probe failed with non-transient error: %v", err) } if !time.Now().Before(deadline) { @@ -378,6 +379,37 @@ func waitForWriteableLeader(t *testing.T, nodes []Node, timeout time.Duration) { } } +// proposeReadinessOnce drives one no-op probe through the engine's +// Propose, bounded by min(perProposeTimeout, remaining) so the +// per-attempt ctx cannot block past the outer deadline (gemini PR +// #898 HIGH point 1). Extracted from waitForWriteableLeader to keep +// the outer loop under the cyclop budget; the retry/timeout +// rationale lives at the call site. +func proposeReadinessOnce(engine raftengine.Engine, remaining time.Duration) error { + const perProposeTimeout = 2 * time.Second + currTimeout := perProposeTimeout + if remaining < currTimeout { + currTimeout = remaining + } + proposeCtx, cancel := context.WithTimeout(context.Background(), currTimeout) + defer cancel() + _, err := engine.Propose(proposeCtx, raftReadinessProbePayload) + return err +} + +// isReadinessProbeRetryable widens isTransientNotLeaderErr for the +// readiness-probe loop: a per-propose ctx timeout under heavy CI +// load returns context.DeadlineExceeded, which the base classifier +// does NOT treat as transient. Without this widening the probe +// would t.Fatalf on a slow runner instead of retrying within its +// outer budget. context.Canceled is treated symmetrically (gemini +// PR #898 HIGH point 2). +func isReadinessProbeRetryable(err error) bool { + return isTransientNotLeaderErr(err) || + errors.Is(err, context.DeadlineExceeded) || + errors.Is(err, context.Canceled) +} + // waitForStableLeader polls the cluster until nodes[0] has been the leader // (and all other nodes agree on it as the leader address) for a continuous // stability window. If leadership flips during the window, the loop restarts. From 23a146a06574dc6d01343f4f524d34600cd61e7b Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Mon, 1 Jun 2026 20:57:51 +0900 Subject: [PATCH 3/4] fix(docs): strip commit-hash refs from waitForWriteableLeader doc comments claude[bot] PR #898 minor #2: CLAUDE.md convention forbids referencing specific commits/symbols in code comments since they rot as the codebase evolves. Dropped the "5ba2ef94 / rpushEventually / lpushEventually" enumeration from both doc blocks; replaced with a structural pointer to "the per-test retry wrappers in this file" and a direction to the PR description for full history. The ground truth for the flake-fix lineage lives in the PR description, which is durable while file-local commit refs are not. No behavior change. No caller audit needed (comment-only). --- adapter/test_util.go | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/adapter/test_util.go b/adapter/test_util.go index d25924962..b3b20e7b6 100644 --- a/adapter/test_util.go +++ b/adapter/test_util.go @@ -318,12 +318,11 @@ func waitForRaftReadiness(t *testing.T, nodes []Node, peers []raftengine.Server, // to the test body. The 300 ms stability window above only // proves leadership *appears* held; a leader can still step // down on the first real heartbeat-quorum miss after that. - // 5ba2ef94 / rpushEventually / lpushEventually have been - // absorbing that post-readiness churn per-test for the past - // several months — this commits-one-entry probe closes the - // window at the readiness layer instead, so any subsequent - // write inherits a leader that has already held quorum - // through one apply. + // The per-test retry wrappers in this file have been absorbing + // that post-readiness churn site-by-site; this commits-one-entry + // probe closes the window at the readiness layer instead, so any + // subsequent write inherits a leader that has already held + // quorum through one apply. waitForWriteableLeader(t, nodes, waitTimeout) } @@ -350,8 +349,9 @@ var raftReadinessProbePayload = []byte{ // stability window during which leadership only *looks* stable; a // leader that steps down on its first real heartbeat-quorum miss can // surface as "etcd raft engine is not leader" on the test's first -// write. See commits 5ba2ef94 / rpushEventually / lpushEventually for -// the per-test work-arounds this readiness gate is meant to obsolete. +// write — the failure mode the per-test retry wrappers in this file +// have been working around. See the PR description for the full +// flake-fix history this readiness gate is meant to obsolete. func waitForWriteableLeader(t *testing.T, nodes []Node, timeout time.Duration) { t.Helper() if len(nodes) == 0 || nodes[0].engine == nil { From e73b625298213e6d25a51418cd9f6a1dfa6ffc91 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Mon, 1 Jun 2026 21:03:35 +0900 Subject: [PATCH 4/4] fix(test-util): address codex P2 (re-elect on probe retry) + claude[bot] residual nit MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit codex P2 (adapter/test_util.go:367): When a real re-election makes another node leader during the readiness probe, nodes[0].engine.Propose returns errNotLeader on every attempt for the rest of the budget. The previous loop kept proposing through that follower engine until t.Fatalf fired — spuriously failing the readiness gate even though the cluster had recovered with a different leader. Fix: on transient not-leader errors, re-run ensureNodeZeroIsLeader before the next attempt. ensureNodeZeroIsLeader is idempotent in the steady state so this costs nothing when leadership has not actually moved. Gated on isTransientNotLeaderErr specifically (not the broader isReadinessProbeRetryable) so ctx-timeout cases do not pay for the re-nudge — they are likely just slow scheduler ticks, not leadership changes. Signature change: waitForWriteableLeader now takes (peers, waitInterval) so the probe loop can do leader re-nudges. Sole caller (waitForRaftReadiness in the same file) already had those values to hand; verified by grep. claude[bot] residual nit (round-2 review): Stripped "gemini PR #898 HIGH point 1/2" attribution refs from the proposeReadinessOnce / isReadinessProbeRetryable doc comments. Same CLAUDE.md convention as Minor #2 — the WHY is kept; the PR-process attribution rots. Validation: go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists -race -count=2 -> ok 7.9s go test ./adapter/ -run TestSQSServer_Throttle -race -count=2 -> ok golangci-lint run -> 0 issues --- adapter/test_util.go | 34 ++++++++++++++++++++++++++-------- 1 file changed, 26 insertions(+), 8 deletions(-) diff --git a/adapter/test_util.go b/adapter/test_util.go index b3b20e7b6..ba88cfefb 100644 --- a/adapter/test_util.go +++ b/adapter/test_util.go @@ -323,7 +323,13 @@ func waitForRaftReadiness(t *testing.T, nodes []Node, peers []raftengine.Server, // probe closes the window at the readiness layer instead, so any // subsequent write inherits a leader that has already held // quorum through one apply. - waitForWriteableLeader(t, nodes, waitTimeout) + // + // `peers` and `waitInterval` are passed through so the probe loop + // can re-nudge leadership back to nodes[0] when a real re-election + // elects a different node mid-probe — without that, the loop + // would keep proposing through a follower engine until the budget + // expires and spuriously fail the readiness gate. + waitForWriteableLeader(t, nodes, peers, waitTimeout, waitInterval) } // raftReadinessProbePayload is the 9-byte Raft entry waitForWriteableLeader @@ -352,7 +358,7 @@ var raftReadinessProbePayload = []byte{ // write — the failure mode the per-test retry wrappers in this file // have been working around. See the PR description for the full // flake-fix history this readiness gate is meant to obsolete. -func waitForWriteableLeader(t *testing.T, nodes []Node, timeout time.Duration) { +func waitForWriteableLeader(t *testing.T, nodes []Node, peers []raftengine.Server, timeout, waitInterval time.Duration) { t.Helper() if len(nodes) == 0 || nodes[0].engine == nil { return @@ -372,6 +378,19 @@ func waitForWriteableLeader(t *testing.T, nodes []Node, timeout time.Duration) { if !isReadinessProbeRetryable(err) { t.Fatalf("readiness probe failed with non-transient error: %v", err) } + // If the per-attempt error was specifically "not leader" (vs + // a ctx timeout), a real re-election has elected a different + // node since waitForStableLeader returned. Keep aiming the + // probe at nodes[0] by re-running ensureNodeZeroIsLeader + // before the next attempt; otherwise the loop would spin on + // a follower engine until the budget expires and spuriously + // fail the readiness gate even though the cluster has + // recovered with a different leader. ensureNodeZeroIsLeader + // is idempotent in the steady state, so this costs nothing + // when leadership has not actually moved. + if isTransientNotLeaderErr(err) && len(peers) > 0 { + ensureNodeZeroIsLeader(t, nodes, peers, time.Until(deadline), waitInterval) + } if !time.Now().Before(deadline) { t.Fatalf("readiness probe never committed within %s (last err: %v)", timeout, lastErr) } @@ -381,10 +400,10 @@ func waitForWriteableLeader(t *testing.T, nodes []Node, timeout time.Duration) { // proposeReadinessOnce drives one no-op probe through the engine's // Propose, bounded by min(perProposeTimeout, remaining) so the -// per-attempt ctx cannot block past the outer deadline (gemini PR -// #898 HIGH point 1). Extracted from waitForWriteableLeader to keep -// the outer loop under the cyclop budget; the retry/timeout -// rationale lives at the call site. +// per-attempt ctx cannot block past the outer deadline when the +// overall budget is nearly exhausted. Extracted from +// waitForWriteableLeader to keep the outer loop under the cyclop +// budget; the retry/timeout rationale lives at the call site. func proposeReadinessOnce(engine raftengine.Engine, remaining time.Duration) error { const perProposeTimeout = 2 * time.Second currTimeout := perProposeTimeout @@ -402,8 +421,7 @@ func proposeReadinessOnce(engine raftengine.Engine, remaining time.Duration) err // load returns context.DeadlineExceeded, which the base classifier // does NOT treat as transient. Without this widening the probe // would t.Fatalf on a slow runner instead of retrying within its -// outer budget. context.Canceled is treated symmetrically (gemini -// PR #898 HIGH point 2). +// outer budget. context.Canceled is treated symmetrically. func isReadinessProbeRetryable(err error) bool { return isTransientNotLeaderErr(err) || errors.Is(err, context.DeadlineExceeded) ||