From 85444daa3b2296a54e75d1db445a84966066bbfe Mon Sep 17 00:00:00 2001
From: "Yoshiaki Ueda (bootjp)" <contact@bootjp.me>
Date: Mon, 1 Jun 2026 20:30:10 +0900
Subject: [PATCH 1/4] test(util): commit one no-op entry to prove leader
 writeable in waitForRaftReadiness
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Structural fix for the recurring leader-churn flake pattern. Multiple
commits over the past months have absorbed post-readiness leader
churn per-test by wrapping the first write in retry helpers:

  5ba2ef94  test(redis): retry initial writes on 'not leader' to absorb CI leader churn
            (added doEventually / rpushEventually / lpushEventually)
  2b6e8f28  test(adapter): widen raft election budget + retry gRPC on leader churn
  78b9d091  test(grpc): retry leader churn in consistency loops
  0fb7db76  test(redis): fix flaky TestRedis_LuaRPopLPushBullMQLikeLists
  ba3f5bf8  test(redis): wait for follower apply catch-up in MultiExec list tests

The root cause is the same in every case: waitForStableLeader (the
existing helper) only proves leadership *appears* stable for a 300 ms
window. The freshly-elected leader can still step down on its first
real heartbeat-quorum miss after that window closes, surfacing as
"etcd raft engine is not leader" on the test's first write.

This commit closes the post-readiness window at the readiness layer
instead of asking every new test author to remember to use *Eventually
helpers. After waitForStableLeader returns, drive one no-op Raft
entry through full quorum and retry on transient leader errors until
it commits. Any subsequent test write inherits a leader that has
already held quorum through at least one apply.

Mechanism
=========

The probe payload is a 9-byte HLC-lease entry with ceiling=0:

  [0x02][0x00 * 8]
   ^ raftEncodeHLCLease

kv/fsm.go applyHLCLease reads data[1:] as a big-endian uint64
ceiling. The `if ceilingMs > 0` guard at line 185 makes ceiling=0
a true no-op at the state-machine layer (no HLC mutation, no
storage write, no encryption-band dispatch). The entry still
traverses the full propose -> append -> replicate -> commit ->
apply pipeline, so a successful Propose proves the leader can
actually drive quorum.

waitForWriteableLeader retries the propose on transient
leader-unavailable errors (the same set isTransientNotLeaderErr
already classifies) up to the caller-supplied waitTimeout
(10 s at the current createNode call site). Per-propose timeout
is 2 s — generous enough to absorb scheduler jitter on a loaded
runner, tight enough that a genuinely-stuck leader fails fast
instead of consuming the entire 10 s budget on one attempt.

Caller audit per /loop semantic-change rule
============================================

  - waitForRaftReadiness: single production call site
    (test_util.go createNode); behavior change is additive — same
    semantics PLUS the writeable-leader gate. Tests that previously
    raced into a step-down window now block until the leader has
    committed one entry.

  - waitForWriteableLeader / raftReadinessProbePayload: package-
    private to adapter/test_util.go; no external callers. Verified
    by grep across the repo.

  - applyHLCLease (kv/fsm.go): receives one extra invocation per
    test that calls createNode. Behavior on ceiling=0 is documented
    no-op (kv/fsm.go:185 `if ceilingMs > 0`). No state divergence
    across replicas — every FSM applies the same payload.

  - Side effects on the engine / FSM:
    - pendingApplyIdx advances by 1 (normal flow)
    - encryption applier untouched (opcode 0x02 is not in the
      encryption band)
    - snapshots include one extra applied index (already varies by
      test timing)
    - lease provider does not track HLC entries

  - Per-test wall-clock impact: one extra Raft commit on startup,
    ~10 ms on a stable cluster. 29 test files use createNode, so
    total CI overhead is ~290 ms. Negligible.

Validation
==========

  go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists -race -count=3 -> ok 5.4s
  go test ./adapter/ -run 'TestRedis_SET|TestRedis_LPush|TestRedis_RPush|TestSQSServer_Throttle|TestRedis_MULTI' -race -count=2 -> ok 5.6s
  gofmt + go vet + golangci-lint -> 0 issues

Follow-up note
==============

Existing per-test *Eventually wrappers (rpushEventually,
lpushEventually) remain valid for tests that issue writes long
after createNode, where mid-test leader churn under CI load can
still hit. The readiness probe targets the startup window
specifically — it does not obsolete the helpers for other use
cases.

Subsequent commits can audit whether tests still calling
rpushEventually / lpushEventually right after createNode can drop
back to direct calls now that the startup window is closed at
the readiness layer.
---
 adapter/test_util.go | 64 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 64 insertions(+)

diff --git a/adapter/test_util.go b/adapter/test_util.go
index a31e16077..079a342a4 100644
--- a/adapter/test_util.go
+++ b/adapter/test_util.go
@@ -312,6 +312,70 @@ func waitForRaftReadiness(t *testing.T, nodes []Node, peers []raftengine.Server,
 	// transient "not leader" errors. A stability window catches the
 	// flip and loops until the leader actually holds.
 	waitForStableLeader(t, nodes, peers, waitTimeout)
+
+	// Prove the leader can actually COMMIT by driving one no-op
+	// entry through full Raft quorum before we hand the cluster
+	// to the test body. The 300 ms stability window above only
+	// proves leadership *appears* held; a leader can still step
+	// down on the first real heartbeat-quorum miss after that.
+	// 5ba2ef94 / rpushEventually / lpushEventually have been
+	// absorbing that post-readiness churn per-test for the past
+	// several months — this commits-one-entry probe closes the
+	// window at the readiness layer instead, so any subsequent
+	// write inherits a leader that has already held quorum
+	// through one apply.
+	waitForWriteableLeader(t, nodes, waitTimeout)
+}
+
+// raftReadinessProbePayload is the 9-byte Raft entry waitForWriteableLeader
+// proposes. The FSM dispatches on data[0]: 0x02 (raftEncodeHLCLease) selects
+// applyHLCLease, which reads `data[1:]` as a big-endian uint64 physical
+// ceiling. With ceiling=0 the function's `if ceilingMs > 0` guard skips the
+// HLC update and returns nil — a true no-op at the state-machine layer,
+// while the entry still goes through full Raft quorum so a successful
+// Propose proves the leader holds. See kv/fsm.go applyHLCLease + the
+// raftEncodeHLCLease comment.
+var raftReadinessProbePayload = []byte{
+	0x02,                                           // raftEncodeHLCLease
+	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // BE uint64 ceiling = 0 (no-op)
+}
+
+// waitForWriteableLeader commits one no-op Raft entry through nodes[0]'s
+// engine to prove the leader is not just *named* but actually able to
+// drive quorum. Retries on transient leader-unavailable errors up to
+// timeout — a re-election that races the probe is what this helper is
+// designed to ride out.
+//
+// Without this step, waitForRaftReadiness would return after a 300 ms
+// stability window during which leadership only *looks* stable; a
+// leader that steps down on its first real heartbeat-quorum miss can
+// surface as "etcd raft engine is not leader" on the test's first
+// write. See commits 5ba2ef94 / rpushEventually / lpushEventually for
+// the per-test work-arounds this readiness gate is meant to obsolete.
+func waitForWriteableLeader(t *testing.T, nodes []Node, timeout time.Duration) {
+	t.Helper()
+	if len(nodes) == 0 || nodes[0].engine == nil {
+		return
+	}
+	deadline := time.Now().Add(timeout)
+	const perProposeTimeout = 2 * time.Second
+	var lastErr error
+	for {
+		proposeCtx, cancel := context.WithTimeout(context.Background(), perProposeTimeout)
+		_, err := nodes[0].engine.Propose(proposeCtx, raftReadinessProbePayload)
+		cancel()
+		if err == nil {
+			return
+		}
+		lastErr = err
+		if !isTransientNotLeaderErr(err) {
+			t.Fatalf("readiness probe failed with non-transient error: %v", err)
+		}
+		if !time.Now().Before(deadline) {
+			t.Fatalf("readiness probe never committed within %s (last err: %v)", timeout, lastErr)
+		}
+		time.Sleep(leaderChurnRetryInterval)
+	}
 }
 
 // waitForStableLeader polls the cluster until nodes[0] has been the leader

From cd6a67e895fb727c099626e0e6debaca08ff3c67 Mon Sep 17 00:00:00 2001
From: "Yoshiaki Ueda (bootjp)" <contact@bootjp.me>
Date: Mon, 1 Jun 2026 20:49:33 +0900
Subject: [PATCH 2/4] =?UTF-8?q?fix(test-util):=20address=20gemini=20PR=20#?=
 =?UTF-8?q?898=20HIGH=20=E2=80=94=20bound=20per-propose=20ctx=20+=20classi?=
 =?UTF-8?q?fy=20ctx=20errors=20as=20transient?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

gemini HIGH (adapter/test_util.go:373):

1. Per-propose timeout could exceed outer budget. Hardcoded
   perProposeTimeout=2s combined with the outer waitTimeout (10s
   at createNode) means a propose started within the last <2s
   of the budget could block past the caller deadline. Fix:
   currTimeout = min(perProposeTimeout, remaining).

2. context.DeadlineExceeded not classified as transient. The
   base isTransientNotLeaderErr (redis_retry.go) recognizes only
   raft-engine leader-unavailable signals, not ctx-timeout. A
   per-propose ctx that fires on a slow CI runner would have
   immediately t.Fatalf-ed instead of retrying within the outer
   budget — directly re-introducing the flake class the readiness
   probe is meant to eliminate. Widened with a probe-local
   isReadinessProbeRetryable that ORs context.DeadlineExceeded
   and context.Canceled.

Extracted proposeReadinessOnce + isReadinessProbeRetryable to keep
waitForWriteableLeader under cyclop=10 after the new branch.

Caller audit: all three new symbols (proposeReadinessOnce,
isReadinessProbeRetryable, raftReadinessProbePayload) are
package-private; sole caller is waitForWriteableLeader.
Verified by grep across the repo.

Validation:
  go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists -race -count=2 -> ok 7.2s
  go test ./adapter/ -run TestSQSServer_Throttle -race -count=2 -> ok
  golangci-lint run -> 0 issues
---
 adapter/test_util.go | 42 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 37 insertions(+), 5 deletions(-)

diff --git a/adapter/test_util.go b/adapter/test_util.go
index 079a342a4..d25924962 100644
--- a/adapter/test_util.go
+++ b/adapter/test_util.go
@@ -358,17 +358,18 @@ func waitForWriteableLeader(t *testing.T, nodes []Node, timeout time.Duration) {
 		return
 	}
 	deadline := time.Now().Add(timeout)
-	const perProposeTimeout = 2 * time.Second
 	var lastErr error
 	for {
-		proposeCtx, cancel := context.WithTimeout(context.Background(), perProposeTimeout)
-		_, err := nodes[0].engine.Propose(proposeCtx, raftReadinessProbePayload)
-		cancel()
+		remaining := time.Until(deadline)
+		if remaining <= 0 {
+			t.Fatalf("readiness probe never committed within %s (last err: %v)", timeout, lastErr)
+		}
+		err := proposeReadinessOnce(nodes[0].engine, remaining)
 		if err == nil {
 			return
 		}
 		lastErr = err
-		if !isTransientNotLeaderErr(err) {
+		if !isReadinessProbeRetryable(err) {
 			t.Fatalf("readiness probe failed with non-transient error: %v", err)
 		}
 		if !time.Now().Before(deadline) {
@@ -378,6 +379,37 @@ func waitForWriteableLeader(t *testing.T, nodes []Node, timeout time.Duration) {
 	}
 }
 
+// proposeReadinessOnce drives one no-op probe through the engine's
+// Propose, bounded by min(perProposeTimeout, remaining) so the
+// per-attempt ctx cannot block past the outer deadline (gemini PR
+// #898 HIGH point 1). Extracted from waitForWriteableLeader to keep
+// the outer loop under the cyclop budget; the retry/timeout
+// rationale lives at the call site.
+func proposeReadinessOnce(engine raftengine.Engine, remaining time.Duration) error {
+	const perProposeTimeout = 2 * time.Second
+	currTimeout := perProposeTimeout
+	if remaining < currTimeout {
+		currTimeout = remaining
+	}
+	proposeCtx, cancel := context.WithTimeout(context.Background(), currTimeout)
+	defer cancel()
+	_, err := engine.Propose(proposeCtx, raftReadinessProbePayload)
+	return err
+}
+
+// isReadinessProbeRetryable widens isTransientNotLeaderErr for the
+// readiness-probe loop: a per-propose ctx timeout under heavy CI
+// load returns context.DeadlineExceeded, which the base classifier
+// does NOT treat as transient. Without this widening the probe
+// would t.Fatalf on a slow runner instead of retrying within its
+// outer budget. context.Canceled is treated symmetrically (gemini
+// PR #898 HIGH point 2).
+func isReadinessProbeRetryable(err error) bool {
+	return isTransientNotLeaderErr(err) ||
+		errors.Is(err, context.DeadlineExceeded) ||
+		errors.Is(err, context.Canceled)
+}
+
 // waitForStableLeader polls the cluster until nodes[0] has been the leader
 // (and all other nodes agree on it as the leader address) for a continuous
 // stability window. If leadership flips during the window, the loop restarts.

From 23a146a06574dc6d01343f4f524d34600cd61e7b Mon Sep 17 00:00:00 2001
From: "Yoshiaki Ueda (bootjp)" <contact@bootjp.me>
Date: Mon, 1 Jun 2026 20:57:51 +0900
Subject: [PATCH 3/4] fix(docs): strip commit-hash refs from
 waitForWriteableLeader doc comments

claude[bot] PR #898 minor #2: CLAUDE.md convention forbids referencing
specific commits/symbols in code comments since they rot as the
codebase evolves. Dropped the "5ba2ef94 / rpushEventually /
lpushEventually" enumeration from both doc blocks; replaced with a
structural pointer to "the per-test retry wrappers in this file"
and a direction to the PR description for full history.

The ground truth for the flake-fix lineage lives in the PR
description, which is durable while file-local commit refs are not.

No behavior change. No caller audit needed (comment-only).
---
 adapter/test_util.go | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/adapter/test_util.go b/adapter/test_util.go
index d25924962..b3b20e7b6 100644
--- a/adapter/test_util.go
+++ b/adapter/test_util.go
@@ -318,12 +318,11 @@ func waitForRaftReadiness(t *testing.T, nodes []Node, peers []raftengine.Server,
 	// to the test body. The 300 ms stability window above only
 	// proves leadership *appears* held; a leader can still step
 	// down on the first real heartbeat-quorum miss after that.
-	// 5ba2ef94 / rpushEventually / lpushEventually have been
-	// absorbing that post-readiness churn per-test for the past
-	// several months — this commits-one-entry probe closes the
-	// window at the readiness layer instead, so any subsequent
-	// write inherits a leader that has already held quorum
-	// through one apply.
+	// The per-test retry wrappers in this file have been absorbing
+	// that post-readiness churn site-by-site; this commits-one-entry
+	// probe closes the window at the readiness layer instead, so any
+	// subsequent write inherits a leader that has already held
+	// quorum through one apply.
 	waitForWriteableLeader(t, nodes, waitTimeout)
 }
 
@@ -350,8 +349,9 @@ var raftReadinessProbePayload = []byte{
 // stability window during which leadership only *looks* stable; a
 // leader that steps down on its first real heartbeat-quorum miss can
 // surface as "etcd raft engine is not leader" on the test's first
-// write. See commits 5ba2ef94 / rpushEventually / lpushEventually for
-// the per-test work-arounds this readiness gate is meant to obsolete.
+// write — the failure mode the per-test retry wrappers in this file
+// have been working around. See the PR description for the full
+// flake-fix history this readiness gate is meant to obsolete.
 func waitForWriteableLeader(t *testing.T, nodes []Node, timeout time.Duration) {
 	t.Helper()
 	if len(nodes) == 0 || nodes[0].engine == nil {

From e73b625298213e6d25a51418cd9f6a1dfa6ffc91 Mon Sep 17 00:00:00 2001
From: "Yoshiaki Ueda (bootjp)" <contact@bootjp.me>
Date: Mon, 1 Jun 2026 21:03:35 +0900
Subject: [PATCH 4/4] fix(test-util): address codex P2 (re-elect on probe
 retry) + claude[bot] residual nit
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

codex P2 (adapter/test_util.go:367):
  When a real re-election makes another node leader during the
  readiness probe, nodes[0].engine.Propose returns errNotLeader on
  every attempt for the rest of the budget. The previous loop kept
  proposing through that follower engine until t.Fatalf fired —
  spuriously failing the readiness gate even though the cluster
  had recovered with a different leader.

  Fix: on transient not-leader errors, re-run ensureNodeZeroIsLeader
  before the next attempt. ensureNodeZeroIsLeader is idempotent in
  the steady state so this costs nothing when leadership has not
  actually moved. Gated on isTransientNotLeaderErr specifically
  (not the broader isReadinessProbeRetryable) so ctx-timeout cases
  do not pay for the re-nudge — they are likely just slow scheduler
  ticks, not leadership changes.

  Signature change: waitForWriteableLeader now takes (peers,
  waitInterval) so the probe loop can do leader re-nudges. Sole
  caller (waitForRaftReadiness in the same file) already had those
  values to hand; verified by grep.

claude[bot] residual nit (round-2 review):
  Stripped "gemini PR #898 HIGH point 1/2" attribution refs from
  the proposeReadinessOnce / isReadinessProbeRetryable doc
  comments. Same CLAUDE.md convention as Minor #2 — the WHY is
  kept; the PR-process attribution rots.

Validation:
  go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists -race -count=2 -> ok 7.9s
  go test ./adapter/ -run TestSQSServer_Throttle -race -count=2 -> ok
  golangci-lint run -> 0 issues
---
 adapter/test_util.go | 34 ++++++++++++++++++++++++++--------
 1 file changed, 26 insertions(+), 8 deletions(-)

diff --git a/adapter/test_util.go b/adapter/test_util.go
index b3b20e7b6..ba88cfefb 100644
--- a/adapter/test_util.go
+++ b/adapter/test_util.go
@@ -323,7 +323,13 @@ func waitForRaftReadiness(t *testing.T, nodes []Node, peers []raftengine.Server,
 	// probe closes the window at the readiness layer instead, so any
 	// subsequent write inherits a leader that has already held
 	// quorum through one apply.
-	waitForWriteableLeader(t, nodes, waitTimeout)
+	//
+	// `peers` and `waitInterval` are passed through so the probe loop
+	// can re-nudge leadership back to nodes[0] when a real re-election
+	// elects a different node mid-probe — without that, the loop
+	// would keep proposing through a follower engine until the budget
+	// expires and spuriously fail the readiness gate.
+	waitForWriteableLeader(t, nodes, peers, waitTimeout, waitInterval)
 }
 
 // raftReadinessProbePayload is the 9-byte Raft entry waitForWriteableLeader
@@ -352,7 +358,7 @@ var raftReadinessProbePayload = []byte{
 // write — the failure mode the per-test retry wrappers in this file
 // have been working around. See the PR description for the full
 // flake-fix history this readiness gate is meant to obsolete.
-func waitForWriteableLeader(t *testing.T, nodes []Node, timeout time.Duration) {
+func waitForWriteableLeader(t *testing.T, nodes []Node, peers []raftengine.Server, timeout, waitInterval time.Duration) {
 	t.Helper()
 	if len(nodes) == 0 || nodes[0].engine == nil {
 		return
@@ -372,6 +378,19 @@ func waitForWriteableLeader(t *testing.T, nodes []Node, timeout time.Duration) {
 		if !isReadinessProbeRetryable(err) {
 			t.Fatalf("readiness probe failed with non-transient error: %v", err)
 		}
+		// If the per-attempt error was specifically "not leader" (vs
+		// a ctx timeout), a real re-election has elected a different
+		// node since waitForStableLeader returned. Keep aiming the
+		// probe at nodes[0] by re-running ensureNodeZeroIsLeader
+		// before the next attempt; otherwise the loop would spin on
+		// a follower engine until the budget expires and spuriously
+		// fail the readiness gate even though the cluster has
+		// recovered with a different leader. ensureNodeZeroIsLeader
+		// is idempotent in the steady state, so this costs nothing
+		// when leadership has not actually moved.
+		if isTransientNotLeaderErr(err) && len(peers) > 0 {
+			ensureNodeZeroIsLeader(t, nodes, peers, time.Until(deadline), waitInterval)
+		}
 		if !time.Now().Before(deadline) {
 			t.Fatalf("readiness probe never committed within %s (last err: %v)", timeout, lastErr)
 		}
@@ -381,10 +400,10 @@ func waitForWriteableLeader(t *testing.T, nodes []Node, timeout time.Duration) {
 
 // proposeReadinessOnce drives one no-op probe through the engine's
 // Propose, bounded by min(perProposeTimeout, remaining) so the
-// per-attempt ctx cannot block past the outer deadline (gemini PR
-// #898 HIGH point 1). Extracted from waitForWriteableLeader to keep
-// the outer loop under the cyclop budget; the retry/timeout
-// rationale lives at the call site.
+// per-attempt ctx cannot block past the outer deadline when the
+// overall budget is nearly exhausted. Extracted from
+// waitForWriteableLeader to keep the outer loop under the cyclop
+// budget; the retry/timeout rationale lives at the call site.
 func proposeReadinessOnce(engine raftengine.Engine, remaining time.Duration) error {
 	const perProposeTimeout = 2 * time.Second
 	currTimeout := perProposeTimeout
@@ -402,8 +421,7 @@ func proposeReadinessOnce(engine raftengine.Engine, remaining time.Duration) err
 // load returns context.DeadlineExceeded, which the base classifier
 // does NOT treat as transient. Without this widening the probe
 // would t.Fatalf on a slow runner instead of retrying within its
-// outer budget. context.Canceled is treated symmetrically (gemini
-// PR #898 HIGH point 2).
+// outer budget. context.Canceled is treated symmetrically.
 func isReadinessProbeRetryable(err error) bool {
 	return isTransientNotLeaderErr(err) ||
 		errors.Is(err, context.DeadlineExceeded) ||