test(util): commit one no-op entry to prove leader writeable in waitForRaftReadiness#898
Conversation
…orRaftReadiness Structural fix for the recurring leader-churn flake pattern. Multiple commits over the past months have absorbed post-readiness leader churn per-test by wrapping the first write in retry helpers: 5ba2ef9 test(redis): retry initial writes on 'not leader' to absorb CI leader churn (added doEventually / rpushEventually / lpushEventually) 2b6e8f2 test(adapter): widen raft election budget + retry gRPC on leader churn 78b9d09 test(grpc): retry leader churn in consistency loops 0fb7db7 test(redis): fix flaky TestRedis_LuaRPopLPushBullMQLikeLists ba3f5bf test(redis): wait for follower apply catch-up in MultiExec list tests The root cause is the same in every case: waitForStableLeader (the existing helper) only proves leadership *appears* stable for a 300 ms window. The freshly-elected leader can still step down on its first real heartbeat-quorum miss after that window closes, surfacing as "etcd raft engine is not leader" on the test's first write. This commit closes the post-readiness window at the readiness layer instead of asking every new test author to remember to use *Eventually helpers. After waitForStableLeader returns, drive one no-op Raft entry through full quorum and retry on transient leader errors until it commits. Any subsequent test write inherits a leader that has already held quorum through at least one apply. Mechanism ========= The probe payload is a 9-byte HLC-lease entry with ceiling=0: [0x02][0x00 * 8] ^ raftEncodeHLCLease kv/fsm.go applyHLCLease reads data[1:] as a big-endian uint64 ceiling. The `if ceilingMs > 0` guard at line 185 makes ceiling=0 a true no-op at the state-machine layer (no HLC mutation, no storage write, no encryption-band dispatch). The entry still traverses the full propose -> append -> replicate -> commit -> apply pipeline, so a successful Propose proves the leader can actually drive quorum. waitForWriteableLeader retries the propose on transient leader-unavailable errors (the same set isTransientNotLeaderErr already classifies) up to the caller-supplied waitTimeout (10 s at the current createNode call site). Per-propose timeout is 2 s — generous enough to absorb scheduler jitter on a loaded runner, tight enough that a genuinely-stuck leader fails fast instead of consuming the entire 10 s budget on one attempt. Caller audit per /loop semantic-change rule ============================================ - waitForRaftReadiness: single production call site (test_util.go createNode); behavior change is additive — same semantics PLUS the writeable-leader gate. Tests that previously raced into a step-down window now block until the leader has committed one entry. - waitForWriteableLeader / raftReadinessProbePayload: package- private to adapter/test_util.go; no external callers. Verified by grep across the repo. - applyHLCLease (kv/fsm.go): receives one extra invocation per test that calls createNode. Behavior on ceiling=0 is documented no-op (kv/fsm.go:185 `if ceilingMs > 0`). No state divergence across replicas — every FSM applies the same payload. - Side effects on the engine / FSM: - pendingApplyIdx advances by 1 (normal flow) - encryption applier untouched (opcode 0x02 is not in the encryption band) - snapshots include one extra applied index (already varies by test timing) - lease provider does not track HLC entries - Per-test wall-clock impact: one extra Raft commit on startup, ~10 ms on a stable cluster. 29 test files use createNode, so total CI overhead is ~290 ms. Negligible. Validation ========== go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists -race -count=3 -> ok 5.4s go test ./adapter/ -run 'TestRedis_SET|TestRedis_LPush|TestRedis_RPush|TestSQSServer_Throttle|TestRedis_MULTI' -race -count=2 -> ok 5.6s gofmt + go vet + golangci-lint -> 0 issues Follow-up note ============== Existing per-test *Eventually wrappers (rpushEventually, lpushEventually) remain valid for tests that issue writes long after createNode, where mid-test leader churn under CI load can still hit. The readiness probe targets the startup window specifically — it does not obsolete the helpers for other use cases. Subsequent commits can audit whether tests still calling rpushEventually / lpushEventually right after createNode can drop back to direct calls now that the startup window is closed at the readiness layer.
|
Warning Review limit reached
More reviews will be available in 24 minutes and 2 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a waitForWriteableLeader helper in adapter/test_util.go to ensure that the Raft leader can successfully commit a no-op entry before completing readiness checks, preventing post-readiness test failures. Feedback points out two issues in the retry loop: the hardcoded proposal timeout can exceed the overall deadline, and context-related errors (such as context.DeadlineExceeded) are not classified as transient, which could cause premature test failures. A code suggestion is provided to bound the proposal timeout and handle context errors appropriately.
| deadline := time.Now().Add(timeout) | ||
| const perProposeTimeout = 2 * time.Second | ||
| var lastErr error | ||
| for { | ||
| proposeCtx, cancel := context.WithTimeout(context.Background(), perProposeTimeout) | ||
| _, err := nodes[0].engine.Propose(proposeCtx, raftReadinessProbePayload) | ||
| cancel() | ||
| if err == nil { | ||
| return | ||
| } | ||
| lastErr = err | ||
| if !isTransientNotLeaderErr(err) { | ||
| t.Fatalf("readiness probe failed with non-transient error: %v", err) | ||
| } |
There was a problem hiding this comment.
There are two issues with the current retry loop implementation:
- Context Timeout Bounding: The
perProposeTimeoutis hardcoded to2 * time.Second. If the remaining time until the overalldeadlineis less than 2 seconds, theProposecall can block past the overalltimeoutbudget. - Transient Error Classification: If a single proposal times out (e.g., under heavy CI load or during leader stabilization),
Proposereturnscontext.DeadlineExceeded. SinceisTransientNotLeaderErrdoes not classifycontext.DeadlineExceededorcontext.Canceledas transient, the test will immediately fail viat.Fatalfinstead of retrying up to the overalltimeoutbudget. This defeats the purpose of the retry loop and can introduce new CI flakes.
We should bound the proposal timeout to the remaining budget and treat context errors as transient/retryable.
deadline := time.Now().Add(timeout)
const perProposeTimeout = 2 * time.Second
var lastErr error
for {
remaining := time.Until(deadline)
currTimeout := perProposeTimeout
if remaining < currTimeout {
currTimeout = remaining
}
proposeCtx, cancel := context.WithTimeout(context.Background(), currTimeout)
_, err := nodes[0].engine.Propose(proposeCtx, raftReadinessProbePayload)
cancel()
if err == nil {
return
}
lastErr = err
if !isTransientNotLeaderErr(err) && !errors.Is(err, context.DeadlineExceeded) && !errors.Is(err, context.Canceled) {
t.Fatalf("readiness probe failed with non-transient error: %v", err)
}References
- Ensure that synchronous, potentially blocking, or indefinitely retrying operations executed within a polling loop are wrapped with a bounded timeout context. This prevents the polling/watcher goroutine from blocking indefinitely during failures (e.g., network partitions), allowing the loop to continue ticking, re-evaluate state changes, and maintain a safe fail-closed posture.
… classify ctx errors as transient gemini HIGH (adapter/test_util.go:373): 1. Per-propose timeout could exceed outer budget. Hardcoded perProposeTimeout=2s combined with the outer waitTimeout (10s at createNode) means a propose started within the last <2s of the budget could block past the caller deadline. Fix: currTimeout = min(perProposeTimeout, remaining). 2. context.DeadlineExceeded not classified as transient. The base isTransientNotLeaderErr (redis_retry.go) recognizes only raft-engine leader-unavailable signals, not ctx-timeout. A per-propose ctx that fires on a slow CI runner would have immediately t.Fatalf-ed instead of retrying within the outer budget — directly re-introducing the flake class the readiness probe is meant to eliminate. Widened with a probe-local isReadinessProbeRetryable that ORs context.DeadlineExceeded and context.Canceled. Extracted proposeReadinessOnce + isReadinessProbeRetryable to keep waitForWriteableLeader under cyclop=10 after the new branch. Caller audit: all three new symbols (proposeReadinessOnce, isReadinessProbeRetryable, raftReadinessProbePayload) are package-private; sole caller is waitForWriteableLeader. Verified by grep across the repo. Validation: go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists -race -count=2 -> ok 7.2s go test ./adapter/ -run TestSQSServer_Throttle -race -count=2 -> ok golangci-lint run -> 0 issues
|
@claude review Round 1 fixes (commit cd6a67e):
Caller audit: all 3 new symbols package-private, sole caller is go test -race -count=2 passes (7.2s); golangci-lint 0 issues. |
|
Claude finished @bootjp's task in 2m 36s —— View job Review of PR #898 — Round 1 fixes (cd6a67e)Tasks
SummaryBoth Gemini HIGH findings are correctly addressed. The logic is sound, the no-op payload is verified against the FSM, and the error chain handles wrapping correctly. A few minor observations below — none are blockers. Verification: no-op payloadConfirmed against // applyHLCLease — ceiling=0 path
ceilingMs := int64(binary.BigEndian.Uint64(data)) // = 0
if f.hlc != nil && ceilingMs > 0 { // guard fires → skipped
f.hlc.SetPhysicalCeiling(ceilingMs)
}
return nilThe 9-byte probe payload ( Gemini HIGH #1 — per-propose context bounding (
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cd6a67e895
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if remaining <= 0 { | ||
| t.Fatalf("readiness probe never committed within %s (last err: %v)", timeout, lastErr) | ||
| } | ||
| err := proposeReadinessOnce(nodes[0].engine, remaining) |
There was a problem hiding this comment.
Re-elect node 0 before retrying the readiness probe
When the probe races with a real re-election where another node becomes leader, this loop keeps proposing only through nodes[0].engine; handleProposal returns errNotLeader whenever that engine is a follower, and nothing here re-runs ensureNodeZeroIsLeader or otherwise targets the new leader. In that startup-churn case the helper now spins until readiness probe never committed even though the cluster has recovered with a different leader, so the readiness gate can introduce a new createNode flake instead of riding out the re-election it is meant to handle.
Useful? React with 👍 / 👎.
…ments claude[bot] PR #898 minor #2: CLAUDE.md convention forbids referencing specific commits/symbols in code comments since they rot as the codebase evolves. Dropped the "5ba2ef94 / rpushEventually / lpushEventually" enumeration from both doc blocks; replaced with a structural pointer to "the per-test retry wrappers in this file" and a direction to the PR description for full history. The ground truth for the flake-fix lineage lives in the PR description, which is durable while file-local commit refs are not. No behavior change. No caller audit needed (comment-only).
|
@claude review Round 2 cleanup (commit 23a146a):
claude[bot] minors #1 (redundant deadline check) and #3 ( Comment-only change. |
|
Claude finished @bootjp's task in 1m 42s —— View job Review of PR #898 — Round 2 cleanup (23a146a)Tasks
SummaryRound 2 correctly addresses Minor #2 from the previous review. The targeted doc changes are clean, the convention violation is fixed, and the diff is comment-only — zero behavior change confirmed. Verification: targeted changes
Before (cd6a67e): After (23a146a): Correct. The specific commit hash and symbol-name enumeration are gone; the structural description is still informative and won't rot.
Before: After: Correct. The commit/symbol refs are removed. The "See the PR description" pointer is fine in principle (the PR description is the right home for flake-fix history, per CLAUDE.md), though "the PR description" without a number is slightly ambiguous once this is read out of PR context. Not a blocker — Residual nit (not in scope for Round 2, but same category)The Round 1 additions in // proposeReadinessOnce …
// per-attempt ctx cannot block past the outer deadline (gemini PR
// #898 HIGH point 1). …
// isReadinessProbeRetryable …
// context.Canceled is treated symmetrically (gemini
// PR #898 HIGH point 2).These are the same class of rot as the commit-hash refs — they'll be opaque to any reader who isn't looking at this PR. The WHY is still meaningful; only the attribution needs to go. Suggested replacements: // per-attempt ctx cannot block past the outer deadline when the
// overall budget is nearly exhausted.
// context.Canceled is treated symmetrically.Minor nit — not a blocker, and it's a separate touch from Round 2's stated scope. Could be addressed in a follow-up pass if desired. VerdictRound 2 is clean. CLAUDE.md Minor #2 fully resolved. The two residual |
…ot] residual nit codex P2 (adapter/test_util.go:367): When a real re-election makes another node leader during the readiness probe, nodes[0].engine.Propose returns errNotLeader on every attempt for the rest of the budget. The previous loop kept proposing through that follower engine until t.Fatalf fired — spuriously failing the readiness gate even though the cluster had recovered with a different leader. Fix: on transient not-leader errors, re-run ensureNodeZeroIsLeader before the next attempt. ensureNodeZeroIsLeader is idempotent in the steady state so this costs nothing when leadership has not actually moved. Gated on isTransientNotLeaderErr specifically (not the broader isReadinessProbeRetryable) so ctx-timeout cases do not pay for the re-nudge — they are likely just slow scheduler ticks, not leadership changes. Signature change: waitForWriteableLeader now takes (peers, waitInterval) so the probe loop can do leader re-nudges. Sole caller (waitForRaftReadiness in the same file) already had those values to hand; verified by grep. claude[bot] residual nit (round-2 review): Stripped "gemini PR #898 HIGH point 1/2" attribution refs from the proposeReadinessOnce / isReadinessProbeRetryable doc comments. Same CLAUDE.md convention as Minor #2 — the WHY is kept; the PR-process attribution rots. Validation: go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists -race -count=2 -> ok 7.9s go test ./adapter/ -run TestSQSServer_Throttle -race -count=2 -> ok golangci-lint run -> 0 issues
|
@claude review Round 2 fixes (commit e73b625):
go test -race -count=2 passes (7.9s); golangci-lint 0 issues. |
|
Claude finished @bootjp's task in 2m 15s —— View job Review of PR #898 — Round 2 fixes (e73b625)Tasks
SummaryBoth changes are correct. The codex P2 concern is fully addressed, and the doc comment cleanup is complete. No blockers. Verification: codex P2 fixThe re-nudge is gated on Loop trace for the not-leader path:
Loop trace for the ctx-timeout path:
Both paths are correct. Verification: signature change
Verification: doc comment cleanup
Minor observations1. When 2. 50ms sleep after a successful After the re-nudge completes, the loop checks the deadline and then sleeps VerdictRound 2 is clean. Codex P2 fully resolved, CLAUDE.md attribution cleanup complete. The probe loop now handles both leadership-moved (re-nudge) and ctx-timeout (retry without re-nudge) paths correctly. Approved. |
… tests Third flake-root-cause PR, after PR #891 (SQS throttle refill) and PR #898 (Raft readiness probe). Same pattern: a recurring CI-flake mechanism has been fixed per-test for months; this PR generalises the structural fix and migrates all extant sites. Recurring flake (PR #818 lineage): Tests that "set a short TTL, sleep past it, assert the key is gone" race their own wall clock under -race on slow CI runners. The post-sleep read can sometimes still observe an un-expired view because the inter-call pause does not absorb scheduler jitter. PR #818 fixed one instance (ExpiredKey_BecomesInvisible) by switching to require.Eventually with a TTL-derived deadline. The other 8 sites in the test suite kept the fragile time.Sleep(ttl + 50ms) pattern. Structural fix ============== - New helper eventuallyExpired(t, ttl, condition, msg) in test_util.go. Encapsulates the require.Eventually + ttlExpiryHeadroom (3 s) + ttlExpiryPoll (25 ms) idiom so future test writers reach for it instead of inventing fresh time.Sleep margins. - Migrated 8 sites to the helper: * redis_collection_fastpath_test.go: TestRedis_HGET_TTLExpired, TestRedis_HEXISTS_TTLExpired, TestRedis_SISMEMBER_TTLExpired. * redis_lua_collection_fastpath_test.go: TestLua_HGET_HashTTL, TestLua_HEXISTS_HashTTL, TestLua_ZSCORE_*, TestLua_ZRANGEBYSCORE_*. * redis_misskey_compat_test.go: post-EX-1 expiry assertion. - Removed the now-unused waitForTTLExpiry (redis_collection_fastpath _test.go) and waitForLuaTTL (redis_lua_collection_fastpath_test.go) convenience wrappers. Their sole callers were the migrated sites. - Added errors stdlib import to the three test files where the helper's condition closures use errors.Is. Caller audit per /loop semantic-change rule ============================================ - eventuallyExpired: new helper, 8 callers in 3 test files. All test-only. - waitForTTLExpiry / waitForLuaTTL: removed. grep across the repo confirms zero remaining call sites. - collectionFastPathTTL / luaFastPathTTL consts: still used by the migrated sites as the ttl argument to eventuallyExpired. Unchanged. Validation ========== go test ./adapter/ -run 'TestRedis_HGET_TTLExpired|TestRedis_HEXISTS_TTLExpired|TestRedis_SISMEMBER_TTLExpired|TestLua_HGET_HashTTL|TestLua_HEXISTS|TestLua_ZSCORE|TestLua_ZRANGEBYSCORE|TestRedis_Misskey' -race -count=2 -> ok 32.3s gofmt + go vet + golangci-lint run -> 0 issues Closes the "Redis TTL window deadline-multiplier" follow-up noted in PR #891's description.
… tests (#903) ## Summary Third flake-root-cause PR, after #891 (SQS throttle refill) and #898 (Raft readiness probe). Same recipe: a recurring CI-flake mechanism has been patched per-test for months; this PR generalises the structural fix and migrates the extant sites. ## Recurring flake (PR #818 lineage) Tests that *"set a short TTL, sleep past it, assert the key is gone"* race their own wall clock under `-race` on slow CI runners. The post-sleep read sometimes observes an un-expired view because the inter-call pause does not absorb scheduler jitter. [PR #818](#818) fixed one instance (`TestRedis_ExpiredKey_BecomesInvisible`) by switching to `require.Eventually` with a TTL-derived deadline. The other 8 sites kept the fragile `time.Sleep(ttl + 50ms)` pattern. ## Structural fix - **New helper** `eventuallyExpired(t, ttl, condition, msg)` in `test_util.go`. Encapsulates the `require.Eventually + ttlExpiryHeadroom(3s) + ttlExpiryPoll(25ms)` idiom so future test writers reach for it instead of inventing fresh `time.Sleep` margins. - **Migrated 8 sites** to the helper: - `redis_collection_fastpath_test.go`: `TestRedis_HGET_TTLExpired`, `TestRedis_HEXISTS_TTLExpired`, `TestRedis_SISMEMBER_TTLExpired` - `redis_lua_collection_fastpath_test.go`: `TestLua_HGET_HashTTL`, `TestLua_HEXISTS_HashTTL`, `TestLua_ZSCORE_*`, `TestLua_ZRANGEBYSCORE_*` - `redis_misskey_compat_test.go`: post-`EX 1` expiry assertion - **Removed unused** `waitForTTLExpiry` (redis_collection_fastpath_test.go) and `waitForLuaTTL` (redis_lua_collection_fastpath_test.go) convenience wrappers. Sole callers were the migrated sites. - Added `errors` stdlib import to the three test files where the helper's condition closures use `errors.Is`. ## Caller audit (per /loop semantic-change rule) - `eventuallyExpired`: new helper, 8 callers in 3 test files. All test-only. - `waitForTTLExpiry` / `waitForLuaTTL`: removed. Repo-wide grep confirms zero remaining call sites. - `collectionFastPathTTL` / `luaFastPathTTL` consts: still used by the migrated sites as the `ttl` argument to `eventuallyExpired`. Unchanged. ## Validation - `go test ./adapter/ -run 'TestRedis_HGET_TTLExpired|TestRedis_HEXISTS_TTLExpired|TestRedis_SISMEMBER_TTLExpired|TestLua_HGET_HashTTL|TestLua_HEXISTS|TestLua_ZSCORE|TestLua_ZRANGEBYSCORE|TestRedis_Misskey' -race -count=2` → ok 32.3s - `gofmt` + `go vet` + `golangci-lint run` → 0 issues ## Scope Closes the *"Redis TTL window deadline-multiplier"* follow-up noted in [PR #891's description](#891). ## Out-of-scope (still on the flake follow-up backlog) - VM pool / cache hit-rate assertion → "don't assert on cache hit rates" rule. The structural lesson from commit [7783c27](7783c279) is to never assert on cache hit rates; needs a doc rather than a code change.
PR #898 added waitForWriteableLeader to waitForRaftReadiness, which drives one no-op Raft entry through full quorum before createNode returns. The post-readiness leader-churn window the rpushEventually / lpushEventually wrappers absorbed is now closed at the readiness layer instead. The wrappers are dead code. Changes: - Drop rpushEventually + lpushEventually from test_util.go (the only two callers, in redis_lua_compat_test.go, were first-write -after-createNode sites — replaced with direct RPush/LPush calls + a comment pointing back to PR #898). - Drop doEventually too: its only callers were the two helpers above. retryNotLeader (the underlying classifier loop) stays — grpc_test.go still uses it for *mid-test* leader churn inside its 9999-iteration consistency loops, which PR #898 does not address. - Drop the now-unused redis import from test_util.go. - Refresh the leaderChurnRetryTimeout / retryNotLeader doc comments to call out that startup-window churn is closed at the readiness layer, and retryNotLeader is for mid-test loops. Caller audit per /loop semantic-change rule ============================================ - rpushEventually / lpushEventually: grep across the repo confirms zero remaining call sites after the migration. Safe to delete. - doEventually: only called by the two helpers above; also safe to delete. - retryNotLeader: kept. Sole external caller is adapter/grpc_test.go (9 sites inside two consistency-loop tests). All those wrap mid-test RPCs that can race a real re-election under CI load — the readiness probe does not help mid-test churn, so the helper retains its purpose. Validation ========== go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists -race -count=3 -> ok 5.1s (the direct RPush/LPush calls inherit the writeable-leader gate from createNode, no flake) gofmt + go vet + golangci-lint run -> 0 issues
PR #898 added waitForWriteableLeader to waitForRaftReadiness, which drives one no-op Raft entry through full quorum before createNode returns. The post-readiness leader-churn window the rpushEventually / lpushEventually wrappers absorbed is now closed at the readiness layer instead. The wrappers are dead code. Changes: - Drop rpushEventually + lpushEventually from test_util.go (the only two callers, in redis_lua_compat_test.go, were first-write -after-createNode sites — replaced with direct RPush/LPush calls + a comment pointing back to PR #898). - Drop doEventually too: its only callers were the two helpers above. retryNotLeader (the underlying classifier loop) stays — grpc_test.go still uses it for *mid-test* leader churn inside its 9999-iteration consistency loops, which PR #898 does not address. - Drop the now-unused redis import from test_util.go. - Refresh the leaderChurnRetryTimeout / retryNotLeader doc comments to call out that startup-window churn is closed at the readiness layer, and retryNotLeader is for mid-test loops. Caller audit per /loop semantic-change rule ============================================ - rpushEventually / lpushEventually: grep across the repo confirms zero remaining call sites after the migration. Safe to delete. - doEventually: only called by the two helpers above; also safe to delete. - retryNotLeader: kept. Sole external caller is adapter/grpc_test.go (9 sites inside two consistency-loop tests). All those wrap mid-test RPCs that can race a real re-election under CI load — the readiness probe does not help mid-test churn, so the helper retains its purpose. Validation ========== go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists -race -count=3 -> ok 5.1s (the direct RPush/LPush calls inherit the writeable-leader gate from createNode, no flake) gofmt + go vet + golangci-lint run -> 0 issues
…#902) ## Summary Follow-up to PR #898. The readiness probe (`waitForWriteableLeader`) closes the startup leader-churn window at the readiness layer, so the per-test `rpushEventually` / `lpushEventually` wrappers that absorbed that window become dead code. ## What's dropped - `rpushEventually` + `lpushEventually` (`test_util.go`): only callers were `redis_lua_compat_test.go` lines 128–129 (first writes immediately after `createNode`). Migrated to direct `rdb.RPush` / `rdb.LPush` with a comment pointing back to PR #898. - `doEventually`: only used by the two helpers above; no other callers. - Unused `github.com/redis/go-redis/v9` import in `test_util.go`. ## What's kept - `retryNotLeader`: still used in `grpc_test.go` (9 sites) wrapping mid-test RPCs inside 9999-iteration consistency loops. The readiness probe does NOT help **mid-test** churn (a real re-election can fire halfway through a long-running loop), so this helper retains its purpose. The doc comment is refreshed to clarify the startup vs mid-test distinction. ## Caller audit (per /loop semantic-change rule) Repo-wide grep: - `rpushEventually\b` / `lpushEventually\b`: 0 callers remain (only historical comment refs in `redis_lua_compat_test.go`). - `doEventually\b`: 0 callers remain. - `retryNotLeader\b`: 9 callers, all in `grpc_test.go`. Unchanged. ## Validation - `go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists -race -count=3` → ok 5.1s (direct RPush/LPush inherits the writeable-leader gate from `createNode`; no flake) - `gofmt` + `go vet` + `golangci-lint run` → 0 issues ## Scope Closes the "*Eventually helper unnecessization audit" follow-up noted in PR #898's description.
Summary
Structural fix for the recurring leader-churn CI flake pattern. Closes the post-readiness step-down window at the readiness layer instead of asking every new test author to remember to use
*Eventuallyhelpers.Companion to PR #891 (SQS throttle structural fix): same recipe — each per-test fix to a recurring flake mechanism leaves the same trap for the next test. PR #891 hardened the SQS throttle-refill subset; this PR hardens the leader-churn subset.
Recurring flake pattern (5+ commits this year)
5ba2ef94doEventually/rpushEventually/lpushEventually, wrapped 2 RPUSH/LPUSH calls inTestRedis_LuaRPopLPushBullMQLikeLists2b6e8f2878b9d0910fb7db76TestRedis_LuaRPopLPushBullMQLikeLists(re-fix)ba3f5bf8Root cause is the same:
waitForStableLeaderonly proves leadership appears stable for a 300 ms window. The freshly-elected leader can still step down on its first real heartbeat-quorum miss after that window closes, surfacing as"etcd raft engine is not leader"on the test's first write.Mechanism
Add
waitForWriteableLeaderthat drives one no-op Raft entry through full quorum afterwaitForStableLeaderreturns:kv/fsm.go applyHLCLeasereadsdata[1:]as a big-endian uint64 ceiling. Theif ceilingMs > 0guard at line 185 makes ceiling=0 a true no-op (no HLC mutation, no storage write, no encryption dispatch).Proposeproves the leader can actually drive quorum.isTransientNotLeaderErralready classifies) up to caller-suppliedwaitTimeout(10s at the currentcreateNodecall site). Per-propose timeout is 2s.Wired into
waitForRaftReadinessas the final step.Caller audit (per /loop semantic-change rule)
waitForRaftReadiness: single production call site (createNode); behavior change is additive — same semantics PLUS the writeable-leader gate. Tests that previously raced into a step-down window now block until the leader has committed one entry.waitForWriteableLeader/raftReadinessProbePayload: package-private toadapter/test_util.go; no external callers (grep verified).applyHLCLease(kv/fsm.go): receives one extra invocation percreateNodecall. Behavior onceiling=0is documented no-op. No state divergence across replicas — every FSM applies the same payload.pendingApplyIdxadvances by 1 (normal flow)0x02not in encryption band)createNode→ ~290 ms total CI overhead. Negligible.Validation
go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists -race -count=3→ ok 5.4sgo test ./adapter/ -run 'TestRedis_SET|TestRedis_LPush|TestRedis_RPush|TestSQSServer_Throttle|TestRedis_MULTI' -race -count=2→ ok 5.6sgofmt+go vet+golangci-lint run→ 0 issuesFollow-up note
Existing per-test
*Eventuallywrappers (rpushEventually,lpushEventually) remain valid for tests that issue writes long aftercreateNode, where mid-test leader churn under CI load can still hit. The readiness probe targets the startup window specifically — it does not obsolete the helpers for other use cases.Subsequent commits can audit whether tests still calling
rpushEventually/lpushEventuallyimmediately aftercreateNodecan drop back to direct calls now that the startup window is closed at the readiness layer.Out-of-scope (remaining recurring flake categories)
Tracked in the PR #891 description; still to be addressed in follow-up PRs: