test(util): commit one no-op entry to prove leader writeable in waitForRaftReadiness by bootjp · Pull Request #898 · bootjp/elastickv

bootjp · 2026-06-01T11:30:44Z

Summary

Structural fix for the recurring leader-churn CI flake pattern. Closes the post-readiness step-down window at the readiness layer instead of asking every new test author to remember to use *Eventually helpers.

Companion to PR #891 (SQS throttle structural fix): same recipe — each per-test fix to a recurring flake mechanism leaves the same trap for the next test. PR #891 hardened the SQS throttle-refill subset; this PR hardens the leader-churn subset.

Recurring flake pattern (5+ commits this year)

Commit	What it touched
`5ba2ef94`	added `doEventually` / `rpushEventually` / `lpushEventually`, wrapped 2 RPUSH/LPUSH calls in `TestRedis_LuaRPopLPushBullMQLikeLists`
`2b6e8f28`	widened raft election budget + retried gRPC on leader churn
`78b9d091`	retried leader churn in consistency loops
`0fb7db76`	fixed flaky `TestRedis_LuaRPopLPushBullMQLikeLists` (re-fix)
`ba3f5bf8`	waited for follower apply catch-up in MultiExec

Root cause is the same: waitForStableLeader only proves leadership appears stable for a 300 ms window. The freshly-elected leader can still step down on its first real heartbeat-quorum miss after that window closes, surfacing as "etcd raft engine is not leader" on the test's first write.

Mechanism

Add waitForWriteableLeader that drives one no-op Raft entry through full quorum after waitForStableLeader returns:

var raftReadinessProbePayload = []byte{
    0x02,                                           // raftEncodeHLCLease
    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // BE uint64 ceiling = 0 (no-op)
}

State-machine no-op: kv/fsm.go applyHLCLease reads data[1:] as a big-endian uint64 ceiling. The if ceilingMs > 0 guard at line 185 makes ceiling=0 a true no-op (no HLC mutation, no storage write, no encryption dispatch).
Quorum-proving: the entry still traverses the full propose → append → replicate → commit → apply pipeline, so a successful Propose proves the leader can actually drive quorum.
Retry semantics: retries on transient leader-unavailable errors (same set isTransientNotLeaderErr already classifies) up to caller-supplied waitTimeout (10s at the current createNode call site). Per-propose timeout is 2s.

Wired into waitForRaftReadiness as the final step.

Caller audit (per /loop semantic-change rule)

waitForRaftReadiness: single production call site (createNode); behavior change is additive — same semantics PLUS the writeable-leader gate. Tests that previously raced into a step-down window now block until the leader has committed one entry.
waitForWriteableLeader / raftReadinessProbePayload: package-private to adapter/test_util.go; no external callers (grep verified).
applyHLCLease (kv/fsm.go): receives one extra invocation per createNode call. Behavior on ceiling=0 is documented no-op. No state divergence across replicas — every FSM applies the same payload.
Side effects:
- pendingApplyIdx advances by 1 (normal flow)
- encryption applier untouched (opcode 0x02 not in encryption band)
- snapshots include one extra applied index (already varies by test timing)
- lease provider does not track HLC entries
Per-test wall-clock: ~10 ms extra on a stable cluster. 29 test files use createNode → ~290 ms total CI overhead. Negligible.

Validation

go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists -race -count=3 → ok 5.4s
go test ./adapter/ -run 'TestRedis_SET|TestRedis_LPush|TestRedis_RPush|TestSQSServer_Throttle|TestRedis_MULTI' -race -count=2 → ok 5.6s
gofmt + go vet + golangci-lint run → 0 issues

Follow-up note

Existing per-test *Eventually wrappers (rpushEventually, lpushEventually) remain valid for tests that issue writes long after createNode, where mid-test leader churn under CI load can still hit. The readiness probe targets the startup window specifically — it does not obsolete the helpers for other use cases.

Subsequent commits can audit whether tests still calling rpushEventually / lpushEventually immediately after createNode can drop back to direct calls now that the startup window is closed at the readiness layer.

Out-of-scope (remaining recurring flake categories)

Tracked in the PR #891 description; still to be addressed in follow-up PRs:

Redis TTL window racing → per-test deadline-multiplier helper
VM pool / cache hit-rate assertions → "don't assert on cache hit rates" rule

…orRaftReadiness Structural fix for the recurring leader-churn flake pattern. Multiple commits over the past months have absorbed post-readiness leader churn per-test by wrapping the first write in retry helpers: 5ba2ef9 test(redis): retry initial writes on 'not leader' to absorb CI leader churn (added doEventually / rpushEventually / lpushEventually) 2b6e8f2 test(adapter): widen raft election budget + retry gRPC on leader churn 78b9d09 test(grpc): retry leader churn in consistency loops 0fb7db7 test(redis): fix flaky TestRedis_LuaRPopLPushBullMQLikeLists ba3f5bf test(redis): wait for follower apply catch-up in MultiExec list tests The root cause is the same in every case: waitForStableLeader (the existing helper) only proves leadership *appears* stable for a 300 ms window. The freshly-elected leader can still step down on its first real heartbeat-quorum miss after that window closes, surfacing as "etcd raft engine is not leader" on the test's first write. This commit closes the post-readiness window at the readiness layer instead of asking every new test author to remember to use *Eventually helpers. After waitForStableLeader returns, drive one no-op Raft entry through full quorum and retry on transient leader errors until it commits. Any subsequent test write inherits a leader that has already held quorum through at least one apply. Mechanism ========= The probe payload is a 9-byte HLC-lease entry with ceiling=0: [0x02][0x00 * 8] ^ raftEncodeHLCLease kv/fsm.go applyHLCLease reads data[1:] as a big-endian uint64 ceiling. The `if ceilingMs > 0` guard at line 185 makes ceiling=0 a true no-op at the state-machine layer (no HLC mutation, no storage write, no encryption-band dispatch). The entry still traverses the full propose -> append -> replicate -> commit -> apply pipeline, so a successful Propose proves the leader can actually drive quorum. waitForWriteableLeader retries the propose on transient leader-unavailable errors (the same set isTransientNotLeaderErr already classifies) up to the caller-supplied waitTimeout (10 s at the current createNode call site). Per-propose timeout is 2 s — generous enough to absorb scheduler jitter on a loaded runner, tight enough that a genuinely-stuck leader fails fast instead of consuming the entire 10 s budget on one attempt. Caller audit per /loop semantic-change rule ============================================ - waitForRaftReadiness: single production call site (test_util.go createNode); behavior change is additive — same semantics PLUS the writeable-leader gate. Tests that previously raced into a step-down window now block until the leader has committed one entry. - waitForWriteableLeader / raftReadinessProbePayload: package- private to adapter/test_util.go; no external callers. Verified by grep across the repo. - applyHLCLease (kv/fsm.go): receives one extra invocation per test that calls createNode. Behavior on ceiling=0 is documented no-op (kv/fsm.go:185 `if ceilingMs > 0`). No state divergence across replicas — every FSM applies the same payload. - Side effects on the engine / FSM: - pendingApplyIdx advances by 1 (normal flow) - encryption applier untouched (opcode 0x02 is not in the encryption band) - snapshots include one extra applied index (already varies by test timing) - lease provider does not track HLC entries - Per-test wall-clock impact: one extra Raft commit on startup, ~10 ms on a stable cluster. 29 test files use createNode, so total CI overhead is ~290 ms. Negligible. Validation ========== go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists -race -count=3 -> ok 5.4s go test ./adapter/ -run 'TestRedis_SET|TestRedis_LPush|TestRedis_RPush|TestSQSServer_Throttle|TestRedis_MULTI' -race -count=2 -> ok 5.6s gofmt + go vet + golangci-lint -> 0 issues Follow-up note ============== Existing per-test *Eventually wrappers (rpushEventually, lpushEventually) remain valid for tests that issue writes long after createNode, where mid-test leader churn under CI load can still hit. The readiness probe targets the startup window specifically — it does not obsolete the helpers for other use cases. Subsequent commits can audit whether tests still calling rpushEventually / lpushEventually right after createNode can drop back to direct calls now that the startup window is closed at the readiness layer.

coderabbitai · 2026-06-01T11:30:52Z

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 24 minutes and 2 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7c39c03a-2fde-4899-bccc-9b838b5d7cfc

📥 Commits

Reviewing files that changed from the base of the PR and between 2b118a4 and e73b625.

📒 Files selected for processing (1)

adapter/test_util.go

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch leader-churn-helper

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a waitForWriteableLeader helper in adapter/test_util.go to ensure that the Raft leader can successfully commit a no-op entry before completing readiness checks, preventing post-readiness test failures. Feedback points out two issues in the retry loop: the hardcoded proposal timeout can exceed the overall deadline, and context-related errors (such as context.DeadlineExceeded) are not classified as transient, which could cause premature test failures. A code suggestion is provided to bound the proposal timeout and handle context errors appropriately.

gemini-code-assist · 2026-06-01T11:31:53Z

+	deadline := time.Now().Add(timeout)
+	const perProposeTimeout = 2 * time.Second
+	var lastErr error
+	for {
+		proposeCtx, cancel := context.WithTimeout(context.Background(), perProposeTimeout)
+		_, err := nodes[0].engine.Propose(proposeCtx, raftReadinessProbePayload)
+		cancel()
+		if err == nil {
+			return
+		}
+		lastErr = err
+		if !isTransientNotLeaderErr(err) {
+			t.Fatalf("readiness probe failed with non-transient error: %v", err)
+		}


There are two issues with the current retry loop implementation:

Context Timeout Bounding: The perProposeTimeout is hardcoded to 2 * time.Second. If the remaining time until the overall deadline is less than 2 seconds, the Propose call can block past the overall timeout budget.

Transient Error Classification: If a single proposal times out (e.g., under heavy CI load or during leader stabilization), Propose returns context.DeadlineExceeded. Since isTransientNotLeaderErr does not classify context.DeadlineExceeded or context.Canceled as transient, the test will immediately fail via t.Fatalf instead of retrying up to the overall timeout budget. This defeats the purpose of the retry loop and can introduce new CI flakes.

We should bound the proposal timeout to the remaining budget and treat context errors as transient/retryable.

deadline := time.Now().Add(timeout) const perProposeTimeout = 2 * time.Second var lastErr error for { remaining := time.Until(deadline) currTimeout := perProposeTimeout if remaining < currTimeout { currTimeout = remaining } proposeCtx, cancel := context.WithTimeout(context.Background(), currTimeout) _, err := nodes[0].engine.Propose(proposeCtx, raftReadinessProbePayload) cancel() if err == nil { return } lastErr = err if !isTransientNotLeaderErr(err) && !errors.Is(err, context.DeadlineExceeded) && !errors.Is(err, context.Canceled) { t.Fatalf("readiness probe failed with non-transient error: %v", err) }

References

Ensure that synchronous, potentially blocking, or indefinitely retrying operations executed within a polling loop are wrapped with a bounded timeout context. This prevents the polling/watcher goroutine from blocking indefinitely during failures (e.g., network partitions), allowing the loop to continue ticking, re-evaluate state changes, and maintain a safe fail-closed posture.

… classify ctx errors as transient gemini HIGH (adapter/test_util.go:373): 1. Per-propose timeout could exceed outer budget. Hardcoded perProposeTimeout=2s combined with the outer waitTimeout (10s at createNode) means a propose started within the last <2s of the budget could block past the caller deadline. Fix: currTimeout = min(perProposeTimeout, remaining). 2. context.DeadlineExceeded not classified as transient. The base isTransientNotLeaderErr (redis_retry.go) recognizes only raft-engine leader-unavailable signals, not ctx-timeout. A per-propose ctx that fires on a slow CI runner would have immediately t.Fatalf-ed instead of retrying within the outer budget — directly re-introducing the flake class the readiness probe is meant to eliminate. Widened with a probe-local isReadinessProbeRetryable that ORs context.DeadlineExceeded and context.Canceled. Extracted proposeReadinessOnce + isReadinessProbeRetryable to keep waitForWriteableLeader under cyclop=10 after the new branch. Caller audit: all three new symbols (proposeReadinessOnce, isReadinessProbeRetryable, raftReadinessProbePayload) are package-private; sole caller is waitForWriteableLeader. Verified by grep across the repo. Validation: go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists -race -count=2 -> ok 7.2s go test ./adapter/ -run TestSQSServer_Throttle -race -count=2 -> ok golangci-lint run -> 0 issues

bootjp · 2026-06-01T11:49:45Z

@claude review

Round 1 fixes (commit cd6a67e):

gemini HIGH point 1: Per-propose ctx is now bounded by min(perProposeTimeout, remaining). A propose started within the last <2 s of the outer budget can no longer block past the caller deadline.
gemini HIGH point 2: context.DeadlineExceeded + context.Canceled are now treated as transient via a probe-local isReadinessProbeRetryable widener. A per-propose ctx timeout on a slow CI runner now retries within the outer budget instead of immediately fataling — which would have directly re-introduced the flake class this probe is meant to eliminate.
Extracted proposeReadinessOnce + isReadinessProbeRetryable to keep waitForWriteableLeader under cyclop=10 budget.

Caller audit: all 3 new symbols package-private, sole caller is waitForWriteableLeader.

go test -race -count=2 passes (7.2s); golangci-lint 0 issues.

claude · 2026-06-01T11:50:02Z

Claude finished @bootjp's task in 2m 36s —— View job

Review of PR #898 — Round 1 fixes (`cd6a67e`)

Tasks

Load tools
Read adapter/test_util.go changes
Verify FSM no-op claim (kv/fsm.go)
Verify error wrapping / errors.Is chain
Review code for correctness, concurrency safety, and edge cases
Post findings

Summary

Both Gemini HIGH findings are correctly addressed. The logic is sound, the no-op payload is verified against the FSM, and the error chain handles wrapping correctly. A few minor observations below — none are blockers.

Verification: no-op payload

Confirmed against kv/fsm.go:272-280:

// applyHLCLease — ceiling=0 path
ceilingMs := int64(binary.BigEndian.Uint64(data))  // = 0
if f.hlc != nil && ceilingMs > 0 {                  // guard fires → skipped
    f.hlc.SetPhysicalCeiling(ceilingMs)
}
return nil

The 9-byte probe payload (0x02 + 8 zero bytes) is a true no-op at the state-machine layer while still traversing the full propose→replicate→commit→apply pipeline. Payload length exactly matches hlcLeasePayloadLen = 8 for data[1:]. Correct.

Gemini HIGH #1 — per-propose context bounding (`proposeReadinessOnce`)

currTimeout := perProposeTimeout
if remaining < currTimeout {
    currTimeout = remaining
}
proposeCtx, cancel := context.WithTimeout(context.Background(), currTimeout)

Fix is correct. context.Background() as the base is appropriate here since the timeout is already bounded by remaining. When remaining is nearly zero, currTimeout becomes tiny → Propose returns immediately with context.DeadlineExceeded → isReadinessProbeRetryable returns true → the top-of-loop remaining <= 0 check on the next iteration catches the expired budget and calls t.Fatalf. No runaway blocking past the outer deadline.

Gemini HIGH #2 — transient error widening (`isReadinessProbeRetryable`)

func isReadinessProbeRetryable(err error) bool {
    return isTransientNotLeaderErr(err) ||
        errors.Is(err, context.DeadlineExceeded) ||
        errors.Is(err, context.Canceled)
}

Propose wraps context errors with errors.WithStack(ctx.Err()) (engine.go:748,757), so errors.Is correctly unwraps them. Fix is correct and complete for the known error surface of Propose.

Minor observations

1. Duplicate deadline check in waitForWriteableLeader (line 375)

if !isReadinessProbeRetryable(err) {
    t.Fatalf(...)
}
if !time.Now().Before(deadline) {   // ← technically reachable but redundant
    t.Fatalf(...)
}
time.Sleep(leaderChurnRetryInterval)

The second deadline check is pre-empted by the top-of-loop remaining <= 0 check on the next iteration. Its only effect is to skip the 50ms sleep when the deadline expired during a propose. Harmless — could be removed, but it's not wrong. No action needed.

2. Commit hash refs in inline comments (5ba2ef94, rpushEventually, lpushEventually)

CLAUDE.md says: "Don't reference the current task, fix, or callers since those belong in the PR description and rot as the codebase evolves." The references in waitForWriteableLeader's doc comment and the waitForRaftReadiness block comment will become stale as those helpers are cleaned up. Consider replacing them with a single structural note ("see PR description for the per-test retry history this probe is intended to obsolete") or dropping them. Not a blocker — the PR description already captures the full history.

3. raftReadinessProbePayload is a mutable package-level var

Since []byte can't be const, this is unavoidable. No production code path could mutate it, and it's package-private. Mention only for completeness — no change needed.

Verdict

Round 1 addresses both HIGH items correctly. The probe mechanism is sound and the retry logic behaves as documented. Approved from a correctness standpoint; the commit-hash comments are the only callout worth a follow-up cleanup pass.
| Branch: leader-churn-helper

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cd6a67e895

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-01T11:57:12Z

+		if remaining <= 0 {
+			t.Fatalf("readiness probe never committed within %s (last err: %v)", timeout, lastErr)
+		}
+		err := proposeReadinessOnce(nodes[0].engine, remaining)


Re-elect node 0 before retrying the readiness probe

When the probe races with a real re-election where another node becomes leader, this loop keeps proposing only through nodes[0].engine; handleProposal returns errNotLeader whenever that engine is a follower, and nothing here re-runs ensureNodeZeroIsLeader or otherwise targets the new leader. In that startup-churn case the helper now spins until readiness probe never committed even though the cluster has recovered with a different leader, so the readiness gate can introduce a new createNode flake instead of riding out the re-election it is meant to handle.

Useful? React with 👍 / 👎.

…ments claude[bot] PR #898 minor #2: CLAUDE.md convention forbids referencing specific commits/symbols in code comments since they rot as the codebase evolves. Dropped the "5ba2ef94 / rpushEventually / lpushEventually" enumeration from both doc blocks; replaced with a structural pointer to "the per-test retry wrappers in this file" and a direction to the PR description for full history. The ground truth for the flake-fix lineage lives in the PR description, which is durable while file-local commit refs are not. No behavior change. No caller audit needed (comment-only).

bootjp · 2026-06-01T11:58:09Z

@claude review

Round 2 cleanup (commit 23a146a):

claude[bot] minor Update dependency go to v1.21.4 #2 addressed: stripped 5ba2ef94 / rpushEventually / lpushEventually enumerations from the two doc blocks (CLAUDE.md convention — file-local commit-hash/symbol refs rot). Replaced with structural pointers ("per-test retry wrappers in this file") and a direction to the PR description for the full flake-fix lineage.

claude[bot] minors #1 (redundant deadline check) and #3 (var payload) were marked "no action needed" / unavoidable — no change.

Comment-only change. gofmt + build clean.

claude · 2026-06-01T11:58:24Z

Claude finished @bootjp's task in 1m 42s —— View job

Review of PR #898 — Round 2 cleanup (`23a146a`)

Tasks

Load tools
Read adapter/test_util.go current state
Verify doc comment changes vs CLAUDE.md conventions
Confirm no unintended logic changes
Post findings

Summary

Round 2 correctly addresses Minor #2 from the previous review. The targeted doc changes are clean, the convention violation is fixed, and the diff is comment-only — zero behavior change confirmed.

Verification: targeted changes

waitForRaftReadiness block comment (lines 321–325)

Before (cd6a67e):

// 5ba2ef94 / rpushEventually / lpushEventually have been
// absorbing that post-readiness churn per-test for the past
// several months — this commits-one-entry probe closes the ...

After (23a146a):

// The per-test retry wrappers in this file have been absorbing
// that post-readiness churn site-by-site; this commits-one-entry
// probe closes the window at the readiness layer instead, ...

Correct. The specific commit hash and symbol-name enumeration are gone; the structural description is still informative and won't rot.

waitForWriteableLeader doc comment (lines 352–354)

Before:

// write. See commits 5ba2ef94 / rpushEventually / lpushEventually for
// the per-test work-arounds this readiness gate is meant to obsolete.

After:

// write — the failure mode the per-test retry wrappers in this file
// have been working around. See the PR description for the full
// flake-fix history this readiness gate is meant to obsolete.

Correct. The commit/symbol refs are removed. The "See the PR description" pointer is fine in principle (the PR description is the right home for flake-fix history, per CLAUDE.md), though "the PR description" without a number is slightly ambiguous once this is read out of PR context. Not a blocker — git log --follow or git blame gets you there.

Residual nit (not in scope for Round 2, but same category)

The Round 1 additions in proposeReadinessOnce (line 385) and isReadinessProbeRetryable (line 405) still carry review-process artifact references:

// proposeReadinessOnce …
// per-attempt ctx cannot block past the outer deadline (gemini PR
// #898 HIGH point 1). …

// isReadinessProbeRetryable …
// context.Canceled is treated symmetrically (gemini
// PR #898 HIGH point 2).

These are the same class of rot as the commit-hash refs — they'll be opaque to any reader who isn't looking at this PR. The WHY is still meaningful; only the attribution needs to go. Suggested replacements:

// per-attempt ctx cannot block past the outer deadline when the
// overall budget is nearly exhausted.

// context.Canceled is treated symmetrically.

Minor nit — not a blocker, and it's a separate touch from Round 2's stated scope. Could be addressed in a follow-up pass if desired.

Verdict

Round 2 is clean. CLAUDE.md Minor #2 fully resolved. The two residual (gemini PR #898 HIGH point N) parentheticals in the Round 1 helpers are the only remaining callout worth a follow-up pass.
| Branch: leader-churn-helper

…ot] residual nit codex P2 (adapter/test_util.go:367): When a real re-election makes another node leader during the readiness probe, nodes[0].engine.Propose returns errNotLeader on every attempt for the rest of the budget. The previous loop kept proposing through that follower engine until t.Fatalf fired — spuriously failing the readiness gate even though the cluster had recovered with a different leader. Fix: on transient not-leader errors, re-run ensureNodeZeroIsLeader before the next attempt. ensureNodeZeroIsLeader is idempotent in the steady state so this costs nothing when leadership has not actually moved. Gated on isTransientNotLeaderErr specifically (not the broader isReadinessProbeRetryable) so ctx-timeout cases do not pay for the re-nudge — they are likely just slow scheduler ticks, not leadership changes. Signature change: waitForWriteableLeader now takes (peers, waitInterval) so the probe loop can do leader re-nudges. Sole caller (waitForRaftReadiness in the same file) already had those values to hand; verified by grep. claude[bot] residual nit (round-2 review): Stripped "gemini PR #898 HIGH point 1/2" attribution refs from the proposeReadinessOnce / isReadinessProbeRetryable doc comments. Same CLAUDE.md convention as Minor #2 — the WHY is kept; the PR-process attribution rots. Validation: go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists -race -count=2 -> ok 7.9s go test ./adapter/ -run TestSQSServer_Throttle -race -count=2 -> ok golangci-lint run -> 0 issues

bootjp · 2026-06-01T12:03:48Z

@claude review

Round 2 fixes (commit e73b625):

codex P2 (line 367, "Re-elect node 0 before retrying"): On transient not-leader errors, the loop now re-runs ensureNodeZeroIsLeader before the next attempt. Without it, a real re-election that moved leadership off nodes[0] mid-probe would have spun on the follower engine until the budget expired and spuriously failed the gate. Gated on isTransientNotLeaderErr specifically (not the broader retryable widener) so ctx-timeout cases don't pay for the re-nudge. Signature: waitForWriteableLeader now takes (peers, waitInterval); sole caller updated.
claude[bot] residual nit (round-2 review): stripped gemini PR #898 HIGH point 1/2 attribution refs from proposeReadinessOnce + isReadinessProbeRetryable doc comments. Same CLAUDE.md convention as Minor Update dependency go to v1.21.4 #2 cleanup.

go test -race -count=2 passes (7.9s); golangci-lint 0 issues.

claude · 2026-06-01T12:04:01Z

Claude finished @bootjp's task in 2m 15s —— View job

Review of PR #898 — Round 2 fixes (`e73b625`)

Tasks

Load tools
Read adapter/test_util.go current state
Verify codex P2 fix (ensureNodeZeroIsLeader re-nudge on transient not-leader)
Verify signature change (waitForWriteableLeader takes peers, waitInterval)
Verify doc comment cleanup (gemini attribution refs removed)
Review for correctness, concurrency safety, and edge cases
Post findings

Summary

Both changes are correct. The codex P2 concern is fully addressed, and the doc comment cleanup is complete. No blockers.

Verification: codex P2 fix

The re-nudge is gated on isTransientNotLeaderErr(err) && len(peers) > 0 (line 391), not the broader isReadinessProbeRetryable. This is the right gate: ctx-timeout retries (context.DeadlineExceeded / context.Canceled) don't trigger the re-election nudge, since a ctx timeout doesn't imply that leadership has moved — it just means the per-attempt budget ran out on a slow runner. A true not-leader error does imply leadership may have moved off nodes[0], so the narrower gate is correct.

Loop trace for the not-leader path:

Top-of-loop remaining <= 0 check → ok
proposeReadinessOnce → errNotLeader
isReadinessProbeRetryable → true (via isTransientNotLeaderErr)
isTransientNotLeaderErr(err) && len(peers) > 0 → true → ensureNodeZeroIsLeader(t, nodes, peers, time.Until(deadline), waitInterval)
Deadline check → sleep → retry

Loop trace for the ctx-timeout path:

Top-of-loop remaining <= 0 → ok
proposeReadinessOnce → context.DeadlineExceeded
isReadinessProbeRetryable → true
isTransientNotLeaderErr(err) → false → re-nudge skipped
Deadline check → sleep → retry

Both paths are correct.

Verification: signature change

waitForWriteableLeader(t, nodes, peers, timeout, waitInterval time.Duration) — sole caller at line 332 passes (t, nodes, peers, waitTimeout, waitInterval). No other callers (grep confirmed). ✓

Verification: doc comment cleanup

proposeReadinessOnce (line 401–406) and isReadinessProbeRetryable (line 419–424) no longer carry (gemini PR #898 HIGH point N) attribution refs. The comments now state the WHY only — the constraint (per-attempt ctx bounding, ctx-error widening). CLAUDE.md minor from Round 2 review fully resolved. ✓

Minor observations

1. ensureNodeZeroIsLeader failure message when budget is nearly exhausted

When time.Until(deadline) is close to zero, ensureNodeZeroIsLeader passes that as waitTimeout to its own require.Eventually calls. If those time out, the t.FailNow comes from require.Eventually with the message "no node became leader" or "node 0 did not become leader" — not from the outer t.Fatalf("readiness probe never committed within %s"). The failure is still caught and the message is still informative; the attribution in the log will differ from what the outer loop would print. Not a bug — just a note for future diagnosis. Consistent with how waitForRaftReadiness already passes the full waitTimeout to ensureNodeZeroIsLeader and waitForStableLeader independently.

2. 50ms sleep after a successful ensureNodeZeroIsLeader

After the re-nudge completes, the loop checks the deadline and then sleeps leaderChurnRetryInterval (50ms) before the next proposeReadinessOnce. This is fine: the top-of-loop remaining <= 0 check and the min(perProposeTimeout, remaining) bound in proposeReadinessOnce both catch a near-exhausted budget on the next iteration. No issue.

Verdict

Round 2 is clean. Codex P2 fully resolved, CLAUDE.md attribution cleanup complete. The probe loop now handles both leadership-moved (re-nudge) and ctx-timeout (retry without re-nudge) paths correctly. Approved.
| Branch: leader-churn-helper

… tests Third flake-root-cause PR, after PR #891 (SQS throttle refill) and PR #898 (Raft readiness probe). Same pattern: a recurring CI-flake mechanism has been fixed per-test for months; this PR generalises the structural fix and migrates all extant sites. Recurring flake (PR #818 lineage): Tests that "set a short TTL, sleep past it, assert the key is gone" race their own wall clock under -race on slow CI runners. The post-sleep read can sometimes still observe an un-expired view because the inter-call pause does not absorb scheduler jitter. PR #818 fixed one instance (ExpiredKey_BecomesInvisible) by switching to require.Eventually with a TTL-derived deadline. The other 8 sites in the test suite kept the fragile time.Sleep(ttl + 50ms) pattern. Structural fix ============== - New helper eventuallyExpired(t, ttl, condition, msg) in test_util.go. Encapsulates the require.Eventually + ttlExpiryHeadroom (3 s) + ttlExpiryPoll (25 ms) idiom so future test writers reach for it instead of inventing fresh time.Sleep margins. - Migrated 8 sites to the helper: * redis_collection_fastpath_test.go: TestRedis_HGET_TTLExpired, TestRedis_HEXISTS_TTLExpired, TestRedis_SISMEMBER_TTLExpired. * redis_lua_collection_fastpath_test.go: TestLua_HGET_HashTTL, TestLua_HEXISTS_HashTTL, TestLua_ZSCORE_*, TestLua_ZRANGEBYSCORE_*. * redis_misskey_compat_test.go: post-EX-1 expiry assertion. - Removed the now-unused waitForTTLExpiry (redis_collection_fastpath _test.go) and waitForLuaTTL (redis_lua_collection_fastpath_test.go) convenience wrappers. Their sole callers were the migrated sites. - Added errors stdlib import to the three test files where the helper's condition closures use errors.Is. Caller audit per /loop semantic-change rule ============================================ - eventuallyExpired: new helper, 8 callers in 3 test files. All test-only. - waitForTTLExpiry / waitForLuaTTL: removed. grep across the repo confirms zero remaining call sites. - collectionFastPathTTL / luaFastPathTTL consts: still used by the migrated sites as the ttl argument to eventuallyExpired. Unchanged. Validation ========== go test ./adapter/ -run 'TestRedis_HGET_TTLExpired|TestRedis_HEXISTS_TTLExpired|TestRedis_SISMEMBER_TTLExpired|TestLua_HGET_HashTTL|TestLua_HEXISTS|TestLua_ZSCORE|TestLua_ZRANGEBYSCORE|TestRedis_Misskey' -race -count=2 -> ok 32.3s gofmt + go vet + golangci-lint run -> 0 issues Closes the "Redis TTL window deadline-multiplier" follow-up noted in PR #891's description.

… tests (#903) ## Summary Third flake-root-cause PR, after #891 (SQS throttle refill) and #898 (Raft readiness probe). Same recipe: a recurring CI-flake mechanism has been patched per-test for months; this PR generalises the structural fix and migrates the extant sites. ## Recurring flake (PR #818 lineage) Tests that *"set a short TTL, sleep past it, assert the key is gone"* race their own wall clock under `-race` on slow CI runners. The post-sleep read sometimes observes an un-expired view because the inter-call pause does not absorb scheduler jitter. [PR #818](#818) fixed one instance (`TestRedis_ExpiredKey_BecomesInvisible`) by switching to `require.Eventually` with a TTL-derived deadline. The other 8 sites kept the fragile `time.Sleep(ttl + 50ms)` pattern. ## Structural fix - **New helper** `eventuallyExpired(t, ttl, condition, msg)` in `test_util.go`. Encapsulates the `require.Eventually + ttlExpiryHeadroom(3s) + ttlExpiryPoll(25ms)` idiom so future test writers reach for it instead of inventing fresh `time.Sleep` margins. - **Migrated 8 sites** to the helper: - `redis_collection_fastpath_test.go`: `TestRedis_HGET_TTLExpired`, `TestRedis_HEXISTS_TTLExpired`, `TestRedis_SISMEMBER_TTLExpired` - `redis_lua_collection_fastpath_test.go`: `TestLua_HGET_HashTTL`, `TestLua_HEXISTS_HashTTL`, `TestLua_ZSCORE_*`, `TestLua_ZRANGEBYSCORE_*` - `redis_misskey_compat_test.go`: post-`EX 1` expiry assertion - **Removed unused** `waitForTTLExpiry` (redis_collection_fastpath_test.go) and `waitForLuaTTL` (redis_lua_collection_fastpath_test.go) convenience wrappers. Sole callers were the migrated sites. - Added `errors` stdlib import to the three test files where the helper's condition closures use `errors.Is`. ## Caller audit (per /loop semantic-change rule) - `eventuallyExpired`: new helper, 8 callers in 3 test files. All test-only. - `waitForTTLExpiry` / `waitForLuaTTL`: removed. Repo-wide grep confirms zero remaining call sites. - `collectionFastPathTTL` / `luaFastPathTTL` consts: still used by the migrated sites as the `ttl` argument to `eventuallyExpired`. Unchanged. ## Validation - `go test ./adapter/ -run 'TestRedis_HGET_TTLExpired|TestRedis_HEXISTS_TTLExpired|TestRedis_SISMEMBER_TTLExpired|TestLua_HGET_HashTTL|TestLua_HEXISTS|TestLua_ZSCORE|TestLua_ZRANGEBYSCORE|TestRedis_Misskey' -race -count=2` → ok 32.3s - `gofmt` + `go vet` + `golangci-lint run` → 0 issues ## Scope Closes the *"Redis TTL window deadline-multiplier"* follow-up noted in [PR #891's description](#891). ## Out-of-scope (still on the flake follow-up backlog) - VM pool / cache hit-rate assertion → "don't assert on cache hit rates" rule. The structural lesson from commit [7783c27](7783c279) is to never assert on cache hit rates; needs a doc rather than a code change.

PR #898 added waitForWriteableLeader to waitForRaftReadiness, which drives one no-op Raft entry through full quorum before createNode returns. The post-readiness leader-churn window the rpushEventually / lpushEventually wrappers absorbed is now closed at the readiness layer instead. The wrappers are dead code. Changes: - Drop rpushEventually + lpushEventually from test_util.go (the only two callers, in redis_lua_compat_test.go, were first-write -after-createNode sites — replaced with direct RPush/LPush calls + a comment pointing back to PR #898). - Drop doEventually too: its only callers were the two helpers above. retryNotLeader (the underlying classifier loop) stays — grpc_test.go still uses it for *mid-test* leader churn inside its 9999-iteration consistency loops, which PR #898 does not address. - Drop the now-unused redis import from test_util.go. - Refresh the leaderChurnRetryTimeout / retryNotLeader doc comments to call out that startup-window churn is closed at the readiness layer, and retryNotLeader is for mid-test loops. Caller audit per /loop semantic-change rule ============================================ - rpushEventually / lpushEventually: grep across the repo confirms zero remaining call sites after the migration. Safe to delete. - doEventually: only called by the two helpers above; also safe to delete. - retryNotLeader: kept. Sole external caller is adapter/grpc_test.go (9 sites inside two consistency-loop tests). All those wrap mid-test RPCs that can race a real re-election under CI load — the readiness probe does not help mid-test churn, so the helper retains its purpose. Validation ========== go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists -race -count=3 -> ok 5.1s (the direct RPush/LPush calls inherit the writeable-leader gate from createNode, no flake) gofmt + go vet + golangci-lint run -> 0 issues

…#902) ## Summary Follow-up to PR #898. The readiness probe (`waitForWriteableLeader`) closes the startup leader-churn window at the readiness layer, so the per-test `rpushEventually` / `lpushEventually` wrappers that absorbed that window become dead code. ## What's dropped - `rpushEventually` + `lpushEventually` (`test_util.go`): only callers were `redis_lua_compat_test.go` lines 128–129 (first writes immediately after `createNode`). Migrated to direct `rdb.RPush` / `rdb.LPush` with a comment pointing back to PR #898. - `doEventually`: only used by the two helpers above; no other callers. - Unused `github.com/redis/go-redis/v9` import in `test_util.go`. ## What's kept - `retryNotLeader`: still used in `grpc_test.go` (9 sites) wrapping mid-test RPCs inside 9999-iteration consistency loops. The readiness probe does NOT help **mid-test** churn (a real re-election can fire halfway through a long-running loop), so this helper retains its purpose. The doc comment is refreshed to clarify the startup vs mid-test distinction. ## Caller audit (per /loop semantic-change rule) Repo-wide grep: - `rpushEventually\b` / `lpushEventually\b`: 0 callers remain (only historical comment refs in `redis_lua_compat_test.go`). - `doEventually\b`: 0 callers remain. - `retryNotLeader\b`: 9 callers, all in `grpc_test.go`. Unchanged. ## Validation - `go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists -race -count=3` → ok 5.1s (direct RPush/LPush inherits the writeable-leader gate from `createNode`; no flake) - `gofmt` + `go vet` + `golangci-lint run` → 0 issues ## Scope Closes the "*Eventually helper unnecessization audit" follow-up noted in PR #898's description.

gemini-code-assist Bot reviewed Jun 1, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 1, 2026

View reviewed changes

bootjp merged commit 6f2941b into main Jun 1, 2026
9 checks passed

bootjp deleted the leader-churn-helper branch June 1, 2026 14:02

This was referenced Jun 1, 2026

chore(test): drop redundant *Eventually wrappers obsoleted by PR #898 #902

Merged

test(redis): structural fix — eventuallyExpired helper for TTL-expiry tests #903

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(util): commit one no-op entry to prove leader writeable in waitForRaftReadiness#898

test(util): commit one no-op entry to prove leader writeable in waitForRaftReadiness#898
bootjp merged 4 commits into
mainfrom
leader-churn-helper

bootjp commented Jun 1, 2026

Uh oh!

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

Review limit reached

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 1, 2026

Uh oh!

bootjp commented Jun 1, 2026

Uh oh!

claude Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 1, 2026

Uh oh!

bootjp commented Jun 1, 2026

Uh oh!

claude Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

bootjp commented Jun 1, 2026

Uh oh!

claude Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bootjp commented Jun 1, 2026

Summary

Recurring flake pattern (5+ commits this year)

Mechanism

Caller audit (per /loop semantic-change rule)

Validation

Follow-up note

Out-of-scope (remaining recurring flake categories)

Uh oh!

coderabbitai Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Jun 1, 2026

Uh oh!

claude Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #898 — Round 1 fixes (cd6a67e)

Summary

Verification: no-op payload

Gemini HIGH #1 — per-propose context bounding (proposeReadinessOnce)

Gemini HIGH #2 — transient error widening (isReadinessProbeRetryable)

Minor observations

Verdict

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Jun 1, 2026

Uh oh!

claude Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #898 — Round 2 cleanup (23a146a)

Summary

Verification: targeted changes

Residual nit (not in scope for Round 2, but same category)

Verdict

Uh oh!

bootjp commented Jun 1, 2026

Uh oh!

claude Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #898 — Round 2 fixes (e73b625)

Summary

Verification: codex P2 fix

Verification: signature change

Verification: doc comment cleanup

Minor observations

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

claude Bot commented Jun 1, 2026 •

edited

Loading

Review of PR #898 — Round 1 fixes (`cd6a67e`)

Gemini HIGH #1 — per-propose context bounding (`proposeReadinessOnce`)

Gemini HIGH #2 — transient error widening (`isReadinessProbeRetryable`)

claude Bot commented Jun 1, 2026 •

edited

Loading

Review of PR #898 — Round 2 cleanup (`23a146a`)

claude Bot commented Jun 1, 2026 •

edited

Loading

Review of PR #898 — Round 2 fixes (`e73b625`)