Skip to content

test(util): commit one no-op entry to prove leader writeable in waitForRaftReadiness#898

Merged
bootjp merged 4 commits into
mainfrom
leader-churn-helper
Jun 1, 2026
Merged

test(util): commit one no-op entry to prove leader writeable in waitForRaftReadiness#898
bootjp merged 4 commits into
mainfrom
leader-churn-helper

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Jun 1, 2026

Summary

Structural fix for the recurring leader-churn CI flake pattern. Closes the post-readiness step-down window at the readiness layer instead of asking every new test author to remember to use *Eventually helpers.

Companion to PR #891 (SQS throttle structural fix): same recipe — each per-test fix to a recurring flake mechanism leaves the same trap for the next test. PR #891 hardened the SQS throttle-refill subset; this PR hardens the leader-churn subset.

Recurring flake pattern (5+ commits this year)

Commit What it touched
5ba2ef94 added doEventually / rpushEventually / lpushEventually, wrapped 2 RPUSH/LPUSH calls in TestRedis_LuaRPopLPushBullMQLikeLists
2b6e8f28 widened raft election budget + retried gRPC on leader churn
78b9d091 retried leader churn in consistency loops
0fb7db76 fixed flaky TestRedis_LuaRPopLPushBullMQLikeLists (re-fix)
ba3f5bf8 waited for follower apply catch-up in MultiExec

Root cause is the same: waitForStableLeader only proves leadership appears stable for a 300 ms window. The freshly-elected leader can still step down on its first real heartbeat-quorum miss after that window closes, surfacing as "etcd raft engine is not leader" on the test's first write.

Mechanism

Add waitForWriteableLeader that drives one no-op Raft entry through full quorum after waitForStableLeader returns:

var raftReadinessProbePayload = []byte{
    0x02,                                           // raftEncodeHLCLease
    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // BE uint64 ceiling = 0 (no-op)
}
  • State-machine no-op: kv/fsm.go applyHLCLease reads data[1:] as a big-endian uint64 ceiling. The if ceilingMs > 0 guard at line 185 makes ceiling=0 a true no-op (no HLC mutation, no storage write, no encryption dispatch).
  • Quorum-proving: the entry still traverses the full propose → append → replicate → commit → apply pipeline, so a successful Propose proves the leader can actually drive quorum.
  • Retry semantics: retries on transient leader-unavailable errors (same set isTransientNotLeaderErr already classifies) up to caller-supplied waitTimeout (10s at the current createNode call site). Per-propose timeout is 2s.

Wired into waitForRaftReadiness as the final step.

Caller audit (per /loop semantic-change rule)

  • waitForRaftReadiness: single production call site (createNode); behavior change is additive — same semantics PLUS the writeable-leader gate. Tests that previously raced into a step-down window now block until the leader has committed one entry.
  • waitForWriteableLeader / raftReadinessProbePayload: package-private to adapter/test_util.go; no external callers (grep verified).
  • applyHLCLease (kv/fsm.go): receives one extra invocation per createNode call. Behavior on ceiling=0 is documented no-op. No state divergence across replicas — every FSM applies the same payload.
  • Side effects:
    • pendingApplyIdx advances by 1 (normal flow)
    • encryption applier untouched (opcode 0x02 not in encryption band)
    • snapshots include one extra applied index (already varies by test timing)
    • lease provider does not track HLC entries
  • Per-test wall-clock: ~10 ms extra on a stable cluster. 29 test files use createNode → ~290 ms total CI overhead. Negligible.

Validation

  • go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists -race -count=3 → ok 5.4s
  • go test ./adapter/ -run 'TestRedis_SET|TestRedis_LPush|TestRedis_RPush|TestSQSServer_Throttle|TestRedis_MULTI' -race -count=2 → ok 5.6s
  • gofmt + go vet + golangci-lint run → 0 issues

Follow-up note

Existing per-test *Eventually wrappers (rpushEventually, lpushEventually) remain valid for tests that issue writes long after createNode, where mid-test leader churn under CI load can still hit. The readiness probe targets the startup window specifically — it does not obsolete the helpers for other use cases.

Subsequent commits can audit whether tests still calling rpushEventually / lpushEventually immediately after createNode can drop back to direct calls now that the startup window is closed at the readiness layer.

Out-of-scope (remaining recurring flake categories)

Tracked in the PR #891 description; still to be addressed in follow-up PRs:

  • Redis TTL window racing → per-test deadline-multiplier helper
  • VM pool / cache hit-rate assertions → "don't assert on cache hit rates" rule

…orRaftReadiness

Structural fix for the recurring leader-churn flake pattern. Multiple
commits over the past months have absorbed post-readiness leader
churn per-test by wrapping the first write in retry helpers:

  5ba2ef9  test(redis): retry initial writes on 'not leader' to absorb CI leader churn
            (added doEventually / rpushEventually / lpushEventually)
  2b6e8f2  test(adapter): widen raft election budget + retry gRPC on leader churn
  78b9d09  test(grpc): retry leader churn in consistency loops
  0fb7db7  test(redis): fix flaky TestRedis_LuaRPopLPushBullMQLikeLists
  ba3f5bf  test(redis): wait for follower apply catch-up in MultiExec list tests

The root cause is the same in every case: waitForStableLeader (the
existing helper) only proves leadership *appears* stable for a 300 ms
window. The freshly-elected leader can still step down on its first
real heartbeat-quorum miss after that window closes, surfacing as
"etcd raft engine is not leader" on the test's first write.

This commit closes the post-readiness window at the readiness layer
instead of asking every new test author to remember to use *Eventually
helpers. After waitForStableLeader returns, drive one no-op Raft
entry through full quorum and retry on transient leader errors until
it commits. Any subsequent test write inherits a leader that has
already held quorum through at least one apply.

Mechanism
=========

The probe payload is a 9-byte HLC-lease entry with ceiling=0:

  [0x02][0x00 * 8]
   ^ raftEncodeHLCLease

kv/fsm.go applyHLCLease reads data[1:] as a big-endian uint64
ceiling. The `if ceilingMs > 0` guard at line 185 makes ceiling=0
a true no-op at the state-machine layer (no HLC mutation, no
storage write, no encryption-band dispatch). The entry still
traverses the full propose -> append -> replicate -> commit ->
apply pipeline, so a successful Propose proves the leader can
actually drive quorum.

waitForWriteableLeader retries the propose on transient
leader-unavailable errors (the same set isTransientNotLeaderErr
already classifies) up to the caller-supplied waitTimeout
(10 s at the current createNode call site). Per-propose timeout
is 2 s — generous enough to absorb scheduler jitter on a loaded
runner, tight enough that a genuinely-stuck leader fails fast
instead of consuming the entire 10 s budget on one attempt.

Caller audit per /loop semantic-change rule
============================================

  - waitForRaftReadiness: single production call site
    (test_util.go createNode); behavior change is additive — same
    semantics PLUS the writeable-leader gate. Tests that previously
    raced into a step-down window now block until the leader has
    committed one entry.

  - waitForWriteableLeader / raftReadinessProbePayload: package-
    private to adapter/test_util.go; no external callers. Verified
    by grep across the repo.

  - applyHLCLease (kv/fsm.go): receives one extra invocation per
    test that calls createNode. Behavior on ceiling=0 is documented
    no-op (kv/fsm.go:185 `if ceilingMs > 0`). No state divergence
    across replicas — every FSM applies the same payload.

  - Side effects on the engine / FSM:
    - pendingApplyIdx advances by 1 (normal flow)
    - encryption applier untouched (opcode 0x02 is not in the
      encryption band)
    - snapshots include one extra applied index (already varies by
      test timing)
    - lease provider does not track HLC entries

  - Per-test wall-clock impact: one extra Raft commit on startup,
    ~10 ms on a stable cluster. 29 test files use createNode, so
    total CI overhead is ~290 ms. Negligible.

Validation
==========

  go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists -race -count=3 -> ok 5.4s
  go test ./adapter/ -run 'TestRedis_SET|TestRedis_LPush|TestRedis_RPush|TestSQSServer_Throttle|TestRedis_MULTI' -race -count=2 -> ok 5.6s
  gofmt + go vet + golangci-lint -> 0 issues

Follow-up note
==============

Existing per-test *Eventually wrappers (rpushEventually,
lpushEventually) remain valid for tests that issue writes long
after createNode, where mid-test leader churn under CI load can
still hit. The readiness probe targets the startup window
specifically — it does not obsolete the helpers for other use
cases.

Subsequent commits can audit whether tests still calling
rpushEventually / lpushEventually right after createNode can drop
back to direct calls now that the startup window is closed at
the readiness layer.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 24 minutes and 2 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7c39c03a-2fde-4899-bccc-9b838b5d7cfc

📥 Commits

Reviewing files that changed from the base of the PR and between 2b118a4 and e73b625.

📒 Files selected for processing (1)
  • adapter/test_util.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch leader-churn-helper

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a waitForWriteableLeader helper in adapter/test_util.go to ensure that the Raft leader can successfully commit a no-op entry before completing readiness checks, preventing post-readiness test failures. Feedback points out two issues in the retry loop: the hardcoded proposal timeout can exceed the overall deadline, and context-related errors (such as context.DeadlineExceeded) are not classified as transient, which could cause premature test failures. A code suggestion is provided to bound the proposal timeout and handle context errors appropriately.

Comment thread adapter/test_util.go
Comment on lines +360 to +373
deadline := time.Now().Add(timeout)
const perProposeTimeout = 2 * time.Second
var lastErr error
for {
proposeCtx, cancel := context.WithTimeout(context.Background(), perProposeTimeout)
_, err := nodes[0].engine.Propose(proposeCtx, raftReadinessProbePayload)
cancel()
if err == nil {
return
}
lastErr = err
if !isTransientNotLeaderErr(err) {
t.Fatalf("readiness probe failed with non-transient error: %v", err)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There are two issues with the current retry loop implementation:

  1. Context Timeout Bounding: The perProposeTimeout is hardcoded to 2 * time.Second. If the remaining time until the overall deadline is less than 2 seconds, the Propose call can block past the overall timeout budget.
  2. Transient Error Classification: If a single proposal times out (e.g., under heavy CI load or during leader stabilization), Propose returns context.DeadlineExceeded. Since isTransientNotLeaderErr does not classify context.DeadlineExceeded or context.Canceled as transient, the test will immediately fail via t.Fatalf instead of retrying up to the overall timeout budget. This defeats the purpose of the retry loop and can introduce new CI flakes.

We should bound the proposal timeout to the remaining budget and treat context errors as transient/retryable.

	deadline := time.Now().Add(timeout)
	const perProposeTimeout = 2 * time.Second
	var lastErr error
	for {
		remaining := time.Until(deadline)
		currTimeout := perProposeTimeout
		if remaining < currTimeout {
			currTimeout = remaining
		}
		proposeCtx, cancel := context.WithTimeout(context.Background(), currTimeout)
		_, err := nodes[0].engine.Propose(proposeCtx, raftReadinessProbePayload)
		cancel()
		if err == nil {
			return
		}
		lastErr = err
		if !isTransientNotLeaderErr(err) && !errors.Is(err, context.DeadlineExceeded) && !errors.Is(err, context.Canceled) {
			t.Fatalf("readiness probe failed with non-transient error: %v", err)
		}
References
  1. Ensure that synchronous, potentially blocking, or indefinitely retrying operations executed within a polling loop are wrapped with a bounded timeout context. This prevents the polling/watcher goroutine from blocking indefinitely during failures (e.g., network partitions), allowing the loop to continue ticking, re-evaluate state changes, and maintain a safe fail-closed posture.

… classify ctx errors as transient

gemini HIGH (adapter/test_util.go:373):

1. Per-propose timeout could exceed outer budget. Hardcoded
   perProposeTimeout=2s combined with the outer waitTimeout (10s
   at createNode) means a propose started within the last <2s
   of the budget could block past the caller deadline. Fix:
   currTimeout = min(perProposeTimeout, remaining).

2. context.DeadlineExceeded not classified as transient. The
   base isTransientNotLeaderErr (redis_retry.go) recognizes only
   raft-engine leader-unavailable signals, not ctx-timeout. A
   per-propose ctx that fires on a slow CI runner would have
   immediately t.Fatalf-ed instead of retrying within the outer
   budget — directly re-introducing the flake class the readiness
   probe is meant to eliminate. Widened with a probe-local
   isReadinessProbeRetryable that ORs context.DeadlineExceeded
   and context.Canceled.

Extracted proposeReadinessOnce + isReadinessProbeRetryable to keep
waitForWriteableLeader under cyclop=10 after the new branch.

Caller audit: all three new symbols (proposeReadinessOnce,
isReadinessProbeRetryable, raftReadinessProbePayload) are
package-private; sole caller is waitForWriteableLeader.
Verified by grep across the repo.

Validation:
  go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists -race -count=2 -> ok 7.2s
  go test ./adapter/ -run TestSQSServer_Throttle -race -count=2 -> ok
  golangci-lint run -> 0 issues
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

@claude review

Round 1 fixes (commit cd6a67e):

  • gemini HIGH point 1: Per-propose ctx is now bounded by min(perProposeTimeout, remaining). A propose started within the last <2 s of the outer budget can no longer block past the caller deadline.
  • gemini HIGH point 2: context.DeadlineExceeded + context.Canceled are now treated as transient via a probe-local isReadinessProbeRetryable widener. A per-propose ctx timeout on a slow CI runner now retries within the outer budget instead of immediately fataling — which would have directly re-introduced the flake class this probe is meant to eliminate.
  • Extracted proposeReadinessOnce + isReadinessProbeRetryable to keep waitForWriteableLeader under cyclop=10 budget.

Caller audit: all 3 new symbols package-private, sole caller is waitForWriteableLeader.

go test -race -count=2 passes (7.2s); golangci-lint 0 issues.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 2m 36s —— View job


Review of PR #898 — Round 1 fixes (cd6a67e)

Tasks

  • Load tools
  • Read adapter/test_util.go changes
  • Verify FSM no-op claim (kv/fsm.go)
  • Verify error wrapping / errors.Is chain
  • Review code for correctness, concurrency safety, and edge cases
  • Post findings

Summary

Both Gemini HIGH findings are correctly addressed. The logic is sound, the no-op payload is verified against the FSM, and the error chain handles wrapping correctly. A few minor observations below — none are blockers.


Verification: no-op payload

Confirmed against kv/fsm.go:272-280:

// applyHLCLease — ceiling=0 path
ceilingMs := int64(binary.BigEndian.Uint64(data))  // = 0
if f.hlc != nil && ceilingMs > 0 {                  // guard fires → skipped
    f.hlc.SetPhysicalCeiling(ceilingMs)
}
return nil

The 9-byte probe payload (0x02 + 8 zero bytes) is a true no-op at the state-machine layer while still traversing the full propose→replicate→commit→apply pipeline. Payload length exactly matches hlcLeasePayloadLen = 8 for data[1:]. Correct.


Gemini HIGH #1 — per-propose context bounding (proposeReadinessOnce)

currTimeout := perProposeTimeout
if remaining < currTimeout {
    currTimeout = remaining
}
proposeCtx, cancel := context.WithTimeout(context.Background(), currTimeout)

Fix is correct. context.Background() as the base is appropriate here since the timeout is already bounded by remaining. When remaining is nearly zero, currTimeout becomes tiny → Propose returns immediately with context.DeadlineExceededisReadinessProbeRetryable returns true → the top-of-loop remaining <= 0 check on the next iteration catches the expired budget and calls t.Fatalf. No runaway blocking past the outer deadline.


Gemini HIGH #2 — transient error widening (isReadinessProbeRetryable)

func isReadinessProbeRetryable(err error) bool {
    return isTransientNotLeaderErr(err) ||
        errors.Is(err, context.DeadlineExceeded) ||
        errors.Is(err, context.Canceled)
}

Propose wraps context errors with errors.WithStack(ctx.Err()) (engine.go:748,757), so errors.Is correctly unwraps them. Fix is correct and complete for the known error surface of Propose.


Minor observations

1. Duplicate deadline check in waitForWriteableLeader (line 375)

if !isReadinessProbeRetryable(err) {
    t.Fatalf(...)
}
if !time.Now().Before(deadline) {   // ← technically reachable but redundant
    t.Fatalf(...)
}
time.Sleep(leaderChurnRetryInterval)

The second deadline check is pre-empted by the top-of-loop remaining <= 0 check on the next iteration. Its only effect is to skip the 50ms sleep when the deadline expired during a propose. Harmless — could be removed, but it's not wrong. No action needed.

2. Commit hash refs in inline comments (5ba2ef94, rpushEventually, lpushEventually)

CLAUDE.md says: "Don't reference the current task, fix, or callers since those belong in the PR description and rot as the codebase evolves." The references in waitForWriteableLeader's doc comment and the waitForRaftReadiness block comment will become stale as those helpers are cleaned up. Consider replacing them with a single structural note ("see PR description for the per-test retry history this probe is intended to obsolete") or dropping them. Not a blocker — the PR description already captures the full history.

3. raftReadinessProbePayload is a mutable package-level var

Since []byte can't be const, this is unavoidable. No production code path could mutate it, and it's package-private. Mention only for completeness — no change needed.


Verdict

Round 1 addresses both HIGH items correctly. The probe mechanism is sound and the retry logic behaves as documented. Approved from a correctness standpoint; the commit-hash comments are the only callout worth a follow-up cleanup pass.
| Branch: leader-churn-helper

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cd6a67e895

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread adapter/test_util.go
if remaining <= 0 {
t.Fatalf("readiness probe never committed within %s (last err: %v)", timeout, lastErr)
}
err := proposeReadinessOnce(nodes[0].engine, remaining)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Re-elect node 0 before retrying the readiness probe

When the probe races with a real re-election where another node becomes leader, this loop keeps proposing only through nodes[0].engine; handleProposal returns errNotLeader whenever that engine is a follower, and nothing here re-runs ensureNodeZeroIsLeader or otherwise targets the new leader. In that startup-churn case the helper now spins until readiness probe never committed even though the cluster has recovered with a different leader, so the readiness gate can introduce a new createNode flake instead of riding out the re-election it is meant to handle.

Useful? React with 👍 / 👎.

…ments

claude[bot] PR #898 minor #2: CLAUDE.md convention forbids referencing
specific commits/symbols in code comments since they rot as the
codebase evolves. Dropped the "5ba2ef94 / rpushEventually /
lpushEventually" enumeration from both doc blocks; replaced with a
structural pointer to "the per-test retry wrappers in this file"
and a direction to the PR description for full history.

The ground truth for the flake-fix lineage lives in the PR
description, which is durable while file-local commit refs are not.

No behavior change. No caller audit needed (comment-only).
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

@claude review

Round 2 cleanup (commit 23a146a):

  • claude[bot] minor Update dependency go to v1.21.4 #2 addressed: stripped 5ba2ef94 / rpushEventually / lpushEventually enumerations from the two doc blocks (CLAUDE.md convention — file-local commit-hash/symbol refs rot). Replaced with structural pointers ("per-test retry wrappers in this file") and a direction to the PR description for the full flake-fix lineage.

claude[bot] minors #1 (redundant deadline check) and #3 (var payload) were marked "no action needed" / unavoidable — no change.

Comment-only change. gofmt + build clean.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 1m 42s —— View job


Review of PR #898 — Round 2 cleanup (23a146a)

Tasks

  • Load tools
  • Read adapter/test_util.go current state
  • Verify doc comment changes vs CLAUDE.md conventions
  • Confirm no unintended logic changes
  • Post findings

Summary

Round 2 correctly addresses Minor #2 from the previous review. The targeted doc changes are clean, the convention violation is fixed, and the diff is comment-only — zero behavior change confirmed.


Verification: targeted changes

waitForRaftReadiness block comment (lines 321–325)

Before (cd6a67e):

// 5ba2ef94 / rpushEventually / lpushEventually have been
// absorbing that post-readiness churn per-test for the past
// several months — this commits-one-entry probe closes the ...

After (23a146a):

// The per-test retry wrappers in this file have been absorbing
// that post-readiness churn site-by-site; this commits-one-entry
// probe closes the window at the readiness layer instead, ...

Correct. The specific commit hash and symbol-name enumeration are gone; the structural description is still informative and won't rot.

waitForWriteableLeader doc comment (lines 352–354)

Before:

// write. See commits 5ba2ef94 / rpushEventually / lpushEventually for
// the per-test work-arounds this readiness gate is meant to obsolete.

After:

// write — the failure mode the per-test retry wrappers in this file
// have been working around. See the PR description for the full
// flake-fix history this readiness gate is meant to obsolete.

Correct. The commit/symbol refs are removed. The "See the PR description" pointer is fine in principle (the PR description is the right home for flake-fix history, per CLAUDE.md), though "the PR description" without a number is slightly ambiguous once this is read out of PR context. Not a blocker — git log --follow or git blame gets you there.


Residual nit (not in scope for Round 2, but same category)

The Round 1 additions in proposeReadinessOnce (line 385) and isReadinessProbeRetryable (line 405) still carry review-process artifact references:

// proposeReadinessOnce …
// per-attempt ctx cannot block past the outer deadline (gemini PR
// #898 HIGH point 1). …

// isReadinessProbeRetryable …
// context.Canceled is treated symmetrically (gemini
// PR #898 HIGH point 2).

These are the same class of rot as the commit-hash refs — they'll be opaque to any reader who isn't looking at this PR. The WHY is still meaningful; only the attribution needs to go. Suggested replacements:

// per-attempt ctx cannot block past the outer deadline when the
// overall budget is nearly exhausted.

// context.Canceled is treated symmetrically.

Minor nit — not a blocker, and it's a separate touch from Round 2's stated scope. Could be addressed in a follow-up pass if desired.


Verdict

Round 2 is clean. CLAUDE.md Minor #2 fully resolved. The two residual (gemini PR #898 HIGH point N) parentheticals in the Round 1 helpers are the only remaining callout worth a follow-up pass.
| Branch: leader-churn-helper

…ot] residual nit

codex P2 (adapter/test_util.go:367):
  When a real re-election makes another node leader during the
  readiness probe, nodes[0].engine.Propose returns errNotLeader on
  every attempt for the rest of the budget. The previous loop kept
  proposing through that follower engine until t.Fatalf fired —
  spuriously failing the readiness gate even though the cluster
  had recovered with a different leader.

  Fix: on transient not-leader errors, re-run ensureNodeZeroIsLeader
  before the next attempt. ensureNodeZeroIsLeader is idempotent in
  the steady state so this costs nothing when leadership has not
  actually moved. Gated on isTransientNotLeaderErr specifically
  (not the broader isReadinessProbeRetryable) so ctx-timeout cases
  do not pay for the re-nudge — they are likely just slow scheduler
  ticks, not leadership changes.

  Signature change: waitForWriteableLeader now takes (peers,
  waitInterval) so the probe loop can do leader re-nudges. Sole
  caller (waitForRaftReadiness in the same file) already had those
  values to hand; verified by grep.

claude[bot] residual nit (round-2 review):
  Stripped "gemini PR #898 HIGH point 1/2" attribution refs from
  the proposeReadinessOnce / isReadinessProbeRetryable doc
  comments. Same CLAUDE.md convention as Minor #2 — the WHY is
  kept; the PR-process attribution rots.

Validation:
  go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists -race -count=2 -> ok 7.9s
  go test ./adapter/ -run TestSQSServer_Throttle -race -count=2 -> ok
  golangci-lint run -> 0 issues
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

@claude review

Round 2 fixes (commit e73b625):

  • codex P2 (line 367, "Re-elect node 0 before retrying"): On transient not-leader errors, the loop now re-runs ensureNodeZeroIsLeader before the next attempt. Without it, a real re-election that moved leadership off nodes[0] mid-probe would have spun on the follower engine until the budget expired and spuriously failed the gate. Gated on isTransientNotLeaderErr specifically (not the broader retryable widener) so ctx-timeout cases don't pay for the re-nudge. Signature: waitForWriteableLeader now takes (peers, waitInterval); sole caller updated.

  • claude[bot] residual nit (round-2 review): stripped gemini PR #898 HIGH point 1/2 attribution refs from proposeReadinessOnce + isReadinessProbeRetryable doc comments. Same CLAUDE.md convention as Minor Update dependency go to v1.21.4 #2 cleanup.

go test -race -count=2 passes (7.9s); golangci-lint 0 issues.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 2m 15s —— View job


Review of PR #898 — Round 2 fixes (e73b625)

Tasks

  • Load tools
  • Read adapter/test_util.go current state
  • Verify codex P2 fix (ensureNodeZeroIsLeader re-nudge on transient not-leader)
  • Verify signature change (waitForWriteableLeader takes peers, waitInterval)
  • Verify doc comment cleanup (gemini attribution refs removed)
  • Review for correctness, concurrency safety, and edge cases
  • Post findings

Summary

Both changes are correct. The codex P2 concern is fully addressed, and the doc comment cleanup is complete. No blockers.


Verification: codex P2 fix

The re-nudge is gated on isTransientNotLeaderErr(err) && len(peers) > 0 (line 391), not the broader isReadinessProbeRetryable. This is the right gate: ctx-timeout retries (context.DeadlineExceeded / context.Canceled) don't trigger the re-election nudge, since a ctx timeout doesn't imply that leadership has moved — it just means the per-attempt budget ran out on a slow runner. A true not-leader error does imply leadership may have moved off nodes[0], so the narrower gate is correct.

Loop trace for the not-leader path:

  1. Top-of-loop remaining <= 0 check → ok
  2. proposeReadinessOnceerrNotLeader
  3. isReadinessProbeRetryable → true (via isTransientNotLeaderErr)
  4. isTransientNotLeaderErr(err) && len(peers) > 0 → true → ensureNodeZeroIsLeader(t, nodes, peers, time.Until(deadline), waitInterval)
  5. Deadline check → sleep → retry

Loop trace for the ctx-timeout path:

  1. Top-of-loop remaining <= 0 → ok
  2. proposeReadinessOncecontext.DeadlineExceeded
  3. isReadinessProbeRetryable → true
  4. isTransientNotLeaderErr(err)false → re-nudge skipped
  5. Deadline check → sleep → retry

Both paths are correct.


Verification: signature change

waitForWriteableLeader(t, nodes, peers, timeout, waitInterval time.Duration) — sole caller at line 332 passes (t, nodes, peers, waitTimeout, waitInterval). No other callers (grep confirmed). ✓


Verification: doc comment cleanup

proposeReadinessOnce (line 401–406) and isReadinessProbeRetryable (line 419–424) no longer carry (gemini PR #898 HIGH point N) attribution refs. The comments now state the WHY only — the constraint (per-attempt ctx bounding, ctx-error widening). CLAUDE.md minor from Round 2 review fully resolved. ✓


Minor observations

1. ensureNodeZeroIsLeader failure message when budget is nearly exhausted

When time.Until(deadline) is close to zero, ensureNodeZeroIsLeader passes that as waitTimeout to its own require.Eventually calls. If those time out, the t.FailNow comes from require.Eventually with the message "no node became leader" or "node 0 did not become leader" — not from the outer t.Fatalf("readiness probe never committed within %s"). The failure is still caught and the message is still informative; the attribution in the log will differ from what the outer loop would print. Not a bug — just a note for future diagnosis. Consistent with how waitForRaftReadiness already passes the full waitTimeout to ensureNodeZeroIsLeader and waitForStableLeader independently.

2. 50ms sleep after a successful ensureNodeZeroIsLeader

After the re-nudge completes, the loop checks the deadline and then sleeps leaderChurnRetryInterval (50ms) before the next proposeReadinessOnce. This is fine: the top-of-loop remaining <= 0 check and the min(perProposeTimeout, remaining) bound in proposeReadinessOnce both catch a near-exhausted budget on the next iteration. No issue.


Verdict

Round 2 is clean. Codex P2 fully resolved, CLAUDE.md attribution cleanup complete. The probe loop now handles both leadership-moved (re-nudge) and ctx-timeout (retry without re-nudge) paths correctly. Approved.
| Branch: leader-churn-helper

@bootjp bootjp merged commit 6f2941b into main Jun 1, 2026
9 checks passed
@bootjp bootjp deleted the leader-churn-helper branch June 1, 2026 14:02
bootjp added a commit that referenced this pull request Jun 2, 2026
… tests

Third flake-root-cause PR, after PR #891 (SQS throttle refill) and
PR #898 (Raft readiness probe). Same pattern: a recurring CI-flake
mechanism has been fixed per-test for months; this PR generalises
the structural fix and migrates all extant sites.

Recurring flake (PR #818 lineage):
  Tests that "set a short TTL, sleep past it, assert the key is
  gone" race their own wall clock under -race on slow CI runners.
  The post-sleep read can sometimes still observe an un-expired
  view because the inter-call pause does not absorb scheduler
  jitter. PR #818 fixed one instance (ExpiredKey_BecomesInvisible)
  by switching to require.Eventually with a TTL-derived deadline.
  The other 8 sites in the test suite kept the fragile
  time.Sleep(ttl + 50ms) pattern.

Structural fix
==============

  - New helper eventuallyExpired(t, ttl, condition, msg) in
    test_util.go. Encapsulates the require.Eventually +
    ttlExpiryHeadroom (3 s) + ttlExpiryPoll (25 ms) idiom so
    future test writers reach for it instead of inventing fresh
    time.Sleep margins.

  - Migrated 8 sites to the helper:
    * redis_collection_fastpath_test.go: TestRedis_HGET_TTLExpired,
      TestRedis_HEXISTS_TTLExpired, TestRedis_SISMEMBER_TTLExpired.
    * redis_lua_collection_fastpath_test.go: TestLua_HGET_HashTTL,
      TestLua_HEXISTS_HashTTL, TestLua_ZSCORE_*, TestLua_ZRANGEBYSCORE_*.
    * redis_misskey_compat_test.go: post-EX-1 expiry assertion.

  - Removed the now-unused waitForTTLExpiry (redis_collection_fastpath
    _test.go) and waitForLuaTTL (redis_lua_collection_fastpath_test.go)
    convenience wrappers. Their sole callers were the migrated sites.

  - Added errors stdlib import to the three test files where the
    helper's condition closures use errors.Is.

Caller audit per /loop semantic-change rule
============================================

  - eventuallyExpired: new helper, 8 callers in 3 test files. All
    test-only.
  - waitForTTLExpiry / waitForLuaTTL: removed. grep across the repo
    confirms zero remaining call sites.
  - collectionFastPathTTL / luaFastPathTTL consts: still used by
    the migrated sites as the ttl argument to eventuallyExpired.
    Unchanged.

Validation
==========

  go test ./adapter/ -run 'TestRedis_HGET_TTLExpired|TestRedis_HEXISTS_TTLExpired|TestRedis_SISMEMBER_TTLExpired|TestLua_HGET_HashTTL|TestLua_HEXISTS|TestLua_ZSCORE|TestLua_ZRANGEBYSCORE|TestRedis_Misskey' -race -count=2 -> ok 32.3s
  gofmt + go vet + golangci-lint run -> 0 issues

Closes the "Redis TTL window deadline-multiplier" follow-up noted
in PR #891's description.
bootjp added a commit that referenced this pull request Jun 2, 2026
… tests (#903)

## Summary

Third flake-root-cause PR, after #891 (SQS throttle refill) and #898
(Raft readiness probe). Same recipe: a recurring CI-flake mechanism has
been patched per-test for months; this PR generalises the structural fix
and migrates the extant sites.

## Recurring flake (PR #818 lineage)

Tests that *"set a short TTL, sleep past it, assert the key is gone"*
race their own wall clock under `-race` on slow CI runners. The
post-sleep read sometimes observes an un-expired view because the
inter-call pause does not absorb scheduler jitter. [PR
#818](#818) fixed one instance
(`TestRedis_ExpiredKey_BecomesInvisible`) by switching to
`require.Eventually` with a TTL-derived deadline. The other 8 sites kept
the fragile `time.Sleep(ttl + 50ms)` pattern.

## Structural fix

- **New helper** `eventuallyExpired(t, ttl, condition, msg)` in
`test_util.go`. Encapsulates the `require.Eventually +
ttlExpiryHeadroom(3s) + ttlExpiryPoll(25ms)` idiom so future test
writers reach for it instead of inventing fresh `time.Sleep` margins.

- **Migrated 8 sites** to the helper:
- `redis_collection_fastpath_test.go`: `TestRedis_HGET_TTLExpired`,
`TestRedis_HEXISTS_TTLExpired`, `TestRedis_SISMEMBER_TTLExpired`
- `redis_lua_collection_fastpath_test.go`: `TestLua_HGET_HashTTL`,
`TestLua_HEXISTS_HashTTL`, `TestLua_ZSCORE_*`, `TestLua_ZRANGEBYSCORE_*`
  - `redis_misskey_compat_test.go`: post-`EX 1` expiry assertion

- **Removed unused** `waitForTTLExpiry`
(redis_collection_fastpath_test.go) and `waitForLuaTTL`
(redis_lua_collection_fastpath_test.go) convenience wrappers. Sole
callers were the migrated sites.

- Added `errors` stdlib import to the three test files where the
helper's condition closures use `errors.Is`.

## Caller audit (per /loop semantic-change rule)

- `eventuallyExpired`: new helper, 8 callers in 3 test files. All
test-only.
- `waitForTTLExpiry` / `waitForLuaTTL`: removed. Repo-wide grep confirms
zero remaining call sites.
- `collectionFastPathTTL` / `luaFastPathTTL` consts: still used by the
migrated sites as the `ttl` argument to `eventuallyExpired`. Unchanged.

## Validation

- `go test ./adapter/ -run
'TestRedis_HGET_TTLExpired|TestRedis_HEXISTS_TTLExpired|TestRedis_SISMEMBER_TTLExpired|TestLua_HGET_HashTTL|TestLua_HEXISTS|TestLua_ZSCORE|TestLua_ZRANGEBYSCORE|TestRedis_Misskey'
-race -count=2` → ok 32.3s
- `gofmt` + `go vet` + `golangci-lint run` → 0 issues

## Scope

Closes the *"Redis TTL window deadline-multiplier"* follow-up noted in
[PR #891's description](#891).

## Out-of-scope (still on the flake follow-up backlog)

- VM pool / cache hit-rate assertion → "don't assert on cache hit rates"
rule. The structural lesson from commit
[7783c27](7783c279) is to
never assert on cache hit rates; needs a doc rather than a code change.
bootjp added a commit that referenced this pull request Jun 2, 2026
PR #898 added waitForWriteableLeader to waitForRaftReadiness, which
drives one no-op Raft entry through full quorum before createNode
returns. The post-readiness leader-churn window the
rpushEventually / lpushEventually wrappers absorbed is now closed at
the readiness layer instead. The wrappers are dead code.

Changes:
  - Drop rpushEventually + lpushEventually from test_util.go (the
    only two callers, in redis_lua_compat_test.go, were first-write
    -after-createNode sites — replaced with direct RPush/LPush
    calls + a comment pointing back to PR #898).
  - Drop doEventually too: its only callers were the two helpers
    above. retryNotLeader (the underlying classifier loop) stays —
    grpc_test.go still uses it for *mid-test* leader churn inside
    its 9999-iteration consistency loops, which PR #898 does not
    address.
  - Drop the now-unused redis import from test_util.go.
  - Refresh the leaderChurnRetryTimeout / retryNotLeader doc
    comments to call out that startup-window churn is closed at
    the readiness layer, and retryNotLeader is for mid-test loops.

Caller audit per /loop semantic-change rule
============================================

  - rpushEventually / lpushEventually: grep across the repo
    confirms zero remaining call sites after the migration. Safe
    to delete.
  - doEventually: only called by the two helpers above; also safe
    to delete.
  - retryNotLeader: kept. Sole external caller is
    adapter/grpc_test.go (9 sites inside two consistency-loop
    tests). All those wrap mid-test RPCs that can race a real
    re-election under CI load — the readiness probe does not
    help mid-test churn, so the helper retains its purpose.

Validation
==========

  go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists
    -race -count=3 -> ok 5.1s
    (the direct RPush/LPush calls inherit the writeable-leader
     gate from createNode, no flake)
  gofmt + go vet + golangci-lint run -> 0 issues
bootjp added a commit that referenced this pull request Jun 2, 2026
PR #898 added waitForWriteableLeader to waitForRaftReadiness, which
drives one no-op Raft entry through full quorum before createNode
returns. The post-readiness leader-churn window the
rpushEventually / lpushEventually wrappers absorbed is now closed at
the readiness layer instead. The wrappers are dead code.

Changes:
  - Drop rpushEventually + lpushEventually from test_util.go (the
    only two callers, in redis_lua_compat_test.go, were first-write
    -after-createNode sites — replaced with direct RPush/LPush
    calls + a comment pointing back to PR #898).
  - Drop doEventually too: its only callers were the two helpers
    above. retryNotLeader (the underlying classifier loop) stays —
    grpc_test.go still uses it for *mid-test* leader churn inside
    its 9999-iteration consistency loops, which PR #898 does not
    address.
  - Drop the now-unused redis import from test_util.go.
  - Refresh the leaderChurnRetryTimeout / retryNotLeader doc
    comments to call out that startup-window churn is closed at
    the readiness layer, and retryNotLeader is for mid-test loops.

Caller audit per /loop semantic-change rule
============================================

  - rpushEventually / lpushEventually: grep across the repo
    confirms zero remaining call sites after the migration. Safe
    to delete.
  - doEventually: only called by the two helpers above; also safe
    to delete.
  - retryNotLeader: kept. Sole external caller is
    adapter/grpc_test.go (9 sites inside two consistency-loop
    tests). All those wrap mid-test RPCs that can race a real
    re-election under CI load — the readiness probe does not
    help mid-test churn, so the helper retains its purpose.

Validation
==========

  go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists
    -race -count=3 -> ok 5.1s
    (the direct RPush/LPush calls inherit the writeable-leader
     gate from createNode, no flake)
  gofmt + go vet + golangci-lint run -> 0 issues
bootjp added a commit that referenced this pull request Jun 2, 2026
…#902)

## Summary

Follow-up to PR #898. The readiness probe (`waitForWriteableLeader`)
closes the startup leader-churn window at the readiness layer, so the
per-test `rpushEventually` / `lpushEventually` wrappers that absorbed
that window become dead code.

## What's dropped

- `rpushEventually` + `lpushEventually` (`test_util.go`): only callers
were `redis_lua_compat_test.go` lines 128–129 (first writes immediately
after `createNode`). Migrated to direct `rdb.RPush` / `rdb.LPush` with a
comment pointing back to PR #898.
- `doEventually`: only used by the two helpers above; no other callers.
- Unused `github.com/redis/go-redis/v9` import in `test_util.go`.

## What's kept

- `retryNotLeader`: still used in `grpc_test.go` (9 sites) wrapping
mid-test RPCs inside 9999-iteration consistency loops. The readiness
probe does NOT help **mid-test** churn (a real re-election can fire
halfway through a long-running loop), so this helper retains its
purpose. The doc comment is refreshed to clarify the startup vs mid-test
distinction.

## Caller audit (per /loop semantic-change rule)

Repo-wide grep:
- `rpushEventually\b` / `lpushEventually\b`: 0 callers remain (only
historical comment refs in `redis_lua_compat_test.go`).
- `doEventually\b`: 0 callers remain.
- `retryNotLeader\b`: 9 callers, all in `grpc_test.go`. Unchanged.

## Validation

- `go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists -race
-count=3` → ok 5.1s (direct RPush/LPush inherits the writeable-leader
gate from `createNode`; no flake)
- `gofmt` + `go vet` + `golangci-lint run` → 0 issues

## Scope

Closes the "*Eventually helper unnecessization audit" follow-up noted in
PR #898's description.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant