Skip to content

test(sqs): structural fix — shared throttle-refill helper for drain-pattern tests#891

Merged
bootjp merged 2 commits into
mainfrom
fix/sqs-throttle-invalidate-flake
May 30, 2026
Merged

test(sqs): structural fix — shared throttle-refill helper for drain-pattern tests#891
bootjp merged 2 commits into
mainfrom
fix/sqs-throttle-invalidate-flake

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented May 30, 2026

Summary

Structural fix for the recurring SQS-throttle CI flake. PR #890 just merged a per-test fix for SetQueueAttributesInvalidatesBucket; this PR generalizes the rule across the whole file so the next drain-pattern test added to adapter/sqs_throttle_integration_test.go cannot regress to the same flake.

Why a structural fix, not another per-test commit

The same CI-flake mechanism has been fixed three times in three different SQS-throttle tests over ~6 weeks:

Commit Test
54c6cd56 NoOpSetQueueAttributesPreservesBucket (PR #819)
204b7294 SetQueueAttributesInvalidatesBucket (PR #890)
This PR structural prevention for the remaining drain-pattern tests

Each per-test fix touched one test in isolation, leaving the same trap in sibling tests for the next slow CI runner to step on. Two sibling tests are still unprotected at the time of this PR:

  • TestSQSServer_Throttle_SendBucketRejectsAfterCapacityHIGH risk, drain 10 then assert 11th rejects, Refill=1
  • TestSQSServer_Throttle_RecvBucketRejectsAfterCapacityHIGH risk, receive-side mirror, same pattern

Plus BatchChargesByEntryCount is theoretically at risk on a very slow runner.

Root cause (recurring across all three flake instances)

Tests that (a) set ThrottleSend/RecvRefillPerSecond AND (b) drain the bucket then assert a "throttled" response are racing their own wall clock. Under -race on slow CI runners each request elapses ~100-250 ms (full HTTP + Raft propose+apply). An N-write drain plus its assertion send takes ~N×150 ms ± noise. At Refill=1/sec the bucket accumulates ≥1 token over that window and the "throttled" assertion fires against a refilled bucket, returning 200 instead of 400 — a false positive against a real production-contract regression test.

Structural fix

  1. Shared file-head constants + helpers in adapter/sqs_throttle_integration_test.go:

    const slowRefillRate = "0.01"  // 1 token per 100 seconds
    func newSendThrottleAttrs(capacity int) map[string]string
    func newRecvThrottleAttrs(capacity int) map[string]string

    Helpers lock the refill rate to slowRefillRate so a future drain-pattern test cannot pick a racing value by accident.

  2. File-head doc-comment block explaining:

    • The recurring flake mechanism (with prior-commit refs)
    • The rule: drain-then-assert MUST use the helpers
    • The exception: config-round-trip / validator-rejection / post-raise tests where the rate is part of the test contract
  3. Migrated all drain-pattern call sites to the helpers:

Not migrated (intentional):

  • GetQueueAttributesRoundTrip (Refill=50/5 — exact values are the test contract)
  • RejectsCapacityBelowBatchMin (Refill=1 — value is part of the rejected input the validator must catch)
  • SetQueueAttributesInvalidatesBucket post-raise phase (Refill=20 — raising both Capacity AND Refill is what the phase asserts fires)

Caller audit (per /loop semantic-change rule)

Test-only file. Helpers are package-private and only referenced within this single file (verified by grep across the repo). No production code touched. No other test file uses these names.

Retry-After header assertion change in SendBucketRejectsAfterCapacity was the single observable behavior change: header depends on refillRate via the §3.4 formula. Old "1" matched Refill=1; new "100" matches Refill=0.01. The header contract is still exercised — only the asserted literal changed.

Validation

  • go test ./adapter/ -run TestSQSServer_Throttle -race -count=3 -timeout 180s → ok 5.1s, all three iterations green
  • gofmt + go vet + golangci-lint run → 0 issues

Out-of-scope follow-ups (other recurring flake categories)

Broader CI-flake history (47 fix-CI-flake commits) has other recurring patterns that deserve similar structural treatment — tracked as future work, out of scope for this PR:

  • Leader churn + Raft async apply: many tests retry on "not leader" or wait for follower catch-up. A shared eventuallyLeaderOK(t, ...) helper would centralize the retry + budget knobs.
  • Redis TTL window racing: multiple "widen TTL" commits suggest a per-test deadline-multiplier helper.
  • VM pool / cache hit-rate assertions: dropped as flaky in 7783c279 — the structural lesson is to avoid asserting on cache hit rates.

This PR focuses on the most-repeated single flake (3 fixes for the same mechanism in 6 weeks). The remaining categories each warrant their own follow-up PR after we see how this pattern works in practice.

Summary by CodeRabbit

  • Tests
    • Enhanced SQS throttle test infrastructure to improve test reliability in continuous integration environments.

Review Change Stack

…attern tests

The SQS throttle test file has been hit by the same CI flake mechanism
three times in different tests:

  54c6cd5  NoOpSetQueueAttributesPreservesBucket   (PR #819)
  204b729  SetQueueAttributesInvalidatesBucket    (PR #890 first commit)
  <this>    structural prevention for the remaining drain-pattern tests

Each fix touched one test in isolation, leaving the same trap in
sibling tests for the next slow CI runner to step on. This commit
extracts the rule into a shared helper so future drain-pattern tests
cannot regress to the same flake.

Root cause (recurring)
======================

Tests that (a) set ThrottleSend/RecvRefillPerSecond AND (b) drain
the bucket then assert a "throttled" response are racing their own
wall clock: under -race on slow CI runners each request elapses
~100-250ms (full HTTP + Raft propose+apply). An N-write drain plus
its assertion send takes ~N*150ms +/- noise. At Refill=1/sec the
bucket accumulates >=1 token over that window and the "throttled"
assertion fires against a refilled bucket, returning 200 instead
of the expected 400 -- a false positive against a real production
contract regression.

Three commits have fixed exactly this -- each time in just the one
test that flaked. Sibling tests using the same pattern remained
unprotected.

Structural fix
==============

1. Shared file-head constants + helpers:

     const slowRefillRate = "0.01"  // 1 token per 100 seconds
     func newSendThrottleAttrs(capacity int) map[string]string
     func newRecvThrottleAttrs(capacity int) map[string]string

   Helpers lock the refill rate to slowRefillRate so a future
   drain-pattern test cannot pick a racing value by accident.

2. File-head doc-comment block explaining:
   - The recurring flake mechanism (with prior-commit refs).
   - The rule: drain-then-assert MUST use the helpers.
   - The exception: config-round-trip / validator-rejection /
     post-raise tests where the rate is part of the test contract.

3. Migrated all drain-pattern call sites to the helpers:
   - TestSQSServer_Throttle_SendBucketRejectsAfterCapacity (HIGH risk;
     same drain-then-assert pattern at Refill=1, unprotected until
     now). Retry-After assertion updated from "1" to "100" since the
     formula `ceil(needed/refillRate)` now returns 100 s for
     refillRate=0.01.
   - TestSQSServer_Throttle_RecvBucketRejectsAfterCapacity (HIGH
     risk; receive-side mirror, same race).
   - TestSQSServer_Throttle_BatchChargesByEntryCount (LOW risk in
     practice -- only 2 calls -- but migrated for consistency).
   - TestSQSServer_Throttle_SetQueueAttributesInvalidatesBucket
     (already fixed inline by PR #890 first commit; refactored to
     use the helper).
   - TestSQSServer_Throttle_DeleteQueueInvalidatesBucket (no drain
     assertion -- both call sites migrated for uniformity).
   - TestSQSServer_Throttle_NoOpSetQueueAttributesPreservesBucket
     (already fixed inline by PR #819; local slowRefill const +
     throttleAttrs map literal dropped in favor of the shared
     helper).

Not migrated (intentional):
   - TestSQSServer_Throttle_GetQueueAttributesRoundTrip
     (Refill=50/5 -- exact values are the test contract).
   - TestSQSServer_Throttle_RejectsCapacityBelowBatchMin
     (Refill=1 -- value is part of the rejected input the
     validator must catch).
   - SetQueueAttributesInvalidatesBucket post-raise phase
     (Refill=20 -- raising both Capacity AND Refill is the
     thing this phase is asserting fires).

Caller audit per /loop semantic-change rule
============================================

Test-only file. Helpers are package-private (lowercase) and only
referenced within this single file (verified by grep
'newSendThrottleAttrs\|newRecvThrottleAttrs\|slowRefillRate'
across the repo). No production code touched. No other test file
uses these names.

Retry-After header assertion in
TestSQSServer_Throttle_SendBucketRejectsAfterCapacity was the
single observable behavior change: header value depends on the
configured refillRate via the documented §3.4 formula. Old "1"
matched Refill=1; new "100" matches Refill=0.01 (verified by
reading computeRetryAfter in sqs_throttle.go:525). The header
contract test is still exercised -- only the asserted literal
value changed.

Validation
==========

  go test ./adapter/ -run TestSQSServer_Throttle -race -count=3
    -> ok (5.1s, all three iterations green)
  gofmt + go vet + golangci-lint run -> 0 issues

Out-of-scope follow-ups (other recurring flake categories)
==========================================================

This commit addresses the SQS-throttle subset. The broader CI flake
history (47 fix-CI-flake commits over the last several months) has
other recurring patterns that deserve similar structural treatment:

  - Leader churn + Raft async apply: many tests retry on "not
    leader" or wait for follower catch-up. A shared
    eventuallyLeaderOK(t, ...) helper would centralize the retry
    + budget knobs.
  - Redis TTL window racing: multiple "widen TTL" commits suggest
    a per-test deadline-multiplier helper would be a better
    pattern than ad-hoc widening.
  - VM pool / cache hit-rate assertions: dropped as flaky in
    commit 7783c27. The structural lesson is to not assert on
    cache hit rates at all; need a doc somewhere.

These are tracked as future work; out of scope for this PR which
focuses on the most-repeated single flake (3 fixes for the same
mechanism in 6 weeks).
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 30, 2026

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 48 minutes and 43 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c637023e-be93-4fb7-8964-cdfdccd1a5da

📥 Commits

Reviewing files that changed from the base of the PR and between 17524a7 and e5f3176.

📒 Files selected for processing (1)
  • adapter/sqs_throttle_integration_test.go
📝 Walkthrough

Walkthrough

This PR hardens SQS throttle integration tests against CI flakes by introducing a shared slowRefillRate constant and helper functions that generate standardized throttle attribute maps. Six tests are updated to use these helpers, replacing hard-coded refill values, and one Retry-After header assertion is adjusted to match the new slow refill rate's rounding.

Changes

SQS Throttle Test CI-Flake Guardrails

Layer / File(s) Summary
Test infrastructure and helpers
adapter/sqs_throttle_integration_test.go
Defines slowRefillRate = "0.01" and introduces newSendThrottleAttrs(capacity) and newRecvThrottleAttrs(capacity) helper functions that produce standardized throttle attribute maps locked to the slow refill rate.
Migrate integration tests to use throttle helpers
adapter/sqs_throttle_integration_test.go
SendBucketRejectsAfterCapacity, ReceiveBucketRejectsAfterCapacity, BatchCharges, SetQueueAttributesInvalidatesBucket, DeleteQueueInvalidatesBucket, and NoOpSetQueueAttributesPreservesBucket tests are updated to call the new helpers instead of inline attribute maps; the Retry-After assertion in SendBucketRejectsAfterCapacity is changed from "1" to "100" to align with the slow refill rate.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • bootjp/elastickv#890: Also modifies adapter/sqs_throttle_integration_test.go to address the same SQS throttle CI flake by adjusting SetQueueAttributesInvalidatesBucket's slow refill configuration.

Poem

🐰 A refill rate so slow it stays,
Constants shared through all the days,
Tests unified, no more the drift,
CI flakes begone—a stable gift!
Helpers bundled, once or twice,
Throttle testing, oh so nice! 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'test(sqs): structural fix — shared throttle-refill helper for drain-pattern tests' accurately describes the main change: introducing shared helper functions for throttle configuration in SQS tests to fix a CI flake.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/sqs-throttle-invalidate-flake

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors SQS throttle integration tests to use structural helpers with a slow refill rate (0.01 tokens/sec), preventing CI flakes caused by wall-clock races during drain-then-assert patterns. The reviewer pointed out a potential new flake in TestSQSServer_Throttle_SendBucketRejectsAfterCapacity where asserting an exact Retry-After value of "100" might fail on slow CI runners if more than a second elapses between requests. They recommend asserting a safe range (e.g., [90, 100]) instead.

Comment on lines 137 to 139
if ra := resp.Header.Get("Retry-After"); ra != "100" {
t.Fatalf("11th send: Retry-After=%q (expected 100 — slowRefillRate)", ra)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Asserting an exact Retry-After value of "100" can introduce a new test flake on slow CI runners.

Why this happens:

The Retry-After header is calculated as math.Ceil(needed / refillRate).
With refillRate = 0.01, needed / refillRate is (1 - elapsed * 0.01) / 0.01 = 100 - elapsed.
If the time elapsed (elapsed) between the 10th send and the 11th send is strictly greater than 1.0 second (which is highly possible on slow or busy CI runners under -race), 100 - elapsed will be less than 99.0.
Consequently, math.Ceil(100 - elapsed) will round up to 99 or less, causing the exact assertion ra != "100" to fail.

Recommendation:

Parse the Retry-After header value and assert that it falls within a safe range (e.g., [90, 100]) to tolerate realistic scheduling or network delays on slow CI runners.

	raVal, err := strconv.Atoi(resp.Header.Get("Retry-After"))
	if err != nil {
		t.Fatalf("11th send: invalid Retry-After header: %v", err)
	}
	if raVal < 90 || raVal > 100 {
		t.Fatalf("11th send: Retry-After=%d (expected between 90 and 100)", raVal)
	}

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
adapter/sqs_throttle_integration_test.go (1)

56-57: ⚡ Quick win

Remove //nolint:unparam by simplifying helper signatures.

These suppressions can be avoided by making the helpers zero-arg (current usage is fixed at capacity 10), then re-introducing parameters when a real caller needs a different value.

Suggested diff
+const drainPatternCapacity = 10
+
-//nolint:unparam // capacity parameter is forward-looking; see doc.
-func newSendThrottleAttrs(capacity int) map[string]string {
+func newSendThrottleAttrs() map[string]string {
 	return map[string]string{
-		"ThrottleSendCapacity":        strconv.Itoa(capacity),
+		"ThrottleSendCapacity":        strconv.Itoa(drainPatternCapacity),
 		"ThrottleSendRefillPerSecond": slowRefillRate,
 	}
 }
 
-//nolint:unparam // capacity parameter is forward-looking; see newSendThrottleAttrs.
-func newRecvThrottleAttrs(capacity int) map[string]string {
+func newRecvThrottleAttrs() map[string]string {
 	return map[string]string{
-		"ThrottleRecvCapacity":        strconv.Itoa(capacity),
+		"ThrottleRecvCapacity":        strconv.Itoa(drainPatternCapacity),
 		"ThrottleRecvRefillPerSecond": slowRefillRate,
 	}
 }

As per coding guidelines, "**/*.go: ... Avoid adding //nolint annotations — refactor instead."

Also applies to: 69-70

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@adapter/sqs_throttle_integration_test.go` around lines 56 - 57, The helper
newSendThrottleAttrs currently takes an unused capacity parameter and is
suppressed with //nolint:unparam; remove the nolint and simplify the signature
to a zero-arg function (e.g., newSendThrottleAttrs()) since all current callers
use the fixed capacity (10), and update any analogous helper (the one at lines
69-70) the same way so both helpers have no unused parameters; adjust call sites
to invoke the zero-arg functions and re-run linters to ensure the //nolint
annotations can be removed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@adapter/sqs_throttle_integration_test.go`:
- Around line 56-57: The helper newSendThrottleAttrs currently takes an unused
capacity parameter and is suppressed with //nolint:unparam; remove the nolint
and simplify the signature to a zero-arg function (e.g., newSendThrottleAttrs())
since all current callers use the fixed capacity (10), and update any analogous
helper (the one at lines 69-70) the same way so both helpers have no unused
parameters; adjust call sites to invoke the zero-arg functions and re-run
linters to ensure the //nolint annotations can be removed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ff78a714-1ad0-480d-b5ea-945544181de8

📥 Commits

Reviewing files that changed from the base of the PR and between 3599556 and 17524a7.

📒 Files selected for processing (1)
  • adapter/sqs_throttle_integration_test.go

…pers

gemini HIGH (Retry-After exact-value assertion is itself flaky):
  Asserting Retry-After == "100" exactly re-introduces the wall-clock
  flake at the assertion layer. computeRetryAfter formula is
  ceil((1 - currentTokens) / refillRate), and currentTokens is the
  refill accumulated during the drain loop (each drain send takes
  ~150 ms wall, at refill=0.01 that yields ~0.015 tokens per send,
  so by the 11th request currentTokens is a small positive number).
  ceil((1 - 0.015) / 0.01) = ceil(98.5) = 99, not 100. Range check
  [1, 100] captures the §3.4 contract (positive integer bounded by
  1/refillRate) without re-racing the wall clock.

coderabbit nitpick (//nolint:unparam suppressions):
  Removed by making helpers zero-arg. Capacity is hardcoded as a
  documented const drainBucketCapacity = "10" (the §3.2 validator
  batch-floor minimum, doubles as the smallest value usable by
  both single and batch drains). YAGNI: future tests needing a
  different capacity should construct the map literally rather
  than parameterizing the helper. All 9 helper call sites updated
  from foo(10) to foo() via sed.

Caller audit: test-only. Helpers are package-private; grep across
the repo confirms zero external references. Const drainBucketCapacity
is new name, no collision risk verified.

Validation:
  go test ./adapter/ -run TestSQSServer_Throttle -race -count=3 -> ok
  golangci-lint run -> 0 issues
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 30, 2026

@claude review

Round 1 fixes (commit e5f3176):

  • gemini HIGH (Retry-After exact-value assertion is itself flaky): Replaced exact != "100" check with a range check [1, 100]. The formula ceil((1 - currentTokens) / refillRate) depends on drain-window elapsed time (refill accumulates during the loop), so asserting "100" exactly re-introduces a wall-clock flake at the assertion layer. The range pins the §3.4 contract (positive integer bounded by 1/refillRate) without racing.

  • coderabbit nitpick (//nolint:unparam): Removed by making helpers zero-arg. Capacity is now a documented file-level const drainBucketCapacity = "10" (the §3.2 validator batch-floor minimum). All 9 call sites updated from foo(10)foo().

Caller audit: test-only file, helpers package-private, no external references. go test -race -count=3 passes (5.3s); golangci-lint 0 issues.

@claude
Copy link
Copy Markdown

claude Bot commented May 30, 2026

Claude finished @bootjp's task in 2m 23s —— View job


Review of PR #891 — Round 1 fixes (commit e5f3176)

  • Read the changed file and diff
  • Verify the Retry-After range fix correctness
  • Verify the zero-arg helper refactor
  • Check for any remaining issues
  • Post review findings

Both Round 1 fixes are correct. Here's the verification:

Fix 1: Retry-After range [1, 100] — ✅ Correct

Traced through computeRetryAfter (sqs_throttle.go:525) and writeSQSThrottlingError (sqs.go:585):

needed = 1 - currentTokens
secs   = ceil(needed / 0.01)

The mathematical bounds with slowRefillRate=0.01:

  • Upper bound 100: when currentTokens = 0 exactly → ceil(1.0/0.01) = 100. This is then written as int(retryAfter/time.Second)100.
  • Lower bound 1: from two guards — computeRetryAfter clamps secs < 1 → 1, and writeSQSThrottlingError additionally clamps retryAfter < time.Second → time.Second. On an extremely slow runner where the drain loop itself takes >100s, currentTokens ≥ 1 triggers the needed ≤ 0 path which returns time.SecondraSecs = 1.
  • No escape outside [1, 100]: throttleRetryAfterCap = time.Hour only applies above 3600s, which is unreachable at capacity=10 with refill=0.01.

The range [1, 100] is the exact tight bound. The commit message math (ceil(98.5) = 99 for a 10×150ms drain) is also correct — the range handles intermediate values without over-constraining.

Fix 2: Zero-arg helpers + drainBucketCapacity const — ✅ Clean

  • //nolint:unparam annotations removed as requested
  • drainBucketCapacity = "10" (string type is correct — values go directly into map[string]string)
  • strconv import is still valid: used by strconv.Atoi(ra) at line 152 and strconv.Itoa(i) at lines 98, 124, 210–211
  • All 9 call sites correctly updated from foo(10)foo()

Pre-existing gap (not introduced by Round 1, low priority)

RecvBucketRejectsAfterCapacity (adapter/sqs_throttle_integration_test.go:166-190) checks status and x-amzn-ErrorType but never asserts Retry-After. The send test at line 151-158 does verify it. If the recv path ever diverges on the header, there's no test coverage for it. Worth a follow-up but out of scope for this PR.

Summary

Both reviewer concerns from Round 1 are addressed correctly:

  • gemini HIGH: range check [1, 100] is mathematically tight and won't re-introduce a wall-clock flake
  • coderabbit nitpick: zero-arg helpers eliminate //nolint:unparam without leaving dead code

LGTM.

@bootjp bootjp merged commit 5e7b04a into main May 30, 2026
9 checks passed
@bootjp bootjp deleted the fix/sqs-throttle-invalidate-flake branch May 30, 2026 14:13
bootjp added a commit that referenced this pull request Jun 1, 2026
…orRaftReadiness (#898)

## Summary

Structural fix for the recurring **leader-churn CI flake** pattern.
Closes the post-readiness step-down window at the readiness layer
instead of asking every new test author to remember to use `*Eventually`
helpers.

Companion to PR #891 (SQS throttle structural fix): same recipe — *each
per-test fix to a recurring flake mechanism leaves the same trap for the
next test*. PR #891 hardened the SQS throttle-refill subset; this PR
hardens the leader-churn subset.

## Recurring flake pattern (5+ commits this year)

| Commit | What it touched |
|---|---|
| [`5ba2ef94`](5ba2ef94) |
added `doEventually` / `rpushEventually` / `lpushEventually`, wrapped 2
RPUSH/LPUSH calls in `TestRedis_LuaRPopLPushBullMQLikeLists` |
| [`2b6e8f28`](2b6e8f28) |
widened raft election budget + retried gRPC on leader churn |
| [`78b9d091`](78b9d091) |
retried leader churn in consistency loops |
| [`0fb7db76`](0fb7db76) |
fixed flaky `TestRedis_LuaRPopLPushBullMQLikeLists` (re-fix) |
| [`ba3f5bf8`](ba3f5bf8) |
waited for follower apply catch-up in MultiExec |

Root cause is the same: `waitForStableLeader` only proves leadership
*appears* stable for a 300 ms window. The freshly-elected leader can
still step down on its first real heartbeat-quorum miss after that
window closes, surfacing as `"etcd raft engine is not leader"` on the
test's first write.

## Mechanism

Add `waitForWriteableLeader` that drives one no-op Raft entry through
full quorum after `waitForStableLeader` returns:

```go
var raftReadinessProbePayload = []byte{
    0x02,                                           // raftEncodeHLCLease
    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // BE uint64 ceiling = 0 (no-op)
}
```

- **State-machine no-op**: `kv/fsm.go applyHLCLease` reads `data[1:]` as
a big-endian uint64 ceiling. The `if ceilingMs > 0` guard at line 185
makes ceiling=0 a true no-op (no HLC mutation, no storage write, no
encryption dispatch).
- **Quorum-proving**: the entry still traverses the full propose →
append → replicate → commit → apply pipeline, so a successful `Propose`
proves the leader can actually drive quorum.
- **Retry semantics**: retries on transient leader-unavailable errors
(same set `isTransientNotLeaderErr` already classifies) up to
caller-supplied `waitTimeout` (10s at the current `createNode` call
site). Per-propose timeout is 2s.

Wired into `waitForRaftReadiness` as the final step.

## Caller audit (per /loop semantic-change rule)

- `waitForRaftReadiness`: single production call site (`createNode`);
behavior change is **additive** — same semantics PLUS the
writeable-leader gate. Tests that previously raced into a step-down
window now block until the leader has committed one entry.
- `waitForWriteableLeader` / `raftReadinessProbePayload`:
package-private to `adapter/test_util.go`; no external callers (grep
verified).
- `applyHLCLease` (`kv/fsm.go`): receives one extra invocation per
`createNode` call. Behavior on `ceiling=0` is documented no-op. No state
divergence across replicas — every FSM applies the same payload.
- Side effects:
  - `pendingApplyIdx` advances by 1 (normal flow)
  - encryption applier untouched (opcode `0x02` not in encryption band)
- snapshots include one extra applied index (already varies by test
timing)
  - lease provider does not track HLC entries
- Per-test wall-clock: ~10 ms extra on a stable cluster. 29 test files
use `createNode` → ~290 ms total CI overhead. Negligible.

## Validation

- `go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists -race
-count=3` → ok 5.4s
- `go test ./adapter/ -run
'TestRedis_SET|TestRedis_LPush|TestRedis_RPush|TestSQSServer_Throttle|TestRedis_MULTI'
-race -count=2` → ok 5.6s
- `gofmt` + `go vet` + `golangci-lint run` → 0 issues

## Follow-up note

Existing per-test `*Eventually` wrappers (`rpushEventually`,
`lpushEventually`) remain valid for tests that issue writes **long
after** `createNode`, where mid-test leader churn under CI load can
still hit. The readiness probe targets the startup window specifically —
it does not obsolete the helpers for other use cases.

Subsequent commits can audit whether tests still calling
`rpushEventually` / `lpushEventually` *immediately after* `createNode`
can drop back to direct calls now that the startup window is closed at
the readiness layer.

## Out-of-scope (remaining recurring flake categories)

Tracked in the PR #891 description; still to be addressed in follow-up
PRs:
- Redis TTL window racing → per-test deadline-multiplier helper
- VM pool / cache hit-rate assertions → "don't assert on cache hit
rates" rule
bootjp added a commit that referenced this pull request Jun 2, 2026
… tests

Third flake-root-cause PR, after PR #891 (SQS throttle refill) and
PR #898 (Raft readiness probe). Same pattern: a recurring CI-flake
mechanism has been fixed per-test for months; this PR generalises
the structural fix and migrates all extant sites.

Recurring flake (PR #818 lineage):
  Tests that "set a short TTL, sleep past it, assert the key is
  gone" race their own wall clock under -race on slow CI runners.
  The post-sleep read can sometimes still observe an un-expired
  view because the inter-call pause does not absorb scheduler
  jitter. PR #818 fixed one instance (ExpiredKey_BecomesInvisible)
  by switching to require.Eventually with a TTL-derived deadline.
  The other 8 sites in the test suite kept the fragile
  time.Sleep(ttl + 50ms) pattern.

Structural fix
==============

  - New helper eventuallyExpired(t, ttl, condition, msg) in
    test_util.go. Encapsulates the require.Eventually +
    ttlExpiryHeadroom (3 s) + ttlExpiryPoll (25 ms) idiom so
    future test writers reach for it instead of inventing fresh
    time.Sleep margins.

  - Migrated 8 sites to the helper:
    * redis_collection_fastpath_test.go: TestRedis_HGET_TTLExpired,
      TestRedis_HEXISTS_TTLExpired, TestRedis_SISMEMBER_TTLExpired.
    * redis_lua_collection_fastpath_test.go: TestLua_HGET_HashTTL,
      TestLua_HEXISTS_HashTTL, TestLua_ZSCORE_*, TestLua_ZRANGEBYSCORE_*.
    * redis_misskey_compat_test.go: post-EX-1 expiry assertion.

  - Removed the now-unused waitForTTLExpiry (redis_collection_fastpath
    _test.go) and waitForLuaTTL (redis_lua_collection_fastpath_test.go)
    convenience wrappers. Their sole callers were the migrated sites.

  - Added errors stdlib import to the three test files where the
    helper's condition closures use errors.Is.

Caller audit per /loop semantic-change rule
============================================

  - eventuallyExpired: new helper, 8 callers in 3 test files. All
    test-only.
  - waitForTTLExpiry / waitForLuaTTL: removed. grep across the repo
    confirms zero remaining call sites.
  - collectionFastPathTTL / luaFastPathTTL consts: still used by
    the migrated sites as the ttl argument to eventuallyExpired.
    Unchanged.

Validation
==========

  go test ./adapter/ -run 'TestRedis_HGET_TTLExpired|TestRedis_HEXISTS_TTLExpired|TestRedis_SISMEMBER_TTLExpired|TestLua_HGET_HashTTL|TestLua_HEXISTS|TestLua_ZSCORE|TestLua_ZRANGEBYSCORE|TestRedis_Misskey' -race -count=2 -> ok 32.3s
  gofmt + go vet + golangci-lint run -> 0 issues

Closes the "Redis TTL window deadline-multiplier" follow-up noted
in PR #891's description.
bootjp added a commit that referenced this pull request Jun 2, 2026
… tests (#903)

## Summary

Third flake-root-cause PR, after #891 (SQS throttle refill) and #898
(Raft readiness probe). Same recipe: a recurring CI-flake mechanism has
been patched per-test for months; this PR generalises the structural fix
and migrates the extant sites.

## Recurring flake (PR #818 lineage)

Tests that *"set a short TTL, sleep past it, assert the key is gone"*
race their own wall clock under `-race` on slow CI runners. The
post-sleep read sometimes observes an un-expired view because the
inter-call pause does not absorb scheduler jitter. [PR
#818](#818) fixed one instance
(`TestRedis_ExpiredKey_BecomesInvisible`) by switching to
`require.Eventually` with a TTL-derived deadline. The other 8 sites kept
the fragile `time.Sleep(ttl + 50ms)` pattern.

## Structural fix

- **New helper** `eventuallyExpired(t, ttl, condition, msg)` in
`test_util.go`. Encapsulates the `require.Eventually +
ttlExpiryHeadroom(3s) + ttlExpiryPoll(25ms)` idiom so future test
writers reach for it instead of inventing fresh `time.Sleep` margins.

- **Migrated 8 sites** to the helper:
- `redis_collection_fastpath_test.go`: `TestRedis_HGET_TTLExpired`,
`TestRedis_HEXISTS_TTLExpired`, `TestRedis_SISMEMBER_TTLExpired`
- `redis_lua_collection_fastpath_test.go`: `TestLua_HGET_HashTTL`,
`TestLua_HEXISTS_HashTTL`, `TestLua_ZSCORE_*`, `TestLua_ZRANGEBYSCORE_*`
  - `redis_misskey_compat_test.go`: post-`EX 1` expiry assertion

- **Removed unused** `waitForTTLExpiry`
(redis_collection_fastpath_test.go) and `waitForLuaTTL`
(redis_lua_collection_fastpath_test.go) convenience wrappers. Sole
callers were the migrated sites.

- Added `errors` stdlib import to the three test files where the
helper's condition closures use `errors.Is`.

## Caller audit (per /loop semantic-change rule)

- `eventuallyExpired`: new helper, 8 callers in 3 test files. All
test-only.
- `waitForTTLExpiry` / `waitForLuaTTL`: removed. Repo-wide grep confirms
zero remaining call sites.
- `collectionFastPathTTL` / `luaFastPathTTL` consts: still used by the
migrated sites as the `ttl` argument to `eventuallyExpired`. Unchanged.

## Validation

- `go test ./adapter/ -run
'TestRedis_HGET_TTLExpired|TestRedis_HEXISTS_TTLExpired|TestRedis_SISMEMBER_TTLExpired|TestLua_HGET_HashTTL|TestLua_HEXISTS|TestLua_ZSCORE|TestLua_ZRANGEBYSCORE|TestRedis_Misskey'
-race -count=2` → ok 32.3s
- `gofmt` + `go vet` + `golangci-lint run` → 0 issues

## Scope

Closes the *"Redis TTL window deadline-multiplier"* follow-up noted in
[PR #891's description](#891).

## Out-of-scope (still on the flake follow-up backlog)

- VM pool / cache hit-rate assertion → "don't assert on cache hit rates"
rule. The structural lesson from commit
[7783c27](7783c279) is to
never assert on cache hit rates; needs a doc rather than a code change.
bootjp added a commit that referenced this pull request Jun 2, 2026
…gatorRaceFree (#906)

## Summary

Fixes the CI flake observed in [Actions run
26765510693](https://github.com/bootjp/elastickv/actions/runs/26765510693/job/78890657730?pr=902):

```
--- FAIL: TestHotKeysAggregatorRaceFree (0.11s)
    hot_keys_test.go:356: Expected value not to be nil.
```

Surfaced on PR #902's CI run but **unrelated to PR #902** — the failing
test is on `main` in `keyviz/`, which PR #902 (Redis adapter
`*Eventually` helper cleanup) doesn't touch.

## Root cause

Same wall-clock-racing pattern as the SQS-throttle (PR #891) and
Redis-TTL (PR #903) flake series. The test:

1. Launches an aggregator goroutine (Step=5ms tick).
2. Spawns 8 workers each calling `Observe` 500 times.
3. `time.Sleep(50 * time.Millisecond)`.
4. Asserts both per-route snapshots are non-nil.

The aggregator publishes a snapshot only on a tick (or on ctx-done
drain). Under `-race` on a slow CI runner, the `Run` goroutine itself
can be slow to schedule — the 50ms wait may not contain a single actual
tick worth of scheduler time, so `hotKeysSnap` stays at its initial
`nil` `atomic.Pointer` load and the assertion fires.

## Fix

`require.Eventually` with a 3-second budget + 5ms poll cadence,
asserting both snapshots non-nil:

```go
require.Eventually(t, func() bool {
    return s.HotKeysSnapshot(1) != nil && s.HotKeysSnapshot(2) != nil
}, 3*time.Second, 5*time.Millisecond, "aggregator must publish a snapshot for each route within the budget")
```

The test still pins the load-bearing property (snapshot reads are
lock-free; the `atomic.Pointer.Load` works without contention with the
publisher); only the **wait mechanism** changes from "sleep blindly and
hope" to "poll until ready or fail". Total time bound is at most 3s on a
slow runner, ~5–50ms in the common case.

## Validation

- `go test ./keyviz/ -run TestHotKeysAggregatorRaceFree -race -count=5
-timeout 120s` → ok 1.1s (5/5 iterations green)
- `gofmt` + `golangci-lint run` → 0 issues

## Pattern continuity

This is the same kind of structural fix as PR #891 / #903: **replace
fixed-time-window sleep with `require.Eventually`** for any condition
that depends on a background goroutine's progress under `-race` on CI.
Future flakes in this category should follow the same pattern.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant