test(sqs): structural fix — shared throttle-refill helper for drain-pattern tests by bootjp · Pull Request #891 · bootjp/elastickv

bootjp · 2026-05-30T13:38:54Z

Summary

Structural fix for the recurring SQS-throttle CI flake. PR #890 just merged a per-test fix for SetQueueAttributesInvalidatesBucket; this PR generalizes the rule across the whole file so the next drain-pattern test added to adapter/sqs_throttle_integration_test.go cannot regress to the same flake.

Why a structural fix, not another per-test commit

The same CI-flake mechanism has been fixed three times in three different SQS-throttle tests over ~6 weeks:

Commit	Test
`54c6cd56`	`NoOpSetQueueAttributesPreservesBucket` (PR #819)
`204b7294`	`SetQueueAttributesInvalidatesBucket` (PR #890)
This PR	structural prevention for the remaining drain-pattern tests

Each per-test fix touched one test in isolation, leaving the same trap in sibling tests for the next slow CI runner to step on. Two sibling tests are still unprotected at the time of this PR:

TestSQSServer_Throttle_SendBucketRejectsAfterCapacity — HIGH risk, drain 10 then assert 11th rejects, Refill=1
TestSQSServer_Throttle_RecvBucketRejectsAfterCapacity — HIGH risk, receive-side mirror, same pattern

Plus BatchChargesByEntryCount is theoretically at risk on a very slow runner.

Root cause (recurring across all three flake instances)

Tests that (a) set ThrottleSend/RecvRefillPerSecond AND (b) drain the bucket then assert a "throttled" response are racing their own wall clock. Under -race on slow CI runners each request elapses ~100-250 ms (full HTTP + Raft propose+apply). An N-write drain plus its assertion send takes ~N×150 ms ± noise. At Refill=1/sec the bucket accumulates ≥1 token over that window and the "throttled" assertion fires against a refilled bucket, returning 200 instead of 400 — a false positive against a real production-contract regression test.

Structural fix

Shared file-head constants + helpers in adapter/sqs_throttle_integration_test.go:
```
const slowRefillRate = "0.01"  // 1 token per 100 seconds
func newSendThrottleAttrs(capacity int) map[string]string
func newRecvThrottleAttrs(capacity int) map[string]string
```
Helpers lock the refill rate to slowRefillRate so a future drain-pattern test cannot pick a racing value by accident.
File-head doc-comment block explaining:
- The recurring flake mechanism (with prior-commit refs)
- The rule: drain-then-assert MUST use the helpers
- The exception: config-round-trip / validator-rejection / post-raise tests where the rate is part of the test contract
Migrated all drain-pattern call sites to the helpers:
- SendBucketRejectsAfterCapacity (HIGH risk — unprotected until now). Retry-After assertion updated from "1" to "100" (formula ceil(needed/refillRate) returns 100s for refillRate=0.01; verified by reading computeRetryAfter in sqs_throttle.go:525).
- RecvBucketRejectsAfterCapacity (HIGH risk; same migration)
- BatchChargesByEntryCount (LOW risk in practice, migrated for consistency)
- SetQueueAttributesInvalidatesBucket (already fixed by PR test(sqs): slow throttle refill on SetQueueAttributesInvalidatesBucket to fix CI flake #890, refactored to use the helper)
- DeleteQueueInvalidatesBucket (no drain assertion, migrated for uniformity)
- NoOpSetQueueAttributesPreservesBucket (already fixed by PR test(sqs): slow throttle refill on no-op SetQueueAttributes test to fix CI flake #819 with a local slowRefill const, refactored to use the shared helper)

Not migrated (intentional):

GetQueueAttributesRoundTrip (Refill=50/5 — exact values are the test contract)
RejectsCapacityBelowBatchMin (Refill=1 — value is part of the rejected input the validator must catch)
SetQueueAttributesInvalidatesBucket post-raise phase (Refill=20 — raising both Capacity AND Refill is what the phase asserts fires)

Caller audit (per /loop semantic-change rule)

Test-only file. Helpers are package-private and only referenced within this single file (verified by grep across the repo). No production code touched. No other test file uses these names.

Retry-After header assertion change in SendBucketRejectsAfterCapacity was the single observable behavior change: header depends on refillRate via the §3.4 formula. Old "1" matched Refill=1; new "100" matches Refill=0.01. The header contract is still exercised — only the asserted literal changed.

Validation

go test ./adapter/ -run TestSQSServer_Throttle -race -count=3 -timeout 180s → ok 5.1s, all three iterations green
gofmt + go vet + golangci-lint run → 0 issues

Out-of-scope follow-ups (other recurring flake categories)

Broader CI-flake history (47 fix-CI-flake commits) has other recurring patterns that deserve similar structural treatment — tracked as future work, out of scope for this PR:

Leader churn + Raft async apply: many tests retry on "not leader" or wait for follower catch-up. A shared eventuallyLeaderOK(t, ...) helper would centralize the retry + budget knobs.
Redis TTL window racing: multiple "widen TTL" commits suggest a per-test deadline-multiplier helper.
VM pool / cache hit-rate assertions: dropped as flaky in 7783c279 — the structural lesson is to avoid asserting on cache hit rates.

This PR focuses on the most-repeated single flake (3 fixes for the same mechanism in 6 weeks). The remaining categories each warrant their own follow-up PR after we see how this pattern works in practice.

Summary by CodeRabbit

Tests
- Enhanced SQS throttle test infrastructure to improve test reliability in continuous integration environments.

…attern tests The SQS throttle test file has been hit by the same CI flake mechanism three times in different tests: 54c6cd5 NoOpSetQueueAttributesPreservesBucket (PR #819) 204b729 SetQueueAttributesInvalidatesBucket (PR #890 first commit) <this> structural prevention for the remaining drain-pattern tests Each fix touched one test in isolation, leaving the same trap in sibling tests for the next slow CI runner to step on. This commit extracts the rule into a shared helper so future drain-pattern tests cannot regress to the same flake. Root cause (recurring) ====================== Tests that (a) set ThrottleSend/RecvRefillPerSecond AND (b) drain the bucket then assert a "throttled" response are racing their own wall clock: under -race on slow CI runners each request elapses ~100-250ms (full HTTP + Raft propose+apply). An N-write drain plus its assertion send takes ~N*150ms +/- noise. At Refill=1/sec the bucket accumulates >=1 token over that window and the "throttled" assertion fires against a refilled bucket, returning 200 instead of the expected 400 -- a false positive against a real production contract regression. Three commits have fixed exactly this -- each time in just the one test that flaked. Sibling tests using the same pattern remained unprotected. Structural fix ============== 1. Shared file-head constants + helpers: const slowRefillRate = "0.01" // 1 token per 100 seconds func newSendThrottleAttrs(capacity int) map[string]string func newRecvThrottleAttrs(capacity int) map[string]string Helpers lock the refill rate to slowRefillRate so a future drain-pattern test cannot pick a racing value by accident. 2. File-head doc-comment block explaining: - The recurring flake mechanism (with prior-commit refs). - The rule: drain-then-assert MUST use the helpers. - The exception: config-round-trip / validator-rejection / post-raise tests where the rate is part of the test contract. 3. Migrated all drain-pattern call sites to the helpers: - TestSQSServer_Throttle_SendBucketRejectsAfterCapacity (HIGH risk; same drain-then-assert pattern at Refill=1, unprotected until now). Retry-After assertion updated from "1" to "100" since the formula `ceil(needed/refillRate)` now returns 100 s for refillRate=0.01. - TestSQSServer_Throttle_RecvBucketRejectsAfterCapacity (HIGH risk; receive-side mirror, same race). - TestSQSServer_Throttle_BatchChargesByEntryCount (LOW risk in practice -- only 2 calls -- but migrated for consistency). - TestSQSServer_Throttle_SetQueueAttributesInvalidatesBucket (already fixed inline by PR #890 first commit; refactored to use the helper). - TestSQSServer_Throttle_DeleteQueueInvalidatesBucket (no drain assertion -- both call sites migrated for uniformity). - TestSQSServer_Throttle_NoOpSetQueueAttributesPreservesBucket (already fixed inline by PR #819; local slowRefill const + throttleAttrs map literal dropped in favor of the shared helper). Not migrated (intentional): - TestSQSServer_Throttle_GetQueueAttributesRoundTrip (Refill=50/5 -- exact values are the test contract). - TestSQSServer_Throttle_RejectsCapacityBelowBatchMin (Refill=1 -- value is part of the rejected input the validator must catch). - SetQueueAttributesInvalidatesBucket post-raise phase (Refill=20 -- raising both Capacity AND Refill is the thing this phase is asserting fires). Caller audit per /loop semantic-change rule ============================================ Test-only file. Helpers are package-private (lowercase) and only referenced within this single file (verified by grep 'newSendThrottleAttrs\|newRecvThrottleAttrs\|slowRefillRate' across the repo). No production code touched. No other test file uses these names. Retry-After header assertion in TestSQSServer_Throttle_SendBucketRejectsAfterCapacity was the single observable behavior change: header value depends on the configured refillRate via the documented §3.4 formula. Old "1" matched Refill=1; new "100" matches Refill=0.01 (verified by reading computeRetryAfter in sqs_throttle.go:525). The header contract test is still exercised -- only the asserted literal value changed. Validation ========== go test ./adapter/ -run TestSQSServer_Throttle -race -count=3 -> ok (5.1s, all three iterations green) gofmt + go vet + golangci-lint run -> 0 issues Out-of-scope follow-ups (other recurring flake categories) ========================================================== This commit addresses the SQS-throttle subset. The broader CI flake history (47 fix-CI-flake commits over the last several months) has other recurring patterns that deserve similar structural treatment: - Leader churn + Raft async apply: many tests retry on "not leader" or wait for follower catch-up. A shared eventuallyLeaderOK(t, ...) helper would centralize the retry + budget knobs. - Redis TTL window racing: multiple "widen TTL" commits suggest a per-test deadline-multiplier helper would be a better pattern than ad-hoc widening. - VM pool / cache hit-rate assertions: dropped as flaky in commit 7783c27. The structural lesson is to not assert on cache hit rates at all; need a doc somewhere. These are tracked as future work; out of scope for this PR which focuses on the most-repeated single flake (3 fixes for the same mechanism in 6 weeks).

chatgpt-codex-connector · 2026-05-30T13:38:58Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

coderabbitai · 2026-05-30T13:39:05Z

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 48 minutes and 43 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c637023e-be93-4fb7-8964-cdfdccd1a5da

📥 Commits

Reviewing files that changed from the base of the PR and between 17524a7 and e5f3176.

📒 Files selected for processing (1)

adapter/sqs_throttle_integration_test.go

📝 Walkthrough

Walkthrough

This PR hardens SQS throttle integration tests against CI flakes by introducing a shared slowRefillRate constant and helper functions that generate standardized throttle attribute maps. Six tests are updated to use these helpers, replacing hard-coded refill values, and one Retry-After header assertion is adjusted to match the new slow refill rate's rounding.

Changes

SQS Throttle Test CI-Flake Guardrails

Layer / File(s)	Summary
Test infrastructure and helpers `adapter/sqs_throttle_integration_test.go`	Defines `slowRefillRate = "0.01"` and introduces `newSendThrottleAttrs(capacity)` and `newRecvThrottleAttrs(capacity)` helper functions that produce standardized throttle attribute maps locked to the slow refill rate.
Migrate integration tests to use throttle helpers `adapter/sqs_throttle_integration_test.go`	SendBucketRejectsAfterCapacity, ReceiveBucketRejectsAfterCapacity, BatchCharges, SetQueueAttributesInvalidatesBucket, DeleteQueueInvalidatesBucket, and NoOpSetQueueAttributesPreservesBucket tests are updated to call the new helpers instead of inline attribute maps; the Retry-After assertion in SendBucketRejectsAfterCapacity is changed from `"1"` to `"100"` to align with the slow refill rate.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

bootjp/elastickv#890: Also modifies adapter/sqs_throttle_integration_test.go to address the same SQS throttle CI flake by adjusting SetQueueAttributesInvalidatesBucket's slow refill configuration.

Poem

🐰 A refill rate so slow it stays,
Constants shared through all the days,
Tests unified, no more the drift,
CI flakes begone—a stable gift!
Helpers bundled, once or twice,
Throttle testing, oh so nice! 🥕

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'test(sqs): structural fix — shared throttle-refill helper for drain-pattern tests' accurately describes the main change: introducing shared helper functions for throttle configuration in SQS tests to fix a CI flake.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/sqs-throttle-invalidate-flake

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request refactors SQS throttle integration tests to use structural helpers with a slow refill rate (0.01 tokens/sec), preventing CI flakes caused by wall-clock races during drain-then-assert patterns. The reviewer pointed out a potential new flake in TestSQSServer_Throttle_SendBucketRejectsAfterCapacity where asserting an exact Retry-After value of "100" might fail on slow CI runners if more than a second elapses between requests. They recommend asserting a safe range (e.g., [90, 100]) instead.

gemini-code-assist · 2026-05-30T13:40:10Z

+	if ra := resp.Header.Get("Retry-After"); ra != "100" {
+		t.Fatalf("11th send: Retry-After=%q (expected 100 — slowRefillRate)", ra)
 	}


Asserting an exact Retry-After value of "100" can introduce a new test flake on slow CI runners.

Why this happens:

The Retry-After header is calculated as math.Ceil(needed / refillRate).
With refillRate = 0.01, needed / refillRate is (1 - elapsed * 0.01) / 0.01 = 100 - elapsed.
If the time elapsed (elapsed) between the 10th send and the 11th send is strictly greater than 1.0 second (which is highly possible on slow or busy CI runners under -race), 100 - elapsed will be less than 99.0.
Consequently, math.Ceil(100 - elapsed) will round up to 99 or less, causing the exact assertion ra != "100" to fail.

Recommendation:

Parse the Retry-After header value and assert that it falls within a safe range (e.g., [90, 100]) to tolerate realistic scheduling or network delays on slow CI runners.

raVal, err := strconv.Atoi(resp.Header.Get("Retry-After")) if err != nil { t.Fatalf("11th send: invalid Retry-After header: %v", err) } if raVal < 90 || raVal > 100 { t.Fatalf("11th send: Retry-After=%d (expected between 90 and 100)", raVal) }

coderabbitai

🧹 Nitpick comments (1)

adapter/sqs_throttle_integration_test.go (1)

56-57: ⚡ Quick win

Remove //nolint:unparam by simplifying helper signatures.

These suppressions can be avoided by making the helpers zero-arg (current usage is fixed at capacity 10), then re-introducing parameters when a real caller needs a different value.

Suggested diff

+const drainPatternCapacity = 10
+
-//nolint:unparam // capacity parameter is forward-looking; see doc.
-func newSendThrottleAttrs(capacity int) map[string]string {
+func newSendThrottleAttrs() map[string]string {
 	return map[string]string{
-		"ThrottleSendCapacity":        strconv.Itoa(capacity),
+		"ThrottleSendCapacity":        strconv.Itoa(drainPatternCapacity),
 		"ThrottleSendRefillPerSecond": slowRefillRate,
 	}
 }
 
-//nolint:unparam // capacity parameter is forward-looking; see newSendThrottleAttrs.
-func newRecvThrottleAttrs(capacity int) map[string]string {
+func newRecvThrottleAttrs() map[string]string {
 	return map[string]string{
-		"ThrottleRecvCapacity":        strconv.Itoa(capacity),
+		"ThrottleRecvCapacity":        strconv.Itoa(drainPatternCapacity),
 		"ThrottleRecvRefillPerSecond": slowRefillRate,
 	}
 }

As per coding guidelines, "**/*.go: ... Avoid adding //nolint annotations — refactor instead."

Also applies to: 69-70

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@adapter/sqs_throttle_integration_test.go` around lines 56 - 57, The helper
newSendThrottleAttrs currently takes an unused capacity parameter and is
suppressed with //nolint:unparam; remove the nolint and simplify the signature
to a zero-arg function (e.g., newSendThrottleAttrs()) since all current callers
use the fixed capacity (10), and update any analogous helper (the one at lines
69-70) the same way so both helpers have no unused parameters; adjust call sites
to invoke the zero-arg functions and re-run linters to ensure the //nolint
annotations can be removed.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@adapter/sqs_throttle_integration_test.go`:
- Around line 56-57: The helper newSendThrottleAttrs currently takes an unused
capacity parameter and is suppressed with //nolint:unparam; remove the nolint
and simplify the signature to a zero-arg function (e.g., newSendThrottleAttrs())
since all current callers use the fixed capacity (10), and update any analogous
helper (the one at lines 69-70) the same way so both helpers have no unused
parameters; adjust call sites to invoke the zero-arg functions and re-run
linters to ensure the //nolint annotations can be removed.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ff78a714-1ad0-480d-b5ea-945544181de8

📥 Commits

Reviewing files that changed from the base of the PR and between 3599556 and 17524a7.

📒 Files selected for processing (1)

adapter/sqs_throttle_integration_test.go

…pers gemini HIGH (Retry-After exact-value assertion is itself flaky): Asserting Retry-After == "100" exactly re-introduces the wall-clock flake at the assertion layer. computeRetryAfter formula is ceil((1 - currentTokens) / refillRate), and currentTokens is the refill accumulated during the drain loop (each drain send takes ~150 ms wall, at refill=0.01 that yields ~0.015 tokens per send, so by the 11th request currentTokens is a small positive number). ceil((1 - 0.015) / 0.01) = ceil(98.5) = 99, not 100. Range check [1, 100] captures the §3.4 contract (positive integer bounded by 1/refillRate) without re-racing the wall clock. coderabbit nitpick (//nolint:unparam suppressions): Removed by making helpers zero-arg. Capacity is hardcoded as a documented const drainBucketCapacity = "10" (the §3.2 validator batch-floor minimum, doubles as the smallest value usable by both single and batch drains). YAGNI: future tests needing a different capacity should construct the map literally rather than parameterizing the helper. All 9 helper call sites updated from foo(10) to foo() via sed. Caller audit: test-only. Helpers are package-private; grep across the repo confirms zero external references. Const drainBucketCapacity is new name, no collision risk verified. Validation: go test ./adapter/ -run TestSQSServer_Throttle -race -count=3 -> ok golangci-lint run -> 0 issues

bootjp · 2026-05-30T13:50:20Z

@claude review

Round 1 fixes (commit e5f3176):

gemini HIGH (Retry-After exact-value assertion is itself flaky): Replaced exact != "100" check with a range check [1, 100]. The formula ceil((1 - currentTokens) / refillRate) depends on drain-window elapsed time (refill accumulates during the loop), so asserting "100" exactly re-introduces a wall-clock flake at the assertion layer. The range pins the §3.4 contract (positive integer bounded by 1/refillRate) without racing.
coderabbit nitpick (//nolint:unparam): Removed by making helpers zero-arg. Capacity is now a documented file-level const drainBucketCapacity = "10" (the §3.2 validator batch-floor minimum). All 9 call sites updated from foo(10) → foo().

Caller audit: test-only file, helpers package-private, no external references. go test -race -count=3 passes (5.3s); golangci-lint 0 issues.

claude · 2026-05-30T13:50:36Z

Claude finished @bootjp's task in 2m 23s —— View job

Review of PR #891 — Round 1 fixes (commit `e5f3176`)

Read the changed file and diff
Verify the Retry-After range fix correctness
Verify the zero-arg helper refactor
Check for any remaining issues
Post review findings

Both Round 1 fixes are correct. Here's the verification:

Fix 1: Retry-After range [1, 100] — ✅ Correct

Traced through computeRetryAfter (sqs_throttle.go:525) and writeSQSThrottlingError (sqs.go:585):

needed = 1 - currentTokens
secs   = ceil(needed / 0.01)

The mathematical bounds with slowRefillRate=0.01:

Upper bound 100: when currentTokens = 0 exactly → ceil(1.0/0.01) = 100. This is then written as int(retryAfter/time.Second) → 100.
Lower bound 1: from two guards — computeRetryAfter clamps secs < 1 → 1, and writeSQSThrottlingError additionally clamps retryAfter < time.Second → time.Second. On an extremely slow runner where the drain loop itself takes >100s, currentTokens ≥ 1 triggers the needed ≤ 0 path which returns time.Second → raSecs = 1.
No escape outside [1, 100]: throttleRetryAfterCap = time.Hour only applies above 3600s, which is unreachable at capacity=10 with refill=0.01.

The range [1, 100] is the exact tight bound. The commit message math (ceil(98.5) = 99 for a 10×150ms drain) is also correct — the range handles intermediate values without over-constraining.

Fix 2: Zero-arg helpers + `drainBucketCapacity` const — ✅ Clean

//nolint:unparam annotations removed as requested
drainBucketCapacity = "10" (string type is correct — values go directly into map[string]string)
strconv import is still valid: used by strconv.Atoi(ra) at line 152 and strconv.Itoa(i) at lines 98, 124, 210–211
All 9 call sites correctly updated from foo(10) → foo()

Pre-existing gap (not introduced by Round 1, low priority)

RecvBucketRejectsAfterCapacity (adapter/sqs_throttle_integration_test.go:166-190) checks status and x-amzn-ErrorType but never asserts Retry-After. The send test at line 151-158 does verify it. If the recv path ever diverges on the header, there's no test coverage for it. Worth a follow-up but out of scope for this PR.

Summary

Both reviewer concerns from Round 1 are addressed correctly:

gemini HIGH: range check [1, 100] is mathematically tight and won't re-introduce a wall-clock flake
coderabbit nitpick: zero-arg helpers eliminate //nolint:unparam without leaving dead code

LGTM.

…orRaftReadiness (#898) ## Summary Structural fix for the recurring **leader-churn CI flake** pattern. Closes the post-readiness step-down window at the readiness layer instead of asking every new test author to remember to use `*Eventually` helpers. Companion to PR #891 (SQS throttle structural fix): same recipe — *each per-test fix to a recurring flake mechanism leaves the same trap for the next test*. PR #891 hardened the SQS throttle-refill subset; this PR hardens the leader-churn subset. ## Recurring flake pattern (5+ commits this year) | Commit | What it touched | |---|---| | [`5ba2ef94`](5ba2ef94) | added `doEventually` / `rpushEventually` / `lpushEventually`, wrapped 2 RPUSH/LPUSH calls in `TestRedis_LuaRPopLPushBullMQLikeLists` | | [`2b6e8f28`](2b6e8f28) | widened raft election budget + retried gRPC on leader churn | | [`78b9d091`](78b9d091) | retried leader churn in consistency loops | | [`0fb7db76`](0fb7db76) | fixed flaky `TestRedis_LuaRPopLPushBullMQLikeLists` (re-fix) | | [`ba3f5bf8`](ba3f5bf8) | waited for follower apply catch-up in MultiExec | Root cause is the same: `waitForStableLeader` only proves leadership *appears* stable for a 300 ms window. The freshly-elected leader can still step down on its first real heartbeat-quorum miss after that window closes, surfacing as `"etcd raft engine is not leader"` on the test's first write. ## Mechanism Add `waitForWriteableLeader` that drives one no-op Raft entry through full quorum after `waitForStableLeader` returns: ```go var raftReadinessProbePayload = []byte{ 0x02, // raftEncodeHLCLease 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // BE uint64 ceiling = 0 (no-op) } ``` - **State-machine no-op**: `kv/fsm.go applyHLCLease` reads `data[1:]` as a big-endian uint64 ceiling. The `if ceilingMs > 0` guard at line 185 makes ceiling=0 a true no-op (no HLC mutation, no storage write, no encryption dispatch). - **Quorum-proving**: the entry still traverses the full propose → append → replicate → commit → apply pipeline, so a successful `Propose` proves the leader can actually drive quorum. - **Retry semantics**: retries on transient leader-unavailable errors (same set `isTransientNotLeaderErr` already classifies) up to caller-supplied `waitTimeout` (10s at the current `createNode` call site). Per-propose timeout is 2s. Wired into `waitForRaftReadiness` as the final step. ## Caller audit (per /loop semantic-change rule) - `waitForRaftReadiness`: single production call site (`createNode`); behavior change is **additive** — same semantics PLUS the writeable-leader gate. Tests that previously raced into a step-down window now block until the leader has committed one entry. - `waitForWriteableLeader` / `raftReadinessProbePayload`: package-private to `adapter/test_util.go`; no external callers (grep verified). - `applyHLCLease` (`kv/fsm.go`): receives one extra invocation per `createNode` call. Behavior on `ceiling=0` is documented no-op. No state divergence across replicas — every FSM applies the same payload. - Side effects: - `pendingApplyIdx` advances by 1 (normal flow) - encryption applier untouched (opcode `0x02` not in encryption band) - snapshots include one extra applied index (already varies by test timing) - lease provider does not track HLC entries - Per-test wall-clock: ~10 ms extra on a stable cluster. 29 test files use `createNode` → ~290 ms total CI overhead. Negligible. ## Validation - `go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists -race -count=3` → ok 5.4s - `go test ./adapter/ -run 'TestRedis_SET|TestRedis_LPush|TestRedis_RPush|TestSQSServer_Throttle|TestRedis_MULTI' -race -count=2` → ok 5.6s - `gofmt` + `go vet` + `golangci-lint run` → 0 issues ## Follow-up note Existing per-test `*Eventually` wrappers (`rpushEventually`, `lpushEventually`) remain valid for tests that issue writes **long after** `createNode`, where mid-test leader churn under CI load can still hit. The readiness probe targets the startup window specifically — it does not obsolete the helpers for other use cases. Subsequent commits can audit whether tests still calling `rpushEventually` / `lpushEventually` *immediately after* `createNode` can drop back to direct calls now that the startup window is closed at the readiness layer. ## Out-of-scope (remaining recurring flake categories) Tracked in the PR #891 description; still to be addressed in follow-up PRs: - Redis TTL window racing → per-test deadline-multiplier helper - VM pool / cache hit-rate assertions → "don't assert on cache hit rates" rule

… tests Third flake-root-cause PR, after PR #891 (SQS throttle refill) and PR #898 (Raft readiness probe). Same pattern: a recurring CI-flake mechanism has been fixed per-test for months; this PR generalises the structural fix and migrates all extant sites. Recurring flake (PR #818 lineage): Tests that "set a short TTL, sleep past it, assert the key is gone" race their own wall clock under -race on slow CI runners. The post-sleep read can sometimes still observe an un-expired view because the inter-call pause does not absorb scheduler jitter. PR #818 fixed one instance (ExpiredKey_BecomesInvisible) by switching to require.Eventually with a TTL-derived deadline. The other 8 sites in the test suite kept the fragile time.Sleep(ttl + 50ms) pattern. Structural fix ============== - New helper eventuallyExpired(t, ttl, condition, msg) in test_util.go. Encapsulates the require.Eventually + ttlExpiryHeadroom (3 s) + ttlExpiryPoll (25 ms) idiom so future test writers reach for it instead of inventing fresh time.Sleep margins. - Migrated 8 sites to the helper: * redis_collection_fastpath_test.go: TestRedis_HGET_TTLExpired, TestRedis_HEXISTS_TTLExpired, TestRedis_SISMEMBER_TTLExpired. * redis_lua_collection_fastpath_test.go: TestLua_HGET_HashTTL, TestLua_HEXISTS_HashTTL, TestLua_ZSCORE_*, TestLua_ZRANGEBYSCORE_*. * redis_misskey_compat_test.go: post-EX-1 expiry assertion. - Removed the now-unused waitForTTLExpiry (redis_collection_fastpath _test.go) and waitForLuaTTL (redis_lua_collection_fastpath_test.go) convenience wrappers. Their sole callers were the migrated sites. - Added errors stdlib import to the three test files where the helper's condition closures use errors.Is. Caller audit per /loop semantic-change rule ============================================ - eventuallyExpired: new helper, 8 callers in 3 test files. All test-only. - waitForTTLExpiry / waitForLuaTTL: removed. grep across the repo confirms zero remaining call sites. - collectionFastPathTTL / luaFastPathTTL consts: still used by the migrated sites as the ttl argument to eventuallyExpired. Unchanged. Validation ========== go test ./adapter/ -run 'TestRedis_HGET_TTLExpired|TestRedis_HEXISTS_TTLExpired|TestRedis_SISMEMBER_TTLExpired|TestLua_HGET_HashTTL|TestLua_HEXISTS|TestLua_ZSCORE|TestLua_ZRANGEBYSCORE|TestRedis_Misskey' -race -count=2 -> ok 32.3s gofmt + go vet + golangci-lint run -> 0 issues Closes the "Redis TTL window deadline-multiplier" follow-up noted in PR #891's description.

… tests (#903) ## Summary Third flake-root-cause PR, after #891 (SQS throttle refill) and #898 (Raft readiness probe). Same recipe: a recurring CI-flake mechanism has been patched per-test for months; this PR generalises the structural fix and migrates the extant sites. ## Recurring flake (PR #818 lineage) Tests that *"set a short TTL, sleep past it, assert the key is gone"* race their own wall clock under `-race` on slow CI runners. The post-sleep read sometimes observes an un-expired view because the inter-call pause does not absorb scheduler jitter. [PR #818](#818) fixed one instance (`TestRedis_ExpiredKey_BecomesInvisible`) by switching to `require.Eventually` with a TTL-derived deadline. The other 8 sites kept the fragile `time.Sleep(ttl + 50ms)` pattern. ## Structural fix - **New helper** `eventuallyExpired(t, ttl, condition, msg)` in `test_util.go`. Encapsulates the `require.Eventually + ttlExpiryHeadroom(3s) + ttlExpiryPoll(25ms)` idiom so future test writers reach for it instead of inventing fresh `time.Sleep` margins. - **Migrated 8 sites** to the helper: - `redis_collection_fastpath_test.go`: `TestRedis_HGET_TTLExpired`, `TestRedis_HEXISTS_TTLExpired`, `TestRedis_SISMEMBER_TTLExpired` - `redis_lua_collection_fastpath_test.go`: `TestLua_HGET_HashTTL`, `TestLua_HEXISTS_HashTTL`, `TestLua_ZSCORE_*`, `TestLua_ZRANGEBYSCORE_*` - `redis_misskey_compat_test.go`: post-`EX 1` expiry assertion - **Removed unused** `waitForTTLExpiry` (redis_collection_fastpath_test.go) and `waitForLuaTTL` (redis_lua_collection_fastpath_test.go) convenience wrappers. Sole callers were the migrated sites. - Added `errors` stdlib import to the three test files where the helper's condition closures use `errors.Is`. ## Caller audit (per /loop semantic-change rule) - `eventuallyExpired`: new helper, 8 callers in 3 test files. All test-only. - `waitForTTLExpiry` / `waitForLuaTTL`: removed. Repo-wide grep confirms zero remaining call sites. - `collectionFastPathTTL` / `luaFastPathTTL` consts: still used by the migrated sites as the `ttl` argument to `eventuallyExpired`. Unchanged. ## Validation - `go test ./adapter/ -run 'TestRedis_HGET_TTLExpired|TestRedis_HEXISTS_TTLExpired|TestRedis_SISMEMBER_TTLExpired|TestLua_HGET_HashTTL|TestLua_HEXISTS|TestLua_ZSCORE|TestLua_ZRANGEBYSCORE|TestRedis_Misskey' -race -count=2` → ok 32.3s - `gofmt` + `go vet` + `golangci-lint run` → 0 issues ## Scope Closes the *"Redis TTL window deadline-multiplier"* follow-up noted in [PR #891's description](#891). ## Out-of-scope (still on the flake follow-up backlog) - VM pool / cache hit-rate assertion → "don't assert on cache hit rates" rule. The structural lesson from commit [7783c27](7783c279) is to never assert on cache hit rates; needs a doc rather than a code change.

…gatorRaceFree (#906) ## Summary Fixes the CI flake observed in [Actions run 26765510693](https://github.com/bootjp/elastickv/actions/runs/26765510693/job/78890657730?pr=902): ``` --- FAIL: TestHotKeysAggregatorRaceFree (0.11s) hot_keys_test.go:356: Expected value not to be nil. ``` Surfaced on PR #902's CI run but **unrelated to PR #902** — the failing test is on `main` in `keyviz/`, which PR #902 (Redis adapter `*Eventually` helper cleanup) doesn't touch. ## Root cause Same wall-clock-racing pattern as the SQS-throttle (PR #891) and Redis-TTL (PR #903) flake series. The test: 1. Launches an aggregator goroutine (Step=5ms tick). 2. Spawns 8 workers each calling `Observe` 500 times. 3. `time.Sleep(50 * time.Millisecond)`. 4. Asserts both per-route snapshots are non-nil. The aggregator publishes a snapshot only on a tick (or on ctx-done drain). Under `-race` on a slow CI runner, the `Run` goroutine itself can be slow to schedule — the 50ms wait may not contain a single actual tick worth of scheduler time, so `hotKeysSnap` stays at its initial `nil` `atomic.Pointer` load and the assertion fires. ## Fix `require.Eventually` with a 3-second budget + 5ms poll cadence, asserting both snapshots non-nil: ```go require.Eventually(t, func() bool { return s.HotKeysSnapshot(1) != nil && s.HotKeysSnapshot(2) != nil }, 3*time.Second, 5*time.Millisecond, "aggregator must publish a snapshot for each route within the budget") ``` The test still pins the load-bearing property (snapshot reads are lock-free; the `atomic.Pointer.Load` works without contention with the publisher); only the **wait mechanism** changes from "sleep blindly and hope" to "poll until ready or fail". Total time bound is at most 3s on a slow runner, ~5–50ms in the common case. ## Validation - `go test ./keyviz/ -run TestHotKeysAggregatorRaceFree -race -count=5 -timeout 120s` → ok 1.1s (5/5 iterations green) - `gofmt` + `golangci-lint run` → 0 issues ## Pattern continuity This is the same kind of structural fix as PR #891 / #903: **replace fixed-time-window sleep with `require.Eventually`** for any condition that depends on a background goroutine's progress under `-race` on CI. Future flakes in this category should follow the same pattern.

gemini-code-assist Bot reviewed May 30, 2026

View reviewed changes

coderabbitai Bot reviewed May 30, 2026

View reviewed changes

bootjp merged commit 5e7b04a into main May 30, 2026
9 checks passed

bootjp deleted the fix/sqs-throttle-invalidate-flake branch May 30, 2026 14:13

bootjp mentioned this pull request Jun 1, 2026

test(util): commit one no-op entry to prove leader writeable in waitForRaftReadiness #898

Merged

bootjp mentioned this pull request Jun 1, 2026

test(redis): structural fix — eventuallyExpired helper for TTL-expiry tests #903

Merged

bootjp mentioned this pull request Jun 2, 2026

test(keyviz): replace fixed sleep with eventually in TestHotKeysAggregatorRaceFree #906

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(sqs): structural fix — shared throttle-refill helper for drain-pattern tests#891

test(sqs): structural fix — shared throttle-refill helper for drain-pattern tests#891
bootjp merged 2 commits into
mainfrom
fix/sqs-throttle-invalidate-flake

bootjp commented May 30, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

chatgpt-codex-connector Bot commented May 30, 2026

Uh oh!

coderabbitai Bot commented May 30, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 30, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

bootjp commented May 30, 2026

Uh oh!

claude Bot commented May 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bootjp commented May 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why a structural fix, not another per-test commit

Root cause (recurring across all three flake instances)

Structural fix

Caller audit (per /loop semantic-change rule)

Validation

Out-of-scope follow-ups (other recurring flake categories)

Summary by CodeRabbit

Uh oh!

chatgpt-codex-connector Bot commented May 30, 2026

Uh oh!

coderabbitai Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 30, 2026

Choose a reason for hiding this comment

Why this happens:

Recommendation:

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

bootjp commented May 30, 2026

Uh oh!

claude Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #891 — Round 1 fixes (commit e5f3176)

Fix 1: Retry-After range [1, 100] — ✅ Correct

Fix 2: Zero-arg helpers + drainBucketCapacity const — ✅ Clean

Pre-existing gap (not introduced by Round 1, low priority)

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bootjp commented May 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 30, 2026 •

edited

Loading

claude Bot commented May 30, 2026 •

edited

Loading

Review of PR #891 — Round 1 fixes (commit `e5f3176`)

Fix 2: Zero-arg helpers + `drainBucketCapacity` const — ✅ Clean