test(sqs): structural fix — shared throttle-refill helper for drain-pattern tests#891
Conversation
…attern tests The SQS throttle test file has been hit by the same CI flake mechanism three times in different tests: 54c6cd5 NoOpSetQueueAttributesPreservesBucket (PR #819) 204b729 SetQueueAttributesInvalidatesBucket (PR #890 first commit) <this> structural prevention for the remaining drain-pattern tests Each fix touched one test in isolation, leaving the same trap in sibling tests for the next slow CI runner to step on. This commit extracts the rule into a shared helper so future drain-pattern tests cannot regress to the same flake. Root cause (recurring) ====================== Tests that (a) set ThrottleSend/RecvRefillPerSecond AND (b) drain the bucket then assert a "throttled" response are racing their own wall clock: under -race on slow CI runners each request elapses ~100-250ms (full HTTP + Raft propose+apply). An N-write drain plus its assertion send takes ~N*150ms +/- noise. At Refill=1/sec the bucket accumulates >=1 token over that window and the "throttled" assertion fires against a refilled bucket, returning 200 instead of the expected 400 -- a false positive against a real production contract regression. Three commits have fixed exactly this -- each time in just the one test that flaked. Sibling tests using the same pattern remained unprotected. Structural fix ============== 1. Shared file-head constants + helpers: const slowRefillRate = "0.01" // 1 token per 100 seconds func newSendThrottleAttrs(capacity int) map[string]string func newRecvThrottleAttrs(capacity int) map[string]string Helpers lock the refill rate to slowRefillRate so a future drain-pattern test cannot pick a racing value by accident. 2. File-head doc-comment block explaining: - The recurring flake mechanism (with prior-commit refs). - The rule: drain-then-assert MUST use the helpers. - The exception: config-round-trip / validator-rejection / post-raise tests where the rate is part of the test contract. 3. Migrated all drain-pattern call sites to the helpers: - TestSQSServer_Throttle_SendBucketRejectsAfterCapacity (HIGH risk; same drain-then-assert pattern at Refill=1, unprotected until now). Retry-After assertion updated from "1" to "100" since the formula `ceil(needed/refillRate)` now returns 100 s for refillRate=0.01. - TestSQSServer_Throttle_RecvBucketRejectsAfterCapacity (HIGH risk; receive-side mirror, same race). - TestSQSServer_Throttle_BatchChargesByEntryCount (LOW risk in practice -- only 2 calls -- but migrated for consistency). - TestSQSServer_Throttle_SetQueueAttributesInvalidatesBucket (already fixed inline by PR #890 first commit; refactored to use the helper). - TestSQSServer_Throttle_DeleteQueueInvalidatesBucket (no drain assertion -- both call sites migrated for uniformity). - TestSQSServer_Throttle_NoOpSetQueueAttributesPreservesBucket (already fixed inline by PR #819; local slowRefill const + throttleAttrs map literal dropped in favor of the shared helper). Not migrated (intentional): - TestSQSServer_Throttle_GetQueueAttributesRoundTrip (Refill=50/5 -- exact values are the test contract). - TestSQSServer_Throttle_RejectsCapacityBelowBatchMin (Refill=1 -- value is part of the rejected input the validator must catch). - SetQueueAttributesInvalidatesBucket post-raise phase (Refill=20 -- raising both Capacity AND Refill is the thing this phase is asserting fires). Caller audit per /loop semantic-change rule ============================================ Test-only file. Helpers are package-private (lowercase) and only referenced within this single file (verified by grep 'newSendThrottleAttrs\|newRecvThrottleAttrs\|slowRefillRate' across the repo). No production code touched. No other test file uses these names. Retry-After header assertion in TestSQSServer_Throttle_SendBucketRejectsAfterCapacity was the single observable behavior change: header value depends on the configured refillRate via the documented §3.4 formula. Old "1" matched Refill=1; new "100" matches Refill=0.01 (verified by reading computeRetryAfter in sqs_throttle.go:525). The header contract test is still exercised -- only the asserted literal value changed. Validation ========== go test ./adapter/ -run TestSQSServer_Throttle -race -count=3 -> ok (5.1s, all three iterations green) gofmt + go vet + golangci-lint run -> 0 issues Out-of-scope follow-ups (other recurring flake categories) ========================================================== This commit addresses the SQS-throttle subset. The broader CI flake history (47 fix-CI-flake commits over the last several months) has other recurring patterns that deserve similar structural treatment: - Leader churn + Raft async apply: many tests retry on "not leader" or wait for follower catch-up. A shared eventuallyLeaderOK(t, ...) helper would centralize the retry + budget knobs. - Redis TTL window racing: multiple "widen TTL" commits suggest a per-test deadline-multiplier helper would be a better pattern than ad-hoc widening. - VM pool / cache hit-rate assertions: dropped as flaky in commit 7783c27. The structural lesson is to not assert on cache hit rates at all; need a doc somewhere. These are tracked as future work; out of scope for this PR which focuses on the most-repeated single flake (3 fixes for the same mechanism in 6 weeks).
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Warning Review limit reached
More reviews will be available in 48 minutes and 43 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR hardens SQS throttle integration tests against CI flakes by introducing a shared ChangesSQS Throttle Test CI-Flake Guardrails
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request refactors SQS throttle integration tests to use structural helpers with a slow refill rate (0.01 tokens/sec), preventing CI flakes caused by wall-clock races during drain-then-assert patterns. The reviewer pointed out a potential new flake in TestSQSServer_Throttle_SendBucketRejectsAfterCapacity where asserting an exact Retry-After value of "100" might fail on slow CI runners if more than a second elapses between requests. They recommend asserting a safe range (e.g., [90, 100]) instead.
| if ra := resp.Header.Get("Retry-After"); ra != "100" { | ||
| t.Fatalf("11th send: Retry-After=%q (expected 100 — slowRefillRate)", ra) | ||
| } |
There was a problem hiding this comment.
Asserting an exact Retry-After value of "100" can introduce a new test flake on slow CI runners.
Why this happens:
The Retry-After header is calculated as math.Ceil(needed / refillRate).
With refillRate = 0.01, needed / refillRate is (1 - elapsed * 0.01) / 0.01 = 100 - elapsed.
If the time elapsed (elapsed) between the 10th send and the 11th send is strictly greater than 1.0 second (which is highly possible on slow or busy CI runners under -race), 100 - elapsed will be less than 99.0.
Consequently, math.Ceil(100 - elapsed) will round up to 99 or less, causing the exact assertion ra != "100" to fail.
Recommendation:
Parse the Retry-After header value and assert that it falls within a safe range (e.g., [90, 100]) to tolerate realistic scheduling or network delays on slow CI runners.
raVal, err := strconv.Atoi(resp.Header.Get("Retry-After"))
if err != nil {
t.Fatalf("11th send: invalid Retry-After header: %v", err)
}
if raVal < 90 || raVal > 100 {
t.Fatalf("11th send: Retry-After=%d (expected between 90 and 100)", raVal)
}There was a problem hiding this comment.
🧹 Nitpick comments (1)
adapter/sqs_throttle_integration_test.go (1)
56-57: ⚡ Quick winRemove
//nolint:unparamby simplifying helper signatures.These suppressions can be avoided by making the helpers zero-arg (current usage is fixed at capacity 10), then re-introducing parameters when a real caller needs a different value.
Suggested diff
+const drainPatternCapacity = 10 + -//nolint:unparam // capacity parameter is forward-looking; see doc. -func newSendThrottleAttrs(capacity int) map[string]string { +func newSendThrottleAttrs() map[string]string { return map[string]string{ - "ThrottleSendCapacity": strconv.Itoa(capacity), + "ThrottleSendCapacity": strconv.Itoa(drainPatternCapacity), "ThrottleSendRefillPerSecond": slowRefillRate, } } -//nolint:unparam // capacity parameter is forward-looking; see newSendThrottleAttrs. -func newRecvThrottleAttrs(capacity int) map[string]string { +func newRecvThrottleAttrs() map[string]string { return map[string]string{ - "ThrottleRecvCapacity": strconv.Itoa(capacity), + "ThrottleRecvCapacity": strconv.Itoa(drainPatternCapacity), "ThrottleRecvRefillPerSecond": slowRefillRate, } }As per coding guidelines, "
**/*.go: ... Avoid adding//nolintannotations — refactor instead."Also applies to: 69-70
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@adapter/sqs_throttle_integration_test.go` around lines 56 - 57, The helper newSendThrottleAttrs currently takes an unused capacity parameter and is suppressed with //nolint:unparam; remove the nolint and simplify the signature to a zero-arg function (e.g., newSendThrottleAttrs()) since all current callers use the fixed capacity (10), and update any analogous helper (the one at lines 69-70) the same way so both helpers have no unused parameters; adjust call sites to invoke the zero-arg functions and re-run linters to ensure the //nolint annotations can be removed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@adapter/sqs_throttle_integration_test.go`:
- Around line 56-57: The helper newSendThrottleAttrs currently takes an unused
capacity parameter and is suppressed with //nolint:unparam; remove the nolint
and simplify the signature to a zero-arg function (e.g., newSendThrottleAttrs())
since all current callers use the fixed capacity (10), and update any analogous
helper (the one at lines 69-70) the same way so both helpers have no unused
parameters; adjust call sites to invoke the zero-arg functions and re-run
linters to ensure the //nolint annotations can be removed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ff78a714-1ad0-480d-b5ea-945544181de8
📒 Files selected for processing (1)
adapter/sqs_throttle_integration_test.go
…pers gemini HIGH (Retry-After exact-value assertion is itself flaky): Asserting Retry-After == "100" exactly re-introduces the wall-clock flake at the assertion layer. computeRetryAfter formula is ceil((1 - currentTokens) / refillRate), and currentTokens is the refill accumulated during the drain loop (each drain send takes ~150 ms wall, at refill=0.01 that yields ~0.015 tokens per send, so by the 11th request currentTokens is a small positive number). ceil((1 - 0.015) / 0.01) = ceil(98.5) = 99, not 100. Range check [1, 100] captures the §3.4 contract (positive integer bounded by 1/refillRate) without re-racing the wall clock. coderabbit nitpick (//nolint:unparam suppressions): Removed by making helpers zero-arg. Capacity is hardcoded as a documented const drainBucketCapacity = "10" (the §3.2 validator batch-floor minimum, doubles as the smallest value usable by both single and batch drains). YAGNI: future tests needing a different capacity should construct the map literally rather than parameterizing the helper. All 9 helper call sites updated from foo(10) to foo() via sed. Caller audit: test-only. Helpers are package-private; grep across the repo confirms zero external references. Const drainBucketCapacity is new name, no collision risk verified. Validation: go test ./adapter/ -run TestSQSServer_Throttle -race -count=3 -> ok golangci-lint run -> 0 issues
|
@claude review Round 1 fixes (commit e5f3176):
Caller audit: test-only file, helpers package-private, no external references. |
|
Claude finished @bootjp's task in 2m 23s —— View job Review of PR #891 — Round 1 fixes (commit
|
…orRaftReadiness (#898) ## Summary Structural fix for the recurring **leader-churn CI flake** pattern. Closes the post-readiness step-down window at the readiness layer instead of asking every new test author to remember to use `*Eventually` helpers. Companion to PR #891 (SQS throttle structural fix): same recipe — *each per-test fix to a recurring flake mechanism leaves the same trap for the next test*. PR #891 hardened the SQS throttle-refill subset; this PR hardens the leader-churn subset. ## Recurring flake pattern (5+ commits this year) | Commit | What it touched | |---|---| | [`5ba2ef94`](5ba2ef94) | added `doEventually` / `rpushEventually` / `lpushEventually`, wrapped 2 RPUSH/LPUSH calls in `TestRedis_LuaRPopLPushBullMQLikeLists` | | [`2b6e8f28`](2b6e8f28) | widened raft election budget + retried gRPC on leader churn | | [`78b9d091`](78b9d091) | retried leader churn in consistency loops | | [`0fb7db76`](0fb7db76) | fixed flaky `TestRedis_LuaRPopLPushBullMQLikeLists` (re-fix) | | [`ba3f5bf8`](ba3f5bf8) | waited for follower apply catch-up in MultiExec | Root cause is the same: `waitForStableLeader` only proves leadership *appears* stable for a 300 ms window. The freshly-elected leader can still step down on its first real heartbeat-quorum miss after that window closes, surfacing as `"etcd raft engine is not leader"` on the test's first write. ## Mechanism Add `waitForWriteableLeader` that drives one no-op Raft entry through full quorum after `waitForStableLeader` returns: ```go var raftReadinessProbePayload = []byte{ 0x02, // raftEncodeHLCLease 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // BE uint64 ceiling = 0 (no-op) } ``` - **State-machine no-op**: `kv/fsm.go applyHLCLease` reads `data[1:]` as a big-endian uint64 ceiling. The `if ceilingMs > 0` guard at line 185 makes ceiling=0 a true no-op (no HLC mutation, no storage write, no encryption dispatch). - **Quorum-proving**: the entry still traverses the full propose → append → replicate → commit → apply pipeline, so a successful `Propose` proves the leader can actually drive quorum. - **Retry semantics**: retries on transient leader-unavailable errors (same set `isTransientNotLeaderErr` already classifies) up to caller-supplied `waitTimeout` (10s at the current `createNode` call site). Per-propose timeout is 2s. Wired into `waitForRaftReadiness` as the final step. ## Caller audit (per /loop semantic-change rule) - `waitForRaftReadiness`: single production call site (`createNode`); behavior change is **additive** — same semantics PLUS the writeable-leader gate. Tests that previously raced into a step-down window now block until the leader has committed one entry. - `waitForWriteableLeader` / `raftReadinessProbePayload`: package-private to `adapter/test_util.go`; no external callers (grep verified). - `applyHLCLease` (`kv/fsm.go`): receives one extra invocation per `createNode` call. Behavior on `ceiling=0` is documented no-op. No state divergence across replicas — every FSM applies the same payload. - Side effects: - `pendingApplyIdx` advances by 1 (normal flow) - encryption applier untouched (opcode `0x02` not in encryption band) - snapshots include one extra applied index (already varies by test timing) - lease provider does not track HLC entries - Per-test wall-clock: ~10 ms extra on a stable cluster. 29 test files use `createNode` → ~290 ms total CI overhead. Negligible. ## Validation - `go test ./adapter/ -run TestRedis_LuaRPopLPushBullMQLikeLists -race -count=3` → ok 5.4s - `go test ./adapter/ -run 'TestRedis_SET|TestRedis_LPush|TestRedis_RPush|TestSQSServer_Throttle|TestRedis_MULTI' -race -count=2` → ok 5.6s - `gofmt` + `go vet` + `golangci-lint run` → 0 issues ## Follow-up note Existing per-test `*Eventually` wrappers (`rpushEventually`, `lpushEventually`) remain valid for tests that issue writes **long after** `createNode`, where mid-test leader churn under CI load can still hit. The readiness probe targets the startup window specifically — it does not obsolete the helpers for other use cases. Subsequent commits can audit whether tests still calling `rpushEventually` / `lpushEventually` *immediately after* `createNode` can drop back to direct calls now that the startup window is closed at the readiness layer. ## Out-of-scope (remaining recurring flake categories) Tracked in the PR #891 description; still to be addressed in follow-up PRs: - Redis TTL window racing → per-test deadline-multiplier helper - VM pool / cache hit-rate assertions → "don't assert on cache hit rates" rule
… tests Third flake-root-cause PR, after PR #891 (SQS throttle refill) and PR #898 (Raft readiness probe). Same pattern: a recurring CI-flake mechanism has been fixed per-test for months; this PR generalises the structural fix and migrates all extant sites. Recurring flake (PR #818 lineage): Tests that "set a short TTL, sleep past it, assert the key is gone" race their own wall clock under -race on slow CI runners. The post-sleep read can sometimes still observe an un-expired view because the inter-call pause does not absorb scheduler jitter. PR #818 fixed one instance (ExpiredKey_BecomesInvisible) by switching to require.Eventually with a TTL-derived deadline. The other 8 sites in the test suite kept the fragile time.Sleep(ttl + 50ms) pattern. Structural fix ============== - New helper eventuallyExpired(t, ttl, condition, msg) in test_util.go. Encapsulates the require.Eventually + ttlExpiryHeadroom (3 s) + ttlExpiryPoll (25 ms) idiom so future test writers reach for it instead of inventing fresh time.Sleep margins. - Migrated 8 sites to the helper: * redis_collection_fastpath_test.go: TestRedis_HGET_TTLExpired, TestRedis_HEXISTS_TTLExpired, TestRedis_SISMEMBER_TTLExpired. * redis_lua_collection_fastpath_test.go: TestLua_HGET_HashTTL, TestLua_HEXISTS_HashTTL, TestLua_ZSCORE_*, TestLua_ZRANGEBYSCORE_*. * redis_misskey_compat_test.go: post-EX-1 expiry assertion. - Removed the now-unused waitForTTLExpiry (redis_collection_fastpath _test.go) and waitForLuaTTL (redis_lua_collection_fastpath_test.go) convenience wrappers. Their sole callers were the migrated sites. - Added errors stdlib import to the three test files where the helper's condition closures use errors.Is. Caller audit per /loop semantic-change rule ============================================ - eventuallyExpired: new helper, 8 callers in 3 test files. All test-only. - waitForTTLExpiry / waitForLuaTTL: removed. grep across the repo confirms zero remaining call sites. - collectionFastPathTTL / luaFastPathTTL consts: still used by the migrated sites as the ttl argument to eventuallyExpired. Unchanged. Validation ========== go test ./adapter/ -run 'TestRedis_HGET_TTLExpired|TestRedis_HEXISTS_TTLExpired|TestRedis_SISMEMBER_TTLExpired|TestLua_HGET_HashTTL|TestLua_HEXISTS|TestLua_ZSCORE|TestLua_ZRANGEBYSCORE|TestRedis_Misskey' -race -count=2 -> ok 32.3s gofmt + go vet + golangci-lint run -> 0 issues Closes the "Redis TTL window deadline-multiplier" follow-up noted in PR #891's description.
… tests (#903) ## Summary Third flake-root-cause PR, after #891 (SQS throttle refill) and #898 (Raft readiness probe). Same recipe: a recurring CI-flake mechanism has been patched per-test for months; this PR generalises the structural fix and migrates the extant sites. ## Recurring flake (PR #818 lineage) Tests that *"set a short TTL, sleep past it, assert the key is gone"* race their own wall clock under `-race` on slow CI runners. The post-sleep read sometimes observes an un-expired view because the inter-call pause does not absorb scheduler jitter. [PR #818](#818) fixed one instance (`TestRedis_ExpiredKey_BecomesInvisible`) by switching to `require.Eventually` with a TTL-derived deadline. The other 8 sites kept the fragile `time.Sleep(ttl + 50ms)` pattern. ## Structural fix - **New helper** `eventuallyExpired(t, ttl, condition, msg)` in `test_util.go`. Encapsulates the `require.Eventually + ttlExpiryHeadroom(3s) + ttlExpiryPoll(25ms)` idiom so future test writers reach for it instead of inventing fresh `time.Sleep` margins. - **Migrated 8 sites** to the helper: - `redis_collection_fastpath_test.go`: `TestRedis_HGET_TTLExpired`, `TestRedis_HEXISTS_TTLExpired`, `TestRedis_SISMEMBER_TTLExpired` - `redis_lua_collection_fastpath_test.go`: `TestLua_HGET_HashTTL`, `TestLua_HEXISTS_HashTTL`, `TestLua_ZSCORE_*`, `TestLua_ZRANGEBYSCORE_*` - `redis_misskey_compat_test.go`: post-`EX 1` expiry assertion - **Removed unused** `waitForTTLExpiry` (redis_collection_fastpath_test.go) and `waitForLuaTTL` (redis_lua_collection_fastpath_test.go) convenience wrappers. Sole callers were the migrated sites. - Added `errors` stdlib import to the three test files where the helper's condition closures use `errors.Is`. ## Caller audit (per /loop semantic-change rule) - `eventuallyExpired`: new helper, 8 callers in 3 test files. All test-only. - `waitForTTLExpiry` / `waitForLuaTTL`: removed. Repo-wide grep confirms zero remaining call sites. - `collectionFastPathTTL` / `luaFastPathTTL` consts: still used by the migrated sites as the `ttl` argument to `eventuallyExpired`. Unchanged. ## Validation - `go test ./adapter/ -run 'TestRedis_HGET_TTLExpired|TestRedis_HEXISTS_TTLExpired|TestRedis_SISMEMBER_TTLExpired|TestLua_HGET_HashTTL|TestLua_HEXISTS|TestLua_ZSCORE|TestLua_ZRANGEBYSCORE|TestRedis_Misskey' -race -count=2` → ok 32.3s - `gofmt` + `go vet` + `golangci-lint run` → 0 issues ## Scope Closes the *"Redis TTL window deadline-multiplier"* follow-up noted in [PR #891's description](#891). ## Out-of-scope (still on the flake follow-up backlog) - VM pool / cache hit-rate assertion → "don't assert on cache hit rates" rule. The structural lesson from commit [7783c27](7783c279) is to never assert on cache hit rates; needs a doc rather than a code change.
…gatorRaceFree (#906) ## Summary Fixes the CI flake observed in [Actions run 26765510693](https://github.com/bootjp/elastickv/actions/runs/26765510693/job/78890657730?pr=902): ``` --- FAIL: TestHotKeysAggregatorRaceFree (0.11s) hot_keys_test.go:356: Expected value not to be nil. ``` Surfaced on PR #902's CI run but **unrelated to PR #902** — the failing test is on `main` in `keyviz/`, which PR #902 (Redis adapter `*Eventually` helper cleanup) doesn't touch. ## Root cause Same wall-clock-racing pattern as the SQS-throttle (PR #891) and Redis-TTL (PR #903) flake series. The test: 1. Launches an aggregator goroutine (Step=5ms tick). 2. Spawns 8 workers each calling `Observe` 500 times. 3. `time.Sleep(50 * time.Millisecond)`. 4. Asserts both per-route snapshots are non-nil. The aggregator publishes a snapshot only on a tick (or on ctx-done drain). Under `-race` on a slow CI runner, the `Run` goroutine itself can be slow to schedule — the 50ms wait may not contain a single actual tick worth of scheduler time, so `hotKeysSnap` stays at its initial `nil` `atomic.Pointer` load and the assertion fires. ## Fix `require.Eventually` with a 3-second budget + 5ms poll cadence, asserting both snapshots non-nil: ```go require.Eventually(t, func() bool { return s.HotKeysSnapshot(1) != nil && s.HotKeysSnapshot(2) != nil }, 3*time.Second, 5*time.Millisecond, "aggregator must publish a snapshot for each route within the budget") ``` The test still pins the load-bearing property (snapshot reads are lock-free; the `atomic.Pointer.Load` works without contention with the publisher); only the **wait mechanism** changes from "sleep blindly and hope" to "poll until ready or fail". Total time bound is at most 3s on a slow runner, ~5–50ms in the common case. ## Validation - `go test ./keyviz/ -run TestHotKeysAggregatorRaceFree -race -count=5 -timeout 120s` → ok 1.1s (5/5 iterations green) - `gofmt` + `golangci-lint run` → 0 issues ## Pattern continuity This is the same kind of structural fix as PR #891 / #903: **replace fixed-time-window sleep with `require.Eventually`** for any condition that depends on a background goroutine's progress under `-race` on CI. Future flakes in this category should follow the same pattern.
Summary
Structural fix for the recurring SQS-throttle CI flake. PR #890 just merged a per-test fix for
SetQueueAttributesInvalidatesBucket; this PR generalizes the rule across the whole file so the next drain-pattern test added toadapter/sqs_throttle_integration_test.gocannot regress to the same flake.Why a structural fix, not another per-test commit
The same CI-flake mechanism has been fixed three times in three different SQS-throttle tests over ~6 weeks:
54c6cd56NoOpSetQueueAttributesPreservesBucket(PR #819)204b7294SetQueueAttributesInvalidatesBucket(PR #890)Each per-test fix touched one test in isolation, leaving the same trap in sibling tests for the next slow CI runner to step on. Two sibling tests are still unprotected at the time of this PR:
TestSQSServer_Throttle_SendBucketRejectsAfterCapacity— HIGH risk, drain 10 then assert 11th rejects, Refill=1TestSQSServer_Throttle_RecvBucketRejectsAfterCapacity— HIGH risk, receive-side mirror, same patternPlus
BatchChargesByEntryCountis theoretically at risk on a very slow runner.Root cause (recurring across all three flake instances)
Tests that (a) set
ThrottleSend/RecvRefillPerSecondAND (b) drain the bucket then assert a "throttled" response are racing their own wall clock. Under-raceon slow CI runners each request elapses ~100-250 ms (full HTTP + Raft propose+apply). An N-write drain plus its assertion send takes ~N×150 ms ± noise. At Refill=1/sec the bucket accumulates ≥1 token over that window and the "throttled" assertion fires against a refilled bucket, returning 200 instead of 400 — a false positive against a real production-contract regression test.Structural fix
Shared file-head constants + helpers in
adapter/sqs_throttle_integration_test.go:Helpers lock the refill rate to
slowRefillRateso a future drain-pattern test cannot pick a racing value by accident.File-head doc-comment block explaining:
Migrated all drain-pattern call sites to the helpers:
SendBucketRejectsAfterCapacity(HIGH risk — unprotected until now). Retry-After assertion updated from"1"to"100"(formulaceil(needed/refillRate)returns 100s for refillRate=0.01; verified by readingcomputeRetryAfterinsqs_throttle.go:525).RecvBucketRejectsAfterCapacity(HIGH risk; same migration)BatchChargesByEntryCount(LOW risk in practice, migrated for consistency)SetQueueAttributesInvalidatesBucket(already fixed by PR test(sqs): slow throttle refill on SetQueueAttributesInvalidatesBucket to fix CI flake #890, refactored to use the helper)DeleteQueueInvalidatesBucket(no drain assertion, migrated for uniformity)NoOpSetQueueAttributesPreservesBucket(already fixed by PR test(sqs): slow throttle refill on no-op SetQueueAttributes test to fix CI flake #819 with a localslowRefillconst, refactored to use the shared helper)Not migrated (intentional):
GetQueueAttributesRoundTrip(Refill=50/5 — exact values are the test contract)RejectsCapacityBelowBatchMin(Refill=1 — value is part of the rejected input the validator must catch)SetQueueAttributesInvalidatesBucketpost-raise phase (Refill=20 — raising both Capacity AND Refill is what the phase asserts fires)Caller audit (per /loop semantic-change rule)
Test-only file. Helpers are package-private and only referenced within this single file (verified by grep across the repo). No production code touched. No other test file uses these names.
Retry-After header assertion change in
SendBucketRejectsAfterCapacitywas the single observable behavior change: header depends onrefillRatevia the §3.4 formula. Old"1"matched Refill=1; new"100"matches Refill=0.01. The header contract is still exercised — only the asserted literal changed.Validation
go test ./adapter/ -run TestSQSServer_Throttle -race -count=3 -timeout 180s→ ok 5.1s, all three iterations greengofmt+go vet+golangci-lint run→ 0 issuesOut-of-scope follow-ups (other recurring flake categories)
Broader CI-flake history (47 fix-CI-flake commits) has other recurring patterns that deserve similar structural treatment — tracked as future work, out of scope for this PR:
eventuallyLeaderOK(t, ...)helper would centralize the retry + budget knobs.7783c279— the structural lesson is to avoid asserting on cache hit rates.This PR focuses on the most-repeated single flake (3 fixes for the same mechanism in 6 weeks). The remaining categories each warrant their own follow-up PR after we see how this pattern works in practice.
Summary by CodeRabbit