test(keyviz): replace fixed sleep with eventually in TestHotKeysAggregatorRaceFree#906
Conversation
…gatorRaceFree CI flake observed on Actions run 26765510693 (PR #902 branch but unrelated to that PR -- the keyviz/ change is on main and surfaced there by chance): --- FAIL: TestHotKeysAggregatorRaceFree (0.11s) hot_keys_test.go:356: Expected value not to be nil. Root cause: same wall-clock racing pattern as the SQS-throttle and Redis-TTL flake series (PR #891, #903). The test: 1. Launches an aggregator goroutine (Step=5ms tick). 2. Spawns 8 workers each calling Observe 500 times. 3. Sleeps 50ms. 4. Asserts both per-route snapshots are non-nil. The aggregator publishes a snapshot only on a tick (or on ctx-done drain). Under -race on a slow CI runner, the Run goroutine itself can be slow to schedule -- the 50ms wait may not contain a single actual tick worth of scheduler time, so hotKeysSnap stays at its initial nil atomic.Pointer load and the assertion fires. Fix: require.Eventually with a 3-second budget + 5ms poll cadence, asserting both snapshots non-nil. The test still pins the load-bearing property (snapshot reads are lock-free; the atomic.Pointer.Load works without contention with the publisher); only the WAIT mechanism changes from "sleep blindly and hope" to "poll until ready or fail". Total time bound is at most 3 s on a slow runner, ~5-50 ms in the common case. Validation: go test ./keyviz/ -run TestHotKeysAggregatorRaceFree -race -count=5 -> ok 1.1s gofmt + golangci-lint -> 0 issues Caller audit: test-only file; no Go callers affected. The flake linage matches the structural fixes in PR #891/#903 (replace fixed-time-window sleep with require.Eventually); future flakes in this category should follow the same pattern.
|
Warning Review limit reached
More reviews will be available in 41 minutes and 6 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request replaces a static sleep with require.Eventually in TestHotKeysAggregatorRaceFree to resolve test flakiness on slow CI runners where the aggregator's goroutine might not schedule in time. There are no review comments, so I have no feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
@claude review PR #906: hot-keys aggregator flake fix. Replaces fixed time.Sleep(50ms) with require.Eventually (3s/5ms) in TestHotKeysAggregatorRaceFree. Surfaced on Actions run 26765510693. gemini-code-assist already reviewed: "There are no review comments." — no findings. Same structural pattern as PR #891 / #903 (fixed-time-window sleep → polling) but for the keyviz aggregator publish-on-tick mechanism. Test still pins the load-bearing property (snapshot reads are lock-free atomic.Pointer.Load); only the wait mechanism changes. go test ./keyviz/ -race -count=5 passes; lint clean. |
|
Claude finished @bootjp's task in 1m 38s —— View job Code Review: PR #906
LGTM — correct fix, no issues found. Root cause analysisConfirmed. The old code at lines 351–357 (pre-patch): time.Sleep(50 * time.Millisecond)
s1 := s.HotKeysSnapshot(1)
s2 := s.HotKeysSnapshot(2)
require.NotNil(t, s1)
require.NotNil(t, s2)The aggregator goroutine with Correctness of the polling conditionThe key safety property of s.HotKeysSnapshot(1) != nil && s.HotKeysSnapshot(2) != nilis that it's monotonically stable once satisfied. Looking at slot.hotKeysSnap.Store(snap) // snap is always a fresh non-nil pointer
Goroutine lifecycle orderingPreserved correctly:
The Comment verbosityThe 8-line block comment justifying the switch to Pattern consistencyMatches the established No dead importsThe SummaryNo issues. The fix is minimal (no scope creep), correct (monotone poll condition, correct goroutine lifecycle), and follows the project's established flake-fix pattern. Gemini's "no findings" aligns. |
Summary
Fixes the CI flake observed in Actions run 26765510693:
Surfaced on PR #902's CI run but unrelated to PR #902 — the failing test is on
maininkeyviz/, which PR #902 (Redis adapter*Eventuallyhelper cleanup) doesn't touch.Root cause
Same wall-clock-racing pattern as the SQS-throttle (PR #891) and Redis-TTL (PR #903) flake series. The test:
Observe500 times.time.Sleep(50 * time.Millisecond).The aggregator publishes a snapshot only on a tick (or on ctx-done drain). Under
-raceon a slow CI runner, theRungoroutine itself can be slow to schedule — the 50ms wait may not contain a single actual tick worth of scheduler time, sohotKeysSnapstays at its initialnilatomic.Pointerload and the assertion fires.Fix
require.Eventuallywith a 3-second budget + 5ms poll cadence, asserting both snapshots non-nil:The test still pins the load-bearing property (snapshot reads are lock-free; the
atomic.Pointer.Loadworks without contention with the publisher); only the wait mechanism changes from "sleep blindly and hope" to "poll until ready or fail". Total time bound is at most 3s on a slow runner, ~5–50ms in the common case.Validation
go test ./keyviz/ -run TestHotKeysAggregatorRaceFree -race -count=5 -timeout 120s→ ok 1.1s (5/5 iterations green)gofmt+golangci-lint run→ 0 issuesPattern continuity
This is the same kind of structural fix as PR #891 / #903: replace fixed-time-window sleep with
require.Eventuallyfor any condition that depends on a background goroutine's progress under-raceon CI. Future flakes in this category should follow the same pattern.