Skip to content

test(keyviz): replace fixed sleep with eventually in TestHotKeysAggregatorRaceFree#906

Merged
bootjp merged 1 commit into
mainfrom
fix/hot-keys-aggregator-flake
Jun 2, 2026
Merged

test(keyviz): replace fixed sleep with eventually in TestHotKeysAggregatorRaceFree#906
bootjp merged 1 commit into
mainfrom
fix/hot-keys-aggregator-flake

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Jun 2, 2026

Summary

Fixes the CI flake observed in Actions run 26765510693:

--- FAIL: TestHotKeysAggregatorRaceFree (0.11s)
    hot_keys_test.go:356: Expected value not to be nil.

Surfaced on PR #902's CI run but unrelated to PR #902 — the failing test is on main in keyviz/, which PR #902 (Redis adapter *Eventually helper cleanup) doesn't touch.

Root cause

Same wall-clock-racing pattern as the SQS-throttle (PR #891) and Redis-TTL (PR #903) flake series. The test:

  1. Launches an aggregator goroutine (Step=5ms tick).
  2. Spawns 8 workers each calling Observe 500 times.
  3. time.Sleep(50 * time.Millisecond).
  4. Asserts both per-route snapshots are non-nil.

The aggregator publishes a snapshot only on a tick (or on ctx-done drain). Under -race on a slow CI runner, the Run goroutine itself can be slow to schedule — the 50ms wait may not contain a single actual tick worth of scheduler time, so hotKeysSnap stays at its initial nil atomic.Pointer load and the assertion fires.

Fix

require.Eventually with a 3-second budget + 5ms poll cadence, asserting both snapshots non-nil:

require.Eventually(t, func() bool {
    return s.HotKeysSnapshot(1) != nil && s.HotKeysSnapshot(2) != nil
}, 3*time.Second, 5*time.Millisecond, "aggregator must publish a snapshot for each route within the budget")

The test still pins the load-bearing property (snapshot reads are lock-free; the atomic.Pointer.Load works without contention with the publisher); only the wait mechanism changes from "sleep blindly and hope" to "poll until ready or fail". Total time bound is at most 3s on a slow runner, ~5–50ms in the common case.

Validation

  • go test ./keyviz/ -run TestHotKeysAggregatorRaceFree -race -count=5 -timeout 120s → ok 1.1s (5/5 iterations green)
  • gofmt + golangci-lint run → 0 issues

Pattern continuity

This is the same kind of structural fix as PR #891 / #903: replace fixed-time-window sleep with require.Eventually for any condition that depends on a background goroutine's progress under -race on CI. Future flakes in this category should follow the same pattern.

…gatorRaceFree

CI flake observed on Actions run 26765510693 (PR #902 branch but
unrelated to that PR -- the keyviz/ change is on main and surfaced
there by chance):

  --- FAIL: TestHotKeysAggregatorRaceFree (0.11s)
      hot_keys_test.go:356: Expected value not to be nil.

Root cause: same wall-clock racing pattern as the SQS-throttle and
Redis-TTL flake series (PR #891, #903). The test:

  1. Launches an aggregator goroutine (Step=5ms tick).
  2. Spawns 8 workers each calling Observe 500 times.
  3. Sleeps 50ms.
  4. Asserts both per-route snapshots are non-nil.

The aggregator publishes a snapshot only on a tick (or on ctx-done
drain). Under -race on a slow CI runner, the Run goroutine itself
can be slow to schedule -- the 50ms wait may not contain a single
actual tick worth of scheduler time, so hotKeysSnap stays at its
initial nil atomic.Pointer load and the assertion fires.

Fix: require.Eventually with a 3-second budget + 5ms poll cadence,
asserting both snapshots non-nil. The test still pins the load-bearing
property (snapshot reads are lock-free; the atomic.Pointer.Load works
without contention with the publisher); only the WAIT mechanism
changes from "sleep blindly and hope" to "poll until ready or fail".
Total time bound is at most 3 s on a slow runner, ~5-50 ms in the
common case.

Validation:
  go test ./keyviz/ -run TestHotKeysAggregatorRaceFree -race -count=5 -> ok 1.1s
  gofmt + golangci-lint -> 0 issues

Caller audit: test-only file; no Go callers affected. The flake
linage matches the structural fixes in PR #891/#903 (replace
fixed-time-window sleep with require.Eventually); future flakes
in this category should follow the same pattern.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 2, 2026

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 41 minutes and 6 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d77b2340-d85c-473f-bbb4-2911074e1765

📥 Commits

Reviewing files that changed from the base of the PR and between 214d6d7 and 0559e64.

📒 Files selected for processing (1)
  • keyviz/hot_keys_test.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/hot-keys-aggregator-flake

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces a static sleep with require.Eventually in TestHotKeysAggregatorRaceFree to resolve test flakiness on slow CI runners where the aggregator's goroutine might not schedule in time. There are no review comments, so I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 2, 2026

@claude review

PR #906: hot-keys aggregator flake fix. Replaces fixed time.Sleep(50ms) with require.Eventually (3s/5ms) in TestHotKeysAggregatorRaceFree. Surfaced on Actions run 26765510693.

gemini-code-assist already reviewed: "There are no review comments." — no findings.

Same structural pattern as PR #891 / #903 (fixed-time-window sleep → polling) but for the keyviz aggregator publish-on-tick mechanism. Test still pins the load-bearing property (snapshot reads are lock-free atomic.Pointer.Load); only the wait mechanism changes.

go test ./keyviz/ -race -count=5 passes; lint clean.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 2, 2026

Claude finished @bootjp's task in 1m 38s —— View job


Code Review: PR #906

  • Gather context and understand the request
  • Read changed file (keyviz/hot_keys_test.go)
  • Check diff against origin/main
  • Review for correctness, completeness, and missed edge cases
  • Post findings

LGTM — correct fix, no issues found.

Root cause analysis

Confirmed. The old code at lines 351–357 (pre-patch):

time.Sleep(50 * time.Millisecond)
s1 := s.HotKeysSnapshot(1)
s2 := s.HotKeysSnapshot(2)
require.NotNil(t, s1)
require.NotNil(t, s2)

The aggregator goroutine with Step=5ms only publishes on a tick (or ctx.Done drain). Under -race on a slow CI runner, the goroutine itself may not get scheduled for the full 50ms — the time.Sleep in the test goroutine elapses without the aggregator actually firing a tick. The atomic.Pointer load returns the initial nil. Fix is correct.

Correctness of the polling condition

The key safety property of require.Eventually's condition:

s.HotKeysSnapshot(1) != nil && s.HotKeysSnapshot(2) != nil

is that it's monotonically stable once satisfied. Looking at publishAndReset (hot_keys.go:361):

slot.hotKeysSnap.Store(snap)  // snap is always a fresh non-nil pointer

publishAndReset only ever stores a new non-nil snapshot — it never stores nil back. So once both routes get their first publish, the condition stays true on all subsequent polls. No TOCTOU concern between the two HotKeysSnapshot calls within the closure.

Goroutine lifecycle ordering

Preserved correctly:

  1. Workers complete (wg.Wait())
  2. Poll until both snapshots non-nil (require.Eventually)
  3. cancel() — signals aggregator to stop
  4. wgRun.Wait() — waits for aggregator goroutine to exit

The t.Cleanup(cancel) at line 327 is a correct safety net: if require.Eventually times out and the test fails, the context gets cancelled during test cleanup, preventing the aggregator goroutine from leaking.

Comment verbosity

The 8-line block comment justifying the switch to require.Eventually is on the longer side per CLAUDE.md convention ("only add a comment when the WHY is non-obvious"). Here the WHY is genuinely non-obvious (CI-only race, specific Actions run link, scheduler jitter under -race), so the comment is warranted. The reference to the observed failure (Actions run 26765510693) makes it valuable as a post-mortem anchor for future readers.

Pattern consistency

Matches the established PR #891 / PR #903 pattern exactly. Good.

No dead imports

The time package import remains live via 3*time.Second and 5*time.Millisecond in the require.Eventually call. context and sync are still used in the broader test file.

Summary

No issues. The fix is minimal (no scope creep), correct (monotone poll condition, correct goroutine lifecycle), and follows the project's established flake-fix pattern. Gemini's "no findings" aligns.

@bootjp bootjp merged commit aa1d650 into main Jun 2, 2026
10 checks passed
@bootjp bootjp deleted the fix/hot-keys-aggregator-flake branch June 2, 2026 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant