test(redis): widen TTL window on ExpiredKey_BecomesInvisible to fix CI flake#818
Conversation
…I flake The 200ms TTL races against the SET → GET round-trip latency on a 3-node Raft cluster under -race on CI runners. SET goes through Raft propose → quorum apply, which can take 100-250ms under contention. By the time the visibility-before-expiry GET fires, the wall-clock-based TTL has already burned and the GET returns redis.Nil — failing the require.NoError at line 193. The flake recurred consistently today across PRs that share no code with the Redis adapter: - #813 (admin Dynamo HTTP, internal/admin only) - #814 (admin S3 HTTP, internal/admin only) - #815 (admin SPA Dynamo, web/admin only) - #816 (admin SPA S3, web/admin only) That cross-PR pattern locates the failure squarely in the adapter test, not in the PR under test. Fix: PX 200 -> PX 2000. 2s is comfortably past the worst observed SET-ack latency (~250ms on CI) while keeping the test under a 7s wall-clock budget (2s TTL + 5s Eventually deadline). Also widens the Eventually deadline from 1s to 5s so the expiry-detection loop has headroom past the longer TTL plus any Raft-replicated DEL on expiry. The test's intent — verify that an expired key becomes invisible to subsequent GETs — is preserved exactly. Only the timing constants change; the assertions are unchanged. Tested locally: go test -race -count=3 -timeout=180s -run TestRedis_ExpiredKey_BecomesInvisible ./adapter/ ok github.com/bootjp/elastickv/adapter 10.792s
|
@claude review New PR to fix the recurring Root cause: Tested locally with -race -count=3 — passes 3/3. Test intent unchanged; only timing constants. |
|
Warning Review limit reached
Your plan currently allows 1 review/hour. Refill in 52 minutes and 17 seconds. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more review capacity refills, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR stabilizes a flaky TTL expiry test by increasing timing parameters. The ChangesRedis TTL Test Stabilization
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Claude finished @bootjp's task in 1m 51s —— View job PR #818 Review
Overall: Approve with one style note. The fix is correct — root cause analysis is sound, timing margins are appropriate, and the test assertions are unchanged. One CLAUDE.md violation in the inline comments worth addressing before merge. What's GoodRoot cause is accurately diagnosed. The Timing margins are appropriate.
Both assertions are preserved: "visible before expiry" and "gone after expiry" — the test still covers both halves of the TTL contract. One Style Issue: PR Numbers in Inline Comments
// — a recurring flake
// observed across PRs #813/#814/#815/#816 today, all of which
// share no code with the Redis adapter.CLAUDE.md explicitly forbids this:
Six months from now, Suggested replacement (lines 189–213): // PX=2000ms (not shorter): SET on a 3-node Raft cluster must reach
// quorum before returning OK, which takes 100–250ms under -race on
// CI runners. With a 200ms TTL the "visible before expiry" GET
// raced the TTL clock and returned redis.Nil intermittently.
// 2s is comfortably past the worst observed SET-ack latency.
const initialTTL = 2 * time.Second
require.NoError(t, rdb.Do(ctx, "SET", "expiry:short", "v", "PX",
initialTTL.Milliseconds()).Err())
got, err := rdb.Get(ctx, "expiry:short").Result()
require.NoError(t, err)
require.Equal(t, "v", got, "key must be visible before expiry")
// 5s deadline: headroom past the 2s TTL + 25ms poll cadence +
// any Raft-replicated expiry propagation latency.
require.Eventually(t, func() bool {
_, e := rdb.Get(ctx, "expiry:short").Result()
return errors.Is(e, redis.Nil)
}, 5*time.Second, 25*time.Millisecond, "key must be gone after expiry")Five-Pass Self-Review
The style issue is minor enough that this is mergeable as-is if you prefer speed — the PR numbers will just need to be cleaned up eventually. Either way, the fix itself is solid. |
There was a problem hiding this comment.
Code Review
This pull request addresses test flakiness in TestRedis_ExpiredKey_BecomesInvisible by increasing the key's TTL to 2 seconds and extending the Eventually timeout to 5 seconds to accommodate Raft replication latency on CI runners. Review feedback suggests simplifying the verbose comments to remove transient details and refactoring the assertion timeout to be relative to the initial TTL for better maintainability.
| // PX=2000ms (was 200ms). The previous 200ms TTL races against | ||
| // the SET → GET round-trip latency on a 3-node Raft cluster | ||
| // under -race on CI runners: SET proposes through Raft, waits | ||
| // for quorum apply, and then returns OK, which can take | ||
| // 100–250ms on a slow runner. By the time the "visible before | ||
| // expiry" GET fires, the wall-clock-based TTL has already | ||
| // fired and the GET returns redis.Nil — a recurring flake | ||
| // observed across PRs #813/#814/#815/#816 today, all of which | ||
| // share no code with the Redis adapter. | ||
| // | ||
| // 2s is comfortably past the worst observed SET-ack latency | ||
| // (~250ms on CI) while still letting the "must be gone after | ||
| // expiry" loop's 5s deadline (raised below) catch the expiry | ||
| // well before the test deadline. |
There was a problem hiding this comment.
This comment block is quite verbose and contains transient information (specific PR numbers and "today") that will lose context over time. Consider simplifying it to focus on the technical justification for the increased TTL and the race condition it addresses.
| // PX=2000ms (was 200ms). The previous 200ms TTL races against | |
| // the SET → GET round-trip latency on a 3-node Raft cluster | |
| // under -race on CI runners: SET proposes through Raft, waits | |
| // for quorum apply, and then returns OK, which can take | |
| // 100–250ms on a slow runner. By the time the "visible before | |
| // expiry" GET fires, the wall-clock-based TTL has already | |
| // fired and the GET returns redis.Nil — a recurring flake | |
| // observed across PRs #813/#814/#815/#816 today, all of which | |
| // share no code with the Redis adapter. | |
| // | |
| // 2s is comfortably past the worst observed SET-ack latency | |
| // (~250ms on CI) while still letting the "must be gone after | |
| // expiry" loop's 5s deadline (raised below) catch the expiry | |
| // well before the test deadline. | |
| // PX=2000ms (was 200ms). The previous 200ms TTL was prone to racing | |
| // against the SET → GET round-trip latency on a 3-node Raft cluster | |
| // under -race on CI runners. 2s is comfortably past the worst | |
| // observed SET-ack latency (~250ms) while still allowing the | |
| // expiry check to complete well within the test deadline. |
| // 5s deadline (was 1s) so a slow CI runner has headroom past | ||
| // the 2s TTL + the Eventually poll cadence (25ms tick) + | ||
| // any additional latency from Raft-replicated DEL on expiry. | ||
| require.Eventually(t, func() bool { | ||
| _, e := rdb.Get(ctx, "expiry:short").Result() | ||
| return errors.Is(e, redis.Nil) | ||
| }, time.Second, 25*time.Millisecond, "key must be gone after expiry") | ||
| }, 5*time.Second, 25*time.Millisecond, "key must be gone after expiry") |
There was a problem hiding this comment.
The Eventually timeout is hardcoded to 5*time.Second. To improve maintainability, consider defining this timeout relative to initialTTL. This ensures that if the TTL is adjusted in the future, the assertion deadline remains valid and correctly accounts for the expected expiration window.
// Use a deadline with headroom past the TTL to account for poll
// cadence and Raft-replicated DEL latency on expiry.
require.Eventually(t, func() bool {
_, e := rdb.Get(ctx, "expiry:short").Result()
return errors.Is(e, redis.Nil)
}, initialTTL + 3*time.Second, 25*time.Millisecond, "key must be gone after expiry")…#818 r1) Two style notes from Gemini medium x2 / Claude bot on PR #818. 1) CLAUDE.md violation: the inline comment referenced PRs #813/#814/#815/#816 and 'today' — transient context that rots and doesn't belong in source comments. Per CLAUDE.md: 'Don't reference the current task, fix, or callers ... those belong in the PR description'. The commit message and PR description still carry the cross- PR diagnostic context for git archeology. 2) Eventually deadline now derives from initialTTL (initialTTL + 3*time.Second) rather than a hardcoded 5s. If a future TTL adjustment lands, the assertion window stays valid without a parallel update. Trimmed comment focuses on the technical justification: - WHY 2s TTL (SET-ack latency on 3-node Raft under -race) - WHY ttl+3s deadline (poll cadence + Raft-replicated DEL latency on expiry) No behavior change to the test. initialTTL=2s + deadline=5s (2+3) matches the previous 2s + 5s. Tested locally with -race -count=2 (7.4s wall) — passes both runs. go test -race -count=2 -timeout=120s ./adapter/... in scope: passes golangci-lint: clean
|
@claude review Round 1 addresses both findings (1 Claude bot, 2 Gemini medium):
No behavior change. Same 2s TTL, same effective 5s assertion deadline; tested locally with -race -count=2. |
|
Claude finished @bootjp's task in 57s —— View job PR #818 Round 2 Review
Overall: Approve. Both Round 1 findings are cleanly addressed. No remaining issues. Fix 1: PR number references removed ✅The previous comment block referencing // initialTTL must comfortably outlast the SET → first-GET
// round-trip. SET on a 3-node Raft cluster must reach quorum
// before returning OK, which can take 100–250ms under -race on
// CI runners. The original 200ms TTL raced that window: by the
// time the "visible before expiry" GET fired, the wall-clock
// TTL had already burned and the GET returned redis.Nil. 2s
// is comfortably past the worst observed SET-ack latency.This is exactly the kind of non-obvious WHY that CLAUDE.md mandates — it documents the failure mode that occurs if the value drops below ~250ms, which is not derivable from the constant itself. A future developer adjusting this value will understand the constraint. No transient references. Fix 2:
|
…ix CI flake (#819) ## Summary Fixes the recurring `TestSQSServer_Throttle_NoOpSetQueueAttributesPreservesBucket` CI flake that has been hitting the same admin PR series (#813, #815, #816) that #818 just unblocked. ## Root cause The test sequence: 1. `SetQueueAttributes` (capacity=10, refill=1/sec) 2. **10 drain sends** (drain the bucket) 3. Sanity-check send → expects 400 (drained) 4. `SetQueueAttributes` with identical values (no-op) 5. **Post-no-op send → expects 400** ← intermittently returns 200 Each step goes through a real Raft propose+apply. Under `-race` on slow CI runners each round-trip takes 100–250ms. Total wall-clock from start to step 5 reaches 1.5–2.5s. At **1 token/sec** refill, by step 5 the bucket has accumulated 1+ tokens and the send returns 200 — falsely indicating a no-op-invalidate-bypass regression. The test's intent — verify that a no-op `SetQueueAttributes` does not reset bucket state — is independent of the refill rate. Slowing the refill removes the race without changing what's tested. ## Fix `refill 1/sec → 0.01/sec` (1 token per 100s). Even a 10s test window can't accumulate to a whole token. The throttle config validator (`adapter/sqs_catalog.go:163` `SendRefillPerSecond float64`) accepts fractional values. `0.01 != 0` so `IsEmpty` returns false and the throttle path is still exercised — only the refill cadence changes. ## Self-review (5 passes) 1. **Data loss** — none; test-only constant. 2. **Concurrency** — closes a wall-clock vs. refill-rate race in the test fixture (same shape as the Redis TTL fix in #818). 3. **Performance** — no runtime change; the test still completes in ~1–2s. 4. **Consistency** — the test still verifies the no-op-preserves-bucket invariant identically. The refill rate isn't load-bearing for the assertion. 5. **Test coverage** — unchanged. Same drain → sanity → no-op → recheck sequence; only the refill cadence differs. ## Test plan - [x] `go test -race -count=2 -timeout=120s -run TestSQSServer_Throttle_NoOpSetQueueAttributesPreservesBucket ./adapter/` — passes both (2.1s) - [x] `golangci-lint --config=.golangci.yaml run` — clean - [ ] CI ## Context Second flake-fix PR today (after #818 for Redis TTL). The admin PR series (#813, #815, #816) keeps hitting unrelated adapter-package flakes because the admin code lives in `internal/admin/` / `web/admin/` but `go test ./...` runs the whole tree. Each fix is small and scoped.
Summary
Fixes the recurring
TestRedis_ExpiredKey_BecomesInvisibleCI flake that has been blocking unrelated admin PRs (#813, #814, #815, #816) all day.Root cause
The test sets a key with
PX 200(200ms TTL) then immediately reads it back:The
SETgoes through a 3-node Raft cluster: client → leader proposes → quorum apply → leader applies → response. Under-raceon CI runners this round-trip can take 100–250ms. The TTL clock starts at the moment the leader applies, so by the timeSETreturns OK on a slow CI runner, the key may have <50ms of TTL left. The immediately-followingGETthen loses the race and hitsredis.Nil.That this is the bug — not anything in any of #813-#816 — is locked in by the cross-PR failure pattern: the flake fired today on PRs touching only
internal/admin/andweb/admin/, with zero code overlap withadapter/.Fix
PX 200 → PX 2000. 2s is comfortably past the worst observed SET-ack latency (~250ms on CI) while keeping the test under a 7s wall-clock budget (2s TTL + 5sEventuallydeadline, raised from 1s).The test's intent — verify expired keys become invisible to subsequent GETs — is preserved exactly. Only the timing constants change; the assertions are unchanged.
Self-review (5 passes)
Test plan
go test -race -count=3 -timeout=180s -run TestRedis_ExpiredKey_BecomesInvisible ./adapter/— passes 3/3 (10.792s)golangci-lint --config=.golangci.yaml run— cleanContext
Discovered while running the PR review loop on #813-#817. The Claude bot's Round 8 review on #813 marked the PR as "Ready to merge once CI passes" — only the Redis flake was blocking. This fix unblocks #813 (and the parallel admin PRs) without touching any admin code.
Summary by CodeRabbit