test(redis): widen TTL window on ExpiredKey_BecomesInvisible to fix CI flake by bootjp · Pull Request #818 · bootjp/elastickv

bootjp · 2026-05-23T21:04:47Z

Summary

Fixes the recurring TestRedis_ExpiredKey_BecomesInvisible CI flake that has been blocking unrelated admin PRs (#813, #814, #815, #816) all day.

Root cause

The test sets a key with PX 200 (200ms TTL) then immediately reads it back:

require.NoError(t, rdb.Do(ctx, "SET", "expiry:short", "v", "PX", "200").Err())

got, err := rdb.Get(ctx, "expiry:short").Result()
require.NoError(t, err)                                          // ← fails: redis.Nil
require.Equal(t, "v", got, "key must be visible before expiry")

The SET goes through a 3-node Raft cluster: client → leader proposes → quorum apply → leader applies → response. Under -race on CI runners this round-trip can take 100–250ms. The TTL clock starts at the moment the leader applies, so by the time SET returns OK on a slow CI runner, the key may have <50ms of TTL left. The immediately-following GET then loses the race and hits redis.Nil.

That this is the bug — not anything in any of #813-#816 — is locked in by the cross-PR failure pattern: the flake fired today on PRs touching only internal/admin/ and web/admin/, with zero code overlap with adapter/.

Fix

PX 200 → PX 2000. 2s is comfortably past the worst observed SET-ack latency (~250ms on CI) while keeping the test under a 7s wall-clock budget (2s TTL + 5s Eventually deadline, raised from 1s).

The test's intent — verify expired keys become invisible to subsequent GETs — is preserved exactly. Only the timing constants change; the assertions are unchanged.

Self-review (5 passes)

Data loss — none. Test-only constants.
Concurrency — closes a wall-clock vs. Raft-apply-latency race in the test fixture.
Performance — test now takes ~3s instead of ~250ms locally; still well under the 30s package timeout.
Consistency — the broader test still asserts both halves (visible before expiry, invisible after expiry) of the TTL contract.
Test coverage — no change to coverage; same assertions, friendlier timing.

Test plan

go test -race -count=3 -timeout=180s -run TestRedis_ExpiredKey_BecomesInvisible ./adapter/ — passes 3/3 (10.792s)
golangci-lint --config=.golangci.yaml run — clean
CI

Context

Discovered while running the PR review loop on #813-#817. The Claude bot's Round 8 review on #813 marked the PR as "Ready to merge once CI passes" — only the Redis flake was blocking. This fix unblocks #813 (and the parallel admin PRs) without touching any admin code.

Summary by CodeRabbit

Tests
- Improved reliability of the Redis TTL compatibility test by adjusting timing parameters to reduce intermittent test failures.

…I flake The 200ms TTL races against the SET → GET round-trip latency on a 3-node Raft cluster under -race on CI runners. SET goes through Raft propose → quorum apply, which can take 100-250ms under contention. By the time the visibility-before-expiry GET fires, the wall-clock-based TTL has already burned and the GET returns redis.Nil — failing the require.NoError at line 193. The flake recurred consistently today across PRs that share no code with the Redis adapter: - #813 (admin Dynamo HTTP, internal/admin only) - #814 (admin S3 HTTP, internal/admin only) - #815 (admin SPA Dynamo, web/admin only) - #816 (admin SPA S3, web/admin only) That cross-PR pattern locates the failure squarely in the adapter test, not in the PR under test. Fix: PX 200 -> PX 2000. 2s is comfortably past the worst observed SET-ack latency (~250ms on CI) while keeping the test under a 7s wall-clock budget (2s TTL + 5s Eventually deadline). Also widens the Eventually deadline from 1s to 5s so the expiry-detection loop has headroom past the longer TTL plus any Raft-replicated DEL on expiry. The test's intent — verify that an expired key becomes invisible to subsequent GETs — is preserved exactly. Only the timing constants change; the assertions are unchanged. Tested locally: go test -race -count=3 -timeout=180s -run TestRedis_ExpiredKey_BecomesInvisible ./adapter/ ok github.com/bootjp/elastickv/adapter 10.792s

bootjp · 2026-05-23T21:04:58Z

@claude review

New PR to fix the recurring TestRedis_ExpiredKey_BecomesInvisible flake that has been blocking #813-#816 CI all day.

Root cause: PX 200 (200ms TTL) races against 3-node Raft SET round-trip latency (100-250ms under -race). Fix: PX 2000 + Eventually deadline raised 1s→5s.

Tested locally with -race -count=3 — passes 3/3. Test intent unchanged; only timing constants.

coderabbitai · 2026-05-23T21:04:58Z

Warning

Review limit reached

@bootjp, we couldn't start this review because you've used your available PR reviews for now.

Your plan currently allows 1 review/hour. Refill in 52 minutes and 17 seconds.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more review capacity refills, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5e65bbe0-0c97-4707-acca-67af2daa19f5

📥 Commits

Reviewing files that changed from the base of the PR and between f831c93 and 4b4a650.

📒 Files selected for processing (1)

adapter/redis_ttl_compat_test.go

📝 Walkthrough

Walkthrough

This PR stabilizes a flaky TTL expiry test by increasing timing parameters. The TestRedis_ExpiredKey_BecomesInvisible test's PX TTL is extended from 200ms to 2s, and the Eventually deadline is extended from 1s to 5s, with expanded comments documenting the race condition.

Changes

Redis TTL Test Stabilization

Layer / File(s)	Summary
TTL expiry test timing adjustment `adapter/redis_ttl_compat_test.go`	PX TTL increased from 200ms to 2s and Eventually deadline increased from 1s to 5s to reduce test flakiness. Surrounding comments expanded to document the observed race condition and the rationale for new timing parameters.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

A rabbit hops through timing's test,
Where sleepy keys must pass their quest,
Two seconds now, no more the race—
Five seconds catch them in their place! 🐰⏱️

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and concisely describes the main change: fixing a CI flake in the TestRedis_ExpiredKey_BecomesInvisible test by widening TTL timing parameters.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/redis-ttl-test-flake

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-05-23T21:05:10Z

Claude finished @bootjp's task in 1m 51s —— View job

PR #818 Review

Gather context
Read changed file and diff
Review timing constants and test logic
Post findings

Overall: Approve with one style note. The fix is correct — root cause analysis is sound, timing margins are appropriate, and the test assertions are unchanged. One CLAUDE.md violation in the inline comments worth addressing before merge.

What's Good

Root cause is accurately diagnosed. The PX 200 → SET → GET sequence on a 3-node Raft cluster creates a genuine race: SET is only acknowledged after quorum apply (100–250ms under -race), leaving <50ms of TTL for the immediately-following GET. The cross-PR failure evidence (flake on PRs touching only internal/admin/ + web/admin/) confirms the race is the adapter's own timing, not anything in the PRs it blocked.

Timing margins are appropriate.

PX 2000 gives 8–10× headroom over worst observed SET-ack latency (250ms). This is robust without being unreasonable.
Eventually deadline 5s gives ~3s of polling time past the 2s TTL expiry, easily enough for the 25ms poll cadence + any expiry propagation delay.
Test wall-clock budget: ~2–3s per run, well under the 30s package timeout even under -race -count=3.

const initialTTL = 2 * time.Second is a nice touch — avoids the magic number 2000 in the PX call and documents the unit at the declaration site.

initialTTL.Milliseconds() (int64) passed to rdb.Do as interface{} is fine — go-redis v9 serializes integers correctly for PX.

Both assertions are preserved: "visible before expiry" and "gone after expiry" — the test still covers both halves of the TTL contract.

One Style Issue: PR Numbers in Inline Comments

adapter/redis_ttl_compat_test.go:195–197:

// — a recurring flake
// observed across PRs #813/#814/#815/#816 today, all of which
// share no code with the Redis adapter.

CLAUDE.md explicitly forbids this:

Don't reference the current task, fix, or callers ("used by X", "added for the Y flow", "handles the case from issue #123"), since those belong in the PR description and rot as the codebase evolves.

Six months from now, #813–#816 is noise in the source file. The diagnostics behind those numbers belong in the commit message and PR description (where they already live), not in the test code. The rest of the comment — explaining why 2s and why 5s — is genuinely non-obvious and warranted; just trim the PR references.

Suggested replacement (lines 189–213):

// PX=2000ms (not shorter): SET on a 3-node Raft cluster must reach
// quorum before returning OK, which takes 100–250ms under -race on
// CI runners. With a 200ms TTL the "visible before expiry" GET
// raced the TTL clock and returned redis.Nil intermittently.
// 2s is comfortably past the worst observed SET-ack latency.
const initialTTL = 2 * time.Second
require.NoError(t, rdb.Do(ctx, "SET", "expiry:short", "v", "PX",
    initialTTL.Milliseconds()).Err())

got, err := rdb.Get(ctx, "expiry:short").Result()
require.NoError(t, err)
require.Equal(t, "v", got, "key must be visible before expiry")

// 5s deadline: headroom past the 2s TTL + 25ms poll cadence +
// any Raft-replicated expiry propagation latency.
require.Eventually(t, func() bool {
    _, e := rdb.Get(ctx, "expiry:short").Result()
    return errors.Is(e, redis.Nil)
}, 5*time.Second, 25*time.Millisecond, "key must be gone after expiry")

Fix this →

Five-Pass Self-Review

Data loss — none; test-only constants.
Concurrency — the original wall-clock vs. Raft-apply race is closed by the wider TTL window.
Performance — test runtime increases by ~2s; acceptable under the 30s package timeout.
Consistency — both "visible" and "expired" assertions are preserved.
Test coverage — unchanged; same branches exercised.

The style issue is minor enough that this is mergeable as-is if you prefer speed — the PR numbers will just need to be cleaned up eventually. Either way, the fix itself is solid.
| Branch: fix/redis-ttl-test-flake

gemini-code-assist

Code Review

This pull request addresses test flakiness in TestRedis_ExpiredKey_BecomesInvisible by increasing the key's TTL to 2 seconds and extending the Eventually timeout to 5 seconds to accommodate Raft replication latency on CI runners. Review feedback suggests simplifying the verbose comments to remove transient details and refactoring the assertion timeout to be relative to the initial TTL for better maintainability.

gemini-code-assist · 2026-05-23T21:07:06Z

+	// PX=2000ms (was 200ms). The previous 200ms TTL races against
+	// the SET → GET round-trip latency on a 3-node Raft cluster
+	// under -race on CI runners: SET proposes through Raft, waits
+	// for quorum apply, and then returns OK, which can take
+	// 100–250ms on a slow runner. By the time the "visible before
+	// expiry" GET fires, the wall-clock-based TTL has already
+	// fired and the GET returns redis.Nil — a recurring flake
+	// observed across PRs #813/#814/#815/#816 today, all of which
+	// share no code with the Redis adapter.
+	//
+	// 2s is comfortably past the worst observed SET-ack latency
+	// (~250ms on CI) while still letting the "must be gone after
+	// expiry" loop's 5s deadline (raised below) catch the expiry
+	// well before the test deadline.


This comment block is quite verbose and contains transient information (specific PR numbers and "today") that will lose context over time. Consider simplifying it to focus on the technical justification for the increased TTL and the race condition it addresses.

Suggested change

// PX=2000ms (was 200ms). The previous 200ms TTL races against

// the SET → GET round-trip latency on a 3-node Raft cluster

// under -race on CI runners: SET proposes through Raft, waits

// for quorum apply, and then returns OK, which can take

// 100–250ms on a slow runner. By the time the "visible before

// expiry" GET fires, the wall-clock-based TTL has already

// fired and the GET returns redis.Nil — a recurring flake

// observed across PRs #813/#814/#815/#816 today, all of which

// share no code with the Redis adapter.

//

// 2s is comfortably past the worst observed SET-ack latency

// (~250ms on CI) while still letting the "must be gone after

// expiry" loop's 5s deadline (raised below) catch the expiry

// well before the test deadline.

// PX=2000ms (was 200ms). The previous 200ms TTL was prone to racing

// against the SET → GET round-trip latency on a 3-node Raft cluster

// under -race on CI runners. 2s is comfortably past the worst

// observed SET-ack latency (~250ms) while still allowing the

// expiry check to complete well within the test deadline.

gemini-code-assist · 2026-05-23T21:07:06Z

+	// 5s deadline (was 1s) so a slow CI runner has headroom past
+	// the 2s TTL + the Eventually poll cadence (25ms tick) +
+	// any additional latency from Raft-replicated DEL on expiry.
 	require.Eventually(t, func() bool {
 		_, e := rdb.Get(ctx, "expiry:short").Result()
 		return errors.Is(e, redis.Nil)
-	}, time.Second, 25*time.Millisecond, "key must be gone after expiry")
+	}, 5*time.Second, 25*time.Millisecond, "key must be gone after expiry")


The Eventually timeout is hardcoded to 5*time.Second. To improve maintainability, consider defining this timeout relative to initialTTL. This ensures that if the TTL is adjusted in the future, the assertion deadline remains valid and correctly accounts for the expected expiration window.

// Use a deadline with headroom past the TTL to account for poll // cadence and Raft-replicated DEL latency on expiry. require.Eventually(t, func() bool { _, e := rdb.Get(ctx, "expiry:short").Result() return errors.Is(e, redis.Nil) }, initialTTL + 3*time.Second, 25*time.Millisecond, "key must be gone after expiry")

…#818 r1) Two style notes from Gemini medium x2 / Claude bot on PR #818. 1) CLAUDE.md violation: the inline comment referenced PRs #813/#814/#815/#816 and 'today' — transient context that rots and doesn't belong in source comments. Per CLAUDE.md: 'Don't reference the current task, fix, or callers ... those belong in the PR description'. The commit message and PR description still carry the cross- PR diagnostic context for git archeology. 2) Eventually deadline now derives from initialTTL (initialTTL + 3*time.Second) rather than a hardcoded 5s. If a future TTL adjustment lands, the assertion window stays valid without a parallel update. Trimmed comment focuses on the technical justification: - WHY 2s TTL (SET-ack latency on 3-node Raft under -race) - WHY ttl+3s deadline (poll cadence + Raft-replicated DEL latency on expiry) No behavior change to the test. initialTTL=2s + deadline=5s (2+3) matches the previous 2s + 5s. Tested locally with -race -count=2 (7.4s wall) — passes both runs. go test -race -count=2 -timeout=120s ./adapter/... in scope: passes golangci-lint: clean

bootjp · 2026-05-23T21:12:41Z

@claude review

Round 1 addresses both findings (1 Claude bot, 2 Gemini medium):

CLAUDE.md violation — removed PR-number references and 'today' phrasing from inline comments. Cross-PR diagnostic context lives in the commit message / PR description where it belongs.
Eventually deadline derives from initialTTL — initialTTL + 3*time.Second keeps the assertion window in sync if the TTL is adjusted in future.

No behavior change. Same 2s TTL, same effective 5s assertion deadline; tested locally with -race -count=2.

claude · 2026-05-23T21:12:55Z

Claude finished @bootjp's task in 57s —— View job

PR #818 Round 2 Review

Overall: Approve. Both Round 1 findings are cleanly addressed. No remaining issues.

Fix 1: PR number references removed ✅

The previous comment block referencing #813/#814/#815/#816 and "today" is gone. The replacement explains the technical constraint precisely:

// initialTTL must comfortably outlast the SET → first-GET
// round-trip. SET on a 3-node Raft cluster must reach quorum
// before returning OK, which can take 100–250ms under -race on
// CI runners. The original 200ms TTL raced that window: by the
// time the "visible before expiry" GET fired, the wall-clock
// TTL had already burned and the GET returned redis.Nil. 2s
// is comfortably past the worst observed SET-ack latency.

This is exactly the kind of non-obvious WHY that CLAUDE.md mandates — it documents the failure mode that occurs if the value drops below ~250ms, which is not derivable from the constant itself. A future developer adjusting this value will understand the constraint. No transient references.

Fix 2: `Eventually` deadline derives from `initialTTL` ✅

}, initialTTL+3*time.Second, 25*time.Millisecond, "key must be gone after expiry")

initialTTL + 3*time.Second (= 5s total) keeps the assertion window proportional. If initialTTL is ever increased, the deadline scales with it automatically. The accompanying comment explains the headroom rationale (poll cadence + Raft-replicated DEL latency) without over-specifying.

Five-Pass Sanity Check

Data loss — none; test-only constants.
Concurrency — wall-clock vs. Raft-apply race closed by the wider TTL window; no new races introduced.
Performance — test runtime ~5s max under -race, well under the 30s package timeout.
Consistency — both "visible before expiry" and "gone after expiry" assertions are preserved.
Test coverage — unchanged; same branches exercised, comments improved.

Ready to merge.
| Branch: fix/redis-ttl-test-flake

…ix CI flake (#819) ## Summary Fixes the recurring `TestSQSServer_Throttle_NoOpSetQueueAttributesPreservesBucket` CI flake that has been hitting the same admin PR series (#813, #815, #816) that #818 just unblocked. ## Root cause The test sequence: 1. `SetQueueAttributes` (capacity=10, refill=1/sec) 2. **10 drain sends** (drain the bucket) 3. Sanity-check send → expects 400 (drained) 4. `SetQueueAttributes` with identical values (no-op) 5. **Post-no-op send → expects 400** ← intermittently returns 200 Each step goes through a real Raft propose+apply. Under `-race` on slow CI runners each round-trip takes 100–250ms. Total wall-clock from start to step 5 reaches 1.5–2.5s. At **1 token/sec** refill, by step 5 the bucket has accumulated 1+ tokens and the send returns 200 — falsely indicating a no-op-invalidate-bypass regression. The test's intent — verify that a no-op `SetQueueAttributes` does not reset bucket state — is independent of the refill rate. Slowing the refill removes the race without changing what's tested. ## Fix `refill 1/sec → 0.01/sec` (1 token per 100s). Even a 10s test window can't accumulate to a whole token. The throttle config validator (`adapter/sqs_catalog.go:163` `SendRefillPerSecond float64`) accepts fractional values. `0.01 != 0` so `IsEmpty` returns false and the throttle path is still exercised — only the refill cadence changes. ## Self-review (5 passes) 1. **Data loss** — none; test-only constant. 2. **Concurrency** — closes a wall-clock vs. refill-rate race in the test fixture (same shape as the Redis TTL fix in #818). 3. **Performance** — no runtime change; the test still completes in ~1–2s. 4. **Consistency** — the test still verifies the no-op-preserves-bucket invariant identically. The refill rate isn't load-bearing for the assertion. 5. **Test coverage** — unchanged. Same drain → sanity → no-op → recheck sequence; only the refill cadence differs. ## Test plan - [x] `go test -race -count=2 -timeout=120s -run TestSQSServer_Throttle_NoOpSetQueueAttributesPreservesBucket ./adapter/` — passes both (2.1s) - [x] `golangci-lint --config=.golangci.yaml run` — clean - [ ] CI ## Context Second flake-fix PR today (after #818 for Redis TTL). The admin PR series (#813, #815, #816) keeps hitting unrelated adapter-package flakes because the admin code lives in `internal/admin/` / `web/admin/` but `go test ./...` runs the whole tree. Each fix is small and scoped.

gemini-code-assist Bot reviewed May 23, 2026

View reviewed changes

bootjp merged commit aca0aba into main May 23, 2026
9 checks passed

bootjp deleted the fix/redis-ttl-test-flake branch May 23, 2026 21:23

This was referenced May 23, 2026

internal/admin: HTTP handlers for DynamoDB item RPCs (Phase 3a) #813

Merged

test(sqs): slow throttle refill on no-op SetQueueAttributes test to fix CI flake #819

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(redis): widen TTL window on ExpiredKey_BecomesInvisible to fix CI flake#818

test(redis): widen TTL window on ExpiredKey_BecomesInvisible to fix CI flake#818
bootjp merged 2 commits into
mainfrom
fix/redis-ttl-test-flake

bootjp commented May 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

bootjp commented May 23, 2026

Uh oh!

coderabbitai Bot commented May 23, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

claude Bot commented May 23, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 23, 2026

Uh oh!

gemini-code-assist Bot May 23, 2026

Uh oh!

bootjp commented May 23, 2026

Uh oh!

claude Bot commented May 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bootjp commented May 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Self-review (5 passes)

Test plan

Context

Summary by CodeRabbit

Uh oh!

bootjp commented May 23, 2026

Uh oh!

coderabbitai Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

claude Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR #818 Review

What's Good

One Style Issue: PR Numbers in Inline Comments

Five-Pass Self-Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented May 23, 2026

Uh oh!

claude Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR #818 Round 2 Review

Fix 1: PR number references removed ✅

Fix 2: Eventually deadline derives from initialTTL ✅

Five-Pass Sanity Check

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bootjp commented May 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 23, 2026 •

edited

Loading

claude Bot commented May 23, 2026 •

edited

Loading

claude Bot commented May 23, 2026 •

edited

Loading

Fix 2: `Eventually` deadline derives from `initialTTL` ✅