Skip to content

test(redis): widen TTL window on ExpiredKey_BecomesInvisible to fix CI flake#818

Merged
bootjp merged 2 commits into
mainfrom
fix/redis-ttl-test-flake
May 23, 2026
Merged

test(redis): widen TTL window on ExpiredKey_BecomesInvisible to fix CI flake#818
bootjp merged 2 commits into
mainfrom
fix/redis-ttl-test-flake

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented May 23, 2026

Summary

Fixes the recurring TestRedis_ExpiredKey_BecomesInvisible CI flake that has been blocking unrelated admin PRs (#813, #814, #815, #816) all day.

Root cause

The test sets a key with PX 200 (200ms TTL) then immediately reads it back:

require.NoError(t, rdb.Do(ctx, "SET", "expiry:short", "v", "PX", "200").Err())

got, err := rdb.Get(ctx, "expiry:short").Result()
require.NoError(t, err)                                          // ← fails: redis.Nil
require.Equal(t, "v", got, "key must be visible before expiry")

The SET goes through a 3-node Raft cluster: client → leader proposes → quorum apply → leader applies → response. Under -race on CI runners this round-trip can take 100–250ms. The TTL clock starts at the moment the leader applies, so by the time SET returns OK on a slow CI runner, the key may have <50ms of TTL left. The immediately-following GET then loses the race and hits redis.Nil.

That this is the bug — not anything in any of #813-#816 — is locked in by the cross-PR failure pattern: the flake fired today on PRs touching only internal/admin/ and web/admin/, with zero code overlap with adapter/.

Fix

PX 200 → PX 2000. 2s is comfortably past the worst observed SET-ack latency (~250ms on CI) while keeping the test under a 7s wall-clock budget (2s TTL + 5s Eventually deadline, raised from 1s).

The test's intent — verify expired keys become invisible to subsequent GETs — is preserved exactly. Only the timing constants change; the assertions are unchanged.

Self-review (5 passes)

  1. Data loss — none. Test-only constants.
  2. Concurrency — closes a wall-clock vs. Raft-apply-latency race in the test fixture.
  3. Performance — test now takes ~3s instead of ~250ms locally; still well under the 30s package timeout.
  4. Consistency — the broader test still asserts both halves (visible before expiry, invisible after expiry) of the TTL contract.
  5. Test coverage — no change to coverage; same assertions, friendlier timing.

Test plan

  • go test -race -count=3 -timeout=180s -run TestRedis_ExpiredKey_BecomesInvisible ./adapter/ — passes 3/3 (10.792s)
  • golangci-lint --config=.golangci.yaml run — clean
  • CI

Context

Discovered while running the PR review loop on #813-#817. The Claude bot's Round 8 review on #813 marked the PR as "Ready to merge once CI passes" — only the Redis flake was blocking. This fix unblocks #813 (and the parallel admin PRs) without touching any admin code.

Summary by CodeRabbit

  • Tests
    • Improved reliability of the Redis TTL compatibility test by adjusting timing parameters to reduce intermittent test failures.

Review Change Stack

…I flake

The 200ms TTL races against the SET → GET round-trip latency on
a 3-node Raft cluster under -race on CI runners. SET goes through
Raft propose → quorum apply, which can take 100-250ms under
contention. By the time the visibility-before-expiry GET fires,
the wall-clock-based TTL has already burned and the GET returns
redis.Nil — failing the require.NoError at line 193.

The flake recurred consistently today across PRs that share no
code with the Redis adapter:
- #813 (admin Dynamo HTTP, internal/admin only)
- #814 (admin S3 HTTP, internal/admin only)
- #815 (admin SPA Dynamo, web/admin only)
- #816 (admin SPA S3, web/admin only)

That cross-PR pattern locates the failure squarely in the
adapter test, not in the PR under test.

Fix: PX 200 -> PX 2000. 2s is comfortably past the worst observed
SET-ack latency (~250ms on CI) while keeping the test under
a 7s wall-clock budget (2s TTL + 5s Eventually deadline).

Also widens the Eventually deadline from 1s to 5s so the
expiry-detection loop has headroom past the longer TTL plus
any Raft-replicated DEL on expiry.

The test's intent — verify that an expired key becomes invisible
to subsequent GETs — is preserved exactly. Only the timing
constants change; the assertions are unchanged.

Tested locally:
  go test -race -count=3 -timeout=180s -run TestRedis_ExpiredKey_BecomesInvisible ./adapter/
  ok  	github.com/bootjp/elastickv/adapter	10.792s
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 23, 2026

@claude review

New PR to fix the recurring TestRedis_ExpiredKey_BecomesInvisible flake that has been blocking #813-#816 CI all day.

Root cause: PX 200 (200ms TTL) races against 3-node Raft SET round-trip latency (100-250ms under -race). Fix: PX 2000 + Eventually deadline raised 1s→5s.

Tested locally with -race -count=3 — passes 3/3. Test intent unchanged; only timing constants.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 23, 2026

Warning

Review limit reached

@bootjp, we couldn't start this review because you've used your available PR reviews for now.

Your plan currently allows 1 review/hour. Refill in 52 minutes and 17 seconds.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more review capacity refills, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5e65bbe0-0c97-4707-acca-67af2daa19f5

📥 Commits

Reviewing files that changed from the base of the PR and between f831c93 and 4b4a650.

📒 Files selected for processing (1)
  • adapter/redis_ttl_compat_test.go
📝 Walkthrough

Walkthrough

This PR stabilizes a flaky TTL expiry test by increasing timing parameters. The TestRedis_ExpiredKey_BecomesInvisible test's PX TTL is extended from 200ms to 2s, and the Eventually deadline is extended from 1s to 5s, with expanded comments documenting the race condition.

Changes

Redis TTL Test Stabilization

Layer / File(s) Summary
TTL expiry test timing adjustment
adapter/redis_ttl_compat_test.go
PX TTL increased from 200ms to 2s and Eventually deadline increased from 1s to 5s to reduce test flakiness. Surrounding comments expanded to document the observed race condition and the rationale for new timing parameters.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

A rabbit hops through timing's test,
Where sleepy keys must pass their quest,
Two seconds now, no more the race—
Five seconds catch them in their place! 🐰⏱️

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and concisely describes the main change: fixing a CI flake in the TestRedis_ExpiredKey_BecomesInvisible test by widening TTL timing parameters.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/redis-ttl-test-flake

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude Bot commented May 23, 2026

Claude finished @bootjp's task in 1m 51s —— View job


PR #818 Review

  • Gather context
  • Read changed file and diff
  • Review timing constants and test logic
  • Post findings

Overall: Approve with one style note. The fix is correct — root cause analysis is sound, timing margins are appropriate, and the test assertions are unchanged. One CLAUDE.md violation in the inline comments worth addressing before merge.


What's Good

Root cause is accurately diagnosed. The PX 200 → SET → GET sequence on a 3-node Raft cluster creates a genuine race: SET is only acknowledged after quorum apply (100–250ms under -race), leaving <50ms of TTL for the immediately-following GET. The cross-PR failure evidence (flake on PRs touching only internal/admin/ + web/admin/) confirms the race is the adapter's own timing, not anything in the PRs it blocked.

Timing margins are appropriate.

  • PX 2000 gives 8–10× headroom over worst observed SET-ack latency (250ms). This is robust without being unreasonable.
  • Eventually deadline 5s gives ~3s of polling time past the 2s TTL expiry, easily enough for the 25ms poll cadence + any expiry propagation delay.
  • Test wall-clock budget: ~2–3s per run, well under the 30s package timeout even under -race -count=3.

const initialTTL = 2 * time.Second is a nice touch — avoids the magic number 2000 in the PX call and documents the unit at the declaration site.

initialTTL.Milliseconds() (int64) passed to rdb.Do as interface{} is fine — go-redis v9 serializes integers correctly for PX.

Both assertions are preserved: "visible before expiry" and "gone after expiry" — the test still covers both halves of the TTL contract.


One Style Issue: PR Numbers in Inline Comments

adapter/redis_ttl_compat_test.go:195–197:

// — a recurring flake
// observed across PRs #813/#814/#815/#816 today, all of which
// share no code with the Redis adapter.

CLAUDE.md explicitly forbids this:

Don't reference the current task, fix, or callers ("used by X", "added for the Y flow", "handles the case from issue #123"), since those belong in the PR description and rot as the codebase evolves.

Six months from now, #813–#816 is noise in the source file. The diagnostics behind those numbers belong in the commit message and PR description (where they already live), not in the test code. The rest of the comment — explaining why 2s and why 5s — is genuinely non-obvious and warranted; just trim the PR references.

Suggested replacement (lines 189–213):

// PX=2000ms (not shorter): SET on a 3-node Raft cluster must reach
// quorum before returning OK, which takes 100–250ms under -race on
// CI runners. With a 200ms TTL the "visible before expiry" GET
// raced the TTL clock and returned redis.Nil intermittently.
// 2s is comfortably past the worst observed SET-ack latency.
const initialTTL = 2 * time.Second
require.NoError(t, rdb.Do(ctx, "SET", "expiry:short", "v", "PX",
    initialTTL.Milliseconds()).Err())

got, err := rdb.Get(ctx, "expiry:short").Result()
require.NoError(t, err)
require.Equal(t, "v", got, "key must be visible before expiry")

// 5s deadline: headroom past the 2s TTL + 25ms poll cadence +
// any Raft-replicated expiry propagation latency.
require.Eventually(t, func() bool {
    _, e := rdb.Get(ctx, "expiry:short").Result()
    return errors.Is(e, redis.Nil)
}, 5*time.Second, 25*time.Millisecond, "key must be gone after expiry")

Fix this →


Five-Pass Self-Review

  1. Data loss — none; test-only constants.
  2. Concurrency — the original wall-clock vs. Raft-apply race is closed by the wider TTL window.
  3. Performance — test runtime increases by ~2s; acceptable under the 30s package timeout.
  4. Consistency — both "visible" and "expired" assertions are preserved.
  5. Test coverage — unchanged; same branches exercised.

The style issue is minor enough that this is mergeable as-is if you prefer speed — the PR numbers will just need to be cleaned up eventually. Either way, the fix itself is solid.
| Branch: fix/redis-ttl-test-flake

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses test flakiness in TestRedis_ExpiredKey_BecomesInvisible by increasing the key's TTL to 2 seconds and extending the Eventually timeout to 5 seconds to accommodate Raft replication latency on CI runners. Review feedback suggests simplifying the verbose comments to remove transient details and refactoring the assertion timeout to be relative to the initial TTL for better maintainability.

Comment thread adapter/redis_ttl_compat_test.go Outdated
Comment on lines +189 to +202
// PX=2000ms (was 200ms). The previous 200ms TTL races against
// the SET → GET round-trip latency on a 3-node Raft cluster
// under -race on CI runners: SET proposes through Raft, waits
// for quorum apply, and then returns OK, which can take
// 100–250ms on a slow runner. By the time the "visible before
// expiry" GET fires, the wall-clock-based TTL has already
// fired and the GET returns redis.Nil — a recurring flake
// observed across PRs #813/#814/#815/#816 today, all of which
// share no code with the Redis adapter.
//
// 2s is comfortably past the worst observed SET-ack latency
// (~250ms on CI) while still letting the "must be gone after
// expiry" loop's 5s deadline (raised below) catch the expiry
// well before the test deadline.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment block is quite verbose and contains transient information (specific PR numbers and "today") that will lose context over time. Consider simplifying it to focus on the technical justification for the increased TTL and the race condition it addresses.

Suggested change
// PX=2000ms (was 200ms). The previous 200ms TTL races against
// the SET → GET round-trip latency on a 3-node Raft cluster
// under -race on CI runners: SET proposes through Raft, waits
// for quorum apply, and then returns OK, which can take
// 100–250ms on a slow runner. By the time the "visible before
// expiry" GET fires, the wall-clock-based TTL has already
// fired and the GET returns redis.Nil — a recurring flake
// observed across PRs #813/#814/#815/#816 today, all of which
// share no code with the Redis adapter.
//
// 2s is comfortably past the worst observed SET-ack latency
// (~250ms on CI) while still letting the "must be gone after
// expiry" loop's 5s deadline (raised below) catch the expiry
// well before the test deadline.
// PX=2000ms (was 200ms). The previous 200ms TTL was prone to racing
// against the SET → GET round-trip latency on a 3-node Raft cluster
// under -race on CI runners. 2s is comfortably past the worst
// observed SET-ack latency (~250ms) while still allowing the
// expiry check to complete well within the test deadline.

Comment thread adapter/redis_ttl_compat_test.go Outdated
Comment on lines +211 to +217
// 5s deadline (was 1s) so a slow CI runner has headroom past
// the 2s TTL + the Eventually poll cadence (25ms tick) +
// any additional latency from Raft-replicated DEL on expiry.
require.Eventually(t, func() bool {
_, e := rdb.Get(ctx, "expiry:short").Result()
return errors.Is(e, redis.Nil)
}, time.Second, 25*time.Millisecond, "key must be gone after expiry")
}, 5*time.Second, 25*time.Millisecond, "key must be gone after expiry")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Eventually timeout is hardcoded to 5*time.Second. To improve maintainability, consider defining this timeout relative to initialTTL. This ensures that if the TTL is adjusted in the future, the assertion deadline remains valid and correctly accounts for the expected expiration window.

	// Use a deadline with headroom past the TTL to account for poll
	// cadence and Raft-replicated DEL latency on expiry.
	require.Eventually(t, func() bool {
		_, e := rdb.Get(ctx, "expiry:short").Result()
		return errors.Is(e, redis.Nil)
	}, initialTTL + 3*time.Second, 25*time.Millisecond, "key must be gone after expiry")

…#818 r1)

Two style notes from Gemini medium x2 / Claude bot on PR #818.

1) CLAUDE.md violation: the inline comment referenced PRs
   #813/#814/#815/#816 and 'today' — transient context that
   rots and doesn't belong in source comments. Per
   CLAUDE.md: 'Don't reference the current task, fix, or
   callers ... those belong in the PR description'. The
   commit message and PR description still carry the cross-
   PR diagnostic context for git archeology.

2) Eventually deadline now derives from initialTTL
   (initialTTL + 3*time.Second) rather than a hardcoded 5s.
   If a future TTL adjustment lands, the assertion window
   stays valid without a parallel update.

Trimmed comment focuses on the technical justification:
- WHY 2s TTL (SET-ack latency on 3-node Raft under -race)
- WHY ttl+3s deadline (poll cadence + Raft-replicated DEL
  latency on expiry)

No behavior change to the test. initialTTL=2s + deadline=5s
(2+3) matches the previous 2s + 5s. Tested locally
with -race -count=2 (7.4s wall) — passes both runs.

go test -race -count=2 -timeout=120s ./adapter/... in scope: passes
golangci-lint: clean
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 23, 2026

@claude review

Round 1 addresses both findings (1 Claude bot, 2 Gemini medium):

  • CLAUDE.md violation — removed PR-number references and 'today' phrasing from inline comments. Cross-PR diagnostic context lives in the commit message / PR description where it belongs.
  • Eventually deadline derives from initialTTLinitialTTL + 3*time.Second keeps the assertion window in sync if the TTL is adjusted in future.

No behavior change. Same 2s TTL, same effective 5s assertion deadline; tested locally with -race -count=2.

@claude
Copy link
Copy Markdown

claude Bot commented May 23, 2026

Claude finished @bootjp's task in 57s —— View job


PR #818 Round 2 Review

  • Gather context
  • Read updated file
  • Verify both Round 1 fixes were applied correctly
  • Check for remaining issues
  • Post findings

Overall: Approve. Both Round 1 findings are cleanly addressed. No remaining issues.


Fix 1: PR number references removed ✅

The previous comment block referencing #813/#814/#815/#816 and "today" is gone. The replacement explains the technical constraint precisely:

// initialTTL must comfortably outlast the SET → first-GET
// round-trip. SET on a 3-node Raft cluster must reach quorum
// before returning OK, which can take 100–250ms under -race on
// CI runners. The original 200ms TTL raced that window: by the
// time the "visible before expiry" GET fired, the wall-clock
// TTL had already burned and the GET returned redis.Nil. 2s
// is comfortably past the worst observed SET-ack latency.

This is exactly the kind of non-obvious WHY that CLAUDE.md mandates — it documents the failure mode that occurs if the value drops below ~250ms, which is not derivable from the constant itself. A future developer adjusting this value will understand the constraint. No transient references.

Fix 2: Eventually deadline derives from initialTTL

}, initialTTL+3*time.Second, 25*time.Millisecond, "key must be gone after expiry")

initialTTL + 3*time.Second (= 5s total) keeps the assertion window proportional. If initialTTL is ever increased, the deadline scales with it automatically. The accompanying comment explains the headroom rationale (poll cadence + Raft-replicated DEL latency) without over-specifying.

Five-Pass Sanity Check

  1. Data loss — none; test-only constants.
  2. Concurrency — wall-clock vs. Raft-apply race closed by the wider TTL window; no new races introduced.
  3. Performance — test runtime ~5s max under -race, well under the 30s package timeout.
  4. Consistency — both "visible before expiry" and "gone after expiry" assertions are preserved.
  5. Test coverage — unchanged; same branches exercised, comments improved.

Ready to merge.
| Branch: fix/redis-ttl-test-flake

@bootjp bootjp merged commit aca0aba into main May 23, 2026
9 checks passed
@bootjp bootjp deleted the fix/redis-ttl-test-flake branch May 23, 2026 21:23
bootjp added a commit that referenced this pull request May 23, 2026
…ix CI flake (#819)

## Summary

Fixes the recurring
`TestSQSServer_Throttle_NoOpSetQueueAttributesPreservesBucket` CI flake
that has been hitting the same admin PR series (#813, #815, #816) that
#818 just unblocked.

## Root cause

The test sequence:

1. `SetQueueAttributes` (capacity=10, refill=1/sec)
2. **10 drain sends** (drain the bucket)
3. Sanity-check send → expects 400 (drained)
4. `SetQueueAttributes` with identical values (no-op)
5. **Post-no-op send → expects 400** ← intermittently returns 200

Each step goes through a real Raft propose+apply. Under `-race` on slow
CI runners each round-trip takes 100–250ms. Total wall-clock from start
to step 5 reaches 1.5–2.5s. At **1 token/sec** refill, by step 5 the
bucket has accumulated 1+ tokens and the send returns 200 — falsely
indicating a no-op-invalidate-bypass regression.

The test's intent — verify that a no-op `SetQueueAttributes` does not
reset bucket state — is independent of the refill rate. Slowing the
refill removes the race without changing what's tested.

## Fix

`refill 1/sec → 0.01/sec` (1 token per 100s). Even a 10s test window
can't accumulate to a whole token.

The throttle config validator (`adapter/sqs_catalog.go:163`
`SendRefillPerSecond float64`) accepts fractional values. `0.01 != 0` so
`IsEmpty` returns false and the throttle path is still exercised — only
the refill cadence changes.

## Self-review (5 passes)

1. **Data loss** — none; test-only constant.
2. **Concurrency** — closes a wall-clock vs. refill-rate race in the
test fixture (same shape as the Redis TTL fix in #818).
3. **Performance** — no runtime change; the test still completes in
~1–2s.
4. **Consistency** — the test still verifies the no-op-preserves-bucket
invariant identically. The refill rate isn't load-bearing for the
assertion.
5. **Test coverage** — unchanged. Same drain → sanity → no-op → recheck
sequence; only the refill cadence differs.

## Test plan

- [x] `go test -race -count=2 -timeout=120s -run
TestSQSServer_Throttle_NoOpSetQueueAttributesPreservesBucket ./adapter/`
— passes both (2.1s)
- [x] `golangci-lint --config=.golangci.yaml run` — clean
- [ ] CI

## Context

Second flake-fix PR today (after #818 for Redis TTL). The admin PR
series (#813, #815, #816) keeps hitting unrelated adapter-package flakes
because the admin code lives in `internal/admin/` / `web/admin/` but `go
test ./...` runs the whole tree. Each fix is small and scoped.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant