Skip to content

test(redis): structural fix — eventuallyExpired helper for TTL-expiry tests#903

Merged
bootjp merged 2 commits into
mainfrom
chore/redis-ttl-deadline-multiplier
Jun 2, 2026
Merged

test(redis): structural fix — eventuallyExpired helper for TTL-expiry tests#903
bootjp merged 2 commits into
mainfrom
chore/redis-ttl-deadline-multiplier

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Jun 1, 2026

Summary

Third flake-root-cause PR, after #891 (SQS throttle refill) and #898 (Raft readiness probe). Same recipe: a recurring CI-flake mechanism has been patched per-test for months; this PR generalises the structural fix and migrates the extant sites.

Recurring flake (PR #818 lineage)

Tests that "set a short TTL, sleep past it, assert the key is gone" race their own wall clock under -race on slow CI runners. The post-sleep read sometimes observes an un-expired view because the inter-call pause does not absorb scheduler jitter. PR #818 fixed one instance (TestRedis_ExpiredKey_BecomesInvisible) by switching to require.Eventually with a TTL-derived deadline. The other 8 sites kept the fragile time.Sleep(ttl + 50ms) pattern.

Structural fix

  • New helper eventuallyExpired(t, ttl, condition, msg) in test_util.go. Encapsulates the require.Eventually + ttlExpiryHeadroom(3s) + ttlExpiryPoll(25ms) idiom so future test writers reach for it instead of inventing fresh time.Sleep margins.

  • Migrated 8 sites to the helper:

    • redis_collection_fastpath_test.go: TestRedis_HGET_TTLExpired, TestRedis_HEXISTS_TTLExpired, TestRedis_SISMEMBER_TTLExpired
    • redis_lua_collection_fastpath_test.go: TestLua_HGET_HashTTL, TestLua_HEXISTS_HashTTL, TestLua_ZSCORE_*, TestLua_ZRANGEBYSCORE_*
    • redis_misskey_compat_test.go: post-EX 1 expiry assertion
  • Removed unused waitForTTLExpiry (redis_collection_fastpath_test.go) and waitForLuaTTL (redis_lua_collection_fastpath_test.go) convenience wrappers. Sole callers were the migrated sites.

  • Added errors stdlib import to the three test files where the helper's condition closures use errors.Is.

Caller audit (per /loop semantic-change rule)

  • eventuallyExpired: new helper, 8 callers in 3 test files. All test-only.
  • waitForTTLExpiry / waitForLuaTTL: removed. Repo-wide grep confirms zero remaining call sites.
  • collectionFastPathTTL / luaFastPathTTL consts: still used by the migrated sites as the ttl argument to eventuallyExpired. Unchanged.

Validation

  • go test ./adapter/ -run 'TestRedis_HGET_TTLExpired|TestRedis_HEXISTS_TTLExpired|TestRedis_SISMEMBER_TTLExpired|TestLua_HGET_HashTTL|TestLua_HEXISTS|TestLua_ZSCORE|TestLua_ZRANGEBYSCORE|TestRedis_Misskey' -race -count=2 → ok 32.3s
  • gofmt + go vet + golangci-lint run → 0 issues

Scope

Closes the "Redis TTL window deadline-multiplier" follow-up noted in PR #891's description.

Out-of-scope (still on the flake follow-up backlog)

  • VM pool / cache hit-rate assertion → "don't assert on cache hit rates" rule. The structural lesson from commit 7783c279 is to never assert on cache hit rates; needs a doc rather than a code change.

… tests

Third flake-root-cause PR, after PR #891 (SQS throttle refill) and
PR #898 (Raft readiness probe). Same pattern: a recurring CI-flake
mechanism has been fixed per-test for months; this PR generalises
the structural fix and migrates all extant sites.

Recurring flake (PR #818 lineage):
  Tests that "set a short TTL, sleep past it, assert the key is
  gone" race their own wall clock under -race on slow CI runners.
  The post-sleep read can sometimes still observe an un-expired
  view because the inter-call pause does not absorb scheduler
  jitter. PR #818 fixed one instance (ExpiredKey_BecomesInvisible)
  by switching to require.Eventually with a TTL-derived deadline.
  The other 8 sites in the test suite kept the fragile
  time.Sleep(ttl + 50ms) pattern.

Structural fix
==============

  - New helper eventuallyExpired(t, ttl, condition, msg) in
    test_util.go. Encapsulates the require.Eventually +
    ttlExpiryHeadroom (3 s) + ttlExpiryPoll (25 ms) idiom so
    future test writers reach for it instead of inventing fresh
    time.Sleep margins.

  - Migrated 8 sites to the helper:
    * redis_collection_fastpath_test.go: TestRedis_HGET_TTLExpired,
      TestRedis_HEXISTS_TTLExpired, TestRedis_SISMEMBER_TTLExpired.
    * redis_lua_collection_fastpath_test.go: TestLua_HGET_HashTTL,
      TestLua_HEXISTS_HashTTL, TestLua_ZSCORE_*, TestLua_ZRANGEBYSCORE_*.
    * redis_misskey_compat_test.go: post-EX-1 expiry assertion.

  - Removed the now-unused waitForTTLExpiry (redis_collection_fastpath
    _test.go) and waitForLuaTTL (redis_lua_collection_fastpath_test.go)
    convenience wrappers. Their sole callers were the migrated sites.

  - Added errors stdlib import to the three test files where the
    helper's condition closures use errors.Is.

Caller audit per /loop semantic-change rule
============================================

  - eventuallyExpired: new helper, 8 callers in 3 test files. All
    test-only.
  - waitForTTLExpiry / waitForLuaTTL: removed. grep across the repo
    confirms zero remaining call sites.
  - collectionFastPathTTL / luaFastPathTTL consts: still used by
    the migrated sites as the ttl argument to eventuallyExpired.
    Unchanged.

Validation
==========

  go test ./adapter/ -run 'TestRedis_HGET_TTLExpired|TestRedis_HEXISTS_TTLExpired|TestRedis_SISMEMBER_TTLExpired|TestLua_HGET_HashTTL|TestLua_HEXISTS|TestLua_ZSCORE|TestLua_ZRANGEBYSCORE|TestRedis_Misskey' -race -count=2 -> ok 32.3s
  gofmt + go vet + golangci-lint run -> 0 issues

Closes the "Redis TTL window deadline-multiplier" follow-up noted
in PR #891's description.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 53 minutes and 13 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cef606f6-3b47-4e30-b9e6-5cfcd9926962

📥 Commits

Reviewing files that changed from the base of the PR and between fe9e941 and ffff49b.

📒 Files selected for processing (4)
  • adapter/redis_collection_fastpath_test.go
  • adapter/redis_lua_collection_fastpath_test.go
  • adapter/redis_misskey_compat_test.go
  • adapter/test_util.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/redis-ttl-deadline-multiplier

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new helper function, eventuallyExpired, in adapter/test_util.go to address test flakiness caused by static time.Sleep calls during TTL expiration checks. Various Redis and Lua TTL tests across redis_collection_fastpath_test.go, redis_lua_collection_fastpath_test.go, and redis_misskey_compat_test.go have been refactored to use this helper, which utilizes require.Eventually with a generous deadline and a 25ms polling interval to absorb CI scheduler jitter. There are no review comments, and we have no further feedback to provide.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c2b55e38ee

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread adapter/test_util.go Outdated
ttlExpiryHeadroom = 3 * time.Second
ttlExpiryPoll = 25 * time.Millisecond
)
require.Eventually(t, condition, ttl+ttlExpiryHeadroom, ttlExpiryPoll, msg)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Wait until the TTL elapses before polling

require.Eventually runs the condition once immediately, so these TTL-expiry assertions can now pass before the TTL has elapsed. In cases where the post-expiry result is also the result for a missing/hidden collection (for example HEXISTS/SISMEMBER returning false or Lua range returning an empty array), a regression that deletes or hides the key as soon as PExpire is called would satisfy the helper on the first poll and the test would never exercise the expired-after-deadline state. Sleep until ttl (or otherwise assert the key is still visible first) before starting the retry window.

Useful? React with 👍 / 👎.

codex P2 on PR #903 (adapter/test_util.go:742):
  require.Eventually runs the condition once immediately before its
  first poll tick. For TTL-expiry tests where the "expired" result
  shape coincides with "missing/hidden" (HEXISTS/SISMEMBER false,
  Lua ZRANGEBYSCORE empty array), a regression that deletes/hides
  the key at PExpire time would satisfy the first poll and the
  test would pass without ever exercising the actual
  expired-after-deadline state — silently losing the regression-
  detection power the test was meant to provide.

  Fix: time.Sleep(ttl) before require.Eventually so the first poll
  fires AFTER the TTL deadline. Total time bound is unchanged
  (ttl + ttlExpiryHeadroom); polling just starts later. The
  PExpire-deletes-key regression now fails the test because the
  PRE-sleep state has already deleted the key, and after the
  sleep we would have expected to observe the expiry transition,
  not a stale already-gone result.

Validation:
  go test ./adapter/ -run "TestRedis_HGET_TTLExpired|TestRedis_HEXISTS_TTLExpired|TestRedis_SISMEMBER_TTLExpired|TestLua_HGET_HashTTL|TestLua_HEXISTS|TestLua_ZSCORE|TestLua_ZRANGEBYSCORE|TestRedis_Misskey" -race -count=2 -> ok 30.3s
  gofmt + golangci-lint -> 0 issues

No new caller audit needed: eventuallyExpired signature unchanged,
sole new behavior is the pre-poll sleep window. Existing 8 callers
benefit transparently.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 2, 2026

@claude review

Round 1 fix (commit ffff49b) for codex P2:

  • codex P2 (Wait until the TTL elapses before polling): require.Eventually runs the condition once immediately before its first tick. For HEXISTS/SISMEMBER tests where "expired" coincides with "missing/hidden", a PExpire-deletes-key regression would satisfy the first poll and the test would pass without exercising the actual deadline transition. Added time.Sleep(ttl) before require.Eventually so the first poll fires AFTER the TTL deadline. Total time budget unchanged (ttl + ttlExpiryHeadroom).

go test -race -count=2 (8 migrated sites) passes; lint clean.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 2, 2026

Claude finished @bootjp's task in 1m 40s —— View job


Review: Round 1 fix for codex P2

Tasks:

  • Load tools
  • Read changed files
  • Analyze the Round 1 fix for codex P2
  • Post review feedback

Summary

The fix correctly addresses the codex P2 concern. Adding time.Sleep(ttl) before require.Eventually in eventuallyExpired ensures the first poll fires after the TTL deadline, so the helper cannot trivially pass on a pre-expiry observation. The design is sound.

eventuallyExpiredadapter/test_util.go:734-751

Structure is correct. The sequencing is:

time.Sleep(ttl)                            // ① key must have TTL-expired by now
require.Eventually(condition, 3s, 25ms)    // ② CI-jitter slack for sweeper lag

Total wall-clock budget: ttl + 3s — unchanged from original intent, just split between a mandatory wait and a jitter-absorbing retry window.

Comment (lines 741-748) accurately explains the reasoning and links back to the P2 source. Good.

One residual limitation worth noting (not blocking)

The P2 concern was about a PExpire-deletes-key regression: if PExpire deleted the key immediately (a bug), the condition err == nil && !ok (HEXISTS/SISMEMBER) would satisfy the first poll, masking the regression.

With the time.Sleep(ttl) fix:

  • Correct behavior: key was alive after PExpire, expired after ttl → poll sees !ok
  • Regression: key deleted at PExpire time → key is still gone after time.Sleep(ttl) → poll still sees !ok

Both cases are observationally identical after ttl elapses. The sleep ensures we don't pass before the deadline, but it can't distinguish why the key is gone after the deadline. A test that asserts HEXISTS == 1 immediately after PExpire (before the sleep) would be the only way to catch that specific regression — but that's out of scope for a flake-fix PR, and the existing test suite likely has other coverage for PExpire semantics.

This is an inherent limitation of TTL testing, not a flaw in this fix.

Minor observations

  • redis_misskey_compat_test.go: lockTTL = 1 * time.Second means the test now sleeps a full second before polling. This is a meaningful increase in test time, but correct and consistent with the helper's contract. Acceptable.

  • Import consistency: All three test files correctly use stdlib errors (not github.com/cockroachdb/errors) in condition closures. Correct — errors.Is(err, redis.Nil) works with either package.

  • Condition closure for ZRANGEBYSCORE TTL test (redis_lua_collection_fastpath_test.go:846-853): checks len(gotArr) == 0 after a type assertion — no change in the fix, already robust.

Verdict

LGTM. The fix is correct, well-commented, and addresses the P2 concern within the scope of what TTL-expiry tests can structurally guarantee. The residual limitation I noted is pre-existing and inherent, not introduced by this PR.

@bootjp bootjp merged commit 214d6d7 into main Jun 2, 2026
9 checks passed
@bootjp bootjp deleted the chore/redis-ttl-deadline-multiplier branch June 2, 2026 06:39
bootjp added a commit that referenced this pull request Jun 2, 2026
…gatorRaceFree (#906)

## Summary

Fixes the CI flake observed in [Actions run
26765510693](https://github.com/bootjp/elastickv/actions/runs/26765510693/job/78890657730?pr=902):

```
--- FAIL: TestHotKeysAggregatorRaceFree (0.11s)
    hot_keys_test.go:356: Expected value not to be nil.
```

Surfaced on PR #902's CI run but **unrelated to PR #902** — the failing
test is on `main` in `keyviz/`, which PR #902 (Redis adapter
`*Eventually` helper cleanup) doesn't touch.

## Root cause

Same wall-clock-racing pattern as the SQS-throttle (PR #891) and
Redis-TTL (PR #903) flake series. The test:

1. Launches an aggregator goroutine (Step=5ms tick).
2. Spawns 8 workers each calling `Observe` 500 times.
3. `time.Sleep(50 * time.Millisecond)`.
4. Asserts both per-route snapshots are non-nil.

The aggregator publishes a snapshot only on a tick (or on ctx-done
drain). Under `-race` on a slow CI runner, the `Run` goroutine itself
can be slow to schedule — the 50ms wait may not contain a single actual
tick worth of scheduler time, so `hotKeysSnap` stays at its initial
`nil` `atomic.Pointer` load and the assertion fires.

## Fix

`require.Eventually` with a 3-second budget + 5ms poll cadence,
asserting both snapshots non-nil:

```go
require.Eventually(t, func() bool {
    return s.HotKeysSnapshot(1) != nil && s.HotKeysSnapshot(2) != nil
}, 3*time.Second, 5*time.Millisecond, "aggregator must publish a snapshot for each route within the budget")
```

The test still pins the load-bearing property (snapshot reads are
lock-free; the `atomic.Pointer.Load` works without contention with the
publisher); only the **wait mechanism** changes from "sleep blindly and
hope" to "poll until ready or fail". Total time bound is at most 3s on a
slow runner, ~5–50ms in the common case.

## Validation

- `go test ./keyviz/ -run TestHotKeysAggregatorRaceFree -race -count=5
-timeout 120s` → ok 1.1s (5/5 iterations green)
- `gofmt` + `golangci-lint run` → 0 issues

## Pattern continuity

This is the same kind of structural fix as PR #891 / #903: **replace
fixed-time-window sleep with `require.Eventually`** for any condition
that depends on a background goroutine's progress under `-race` on CI.
Future flakes in this category should follow the same pattern.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant