Skip to content

test(redis): wait for follower apply catch-up in MultiExec list tests#582

Merged
bootjp merged 6 commits into
mainfrom
test/fix-multiexec-flake
Apr 22, 2026
Merged

test(redis): wait for follower apply catch-up in MultiExec list tests#582
bootjp merged 6 commits into
mainfrom
test/fix-multiexec-flake

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 22, 2026

Summary

  • Root-cause TestRedis_MultiExec_DelThenRPushRecreatesList flake: the test reads readTS() directly on nodes[1], which returns the node-local store.LastCommitTS(). That value can trail the raft-committed EXEC on a follower (or a leader that just responded), while the client-facing LRANGE path waits for local apply via LeaseRead/LinearizableRead. The direct ScanAt(readTS) therefore observes the pre-EXEC state.
  • Fix by introducing waitForListState helper that polls resolveListMeta + raw item scan at the node-local readTS via require.Eventually (250ms / 5s) until both reflect the expected post-EXEC state. Real synchronization on raft apply progress, not a sleep.
  • Swap the ad-hoc readTS/resolveListMeta/ScanAt block in TestRedis_MultiExec_DelThenRPushRecreatesList for the helper.

Test plan

  • go test -race -count=10 -run TestRedis_MultiExec ./adapter/ -> 10/10 pass (20.5s)
  • go test -race -count=1 -short ./adapter/... passes (~62s)
  • make lint clean (0 issues)

TestRedis_MultiExec_DelThenRPushRecreatesList was intermittently failing
under -race on CI (PR #577): after MULTI / DEL list / RPUSH list new1
new2 / EXEC, the final direct store.ScanAt(..., readTS) on nodes[1]
returned the pre-EXEC ["old1","old2"] even though the immediately
preceding LRANGE via the redis client correctly returned
["new1","new2"].

Why the race exists:
- The go-redis client connects to nodes[1], which is not necessarily the
  raft leader.
- LRANGE goes through the adapter's LeaseRead / LinearizableRead path,
  which blocks until the local apply watermark reaches a safe point, so
  it observes the EXEC commit even on a slightly-lagging follower.
- readTS() (adapter/ts.go:snapshotTS) returns the local store's
  LastCommitTS() directly, with no wait-apply. On a follower (or a
  leader that has just responded to the client), LastCommitTS can
  momentarily trail the raft-committed EXEC because the test reads it
  on a different code path than the one that drove the client-visible
  wait.
- The direct ScanAt(readTS) therefore sees the stale, pre-EXEC state.

Fix: introduce waitForListState(t, node, key, expectedLen,
expectedValues) that polls resolveListMeta AND the raw item scan at the
node-local readTS via require.Eventually (250ms interval, 5s timeout)
until both reflect the expected post-EXEC state. This is a real
synchronization on the raft apply progress of the target node, not a
timing hack: it succeeds on the first iteration when the node is
already caught up and only retries while the apply is genuinely behind.

Use it in place of the ad-hoc readTS/resolveListMeta/ScanAt block in
TestRedis_MultiExec_DelThenRPushRecreatesList.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 22, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 43 minutes and 41 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 43 minutes and 41 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b0dfce6b-2494-49c4-ade2-c6b03b92614c

📥 Commits

Reviewing files that changed from the base of the PR and between 20f97f4 and 9e0258d.

📒 Files selected for processing (1)
  • adapter/redis_multi_test.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch test/fix-multiexec-flake

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a waitForListState helper function in adapter/redis_multi_test.go to handle replication lag in tests by polling for expected list states. A review comment identifies a logic error in the helper: it fails when a list is expected to be empty (length 0) because it incorrectly treats a non-existent key as a failure, preventing its use for verifying deletions.

Comment thread adapter/redis_multi_test.go Outdated
Comment on lines +39 to +41
if err != nil || !exists || meta.Len != int64(expectedLen) {
return false
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current logic for checking list existence and length in waitForListState is incorrect for cases where the list is expected to be empty or deleted (expectedLen == 0). In Redis, an empty list is typically represented by a non-existent key. The current check !exists || meta.Len != int64(expectedLen) will return false (triggering a retry/timeout) if the key does not exist, even if expectedLen is 0. This prevents the helper from being used to verify that a list has been successfully deleted.

Suggested change
if err != nil || !exists || meta.Len != int64(expectedLen) {
return false
}
if err != nil || (expectedLen > 0 && !exists) || (exists && meta.Len != int64(expectedLen)) {
return false
}

bootjp added a commit that referenced this pull request Apr 22, 2026
#583)

## Summary

Drop the `require.GreaterOrEqual(t, pool.Hits(), uint64(1))` line from
`TestLua_VMReuseDoesNotLeakGlobals`. It was flaking on CI (PR #582 run
`24781704678`), and the assertion is inherently racy on `sync.Pool`.

## Root cause

The assertion required the Script-B `pool.get(nil)` to return exactly
the `pooledLuaState` that Script-A put back moments earlier. But
`sync.Pool` is documented to discard pooled items under GC pressure,
which can easily fire between a Put and a Get on the race runtime. The
test's own comment above line 80 already acknowledges this
non-determinism — but then the Hits assertion at line 112 reintroduces
the same source of flake.

```
--- FAIL: TestLua_VMReuseDoesNotLeakGlobals (0.00s)
    Error: "0" is not greater than or equal to "1"
    redis_lua_pool_test.go:112
```

## Change

- Remove the flaky Hits assertion from
`TestLua_VMReuseDoesNotLeakGlobals`.
- Keep the test's actual purpose intact (no leak of `GLOBAL_LEAK`,
`LEAKY_TABLE` across pooled states; whitelisted globals still intact).
- Comment points at `TestLua_PoolRecordsReuseVsAllocation` which asserts
`hits + misses` deterministically across a controlled loop.

## Verification

- `go test -race -count=20 -run TestLua_VMReuseDoesNotLeakGlobals
./adapter/` → 20/20 pass
- `go test -race -count=1 -short ./adapter/...` → ok in 63s
- `make lint` clean

Pair with PR #582 (multi-exec flake) to unblock the CI pipeline on
`perf/raft-dispatcher-lanes` and `fix/idempotent-rollback-marker`.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 22, 2026

Addressed gemini's medium (d63f92b):

  • waitForListState now accepts !exists as success when expectedLen == 0. In Redis an empty / deleted list is represented by the absence of its meta key; the previous check was failing post-DEL waits.
  • For expectedLen > 0, !exists still means "not yet applied" and keeps polling.

Verified: go test -race -count=10 -run TestRedis_MultiExec ./adapter/ 10/10 green.

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a waitForListState test helper in adapter/redis_multi_test.go to handle eventual consistency when verifying list states across nodes. It also refactors TestRedis_MultiExec_DelThenRPushRecreatesList to use this helper. Feedback was provided to add a validation check within the helper to ensure the length of expectedValues matches expectedLen, which prevents potential out-of-bounds panics during test execution.

Comment thread adapter/redis_multi_test.go Outdated
Comment on lines +33 to +35
func waitForListState(t *testing.T, n Node, key []byte, expectedLen int, expectedValues []string) {
t.Helper()
ctx := context.Background()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The waitForListState helper should verify that the length of expectedValues matches expectedLen if it is provided. This prevents a potential out-of-bounds panic at line 71 if the helper is misused in a test with mismatched arguments.

func waitForListState(t *testing.T, n Node, key []byte, expectedLen int, expectedValues []string) {
	t.Helper()
	if expectedValues != nil && len(expectedValues) != expectedLen {
		t.Fatalf("waitForListState: expectedValues length %d does not match expectedLen %d", len(expectedValues), expectedLen)
	}
	ctx := context.Background()

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 22, 2026

Refactored waitForListState to satisfy cyclop (was 12, limit 10).

  • Split the require.Eventually body into four helpers:
    • listStateMatches — single-shot predicate
    • listMetaMatches — meta + expectedLen==0 empty/absent semantics
    • scanListItemsScanAt + length check
    • listValuesMatch — optional value equality
  • No behavior change; polling/timeouts and the expectedLen==0 branch preserved.
  • go test -race -count=10 -run TestRedis_MultiExec ./adapter/ → pass (20.7s).
  • make lint → 0 issues.

Commit: 403367e

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a polling mechanism in the Redis multi-test suite to handle eventual consistency issues where a node might lag in applying Raft commits. Specifically, it adds the waitForListState helper and several supporting functions to ensure tests wait for the expected state before making assertions. Feedback was provided to improve the robustness of the listValuesMatch helper by adding a length check to prevent potential panics when comparing slices of different sizes.

Comment on lines +89 to +99
func listValuesMatch(kvs []*store.KVPair, expectedValues []string) bool {
if expectedValues == nil {
return true
}
for i, kvp := range kvs {
if string(kvp.Value) != expectedValues[i] {
return false
}
}
return true
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The listValuesMatch function is susceptible to a panic if len(kvs) is greater than len(expectedValues). While scanListItems ensures that len(kvs) == expectedLen, there is no validation that the caller-provided expectedLen matches the actual length of the expectedValues slice. Adding a length check at the beginning of the function makes the helper more robust and prevents confusing panics during test failures.

func listValuesMatch(kvs []*store.KVPair, expectedValues []string) bool {
	if expectedValues == nil {
		return true
	}
	if len(kvs) != len(expectedValues) {
		return false
	}
	for i, kvp := range kvs {
		if string(kvp.Value) != expectedValues[i] {
			return false
		}
	}
	return true
}

Fail fast in waitForListState when the caller supplies an expectedValues
slice whose length does not match expectedLen. Without this check a
misuse could cause an out-of-bounds panic inside listValuesMatch when
scanListItems succeeds with expectedLen items. The guard centralizes
the invariant at the single public entry point so listValuesMatch and
listMetaMatches stay simple.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 22, 2026

Addressed the latest two review threads in 9e0258d:

  • waitForListState misuse guard (review 14:04 UTC) + listValuesMatch panic (review 14:08 UTC, line 99): Added a single fail-fast check at the top of waitForListState: if expectedValues != nil && len(expectedValues) != expectedLen, the helper t.Fatalfs immediately. This centralizes the invariant at the public entry point, so scanListItems's existing len(kvs) == expectedLen guarantee plus this length check means listValuesMatch cannot panic. Keeping the check in one place instead of duplicating into both leaf helpers preserves readability.

  • Rejected: empty-list / deleted-list check (review 13:46 UTC): This comment describes the pre-fix code. The current listMetaMatches (lines 60-69) already handles expectedLen == 0 with !exists || meta.Len == 0, landed in commit d63f92b before the review snapshot was taken. No change needed.

Test results on this commit:

  • go test -race -count=10 -run TestRedis_MultiExec ./adapter/ -> 10/10 pass (20.9s)
  • go test -race -count=1 -short ./adapter/... -> pass (64.9s)
  • make lint -> 0 issues

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a waitForListState helper function and associated utilities in adapter/redis_multi_test.go to handle eventual consistency during tests. These helpers poll the node's state until it reflects the expected list metadata and values, addressing a race condition where direct store scans might occur before the local node has updated its commit timestamp. The TestRedis_MultiExec_DelThenRPushRecreatesList test was refactored to use this new polling mechanism, making it more robust. I have no feedback to provide.

@bootjp bootjp merged commit 63feb7d into main Apr 22, 2026
8 checks passed
@bootjp bootjp deleted the test/fix-multiexec-flake branch April 22, 2026 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant