perf(lua): verify leader once at startTS, use snapshot reads thereafter#546
perf(lua): verify leader once at startTS, use snapshot reads thereafter#546
Conversation
Previously leaderAwareGetAt called VerifyLeader() on every redis.call() inside a Lua script (up to 3× per GET: prefixed key + legacy TTL index + bare key). With p50=7 calls and VerifyLeader ~500ms each, this caused 5-10 s total latency matching the observed Grafana data. Changes: - newLuaScriptContext calls coordinator.VerifyLeader() once and errors early if the node is not the leader - leaderAwareGetAt refactored into doGetAt(verify bool); snapshotGetAt added as the no-verify variant - readRedisStringAt chain (decodePrefixedString, readBareLegacyString, readLegacyTTL) parameterised with rawGetFn; readRedisStringAtSnapshot added using snapshotGetAt - Lua string reads use readRedisStringAtSnapshot, reducing VerifyLeader calls from O(redis.call count) to O(1) per script invocation
📝 WalkthroughWalkthroughThis change introduces a new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request optimizes Redis Lua script execution by performing a single leadership verification at the start of the script context, allowing subsequent reads to bypass per-call verification. Key changes include refactoring string read helpers to support snapshot reads and updating the Lua script context initialization to handle leadership checks. Feedback suggests using context.Context for the leadership verification to respect timeouts and ensuring the read timestamp is acquired after the linearizable read fence for consistency. Additionally, a variable shadowing issue was noted in the Lua script execution loop.
|
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
…ompat Covers the gap identified in the 5-perspective code review: EVAL, EVALSHA, and execLuaCompat all propagate LinearizableRead errors to the client as Redis error replies, and the happy path returns the expected result.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
adapter/redis_lua_context.go (1)
193-205:⚠️ Potential issue | 🟡 MinorClarify why the
LinearizableReadreturn value is discarded in favor ofserver.readTS().
LinearizableRead(ctx)returns a Raft log index (as documented ininternal/raftengine/engine.go), which serves as a fence confirming the FSM has applied all committed entries. However, the subsequent call toserver.readTS()obtains the MVCC timestamp independently. While the current comment explains the reasoning, adding an explicit note that the Raft index is intentionally not used for the snapshot (since the MVCC timestamp is obtained separately after FSM application) would clarify the design for future maintainers.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@adapter/redis_lua_context.go` around lines 193 - 205, The comment should explicitly state that the Raft log index returned by coordinator.LinearizableRead(ctx) is intentionally discarded because it only serves as a fence to ensure the FSM has applied committed entries; the code then reads the MVCC snapshot timestamp via server.readTS() for snapshotGetAt. Update the comment inside newLuaScriptContext to mention LinearizableRead is used solely for leader/fence verification and that startTS is derived from server.readTS() (not from the returned Raft index) before constructing luaScriptContext with startTS and readPin.
🧹 Nitpick comments (2)
adapter/distribution_server_test.go (1)
747-749: Mirror the stub’s follower behavior inLinearizableRead.This stub already returns
kv.ErrLeaderNotFoundfromVerifyLeaderwhenleaderis false;LinearizableReadshould do the same so future leader-check paths do not accidentally pass in follower-mode tests.Proposed test-stub fix
func (s *distributionCoordinatorStub) LinearizableRead(_ context.Context) (uint64, error) { + if !s.leader { + return 0, kv.ErrLeaderNotFound + } return 0, nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@adapter/distribution_server_test.go` around lines 747 - 749, The LinearizableRead method on distributionCoordinatorStub currently always returns (0, nil) which bypasses follower-mode checks; update distributionCoordinatorStub.LinearizableRead to mirror VerifyLeader’s behavior by returning kv.ErrLeaderNotFound when the stub's leader flag is false (same as VerifyLeader) and otherwise perform the normal successful return, ensuring follower-mode tests fail leader-only paths consistently.adapter/redis_lua_linearizable_read_test.go (1)
24-96: Add a test verifying the one-time leader-read contract for multi-call scripts.These tests verify error propagation but miss the core perf regression guard. A Lua script with multiple
redis.call()operations should verify thatLinearizableReadis called exactly once upfront and subsequent operations do not trigger per-callVerifyLeaderForKeychecks.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@adapter/redis_lua_linearizable_read_test.go` around lines 24 - 96, Add a test (e.g., TestEval_MultiCall_OneLeaderRead) that constructs a lua test server via newLuaTestServer but with a spy/fixture that increments a counter each time the LinearizableRead/VerifyLeaderForKey path is invoked, then run a multi-call script (e.g., "redis.call('GET','k1'); redis.call('GET','k2'); return 1") by calling r.eval with a recordingConn and the EVAL args, and finally assert the script returns successfully (conn.err empty) and that the spy counter == 1 to ensure VerifyLeaderForKey/LinearizableRead was performed only once for the whole script; reference newLuaTestServer, recordingConn, r.eval and the VerifyLeaderForKey/LinearizableRead hook when implementing the spy.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@adapter/redis_compat_helpers.go`:
- Around line 383-385: Update the precondition comments to reflect that snapshot
reads are gated by LinearizableRead(ctx) instead of coordinator.VerifyLeader();
specifically change the comment above readRedisStringAtSnapshot to say the
caller must have already performed a LinearizableRead(ctx) (e.g., at Lua script
startTS acquisition) and make the same wording change in the other similar
comment block around the second snapshot-read helper (the block at the other
location referenced in the review). Ensure the revised comments mention
LinearizableRead(ctx) as the required precondition and keep the example context
(Lua script startTS acquisition).
---
Outside diff comments:
In `@adapter/redis_lua_context.go`:
- Around line 193-205: The comment should explicitly state that the Raft log
index returned by coordinator.LinearizableRead(ctx) is intentionally discarded
because it only serves as a fence to ensure the FSM has applied committed
entries; the code then reads the MVCC snapshot timestamp via server.readTS() for
snapshotGetAt. Update the comment inside newLuaScriptContext to mention
LinearizableRead is used solely for leader/fence verification and that startTS
is derived from server.readTS() (not from the returned Raft index) before
constructing luaScriptContext with startTS and readPin.
---
Nitpick comments:
In `@adapter/distribution_server_test.go`:
- Around line 747-749: The LinearizableRead method on
distributionCoordinatorStub currently always returns (0, nil) which bypasses
follower-mode checks; update distributionCoordinatorStub.LinearizableRead to
mirror VerifyLeader’s behavior by returning kv.ErrLeaderNotFound when the stub's
leader flag is false (same as VerifyLeader) and otherwise perform the normal
successful return, ensuring follower-mode tests fail leader-only paths
consistently.
In `@adapter/redis_lua_linearizable_read_test.go`:
- Around line 24-96: Add a test (e.g., TestEval_MultiCall_OneLeaderRead) that
constructs a lua test server via newLuaTestServer but with a spy/fixture that
increments a counter each time the LinearizableRead/VerifyLeaderForKey path is
invoked, then run a multi-call script (e.g., "redis.call('GET','k1');
redis.call('GET','k2'); return 1") by calling r.eval with a recordingConn and
the EVAL args, and finally assert the script returns successfully (conn.err
empty) and that the spy counter == 1 to ensure
VerifyLeaderForKey/LinearizableRead was performed only once for the whole
script; reference newLuaTestServer, recordingConn, r.eval and the
VerifyLeaderForKey/LinearizableRead hook when implementing the spy.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 033d9567-7318-4457-8d0a-993c410e6017
📒 Files selected for processing (11)
adapter/distribution_server_test.goadapter/dynamodb_test.goadapter/redis_compat_helpers.goadapter/redis_info_test.goadapter/redis_keys_pattern_test.goadapter/redis_lua.goadapter/redis_lua_context.goadapter/redis_lua_linearizable_read_test.goadapter/redis_retry_test.goadapter/s3_test.gokv/coordinator.go
| // readRedisStringAtSnapshot reads a string without re-verifying leadership on | ||
| // every sub-call. The caller must have already called coordinator.VerifyLeader() | ||
| // once before invoking this (e.g. at Lua script startTS acquisition time). |
There was a problem hiding this comment.
Update the snapshot-read precondition comments.
The Lua path now gates snapshot reads with LinearizableRead(ctx), not coordinator.VerifyLeader(). Keeping the old wording makes the safety contract ambiguous for future callers.
📝 Proposed comment update
-// every sub-call. The caller must have already called coordinator.VerifyLeader()
-// once before invoking this (e.g. at Lua script startTS acquisition time).
+// every sub-call. The caller must have already completed a LinearizableRead
+// before invoking this (e.g. at Lua script startTS acquisition time).
func (r *RedisServer) readRedisStringAtSnapshot(key []byte, readTS uint64) ([]byte, *time.Time, error) {
return r.readRedisStringWith(key, readTS, r.snapshotGetAt)
}-// The caller must have already called coordinator.VerifyLeader() once (e.g. at
-// Lua script startTS acquisition time) before using this method.
+// The caller must have already completed a LinearizableRead (e.g. at Lua script
+// startTS acquisition time) before using this method.
func (r *RedisServer) snapshotGetAt(key []byte, readTS uint64) ([]byte, error) {
return r.doGetAt(key, readTS, false)
}Also applies to: 467-469
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@adapter/redis_compat_helpers.go` around lines 383 - 385, Update the
precondition comments to reflect that snapshot reads are gated by
LinearizableRead(ctx) instead of coordinator.VerifyLeader(); specifically change
the comment above readRedisStringAtSnapshot to say the caller must have already
performed a LinearizableRead(ctx) (e.g., at Lua script startTS acquisition) and
make the same wording change in the other similar comment block around the
second snapshot-read helper (the block at the other location referenced in the
review). Ensure the revised comments mention LinearizableRead(ctx) as the
required precondition and keep the example context (Lua script startTS
acquisition).
Previously leaderAwareGetAt called VerifyLeader() on every redis.call() inside a Lua script (up to 3× per GET: prefixed key + legacy TTL index + bare key). With p50=7 calls and VerifyLeader ~500ms each, this caused 5-10 s total latency matching the observed Grafana data.
Changes:
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Refactor