feat(lease-read): engine-driven lease anchor via LastQuorumAck#561
feat(lease-read): engine-driven lease anchor via LastQuorumAck#561
Conversation
The caller-side lease (kv.leaseState + refreshLeaseAfterDispatch) sampled time.Now() BEFORE each slow-path LinearizableRead and used it as the extend base. When production LinearizableRead latency grew to ~1s (step-queue congestion) while LeaseDuration is 700ms, every extend landed ~300ms in the past -- the next read always missed the cache and the lease never warmed up, defeating the entire lease-read optimisation. Switch the authority to the engine: - internal/raftengine: add LeaseProvider.LastQuorumAck() returning the wall-clock instant at which a majority of followers most recently responded to the leader. - internal/raftengine/etcd: implement quorumAckTracker and hook it from handleStep's follower-response path (MsgAppResp / MsgHeartbeatResp). Reset the tracker on leader-loss transitions in refreshStatus so a future re-election cannot surface a stale instant. Single-node clusters short-circuit to time.Now(). - kv: Coordinate.LeaseRead and ShardedCoordinator.groupLeaseRead consult LastQuorumAck FIRST, falling back to the existing caller-side lease and finally to LinearizableRead. The engine's lease is refreshed on every heartbeat independent of read latency, so a slow LinearizableRead no longer leaves the cache cold. Safety: we record time.Now() when the leader observes the follower response, which is an upper bound on the follower's true ack time. leaseSafetyMargin (300ms) covers the resulting overshoot (one-way delay + scheduling slop + clock drift), keeping lease_expiry strictly inside ack_time + electionTimeout. Caller-side lease machinery (leaseState, leaseRefreshingTxn, RegisterLeaderLossCallback) is retained for test affordance and as a secondary fast-path during rollout, but is no longer load-bearing.
|
/gemini review |
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 55 minutes and 39 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (7)
📝 WalkthroughWalkthroughThe PR extends the Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Coordinator
participant Engine
participant AckTracker
participant Followers
Client->>Coordinator: LeaseRead(ctx)
Coordinator->>Engine: State()
Engine-->>Coordinator: StateLeader
Coordinator->>Engine: LastQuorumAck()
Engine->>AckTracker: load()
AckTracker-->>Engine: lastQuorumAckTime
Engine-->>Coordinator: lastQuorumAckTime
alt Quorum Ack Fresh (within lease duration)
Coordinator-->>Client: Fast-path success (appliedIndex)
else Stale or Zero
Coordinator->>Coordinator: LinearizableRead(ctx)
Coordinator->>Engine: extend lease
Coordinator-->>Client: Success
end
Note over Followers,AckTracker: Meanwhile: Leader receives follower responses
Followers->>Engine: MsgAppResp / MsgHeartbeatResp
Engine->>AckTracker: recordAck(peerID)
AckTracker->>AckTracker: Update peer timestamp & recompute majority boundary
AckTracker->>AckTracker: Publish quorumAckUnixNano atomically
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request implements an engine-driven lease mechanism to improve the performance and reliability of lease-based reads, particularly under network or scheduling congestion. By tracking the majority-ack instant directly within the Raft engine using follower heartbeat and append responses, the system can serve leader-local reads more efficiently. The review feedback highlights several critical improvement opportunities: ensuring time monotonicity by avoiding the conversion of timestamps to Unix nanoseconds, optimizing the ack-tracking logic to reduce memory allocations and lock contention on high-frequency paths, and properly handling cluster membership changes to prevent stale peer data from corrupting the quorum calculation.
| if followerQuorum <= 0 { | ||
| return | ||
| } | ||
| now := time.Now().UnixNano() |
There was a problem hiding this comment.
Using time.Now().UnixNano() and then reconstructing the time via time.Unix(0, ns) strips the monotonic clock reading from the time.Time object. This makes subsequent time.Since(ack) calls in the lease-read path susceptible to wall-clock jumps (e.g., NTP adjustments). If the system clock is adjusted backwards, the lease could be held for longer than the safety window, which is unsafe.
Consider storing the time.Time object directly using an atomic.Pointer[time.Time] or atomic.Value to preserve monotonicity.
References
- For frequently accessed fields on hot paths, prefer atomic operations over mutexes to improve performance and maintain consistency.
| acks := make([]int64, 0, len(t.peerAcks)) | ||
| for _, a := range t.peerAcks { | ||
| acks = append(acks, a) | ||
| } | ||
| // Sort descending so acks[0] is the most recent. The followerQuorum-th | ||
| // entry (1-indexed) is the oldest ack among the top quorum -- i.e. the | ||
| // boundary instant by which majority liveness was confirmed. | ||
| sort.Slice(acks, func(i, j int) bool { return acks[i] > acks[j] }) | ||
| t.quorumAckUnixNano.Store(acks[followerQuorum-1]) |
There was a problem hiding this comment.
Allocating a new slice and sorting it on every heartbeat or append response can be inefficient, especially in clusters with many peers. Since followerQuorum is typically very small (e.g., 1 or 2), you can find the Nth-most-recent timestamp in
| // See docs/lease_read_design.md for the full argument. | ||
| type quorumAckTracker struct { | ||
| mu sync.Mutex | ||
| peerAcks map[uint64]int64 // peer ID → last ack unix nano observed on leader |
There was a problem hiding this comment.
The peerAcks map is never pruned of peers that have been removed from the cluster. While reset() clears it on leader loss, a long-lived leader in a cluster with frequent membership changes will accumulate stale entries. More importantly, if the cluster size shrinks, an old timestamp from a removed peer might still be among the top followerQuorum acks, potentially inflating the majority-ack instant incorrectly.
| e.mu.RLock() | ||
| clusterSize := len(e.peers) | ||
| state := e.status.State | ||
| e.mu.RUnlock() |
There was a problem hiding this comment.
Acquiring e.mu.RLock() on the hot lease-read path to check clusterSize and state introduces unnecessary lock contention. Since these fields are updated infrequently compared to read volume, consider storing them in atomic variables (e.g., atomic.Int32) to allow truly lock-free peeking in LastQuorumAck.
References
- For frequently accessed fields that require monotonic updates (like a configuration index), prefer atomic operations (e.g., CAS) over mutexes to improve performance on hot paths.
| state := c.engine.State() | ||
| if state == raftengine.StateLeader { | ||
| if ack := lp.LastQuorumAck(); !ack.IsZero() && time.Since(ack) < leaseDur { | ||
| return lp.AppliedIndex(), nil | ||
| } | ||
| } |
There was a problem hiding this comment.
There is redundant state checking and lock acquisition here. c.engine.State() is called at line 376, and then lp.LastQuorumAck() is called at line 378, which internally acquires the engine's RLock again to check the same state. This doubles the synchronization overhead on the fast path. Consider refactoring LastQuorumAck to return both the timestamp and the leader status, or pass the already-acquired state into it to ensure leadership is verified correctly relative to the lock.
References
- To prevent race conditions in read-modify-write operations that require leadership, verify leadership status after acquiring the lock that protects the operation.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
kv/lease_read_test.go (1)
24-24: Add coverage for the new engine-driven fast path.
fakeLeaseEngine.LastQuorumAck()is now wired vialastQuorumAckUnixNano, but none of theTestCoordinate_LeaseRead_*cases in this file actually exercise the new primary path (state==Leader && !ack.IsZero() && time.Since(ack) < leaseDur). Every current test still reaches the fast path viac.lease.extend(...)(i.e. the secondary/caller-side lease), so the main behavioral change of this PR — that a fresh engine ack alone returnsAppliedIndexwithout touchingLinearizableReadorc.lease— is untested at the Coordinate boundary. A regression that broke the engine-driven branch would not be caught here.Suggested additions (table-driven friendly):
- Fresh ack, cold caller-side lease → fast path, 0
LinearizableReadcalls,c.lease.valid(now)still false.- Stale ack (
time.Since(ack) >= leaseDur), cold caller-side lease → slow path.- Fresh ack but
state != Leader→ slow path (andLinearizableReadinvoked).- Same expectations replicated for
groupLeaseReadinsharded_coordinator_test.go.Example helper on
fakeLeaseEngine:func (e *fakeLeaseEngine) setQuorumAck(t time.Time) { if t.IsZero() { e.lastQuorumAckUnixNano.Store(0) return } e.lastQuorumAckUnixNano.Store(t.UnixNano()) }As per coding guidelines: "Add test coverage for TTL/HLC behaviors when touching clocks, OCC, or replication logic in
store/andkv/packages."Also applies to: 68-74
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kv/lease_read_test.go` at line 24, Tests in kv/lease_read_test.go don't cover the new engine-driven fast path implemented via fakeLeaseEngine.LastQuorumAck() (backed by lastQuorumAckUnixNano), so add table-driven cases to the existing TestCoordinate_LeaseRead_* that exercise: (1) fresh engine ack with a cold caller-side lease to ensure the primary fast path returns immediately, does not call Coordinate.LinearizableRead, and leaves c.lease.valid(now) false; (2) stale ack with cold caller-side lease to ensure the slow path is taken; and (3) fresh ack but state != Leader to ensure the slow path and LinearizableRead are invoked; use a helper on fakeLeaseEngine (e.g. setQuorumAck) to set lastQuorumAckUnixNano for each case; repeat equivalent tests for groupLeaseRead in sharded_coordinator_test.go to cover both branches.internal/raftengine/etcd/quorum_ack_test.go (1)
39-62: Test name claims more than it verifies.
TestQuorumAckTracker_QuorumAckIsOldestOfTopNonly asserts monotonicity (!third.Before(second)), not thatload()equals the boundary entry (thefollowerQuorum-th most recent ack). As written, an implementation bug that published the most recent ack instead of the oldest of the top N would still pass this test.Consider asserting the exact value by capturing the individual ack times (e.g. record timestamps
t2, t3, t4viatime.Now()around eachrecordAck, thenrequire.Equal(t, t3.UnixNano(), tr.load().UnixNano())sincet3is the older of the top two after peer 4's ack lands).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/raftengine/etcd/quorum_ack_test.go` around lines 39 - 62, The test TestQuorumAckTracker_QuorumAckIsOldestOfTopN only checks monotonicity; fix it by capturing exact timestamps for each ack and asserting the quorum instant equals the boundary follower ack (use quorumAckTracker.recordAck to record after calling time.Now() into variables like t2, t3, t4 and then compare tr.load() to the expected boundary timestamp—e.g. require.Equal on UnixNano between tr.load() and t3 for the case where t3 is the older of the top N); reference quorumAckTracker, recordAck, load, and followerQuorum when making the assertion so the test verifies the published instant is the followerQuorum-th most recent ack rather than merely non‑regressing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/raftengine/etcd/quorum_ack.go`:
- Around line 42-82: The tracker retains removed peers in peerAcks causing stale
acks and unbounded growth; add a method quorumAckTracker.forgetPeer(peerID
uint64) that locks mu, deletes peerID from peerAcks, and (optionally) recomputes
quorumAckUnixNano or leaves recompute to the next recordAck; then call this new
forgetPeer from Engine.removePeer alongside the existing teardown so removed
peers no longer contribute to quorum computation and the map does not leak
entries; ensure symbols referenced are quorumAckTracker, peerAcks,
quorumAckUnixNano, recordAck, forgetPeer, reset, and Engine.removePeer.
---
Nitpick comments:
In `@internal/raftengine/etcd/quorum_ack_test.go`:
- Around line 39-62: The test TestQuorumAckTracker_QuorumAckIsOldestOfTopN only
checks monotonicity; fix it by capturing exact timestamps for each ack and
asserting the quorum instant equals the boundary follower ack (use
quorumAckTracker.recordAck to record after calling time.Now() into variables
like t2, t3, t4 and then compare tr.load() to the expected boundary
timestamp—e.g. require.Equal on UnixNano between tr.load() and t3 for the case
where t3 is the older of the top N); reference quorumAckTracker, recordAck,
load, and followerQuorum when making the assertion so the test verifies the
published instant is the followerQuorum-th most recent ack rather than merely
non‑regressing.
In `@kv/lease_read_test.go`:
- Line 24: Tests in kv/lease_read_test.go don't cover the new engine-driven fast
path implemented via fakeLeaseEngine.LastQuorumAck() (backed by
lastQuorumAckUnixNano), so add table-driven cases to the existing
TestCoordinate_LeaseRead_* that exercise: (1) fresh engine ack with a cold
caller-side lease to ensure the primary fast path returns immediately, does not
call Coordinate.LinearizableRead, and leaves c.lease.valid(now) false; (2) stale
ack with cold caller-side lease to ensure the slow path is taken; and (3) fresh
ack but state != Leader to ensure the slow path and LinearizableRead are
invoked; use a helper on fakeLeaseEngine (e.g. setQuorumAck) to set
lastQuorumAckUnixNano for each case; repeat equivalent tests for groupLeaseRead
in sharded_coordinator_test.go to cover both branches.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: cff706b6-3bee-4ea5-b080-6144b39cc565
📒 Files selected for processing (7)
internal/raftengine/engine.gointernal/raftengine/etcd/engine.gointernal/raftengine/etcd/quorum_ack.gointernal/raftengine/etcd/quorum_ack_test.gokv/coordinator.gokv/lease_read_test.gokv/sharded_coordinator.go
There was a problem hiding this comment.
Code Review
This pull request introduces an engine-driven lease mechanism to improve the performance and reliability of lease-based reads, particularly under congestion. It adds a LastQuorumAck method to the LeaseProvider interface and implements a quorumAckTracker to maintain the wall-clock time of the most recent majority acknowledgment. Feedback focuses on optimizing the hot path by reducing memory allocations and redundant locking through the use of stack-allocated buffers and atomic operations. Additionally, a concern was raised regarding the handling of peer removals in the acknowledgment tracker to prevent potential lease safety violations.
| acks := make([]int64, 0, len(t.peerAcks)) | ||
| for _, a := range t.peerAcks { | ||
| acks = append(acks, a) | ||
| } | ||
| // Sort descending so acks[0] is the most recent. The followerQuorum-th | ||
| // entry (1-indexed) is the oldest ack among the top quorum -- i.e. the | ||
| // boundary instant by which majority liveness was confirmed. | ||
| sort.Slice(acks, func(i, j int) bool { return acks[i] > acks[j] }) | ||
| t.quorumAckUnixNano.Store(acks[followerQuorum-1]) |
There was a problem hiding this comment.
This block executes inside the main Raft event loop for every MsgAppResp and MsgHeartbeatResp. Allocating a new slice and performing a reflective sort (sort.Slice) on every response can introduce significant overhead and GC pressure on high-throughput clusters. Since Raft clusters are typically small (3-7 nodes), we can use a stack-allocated buffer to avoid allocations in the common case.
var buf [16]int64
acks := buf[:0]
if len(t.peerAcks) > len(buf) {
acks = make([]int64, 0, len(t.peerAcks))
}
for _, a := range t.peerAcks {
acks = append(acks, a)
}
// Sort descending so acks[0] is the most recent. The followerQuorum-th
// entry (1-indexed) is the oldest ack among the top quorum.
sort.Slice(acks, func(i, j int) bool { return acks[i] > acks[j] })
t.quorumAckUnixNano.Store(acks[followerQuorum-1])| // the strictly-safe boundary by at most the one-way delay + scheduling | ||
| // slop -- which is exactly what leaseSafetyMargin is sized to cover. | ||
| // See docs/lease_read_design.md for the full argument. | ||
| type quorumAckTracker struct { |
There was a problem hiding this comment.
The quorumAckTracker currently lacks a mechanism to handle peer removal. If a peer is removed from the cluster, its last recorded acknowledgment remains in the peerAcks map indefinitely. This could allow a leader to incorrectly satisfy the quorum requirement using a "ghost" acknowledgment from a non-existent peer, potentially violating lease safety after a configuration change. Please add a removePeer(peerID uint64) method to prune the map and ensure it is called from Engine.removePeer.
References
- To maintain consistency between the application's peer list and the Raft ConfState, ensure the peer list remains complete and consistent across configuration changes.
| e.mu.RLock() | ||
| clusterSize := len(e.peers) | ||
| state := e.status.State | ||
| e.mu.RUnlock() |
There was a problem hiding this comment.
Acquiring e.mu.RLock() here introduces contention on the hot lease-read path. Since the caller (Coordinate.LeaseRead) likely already checked the engine state, this lock is often redundant. To achieve a truly lock-free fast path, consider mirroring the cluster size and Raft state in atomic variables (similar to how appliedIndex is handled).
References
- For frequently accessed fields that require monotonic updates (like a configuration index), prefer atomic operations (e.g., CAS) over mutexes to improve performance on hot paths.
| state := c.engine.State() | ||
| if state == raftengine.StateLeader { | ||
| if ack := lp.LastQuorumAck(); !ack.IsZero() && time.Since(ack) < leaseDur { | ||
| return lp.AppliedIndex(), nil | ||
| } | ||
| } |
There was a problem hiding this comment.
This logic results in redundant locking and state checks. c.engine.State() acquires a read lock, and then lp.LastQuorumAck() acquires the same lock again. Since LastQuorumAck is only useful when the node is the leader, consider refactoring this to minimize lock acquisitions on the fast path, perhaps by making LastQuorumAck lock-free using atomic operations.
References
- For frequently accessed fields that require monotonic updates (like a configuration index), prefer atomic operations (e.g., CAS) over mutexes to improve performance on hot paths.
| state := engine.State() | ||
| if state == raftengine.StateLeader { | ||
| if ack := lp.LastQuorumAck(); !ack.IsZero() && time.Since(ack) < leaseDur { | ||
| return lp.AppliedIndex(), nil | ||
| } | ||
| } |
There was a problem hiding this comment.
Similar to Coordinate.LeaseRead, this implementation performs redundant state checks and locking. Both engine.State() and lp.LastQuorumAck() acquire the engine's RWMutex. Optimizing LastQuorumAck to be lock-free would improve performance on this hot path, consistent with repository rules regarding atomic operations.
References
- For frequently accessed fields that require monotonic updates (like a configuration index), prefer atomic operations (e.g., CAS) over mutexes to improve performance on hot paths.
Addresses review feedback on PR #561. Safety fix (gemini HIGH): quorumAckTracker now exposes removePeer, called from Engine.removePeer so a shrinking-then-growing cluster cannot keep a stale ack from a removed peer falsely satisfying the majority threshold. recomputeLocked factored out of recordAck to share the sort-and-publish path. Hot-path fix (gemini MEDIUM x3): LastQuorumAck is now fully lock-free. Multi-node path reads ackTracker.quorumAckUnixNano atomically as before; single-node leader path reads a new singleNodeLeaderAckUnixNano atomic, populated from refreshStatus every tick while leader of a 1-node cluster and cleared otherwise. The previous implementation held e.mu.RLock to check peer count and state, doubling the lock traffic with the caller's engine.State(). Coverage (coderabbit nitpick): - TestCoordinate_LeaseRead_EngineAckFastPath: fresh ack + cold caller-side lease -> fast path, 0 LinearizableRead. - TestCoordinate_LeaseRead_EngineAckStaleFallsThrough: stale ack -> slow path. - TestCoordinate_LeaseRead_EngineAckIgnoredWhenNotLeader: fresh ack with non-leader state -> slow path. - TestQuorumAckTracker_RemovedPeerCannotSatisfyQuorum: removed peer ack no longer counts toward the new cluster's quorum. - TestQuorumAckTracker_RemovePeerZeroQuorumKeepsCurrent: followerQuorum=0 preserves the published instant so the next recordAck refreshes it.
|
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
There was a problem hiding this comment.
Pull request overview
This PR fixes a lease-read regression by shifting the lease “anchor time” authority from the caller (pre-read time.Now() sampling) to the Raft engine, using a majority-ack observation (LastQuorumAck) derived from follower responses to keep lease reads warm even when LinearizableRead is slow.
Changes:
- Extend
raftengine.LeaseProviderwithLastQuorumAck() time.Timeand update KV lease-read logic to prefer this engine-driven anchor before falling back to the existing caller-side lease and thenLinearizableRead. - Implement quorum-ack tracking in the etcd Raft engine (per-peer ack timestamps, majority “oldest-of-top-N” computation), including reset on leader loss and peer removal handling.
- Add unit tests for quorum-ack tracking and KV coordinator behavior using the new fast path.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
internal/raftengine/engine.go |
Adds LastQuorumAck() to LeaseProvider contract and documents required safety checks. |
internal/raftengine/etcd/engine.go |
Implements LastQuorumAck() plus leader-side ack tracking, reset on leader loss, and peer-removal pruning. |
internal/raftengine/etcd/quorum_ack.go |
Introduces quorumAckTracker to compute and publish majority-ack instant. |
internal/raftengine/etcd/quorum_ack_test.go |
Adds unit tests for quorum ack semantics, reset, pruning, and concurrency. |
kv/coordinator.go |
Makes engine-driven quorum-ack the primary LeaseRead fast path, preserving the old caller-side lease as secondary. |
kv/sharded_coordinator.go |
Mirrors the engine-driven lease-read fast path for sharded groups. |
kv/lease_read_test.go |
Extends fake engine and adds tests validating the new engine-ack fast path behavior. |
Comments suppressed due to low confidence (1)
internal/raftengine/etcd/engine.go:1262
- recordQuorumAck() is called before rawNode.Step(). If Step returns etcdraft.ErrStepPeerNotFound (e.g., response from a removed/unknown peer), the ack still gets recorded and can incorrectly advance LastQuorumAck / satisfy quorum after a config change. Consider moving recordQuorumAck after a successful Step, or explicitly ignoring acks from peers not present in e.peers (and/or only recording when Step does not return ErrStepPeerNotFound).
e.recordLeaderContact(msg)
e.recordQuorumAck(msg)
if err := e.rawNode.Step(msg); err != nil {
if errors.Is(err, etcdraft.ErrStepPeerNotFound) {
return
}
| // when the local node is not the leader. Single-node clusters | ||
| // report time.Now() unconditionally since self is the quorum. |
There was a problem hiding this comment.
The LeaseProvider doc says single-node clusters "report time.Now() unconditionally", but the etcd engine implementation can return zero (e.g., before the first refreshStatus tick) and otherwise returns the last tick time rather than calling time.Now at read time. Please either adjust the documentation to match the actual contract (e.g., "returns a recent non-zero time while leader in a single-node cluster") or change implementations to satisfy the documented behavior.
| // when the local node is not the leader. Single-node clusters | |
| // report time.Now() unconditionally since self is the quorum. | |
| // when the local node is not the leader. In a single-node cluster, | |
| // self is the quorum, so while leader this returns a recent non-zero | |
| // timestamp maintained by the engine; callers must not assume it is | |
| // computed via time.Now at the instant of the call. |
| linearizableErr error | ||
| linearizableCalls atomic.Int32 | ||
| state atomic.Value // stores raftengine.State; default Leader | ||
| lastQuorumAckUnixNano atomic.Int64 // 0 = no ack yet. Updated by ackNow(). |
There was a problem hiding this comment.
The struct field comment mentions it is "Updated by ackNow()", but there is no ackNow helper on fakeLeaseEngine (the tests use setQuorumAck). Update the comment to match the actual helper/method so the test harness stays self-explanatory.
| lastQuorumAckUnixNano atomic.Int64 // 0 = no ack yet. Updated by ackNow(). | |
| lastQuorumAckUnixNano atomic.Int64 // 0 = no ack yet. Updated by setQuorumAck(). |
| // Recorder alternates between two peer IDs so a 3-node followerQuorum | ||
| // always has at least one entry and the sort path runs. | ||
| go func() { | ||
| defer wg.Done() | ||
| for { | ||
| select { | ||
| case <-stop: | ||
| return | ||
| default: | ||
| tr.recordAck(2, 1) | ||
| tr.recordAck(3, 1) | ||
| } | ||
| } | ||
| }() | ||
| go func() { | ||
| defer wg.Done() | ||
| for { | ||
| select { | ||
| case <-stop: | ||
| return | ||
| default: | ||
| _ = tr.load() | ||
| } | ||
| } |
There was a problem hiding this comment.
TestQuorumAckTracker_ConcurrentRecordAndLoad spins in tight loops with a select default in two goroutines. In parallel test runs this can burn CPU and cause flakiness under constrained CI. Consider adding a small backoff (e.g., time.Sleep / runtime.Gosched) or using a time-bounded loop with a ticker to reduce busy-waiting while still exercising concurrent access.
Addresses copilot review on PR #561. recordQuorumAck ran before rawNode.Step, so a late MsgAppResp / MsgHeartbeatResp from a just-removed peer (which rawNode.Step would reject with ErrStepPeerNotFound) still landed an ack in the tracker -- resurrecting the "ghost" entry that removePeer just pruned. Filter on e.peers membership before recording; the event loop is the sole writer to e.peers so the map read is race-free.
|
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
| state := c.engine.State() | ||
| if state == raftengine.StateLeader { | ||
| if ack := lp.LastQuorumAck(); !ack.IsZero() && time.Since(ack) < leaseDur { | ||
| return lp.AppliedIndex(), nil | ||
| } |
There was a problem hiding this comment.
The engine-driven fast path uses time.Since(ack) < leaseDur. Because LastQuorumAck() is reconstructed from a unix-nano value (no monotonic component), a backwards wall-clock adjustment can make time.Since(ack) negative and incorrectly treat a very old ack as "fresh". To keep the lease safety argument intact, guard against ack.After(time.Now()) (treat as stale/force slow path), or compare using now := time.Now(); elapsed := now.Sub(ack); if elapsed < 0 { ... } before applying the duration check.
| // The lease is maintained inside the engine from ongoing | ||
| // MsgAppResp / MsgHeartbeatResp traffic, so callers do not sample | ||
| // time.Now() before the slow path to "extend" a lease afterwards. | ||
| // That earlier pre-read sampling was racy under congestion: if a | ||
| // LinearizableRead took longer than LeaseDuration, the extension | ||
| // would land already expired and the lease never warmed up. The | ||
| // engine-driven anchor is refreshed every heartbeat independent of | ||
| // read latency. | ||
| // |
There was a problem hiding this comment.
The doc comment says callers "do not sample time.Now() before the slow path to extend a lease afterwards", but this function still samples now := time.Now() before LinearizableRead and uses it to extend the secondary caller-side lease (c.lease.extend(now.Add(leaseDur), ...)). Please adjust the comment to clarify that only the primary path is engine-driven, while the legacy caller-side lease (with pre-read sampling) remains as a secondary fallback during rollout.
| // The lease is maintained inside the engine from ongoing | |
| // MsgAppResp / MsgHeartbeatResp traffic, so callers do not sample | |
| // time.Now() before the slow path to "extend" a lease afterwards. | |
| // That earlier pre-read sampling was racy under congestion: if a | |
| // LinearizableRead took longer than LeaseDuration, the extension | |
| // would land already expired and the lease never warmed up. The | |
| // engine-driven anchor is refreshed every heartbeat independent of | |
| // read latency. | |
| // | |
| // The primary lease path is maintained inside the engine from ongoing | |
| // MsgAppResp / MsgHeartbeatResp traffic, so that path does not rely on | |
| // callers sampling time.Now() before the slow path to "extend" a lease | |
| // afterwards. That earlier pre-read sampling was racy under | |
| // congestion: if a LinearizableRead took longer than LeaseDuration, | |
| // the extension would land already expired and the lease never warmed | |
| // up. The engine-driven anchor is refreshed every heartbeat | |
| // independent of read latency. | |
| // | |
| // During rollout, a legacy caller-side lease extension remains as a | |
| // secondary fallback after a successful slow-path read, including the | |
| // older pre-read time sampling used to anchor that fallback window. | |
| // |
| // Engine-driven lease anchor -- see Coordinate.LeaseRead for why | ||
| // this is the primary check. | ||
| state := engine.State() | ||
| if state == raftengine.StateLeader { | ||
| if ack := lp.LastQuorumAck(); !ack.IsZero() && time.Since(ack) < leaseDur { | ||
| return lp.AppliedIndex(), nil | ||
| } | ||
| } | ||
| now := time.Now() |
There was a problem hiding this comment.
Same wall-clock skew issue as in Coordinate.LeaseRead: if LastQuorumAck() is in the future relative to time.Now() (e.g. due to a backwards clock step), time.Since(ack) becomes negative and this fast path can incorrectly accept a stale quorum ack. Add a guard to treat ack.After(now) as invalid/stale before applying the duration check.
| // Engine-driven lease anchor -- see Coordinate.LeaseRead for why | |
| // this is the primary check. | |
| state := engine.State() | |
| if state == raftengine.StateLeader { | |
| if ack := lp.LastQuorumAck(); !ack.IsZero() && time.Since(ack) < leaseDur { | |
| return lp.AppliedIndex(), nil | |
| } | |
| } | |
| now := time.Now() | |
| now := time.Now() | |
| // Engine-driven lease anchor -- see Coordinate.LeaseRead for why | |
| // this is the primary check. | |
| state := engine.State() | |
| if state == raftengine.StateLeader { | |
| if ack := lp.LastQuorumAck(); !ack.IsZero() && !ack.After(now) && now.Sub(ack) < leaseDur { | |
| return lp.AppliedIndex(), nil | |
| } | |
| } |
| // Safety: callers must verify the lease with | ||
| // time.Since(LastQuorumAck()) < LeaseDuration() && | ||
| // engine.State() == raftengine.StateLeader | ||
| // before serving a leader-local read. The LeaseDuration is bounded | ||
| // by electionTimeout - safety_margin, which guarantees that any new | ||
| // leader candidate cannot yet accept writes during that window. |
There was a problem hiding this comment.
The interface contract recommends time.Since(LastQuorumAck()) < LeaseDuration() as the safety check. Since implementations may return a time.Time reconstructed from unix nanos (no monotonic clock), a backwards wall-clock adjustment can make time.Since(...) negative and incorrectly treat an old ack as fresh. Consider updating the contract to require treating LastQuorumAck() values later than time.Now() as invalid/stale (or otherwise specify a monotonic-safe comparison).
| // Safety: callers must verify the lease with | |
| // time.Since(LastQuorumAck()) < LeaseDuration() && | |
| // engine.State() == raftengine.StateLeader | |
| // before serving a leader-local read. The LeaseDuration is bounded | |
| // by electionTimeout - safety_margin, which guarantees that any new | |
| // leader candidate cannot yet accept writes during that window. | |
| // Safety: callers must verify the lease against a single `now := time.Now()` | |
| // and MUST treat any LastQuorumAck() later than `now` as invalid/stale | |
| // before serving a leader-local read, e.g. | |
| // ack := LastQuorumAck() | |
| // now := time.Now() | |
| // !ack.IsZero() && | |
| // !ack.After(now) && | |
| // now.Sub(ack) < LeaseDuration() && | |
| // engine.State() == raftengine.StateLeader | |
| // The LeaseDuration is bounded by electionTimeout - safety_margin, | |
| // which guarantees that any new leader candidate cannot yet accept | |
| // writes during that window. |
Addresses copilot review on PR #561. LastQuorumAck() returns a time reconstructed from UnixNano, which has no monotonic component. A backwards wall-clock step can leave now.Sub(ack) negative and let a stale ack pass the `<` duration comparison -- a lease-safety violation. Three related changes: 1. Coordinate.LeaseRead / groupLeaseRead: sample now := time.Now() once, then validate the engine-driven ack through a new engineLeaseAckValid helper that requires state==Leader, ack non-zero, !ack.After(now), and now.Sub(ack) < leaseDur. Treating future-dated acks as invalid forces the slow path instead of trusting a negative elapsed duration. Factoring the check into a helper keeps both call sites under the cyclomatic-complexity cap. 2. raftengine.LeaseProvider.LastQuorumAck doc: specify the skew-safe check (single-now sample + !ack.After(now)) so every implementation and every caller follows the same contract. 3. Coordinate.LeaseRead doc refreshed to clarify that only the PRIMARY engine-driven path skips pre-read sampling; the SECONDARY caller-side lease fallback still uses now.Add(leaseDur) during rollout.
|
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
| func (t *quorumAckTracker) removePeer(peerID uint64, followerQuorum int) { | ||
| t.mu.Lock() | ||
| defer t.mu.Unlock() | ||
| if _, ok := t.peerAcks[peerID]; !ok { | ||
| return | ||
| } | ||
| delete(t.peerAcks, peerID) | ||
| if followerQuorum <= 0 { | ||
| return | ||
| } | ||
| t.recomputeLocked(followerQuorum) |
There was a problem hiding this comment.
In quorumAckTracker.removePeer, returning early when peerID has no recorded entry skips recomputing quorumAckUnixNano. If membership changes reduce followerQuorum, the tracker can unnecessarily keep quorumAck at 0 (or stale) until the next recordAck, even though the remaining recorded peers may already satisfy the new quorum. Consider always recomputing when followerQuorum > 0 (and delete() on a nil/missing entry is safe), or providing an explicit recompute-on-config-change path.
| // when the local node is not the leader. Single-node clusters | ||
| // report time.Now() unconditionally since self is the quorum. |
There was a problem hiding this comment.
LeaseProvider.LastQuorumAck docs say "Single-node clusters report time.Now() unconditionally since self is the quorum" while also stating the method returns zero when the local node is not leader. In practice the etcd engine only publishes a non-zero value while leader (and updates it periodically), so this wording is misleading/contradictory. Please tighten the contract wording (e.g., "single-node leaders may return a recent time.Now()"), so callers/other implementations don't interpret it as returning non-zero when not leader.
| // when the local node is not the leader. Single-node clusters | |
| // report time.Now() unconditionally since self is the quorum. | |
| // when the local node is not the leader. Single-node leaders may | |
| // return a recent time.Now() value since self is the quorum. |
Addresses copilot review on PR #561. - quorumAckTracker.removePeer no longer early-returns when peerID has no recorded entry. A shrink that reduces followerQuorum may let the remaining peers now satisfy the smaller threshold, and without an explicit recompute the published instant stays stale (or zero) until the next recordAck. delete() is safe on a missing key, so the unconditional path is harmless otherwise. - LeaseProvider.LastQuorumAck doc: the previous wording said "single- node clusters report time.Now() unconditionally" which contradicted the "zero when not leader" clause. Tighten to "single-node leaders may return a recent time.Now()" so callers and alternate engines don't interpret it as returning non-zero on a non-leader.
| func (e *Engine) LastQuorumAck() time.Time { | ||
| if e == nil { | ||
| return time.Time{} | ||
| } | ||
| if ns := e.singleNodeLeaderAckUnixNano.Load(); ns != 0 { | ||
| return time.Unix(0, ns) | ||
| } | ||
| return e.ackTracker.load() | ||
| } |
There was a problem hiding this comment.
LastQuorumAck() currently returns ackTracker.load() unconditionally for multi-node clusters. That can return a non-zero time even when the node is follower/candidate (e.g., late MsgAppResp/MsgHeartbeatResp arriving after a step-down can repopulate ackTracker), which violates the LeaseProvider contract (“zero time when ... not the leader”) and can also leak stale quorum acks across re-elections. Consider gating the return value on leader state (or maintaining an atomic leader flag) so non-leaders always observe time.Time{} for this API.
| if !isFollowerResponse(msg.Type) { | ||
| return | ||
| } | ||
| if msg.From == 0 || msg.From == e.nodeID { | ||
| return | ||
| } | ||
| // Reject acks from peers not in the current membership. Without | ||
| // this filter, a late MsgAppResp from a just-removed peer (which | ||
| // rawNode.Step will immediately reject with ErrStepPeerNotFound) | ||
| // would still land an ack in the tracker -- resurrecting the | ||
| // "ghost" entry that removePeer just pruned. Since we run on the | ||
| // event-loop goroutine (the sole writer to e.peers), the map read | ||
| // here is race-free. | ||
| if _, ok := e.peers[msg.From]; !ok { | ||
| return | ||
| } | ||
| clusterSize := len(e.peers) | ||
| if clusterSize <= 1 { | ||
| return | ||
| } |
There was a problem hiding this comment.
recordQuorumAck updates the tracker for follower-response messages without checking that the local node is currently leader. After a leader-loss transition (where ackTracker.reset() runs), late follower responses can still arrive and repopulate ackTracker while we are follower/candidate, making LastQuorumAck() non-zero and potentially reusing stale liveness on a later re-election. Suggest adding a leader-state gate here (or otherwise ensuring the tracker cannot be updated outside leader role).
| time.Sleep(2 * time.Millisecond) | ||
| tr.recordAck(3, 2) | ||
| second := tr.load() | ||
| require.False(t, second.IsZero()) | ||
|
|
||
| // Now peer 4 acks with a later timestamp. Quorum ack should still | ||
| // be the older of the top two (either 2 or 3, not 4) because the | ||
| // 5-node quorum is 3 including self (2 followers + self), and the | ||
| // OLDEST of the top two followers is still the limiting factor. | ||
| time.Sleep(2 * time.Millisecond) |
There was a problem hiding this comment.
These time.Sleep(2 * time.Millisecond) calls aren’t required for the assertions (the test only checks non-zero/non-regression) and can add flakiness/latency to the test suite on slow CI. Consider removing them or rewriting the tracker to accept an injected timestamp in tests so ordering can be controlled deterministically.
| time.Sleep(2 * time.Millisecond) | |
| tr.recordAck(3, 2) | |
| second := tr.load() | |
| require.False(t, second.IsZero()) | |
| // Now peer 4 acks with a later timestamp. Quorum ack should still | |
| // be the older of the top two (either 2 or 3, not 4) because the | |
| // 5-node quorum is 3 including self (2 followers + self), and the | |
| // OLDEST of the top two followers is still the limiting factor. | |
| time.Sleep(2 * time.Millisecond) | |
| tr.recordAck(3, 2) | |
| second := tr.load() | |
| require.False(t, second.IsZero()) | |
| // Now peer 4 acks. Quorum ack should still be the older of the top | |
| // two (either 2 or 3, not 4) because the 5-node quorum is 3 | |
| // including self (2 followers + self), and the OLDEST of the top | |
| // two followers is still the limiting factor. |
Addresses copilot review on PR #561. A transport-level MsgAppResp / MsgHeartbeatResp can land at handleStep shortly after a step-down -- ackTracker.reset() has already fired, but the response is already in flight. Previously recordQuorumAck would admit the late ack and repopulate the tracker, and LastQuorumAck would surface the resulting non-zero instant. A subsequent re-election would then observe stale liveness from the prior term, violating the LeaseProvider contract ("zero time when the local node is not the leader") and potentially widening an already-expired lease window. Fix: atomic isLeader mirror on the Engine. - refreshStatus writes it every tick before the other lease-related atomics, so every role transition is observed before the next handleStep runs. - recordQuorumAck loads it and drops the response on the floor when not leader. Prevents the tracker from being repopulated after reset() and before the next refreshStatus pass. - LastQuorumAck returns time.Time{} when not leader, honoring the interface contract and covering the narrow window between recordQuorumAck's check and ack-tracker publication. Also drop the flaky 2ms time.Sleep calls from TestQuorumAckTracker_QuorumAckIsOldestOfTopN: the require.False( third.Before(second)) assertion holds trivially when timestamps are equal, so the sleeps added latency without exercising new behaviour and could flake on slow CI.
| // Recorder alternates between two peer IDs so a 3-node followerQuorum | ||
| // always has at least one entry and the sort path runs. | ||
| go func() { | ||
| defer wg.Done() | ||
| for { | ||
| select { | ||
| case <-stop: | ||
| return | ||
| default: | ||
| tr.recordAck(2, 1) | ||
| tr.recordAck(3, 1) | ||
| } | ||
| } | ||
| }() | ||
| go func() { | ||
| defer wg.Done() | ||
| for { | ||
| select { | ||
| case <-stop: | ||
| return | ||
| default: | ||
| _ = tr.load() | ||
| } | ||
| } | ||
| }() |
There was a problem hiding this comment.
This concurrency test uses two goroutines with tight default: loops and no backoff, which can busy-spin and consume significant CPU on CI (especially with t.Parallel()). Consider adding a small sleep/Gosched in the loops or using a time-bounded context/ticker-driven loop so the test still exercises concurrency without pegging a core.
| func (t *quorumAckTracker) recomputeLocked(followerQuorum int) { | ||
| if len(t.peerAcks) < followerQuorum { | ||
| // Not enough peers have reported to form a majority yet. | ||
| t.quorumAckUnixNano.Store(0) | ||
| return | ||
| } | ||
| acks := make([]int64, 0, len(t.peerAcks)) | ||
| for _, a := range t.peerAcks { | ||
| acks = append(acks, a) | ||
| } | ||
| // Sort descending so acks[0] is the most recent. The followerQuorum-th | ||
| // entry (1-indexed) is the oldest ack among the top quorum -- i.e. | ||
| // the boundary instant by which majority liveness was confirmed. | ||
| sort.Slice(acks, func(i, j int) bool { return acks[i] > acks[j] }) | ||
| t.quorumAckUnixNano.Store(acks[followerQuorum-1]) | ||
| } |
There was a problem hiding this comment.
quorumAckTracker.recomputeLocked allocates a fresh slice and sorts all peer ack timestamps on every follower response. Since recordAck is called for each MsgAppResp / MsgHeartbeatResp, this can become a hot/allocating path as cluster size or heartbeat frequency grows. Consider reducing allocations (e.g., reuse a buffer) and computing the Nth-most-recent timestamp without a full sort.
| if postRemovalClusterSize > 1 { | ||
| followerQuorum = postRemovalClusterSize / 2 //nolint:mnd | ||
| } | ||
| e.ackTracker.removePeer(nodeID, followerQuorum) |
There was a problem hiding this comment.
ackTracker is only recomputed on peer removal. When postRemovalClusterSize <= 1, followerQuorum is passed as 0, and quorumAckTracker.removePeer intentionally retains the previously published quorum instant. If the cluster later grows (or any config change increases the required follower quorum) before new follower responses arrive, LastQuorumAck can expose a stale non-zero instant and incorrectly allow lease-read fast paths under the new membership. Consider clearing/resetting the published quorum instant (or recomputing against the new followerQuorum) on membership changes that reduce quorum to 0 and on additions/increases to the follower quorum, so a new configuration cannot inherit a quorum-ack from the prior one.
| e.ackTracker.removePeer(nodeID, followerQuorum) | |
| ackTrackerQuorum := followerQuorum | |
| if ackTrackerQuorum == 0 { | |
| // Force the tracker to recompute/clear any previously published | |
| // quorum instant instead of retaining stale state from the old | |
| // configuration when the cluster shrinks to a single node. | |
| ackTrackerQuorum = 1 | |
| } | |
| e.ackTracker.removePeer(nodeID, ackTrackerQuorum) |
… sort buffer Addresses copilot review on PR #561. - engine.removePeer: when the cluster shrinks to a single node the prior code passed followerQuorum=0 to quorumAckTracker.removePeer, which by design preserves the currently published instant so the next recordAck can refresh it. But if the cluster subsequently grew back without a recordAck arriving first, LastQuorumAck's multi-node fallback would surface the stale instant from the previous configuration. Route shrink-to-<=1 through ackTracker.reset() so any future multi-node membership starts fresh. - quorumAckTracker.recomputeLocked: reuse a per-tracker ackBuf across calls instead of allocating a fresh []int64 on every recordAck. Heartbeat frequency × cluster size keeps the allocation rate non-trivial otherwise; re-slicing in place + append-on-growth brings the steady-state alloc to zero. - quorum_ack_test concurrency loops now Gosched() between iterations so the test exercises interleavings without pegging a CI core.
| func (e *fakeLeaseEngine) LastQuorumAck() time.Time { | ||
| ns := e.lastQuorumAckUnixNano.Load() | ||
| if ns == 0 { | ||
| return time.Time{} | ||
| } | ||
| return time.Unix(0, ns) | ||
| } |
There was a problem hiding this comment.
fakeLeaseEngine.LastQuorumAck() doesn't honor the raftengine.LeaseProvider contract that non-leaders return a zero time. As written it can return a non-zero ack even when e.State() is follower, which could let tests accidentally pass if production code ever stops checking state before using LastQuorumAck. Consider gating on e.State()==raftengine.StateLeader (or the atomic state value) and returning time.Time{} otherwise.
| // Followers needed for majority = floor(clusterSize / 2): 1 for a | ||
| // 3-node cluster, 2 for 5-node, matching raft quorum semantics. | ||
| followerQuorum := clusterSize / 2 //nolint:mnd | ||
| e.ackTracker.recordAck(msg.From, followerQuorum) |
There was a problem hiding this comment.
New //nolint:mnd on the quorum calculation looks avoidable here. To keep the code self-explanatory without suppressing the linter, consider introducing a small helper (e.g., followerQuorumForClusterSize) or a named constant for the divisor, and reuse it in both recordQuorumAck and removePeer.
| if postRemovalClusterSize <= 1 { | ||
| e.ackTracker.reset() | ||
| } else { | ||
| e.ackTracker.removePeer(nodeID, postRemovalClusterSize/2) //nolint:mnd | ||
| } |
There was a problem hiding this comment.
New //nolint:mnd on the quorum calculation looks avoidable here as well. Consider reusing the same named helper/constant used in recordQuorumAck so the quorum math is centralized and the linter suppression isn't needed.
…der state Addresses copilot review on PR #561. - followerQuorumForClusterSize helper centralises the cluster/2 formula so recordQuorumAck and removePeer share a single point of truth. The //nolint:mnd suppression now lives only in the helper where the formula is documented in context, rather than sprinkled at each call site. - fakeLeaseEngine.LastQuorumAck now gates on State() == StateLeader, matching the production etcd engine and the LeaseProvider contract ("zero time when the local node is not the leader"). Without this mirror, a test could set a fresh ack plus a non-leader state and still take the fast path, masking a regression if production code ever stopped gating on engine.State() before consulting LastQuorumAck.
| // Single time.Now() sample so primary/secondary/extension all see | ||
| // the same instant. Clock-skew safety delegated to | ||
| // engineLeaseAckValid (see Coordinate.LeaseRead). | ||
| now := time.Now() | ||
| state := engine.State() | ||
| if engineLeaseAckValid(state, lp.LastQuorumAck(), now, leaseDur) { |
There was a problem hiding this comment.
now is taken before LastQuorumAck(). If the engine updates the quorum-ack timestamp between these two reads, ack.After(now) can become true and the primary fast path will be skipped even though the ack is fresh. Consider reading ack first and then sampling now (reused for secondary/extend) to avoid spurious slow-path reads.
| // Single time.Now() sample so primary/secondary/extension all see | |
| // the same instant. Clock-skew safety delegated to | |
| // engineLeaseAckValid (see Coordinate.LeaseRead). | |
| now := time.Now() | |
| state := engine.State() | |
| if engineLeaseAckValid(state, lp.LastQuorumAck(), now, leaseDur) { | |
| // Read the quorum ack before sampling time so a concurrent ack | |
| // update cannot appear to be in the future relative to now and | |
| // spuriously disable the primary fast path. Reuse the same now | |
| // sample for primary/secondary/extension checks. | |
| ack := lp.LastQuorumAck() | |
| now := time.Now() | |
| state := engine.State() | |
| if engineLeaseAckValid(state, ack, now, leaseDur) { |
| // Publish leader state atomically so recordQuorumAck / LastQuorumAck | ||
| // can gate on it without acquiring e.mu. MUST run before the | ||
| // single-node ack store below, otherwise a brand-new leader tick | ||
| // could publish a ack instant while isLeader is still false. |
There was a problem hiding this comment.
Minor grammar nit in the comment: "a ack instant" should be "an ack instant".
| // could publish a ack instant while isLeader is still false. | |
| // could publish an ack instant while isLeader is still false. |
Resolve conflicts from #561 (engine-driven lease) landing: - internal/raftengine/etcd/engine.go: keep both accessor groups -- LastQuorumAck from #561 and the DispatchDropCount / DispatchErrorCount / StepQueueFullCount accessors this branch introduced for the monitoring DispatchCollector. - kv/coordinator.go: the secondary caller-side lease fast path now consults the locally-cached `state` from the engineLeaseAckValid call instead of re-reading engine.State(), and observes the hit via c.leaseObserver. Also wire the engine-driven primary fast path through the observer so the lease-hit ratio panel counts every fast-path hit regardless of which tier produced it. Extract a one-line observeLeaseRead helper to keep LeaseRead under the cyclomatic-complexity cap after the extra branch. - kv/sharded_coordinator.go: mirror the Coordinate.LeaseRead restructure for groupLeaseRead -- use the shared `state`, observe hits on both tiers, and share an observeLeaseRead package helper to keep the function body under the cyclomatic-complexity cap.
Summary
Fixes the root cause of the production lease-read regression: under step-queue congestion,
LinearizableReadtook ~1s, exceeding the 700msLeaseDuration. The caller-side lease sampledtime.Now()BEFORE the slow read and used it as the extend base, so every extend landed ~300ms in the past and the next read always missed the cache. The lease never warmed up; the optimisation was effectively dead under load.Switch the authority from caller-driven to engine-driven.
raftengine.LeaseProvider: addLastQuorumAck() time.Time.etcd engine:quorumAckTrackerrecords per-peer last-response time fromMsgAppResp/MsgHeartbeatRespinhandleStep.LastQuorumAck()returns the oldest ack among the top-followerQuorumpeers, which is the wall-clock instant by which majority liveness was confirmed. Reset on leader-loss transitions so a re-election cannot surface a stale anchor. Single-node clusters short-circuit totime.Now().kv:Coordinate.LeaseRead/groupLeaseReadconsultLastQuorumAckfirst, fall back to the existing caller-side lease (preserved as a secondary fast path during rollout), and finally toLinearizableRead. The engine refreshes its anchor on every heartbeat independent of read latency, so a slowLinearizableReadno longer leaves the cache cold.Safety
We record
time.Now()when the leader observes the follower response, which is an upper bound on the follower's true ack time.leaseSafetyMargin(300ms) covers the resulting overshoot (one-way delay + scheduling slop + clock drift), keepinglease_expirystrictly insidefollower_ack_time + electionTimeout.Test plan
TestQuorumAckTracker_*covers single-node / majority threshold / oldest-of-top-N semantics / reset / concurrent accessgo test -race ./kv/... ./internal/raftengine/...passeslinearizable_read_callsrate drops while GET p99 stays low)Summary by CodeRabbit