Skip to content

feat(lease-read): engine-driven lease anchor via LastQuorumAck#561

Merged
bootjp merged 9 commits intomainfrom
feat/engine-driven-lease
Apr 20, 2026
Merged

feat(lease-read): engine-driven lease anchor via LastQuorumAck#561
bootjp merged 9 commits intomainfrom
feat/engine-driven-lease

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 20, 2026

Summary

Fixes the root cause of the production lease-read regression: under step-queue congestion, LinearizableRead took ~1s, exceeding the 700ms LeaseDuration. The caller-side lease sampled time.Now() BEFORE the slow read and used it as the extend base, so every extend landed ~300ms in the past and the next read always missed the cache. The lease never warmed up; the optimisation was effectively dead under load.

Switch the authority from caller-driven to engine-driven.

  • raftengine.LeaseProvider: add LastQuorumAck() time.Time.
  • etcd engine: quorumAckTracker records per-peer last-response time from MsgAppResp / MsgHeartbeatResp in handleStep. LastQuorumAck() returns the oldest ack among the top-followerQuorum peers, which is the wall-clock instant by which majority liveness was confirmed. Reset on leader-loss transitions so a re-election cannot surface a stale anchor. Single-node clusters short-circuit to time.Now().
  • kv: Coordinate.LeaseRead / groupLeaseRead consult LastQuorumAck first, fall back to the existing caller-side lease (preserved as a secondary fast path during rollout), and finally to LinearizableRead. The engine refreshes its anchor on every heartbeat independent of read latency, so a slow LinearizableRead no longer leaves the cache cold.

Safety

We record time.Now() when the leader observes the follower response, which is an upper bound on the follower's true ack time. leaseSafetyMargin (300ms) covers the resulting overshoot (one-way delay + scheduling slop + clock drift), keeping lease_expiry strictly inside follower_ack_time + electionTimeout.

Test plan

  • TestQuorumAckTracker_* covers single-node / majority threshold / oldest-of-top-N semantics / reset / concurrent access
  • go test -race ./kv/... ./internal/raftengine/... passes
  • Existing lease tests continue to work because the secondary (caller-side) path remains available
  • Production: deploy and verify GET on a write-idle key stays fast across >1s slow reads (key signal: linearizable_read_calls rate drops while GET p99 stays low)

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced lease read consistency by tracking quorum acknowledgment freshness, replacing previous time-and-generation-based validation for improved reliability during high concurrency.

The caller-side lease (kv.leaseState + refreshLeaseAfterDispatch)
sampled time.Now() BEFORE each slow-path LinearizableRead and used it
as the extend base. When production LinearizableRead latency grew to
~1s (step-queue congestion) while LeaseDuration is 700ms, every
extend landed ~300ms in the past -- the next read always missed the
cache and the lease never warmed up, defeating the entire lease-read
optimisation.

Switch the authority to the engine:

- internal/raftengine: add LeaseProvider.LastQuorumAck() returning
  the wall-clock instant at which a majority of followers most
  recently responded to the leader.
- internal/raftengine/etcd: implement quorumAckTracker and hook it
  from handleStep's follower-response path (MsgAppResp /
  MsgHeartbeatResp). Reset the tracker on leader-loss transitions in
  refreshStatus so a future re-election cannot surface a stale
  instant. Single-node clusters short-circuit to time.Now().
- kv: Coordinate.LeaseRead and ShardedCoordinator.groupLeaseRead
  consult LastQuorumAck FIRST, falling back to the existing
  caller-side lease and finally to LinearizableRead. The engine's
  lease is refreshed on every heartbeat independent of read latency,
  so a slow LinearizableRead no longer leaves the cache cold.

Safety: we record time.Now() when the leader observes the follower
response, which is an upper bound on the follower's true ack time.
leaseSafetyMargin (300ms) covers the resulting overshoot (one-way
delay + scheduling slop + clock drift), keeping lease_expiry strictly
inside ack_time + electionTimeout.

Caller-side lease machinery (leaseState, leaseRefreshingTxn,
RegisterLeaderLossCallback) is retained for test affordance and as a
secondary fast-path during rollout, but is no longer load-bearing.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 20, 2026

/gemini review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 20, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 55 minutes and 39 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 55 minutes and 39 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bafe9612-b611-450e-a020-7879fab2c46a

📥 Commits

Reviewing files that changed from the base of the PR and between aadf7ff and 40a923e.

📒 Files selected for processing (7)
  • internal/raftengine/engine.go
  • internal/raftengine/etcd/engine.go
  • internal/raftengine/etcd/quorum_ack.go
  • internal/raftengine/etcd/quorum_ack_test.go
  • kv/coordinator.go
  • kv/lease_read_test.go
  • kv/sharded_coordinator.go
📝 Walkthrough

Walkthrough

The PR extends the LeaseProvider interface with a LastQuorumAck() method that exposes the engine's most recent observation of follower majority liveness. A new concurrency-safe quorumAckTracker component records follower acknowledgments, and the etcd engine integrates it to compute and publish the majority-ack instant. The lease-read fast-path in coordinators is refactored to use this quorum acknowledgment freshness instead of time+generation-based validity.

Changes

Cohort / File(s) Summary
LeaseProvider Interface
internal/raftengine/engine.go
Added LastQuorumAck() time.Time method to the LeaseProvider interface, exposing the engine's last quorum acknowledgment instant.
Etcd Engine Quorum Ack Integration
internal/raftengine/etcd/engine.go
Introduced ackTracker field, implemented LastQuorumAck() method with nil-guard and single-node shortcut, and integrated recordQuorumAck() call in handleStep() to track follower responses; resets tracker on leader-to-non-leader transitions.
Quorum Ack Tracker Component
internal/raftengine/etcd/quorum_ack.go
New unexported component quorumAckTracker with concurrency-safe per-peer ack timestamp tracking, majority boundary computation, and lock-free reads via atomic int64; provides recordAck(), reset(), and load() methods.
Quorum Ack Tracker Tests
internal/raftengine/etcd/quorum_ack_test.go
Added five unit tests validating quorum ack initialization, threshold publication, monotonicity, reset behavior, and concurrent correctness.
Lease-Read Fast-Path Logic
kv/coordinator.go, kv/sharded_coordinator.go
Refactored fast-path from time+generation-based lease validity to engine-driven check using State() == StateLeader and LastQuorumAck() freshness; simplified error handling and updated documentation.
Test Double Implementation
kv/lease_read_test.go
Updated fakeLeaseEngine test double to track and expose LastQuorumAck() via atomic.Int64 field.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Coordinator
    participant Engine
    participant AckTracker
    participant Followers

    Client->>Coordinator: LeaseRead(ctx)
    Coordinator->>Engine: State()
    Engine-->>Coordinator: StateLeader
    
    Coordinator->>Engine: LastQuorumAck()
    Engine->>AckTracker: load()
    AckTracker-->>Engine: lastQuorumAckTime
    Engine-->>Coordinator: lastQuorumAckTime
    
    alt Quorum Ack Fresh (within lease duration)
        Coordinator-->>Client: Fast-path success (appliedIndex)
    else Stale or Zero
        Coordinator->>Coordinator: LinearizableRead(ctx)
        Coordinator->>Engine: extend lease
        Coordinator-->>Client: Success
    end
    
    Note over Followers,AckTracker: Meanwhile: Leader receives follower responses
    Followers->>Engine: MsgAppResp / MsgHeartbeatResp
    Engine->>AckTracker: recordAck(peerID)
    AckTracker->>AckTracker: Update peer timestamp & recompute majority boundary
    AckTracker->>AckTracker: Publish quorumAckUnixNano atomically
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Poem

🐰 A tracker hops through every ack,
Recording when the followers bounce back,
The quorum speaks its sweetest time,
And leases read without the climb!
Fast paths bloom where acks align—
Efficiency's now by design! 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the main change: introducing an engine-driven lease anchor mechanism via the LastQuorumAck method for improving lease-read performance.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/engine-driven-lease

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements an engine-driven lease mechanism to improve the performance and reliability of lease-based reads, particularly under network or scheduling congestion. By tracking the majority-ack instant directly within the Raft engine using follower heartbeat and append responses, the system can serve leader-local reads more efficiently. The review feedback highlights several critical improvement opportunities: ensuring time monotonicity by avoiding the conversion of timestamps to Unix nanoseconds, optimizing the ack-tracking logic to reduce memory allocations and lock contention on high-frequency paths, and properly handling cluster membership changes to prevent stale peer data from corrupting the quorum calculation.

if followerQuorum <= 0 {
return
}
now := time.Now().UnixNano()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using time.Now().UnixNano() and then reconstructing the time via time.Unix(0, ns) strips the monotonic clock reading from the time.Time object. This makes subsequent time.Since(ack) calls in the lease-read path susceptible to wall-clock jumps (e.g., NTP adjustments). If the system clock is adjusted backwards, the lease could be held for longer than the safety window, which is unsafe.

Consider storing the time.Time object directly using an atomic.Pointer[time.Time] or atomic.Value to preserve monotonicity.

References
  1. For frequently accessed fields on hot paths, prefer atomic operations over mutexes to improve performance and maintain consistency.

Comment thread internal/raftengine/etcd/quorum_ack.go Outdated
Comment on lines +73 to +81
acks := make([]int64, 0, len(t.peerAcks))
for _, a := range t.peerAcks {
acks = append(acks, a)
}
// Sort descending so acks[0] is the most recent. The followerQuorum-th
// entry (1-indexed) is the oldest ack among the top quorum -- i.e. the
// boundary instant by which majority liveness was confirmed.
sort.Slice(acks, func(i, j int) bool { return acks[i] > acks[j] })
t.quorumAckUnixNano.Store(acks[followerQuorum-1])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Allocating a new slice and sorting it on every heartbeat or append response can be inefficient, especially in clusters with many peers. Since followerQuorum is typically very small (e.g., 1 or 2), you can find the Nth-most-recent timestamp in $O(N)$ time without a full sort and without allocating a slice if you use a small fixed-size array on the stack for common cluster sizes.

// See docs/lease_read_design.md for the full argument.
type quorumAckTracker struct {
mu sync.Mutex
peerAcks map[uint64]int64 // peer ID → last ack unix nano observed on leader
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The peerAcks map is never pruned of peers that have been removed from the cluster. While reset() clears it on leader loss, a long-lived leader in a cluster with frequent membership changes will accumulate stale entries. More importantly, if the cluster size shrinks, an old timestamp from a removed peer might still be among the top followerQuorum acks, potentially inflating the majority-ack instant incorrectly.

Comment thread internal/raftengine/etcd/engine.go Outdated
Comment on lines +633 to +636
e.mu.RLock()
clusterSize := len(e.peers)
state := e.status.State
e.mu.RUnlock()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Acquiring e.mu.RLock() on the hot lease-read path to check clusterSize and state introduces unnecessary lock contention. Since these fields are updated infrequently compared to read volume, consider storing them in atomic variables (e.g., atomic.Int32) to allow truly lock-free peeking in LastQuorumAck.

References
  1. For frequently accessed fields that require monotonic updates (like a configuration index), prefer atomic operations (e.g., CAS) over mutexes to improve performance on hot paths.

Comment thread kv/coordinator.go
Comment on lines +376 to +381
state := c.engine.State()
if state == raftengine.StateLeader {
if ack := lp.LastQuorumAck(); !ack.IsZero() && time.Since(ack) < leaseDur {
return lp.AppliedIndex(), nil
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is redundant state checking and lock acquisition here. c.engine.State() is called at line 376, and then lp.LastQuorumAck() is called at line 378, which internally acquires the engine's RLock again to check the same state. This doubles the synchronization overhead on the fast path. Consider refactoring LastQuorumAck to return both the timestamp and the leader status, or pass the already-acquired state into it to ensure leadership is verified correctly relative to the lock.

References
  1. To prevent race conditions in read-modify-write operations that require leadership, verify leadership status after acquiring the lock that protects the operation.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
kv/lease_read_test.go (1)

24-24: Add coverage for the new engine-driven fast path.

fakeLeaseEngine.LastQuorumAck() is now wired via lastQuorumAckUnixNano, but none of the TestCoordinate_LeaseRead_* cases in this file actually exercise the new primary path (state==Leader && !ack.IsZero() && time.Since(ack) < leaseDur). Every current test still reaches the fast path via c.lease.extend(...) (i.e. the secondary/caller-side lease), so the main behavioral change of this PR — that a fresh engine ack alone returns AppliedIndex without touching LinearizableRead or c.lease — is untested at the Coordinate boundary. A regression that broke the engine-driven branch would not be caught here.

Suggested additions (table-driven friendly):

  • Fresh ack, cold caller-side lease → fast path, 0 LinearizableRead calls, c.lease.valid(now) still false.
  • Stale ack (time.Since(ack) >= leaseDur), cold caller-side lease → slow path.
  • Fresh ack but state != Leader → slow path (and LinearizableRead invoked).
  • Same expectations replicated for groupLeaseRead in sharded_coordinator_test.go.

Example helper on fakeLeaseEngine:

func (e *fakeLeaseEngine) setQuorumAck(t time.Time) {
    if t.IsZero() {
        e.lastQuorumAckUnixNano.Store(0)
        return
    }
    e.lastQuorumAckUnixNano.Store(t.UnixNano())
}

As per coding guidelines: "Add test coverage for TTL/HLC behaviors when touching clocks, OCC, or replication logic in store/ and kv/ packages."

Also applies to: 68-74

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv/lease_read_test.go` at line 24, Tests in kv/lease_read_test.go don't cover
the new engine-driven fast path implemented via fakeLeaseEngine.LastQuorumAck()
(backed by lastQuorumAckUnixNano), so add table-driven cases to the existing
TestCoordinate_LeaseRead_* that exercise: (1) fresh engine ack with a cold
caller-side lease to ensure the primary fast path returns immediately, does not
call Coordinate.LinearizableRead, and leaves c.lease.valid(now) false; (2) stale
ack with cold caller-side lease to ensure the slow path is taken; and (3) fresh
ack but state != Leader to ensure the slow path and LinearizableRead are
invoked; use a helper on fakeLeaseEngine (e.g. setQuorumAck) to set
lastQuorumAckUnixNano for each case; repeat equivalent tests for groupLeaseRead
in sharded_coordinator_test.go to cover both branches.
internal/raftengine/etcd/quorum_ack_test.go (1)

39-62: Test name claims more than it verifies.

TestQuorumAckTracker_QuorumAckIsOldestOfTopN only asserts monotonicity (!third.Before(second)), not that load() equals the boundary entry (the followerQuorum-th most recent ack). As written, an implementation bug that published the most recent ack instead of the oldest of the top N would still pass this test.

Consider asserting the exact value by capturing the individual ack times (e.g. record timestamps t2, t3, t4 via time.Now() around each recordAck, then require.Equal(t, t3.UnixNano(), tr.load().UnixNano()) since t3 is the older of the top two after peer 4's ack lands).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/raftengine/etcd/quorum_ack_test.go` around lines 39 - 62, The test
TestQuorumAckTracker_QuorumAckIsOldestOfTopN only checks monotonicity; fix it by
capturing exact timestamps for each ack and asserting the quorum instant equals
the boundary follower ack (use quorumAckTracker.recordAck to record after
calling time.Now() into variables like t2, t3, t4 and then compare tr.load() to
the expected boundary timestamp—e.g. require.Equal on UnixNano between tr.load()
and t3 for the case where t3 is the older of the top N); reference
quorumAckTracker, recordAck, load, and followerQuorum when making the assertion
so the test verifies the published instant is the followerQuorum-th most recent
ack rather than merely non‑regressing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/raftengine/etcd/quorum_ack.go`:
- Around line 42-82: The tracker retains removed peers in peerAcks causing stale
acks and unbounded growth; add a method quorumAckTracker.forgetPeer(peerID
uint64) that locks mu, deletes peerID from peerAcks, and (optionally) recomputes
quorumAckUnixNano or leaves recompute to the next recordAck; then call this new
forgetPeer from Engine.removePeer alongside the existing teardown so removed
peers no longer contribute to quorum computation and the map does not leak
entries; ensure symbols referenced are quorumAckTracker, peerAcks,
quorumAckUnixNano, recordAck, forgetPeer, reset, and Engine.removePeer.

---

Nitpick comments:
In `@internal/raftengine/etcd/quorum_ack_test.go`:
- Around line 39-62: The test TestQuorumAckTracker_QuorumAckIsOldestOfTopN only
checks monotonicity; fix it by capturing exact timestamps for each ack and
asserting the quorum instant equals the boundary follower ack (use
quorumAckTracker.recordAck to record after calling time.Now() into variables
like t2, t3, t4 and then compare tr.load() to the expected boundary
timestamp—e.g. require.Equal on UnixNano between tr.load() and t3 for the case
where t3 is the older of the top N); reference quorumAckTracker, recordAck,
load, and followerQuorum when making the assertion so the test verifies the
published instant is the followerQuorum-th most recent ack rather than merely
non‑regressing.

In `@kv/lease_read_test.go`:
- Line 24: Tests in kv/lease_read_test.go don't cover the new engine-driven fast
path implemented via fakeLeaseEngine.LastQuorumAck() (backed by
lastQuorumAckUnixNano), so add table-driven cases to the existing
TestCoordinate_LeaseRead_* that exercise: (1) fresh engine ack with a cold
caller-side lease to ensure the primary fast path returns immediately, does not
call Coordinate.LinearizableRead, and leaves c.lease.valid(now) false; (2) stale
ack with cold caller-side lease to ensure the slow path is taken; and (3) fresh
ack but state != Leader to ensure the slow path and LinearizableRead are
invoked; use a helper on fakeLeaseEngine (e.g. setQuorumAck) to set
lastQuorumAckUnixNano for each case; repeat equivalent tests for groupLeaseRead
in sharded_coordinator_test.go to cover both branches.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cff706b6-3bee-4ea5-b080-6144b39cc565

📥 Commits

Reviewing files that changed from the base of the PR and between a45ca29 and aadf7ff.

📒 Files selected for processing (7)
  • internal/raftengine/engine.go
  • internal/raftengine/etcd/engine.go
  • internal/raftengine/etcd/quorum_ack.go
  • internal/raftengine/etcd/quorum_ack_test.go
  • kv/coordinator.go
  • kv/lease_read_test.go
  • kv/sharded_coordinator.go

Comment thread internal/raftengine/etcd/quorum_ack.go
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an engine-driven lease mechanism to improve the performance and reliability of lease-based reads, particularly under congestion. It adds a LastQuorumAck method to the LeaseProvider interface and implements a quorumAckTracker to maintain the wall-clock time of the most recent majority acknowledgment. Feedback focuses on optimizing the hot path by reducing memory allocations and redundant locking through the use of stack-allocated buffers and atomic operations. Additionally, a concern was raised regarding the handling of peer removals in the acknowledgment tracker to prevent potential lease safety violations.

Comment thread internal/raftengine/etcd/quorum_ack.go Outdated
Comment on lines +73 to +81
acks := make([]int64, 0, len(t.peerAcks))
for _, a := range t.peerAcks {
acks = append(acks, a)
}
// Sort descending so acks[0] is the most recent. The followerQuorum-th
// entry (1-indexed) is the oldest ack among the top quorum -- i.e. the
// boundary instant by which majority liveness was confirmed.
sort.Slice(acks, func(i, j int) bool { return acks[i] > acks[j] })
t.quorumAckUnixNano.Store(acks[followerQuorum-1])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This block executes inside the main Raft event loop for every MsgAppResp and MsgHeartbeatResp. Allocating a new slice and performing a reflective sort (sort.Slice) on every response can introduce significant overhead and GC pressure on high-throughput clusters. Since Raft clusters are typically small (3-7 nodes), we can use a stack-allocated buffer to avoid allocations in the common case.

	var buf [16]int64
	acks := buf[:0]
	if len(t.peerAcks) > len(buf) {
		acks = make([]int64, 0, len(t.peerAcks))
	}
	for _, a := range t.peerAcks {
		acks = append(acks, a)
	}
	// Sort descending so acks[0] is the most recent. The followerQuorum-th
	// entry (1-indexed) is the oldest ack among the top quorum.
	sort.Slice(acks, func(i, j int) bool { return acks[i] > acks[j] })
	t.quorumAckUnixNano.Store(acks[followerQuorum-1])

// the strictly-safe boundary by at most the one-way delay + scheduling
// slop -- which is exactly what leaseSafetyMargin is sized to cover.
// See docs/lease_read_design.md for the full argument.
type quorumAckTracker struct {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The quorumAckTracker currently lacks a mechanism to handle peer removal. If a peer is removed from the cluster, its last recorded acknowledgment remains in the peerAcks map indefinitely. This could allow a leader to incorrectly satisfy the quorum requirement using a "ghost" acknowledgment from a non-existent peer, potentially violating lease safety after a configuration change. Please add a removePeer(peerID uint64) method to prune the map and ensure it is called from Engine.removePeer.

References
  1. To maintain consistency between the application's peer list and the Raft ConfState, ensure the peer list remains complete and consistent across configuration changes.

Comment thread internal/raftengine/etcd/engine.go Outdated
Comment on lines +633 to +636
e.mu.RLock()
clusterSize := len(e.peers)
state := e.status.State
e.mu.RUnlock()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Acquiring e.mu.RLock() here introduces contention on the hot lease-read path. Since the caller (Coordinate.LeaseRead) likely already checked the engine state, this lock is often redundant. To achieve a truly lock-free fast path, consider mirroring the cluster size and Raft state in atomic variables (similar to how appliedIndex is handled).

References
  1. For frequently accessed fields that require monotonic updates (like a configuration index), prefer atomic operations (e.g., CAS) over mutexes to improve performance on hot paths.

Comment thread kv/coordinator.go
Comment on lines +376 to +381
state := c.engine.State()
if state == raftengine.StateLeader {
if ack := lp.LastQuorumAck(); !ack.IsZero() && time.Since(ack) < leaseDur {
return lp.AppliedIndex(), nil
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic results in redundant locking and state checks. c.engine.State() acquires a read lock, and then lp.LastQuorumAck() acquires the same lock again. Since LastQuorumAck is only useful when the node is the leader, consider refactoring this to minimize lock acquisitions on the fast path, perhaps by making LastQuorumAck lock-free using atomic operations.

References
  1. For frequently accessed fields that require monotonic updates (like a configuration index), prefer atomic operations (e.g., CAS) over mutexes to improve performance on hot paths.

Comment thread kv/sharded_coordinator.go
Comment on lines +756 to +761
state := engine.State()
if state == raftengine.StateLeader {
if ack := lp.LastQuorumAck(); !ack.IsZero() && time.Since(ack) < leaseDur {
return lp.AppliedIndex(), nil
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to Coordinate.LeaseRead, this implementation performs redundant state checks and locking. Both engine.State() and lp.LastQuorumAck() acquire the engine's RWMutex. Optimizing LastQuorumAck to be lock-free would improve performance on this hot path, consistent with repository rules regarding atomic operations.

References
  1. For frequently accessed fields that require monotonic updates (like a configuration index), prefer atomic operations (e.g., CAS) over mutexes to improve performance on hot paths.

Addresses review feedback on PR #561.

Safety fix (gemini HIGH): quorumAckTracker now exposes removePeer,
called from Engine.removePeer so a shrinking-then-growing cluster
cannot keep a stale ack from a removed peer falsely satisfying the
majority threshold. recomputeLocked factored out of recordAck to
share the sort-and-publish path.

Hot-path fix (gemini MEDIUM x3): LastQuorumAck is now fully
lock-free. Multi-node path reads ackTracker.quorumAckUnixNano
atomically as before; single-node leader path reads a new
singleNodeLeaderAckUnixNano atomic, populated from refreshStatus
every tick while leader of a 1-node cluster and cleared otherwise.
The previous implementation held e.mu.RLock to check peer count and
state, doubling the lock traffic with the caller's engine.State().

Coverage (coderabbit nitpick):
- TestCoordinate_LeaseRead_EngineAckFastPath: fresh ack + cold
  caller-side lease -> fast path, 0 LinearizableRead.
- TestCoordinate_LeaseRead_EngineAckStaleFallsThrough: stale ack ->
  slow path.
- TestCoordinate_LeaseRead_EngineAckIgnoredWhenNotLeader: fresh ack
  with non-leader state -> slow path.
- TestQuorumAckTracker_RemovedPeerCannotSatisfyQuorum: removed peer
  ack no longer counts toward the new cluster's quorum.
- TestQuorumAckTracker_RemovePeerZeroQuorumKeepsCurrent: followerQuorum=0
  preserves the published instant so the next recordAck refreshes it.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 20, 2026

/gemini review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a lease-read regression by shifting the lease “anchor time” authority from the caller (pre-read time.Now() sampling) to the Raft engine, using a majority-ack observation (LastQuorumAck) derived from follower responses to keep lease reads warm even when LinearizableRead is slow.

Changes:

  • Extend raftengine.LeaseProvider with LastQuorumAck() time.Time and update KV lease-read logic to prefer this engine-driven anchor before falling back to the existing caller-side lease and then LinearizableRead.
  • Implement quorum-ack tracking in the etcd Raft engine (per-peer ack timestamps, majority “oldest-of-top-N” computation), including reset on leader loss and peer removal handling.
  • Add unit tests for quorum-ack tracking and KV coordinator behavior using the new fast path.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
internal/raftengine/engine.go Adds LastQuorumAck() to LeaseProvider contract and documents required safety checks.
internal/raftengine/etcd/engine.go Implements LastQuorumAck() plus leader-side ack tracking, reset on leader loss, and peer-removal pruning.
internal/raftengine/etcd/quorum_ack.go Introduces quorumAckTracker to compute and publish majority-ack instant.
internal/raftengine/etcd/quorum_ack_test.go Adds unit tests for quorum ack semantics, reset, pruning, and concurrency.
kv/coordinator.go Makes engine-driven quorum-ack the primary LeaseRead fast path, preserving the old caller-side lease as secondary.
kv/sharded_coordinator.go Mirrors the engine-driven lease-read fast path for sharded groups.
kv/lease_read_test.go Extends fake engine and adds tests validating the new engine-ack fast path behavior.
Comments suppressed due to low confidence (1)

internal/raftengine/etcd/engine.go:1262

  • recordQuorumAck() is called before rawNode.Step(). If Step returns etcdraft.ErrStepPeerNotFound (e.g., response from a removed/unknown peer), the ack still gets recorded and can incorrectly advance LastQuorumAck / satisfy quorum after a config change. Consider moving recordQuorumAck after a successful Step, or explicitly ignoring acks from peers not present in e.peers (and/or only recording when Step does not return ErrStepPeerNotFound).
	e.recordLeaderContact(msg)
	e.recordQuorumAck(msg)
	if err := e.rawNode.Step(msg); err != nil {
		if errors.Is(err, etcdraft.ErrStepPeerNotFound) {
			return
		}

Comment thread internal/raftengine/engine.go Outdated
Comment on lines +112 to +113
// when the local node is not the leader. Single-node clusters
// report time.Now() unconditionally since self is the quorum.
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LeaseProvider doc says single-node clusters "report time.Now() unconditionally", but the etcd engine implementation can return zero (e.g., before the first refreshStatus tick) and otherwise returns the last tick time rather than calling time.Now at read time. Please either adjust the documentation to match the actual contract (e.g., "returns a recent non-zero time while leader in a single-node cluster") or change implementations to satisfy the documented behavior.

Suggested change
// when the local node is not the leader. Single-node clusters
// report time.Now() unconditionally since self is the quorum.
// when the local node is not the leader. In a single-node cluster,
// self is the quorum, so while leader this returns a recent non-zero
// timestamp maintained by the engine; callers must not assume it is
// computed via time.Now at the instant of the call.

Copilot uses AI. Check for mistakes.
Comment thread kv/lease_read_test.go
linearizableErr error
linearizableCalls atomic.Int32
state atomic.Value // stores raftengine.State; default Leader
lastQuorumAckUnixNano atomic.Int64 // 0 = no ack yet. Updated by ackNow().
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The struct field comment mentions it is "Updated by ackNow()", but there is no ackNow helper on fakeLeaseEngine (the tests use setQuorumAck). Update the comment to match the actual helper/method so the test harness stays self-explanatory.

Suggested change
lastQuorumAckUnixNano atomic.Int64 // 0 = no ack yet. Updated by ackNow().
lastQuorumAckUnixNano atomic.Int64 // 0 = no ack yet. Updated by setQuorumAck().

Copilot uses AI. Check for mistakes.
Comment on lines +122 to +145
// Recorder alternates between two peer IDs so a 3-node followerQuorum
// always has at least one entry and the sort path runs.
go func() {
defer wg.Done()
for {
select {
case <-stop:
return
default:
tr.recordAck(2, 1)
tr.recordAck(3, 1)
}
}
}()
go func() {
defer wg.Done()
for {
select {
case <-stop:
return
default:
_ = tr.load()
}
}
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestQuorumAckTracker_ConcurrentRecordAndLoad spins in tight loops with a select default in two goroutines. In parallel test runs this can burn CPU and cause flakiness under constrained CI. Consider adding a small backoff (e.g., time.Sleep / runtime.Gosched) or using a time-bounded loop with a ticker to reduce busy-waiting while still exercising concurrent access.

Copilot uses AI. Check for mistakes.
Addresses copilot review on PR #561.

recordQuorumAck ran before rawNode.Step, so a late MsgAppResp / MsgHeartbeatResp
from a just-removed peer (which rawNode.Step would reject with
ErrStepPeerNotFound) still landed an ack in the tracker -- resurrecting
the "ghost" entry that removePeer just pruned. Filter on e.peers
membership before recording; the event loop is the sole writer to
e.peers so the map read is race-free.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 20, 2026

/gemini review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Comment thread kv/coordinator.go Outdated
Comment on lines +376 to +380
state := c.engine.State()
if state == raftengine.StateLeader {
if ack := lp.LastQuorumAck(); !ack.IsZero() && time.Since(ack) < leaseDur {
return lp.AppliedIndex(), nil
}
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The engine-driven fast path uses time.Since(ack) < leaseDur. Because LastQuorumAck() is reconstructed from a unix-nano value (no monotonic component), a backwards wall-clock adjustment can make time.Since(ack) negative and incorrectly treat a very old ack as "fresh". To keep the lease safety argument intact, guard against ack.After(time.Now()) (treat as stale/force slow path), or compare using now := time.Now(); elapsed := now.Sub(ack); if elapsed < 0 { ... } before applying the duration check.

Copilot uses AI. Check for mistakes.
Comment thread kv/coordinator.go Outdated
Comment on lines +347 to +355
// The lease is maintained inside the engine from ongoing
// MsgAppResp / MsgHeartbeatResp traffic, so callers do not sample
// time.Now() before the slow path to "extend" a lease afterwards.
// That earlier pre-read sampling was racy under congestion: if a
// LinearizableRead took longer than LeaseDuration, the extension
// would land already expired and the lease never warmed up. The
// engine-driven anchor is refreshed every heartbeat independent of
// read latency.
//
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment says callers "do not sample time.Now() before the slow path to extend a lease afterwards", but this function still samples now := time.Now() before LinearizableRead and uses it to extend the secondary caller-side lease (c.lease.extend(now.Add(leaseDur), ...)). Please adjust the comment to clarify that only the primary path is engine-driven, while the legacy caller-side lease (with pre-read sampling) remains as a secondary fallback during rollout.

Suggested change
// The lease is maintained inside the engine from ongoing
// MsgAppResp / MsgHeartbeatResp traffic, so callers do not sample
// time.Now() before the slow path to "extend" a lease afterwards.
// That earlier pre-read sampling was racy under congestion: if a
// LinearizableRead took longer than LeaseDuration, the extension
// would land already expired and the lease never warmed up. The
// engine-driven anchor is refreshed every heartbeat independent of
// read latency.
//
// The primary lease path is maintained inside the engine from ongoing
// MsgAppResp / MsgHeartbeatResp traffic, so that path does not rely on
// callers sampling time.Now() before the slow path to "extend" a lease
// afterwards. That earlier pre-read sampling was racy under
// congestion: if a LinearizableRead took longer than LeaseDuration,
// the extension would land already expired and the lease never warmed
// up. The engine-driven anchor is refreshed every heartbeat
// independent of read latency.
//
// During rollout, a legacy caller-side lease extension remains as a
// secondary fallback after a successful slow-path read, including the
// older pre-read time sampling used to anchor that fallback window.
//

Copilot uses AI. Check for mistakes.
Comment thread kv/sharded_coordinator.go Outdated
Comment on lines 754 to 762
// Engine-driven lease anchor -- see Coordinate.LeaseRead for why
// this is the primary check.
state := engine.State()
if state == raftengine.StateLeader {
if ack := lp.LastQuorumAck(); !ack.IsZero() && time.Since(ack) < leaseDur {
return lp.AppliedIndex(), nil
}
}
now := time.Now()
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same wall-clock skew issue as in Coordinate.LeaseRead: if LastQuorumAck() is in the future relative to time.Now() (e.g. due to a backwards clock step), time.Since(ack) becomes negative and this fast path can incorrectly accept a stale quorum ack. Add a guard to treat ack.After(now) as invalid/stale before applying the duration check.

Suggested change
// Engine-driven lease anchor -- see Coordinate.LeaseRead for why
// this is the primary check.
state := engine.State()
if state == raftengine.StateLeader {
if ack := lp.LastQuorumAck(); !ack.IsZero() && time.Since(ack) < leaseDur {
return lp.AppliedIndex(), nil
}
}
now := time.Now()
now := time.Now()
// Engine-driven lease anchor -- see Coordinate.LeaseRead for why
// this is the primary check.
state := engine.State()
if state == raftengine.StateLeader {
if ack := lp.LastQuorumAck(); !ack.IsZero() && !ack.After(now) && now.Sub(ack) < leaseDur {
return lp.AppliedIndex(), nil
}
}

Copilot uses AI. Check for mistakes.
Comment thread internal/raftengine/engine.go Outdated
Comment on lines +104 to +109
// Safety: callers must verify the lease with
// time.Since(LastQuorumAck()) < LeaseDuration() &&
// engine.State() == raftengine.StateLeader
// before serving a leader-local read. The LeaseDuration is bounded
// by electionTimeout - safety_margin, which guarantees that any new
// leader candidate cannot yet accept writes during that window.
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface contract recommends time.Since(LastQuorumAck()) < LeaseDuration() as the safety check. Since implementations may return a time.Time reconstructed from unix nanos (no monotonic clock), a backwards wall-clock adjustment can make time.Since(...) negative and incorrectly treat an old ack as fresh. Consider updating the contract to require treating LastQuorumAck() values later than time.Now() as invalid/stale (or otherwise specify a monotonic-safe comparison).

Suggested change
// Safety: callers must verify the lease with
// time.Since(LastQuorumAck()) < LeaseDuration() &&
// engine.State() == raftengine.StateLeader
// before serving a leader-local read. The LeaseDuration is bounded
// by electionTimeout - safety_margin, which guarantees that any new
// leader candidate cannot yet accept writes during that window.
// Safety: callers must verify the lease against a single `now := time.Now()`
// and MUST treat any LastQuorumAck() later than `now` as invalid/stale
// before serving a leader-local read, e.g.
// ack := LastQuorumAck()
// now := time.Now()
// !ack.IsZero() &&
// !ack.After(now) &&
// now.Sub(ack) < LeaseDuration() &&
// engine.State() == raftengine.StateLeader
// The LeaseDuration is bounded by electionTimeout - safety_margin,
// which guarantees that any new leader candidate cannot yet accept
// writes during that window.

Copilot uses AI. Check for mistakes.
Addresses copilot review on PR #561.

LastQuorumAck() returns a time reconstructed from UnixNano, which has
no monotonic component. A backwards wall-clock step can leave
now.Sub(ack) negative and let a stale ack pass the `<` duration
comparison -- a lease-safety violation.

Three related changes:

1. Coordinate.LeaseRead / groupLeaseRead: sample now := time.Now()
   once, then validate the engine-driven ack through a new
   engineLeaseAckValid helper that requires state==Leader, ack
   non-zero, !ack.After(now), and now.Sub(ack) < leaseDur. Treating
   future-dated acks as invalid forces the slow path instead of
   trusting a negative elapsed duration. Factoring the check into a
   helper keeps both call sites under the cyclomatic-complexity cap.
2. raftengine.LeaseProvider.LastQuorumAck doc: specify the
   skew-safe check (single-now sample + !ack.After(now)) so every
   implementation and every caller follows the same contract.
3. Coordinate.LeaseRead doc refreshed to clarify that only the
   PRIMARY engine-driven path skips pre-read sampling; the
   SECONDARY caller-side lease fallback still uses now.Add(leaseDur)
   during rollout.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 20, 2026

/gemini review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comment on lines +70 to +80
func (t *quorumAckTracker) removePeer(peerID uint64, followerQuorum int) {
t.mu.Lock()
defer t.mu.Unlock()
if _, ok := t.peerAcks[peerID]; !ok {
return
}
delete(t.peerAcks, peerID)
if followerQuorum <= 0 {
return
}
t.recomputeLocked(followerQuorum)
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In quorumAckTracker.removePeer, returning early when peerID has no recorded entry skips recomputing quorumAckUnixNano. If membership changes reduce followerQuorum, the tracker can unnecessarily keep quorumAck at 0 (or stale) until the next recordAck, even though the remaining recorded peers may already satisfy the new quorum. Consider always recomputing when followerQuorum > 0 (and delete() on a nil/missing entry is safe), or providing an explicit recompute-on-config-change path.

Copilot uses AI. Check for mistakes.
Comment thread internal/raftengine/engine.go Outdated
Comment on lines +118 to +119
// when the local node is not the leader. Single-node clusters
// report time.Now() unconditionally since self is the quorum.
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LeaseProvider.LastQuorumAck docs say "Single-node clusters report time.Now() unconditionally since self is the quorum" while also stating the method returns zero when the local node is not leader. In practice the etcd engine only publishes a non-zero value while leader (and updates it periodically), so this wording is misleading/contradictory. Please tighten the contract wording (e.g., "single-node leaders may return a recent time.Now()"), so callers/other implementations don't interpret it as returning non-zero when not leader.

Suggested change
// when the local node is not the leader. Single-node clusters
// report time.Now() unconditionally since self is the quorum.
// when the local node is not the leader. Single-node leaders may
// return a recent time.Now() value since self is the quorum.

Copilot uses AI. Check for mistakes.
Addresses copilot review on PR #561.

- quorumAckTracker.removePeer no longer early-returns when peerID has
  no recorded entry. A shrink that reduces followerQuorum may let
  the remaining peers now satisfy the smaller threshold, and without
  an explicit recompute the published instant stays stale (or zero)
  until the next recordAck. delete() is safe on a missing key, so
  the unconditional path is harmless otherwise.
- LeaseProvider.LastQuorumAck doc: the previous wording said "single-
  node clusters report time.Now() unconditionally" which contradicted
  the "zero when not leader" clause. Tighten to "single-node leaders
  may return a recent time.Now()" so callers and alternate engines
  don't interpret it as returning non-zero on a non-leader.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comment on lines +630 to +638
func (e *Engine) LastQuorumAck() time.Time {
if e == nil {
return time.Time{}
}
if ns := e.singleNodeLeaderAckUnixNano.Load(); ns != 0 {
return time.Unix(0, ns)
}
return e.ackTracker.load()
}
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LastQuorumAck() currently returns ackTracker.load() unconditionally for multi-node clusters. That can return a non-zero time even when the node is follower/candidate (e.g., late MsgAppResp/MsgHeartbeatResp arriving after a step-down can repopulate ackTracker), which violates the LeaseProvider contract (“zero time when ... not the leader”) and can also leak stale quorum acks across re-elections. Consider gating the return value on leader state (or maintaining an atomic leader flag) so non-leaders always observe time.Time{} for this API.

Copilot uses AI. Check for mistakes.
Comment on lines +1279 to +1298
if !isFollowerResponse(msg.Type) {
return
}
if msg.From == 0 || msg.From == e.nodeID {
return
}
// Reject acks from peers not in the current membership. Without
// this filter, a late MsgAppResp from a just-removed peer (which
// rawNode.Step will immediately reject with ErrStepPeerNotFound)
// would still land an ack in the tracker -- resurrecting the
// "ghost" entry that removePeer just pruned. Since we run on the
// event-loop goroutine (the sole writer to e.peers), the map read
// here is race-free.
if _, ok := e.peers[msg.From]; !ok {
return
}
clusterSize := len(e.peers)
if clusterSize <= 1 {
return
}
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recordQuorumAck updates the tracker for follower-response messages without checking that the local node is currently leader. After a leader-loss transition (where ackTracker.reset() runs), late follower responses can still arrive and repopulate ackTracker while we are follower/candidate, making LastQuorumAck() non-zero and potentially reusing stale liveness on a later re-election. Suggest adding a leader-state gate here (or otherwise ensuring the tracker cannot be updated outside leader role).

Copilot uses AI. Check for mistakes.
Comment on lines +49 to +58
time.Sleep(2 * time.Millisecond)
tr.recordAck(3, 2)
second := tr.load()
require.False(t, second.IsZero())

// Now peer 4 acks with a later timestamp. Quorum ack should still
// be the older of the top two (either 2 or 3, not 4) because the
// 5-node quorum is 3 including self (2 followers + self), and the
// OLDEST of the top two followers is still the limiting factor.
time.Sleep(2 * time.Millisecond)
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These time.Sleep(2 * time.Millisecond) calls aren’t required for the assertions (the test only checks non-zero/non-regression) and can add flakiness/latency to the test suite on slow CI. Consider removing them or rewriting the tracker to accept an injected timestamp in tests so ordering can be controlled deterministically.

Suggested change
time.Sleep(2 * time.Millisecond)
tr.recordAck(3, 2)
second := tr.load()
require.False(t, second.IsZero())
// Now peer 4 acks with a later timestamp. Quorum ack should still
// be the older of the top two (either 2 or 3, not 4) because the
// 5-node quorum is 3 including self (2 followers + self), and the
// OLDEST of the top two followers is still the limiting factor.
time.Sleep(2 * time.Millisecond)
tr.recordAck(3, 2)
second := tr.load()
require.False(t, second.IsZero())
// Now peer 4 acks. Quorum ack should still be the older of the top
// two (either 2 or 3, not 4) because the 5-node quorum is 3
// including self (2 followers + self), and the OLDEST of the top
// two followers is still the limiting factor.

Copilot uses AI. Check for mistakes.
Addresses copilot review on PR #561.

A transport-level MsgAppResp / MsgHeartbeatResp can land at handleStep
shortly after a step-down -- ackTracker.reset() has already fired, but
the response is already in flight. Previously recordQuorumAck would
admit the late ack and repopulate the tracker, and LastQuorumAck
would surface the resulting non-zero instant. A subsequent
re-election would then observe stale liveness from the prior term,
violating the LeaseProvider contract ("zero time when the local node
is not the leader") and potentially widening an already-expired
lease window.

Fix: atomic isLeader mirror on the Engine.

- refreshStatus writes it every tick before the other lease-related
  atomics, so every role transition is observed before the next
  handleStep runs.
- recordQuorumAck loads it and drops the response on the floor when
  not leader. Prevents the tracker from being repopulated after
  reset() and before the next refreshStatus pass.
- LastQuorumAck returns time.Time{} when not leader, honoring the
  interface contract and covering the narrow window between
  recordQuorumAck's check and ack-tracker publication.

Also drop the flaky 2ms time.Sleep calls from
TestQuorumAckTracker_QuorumAckIsOldestOfTopN: the require.False(
third.Before(second)) assertion holds trivially when timestamps are
equal, so the sleeps added latency without exercising new behaviour
and could flake on slow CI.
@bootjp bootjp requested a review from Copilot April 20, 2026 18:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comment on lines +123 to +147
// Recorder alternates between two peer IDs so a 3-node followerQuorum
// always has at least one entry and the sort path runs.
go func() {
defer wg.Done()
for {
select {
case <-stop:
return
default:
tr.recordAck(2, 1)
tr.recordAck(3, 1)
}
}
}()
go func() {
defer wg.Done()
for {
select {
case <-stop:
return
default:
_ = tr.load()
}
}
}()
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This concurrency test uses two goroutines with tight default: loops and no backoff, which can busy-spin and consume significant CPU on CI (especially with t.Parallel()). Consider adding a small sleep/Gosched in the loops or using a time-bounded context/ticker-driven loop so the test still exercises concurrency without pegging a core.

Copilot uses AI. Check for mistakes.
Comment on lines +89 to +104
func (t *quorumAckTracker) recomputeLocked(followerQuorum int) {
if len(t.peerAcks) < followerQuorum {
// Not enough peers have reported to form a majority yet.
t.quorumAckUnixNano.Store(0)
return
}
acks := make([]int64, 0, len(t.peerAcks))
for _, a := range t.peerAcks {
acks = append(acks, a)
}
// Sort descending so acks[0] is the most recent. The followerQuorum-th
// entry (1-indexed) is the oldest ack among the top quorum -- i.e.
// the boundary instant by which majority liveness was confirmed.
sort.Slice(acks, func(i, j int) bool { return acks[i] > acks[j] })
t.quorumAckUnixNano.Store(acks[followerQuorum-1])
}
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quorumAckTracker.recomputeLocked allocates a fresh slice and sorts all peer ack timestamps on every follower response. Since recordAck is called for each MsgAppResp / MsgHeartbeatResp, this can become a hot/allocating path as cluster size or heartbeat frequency grows. Consider reducing allocations (e.g., reuse a buffer) and computing the Nth-most-recent timestamp without a full sort.

Copilot uses AI. Check for mistakes.
Comment thread internal/raftengine/etcd/engine.go Outdated
if postRemovalClusterSize > 1 {
followerQuorum = postRemovalClusterSize / 2 //nolint:mnd
}
e.ackTracker.removePeer(nodeID, followerQuorum)
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ackTracker is only recomputed on peer removal. When postRemovalClusterSize <= 1, followerQuorum is passed as 0, and quorumAckTracker.removePeer intentionally retains the previously published quorum instant. If the cluster later grows (or any config change increases the required follower quorum) before new follower responses arrive, LastQuorumAck can expose a stale non-zero instant and incorrectly allow lease-read fast paths under the new membership. Consider clearing/resetting the published quorum instant (or recomputing against the new followerQuorum) on membership changes that reduce quorum to 0 and on additions/increases to the follower quorum, so a new configuration cannot inherit a quorum-ack from the prior one.

Suggested change
e.ackTracker.removePeer(nodeID, followerQuorum)
ackTrackerQuorum := followerQuorum
if ackTrackerQuorum == 0 {
// Force the tracker to recompute/clear any previously published
// quorum instant instead of retaining stale state from the old
// configuration when the cluster shrinks to a single node.
ackTrackerQuorum = 1
}
e.ackTracker.removePeer(nodeID, ackTrackerQuorum)

Copilot uses AI. Check for mistakes.
… sort buffer

Addresses copilot review on PR #561.

- engine.removePeer: when the cluster shrinks to a single node the
  prior code passed followerQuorum=0 to quorumAckTracker.removePeer,
  which by design preserves the currently published instant so the
  next recordAck can refresh it. But if the cluster subsequently
  grew back without a recordAck arriving first, LastQuorumAck's
  multi-node fallback would surface the stale instant from the
  previous configuration. Route shrink-to-<=1 through
  ackTracker.reset() so any future multi-node membership starts fresh.
- quorumAckTracker.recomputeLocked: reuse a per-tracker ackBuf
  across calls instead of allocating a fresh []int64 on every
  recordAck. Heartbeat frequency × cluster size keeps the allocation
  rate non-trivial otherwise; re-slicing in place + append-on-growth
  brings the steady-state alloc to zero.
- quorum_ack_test concurrency loops now Gosched() between iterations
  so the test exercises interleavings without pegging a CI core.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comment thread kv/lease_read_test.go
Comment on lines +67 to +73
func (e *fakeLeaseEngine) LastQuorumAck() time.Time {
ns := e.lastQuorumAckUnixNano.Load()
if ns == 0 {
return time.Time{}
}
return time.Unix(0, ns)
}
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fakeLeaseEngine.LastQuorumAck() doesn't honor the raftengine.LeaseProvider contract that non-leaders return a zero time. As written it can return a non-zero ack even when e.State() is follower, which could let tests accidentally pass if production code ever stops checking state before using LastQuorumAck. Consider gating on e.State()==raftengine.StateLeader (or the atomic state value) and returning time.Time{} otherwise.

Copilot uses AI. Check for mistakes.
Comment thread internal/raftengine/etcd/engine.go Outdated
Comment on lines +1321 to +1324
// Followers needed for majority = floor(clusterSize / 2): 1 for a
// 3-node cluster, 2 for 5-node, matching raft quorum semantics.
followerQuorum := clusterSize / 2 //nolint:mnd
e.ackTracker.recordAck(msg.From, followerQuorum)
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New //nolint:mnd on the quorum calculation looks avoidable here. To keep the code self-explanatory without suppressing the linter, consider introducing a small helper (e.g., followerQuorumForClusterSize) or a named constant for the divisor, and reuse it in both recordQuorumAck and removePeer.

Copilot uses AI. Check for mistakes.
Comment on lines +2831 to +2835
if postRemovalClusterSize <= 1 {
e.ackTracker.reset()
} else {
e.ackTracker.removePeer(nodeID, postRemovalClusterSize/2) //nolint:mnd
}
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New //nolint:mnd on the quorum calculation looks avoidable here as well. Consider reusing the same named helper/constant used in recordQuorumAck so the quorum math is centralized and the linter suppression isn't needed.

Copilot uses AI. Check for mistakes.
bootjp added 2 commits April 21, 2026 04:23
…der state

Addresses copilot review on PR #561.

- followerQuorumForClusterSize helper centralises the cluster/2
  formula so recordQuorumAck and removePeer share a single point of
  truth. The //nolint:mnd suppression now lives only in the helper
  where the formula is documented in context, rather than sprinkled
  at each call site.
- fakeLeaseEngine.LastQuorumAck now gates on State() == StateLeader,
  matching the production etcd engine and the LeaseProvider contract
  ("zero time when the local node is not the leader"). Without this
  mirror, a test could set a fresh ack plus a non-leader state and
  still take the fast path, masking a regression if production code
  ever stopped gating on engine.State() before consulting
  LastQuorumAck.
@bootjp bootjp requested a review from Copilot April 20, 2026 19:26
@bootjp bootjp merged commit a9a9faf into main Apr 20, 2026
10 checks passed
@bootjp bootjp deleted the feat/engine-driven-lease branch April 20, 2026 19:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comment thread kv/sharded_coordinator.go
Comment on lines +754 to +759
// Single time.Now() sample so primary/secondary/extension all see
// the same instant. Clock-skew safety delegated to
// engineLeaseAckValid (see Coordinate.LeaseRead).
now := time.Now()
state := engine.State()
if engineLeaseAckValid(state, lp.LastQuorumAck(), now, leaseDur) {
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now is taken before LastQuorumAck(). If the engine updates the quorum-ack timestamp between these two reads, ack.After(now) can become true and the primary fast path will be skipped even though the ack is fresh. Consider reading ack first and then sampling now (reused for secondary/extend) to avoid spurious slow-path reads.

Suggested change
// Single time.Now() sample so primary/secondary/extension all see
// the same instant. Clock-skew safety delegated to
// engineLeaseAckValid (see Coordinate.LeaseRead).
now := time.Now()
state := engine.State()
if engineLeaseAckValid(state, lp.LastQuorumAck(), now, leaseDur) {
// Read the quorum ack before sampling time so a concurrent ack
// update cannot appear to be in the future relative to now and
// spuriously disable the primary fast path. Reuse the same now
// sample for primary/secondary/extension checks.
ack := lp.LastQuorumAck()
now := time.Now()
state := engine.State()
if engineLeaseAckValid(state, ack, now, leaseDur) {

Copilot uses AI. Check for mistakes.
// Publish leader state atomically so recordQuorumAck / LastQuorumAck
// can gate on it without acquiring e.mu. MUST run before the
// single-node ack store below, otherwise a brand-new leader tick
// could publish a ack instant while isLeader is still false.
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor grammar nit in the comment: "a ack instant" should be "an ack instant".

Suggested change
// could publish a ack instant while isLeader is still false.
// could publish an ack instant while isLeader is still false.

Copilot uses AI. Check for mistakes.
bootjp added a commit that referenced this pull request Apr 20, 2026
Resolve conflicts from #561 (engine-driven lease) landing:

- internal/raftengine/etcd/engine.go: keep both accessor groups --
  LastQuorumAck from #561 and the DispatchDropCount /
  DispatchErrorCount / StepQueueFullCount accessors this branch
  introduced for the monitoring DispatchCollector.
- kv/coordinator.go: the secondary caller-side lease fast path now
  consults the locally-cached `state` from the engineLeaseAckValid
  call instead of re-reading engine.State(), and observes the hit
  via c.leaseObserver. Also wire the engine-driven primary fast
  path through the observer so the lease-hit ratio panel counts
  every fast-path hit regardless of which tier produced it.
  Extract a one-line observeLeaseRead helper to keep LeaseRead
  under the cyclomatic-complexity cap after the extra branch.
- kv/sharded_coordinator.go: mirror the Coordinate.LeaseRead
  restructure for groupLeaseRead -- use the shared `state`, observe
  hits on both tiers, and share an observeLeaseRead package helper
  to keep the function body under the cyclomatic-complexity cap.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants