-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kv: conditionally handleReady after tick #138357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
db7583b to
6fdca0e
Compare
Informs cockroachdb#133885. Informs cockroachdb#133216. This commit adjusts `Replica.tick` to check `RawNode.HasReady` when deciding whether a `Ready` struct should be processed (`Replica.handleRaftReady`) from the ticked replica. Previously, we would unconditionally pass through `Replica.handleRaftReady` after a tick, even if doing so was unnecessary and we knew ahead of time that we would hit this early return: https://github.com/cockroachdb/cockroach/blob/57aab736c34ce5dc7988bd53e0604fde48cef441/pkg/kv/kvserver/replica_raft.go#L1025. This was mostly harmless in the past with quiescence and a cheap short-circuit in `handleRaftReady`, but has become more problematic recently for the following three reasons: 1. Leader leases don't permit quiescence. We have some work planned to allow follower replicas to quiesce under a leader lease (see cockroachdb#133885), but even after that work, we likely won't allow quiescence for the leaseholder. This means that 2. `Replica.handleRaftReady` has been getting more expensive, even in the no-op case. This is because of work done in `Replica.updateProposalQuotaRaftMuLocked` and work done for replica admission control v2 (RACv2). 3. At some point in the future, we would like to revert 5e6698 and drop the raft tick interval back down to something like 200ms. This will allow us to address cockroachdb#133576 and further reduce failover times. Release note: None
6fdca0e to
072e061
Compare
|
@pav-kv I updated this PR with the additional condition that we discussed. I confirmed that it was failing reliably on |
pav-kv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nvanbenschoten Great, thanks a lot. I'm closing this PR, and will review the other one.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @miraradeva)
|
Oups, sorry, wrong PR :) |
| defer r.raftMu.Unlock() | ||
| defer func() { | ||
| if exists && err == nil { | ||
| if processReady && err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MaybeSendPingsRaftMuLocked call can trigger raft messages. Wonder if we should check HasReady after it, to capture these potentially new messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, this line relied on the fact that we return true unconditionally (except when we're quiesced). Logically, this line is not checking "has ready", but rather checks that we passed below line 1456.
So it should be something like:
if noOp || err != nil {
return
}
MaybeSendPings()
if !processReady { // have we accounted for ReportUnreachable already?
processReady = HasReady()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially there is a less ugly way than this defer. At the very least, it can be moved below line 1456 (caution: I think it doesn't want Replica.mu held).
| // group has Ready to process. | ||
| processReady = r.mu.internalRaftGroup.HasReady() || | ||
| // Also do so if we just marked a replica as unreachable. This can be | ||
| // helpful for RACv2, which responds to changes in the Progress.StateType of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Progress.State
| for remoteReplica := range unreachableRemotes { | ||
| if bypassFn != nil && bypassFn(remoteReplica) { | ||
| continue | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is another call to ReportUnreachable in updatePausedFollowersLocked (and there are no more ReportUnreachables). Should we capture that in processReady flag as well, for completeness?
| // Also do so if we just marked a replica as unreachable. This can be | ||
| // helpful for RACv2, which responds to changes in the Progress.StateType of | ||
| // follower replicas, despite that not being a Ready event. | ||
| len(unreachableRemotes) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Informs #133885.
Informs #133216.
This commit adjusts
Replica.tickto checkRawNode.HasReadywhen deciding whether aReadystruct should be processed (Replica.handleRaftReady) from the ticked replica. Previously, we would unconditionally pass throughReplica.handleRaftReadyafter a tick, even if doing so was unnecessary and we knew ahead of time that we would hit this early return:cockroach/pkg/kv/kvserver/replica_raft.go
Line 1025 in 57aab73
This was mostly harmless in the past with quiescence and a cheap short-circuit in
handleRaftReady, but has become more problematic recently for the following three reasons:Replica.handleRaftReadyhas been getting more expensive, even in the no-op case. This is because of work done inReplica.updateProposalQuotaRaftMuLockedand work done for replica admission control v2 (RACv2).Release note: None