-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: ensure cleanup of all abandoned learners before an AdminRelocateRange
#80205
base: master
Are you sure you want to change the base?
kvserver: ensure cleanup of all abandoned learners before an AdminRelocateRange
#80205
Conversation
@nvanbenschoten, this PR is just a proposal and not something I'm super sold on. What do you think? |
…locateRange` Previously, after cockroachdb#79405, `maybeLeaveAtomicChangeReplicasAndRemoveLearners()` could return a range descriptor that contained learner replicas. This was because of the idempotency introduced in cockroachdb#79405, which relaxed how/when we failed replication changes due to descriptor mismatch. The hazard was that, while we would ensure that the learner we were trying to remove was indeed gone, we wouldn't do enough to ensure that there weren't new learners that were added to the range due to other concurrent `AdminRelocateRange` calls. This was a violation of its contract and could sometimes fail `AdminRelocateRange` requests. This commit makes it such that when `maybeLeaveAtomicChangeReplicasAndRemoveLearners` returns a non-error response, it guarantees that _all_ the learner replicas have been removed according to the returned range descriptor. Release note: None
3aae6de
to
f4d79e9
Compare
This commit is an alternative to cockroachdb#80205. In cockroachdb#79405, we relaxed the check that `AdminChangeReplicas` performs comparing the expected range descriptor state against the range descriptor in KV when we were performing single learner removals. This was because these learner removals are performed at the end of a replication change (i.e. after its already completed) and thus, could be made idempotent. The above change messed with a subtle part of the contract of `maybeLeaveAtomicChangeReplicasAndRemoveLearners()` since now it was no longer guaranteed to always successfully return with a range desc that did not contain learners (due to intervening replication changes that were now no longer detected because of the relaxation mentioned above). This commit fixes this regression by making `maybeLeaveAtomicChangeReplicasAndRemoveLearners` check at the end if we've raced with another replication change. If so, we return an error instead of incorrectly returning a successful response with a descriptor that could still have learners or be in a joint config. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I like this approach better than #80205. That PR feels like a partial regression of #79405, so I don't understand the benefit of a state with both of those PRs vs. a state with neither.
However, this PR does pose a risk of getting stuck in the loop if some other actor continuously adds learners.
I wonder if there's a third option that scopes down the call to maybeLeaveAtomicChangeReplicasAndRemoveLearners
at the end of changeReplicasImpl
. If we only want the demoted learners to be removed in this case, why are we requiring that all learners be removed?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)
pkg/kv/kvserver/replica_command.go
line 1324 at r1 (raw file):
// Now the config isn't joint any more, but we may have demoted some voters // into learners. These learners should go as well. for {
Now that #79379 has landed, Should we be checking hasOutstandingLearnerSnapshotInFlight
inside of this loop?
pkg/kv/kvserver/replica_command.go
line 1338 at r1 (raw file):
var err error log.VEventf(ctx, 2, `removing learner replica %v from %v`, removalTarget, desc) desc, err = execChangeReplicasTxn(
Having seen fallout already, I'm somewhat wary that #79405 might have other consequences for callers that are removing a single learner replica and expect execChangeReplicasTxn
to be a true compare-and-swap on the full descriptor. Do you think that's a concern? If so, do you think there's merit in using changeReplicasTxnArgs
to scope these limited compare-and-swap fast-paths to just the callers that want them, instead of inferring them from the shape of the internalReplicationChanges
?
Previously, after #79405,
maybeLeaveAtomicChangeReplicasAndRemoveLearners()
could return a rangedescriptor that contained learner replicas. This was because of the idempotency
introduced in #79405, which relaxed how/when we failed replication changes due
to descriptor mismatch. The hazard was that, while we would ensure that the
learner we were trying to remove was indeed gone, we wouldn't do enough to
ensure that there weren't new learners that were added to the range due to
other concurrent
AdminRelocateRange
calls.This was a violation of its contract and could sometimes fail
AdminRelocateRange
requests.This commit makes it such that when
maybeLeaveAtomicChangeReplicasAndRemoveLearners
returns a non-error response, it guarantees that all the learner replicas
have been removed according to the returned range descriptor.
Resolves #79887
Release note: None