kvserver: do not allow lease transfers to replicas in `StateSnapshot` #63507

aayushshah15 · 2021-04-12T23:33:51Z

In some roachtest failures(#61541, for example) we've seen disruptive
effects from a range's lease being cooperatively transferred to a
follower replica that was either uninitialized or in need of a snapshot.

As of the time of this change, our default cluster settings make it such
that a snapshot would take ~64 seconds to transfer. This means that, if
our lease rebalancing heuristics (via either the replicate queue or the
store rebalancer) happen to make a decision to transfer a range lease to
a lagging follower during an unfortunate window of time (i.e. when the
follower is in need of a snapshot), we would be wedging all traffic on
that range until this follower applies this snapshot and then applies
the lease.

This commit prevents this behavior by rejecting lease transfers to replicas
that the leader (usually the leaseholder) knows are in need of a snapshot.

Resolves #61604

Release note: None

cockroach-teamcity · 2021-04-12T23:33:58Z

This change is

nvanbenschoten

Reviewed 1 of 1 files at r1, 6 of 6 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @andreimatei)

pkg/kv/kvserver/batcheval/cmd_lease_transfer.go, line 55 at r2 (raw file):

}

var errTransferLeaseToStateSnapshot = errors.Errorf("cannot transfer lease to replica in need of a snapshot")

Does kvnemesis hit this under stress and with a high rate of rebalancing + lease transfers?

pkg/kv/kvserver/batcheval/cmd_lease_transfer.go, line 90 at r2 (raw file):

	}

	if raftProgressTracker := cArgs.EvalCtx.GetProgress(); raftProgressTracker != nil {

Let's give this a comment, explaining the purpose of this check and that it is best-effort and only performed if the leaseholder also happens to be the leader.

pkg/kv/kvserver/batcheval/eval_context.go, line 102 at r2 (raw file):

	GetRangeInfo(context.Context) roachpb.RangeInfo

	// GetProgress returns this replica's raft progress tracker.

GetProgress is pretty ambiguous. What "progress" are we talking about? How about GetRaftProgress?

pkg/kv/kvserver/batcheval/eval_context.go, line 102 at r2 (raw file):

	GetRangeInfo(context.Context) roachpb.RangeInfo

	// GetProgress returns this replica's raft progress tracker.

What do you think about exposing the entire RaftStatus() method here? It seems more generally useful and allows us to be more explicit about that fact that the check you are adding is best-effort and only run if the leaseholder happens to also be the leader.

andreimatei

looks good to me, but a test would be bomb :) Ideally a unit test for TransferLease but I don't know if that's doable without a full cluster.

Reviewed 1 of 1 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)

pkg/kv/kvserver/replica_eval_context_span.go, line 217 at r3 (raw file):

// GetProgress is part of the EvalContext interface.
func (rec SpanSetReplicaEvalContext) RaftStatus() *raft.Status {
return rec.i.RaftStatus()

I think this is missing a tab :P

pkg/kv/kvserver/batcheval/eval_context.go, line 102 at r3 (raw file):

	GetRangeInfo(context.Context) roachpb.RangeInfo

	// RaftStatus returns this replica's raft status tracker.

maybe add some comments here about what a leader sees vs what a follower sees?

aayushshah15 · 2021-04-23T23:14:24Z

looks good to me, but a test would be bomb :)

The super aggressive log truncation is currently making the test flake, but I did add a test for it. Am I misunderstanding you?

andreimatei

I just missed it; reviewable fail.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)

pkg/kv/kvserver/client_lease_test.go, line 289 at r3 (raw file):

		ReplicationMode: base.ReplicationManual,
		ServerArgs: base.TestServerArgs{
			// Encourage very aggressive log truncations.

tell me why. Isn't this redundant with MustForceRaftLogScanAndProcess ?

pkg/kv/kvserver/client_lease_test.go, line 334 at r3 (raw file):

		progress := leaseholderStore.RaftStatus(scratchDesc.RangeID).Progress
		if progress == nil {
			return errors.Errorf("expected leaseholder to be collocated with the leader")

nit: these error don't tell me if they're expected until they eventually go away, or if they're never expected. If they are expected for a while, consider rephrasing as leaseholder not yet collocated with the leader.

aayushshah15

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)

pkg/kv/kvserver/client_lease_test.go, line 289 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

tell me why. Isn't this redundant with MustForceRaftLogScanAndProcess ?

You're right, this wasn't needed.

pkg/kv/kvserver/client_lease_test.go, line 334 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: these error don't tell me if they're expected until they eventually go away, or if they're never expected. If they are expected for a while, consider rephrasing as leaseholder not yet collocated with the leader.

Done.

pkg/kv/kvserver/replica_eval_context_span.go, line 217 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think this is missing a tab :P

Fixed.

pkg/kv/kvserver/batcheval/cmd_lease_transfer.go, line 90 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Let's give this a comment, explaining the purpose of this check and that it is best-effort and only performed if the leaseholder also happens to be the leader.

Done.

pkg/kv/kvserver/batcheval/eval_context.go, line 102 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What do you think about exposing the entire RaftStatus() method here? It seems more generally useful and allows us to be more explicit about that fact that the check you are adding is best-effort and only run if the leaseholder happens to also be the leader.

Done.

pkg/kv/kvserver/batcheval/eval_context.go, line 102 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

maybe add some comments here about what a leader sees vs what a follower sees?

Done.

aayushshah15

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)

pkg/kv/kvserver/batcheval/cmd_lease_transfer.go, line 55 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Does kvnemesis hit this under stress and with a high rate of rebalancing + lease transfers?

I had a hard time hitting it with the 16MB COCKROACH_RAFT_LOG_TRUNCATION_THRESHOLD but with something like 1MB I can get kvnemesis to hit this reasonably quickly.

 error applying x.TransferLeaseOperation(ctx, /Table/50/"e5f12ee7", 7) // cannot transfer lease to replica in need of a snapshot

nvanbenschoten

Reviewed 1 of 4 files at r3, 9 of 9 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @nvanbenschoten)

Previously, the testing knob was only being used to get the snapshot queue to skip over learners or non-voters. However, the intention had always been to be able to get it to skip over voting replicas as well. Release note: None

nvanbenschoten · 2021-08-06T19:35:03Z

One change we could make to harden the protection here is to only allow a leaseholder to transfer its lease away if it is also the leader. However, that still is inherently racy because a leader doesn't immediately know that it needs to send a snapshot to a follower in need. So then we could change the condition from rejecting lease transfers to replicas in StateSnapshot to only allowing lease transfers to replicas in StateReplicate.

But we would need to be careful that doing so doesn't introduce an undesirable cyclical dependency between lease transfers and raft leaders. Those almost always have subtle and unexpected effects - just ask @andreimatei.

And this will all still be a little racy if we don't ensure lease transfer and log trunation are properly synchronized.

nvanbenschoten · 2021-08-09T15:39:59Z

So then we could change the condition from rejecting lease transfers to replicas in StateSnapshot to only allowing lease transfers to replicas in StateReplicate.

This actually seems like a targeted improvement to this PR that incurs very little additional risk. What do you think Aayush?

aayushshah15 · 2021-08-10T13:32:07Z

This actually seems like a targeted improvement to this PR that incurs very little additional risk. What do you think Aayush?

Makes sense, I made the change.

The test added in this PR ended up being a little flakey because the raft leader doesn't learn about the present state of a follower until it receives a subsequent MsgAppResp from it. I've updated the test to account for this.

aayushshah15 · 2021-08-11T17:50:03Z

@nvanbenschoten, I'd like your take on the following:

I'm not super sure about the approach implemented by this PR anymore. It makes AdminRelocateRange flakey (and thus, its callers like EXPERIMENTAL_RELOCATE ... ) because AdminRelocateRange attempts to transfer the range lease right after a rebalance operation. Often, what ends up happening is that the leader has the lease target in StateSnapshot until it receives some kind of raft message response from this potential lease target (which might take up to a heartbeat interval unless theres traffic on the range). This makes AdminRelocateRange fail to transfer the lease (which can fail the entire thing if the lease transfer was required to remove the current leaseholder).

We can simply add some kind of re-try logic here and that will "solve" this issue, but I'm not sure how we feel about that.
Another possibility is that we don't reject anything at the command evaluation level but instead have the allocator / store-rebalancer be cognizant of not transferring leases to replicas that the leader knows are not in StateReplicate.

What do you think?

nvanbenschoten · 2021-08-11T18:16:37Z

Both of these options sound reasonable to me. I think it would help to spell out each of the places that attempt to transfer leases and discuss what will happen in each case if they run into this new error. Will some mechanism retry? How much work will be rolled back? Do we risk thrashing?

This will help us determine where the check and any retry loops should live. For instance, it may lead us to decide that we need a retry loop in AdminTransferLease and that all other initiators of lease transfers already properly handle this error.

In some roachtests failures(cockroachdb#61541, for example) we've seen disruptive effects from a range's lease being cooperatively transferred to a follower replica that was either uninitialized or in need of a snapshot. As of the time of this change, our default cluster settings make it such that a snapshot would take ~64 seconds to transfer. This means that, if our lease rebalancing heuristics (via either the replicate queue or the store rebalancer) happen to make a decision to transfer a range lease to a lagging follower during an unfortunate window of time (i.e. when the follower is in need of a snapshot), we would be wedging all traffic on that range until this follower applies this snapshot and then applies the lease. This commit prevents this behavior by rejecting lease transfers to replicas that are in need of a snapshot and replicas that the leader doesn't yet know the match index of (i.e. replicas in `StateProbe`). Release note: None

aayushshah15 · 2021-08-16T17:40:08Z

I think it would help to spell out each of the places that attempt to transfer
leases and discuss what will happen in each case if they run into this new
error.

There's three places where we initiate lease transfers:

via the replicateQueue
via the StoreRebalancer
via AdminRelocateRange (used by the merge queue, store-rebalancer, and by EXPERIMENTAL_RELOCATE)

If 1 & 2 run into this new error, they will log an error and move on. The
replicateQueue will re-consider the replica in the next scanner cycle, and the
StoreRebalancer might reconsider the replica after about a minute or so.

For (3), AdminRelocateRange will try to do lease transfers for 2 reasons: If
the current leaseholder needs to be removed or in order to move the lease to
the first voting replica supplied to AdminRelocateRange (this is part of its
contract). If it gets this error, it will return an error to its client without
rolling any of its completed relocations back.

So, I dont believe we're risking thrashing of any sort under any circumstance.
I do think adding a retry logic in AdminTransferLease or even in AdminRelocateRange
is a little fraught because it feels hard to reason about the consequences of adding
non-deterministic delays to something like this. Additionally, a raft heartbeat
interval is ~1s based on our defaults, which is quite long (and it doesn't account
for replication lag).

I'm going try a new patch that teaches the replicateQueue and StoreRebalancer
to not initiate lease transfers to replicas in need of snapshots. It should be
a fair bit clearer and a much more reasonable backport candidate.

@irfansharif

68652: kvserver: count draining nodes as live when computing quorum r=aayushshah15 a=aayushshah15 Similar to #67714 Draining nodes were considered non-live by the allocator when it made the determination of whether a range could achieve quorum. This meant that, for instance, on a cluster with a replication factor of 5, if we had 3 or more nodes marked draining, we (with a high likelihood) wouldn't be able to decommission _any other_ nodes from the cluster. Furthermore, due to the same reason as above the system also would incorrectly decide to not rebalance ranges that had more than a quorum of replicas on draining nodes. This patch fixes this problem by considering replicas on draining nodes as live for the purposes of determining whether a range has quorum. This likely fixes a subset of "stuck decommissioning" issues we've seen in the wild. Follows from cockroachlabs/support#1105 Release justification: bug fix Release note(bug fix): Previously, draining a quorum of nodes (i.e. >=2 if replication factor is 3, >=3 if replication factor is 5, etc) would block the subsequent decommissioning of any other nodes in the cluster. This patch fixes this bug. Now, if the lowest replication factor of some zone in the cluster is RF, operators should be able to safely drain up to RF-1 nodes simultaneously. 69115: mt_start_sql: enable enterprise features for multitenant sql servers r=JeffSwenson a=JeffSwenson Enterprise features are controlled by the enterprise.license setting. Currently this setting applies only to the host tenant cluster. This change enables enterprise features for all tenant clusters. Enabling enterprise features for all tenants is reasonable, because multi tenant deployments are an enterprise feature. Release note: None 69571: jobs: don't reset next_run when resuming active schedules r=pbardea a=pbardea Consider an active schedule that was created with a specific first_run time. The first_run would populate the next_run_time on the schedule. The resumption of this schedule before it executed would re-evaluate the next runtime based off the schedule's recurrence. This commit changes the scheduling system to only recompute the next run time on paused schedules. Release justification: bug fix Release note (bug fix): Fix a bug where resuming an active schedule would always reset its next run time. This was sometimes undesirable with schedules that had a first_run option specified. 69696: kvserver: stop transferring leases to replicas that may need snapshots r=aayushshah15 a=aayushshah15 This commit disallows the `replicateQueue` from initiating lease transfers to replicas that may be in need of a raft snapshot. Note that the `StoreRebalancer` already has a stronger form of this check since it disallows lease transfers to replicas that are lagging behind the raft leader (which includes the set of replicas that need a snapshot). In cases where the raft leader is not the leaseholder, we disallow the replicateQueue from any sort of lease transfer until leaseholdership and leadership are collocated. We rely on calls to `maybeTransferRaftLeadershipToLeaseholderLocked()` (called on every raft tick) to make sure that such periods of leadership / leaseholdership misalignment are ephemeral and rare. Alternative to #63507 Release justification: bug fix Resolves #61604 Release note (bug fix): Fixes a bug that can cause prolonged unavailability due to lease transfer to a replica that may be in need of a raft snapshot. 69727: kv: deflake TestPriorityRatchetOnAbortOrPush r=nvanbenschoten a=nvanbenschoten Fixes #68584. The test was flaky for the reasons described in #68584. There doesn't appear to be an easy way to fix this behavior, and it's not clear how valuable doing so even is given how little we rely on transaction priorities anymore, so the commit just deflakes the test by rejecting them. Release justification: deflaking a test. 69728: clusterversion,kv: remove old cluster versions corresponding to 20.2 and 21.1 r=nvanbenschoten a=nvanbenschoten Fixes most of #66544. This pull request removes (almost) all old cluster versions corresponding to 20.2 and 21.1 which fall under the KV group's responsibility. The PR leaves one remaining "combo version" of `TruncatedAndRangeAppliedStateMigration` plus `PostTruncatedAndRangeAppliedStateMigration`, which we'll want to remove before closing #66544. I've left this for @irfansharif both because its removal is a little more involved than the others and because I figured that he deserves to reap the fruit of his labor and get to delete the code related to the replicated truncated state and missing RangeAppliedStateKeys. Making that possible was a hard-earned win. Release justification: cluster version cleanup Co-authored-by: Aayush Shah <aayush.shah15@gmail.com> Co-authored-by: Jeff <swenson@cockroachlabs.com> Co-authored-by: Paul Bardea <pbardea@gmail.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

aayushshah15 · 2021-09-04T20:44:40Z

Closing in favor of #69696

aayushshah15 force-pushed the 20210409_leaseTransferToStateSnapshot branch 2 times, most recently from 37740be to a7e0cab Compare April 12, 2021 23:41

aayushshah15 requested review from nvanbenschoten and andreimatei April 12, 2021 23:42

aayushshah15 force-pushed the 20210409_leaseTransferToStateSnapshot branch 4 times, most recently from 1e19f06 to 6699dd3 Compare April 13, 2021 06:32

nvanbenschoten reviewed Apr 19, 2021

View reviewed changes

aayushshah15 force-pushed the 20210409_leaseTransferToStateSnapshot branch 2 times, most recently from 785fe33 to 7d8fa94 Compare April 22, 2021 19:28

andreimatei approved these changes Apr 23, 2021

View reviewed changes

aayushshah15 self-assigned this Apr 23, 2021

andreimatei approved these changes Apr 23, 2021

View reviewed changes

aayushshah15 force-pushed the 20210409_leaseTransferToStateSnapshot branch 4 times, most recently from 906ff76 to 0bf64a3 Compare April 24, 2021 21:56

aayushshah15 commented Apr 24, 2021

View reviewed changes

aayushshah15 force-pushed the 20210409_leaseTransferToStateSnapshot branch from 0bf64a3 to 0b306e7 Compare April 24, 2021 22:35

aayushshah15 commented Apr 24, 2021

View reviewed changes

aayushshah15 force-pushed the 20210409_leaseTransferToStateSnapshot branch from 0b306e7 to eb940d3 Compare April 24, 2021 22:40

nvanbenschoten approved these changes Apr 26, 2021

View reviewed changes

absolute8511 mentioned this pull request Jun 16, 2021

Some issues on the leader transfer youzan/ZanRedisDB#91

Closed

kvserver: fix RaftSnapshotQueueSkipReplica testing knob

da8a0f2

Previously, the testing knob was only being used to get the snapshot queue to skip over learners or non-voters. However, the intention had always been to be able to get it to skip over voting replicas as well. Release note: None

aayushshah15 force-pushed the 20210409_leaseTransferToStateSnapshot branch from eb940d3 to 6495ddf Compare August 6, 2021 18:15

aayushshah15 requested a review from a team as a code owner August 6, 2021 18:15

aayushshah15 added the backport-21.1.x label Aug 6, 2021

erikgrinaker mentioned this pull request Aug 6, 2021

Unavailability due to lease transfer onto replica waiting for Raft snapshot #68539

Closed

aayushshah15 force-pushed the 20210409_leaseTransferToStateSnapshot branch from 6495ddf to 6ff8fe4 Compare August 10, 2021 13:02

aayushshah15 force-pushed the 20210409_leaseTransferToStateSnapshot branch from 6ff8fe4 to cbfed38 Compare August 10, 2021 19:21

aayushshah15 force-pushed the 20210409_leaseTransferToStateSnapshot branch from cbfed38 to 7bb3f5c Compare August 13, 2021 00:54

aayushshah15 mentioned this pull request Sep 1, 2021

kvserver: stop transferring leases to replicas that may need snapshots #69696

Merged

aayushshah15 mentioned this pull request Sep 2, 2021

release-20.2: kvserver: stop transferring leases to replicas that may need snapshots #69801

Merged

aayushshah15 closed this Sep 4, 2021

aayushshah15 mentioned this pull request Sep 9, 2021

release-21.1: kvserver: stop transferring leases to replicas that may need snapshots #69964

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: do not allow lease transfers to replicas in `StateSnapshot` #63507

kvserver: do not allow lease transfers to replicas in `StateSnapshot` #63507

aayushshah15 commented Apr 12, 2021 •

edited

Loading

cockroach-teamcity commented Apr 12, 2021

nvanbenschoten left a comment

andreimatei left a comment

aayushshah15 commented Apr 23, 2021

andreimatei left a comment

aayushshah15 left a comment

aayushshah15 left a comment

nvanbenschoten left a comment

nvanbenschoten commented Aug 6, 2021

nvanbenschoten commented Aug 9, 2021

aayushshah15 commented Aug 10, 2021 •

edited

Loading

aayushshah15 commented Aug 11, 2021

nvanbenschoten commented Aug 11, 2021

aayushshah15 commented Aug 16, 2021

aayushshah15 commented Sep 4, 2021

kvserver: do not allow lease transfers to replicas in StateSnapshot #63507

kvserver: do not allow lease transfers to replicas in StateSnapshot #63507

Conversation

aayushshah15 commented Apr 12, 2021 • edited Loading

cockroach-teamcity commented Apr 12, 2021

nvanbenschoten left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

aayushshah15 commented Apr 23, 2021

andreimatei left a comment

Choose a reason for hiding this comment

aayushshah15 left a comment

Choose a reason for hiding this comment

aayushshah15 left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Aug 6, 2021

nvanbenschoten commented Aug 9, 2021

aayushshah15 commented Aug 10, 2021 • edited Loading

aayushshah15 commented Aug 11, 2021

nvanbenschoten commented Aug 11, 2021

aayushshah15 commented Aug 16, 2021

aayushshah15 commented Sep 4, 2021

kvserver: do not allow lease transfers to replicas in `StateSnapshot` #63507

kvserver: do not allow lease transfers to replicas in `StateSnapshot` #63507

aayushshah15 commented Apr 12, 2021 •

edited

Loading

aayushshah15 commented Aug 10, 2021 •

edited

Loading