storage: remove the replica up to date check when transfering lease #42379

darinpp · 2019-11-11T21:17:39Z

Previously, we had a check that filtered out all replicas that
are lagging behind the leader in case of a lease transfer.

We remove that check so in case of lease preference for
a node that is constantly lagging - the lease transfer can occur without delay.

This removes the check that the candidates for lease transfer are only
replicas that aren't lagging behind. etcd implements the
3.10 Leadership transfer extension where the old leader will
bring up to date the new leader's log while blocking any new requests.

Release note (bug fix): now possible to transfer range leases to lagging replicas

cockroach-teamcity · 2019-11-11T21:17:47Z

This change is

tbg · 2019-11-12T13:05:47Z

pkg/storage/replicate_queue.go

 	target := rq.allocator.TransferLeaseTarget(
 		ctx,
 		zone,
-		candidates,
+		desc.Replicas().Voters(),


I agree that this particular check is misguided, but there'll probably be some way in which we're shooting ourselves in the foot if we just remove it. For example, I'm not seeing any code in TransferLeaseTarget that would stop it from transferring to a node that just went down, so the transferred lease will time out, someone else will get the lease, attempt to transfer again, time out, ... until the store dead timeout has passed. I think we at least want to check the follower's liveness. There is some kind of check here:

cockroach/pkg/storage/allocator.go

Line 852 in 0cf7360

existing, _ = a.storePool.liveAndDeadReplicas(rangeID, existing)

but note that it's never reached if we have lease preferences, and it uses the 5min timeout.

store.cfg.NodeLiveness.GetLivenessStatusMap() (or the per-node .GetLiveness(nodeID).IsLive seem like a better signal.

So you could replace filterBehindReplicas with a new onlyLiveReplicas(...), and perhaps also remove some of the inefficient liveness checking in TransferLeaseTarget (though all callers would have to prune, maybe it's better to push the liveness map into TransferLeaseTarget instead?)

This is already the case. The first thing that TransferLeaseTarget does is call StorePool.getStoreList. It grabs the NodeLiveness record for each node and calls Liveness.LivenessStatus and in turn Liveness.IsLive.

tbg · 2019-11-12T13:13:46Z

As an end-to-end test, we should do something like #38065: set up a cluster in which latencies are nontrivial, start with lease preferences in region 1 on a table that is seeing steady write traffic, then flip the lease preferences over to a faraway region and make sure that the leases transfer reasonably quickly (and stay where they ought to be). One optional test would be to kill one of the new preferred leaseholders just before you update the leaseholder preference, and make sure that the outage isn't >10s.

Writing this test will be annoying. For one, roachtests don't really hard-code the geography, but you want to know the latencies between them because to get a repro, you want to move the new lease preference as far away from the old one as possible. It also almost seems like you should be able to use ./cockroach demo --global, but this isn't quite there and there's no precedent for writing complex logic against it. You could also look into doing this via shakespeare which would actually be a pretty good fit, but you would have to get some @knz time to get started, and we don't run shakespeare in nightlies yet. Another problem that will have to be solved either way is that - I think - we don't actually have a way to query whether the leaseholder preference is met (we can query crdb_internal.ranges, but this isn't how we want to do it). @andreimatei, are leaseholder prefs in the constraint violation report? I think I looked and found that the answer was no (or the lease transferred faster than I could check).

ajwerner · 2019-11-12T13:34:15Z

I wonder if there’s a way to utilize the new latency stuff we added for cockroach demo and then write the test as an augmentation of partitionccl/TestRepartitioning by putting some load into the test on a cluster which was set up with high latency. The first commit of #41240 adds leaseholder preferences to that test. The test rather crudely checks for leaseholder placement by examining traces for keys. That may be an easier path to a reliable reproduction cycle than relying on roachtests or hooking up Shakespeare.

tbg · 2019-11-12T13:44:06Z

Using the latency injection in unit tests seems like a good idea in general, if we can make that work I'd greatly prefer it.

darinpp · 2019-11-19T04:50:21Z

I added a test as well. Also verified that the test fails with the old check.

tbg

, thanks!

It looks like the test is flaky:

Maybe stress pull request] --- FAIL: TestTransferLeaseToLaggingNode (56.02s)
[05:41:21]
[Maybe stress pull request]     replicate_queue_test.go:604: condition failed to evaluate within 45s: waiting for n1 to get the lease
[05:41:21]
[Maybe stress pull request]         goroutine 25 [running]:
[05:41:21]
[Maybe stress pull request]         runtime/debug.Stack(0xc0003161c0, 0x4562000, 0xc00046aa60)
[05:41:21]
[Maybe stress pull request]         	/usr/local/go/src/runtime/debug/stack.go:24 +0x9d
[05:41:21]
[Maybe stress pull request]         github.com/cockroachdb/cockroach/pkg/testutils.SucceedsSoon(0x4660a00, 0xc0001ca800, 0xc0003161c0)
[05:41:21]
[Maybe stress pull request]         	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/soon.go:37 +0x6b
[05:41:21]
[Maybe stress pull request]         github.com/cockroachdb/cockroach/pkg/storage_test.TestTransferLeaseToLaggingNode(0xc0001ca800)
[05:41:21]
[Maybe stress pull request]         	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replicate_queue_test.go:604 +0x905
[05:41:21]
[Maybe stress pull request]         testing.tRunner(0xc0001ca800, 0x3d9e638)
[05:41:21]
[Maybe stress pull request]         	/usr/local/go/src/testing/testing.go:865 +0xc0
[05:41:21]
[Maybe stress pull request]         created by testing.(*T).Run
[05:41:21]
[Maybe stress pull request]         	/usr/local/go/src/testing/testing.go:916 +0x35a
[05:41:21]
[Maybe stress pull request] FAIL

https://teamcity.cockroachdb.com/viewLog.html?buildId=1597934&buildTypeId=Cockroach_UnitTests_Test&tab=buildLog&_focus=689#_state=689

Make sure it survives stress{,rac}ing for a few minutes on your GCEworker (with STRESSFLAGS='-p 8' or something like that).

Reviewed 1 of 1 files at r1, 1 of 1 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @darinpp)

pkg/storage/replicate_queue.go, line 940 at r1 (raw file):

Previously, darinpp wrote…

This is already the case. The first thing that TransferLeaseTarget does is call StorePool.getStoreList. It grabs the NodeLiveness record for each node and calls Liveness.LivenessStatus and in turn Liveness.IsLive.

👍

pkg/storage/replicate_queue_test.go, line 567 at r2 (raw file):

	s := sqlutils.MakeSQLRunner(tc.Conns[0])

	// Create a table and set the replication factor to 3

.

pkg/storage/replicate_queue_test.go, line 579 at r2 (raw file):

	}

	// Get the table's range

.

pkg/storage/replicate_queue_test.go, line 618 at r2 (raw file):

		rangeDesc := n1Repl.Desc()
		for _, r := range rangeDesc.Replicas().Voters() {
			// We are interested in the replica on the 3rd node

.

Previously, we had a check that filtered out all replicas that are lagging behind the leader in case of a lease transfer. We had to remove that check because when the lease preference specifies a node that is constantly lagging - the lease transfer will never occur. This removes the check that the cadidates for lease transfer are only replicas that aren't lagging behind. etcd implements the 3.10 Leadership transfer extension where the old leader will bring up to date the new leader's log while blocking any new requests. Release note (bug fix): now possible to transfer range leases to lagging replicas

darinpp · 2019-11-25T18:09:06Z

bors r+

42379: storage: remove the replica up to date check when transfering lease r=darinpp a=darinpp Previously, we had a check that filtered out all replicas that are lagging behind the leader in case of a lease transfer. We remove that check so in case of lease preference for a node that is constantly lagging - the lease transfer can occur without delay. This removes the check that the candidates for lease transfer are only replicas that aren't lagging behind. etcd implements the 3.10 Leadership transfer extension where the old leader will bring up to date the new leader's log while blocking any new requests. Release note (bug fix): now possible to transfer range leases to lagging replicas 42724: storage: create cluster setting for a minimum lease transfer interval r=nvanbenschoten a=nvanbenschoten This minimum interval was previously set to 1 second in a constant, which was frustrating when setting up datasets with lease preferences. This change turns the interval into a configurable cluster setting so that it can be changed on-demand. This value also seems fairly high, but to avoid causing any instability or thrashing I opted not to touch the default for the new cluster setting. Release note (sql change): A new kv.allocator.min_lease_transfer_interval cluster setting was introduced, which allows the minimum interval between lease transfers initiated from each node to be configured. Co-authored-by: Darin <darinp@gmail.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

andreimatei · 2019-11-25T18:28:46Z

If bors fails, please put a Fixes #38065 line in the commit message.

Also may I suggest a more user-understandable release note like:
Release note(bug fix): A bug causing a range's lease to sometimes not be transferred to the location indicated in lease_preferences in geographically-distributed deployments has been fixed.

craig · 2019-11-25T18:32:25Z

Build succeeded

GitHub CI (Cockroach)

Fixes cockroachdb#81763. Part of cockroachdb#81561. \### Background When performing a lease transfer, the outgoing leaseholder revokes its lease before proposing the lease transfer request, meaning that it promises to stop using the previous lease to serve reads or writes. The lease transfer request is then proposed and committed to the Raft log, at which point the new lease officially becomes active. However, this new lease is not usable until the incoming leaseholder applies the Raft entry that contains the lease transfer and notices that it is now the leaseholder for the range. The effect of this handoff is that there exists a "power vacuum" time period when the outgoing leaseholder has revoked its previous lease but the incoming leaseholder has not yet applied its new lease. During this time period, a range is effectively unavailable for strong reads and writes, because no replica will act as the leaseholder. Instead, requests that require the lease will be redirected back and forth between the outgoing leaseholder and the incoming leaseholder (the client backs off). To minimize the disruption caused by lease transfers, we need to minimize this time period. We assume that if a lease transfer target is sufficiently caught up on its log such that it will be able to apply the lease transfer through log entry application then this unavailability window will be acceptable. This may be a faulty assumption in cases with severe replication lag, but we must balance any heuristics here that attempts to determine "too much lag" with the possibility of starvation of lease transfers under sustained write load and a resulting sustained replication lag. See cockroachdb#38065 and cockroachdb#42379, which removed such a heuristic. For now, we don't try to make such a determination. \### Patch Details However, with this change, we now draw a distinction between lease transfer targets that will be able to apply the lease transfer through log entry application and those that will require a Raft snapshot to catch up and apply the lease transfer. Raft snapshots are more expensive than Raft entry replication. They are also significantly more likely to be delayed due to queueing behind other snapshot traffic in the system. This potential for delay makes transferring a lease to a replica that needs a snapshot very risky, as doing so has the effect of inducing range unavailability until the snapshot completes, which could take seconds, minutes, or hours. In the future, we will likely get better at prioritizing snapshots to improve the responsiveness of snapshots that are needed to recover availability. However, even in this world, it is not worth inducing unavailability that can only be recovered through a Raft snapshot. It is better to catch the desired lease target up on the log first and then initiate the lease transfer once its log is connected to the leader's. For this reason, unless we can guarantee that the lease transfer target does not need a Raft snapshot, we don't let it through. This commit adds protection against such risky lease transfers at two levels. First, it includes hardened protection in the Replica proposal buffer, immediately before handing the lease transfer proposal off to `etcd/raft`. Second, it includes best-effort protection before a Replica begins to initiate a lease transfer in `AdminTransferLease`, which all lease transfer operations flow through. The change includes protection at two levels because rejecting a lease transfer in the proposal buffer after we have revoked our current lease is more disruptive than doing so earlier, before we have revoked our current lease. Best-effort protection is also able to respond more gracefully to invalid targets (e.g. they pick the next best target). However, the check in the Replica proposal buffer is the only place where the protection is airtight against race conditions because the check is performed: 1. by the current Raft leader, else the proposal will fail 2. while holding latches that prevent interleaving log truncation \### Remaining Work With this change, there is a single known race which can lead to an incoming leaseholder needing a snapshot. This is the case when a leader/leaseholder transfers the lease and then quickly loses Raft leadership to a peer that has a shorter log. Even if the older leader could have caught the incoming leaseholder up on its log, the new leader may not be able to because its log may not go back as far. Such a scenario has been possible since we stopped ensuring that all replicas have logs that start at the same index. For more details, see the discussion about cockroachdb#35701 in cockroachdb#81561. This race is theoretical — we have not seen it in practice. It's not clear whether we will try to address it or rely on a mitigation like the one described in cockroachdb#81764 to limit its blast radius. ---- Release note (bug fix): Range lease transfers are no longer permitted to follower replicas that may require a Raft snapshot. This ensures that lease transfers are never delayed behind snapshots, which could previously create range unavailability until the snapshot completed. Lease transfers now err on the side of caution and are only allowed when the outgoing leaseholder can guarantee that the incoming leaseholder does not need a snapshot.

82560: sql: removed redundant parameter in method to schedule sql stats compaction r=rafiss a=surahman The `crdb_internal.schedule_sql_stats_compaction` SQL function does not require the `byte` string parameter and has thus been removed. Closes #78332. Jira issue: [CRDB-14071](https://cockroachlabs.atlassian.net/browse/CRDB-14071) `@Azhng` 82758: kv: don't allow lease transfers to replicas in need of snapshot r=nvanbenschoten a=nvanbenschoten Fixes #81763. Fixes #79385. Part of #81561. ### Background When performing a lease transfer, the outgoing leaseholder revokes its lease before proposing the lease transfer request, meaning that it promises to stop using the previous lease to serve reads or writes. The lease transfer request is then proposed and committed to the Raft log, at which point the new lease officially becomes active. However, this new lease is not usable until the incoming leaseholder applies the Raft entry that contains the lease transfer and notices that it is now the leaseholder for the range. The effect of this handoff is that there exists a "power vacuum" time period when the outgoing leaseholder has revoked its previous lease but the incoming leaseholder has not yet applied its new lease. During this time period, a range is effectively unavailable for strong reads and writes, because no replica will act as the leaseholder. Instead, requests that require the lease will be redirected back and forth between the outgoing leaseholder and the incoming leaseholder (the client backs off). To minimize the disruption caused by lease transfers, we need to minimize this time period. We assume that if a lease transfer target is sufficiently caught up on its log such that it will be able to apply the lease transfer through log entry application then this unavailability window will be acceptable. This may be a faulty assumption in cases with severe replication lag, but we must balance any heuristics here that attempts to determine "too much lag" with the possibility of starvation of lease transfers under sustained write load and a resulting sustained replication lag. See #38065 and #42379, which removed such a heuristic. For now, we don't try to make such a determination. ### Patch Details However, with this change, we now draw a distinction between lease transfer targets that will be able to apply the lease transfer through log entry application and those that will require a Raft snapshot to catch up and apply the lease transfer. Raft snapshots are more expensive than Raft entry replication. They are also significantly more likely to be delayed due to queueing behind other snapshot traffic in the system. This potential for delay makes transferring a lease to a replica that needs a snapshot very risky, as doing so has the effect of inducing range unavailability until the snapshot completes, which could take seconds, minutes, or hours. In the future, we will likely get better at prioritizing snapshots to improve the responsiveness of snapshots that are needed to recover availability. However, even in this world, it is not worth inducing unavailability that can only be recovered through a Raft snapshot. It is better to catch the desired lease target up on the log first and then initiate the lease transfer once its log is connected to the leader's. For this reason, unless we can guarantee that the lease transfer target does not need a Raft snapshot, we don't let it through. This commit adds protection against such risky lease transfers at two levels. First, it includes hardened protection in the Replica proposal buffer, immediately before handing the lease transfer proposal off to etcd/raft. Second, it includes best-effort protection before a Replica begins to initiate a lease transfer in `AdminTransferLease`, which all lease transfer operations flow through. The change includes protection at two levels because rejecting a lease transfer in the proposal buffer after we have revoked our current lease is more disruptive than doing so earlier, before we have revoked our current lease. Best-effort protection is also able to respond more gracefully to invalid targets (e.g. they pick the next best target). However, the check in the Replica proposal buffer is the only place where the protection is airtight against race conditions because the check is performed: 1. by the current Raft leader, else the proposal will fail 2. while holding latches that prevent interleaving log truncation ### Remaining Work With this change, there is a single known race which can lead to an incoming leaseholder needing a snapshot. This is the case when a leader/leaseholder transfers the lease and then quickly loses Raft leadership to a peer that has a shorter log. Even if the older leader could have caught the incoming leaseholder up on its log, the new leader may not be able to because its log may not go back as far. Such a scenario has been possible since we stopped ensuring that all replicas have logs that start at the same index. For more details, see the discussion about #35701 in #81561. This race is theoretical — we have not seen it in practice. It's not clear whether we will try to address it or rely on a mitigation like the one described in #81764 to limit its blast radius. ---- Release note (bug fix): Range lease transfers are no longer permitted to follower replicas that may require a Raft snapshot. This ensures that lease transfers are never delayed behind snapshots, which could previously create range unavailability until the snapshot completed. Lease transfers now err on the side of caution and are only allowed when the outgoing leaseholder can guarantee that the incoming leaseholder does not need a snapshot. 83109: asim: batch workload events r=kvoli a=kvoli This patch introduces batching for load events. Previously, load events were generated per-key and applied individually to the simulator state by finding the range containing that key. This patch batches load events on the same key, then applies the load events in ascending order, over the range tree. This results in a speedup of 5x on a 32 store, 32k replicas, 16k QPS cluster. Release note: None Co-authored-by: Saad Ur Rahman <saadurrahman@apache.org> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Austen McClernon <austen@cockroachlabs.com>

tbg reviewed Nov 12, 2019

View reviewed changes

darinpp force-pushed the immediate_leasholder_change branch from c5e925d to 0de2008 Compare November 19, 2019 04:46

darinpp requested a review from tbg November 19, 2019 04:49

tbg approved these changes Nov 19, 2019

View reviewed changes

darinpp force-pushed the immediate_leasholder_change branch 7 times, most recently from 464e5ef to 43e6ff7 Compare November 25, 2019 15:18

darinpp force-pushed the immediate_leasholder_change branch from 43e6ff7 to 371738c Compare November 25, 2019 17:26

craig bot merged commit 371738c into cockroachdb:master Nov 25, 2019

This was referenced Nov 25, 2019

release-19.2: storage: remove the replica up to date check when transferring lease #42761

Merged

release-19.1: storage: remove the replica up to date check when transferring lease #42764

Merged

aayushshah15 mentioned this pull request Apr 9, 2021

kvserver: lease transferred to uninitialized follower #61604

Closed

nvanbenschoten mentioned this pull request Jun 10, 2022

kv: don't allow lease transfers to replicas in need of snapshot #82758

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: remove the replica up to date check when transfering lease #42379

storage: remove the replica up to date check when transfering lease #42379

darinpp commented Nov 11, 2019 •

edited

Loading

cockroach-teamcity commented Nov 11, 2019

tbg Nov 12, 2019

darinpp Nov 14, 2019

tbg commented Nov 12, 2019

ajwerner commented Nov 12, 2019

tbg commented Nov 12, 2019

darinpp commented Nov 19, 2019

tbg left a comment

darinpp commented Nov 25, 2019

andreimatei commented Nov 25, 2019

craig bot commented Nov 25, 2019

storage: remove the replica up to date check when transfering lease #42379

storage: remove the replica up to date check when transfering lease #42379

Conversation

darinpp commented Nov 11, 2019 • edited Loading

cockroach-teamcity commented Nov 11, 2019

tbg Nov 12, 2019

Choose a reason for hiding this comment

darinpp Nov 14, 2019

Choose a reason for hiding this comment

tbg commented Nov 12, 2019

ajwerner commented Nov 12, 2019

tbg commented Nov 12, 2019

darinpp commented Nov 19, 2019

tbg left a comment

Choose a reason for hiding this comment

darinpp commented Nov 25, 2019

andreimatei commented Nov 25, 2019

craig bot commented Nov 25, 2019

Build succeeded

darinpp commented Nov 11, 2019 •

edited

Loading