-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: remove the replica up to date check when transfering lease #42379
Conversation
target := rq.allocator.TransferLeaseTarget( | ||
ctx, | ||
zone, | ||
candidates, | ||
desc.Replicas().Voters(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this particular check is misguided, but there'll probably be some way in which we're shooting ourselves in the foot if we just remove it. For example, I'm not seeing any code in TransferLeaseTarget
that would stop it from transferring to a node that just went down, so the transferred lease will time out, someone else will get the lease, attempt to transfer again, time out, ... until the store dead timeout has passed. I think we at least want to check the follower's liveness. There is some kind of check here:
cockroach/pkg/storage/allocator.go
Line 852 in 0cf7360
existing, _ = a.storePool.liveAndDeadReplicas(rangeID, existing) |
but note that it's never reached if we have lease preferences, and it uses the 5min timeout.
store.cfg.NodeLiveness.GetLivenessStatusMap()
(or the per-node .GetLiveness(nodeID).IsLive
seem like a better signal.
So you could replace filterBehindReplicas
with a new onlyLiveReplicas(...)
, and perhaps also remove some of the inefficient liveness checking in TransferLeaseTarget
(though all callers would have to prune, maybe it's better to push the liveness map into TransferLeaseTarget instead?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already the case. The first thing that TransferLeaseTarget
does is call StorePool.getStoreList
. It grabs the NodeLiveness
record for each node and calls Liveness.LivenessStatus
and in turn Liveness.IsLive
.
As an end-to-end test, we should do something like #38065: set up a cluster in which latencies are nontrivial, start with lease preferences in region 1 on a table that is seeing steady write traffic, then flip the lease preferences over to a faraway region and make sure that the leases transfer reasonably quickly (and stay where they ought to be). One optional test would be to kill one of the new preferred leaseholders just before you update the leaseholder preference, and make sure that the outage isn't >10s. Writing this test will be annoying. For one, roachtests don't really hard-code the geography, but you want to know the latencies between them because to get a repro, you want to move the new lease preference as far away from the old one as possible. It also almost seems like you should be able to use |
I wonder if there’s a way to utilize the new latency stuff we added for cockroach demo and then write the test as an augmentation of partitionccl/TestRepartitioning by putting some load into the test on a cluster which was set up with high latency. The first commit of #41240 adds leaseholder preferences to that test. The test rather crudely checks for leaseholder placement by examining traces for keys. That may be an easier path to a reliable reproduction cycle than relying on roachtests or hooking up Shakespeare. |
Using the latency injection in unit tests seems like a good idea in general, if we can make that work I'd greatly prefer it. |
c5e925d
to
0de2008
Compare
I added a test as well. Also verified that the test fails with the old check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the test is flaky:
Maybe stress pull request] --- FAIL: TestTransferLeaseToLaggingNode (56.02s)
[05:41:21]
[Maybe stress pull request] replicate_queue_test.go:604: condition failed to evaluate within 45s: waiting for n1 to get the lease
[05:41:21]
[Maybe stress pull request] goroutine 25 [running]:
[05:41:21]
[Maybe stress pull request] runtime/debug.Stack(0xc0003161c0, 0x4562000, 0xc00046aa60)
[05:41:21]
[Maybe stress pull request] /usr/local/go/src/runtime/debug/stack.go:24 +0x9d
[05:41:21]
[Maybe stress pull request] github.com/cockroachdb/cockroach/pkg/testutils.SucceedsSoon(0x4660a00, 0xc0001ca800, 0xc0003161c0)
[05:41:21]
[Maybe stress pull request] /go/src/github.com/cockroachdb/cockroach/pkg/testutils/soon.go:37 +0x6b
[05:41:21]
[Maybe stress pull request] github.com/cockroachdb/cockroach/pkg/storage_test.TestTransferLeaseToLaggingNode(0xc0001ca800)
[05:41:21]
[Maybe stress pull request] /go/src/github.com/cockroachdb/cockroach/pkg/storage/replicate_queue_test.go:604 +0x905
[05:41:21]
[Maybe stress pull request] testing.tRunner(0xc0001ca800, 0x3d9e638)
[05:41:21]
[Maybe stress pull request] /usr/local/go/src/testing/testing.go:865 +0xc0
[05:41:21]
[Maybe stress pull request] created by testing.(*T).Run
[05:41:21]
[Maybe stress pull request] /usr/local/go/src/testing/testing.go:916 +0x35a
[05:41:21]
[Maybe stress pull request] FAIL
Make sure it survives stress{,rac}ing for a few minutes on your GCEworker (with STRESSFLAGS='-p 8'
or something like that).
Reviewed 1 of 1 files at r1, 1 of 1 files at r2.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @darinpp)
pkg/storage/replicate_queue.go, line 940 at r1 (raw file):
Previously, darinpp wrote…
This is already the case. The first thing that
TransferLeaseTarget
does is callStorePool.getStoreList
. It grabs theNodeLiveness
record for each node and callsLiveness.LivenessStatus
and in turnLiveness.IsLive
.
👍
pkg/storage/replicate_queue_test.go, line 567 at r2 (raw file):
s := sqlutils.MakeSQLRunner(tc.Conns[0]) // Create a table and set the replication factor to 3
.
pkg/storage/replicate_queue_test.go, line 579 at r2 (raw file):
} // Get the table's range
.
pkg/storage/replicate_queue_test.go, line 618 at r2 (raw file):
rangeDesc := n1Repl.Desc() for _, r := range rangeDesc.Replicas().Voters() { // We are interested in the replica on the 3rd node
.
464e5ef
to
43e6ff7
Compare
Previously, we had a check that filtered out all replicas that are lagging behind the leader in case of a lease transfer. We had to remove that check because when the lease preference specifies a node that is constantly lagging - the lease transfer will never occur. This removes the check that the cadidates for lease transfer are only replicas that aren't lagging behind. etcd implements the 3.10 Leadership transfer extension where the old leader will bring up to date the new leader's log while blocking any new requests. Release note (bug fix): now possible to transfer range leases to lagging replicas
43e6ff7
to
371738c
Compare
bors r+ |
42379: storage: remove the replica up to date check when transfering lease r=darinpp a=darinpp Previously, we had a check that filtered out all replicas that are lagging behind the leader in case of a lease transfer. We remove that check so in case of lease preference for a node that is constantly lagging - the lease transfer can occur without delay. This removes the check that the candidates for lease transfer are only replicas that aren't lagging behind. etcd implements the 3.10 Leadership transfer extension where the old leader will bring up to date the new leader's log while blocking any new requests. Release note (bug fix): now possible to transfer range leases to lagging replicas 42724: storage: create cluster setting for a minimum lease transfer interval r=nvanbenschoten a=nvanbenschoten This minimum interval was previously set to 1 second in a constant, which was frustrating when setting up datasets with lease preferences. This change turns the interval into a configurable cluster setting so that it can be changed on-demand. This value also seems fairly high, but to avoid causing any instability or thrashing I opted not to touch the default for the new cluster setting. Release note (sql change): A new kv.allocator.min_lease_transfer_interval cluster setting was introduced, which allows the minimum interval between lease transfers initiated from each node to be configured. Co-authored-by: Darin <darinp@gmail.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
If bors fails, please put a Also may I suggest a more user-understandable release note like: |
Build succeeded |
Fixes cockroachdb#81763. Part of cockroachdb#81561. \### Background When performing a lease transfer, the outgoing leaseholder revokes its lease before proposing the lease transfer request, meaning that it promises to stop using the previous lease to serve reads or writes. The lease transfer request is then proposed and committed to the Raft log, at which point the new lease officially becomes active. However, this new lease is not usable until the incoming leaseholder applies the Raft entry that contains the lease transfer and notices that it is now the leaseholder for the range. The effect of this handoff is that there exists a "power vacuum" time period when the outgoing leaseholder has revoked its previous lease but the incoming leaseholder has not yet applied its new lease. During this time period, a range is effectively unavailable for strong reads and writes, because no replica will act as the leaseholder. Instead, requests that require the lease will be redirected back and forth between the outgoing leaseholder and the incoming leaseholder (the client backs off). To minimize the disruption caused by lease transfers, we need to minimize this time period. We assume that if a lease transfer target is sufficiently caught up on its log such that it will be able to apply the lease transfer through log entry application then this unavailability window will be acceptable. This may be a faulty assumption in cases with severe replication lag, but we must balance any heuristics here that attempts to determine "too much lag" with the possibility of starvation of lease transfers under sustained write load and a resulting sustained replication lag. See cockroachdb#38065 and cockroachdb#42379, which removed such a heuristic. For now, we don't try to make such a determination. \### Patch Details However, with this change, we now draw a distinction between lease transfer targets that will be able to apply the lease transfer through log entry application and those that will require a Raft snapshot to catch up and apply the lease transfer. Raft snapshots are more expensive than Raft entry replication. They are also significantly more likely to be delayed due to queueing behind other snapshot traffic in the system. This potential for delay makes transferring a lease to a replica that needs a snapshot very risky, as doing so has the effect of inducing range unavailability until the snapshot completes, which could take seconds, minutes, or hours. In the future, we will likely get better at prioritizing snapshots to improve the responsiveness of snapshots that are needed to recover availability. However, even in this world, it is not worth inducing unavailability that can only be recovered through a Raft snapshot. It is better to catch the desired lease target up on the log first and then initiate the lease transfer once its log is connected to the leader's. For this reason, unless we can guarantee that the lease transfer target does not need a Raft snapshot, we don't let it through. This commit adds protection against such risky lease transfers at two levels. First, it includes hardened protection in the Replica proposal buffer, immediately before handing the lease transfer proposal off to `etcd/raft`. Second, it includes best-effort protection before a Replica begins to initiate a lease transfer in `AdminTransferLease`, which all lease transfer operations flow through. The change includes protection at two levels because rejecting a lease transfer in the proposal buffer after we have revoked our current lease is more disruptive than doing so earlier, before we have revoked our current lease. Best-effort protection is also able to respond more gracefully to invalid targets (e.g. they pick the next best target). However, the check in the Replica proposal buffer is the only place where the protection is airtight against race conditions because the check is performed: 1. by the current Raft leader, else the proposal will fail 2. while holding latches that prevent interleaving log truncation \### Remaining Work With this change, there is a single known race which can lead to an incoming leaseholder needing a snapshot. This is the case when a leader/leaseholder transfers the lease and then quickly loses Raft leadership to a peer that has a shorter log. Even if the older leader could have caught the incoming leaseholder up on its log, the new leader may not be able to because its log may not go back as far. Such a scenario has been possible since we stopped ensuring that all replicas have logs that start at the same index. For more details, see the discussion about cockroachdb#35701 in cockroachdb#81561. This race is theoretical — we have not seen it in practice. It's not clear whether we will try to address it or rely on a mitigation like the one described in cockroachdb#81764 to limit its blast radius. ---- Release note (bug fix): Range lease transfers are no longer permitted to follower replicas that may require a Raft snapshot. This ensures that lease transfers are never delayed behind snapshots, which could previously create range unavailability until the snapshot completed. Lease transfers now err on the side of caution and are only allowed when the outgoing leaseholder can guarantee that the incoming leaseholder does not need a snapshot.
Fixes cockroachdb#81763. Part of cockroachdb#81561. \### Background When performing a lease transfer, the outgoing leaseholder revokes its lease before proposing the lease transfer request, meaning that it promises to stop using the previous lease to serve reads or writes. The lease transfer request is then proposed and committed to the Raft log, at which point the new lease officially becomes active. However, this new lease is not usable until the incoming leaseholder applies the Raft entry that contains the lease transfer and notices that it is now the leaseholder for the range. The effect of this handoff is that there exists a "power vacuum" time period when the outgoing leaseholder has revoked its previous lease but the incoming leaseholder has not yet applied its new lease. During this time period, a range is effectively unavailable for strong reads and writes, because no replica will act as the leaseholder. Instead, requests that require the lease will be redirected back and forth between the outgoing leaseholder and the incoming leaseholder (the client backs off). To minimize the disruption caused by lease transfers, we need to minimize this time period. We assume that if a lease transfer target is sufficiently caught up on its log such that it will be able to apply the lease transfer through log entry application then this unavailability window will be acceptable. This may be a faulty assumption in cases with severe replication lag, but we must balance any heuristics here that attempts to determine "too much lag" with the possibility of starvation of lease transfers under sustained write load and a resulting sustained replication lag. See cockroachdb#38065 and cockroachdb#42379, which removed such a heuristic. For now, we don't try to make such a determination. \### Patch Details However, with this change, we now draw a distinction between lease transfer targets that will be able to apply the lease transfer through log entry application and those that will require a Raft snapshot to catch up and apply the lease transfer. Raft snapshots are more expensive than Raft entry replication. They are also significantly more likely to be delayed due to queueing behind other snapshot traffic in the system. This potential for delay makes transferring a lease to a replica that needs a snapshot very risky, as doing so has the effect of inducing range unavailability until the snapshot completes, which could take seconds, minutes, or hours. In the future, we will likely get better at prioritizing snapshots to improve the responsiveness of snapshots that are needed to recover availability. However, even in this world, it is not worth inducing unavailability that can only be recovered through a Raft snapshot. It is better to catch the desired lease target up on the log first and then initiate the lease transfer once its log is connected to the leader's. For this reason, unless we can guarantee that the lease transfer target does not need a Raft snapshot, we don't let it through. This commit adds protection against such risky lease transfers at two levels. First, it includes hardened protection in the Replica proposal buffer, immediately before handing the lease transfer proposal off to `etcd/raft`. Second, it includes best-effort protection before a Replica begins to initiate a lease transfer in `AdminTransferLease`, which all lease transfer operations flow through. The change includes protection at two levels because rejecting a lease transfer in the proposal buffer after we have revoked our current lease is more disruptive than doing so earlier, before we have revoked our current lease. Best-effort protection is also able to respond more gracefully to invalid targets (e.g. they pick the next best target). However, the check in the Replica proposal buffer is the only place where the protection is airtight against race conditions because the check is performed: 1. by the current Raft leader, else the proposal will fail 2. while holding latches that prevent interleaving log truncation \### Remaining Work With this change, there is a single known race which can lead to an incoming leaseholder needing a snapshot. This is the case when a leader/leaseholder transfers the lease and then quickly loses Raft leadership to a peer that has a shorter log. Even if the older leader could have caught the incoming leaseholder up on its log, the new leader may not be able to because its log may not go back as far. Such a scenario has been possible since we stopped ensuring that all replicas have logs that start at the same index. For more details, see the discussion about cockroachdb#35701 in cockroachdb#81561. This race is theoretical — we have not seen it in practice. It's not clear whether we will try to address it or rely on a mitigation like the one described in cockroachdb#81764 to limit its blast radius. ---- Release note (bug fix): Range lease transfers are no longer permitted to follower replicas that may require a Raft snapshot. This ensures that lease transfers are never delayed behind snapshots, which could previously create range unavailability until the snapshot completed. Lease transfers now err on the side of caution and are only allowed when the outgoing leaseholder can guarantee that the incoming leaseholder does not need a snapshot.
82560: sql: removed redundant parameter in method to schedule sql stats compaction r=rafiss a=surahman The `crdb_internal.schedule_sql_stats_compaction` SQL function does not require the `byte` string parameter and has thus been removed. Closes #78332. Jira issue: [CRDB-14071](https://cockroachlabs.atlassian.net/browse/CRDB-14071) `@Azhng` 82758: kv: don't allow lease transfers to replicas in need of snapshot r=nvanbenschoten a=nvanbenschoten Fixes #81763. Fixes #79385. Part of #81561. ### Background When performing a lease transfer, the outgoing leaseholder revokes its lease before proposing the lease transfer request, meaning that it promises to stop using the previous lease to serve reads or writes. The lease transfer request is then proposed and committed to the Raft log, at which point the new lease officially becomes active. However, this new lease is not usable until the incoming leaseholder applies the Raft entry that contains the lease transfer and notices that it is now the leaseholder for the range. The effect of this handoff is that there exists a "power vacuum" time period when the outgoing leaseholder has revoked its previous lease but the incoming leaseholder has not yet applied its new lease. During this time period, a range is effectively unavailable for strong reads and writes, because no replica will act as the leaseholder. Instead, requests that require the lease will be redirected back and forth between the outgoing leaseholder and the incoming leaseholder (the client backs off). To minimize the disruption caused by lease transfers, we need to minimize this time period. We assume that if a lease transfer target is sufficiently caught up on its log such that it will be able to apply the lease transfer through log entry application then this unavailability window will be acceptable. This may be a faulty assumption in cases with severe replication lag, but we must balance any heuristics here that attempts to determine "too much lag" with the possibility of starvation of lease transfers under sustained write load and a resulting sustained replication lag. See #38065 and #42379, which removed such a heuristic. For now, we don't try to make such a determination. ### Patch Details However, with this change, we now draw a distinction between lease transfer targets that will be able to apply the lease transfer through log entry application and those that will require a Raft snapshot to catch up and apply the lease transfer. Raft snapshots are more expensive than Raft entry replication. They are also significantly more likely to be delayed due to queueing behind other snapshot traffic in the system. This potential for delay makes transferring a lease to a replica that needs a snapshot very risky, as doing so has the effect of inducing range unavailability until the snapshot completes, which could take seconds, minutes, or hours. In the future, we will likely get better at prioritizing snapshots to improve the responsiveness of snapshots that are needed to recover availability. However, even in this world, it is not worth inducing unavailability that can only be recovered through a Raft snapshot. It is better to catch the desired lease target up on the log first and then initiate the lease transfer once its log is connected to the leader's. For this reason, unless we can guarantee that the lease transfer target does not need a Raft snapshot, we don't let it through. This commit adds protection against such risky lease transfers at two levels. First, it includes hardened protection in the Replica proposal buffer, immediately before handing the lease transfer proposal off to etcd/raft. Second, it includes best-effort protection before a Replica begins to initiate a lease transfer in `AdminTransferLease`, which all lease transfer operations flow through. The change includes protection at two levels because rejecting a lease transfer in the proposal buffer after we have revoked our current lease is more disruptive than doing so earlier, before we have revoked our current lease. Best-effort protection is also able to respond more gracefully to invalid targets (e.g. they pick the next best target). However, the check in the Replica proposal buffer is the only place where the protection is airtight against race conditions because the check is performed: 1. by the current Raft leader, else the proposal will fail 2. while holding latches that prevent interleaving log truncation ### Remaining Work With this change, there is a single known race which can lead to an incoming leaseholder needing a snapshot. This is the case when a leader/leaseholder transfers the lease and then quickly loses Raft leadership to a peer that has a shorter log. Even if the older leader could have caught the incoming leaseholder up on its log, the new leader may not be able to because its log may not go back as far. Such a scenario has been possible since we stopped ensuring that all replicas have logs that start at the same index. For more details, see the discussion about #35701 in #81561. This race is theoretical — we have not seen it in practice. It's not clear whether we will try to address it or rely on a mitigation like the one described in #81764 to limit its blast radius. ---- Release note (bug fix): Range lease transfers are no longer permitted to follower replicas that may require a Raft snapshot. This ensures that lease transfers are never delayed behind snapshots, which could previously create range unavailability until the snapshot completed. Lease transfers now err on the side of caution and are only allowed when the outgoing leaseholder can guarantee that the incoming leaseholder does not need a snapshot. 83109: asim: batch workload events r=kvoli a=kvoli This patch introduces batching for load events. Previously, load events were generated per-key and applied individually to the simulator state by finding the range containing that key. This patch batches load events on the same key, then applies the load events in ascending order, over the range tree. This results in a speedup of 5x on a 32 store, 32k replicas, 16k QPS cluster. Release note: None Co-authored-by: Saad Ur Rahman <saadurrahman@apache.org> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Austen McClernon <austen@cockroachlabs.com>
Previously, we had a check that filtered out all replicas that
are lagging behind the leader in case of a lease transfer.
We remove that check so in case of lease preference for
a node that is constantly lagging - the lease transfer can occur without delay.
This removes the check that the candidates for lease transfer are only
replicas that aren't lagging behind. etcd implements the
3.10 Leadership transfer extension where the old leader will
bring up to date the new leader's log while blocking any new requests.
Release note (bug fix): now possible to transfer range leases to lagging replicas