Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: remove the replica up to date check when transfering lease #42379

Merged
merged 1 commit into from
Nov 25, 2019

Conversation

darinpp
Copy link
Contributor

@darinpp darinpp commented Nov 11, 2019

Previously, we had a check that filtered out all replicas that
are lagging behind the leader in case of a lease transfer.

We remove that check so in case of lease preference for
a node that is constantly lagging - the lease transfer can occur without delay.

This removes the check that the candidates for lease transfer are only
replicas that aren't lagging behind. etcd implements the
3.10 Leadership transfer extension where the old leader will
bring up to date the new leader's log while blocking any new requests.

Release note (bug fix): now possible to transfer range leases to lagging replicas

@cockroach-teamcity
Copy link
Member

This change is Reviewable

target := rq.allocator.TransferLeaseTarget(
ctx,
zone,
candidates,
desc.Replicas().Voters(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this particular check is misguided, but there'll probably be some way in which we're shooting ourselves in the foot if we just remove it. For example, I'm not seeing any code in TransferLeaseTarget that would stop it from transferring to a node that just went down, so the transferred lease will time out, someone else will get the lease, attempt to transfer again, time out, ... until the store dead timeout has passed. I think we at least want to check the follower's liveness. There is some kind of check here:

existing, _ = a.storePool.liveAndDeadReplicas(rangeID, existing)

but note that it's never reached if we have lease preferences, and it uses the 5min timeout.

store.cfg.NodeLiveness.GetLivenessStatusMap() (or the per-node .GetLiveness(nodeID).IsLive seem like a better signal.

So you could replace filterBehindReplicas with a new onlyLiveReplicas(...), and perhaps also remove some of the inefficient liveness checking in TransferLeaseTarget (though all callers would have to prune, maybe it's better to push the liveness map into TransferLeaseTarget instead?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already the case. The first thing that TransferLeaseTarget does is call StorePool.getStoreList. It grabs the NodeLiveness record for each node and calls Liveness.LivenessStatus and in turn Liveness.IsLive.

@tbg
Copy link
Member

tbg commented Nov 12, 2019

As an end-to-end test, we should do something like #38065: set up a cluster in which latencies are nontrivial, start with lease preferences in region 1 on a table that is seeing steady write traffic, then flip the lease preferences over to a faraway region and make sure that the leases transfer reasonably quickly (and stay where they ought to be). One optional test would be to kill one of the new preferred leaseholders just before you update the leaseholder preference, and make sure that the outage isn't >10s.

Writing this test will be annoying. For one, roachtests don't really hard-code the geography, but you want to know the latencies between them because to get a repro, you want to move the new lease preference as far away from the old one as possible. It also almost seems like you should be able to use ./cockroach demo --global, but this isn't quite there and there's no precedent for writing complex logic against it. You could also look into doing this via shakespeare which would actually be a pretty good fit, but you would have to get some @knz time to get started, and we don't run shakespeare in nightlies yet. Another problem that will have to be solved either way is that - I think - we don't actually have a way to query whether the leaseholder preference is met (we can query crdb_internal.ranges, but this isn't how we want to do it). @andreimatei, are leaseholder prefs in the constraint violation report? I think I looked and found that the answer was no (or the lease transferred faster than I could check).

@ajwerner
Copy link
Contributor

I wonder if there’s a way to utilize the new latency stuff we added for cockroach demo and then write the test as an augmentation of partitionccl/TestRepartitioning by putting some load into the test on a cluster which was set up with high latency. The first commit of #41240 adds leaseholder preferences to that test. The test rather crudely checks for leaseholder placement by examining traces for keys. That may be an easier path to a reliable reproduction cycle than relying on roachtests or hooking up Shakespeare.

@tbg
Copy link
Member

tbg commented Nov 12, 2019

Using the latency injection in unit tests seems like a good idea in general, if we can make that work I'd greatly prefer it.

@darinpp darinpp requested a review from tbg November 19, 2019 04:49
@darinpp
Copy link
Contributor Author

darinpp commented Nov 19, 2019

I added a test as well. Also verified that the test fails with the old check.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:, thanks!

It looks like the test is flaky:

Maybe stress pull request] --- FAIL: TestTransferLeaseToLaggingNode (56.02s)
[05:41:21]
[Maybe stress pull request]     replicate_queue_test.go:604: condition failed to evaluate within 45s: waiting for n1 to get the lease
[05:41:21]
[Maybe stress pull request]         goroutine 25 [running]:
[05:41:21]
[Maybe stress pull request]         runtime/debug.Stack(0xc0003161c0, 0x4562000, 0xc00046aa60)
[05:41:21]
[Maybe stress pull request]         	/usr/local/go/src/runtime/debug/stack.go:24 +0x9d
[05:41:21]
[Maybe stress pull request]         github.com/cockroachdb/cockroach/pkg/testutils.SucceedsSoon(0x4660a00, 0xc0001ca800, 0xc0003161c0)
[05:41:21]
[Maybe stress pull request]         	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/soon.go:37 +0x6b
[05:41:21]
[Maybe stress pull request]         github.com/cockroachdb/cockroach/pkg/storage_test.TestTransferLeaseToLaggingNode(0xc0001ca800)
[05:41:21]
[Maybe stress pull request]         	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replicate_queue_test.go:604 +0x905
[05:41:21]
[Maybe stress pull request]         testing.tRunner(0xc0001ca800, 0x3d9e638)
[05:41:21]
[Maybe stress pull request]         	/usr/local/go/src/testing/testing.go:865 +0xc0
[05:41:21]
[Maybe stress pull request]         created by testing.(*T).Run
[05:41:21]
[Maybe stress pull request]         	/usr/local/go/src/testing/testing.go:916 +0x35a
[05:41:21]
[Maybe stress pull request] FAIL

https://teamcity.cockroachdb.com/viewLog.html?buildId=1597934&buildTypeId=Cockroach_UnitTests_Test&tab=buildLog&_focus=689#_state=689

Make sure it survives stress{,rac}ing for a few minutes on your GCEworker (with STRESSFLAGS='-p 8' or something like that).

Reviewed 1 of 1 files at r1, 1 of 1 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @darinpp)


pkg/storage/replicate_queue.go, line 940 at r1 (raw file):

Previously, darinpp wrote…

This is already the case. The first thing that TransferLeaseTarget does is call StorePool.getStoreList. It grabs the NodeLiveness record for each node and calls Liveness.LivenessStatus and in turn Liveness.IsLive.

👍


pkg/storage/replicate_queue_test.go, line 567 at r2 (raw file):

	s := sqlutils.MakeSQLRunner(tc.Conns[0])

	// Create a table and set the replication factor to 3

.


pkg/storage/replicate_queue_test.go, line 579 at r2 (raw file):

	}

	// Get the table's range

.


pkg/storage/replicate_queue_test.go, line 618 at r2 (raw file):

		rangeDesc := n1Repl.Desc()
		for _, r := range rangeDesc.Replicas().Voters() {
			// We are interested in the replica on the 3rd node

.

@darinpp darinpp force-pushed the immediate_leasholder_change branch 7 times, most recently from 464e5ef to 43e6ff7 Compare November 25, 2019 15:18
Previously, we had a check that filtered out all replicas that
are lagging behind the leader in case of a lease transfer.

We had to remove that check because when the lease preference specifies
a node that is constantly lagging - the lease transfer will never occur.

This removes the check that the cadidates for lease transfer are only
replicas that aren't lagging behind. etcd implements the
3.10 Leadership transfer extension where the old leader will
bring up to date the new leader's log while blocking any new requests.

Release note (bug fix): now possible to transfer range leases to lagging replicas
@darinpp
Copy link
Contributor Author

darinpp commented Nov 25, 2019

bors r+

craig bot pushed a commit that referenced this pull request Nov 25, 2019
42379: storage: remove the replica up to date check when transfering lease r=darinpp a=darinpp

Previously, we had a check that filtered out all replicas that
are lagging behind the leader in case of a lease transfer.

We remove that check so in case of lease preference for 
a node that is constantly lagging - the lease transfer can occur without delay.

This removes the check that the candidates for lease transfer are only
replicas that aren't lagging behind. etcd implements the
3.10 Leadership transfer extension where the old leader will
bring up to date the new leader's log while blocking any new requests.

Release note (bug fix): now possible to transfer range leases to lagging replicas

42724: storage: create cluster setting for a minimum lease transfer interval r=nvanbenschoten a=nvanbenschoten

This minimum interval was previously set to 1 second in a constant, which was frustrating when setting up datasets with lease preferences. This change turns the interval into a configurable cluster setting so that it can be changed on-demand.

This value also seems fairly high, but to avoid causing any instability or thrashing I opted not to touch the default for the new cluster setting.

Release note (sql change): A new kv.allocator.min_lease_transfer_interval cluster setting was introduced, which allows the minimum interval between lease transfers initiated from each node to be configured.

Co-authored-by: Darin <darinp@gmail.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@andreimatei
Copy link
Contributor

If bors fails, please put a Fixes #38065 line in the commit message.

Also may I suggest a more user-understandable release note like:
Release note(bug fix): A bug causing a range's lease to sometimes not be transferred to the location indicated in lease_preferences in geographically-distributed deployments has been fixed.

@craig
Copy link
Contributor

craig bot commented Nov 25, 2019

Build succeeded

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jun 12, 2022
Fixes cockroachdb#81763.
Part of cockroachdb#81561.

\### Background

When performing a lease transfer, the outgoing leaseholder revokes its
lease before proposing the lease transfer request, meaning that it
promises to stop using the previous lease to serve reads or writes. The
lease transfer request is then proposed and committed to the Raft log, at
which point the new lease officially becomes active. However, this new
lease is not usable until the incoming leaseholder applies the Raft entry
that contains the lease transfer and notices that it is now the
leaseholder for the range.

The effect of this handoff is that there exists a "power vacuum" time
period when the outgoing leaseholder has revoked its previous lease but
the incoming leaseholder has not yet applied its new lease. During this
time period, a range is effectively unavailable for strong reads and
writes, because no replica will act as the leaseholder. Instead, requests
that require the lease will be redirected back and forth between the
outgoing leaseholder and the incoming leaseholder (the client backs off).
To minimize the disruption caused by lease transfers, we need to minimize
this time period.

We assume that if a lease transfer target is sufficiently caught up on
its log such that it will be able to apply the lease transfer through log
entry application then this unavailability window will be acceptable.
This may be a faulty assumption in cases with severe replication lag, but
we must balance any heuristics here that attempts to determine "too much
lag" with the possibility of starvation of lease transfers under
sustained write load and a resulting sustained replication lag. See cockroachdb#38065
and cockroachdb#42379, which removed such a heuristic. For now, we don't try to make
such a determination.

\### Patch Details

However, with this change, we now draw a distinction between lease
transfer targets that will be able to apply the lease transfer through
log entry application and those that will require a Raft snapshot to
catch up and apply the lease transfer. Raft snapshots are more expensive
than Raft entry replication. They are also significantly more likely to
be delayed due to queueing behind other snapshot traffic in the system.
This potential for delay makes transferring a lease to a replica that
needs a snapshot very risky, as doing so has the effect of inducing
range unavailability until the snapshot completes, which could take
seconds, minutes, or hours.

In the future, we will likely get better at prioritizing snapshots to
improve the responsiveness of snapshots that are needed to recover
availability. However, even in this world, it is not worth inducing
unavailability that can only be recovered through a Raft snapshot. It is
better to catch the desired lease target up on the log first and then
initiate the lease transfer once its log is connected to the leader's.
For this reason, unless we can guarantee that the lease transfer target
does not need a Raft snapshot, we don't let it through.

This commit adds protection against such risky lease transfers at two
levels. First, it includes hardened protection in the Replica proposal
buffer, immediately before handing the lease transfer proposal off to
`etcd/raft`. Second, it includes best-effort protection before a Replica
begins to initiate a lease transfer in `AdminTransferLease`, which all
lease transfer operations flow through.

The change includes protection at two levels because rejecting a lease
transfer in the proposal buffer after we have revoked our current lease
is more disruptive than doing so earlier, before we have revoked our
current lease. Best-effort protection is also able to respond more
gracefully to invalid targets (e.g. they pick the next best target).

However, the check in the Replica proposal buffer is the only place
where the protection is airtight against race conditions because the
check is performed:
1. by the current Raft leader, else the proposal will fail
2. while holding latches that prevent interleaving log truncation

\### Remaining Work

With this change, there is a single known race which can lead to an
incoming leaseholder needing a snapshot. This is the case when a
leader/leaseholder transfers the lease and then quickly loses Raft
leadership to a peer that has a shorter log. Even if the older leader
could have caught the incoming leaseholder up on its log, the new leader
may not be able to because its log may not go back as far. Such a
scenario has been possible since we stopped ensuring that all replicas
have logs that start at the same index. For more details, see the
discussion about cockroachdb#35701 in cockroachdb#81561.

This race is theoretical — we have not seen it in practice. It's not
clear whether we will try to address it or rely on a mitigation like
the one described in cockroachdb#81764 to limit its blast radius.

----

Release note (bug fix): Range lease transfers are no longer permitted to
follower replicas that may require a Raft snapshot. This ensures that
lease transfers are never delayed behind snapshots, which could
previously create range unavailability until the snapshot completed.
Lease transfers now err on the side of caution and are only allowed when
the outgoing leaseholder can guarantee that the incoming leaseholder
does not need a snapshot.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jun 21, 2022
Fixes cockroachdb#81763.
Part of cockroachdb#81561.

\### Background

When performing a lease transfer, the outgoing leaseholder revokes its
lease before proposing the lease transfer request, meaning that it
promises to stop using the previous lease to serve reads or writes. The
lease transfer request is then proposed and committed to the Raft log, at
which point the new lease officially becomes active. However, this new
lease is not usable until the incoming leaseholder applies the Raft entry
that contains the lease transfer and notices that it is now the
leaseholder for the range.

The effect of this handoff is that there exists a "power vacuum" time
period when the outgoing leaseholder has revoked its previous lease but
the incoming leaseholder has not yet applied its new lease. During this
time period, a range is effectively unavailable for strong reads and
writes, because no replica will act as the leaseholder. Instead, requests
that require the lease will be redirected back and forth between the
outgoing leaseholder and the incoming leaseholder (the client backs off).
To minimize the disruption caused by lease transfers, we need to minimize
this time period.

We assume that if a lease transfer target is sufficiently caught up on
its log such that it will be able to apply the lease transfer through log
entry application then this unavailability window will be acceptable.
This may be a faulty assumption in cases with severe replication lag, but
we must balance any heuristics here that attempts to determine "too much
lag" with the possibility of starvation of lease transfers under
sustained write load and a resulting sustained replication lag. See cockroachdb#38065
and cockroachdb#42379, which removed such a heuristic. For now, we don't try to make
such a determination.

\### Patch Details

However, with this change, we now draw a distinction between lease
transfer targets that will be able to apply the lease transfer through
log entry application and those that will require a Raft snapshot to
catch up and apply the lease transfer. Raft snapshots are more expensive
than Raft entry replication. They are also significantly more likely to
be delayed due to queueing behind other snapshot traffic in the system.
This potential for delay makes transferring a lease to a replica that
needs a snapshot very risky, as doing so has the effect of inducing
range unavailability until the snapshot completes, which could take
seconds, minutes, or hours.

In the future, we will likely get better at prioritizing snapshots to
improve the responsiveness of snapshots that are needed to recover
availability. However, even in this world, it is not worth inducing
unavailability that can only be recovered through a Raft snapshot. It is
better to catch the desired lease target up on the log first and then
initiate the lease transfer once its log is connected to the leader's.
For this reason, unless we can guarantee that the lease transfer target
does not need a Raft snapshot, we don't let it through.

This commit adds protection against such risky lease transfers at two
levels. First, it includes hardened protection in the Replica proposal
buffer, immediately before handing the lease transfer proposal off to
`etcd/raft`. Second, it includes best-effort protection before a Replica
begins to initiate a lease transfer in `AdminTransferLease`, which all
lease transfer operations flow through.

The change includes protection at two levels because rejecting a lease
transfer in the proposal buffer after we have revoked our current lease
is more disruptive than doing so earlier, before we have revoked our
current lease. Best-effort protection is also able to respond more
gracefully to invalid targets (e.g. they pick the next best target).

However, the check in the Replica proposal buffer is the only place
where the protection is airtight against race conditions because the
check is performed:
1. by the current Raft leader, else the proposal will fail
2. while holding latches that prevent interleaving log truncation

\### Remaining Work

With this change, there is a single known race which can lead to an
incoming leaseholder needing a snapshot. This is the case when a
leader/leaseholder transfers the lease and then quickly loses Raft
leadership to a peer that has a shorter log. Even if the older leader
could have caught the incoming leaseholder up on its log, the new leader
may not be able to because its log may not go back as far. Such a
scenario has been possible since we stopped ensuring that all replicas
have logs that start at the same index. For more details, see the
discussion about cockroachdb#35701 in cockroachdb#81561.

This race is theoretical — we have not seen it in practice. It's not
clear whether we will try to address it or rely on a mitigation like
the one described in cockroachdb#81764 to limit its blast radius.

----

Release note (bug fix): Range lease transfers are no longer permitted to
follower replicas that may require a Raft snapshot. This ensures that
lease transfers are never delayed behind snapshots, which could
previously create range unavailability until the snapshot completed.
Lease transfers now err on the side of caution and are only allowed when
the outgoing leaseholder can guarantee that the incoming leaseholder
does not need a snapshot.
craig bot pushed a commit that referenced this pull request Jun 22, 2022
82560: sql: removed redundant parameter in method to schedule sql stats compaction r=rafiss a=surahman

The `crdb_internal.schedule_sql_stats_compaction` SQL function does not require the `byte` string parameter and has thus been removed. Closes #78332.

Jira issue: [CRDB-14071](https://cockroachlabs.atlassian.net/browse/CRDB-14071)

`@Azhng`

82758: kv: don't allow lease transfers to replicas in need of snapshot r=nvanbenschoten a=nvanbenschoten

Fixes #81763.
Fixes #79385.
Part of #81561.

### Background

When performing a lease transfer, the outgoing leaseholder revokes its lease before proposing the lease transfer request, meaning that it promises to stop using the previous lease to serve reads or writes. The lease transfer request is then proposed and committed to the Raft log, at which point the new lease officially becomes active. However, this new lease is not usable until the incoming leaseholder applies the Raft entry that contains the lease transfer and notices that it is now the leaseholder for the range.

The effect of this handoff is that there exists a "power vacuum" time period when the outgoing leaseholder has revoked its previous lease but the incoming leaseholder has not yet applied its new lease. During this time period, a range is effectively unavailable for strong reads and writes, because no replica will act as the leaseholder. Instead, requests that require the lease will be redirected back and forth between the outgoing leaseholder and the incoming leaseholder (the client backs off). To minimize the disruption caused by lease transfers, we need to minimize this time period.

We assume that if a lease transfer target is sufficiently caught up on its log such that it will be able to apply the lease transfer through log entry application then this unavailability window will be acceptable. This may be a faulty assumption in cases with severe replication lag, but we must balance any heuristics here that attempts to determine "too much lag" with the possibility of starvation of lease transfers under sustained write load and a resulting sustained replication lag. See #38065 and #42379, which removed such a heuristic. For now, we don't try to make such a determination.

### Patch Details

However, with this change, we now draw a distinction between lease transfer targets that will be able to apply the lease transfer through log entry application and those that will require a Raft snapshot to catch up and apply the lease transfer. Raft snapshots are more expensive than Raft entry replication. They are also significantly more likely to be delayed due to queueing behind other snapshot traffic in the system. This potential for delay makes transferring a lease to a replica that needs a snapshot very risky, as doing so has the effect of inducing range unavailability until the snapshot completes, which could take seconds, minutes, or hours.

In the future, we will likely get better at prioritizing snapshots to improve the responsiveness of snapshots that are needed to recover availability. However, even in this world, it is not worth inducing unavailability that can only be recovered through a Raft snapshot. It is better to catch the desired lease target up on the log first and then initiate the lease transfer once its log is connected to the leader's. For this reason, unless we can guarantee that the lease transfer target does not need a Raft snapshot, we don't let it through. 

This commit adds protection against such risky lease transfers at two levels. First, it includes hardened protection in the Replica proposal buffer, immediately before handing the lease transfer proposal off to etcd/raft. Second, it includes best-effort protection before a Replica begins to initiate a lease transfer in `AdminTransferLease`, which all lease transfer operations flow through.

The change includes protection at two levels because rejecting a lease transfer in the proposal buffer after we have revoked our current lease is more disruptive than doing so earlier, before we have revoked our current lease. Best-effort protection is also able to respond more gracefully to invalid targets (e.g. they pick the next best target).

However, the check in the Replica proposal buffer is the only place where the protection is airtight against race conditions because the check is performed:
1. by the current Raft leader, else the proposal will fail
2. while holding latches that prevent interleaving log truncation

### Remaining Work

With this change, there is a single known race which can lead to an incoming leaseholder needing a snapshot. This is the case when a leader/leaseholder transfers the lease and then quickly loses Raft leadership to a peer that has a shorter log. Even if the older leader could have caught the incoming leaseholder up on its log, the new leader may not be able to because its log may not go back as far. Such a scenario has been possible since we stopped ensuring that all replicas have logs that start at the same index. For more details, see the discussion about #35701 in #81561.

This race is theoretical — we have not seen it in practice. It's not clear whether we will try to address it or rely on a mitigation like the one described in #81764 to limit its blast radius.

----

Release note (bug fix): Range lease transfers are no longer permitted to follower replicas that may require a Raft snapshot. This ensures that lease transfers are never delayed behind snapshots, which could previously create range unavailability until the snapshot completed. Lease transfers now err on the side of caution and are only allowed when the outgoing leaseholder can guarantee that the incoming leaseholder does not need a snapshot.

83109: asim: batch workload events r=kvoli a=kvoli

This patch introduces batching for load events. Previously, load events
were generated per-key and applied individually to the simulator state
by finding the range containing that key. This patch batches load events
on the same key, then applies the load events in ascending order, over
the range tree.

This results in a speedup of 5x on a 32 store, 32k replicas, 16k QPS
cluster.

Release note: None

Co-authored-by: Saad Ur Rahman <saadurrahman@apache.org>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Austen McClernon <austen@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants