kvserver: use shorter lease durations #79494

erikgrinaker · 2022-04-06T08:37:20Z

Currently, range leases are fairly long-lived at 9 seconds:

multiplier * electionTimeoutTicks * tickInterval = 3 * 15 * 200ms = 9s

cockroach/pkg/base/config.go

Lines 442 to 444 in 8817c28

    
           func (cfg RaftConfig) RangeLeaseDurations() (rangeLeaseActive, rangeLeaseRenewal time.Duration) { 
        
           	rangeLeaseActive = time.Duration(cfg.RangeLeaseRaftElectionTimeoutMultiplier * 
        
           		float64(cfg.RaftElectionTimeout()))

This is also true for epoch-based leases, since the node liveness timeout simply calls through to the range lease interval:

cockroach/pkg/base/config.go

Lines 471 to 472 in 8817c28

    
           func (cfg RaftConfig) NodeLivenessDurations() (livenessActive, livenessRenewal time.Duration) { 
        
           	livenessActive = cfg.RangeLeaseActiveDuration()

This can result in a fairly long delay when the leaseholder is lost. We should try to reduce this down e.g. 5 seconds. However, many other components also rely on node liveness, so we must balance this against the impact of node flapping (or consider using separate intervals for node liveness and epoch-based lease expiration).

Jira issue: CRDB-14889

The text was updated successfully, but these errors were encountered:

bdarnell · 2022-04-06T16:51:16Z

We should consider the raft heartbeat and election constants at the same time. All of these constants were set very early and never seriously reevaluated. Network timeouts too.

The entire recovery flow is complicated and it's more than just the 9s lease timeout, although that's the single biggest piece. Cutting the lease timeout in half is probably the best thing we can do to speed up recovery, but beyond that I'm not sure whether further work on the lease timeout would be as effective as improving other areas.

Raft election interval is currently set to 3s, but within raft it's actually transformed into a random number between X and 2X. So we could theoretically be spending 6s waiting on raft elections regardless of the lease duration. (it's quantized by the 200ms tick interval, and each replica picks independently. So each node has a 1/5 chance of picking a 6s timeout, but the faster node wins, so with a replication factor of 3 (and one down node) there's a 1/25 chance that we have to wait the full 6s. A replication factor of 5 reduces that to 1/625).

The raft election interval is currently set to 3x the heartbeat interval, which is common in the raft literature but I'm not certain it makes sense. It's useful for unreliable transports, so that a single dropped packet doesn't trigger an election, but since we're using TCP for everything we could probably run with the election and heartbeat intervals closer together. Something like electionTimeout = heartbeatInterval + 2*RTT.

Network RTT is a factor in a lot of these considerations. It would be great if we could give ranges different configurations based on whether they're contained within a single region or not.

Because we use the raft prevote feature, it's less harmful to set the raft election timeout too short than it is to set the lease duration too short. In theory, the range can keep operating as normal if one node incorrectly thinks the leader has failed because network congestion prevented the timely delivery of a heartbeat.

Setting the raft heartbeat interval too low has a cost in both network traffic and cpu on a per-range basis. We "coalesce" heartbeats to reduce this cost but it's still there and it adds up; this has implications for our maximum data density per node.

Tweaking the tick interval probably doesn't make too much difference either way, but reducing it might help by reducing the chance of contested elections (due to two replicas picking the same effective election timeout).

Similarly, setting the node liveness heartbeat interval too low increases pressure on this critical range, and setting the lease duration too low increases the chance that a lease will appear to be expired while the node is in fact still up. This is disruptive because when a liveness epoch expires, other replicas can attempt to invalidate it so they can steal the lease.

From memory, the overall recovery flow (for the worst case of a node that vanishes just after sending all its heartbeats) is something like this:

Some SQL gateway node tries to talk to a range on the dead node, and waits for the 3s network timeout. (This can be cut short if something in the network is able to send a TCP RST packet)
That gateway tries to talk to other replicas of the same range. One of those replicas will check the lease, so this is where we wait for it to expire (with the gateway bouncing between replicas). We've already waited 3s on the network timeout, so we wait 6 more seconds.
If the range is not quiesced, one of the surviving replicas might try to start a raft election during this time (does this state of affairs automatically unquiesce? I can't remember).
Once the liveness epoch has expired, one of the living replicas sends a CPut to the node liveness range to mark the node as dead and waits for it to commit.
If the range is still quiesced, this is where the raft election starts. The timeout has already elapsed, so we just go through the 3 RTT election process (prevote, vote, append).
One (or both) of the living replicas attempt to acquire the lease through a raft append operation.
If the lease and leaseholder are not colocated, we go through another raft election/leader transfer (because this is a cooperative leadership transfer, no election timeout and just the 3 RTTs. This might even be a fast path and do 2 RTTs instead).
Now the new lease is valid and usable

So that's a 9s wait plus 2 global RTTs (writing to node liveness) plus about 6 RTTs for the range's quorum. If we reduce the 9s lease duration to 5s, the network and raft election timeouts start to become relevant.

rail · 2022-05-25T12:38:15Z

Manually synced with Jira

tbg · 2022-09-22T08:37:02Z

From a discussion yesterday: maybe the single most impactful thing we can do here is to make sure that the expiration-based lease duration on the liveness range is such that a non-cooperative lease transfer happens quickly enough to avoid invalidating all epoch-based leases in the cluster.

erikgrinaker · 2022-09-22T09:18:54Z

Wrote up a separate issue to track this: #88443

erikgrinaker · 2022-11-15T22:04:24Z

Initial PR in #91947, reducing the lease interval to 5.0s and Raft election timeout to 2.0s.

91810: roachtest: add `failover/non-system/crash` r=erikgrinaker a=erikgrinaker `failover/non-system/crash` benchmarks the maximum duration of range unavailability following a leaseholder crash with only non-system ranges. It tests the simplest possible failure: - A process crash, where the host/OS remains available (in particular, the TCP/IP stack is responsive and sends immediate RST packets to peers). - No system ranges located on the crashed node. - SQL clients do not connect to the crashed node. - The workload consists of individual point reads and writes. Since the lease unavailability is probabilistic, depending e.g. on the time since the last heartbeat and other variables, we run 9 crashes and record the pMax latency to find the upper bound on unavailability. We expect this worse-case latency to be slightly larger than the lease interval (9s), to account for lease acquisition and retry latencies. We do not assert this, but instead export latency histograms for graphing. The cluster layout is as follows: n1-n3: System ranges and SQL gateways. n4-n6: Workload ranges. n7: Workload runner. The test runs a kv50 workload with batch size 1, using 256 concurrent workers directed at n1-n3 with a rate of 2048 reqs/s. n4-n6 are killed and restarted in order, with 30 seconds between each operation, for 3 cycles totaling 9 crashes. Touches #79494. Epic: CRDB-18520. Release note: None Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>

92542: base: reduce network timeouts r=erikgrinaker a=erikgrinaker ***: don't use `NetworkTimeout` where inappropriate** `NetworkTimeout` should be used for network roundtrip timeouts, not for request processing timeouts. Release note: None **rpc: unify heartbeat interval and timeout** Previously, the RPC heartbeat timeout (6s) was set to twice the heartbeat interval (3s). This is rather excessive, so this patch sets them to an equal value of 3s. Release note: None **base: add `DialTimeout`** This patch adds a `DialTimeout` constant, set to `2 * NetworkTimeout` to account for the additional roundtrips in TCP + TLS handshakes. **base: reduce network timeouts** This patch reduces the network timeout from 3 seconds to 2 seconds. This change also affects gRPC keepalive intervals/timeouts (3 to 2 seconds), RPC heartbeats and timeouts (3 to 2 seconds), and the gRPC dial timeout (6 to 4 seconds). When a peer is unresponsive, these timeouts determine how quickly RPC calls (and thus critical operations such as lease acquisitions) will be retried against a different node. Reducing them therefore improves recovery time during infrastructure outages. An environment variable `COCKROACH_NETWORK_TIMEOUT` has been introduced to tweak this timeout if needed. Touches #79494. Epic: None. Release note (ops change): The network timeout for RPC connections between cluster nodes has been reduced from 3 seconds to 2 seconds, with a connection timeout of 4 seconds, in order to reduce unavailability and tail latencies during infrastructure outages. This can now be changed via the environment variable `COCKROACH_NETWORK_TIMEOUT` which defaults to `2s`. Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>

92991: roachtest: add `failover/non-system/blackhole` tests r=erikgrinaker a=erikgrinaker This patch adds roachtests to benchmark the maximum unavailability during leaseholder network outages on non-system ranges, both symmetric and asymmetric outages. Initial results, with a query timeout of 30 s: | Test | pMax read | pMax write | |------------------|-----------|------------| | `crash` | 14.5 s | 14.5 s | | `blackhole` | 16.6 s | 18.3 s | | `blackhole-recv` | 30.1 s | 30.1 s | | `blackhole-send` | 30.1 s | 30.1 s | Touches #79494. Epic: None Release note: None Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>

93399: rpc: tweak heartbeat intervals and timeouts r=erikgrinaker a=erikgrinaker The RPC heartbeat interval and timeout were recently reduced to 2 seconds (`base.NetworkTimeout`), with the assumption that heartbeats require a single network roundtrip and 2 seconds would therefore be more than enough. However, high-latency experiments showed that clusters under TPCC import load were very unstable even with a relatively moderate 400ms RTT, showing frequent RPC heartbeat timeouts because RPC `Ping` requests are head-of-line blocked by other RPC traffic. This patch therefore reverts the RPC heartbeat timeout back to the previous 6 second value, which is stable under TPCC import load with 400ms RTT, but struggles under 500ms RTT (which is also the case for 22.2). However, the RPC heartbeat interval and gRPC keepalive ping intervals have been split out to a separate setting `PingInterval` (`COCKROACH_PING_INTERVAL`), with a default value of 1 second, to fail faster despite the very high timeout. Unfortunately, this increases the maximum lease recovery time during network outages from 9.7 seconds to 14.0 seconds (as measured by the `failover/non-system/blackhole` roachtest), but that's still better than the 18.1 seconds in 22.2. Touches #79494. Touches #92542. Touches #93397. Epic: none Release note (ops change): The RPC heartbeat and gRPC keepalive ping intervals have been reduced to 1 second, to detect failures faster. This is adjustable via the new `COCKROACH_PING_INTERVAL` environment variable. The timeouts remain unchanged. Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>

erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication labels Apr 6, 2022

erikgrinaker added this to Incoming in Replication via automation Apr 6, 2022

mwang1026 moved this from Incoming to Backlog in Replication May 5, 2022

jlinder added the sync-me-3 label May 24, 2022

jlinder removed the sync-me-3 label May 27, 2022

exalate-issue-sync bot assigned erikgrinaker Oct 4, 2022

This was referenced Nov 13, 2022

roachtest: add failover/non-system/crash #91810

Merged

base: reduce Raft election timeout and lease interval #91947

Merged

erikgrinaker mentioned this issue Nov 27, 2022

base: reduce network timeouts #92542

Merged

erikgrinaker mentioned this issue Dec 3, 2022

roachtest: add failover/non-system/blackhole tests #92991

Merged

erikgrinaker mentioned this issue Dec 10, 2022

rpc: tweak heartbeat intervals and timeouts #93399

Merged

craig bot closed this as completed in ec83b9e Dec 11, 2022

exalate-issue-sync bot changed the title ~~kvserver: consider shorter leases~~ kvserver: use shorter lease durations Jan 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: use shorter lease durations #79494

kvserver: use shorter lease durations #79494

erikgrinaker commented Apr 6, 2022 •

edited by cockroach-jira-scripts

Loading

bdarnell commented Apr 6, 2022

rail commented May 25, 2022

tbg commented Sep 22, 2022 •

edited

Loading

erikgrinaker commented Sep 22, 2022

erikgrinaker commented Nov 15, 2022

kvserver: use shorter lease durations #79494

kvserver: use shorter lease durations #79494

Comments

erikgrinaker commented Apr 6, 2022 • edited by cockroach-jira-scripts Loading

bdarnell commented Apr 6, 2022

rail commented May 25, 2022

tbg commented Sep 22, 2022 • edited Loading

erikgrinaker commented Sep 22, 2022

erikgrinaker commented Nov 15, 2022

erikgrinaker commented Apr 6, 2022 •

edited by cockroach-jira-scripts

Loading

tbg commented Sep 22, 2022 •

edited

Loading