-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: use shorter lease durations #79494
Comments
We should consider the raft heartbeat and election constants at the same time. All of these constants were set very early and never seriously reevaluated. Network timeouts too. The entire recovery flow is complicated and it's more than just the 9s lease timeout, although that's the single biggest piece. Cutting the lease timeout in half is probably the best thing we can do to speed up recovery, but beyond that I'm not sure whether further work on the lease timeout would be as effective as improving other areas. Raft election interval is currently set to 3s, but within raft it's actually transformed into a random number between X and 2X. So we could theoretically be spending 6s waiting on raft elections regardless of the lease duration. (it's quantized by the 200ms tick interval, and each replica picks independently. So each node has a 1/5 chance of picking a 6s timeout, but the faster node wins, so with a replication factor of 3 (and one down node) there's a 1/25 chance that we have to wait the full 6s. A replication factor of 5 reduces that to 1/625). The raft election interval is currently set to 3x the heartbeat interval, which is common in the raft literature but I'm not certain it makes sense. It's useful for unreliable transports, so that a single dropped packet doesn't trigger an election, but since we're using TCP for everything we could probably run with the election and heartbeat intervals closer together. Something like Network RTT is a factor in a lot of these considerations. It would be great if we could give ranges different configurations based on whether they're contained within a single region or not. Because we use the raft prevote feature, it's less harmful to set the raft election timeout too short than it is to set the lease duration too short. In theory, the range can keep operating as normal if one node incorrectly thinks the leader has failed because network congestion prevented the timely delivery of a heartbeat. Setting the raft heartbeat interval too low has a cost in both network traffic and cpu on a per-range basis. We "coalesce" heartbeats to reduce this cost but it's still there and it adds up; this has implications for our maximum data density per node. Tweaking the tick interval probably doesn't make too much difference either way, but reducing it might help by reducing the chance of contested elections (due to two replicas picking the same effective election timeout). Similarly, setting the node liveness heartbeat interval too low increases pressure on this critical range, and setting the lease duration too low increases the chance that a lease will appear to be expired while the node is in fact still up. This is disruptive because when a liveness epoch expires, other replicas can attempt to invalidate it so they can steal the lease. From memory, the overall recovery flow (for the worst case of a node that vanishes just after sending all its heartbeats) is something like this:
So that's a 9s wait plus 2 global RTTs (writing to node liveness) plus about 6 RTTs for the range's quorum. If we reduce the 9s lease duration to 5s, the network and raft election timeouts start to become relevant. |
Manually synced with Jira |
From a discussion yesterday: maybe the single most impactful thing we can do here is to make sure that the expiration-based lease duration on the liveness range is such that a non-cooperative lease transfer happens quickly enough to avoid invalidating all epoch-based leases in the cluster. |
Wrote up a separate issue to track this: #88443 |
Initial PR in #91947, reducing the lease interval to 5.0s and Raft election timeout to 2.0s. |
91810: roachtest: add `failover/non-system/crash` r=erikgrinaker a=erikgrinaker `failover/non-system/crash` benchmarks the maximum duration of range unavailability following a leaseholder crash with only non-system ranges. It tests the simplest possible failure: - A process crash, where the host/OS remains available (in particular, the TCP/IP stack is responsive and sends immediate RST packets to peers). - No system ranges located on the crashed node. - SQL clients do not connect to the crashed node. - The workload consists of individual point reads and writes. Since the lease unavailability is probabilistic, depending e.g. on the time since the last heartbeat and other variables, we run 9 crashes and record the pMax latency to find the upper bound on unavailability. We expect this worse-case latency to be slightly larger than the lease interval (9s), to account for lease acquisition and retry latencies. We do not assert this, but instead export latency histograms for graphing. The cluster layout is as follows: n1-n3: System ranges and SQL gateways. n4-n6: Workload ranges. n7: Workload runner. The test runs a kv50 workload with batch size 1, using 256 concurrent workers directed at n1-n3 with a rate of 2048 reqs/s. n4-n6 are killed and restarted in order, with 30 seconds between each operation, for 3 cycles totaling 9 crashes. Touches #79494. Epic: CRDB-18520. Release note: None Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
92542: base: reduce network timeouts r=erikgrinaker a=erikgrinaker ***: don't use `NetworkTimeout` where inappropriate** `NetworkTimeout` should be used for network roundtrip timeouts, not for request processing timeouts. Release note: None **rpc: unify heartbeat interval and timeout** Previously, the RPC heartbeat timeout (6s) was set to twice the heartbeat interval (3s). This is rather excessive, so this patch sets them to an equal value of 3s. Release note: None **base: add `DialTimeout`** This patch adds a `DialTimeout` constant, set to `2 * NetworkTimeout` to account for the additional roundtrips in TCP + TLS handshakes. **base: reduce network timeouts** This patch reduces the network timeout from 3 seconds to 2 seconds. This change also affects gRPC keepalive intervals/timeouts (3 to 2 seconds), RPC heartbeats and timeouts (3 to 2 seconds), and the gRPC dial timeout (6 to 4 seconds). When a peer is unresponsive, these timeouts determine how quickly RPC calls (and thus critical operations such as lease acquisitions) will be retried against a different node. Reducing them therefore improves recovery time during infrastructure outages. An environment variable `COCKROACH_NETWORK_TIMEOUT` has been introduced to tweak this timeout if needed. Touches #79494. Epic: None. Release note (ops change): The network timeout for RPC connections between cluster nodes has been reduced from 3 seconds to 2 seconds, with a connection timeout of 4 seconds, in order to reduce unavailability and tail latencies during infrastructure outages. This can now be changed via the environment variable `COCKROACH_NETWORK_TIMEOUT` which defaults to `2s`. Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
92991: roachtest: add `failover/non-system/blackhole` tests r=erikgrinaker a=erikgrinaker This patch adds roachtests to benchmark the maximum unavailability during leaseholder network outages on non-system ranges, both symmetric and asymmetric outages. Initial results, with a query timeout of 30 s: | Test | pMax read | pMax write | |------------------|-----------|------------| | `crash` | 14.5 s | 14.5 s | | `blackhole` | 16.6 s | 18.3 s | | `blackhole-recv` | 30.1 s | 30.1 s | | `blackhole-send` | 30.1 s | 30.1 s | Touches #79494. Epic: None Release note: None Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
93399: rpc: tweak heartbeat intervals and timeouts r=erikgrinaker a=erikgrinaker The RPC heartbeat interval and timeout were recently reduced to 2 seconds (`base.NetworkTimeout`), with the assumption that heartbeats require a single network roundtrip and 2 seconds would therefore be more than enough. However, high-latency experiments showed that clusters under TPCC import load were very unstable even with a relatively moderate 400ms RTT, showing frequent RPC heartbeat timeouts because RPC `Ping` requests are head-of-line blocked by other RPC traffic. This patch therefore reverts the RPC heartbeat timeout back to the previous 6 second value, which is stable under TPCC import load with 400ms RTT, but struggles under 500ms RTT (which is also the case for 22.2). However, the RPC heartbeat interval and gRPC keepalive ping intervals have been split out to a separate setting `PingInterval` (`COCKROACH_PING_INTERVAL`), with a default value of 1 second, to fail faster despite the very high timeout. Unfortunately, this increases the maximum lease recovery time during network outages from 9.7 seconds to 14.0 seconds (as measured by the `failover/non-system/blackhole` roachtest), but that's still better than the 18.1 seconds in 22.2. Touches #79494. Touches #92542. Touches #93397. Epic: none Release note (ops change): The RPC heartbeat and gRPC keepalive ping intervals have been reduced to 1 second, to detect failures faster. This is adjustable via the new `COCKROACH_PING_INTERVAL` environment variable. The timeouts remain unchanged. Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
Currently, range leases are fairly long-lived at 9 seconds:
cockroach/pkg/base/config.go
Lines 442 to 444 in 8817c28
This is also true for epoch-based leases, since the node liveness timeout simply calls through to the range lease interval:
cockroach/pkg/base/config.go
Lines 471 to 472 in 8817c28
This can result in a fairly long delay when the leaseholder is lost. We should try to reduce this down e.g. 5 seconds. However, many other components also rely on node liveness, so we must balance this against the impact of node flapping (or consider using separate intervals for node liveness and epoch-based lease expiration).
Jira issue: CRDB-14889
The text was updated successfully, but these errors were encountered: