Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: use shorter lease durations #79494

Closed
erikgrinaker opened this issue Apr 6, 2022 · 5 comments · Fixed by #91947
Closed

kvserver: use shorter lease durations #79494

erikgrinaker opened this issue Apr 6, 2022 · 5 comments · Fixed by #91947
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Apr 6, 2022

Currently, range leases are fairly long-lived at 9 seconds:

multiplier * electionTimeoutTicks * tickInterval = 3 * 15 * 200ms = 9s

func (cfg RaftConfig) RangeLeaseDurations() (rangeLeaseActive, rangeLeaseRenewal time.Duration) {
rangeLeaseActive = time.Duration(cfg.RangeLeaseRaftElectionTimeoutMultiplier *
float64(cfg.RaftElectionTimeout()))

This is also true for epoch-based leases, since the node liveness timeout simply calls through to the range lease interval:

func (cfg RaftConfig) NodeLivenessDurations() (livenessActive, livenessRenewal time.Duration) {
livenessActive = cfg.RangeLeaseActiveDuration()

This can result in a fairly long delay when the leaseholder is lost. We should try to reduce this down e.g. 5 seconds. However, many other components also rely on node liveness, so we must balance this against the impact of node flapping (or consider using separate intervals for node liveness and epoch-based lease expiration).

Jira issue: CRDB-14889

@erikgrinaker erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication labels Apr 6, 2022
@erikgrinaker erikgrinaker added this to Incoming in Replication via automation Apr 6, 2022
@bdarnell
Copy link
Contributor

bdarnell commented Apr 6, 2022

We should consider the raft heartbeat and election constants at the same time. All of these constants were set very early and never seriously reevaluated. Network timeouts too.

The entire recovery flow is complicated and it's more than just the 9s lease timeout, although that's the single biggest piece. Cutting the lease timeout in half is probably the best thing we can do to speed up recovery, but beyond that I'm not sure whether further work on the lease timeout would be as effective as improving other areas.

Raft election interval is currently set to 3s, but within raft it's actually transformed into a random number between X and 2X. So we could theoretically be spending 6s waiting on raft elections regardless of the lease duration. (it's quantized by the 200ms tick interval, and each replica picks independently. So each node has a 1/5 chance of picking a 6s timeout, but the faster node wins, so with a replication factor of 3 (and one down node) there's a 1/25 chance that we have to wait the full 6s. A replication factor of 5 reduces that to 1/625).

The raft election interval is currently set to 3x the heartbeat interval, which is common in the raft literature but I'm not certain it makes sense. It's useful for unreliable transports, so that a single dropped packet doesn't trigger an election, but since we're using TCP for everything we could probably run with the election and heartbeat intervals closer together. Something like electionTimeout = heartbeatInterval + 2*RTT.

Network RTT is a factor in a lot of these considerations. It would be great if we could give ranges different configurations based on whether they're contained within a single region or not.

Because we use the raft prevote feature, it's less harmful to set the raft election timeout too short than it is to set the lease duration too short. In theory, the range can keep operating as normal if one node incorrectly thinks the leader has failed because network congestion prevented the timely delivery of a heartbeat.

Setting the raft heartbeat interval too low has a cost in both network traffic and cpu on a per-range basis. We "coalesce" heartbeats to reduce this cost but it's still there and it adds up; this has implications for our maximum data density per node.

Tweaking the tick interval probably doesn't make too much difference either way, but reducing it might help by reducing the chance of contested elections (due to two replicas picking the same effective election timeout).

Similarly, setting the node liveness heartbeat interval too low increases pressure on this critical range, and setting the lease duration too low increases the chance that a lease will appear to be expired while the node is in fact still up. This is disruptive because when a liveness epoch expires, other replicas can attempt to invalidate it so they can steal the lease.

From memory, the overall recovery flow (for the worst case of a node that vanishes just after sending all its heartbeats) is something like this:

  • Some SQL gateway node tries to talk to a range on the dead node, and waits for the 3s network timeout. (This can be cut short if something in the network is able to send a TCP RST packet)
  • That gateway tries to talk to other replicas of the same range. One of those replicas will check the lease, so this is where we wait for it to expire (with the gateway bouncing between replicas). We've already waited 3s on the network timeout, so we wait 6 more seconds.
  • If the range is not quiesced, one of the surviving replicas might try to start a raft election during this time (does this state of affairs automatically unquiesce? I can't remember).
  • Once the liveness epoch has expired, one of the living replicas sends a CPut to the node liveness range to mark the node as dead and waits for it to commit.
  • If the range is still quiesced, this is where the raft election starts. The timeout has already elapsed, so we just go through the 3 RTT election process (prevote, vote, append).
  • One (or both) of the living replicas attempt to acquire the lease through a raft append operation.
  • If the lease and leaseholder are not colocated, we go through another raft election/leader transfer (because this is a cooperative leadership transfer, no election timeout and just the 3 RTTs. This might even be a fast path and do 2 RTTs instead).
  • Now the new lease is valid and usable

So that's a 9s wait plus 2 global RTTs (writing to node liveness) plus about 6 RTTs for the range's quorum. If we reduce the 9s lease duration to 5s, the network and raft election timeouts start to become relevant.

@mwang1026 mwang1026 moved this from Incoming to Backlog in Replication May 5, 2022
@rail
Copy link
Member

rail commented May 25, 2022

Manually synced with Jira

@jlinder jlinder removed the sync-me-3 label May 27, 2022
@tbg
Copy link
Member

tbg commented Sep 22, 2022

From a discussion yesterday: maybe the single most impactful thing we can do here is to make sure that the expiration-based lease duration on the liveness range is such that a non-cooperative lease transfer happens quickly enough to avoid invalidating all epoch-based leases in the cluster.

@erikgrinaker
Copy link
Contributor Author

Wrote up a separate issue to track this: #88443

@erikgrinaker
Copy link
Contributor Author

Initial PR in #91947, reducing the lease interval to 5.0s and Raft election timeout to 2.0s.

craig bot pushed a commit that referenced this issue Nov 30, 2022
91810: roachtest: add `failover/non-system/crash` r=erikgrinaker a=erikgrinaker

`failover/non-system/crash` benchmarks the maximum duration of range
unavailability following a leaseholder crash with only non-system
ranges. It tests the simplest possible failure:

  - A process crash, where the host/OS remains available (in particular, the
    TCP/IP stack is responsive and sends immediate RST packets to peers).

  - No system ranges located on the crashed node.

  - SQL clients do not connect to the crashed node.

  - The workload consists of individual point reads and writes.

Since the lease unavailability is probabilistic, depending e.g. on the
time since the last heartbeat and other variables, we run 9 crashes and
record the pMax latency to find the upper bound on unavailability. We
expect this worse-case latency to be slightly larger than the lease
interval (9s), to account for lease acquisition and retry latencies. We
do not assert this, but instead export latency histograms for graphing.

The cluster layout is as follows:

n1-n3: System ranges and SQL gateways.
n4-n6: Workload ranges.
n7:    Workload runner.

The test runs a kv50 workload with batch size 1, using 256 concurrent
workers directed at n1-n3 with a rate of 2048 reqs/s. n4-n6 are killed
and restarted in order, with 30 seconds between each operation, for 3
cycles totaling 9 crashes.

Touches #79494.
Epic: CRDB-18520.

Release note: None

Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Dec 4, 2022
92542: base: reduce network timeouts r=erikgrinaker a=erikgrinaker

***: don't use `NetworkTimeout` where inappropriate**

`NetworkTimeout` should be used for network roundtrip timeouts, not for request processing timeouts.

Release note: None
  
**rpc: unify heartbeat interval and timeout**

Previously, the RPC heartbeat timeout (6s) was set to twice the heartbeat interval (3s). This is rather excessive, so this patch sets them to an equal value of 3s.

Release note: None
  
**base: add `DialTimeout`**

This patch adds a `DialTimeout` constant, set to `2 * NetworkTimeout` to account for the additional roundtrips in TCP + TLS handshakes.

**base: reduce network timeouts**

This patch reduces the network timeout from 3 seconds to 2 seconds. This change also affects gRPC keepalive intervals/timeouts (3 to 2 seconds), RPC heartbeats and timeouts (3 to 2 seconds), and the gRPC dial timeout (6 to 4 seconds).

When a peer is unresponsive, these timeouts determine how quickly RPC calls (and thus critical operations such as lease acquisitions) will be retried against a different node. Reducing them therefore improves recovery time during infrastructure outages.

An environment variable `COCKROACH_NETWORK_TIMEOUT` has been introduced to tweak this timeout if needed.

Touches #79494.
Epic: None.

Release note (ops change): The network timeout for RPC connections between cluster nodes has been reduced from 3 seconds to 2 seconds, with a connection timeout of 4 seconds, in order to reduce unavailability and tail latencies during infrastructure outages. This can now be changed via the environment variable `COCKROACH_NETWORK_TIMEOUT` which defaults to `2s`.

Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Dec 5, 2022
92991: roachtest: add `failover/non-system/blackhole` tests r=erikgrinaker a=erikgrinaker

This patch adds roachtests to benchmark the maximum unavailability during leaseholder network outages on non-system ranges, both symmetric and asymmetric outages. Initial results, with a query timeout of 30 s:

| Test             | pMax read | pMax write |
|------------------|-----------|------------|
| `crash`          | 14.5 s    | 14.5 s     |
| `blackhole`      | 16.6 s    | 18.3 s     |
| `blackhole-recv` | 30.1 s    | 30.1 s     |
| `blackhole-send` | 30.1 s    | 30.1 s     |

Touches #79494.

Epic: None
Release note: None

Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
@craig craig bot closed this as completed in ec83b9e Dec 11, 2022
craig bot pushed a commit that referenced this issue Dec 12, 2022
93399: rpc: tweak heartbeat intervals and timeouts r=erikgrinaker a=erikgrinaker

The RPC heartbeat interval and timeout were recently reduced to 2 seconds (`base.NetworkTimeout`), with the assumption that heartbeats require a single network roundtrip and 2 seconds would therefore be more than enough.

However, high-latency experiments showed that clusters under TPCC import load were very unstable even with a relatively moderate 400ms RTT, showing frequent RPC heartbeat timeouts because RPC `Ping` requests are head-of-line blocked by other RPC traffic.

This patch therefore reverts the RPC heartbeat timeout back to the previous 6 second value, which is stable under TPCC import load with 400ms RTT, but struggles under 500ms RTT (which is also the case for 22.2). However, the RPC heartbeat interval and gRPC keepalive ping intervals have been split out to a separate setting `PingInterval` (`COCKROACH_PING_INTERVAL`), with a default value of 1 second, to fail faster despite the very high timeout.

Unfortunately, this increases the maximum lease recovery time during network outages from 9.7 seconds to 14.0 seconds (as measured by the `failover/non-system/blackhole` roachtest), but that's still better than the 18.1 seconds in 22.2.

Touches #79494.
Touches #92542.
Touches #93397.
Epic: none

Release note (ops change): The RPC heartbeat and gRPC keepalive ping intervals have been reduced to 1 second, to detect failures faster. This is adjustable via the new `COCKROACH_PING_INTERVAL` environment variable. The timeouts remain unchanged.

Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
@exalate-issue-sync exalate-issue-sync bot changed the title kvserver: consider shorter leases kvserver: use shorter lease durations Jan 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants