base: reduce network timeouts #92542

erikgrinaker · 2022-11-27T18:27:04Z

*: don't use NetworkTimeout where inappropriate

NetworkTimeout should be used for network roundtrip timeouts, not for request processing timeouts.

Release note: None

rpc: unify heartbeat interval and timeout

Previously, the RPC heartbeat timeout (6s) was set to twice the heartbeat interval (3s). This is rather excessive, so this patch sets them to an equal value of 3s.

Release note: None

base: add DialTimeout

This patch adds a DialTimeout constant, set to 2 * NetworkTimeout to account for the additional roundtrips in TCP + TLS handshakes.

base: reduce network timeouts

This patch reduces the network timeout from 3 seconds to 2 seconds. This change also affects gRPC keepalive intervals/timeouts (3 to 2 seconds), RPC heartbeats and timeouts (3 to 2 seconds), and the gRPC dial timeout (6 to 4 seconds).

When a peer is unresponsive, these timeouts determine how quickly RPC calls (and thus critical operations such as lease acquisitions) will be retried against a different node. Reducing them therefore improves recovery time during infrastructure outages.

An environment variable COCKROACH_NETWORK_TIMEOUT has been introduced to tweak this timeout if needed.

Touches #79494.
Epic: None.

Release note (ops change): The network timeout for RPC connections between cluster nodes has been reduced from 3 seconds to 2 seconds, with a connection timeout of 4 seconds, in order to reduce unavailability and tail latencies during infrastructure outages. This can now be changed via the environment variable COCKROACH_NETWORK_TIMEOUT which defaults to 2s.

cockroach-teamcity · 2022-11-27T18:27:12Z

This change is

tbg

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @knz, and @nvanbenschoten)

pkg/base/config.go line 57 at r1 (raw file):

	// variance), with a lower bound of 200ms, so the worst-case RTO is 600ms. 2s
	// should therefore be sufficient under most normal conditions.
	// https://datastudio.google.com/reporting/fc733b10-9744-4a72-a502-92290f608571/page/70YCB

Maybe add something similar for AWS, here's the p99 over one week map:

https://www.cloudping.co/grid/p_99/timeframe/1W

Looks like the numbers come out to around the same, between Sao Paulo and Singapore.

pkg/base/config.go line 58 at r1 (raw file):

	// should therefore be sufficient under most normal conditions.
	// https://datastudio.google.com/reporting/fc733b10-9744-4a72-a502-92290f608571/page/70YCB
	NetworkTimeout = 2 * time.Second

You will need to update this one too:

cockroach/pkg/rpc/context.go

Line 1765 in 2d79db8

MinConnectTimeout: minConnectionTimeout}))

Otherwise, a "black hole" dial will still take 5s. Probably NetworkTimeout should "just" reference rpc.MinConnectionTimeout.

I'm not sure if the latter also closes over the TLS handshake. I don't think so, but I am not sure. Would be good to clarify which is which.
In terms of authority on TLS roundtrips, I found https://www.gnutls.org/manual/html_node/Reducing-round_002dtrips.html useful.

erikgrinaker

I also added an environment variable COCKROACH_NETWORK_TIMEOUT to control this, even though 2s should be sufficient for most users. FWIW, 20.2 accidentally dropped the gRPC connection timeout to the default of 1s, and we only had reports from a single user that this was problematic (they had >500ms RTTs due to a network topology that routed Japan to South America via Europe).

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @tbg)

pkg/base/config.go line 57 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Maybe add something similar for AWS, here's the p99 over one week map:

https://www.cloudping.co/grid/p_99/timeframe/1W

Looks like the numbers come out to around the same, between Sao Paulo and Singapore.

Yep, added this.

pkg/base/config.go line 58 at r1 (raw file):
Good catch, thanks. I updated minConnectionTimeout to reference NetworkTimeout, since that's the canonical setting.

I'm not sure if the latter also closes over the TLS handshake. I don't think so, but I am not sure. Would be good to clarify which is which.

It possibly does. I added some analysis on this.

tbg

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @nvanbenschoten)

knz

The gRPC dial timeout is also reduced to the network timeout (from 5 seconds to 2 seconds).

Does this include the TLS handshake?

2 seconds is very short, especially for geo-distributed clusters. With the syn, syn-ack, ack exchange before the TCP connection is established + 3 back-and-forth for the TLS handshake, at 250ms roundtrip you're already at the limit.

I'd be interested in @bdarnell 's opinion on this too.

erikgrinaker · 2022-11-28T15:45:28Z

Does this include the TLS handshake?

2 seconds is very short, especially for geo-distributed clusters. With the syn, syn-ack, ack exchange before the TCP connection is established + 3 back-and-forth for the TLS handshake, at 250ms roundtrip you're already at the limit.

I'm not sure, but I believe it might. The TCP+TLS handshake is 3 RTTs. Cloud providers have a max nominal RTT latency between regions of ~350ms, so if we assume 400ms RTT we still have time both for the TCP+TLS handshake and a single packet loss (600ms RTO): 3*400ms + 600ms = 1800ms.

FWIW, we accidentally ran with a 1 second gRPC dial timeout in 20.2, 21.1, and 21.2, with only a single reported problem because of it (see #71417).

erikgrinaker · 2022-11-28T20:27:34Z

Does this include the TLS handshake?

I'm not sure, but I believe it might.

Confirmed, the timeout includes the TLS handshake. I'll run some further tests to verify when I get a chance (including a fake high-latency network and inspecting the network traffic).

erikgrinaker · 2022-11-28T20:34:35Z

A more conservative option here might be to keep the network timeout at 3 seconds (like today), but apply it to the RPC heartbeats (from 6 to 3 seconds with 1.5 second interval) and the gRPC dial timeout (from 5 to 3 seconds). I still think 2 seconds should be fine for most scenarios though.

bdarnell · 2022-11-29T20:07:59Z

When a peer is unresponsive, these timeouts determine how quickly RPC calls (and thus critical operations such as lease acquisitions) will be retried against a different node.

Any timeout that is long enough to allow for a full TCP+TLS handshake is going to be much longer than necessary for detecting a failure on an established connection. I know this is easier said than done, but the ideal answer would be to replace the undifferentiated NetworkTimeout with something that can vary by context (at least two, for new vs established connection, but potentially more, so that same-region calls can use a very aggressive timeout).

In the meantime, I'm tentatively OK with moving to 2s by default, especially since we'd be introducing a new configuration knob in case it's not working.

erikgrinaker · 2022-11-29T20:17:27Z

Any timeout that is long enough to allow for a full TCP+TLS handshake is going to be much longer than necessary for detecting a failure on an established connection. I know this is easier said than done, but the ideal answer would be to replace the undifferentiated NetworkTimeout with something that can vary by context (at least two, for new vs established connection, but potentially more, so that same-region calls can use a very aggressive timeout).

Yeah, I've been thinking along the same lines. We could drop NetworkTimeout down to e.g. 1 second, which applies to operations on the order of 1 RTT (with room for a retransmit), and then use a multiplier for operations that require multiple roundtrips -- so the dial timeout would be 3 * NetworkTimeout. But maybe that's overcomplicating things a bit.

bdarnell · 2022-11-29T20:29:54Z

I wouldn't even necessarily require room for a retransmit. Retransmit timeouts are pretty high by modern within-region standards (e.g. you mentioned a 200ms minimum above). It could be worthwhile to try another replica after, say, 100ms even though there's a chance a packet was dropped and hasn't been retransmitted yet. (but we shouldn't retry the same replica more than once faster than the retransmission timeout)

erikgrinaker · 2022-11-29T20:35:50Z

Possibly, but I feel like we're already pretty aggressively lowering the timeouts here. I think I'd want to hold back a little bit and not go all in on this just yet, since it can be pretty fragile under latency fluctuations. We can consider tightening it further later, and even then we might want to try it out in CC for a while first.

But the RTT factoring seems reasonable, I'll give that a try (i.e. set NetworkTimeout = time.Second and use roundtrip factors as appropriate).

`NetworkTimeout` should be used for network roundtrip timeouts, not for request processing timeouts. Release note: None

erikgrinaker · 2022-12-01T12:13:44Z

I've restructured this to use NetworkTimeout = time.Second as a measure of a single roundtrip (including retransmit or other latency tolerance), and added DialTimeout = 2 * NetworkTimeout to account for the additional TCP+TLS handshake roundtrips but with reduced retransmit tolerance. I've also cleaned up the use of NetworkTimeout throughout the code base.

As a consequence of these changes, we've reduced the dial timeout from 5 to 2 seconds, gRPC keepalive interval/timeout from 3 to 1 second, and RPC heartbeat interval/timeout from 3/6 seconds to 1 second.

This is rather tight, so I'm open to bumping NetworkTimeout to 2 seconds out of caution (which doubles all of these values). I'm also going to run some high-latency tests to verify this change.

Previously, the RPC heartbeat timeout (6s) was set to twice the heartbeat interval (3s). This is rather excessive, so this patch sets them to an equal value of 3s. Release note: None

This patch adds a `DialTimeout` constant, set to `2 * NetworkTimeout` to account for the additional roundtrips in TCP + TLS handshakes. Release note: None

erikgrinaker · 2022-12-02T19:06:50Z

Had a quick look at our CC clusters in CentMon, and the worst p99 RPC heartbeat latency in any cluster over the past 90 days was 557 ms. This was on a single-region cluster in the US, and the high latency appeared to be due to CPU overload/throttling: 5 nodes with 2 vCPUs where several nodes were running close to 100% CPU, beyond the Kubernetes resource requests. So I don't think we ever want to drop this below 1 second, to make sure we're tolerant to node overload as well. Updated the NetworkTimeout comment saying as much.

erikgrinaker · 2022-12-03T18:32:20Z

I wrote up a roachtest to measure the pMax latency during leaseholder loss with a network blackhole as well (#92991), and the results are promising. These sampled 9 outages each, and show a fair bit of variance, but it's a clear improvement nonetheless.

Branch	`crash`	`blackhole`
`master`	14.5 s	18.3 s
Reduce lease	8.3 s	14.5 s
Reduce lease + network 1s	9.1 s	8.3 s
Reduce lease + network 2s	7.8 s	9.1 s

Notably, the marginal gains of a 1 second vs. 2 second network timeout are fairly low, and may not be worth the instability risk, at least not for this pass.

This patch reduces the network timeout from 3 seconds to 2 seconds. This change also affects gRPC keepalive intervals/timeouts (3 to 2 seconds), RPC heartbeats and timeouts (3 to 2 seconds), and the gRPC dial timeout (6 to 4 seconds). When a peer is unresponsive, these timeouts determine how quickly RPC calls (and thus critical operations such as lease acquisitions) will be retried against a different node. Reducing them therefore improves recovery time during infrastructure outages. An environment variable `COCKROACH_NETWORK_TIMEOUT` has been introduced to tweak this timeout if needed. Release note (ops change): The network timeout for RPC connections between cluster nodes has been reduced from 3 seconds to 2 seconds, with a connection timeout of 4 seconds, in order to reduce unavailability and tail latencies during infrastructure outages. This can now be changed via the environment variable `COCKROACH_NETWORK_TIMEOUT` which defaults to `2s`.

erikgrinaker · 2022-12-04T22:37:01Z

Bumped the network/dial timeouts to 2/4 seconds, out of caution. Opened #93007 to consider reducing them to 1/2 seconds after some verification in cloud. Will do some high-latency testing for #91947.

bors r+

craig · 2022-12-04T23:29:00Z

Build succeeded:

Bazel Essential CI (Cockroach)

93399: rpc: tweak heartbeat intervals and timeouts r=erikgrinaker a=erikgrinaker The RPC heartbeat interval and timeout were recently reduced to 2 seconds (`base.NetworkTimeout`), with the assumption that heartbeats require a single network roundtrip and 2 seconds would therefore be more than enough. However, high-latency experiments showed that clusters under TPCC import load were very unstable even with a relatively moderate 400ms RTT, showing frequent RPC heartbeat timeouts because RPC `Ping` requests are head-of-line blocked by other RPC traffic. This patch therefore reverts the RPC heartbeat timeout back to the previous 6 second value, which is stable under TPCC import load with 400ms RTT, but struggles under 500ms RTT (which is also the case for 22.2). However, the RPC heartbeat interval and gRPC keepalive ping intervals have been split out to a separate setting `PingInterval` (`COCKROACH_PING_INTERVAL`), with a default value of 1 second, to fail faster despite the very high timeout. Unfortunately, this increases the maximum lease recovery time during network outages from 9.7 seconds to 14.0 seconds (as measured by the `failover/non-system/blackhole` roachtest), but that's still better than the 18.1 seconds in 22.2. Touches #79494. Touches #92542. Touches #93397. Epic: none Release note (ops change): The RPC heartbeat and gRPC keepalive ping intervals have been reduced to 1 second, to detect failures faster. This is adjustable via the new `COCKROACH_PING_INTERVAL` environment variable. The timeouts remain unchanged. Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>

erikgrinaker requested review from knz, tbg, nvanbenschoten and a team November 27, 2022 18:27

erikgrinaker requested a review from a team as a code owner November 27, 2022 18:27

erikgrinaker self-assigned this Nov 27, 2022

erikgrinaker requested a review from a team as a code owner November 27, 2022 18:27

erikgrinaker force-pushed the reduce-network-timeouts branch from 659cab3 to 71fd9eb Compare November 27, 2022 18:28

erikgrinaker mentioned this pull request Nov 27, 2022

base: reduce Raft election timeout and lease interval #91947

Merged

tbg approved these changes Nov 28, 2022

View reviewed changes

erikgrinaker force-pushed the reduce-network-timeouts branch from 71fd9eb to e322aba Compare November 28, 2022 13:34

erikgrinaker commented Nov 28, 2022

View reviewed changes

erikgrinaker force-pushed the reduce-network-timeouts branch from e322aba to acb8c67 Compare November 28, 2022 13:48

tbg approved these changes Nov 28, 2022

View reviewed changes

knz reviewed Nov 28, 2022

View reviewed changes

*: don't use NetworkTimeout where inappropriate

90e0dff

`NetworkTimeout` should be used for network roundtrip timeouts, not for request processing timeouts. Release note: None

erikgrinaker force-pushed the reduce-network-timeouts branch from acb8c67 to a00caa2 Compare December 1, 2022 11:54

erikgrinaker requested review from a team December 1, 2022 11:54

erikgrinaker requested review from a team as code owners December 1, 2022 11:54

erikgrinaker force-pushed the reduce-network-timeouts branch from a00caa2 to 715ae21 Compare December 1, 2022 12:08

erikgrinaker requested a review from a team as a code owner December 1, 2022 12:08

erikgrinaker force-pushed the reduce-network-timeouts branch from 715ae21 to 5432207 Compare December 1, 2022 12:11

erikgrinaker added 2 commits December 1, 2022 13:05

rpc: unify heartbeat interval and timeout

0481b90

Previously, the RPC heartbeat timeout (6s) was set to twice the heartbeat interval (3s). This is rather excessive, so this patch sets them to an equal value of 3s. Release note: None

base: add DialTimeout

00f22fe

This patch adds a `DialTimeout` constant, set to `2 * NetworkTimeout` to account for the additional roundtrips in TCP + TLS handshakes. Release note: None

erikgrinaker force-pushed the reduce-network-timeouts branch from 5432207 to 0f9d344 Compare December 1, 2022 13:06

tbg approved these changes Dec 2, 2022

View reviewed changes

erikgrinaker force-pushed the reduce-network-timeouts branch from 0f9d344 to 6286ad0 Compare December 2, 2022 19:07

erikgrinaker force-pushed the reduce-network-timeouts branch from 6286ad0 to e5e5913 Compare December 4, 2022 20:50

erikgrinaker force-pushed the reduce-network-timeouts branch from e5e5913 to 65a2bc3 Compare December 4, 2022 21:01

erikgrinaker mentioned this pull request Dec 4, 2022

base: consider setting NetworkTimeout to 1 second #93007

Closed

craig bot merged commit 8a5cb51 into cockroachdb:master Dec 4, 2022

erikgrinaker deleted the reduce-network-timeouts branch December 5, 2022 09:47

erikgrinaker mentioned this pull request Dec 10, 2022

rpc: tweak heartbeat intervals and timeouts #93399

Merged

rmloveland mentioned this pull request Apr 18, 2023

Update Raft election, lease intervals, etc. cockroachdb/docs#16789

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

base: reduce network timeouts #92542

base: reduce network timeouts #92542

erikgrinaker commented Nov 27, 2022 •

edited

Loading

cockroach-teamcity commented Nov 27, 2022

tbg left a comment

erikgrinaker left a comment

tbg left a comment

knz left a comment

erikgrinaker commented Nov 28, 2022

erikgrinaker commented Nov 28, 2022 •

edited

Loading

erikgrinaker commented Nov 28, 2022 •

edited

Loading

bdarnell commented Nov 29, 2022

erikgrinaker commented Nov 29, 2022 •

edited

Loading

bdarnell commented Nov 29, 2022

erikgrinaker commented Nov 29, 2022 •

edited

Loading

erikgrinaker commented Dec 1, 2022 •

edited

Loading

erikgrinaker commented Dec 2, 2022 •

edited

Loading

erikgrinaker commented Dec 3, 2022 •

edited

Loading

erikgrinaker commented Dec 4, 2022

craig bot commented Dec 4, 2022

base: reduce network timeouts #92542

base: reduce network timeouts #92542

Conversation

erikgrinaker commented Nov 27, 2022 • edited Loading

cockroach-teamcity commented Nov 27, 2022

tbg left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

erikgrinaker commented Nov 28, 2022

erikgrinaker commented Nov 28, 2022 • edited Loading

erikgrinaker commented Nov 28, 2022 • edited Loading

bdarnell commented Nov 29, 2022

erikgrinaker commented Nov 29, 2022 • edited Loading

bdarnell commented Nov 29, 2022

erikgrinaker commented Nov 29, 2022 • edited Loading

erikgrinaker commented Dec 1, 2022 • edited Loading

erikgrinaker commented Dec 2, 2022 • edited Loading

erikgrinaker commented Dec 3, 2022 • edited Loading

erikgrinaker commented Dec 4, 2022

craig bot commented Dec 4, 2022

erikgrinaker commented Nov 27, 2022 •

edited

Loading

erikgrinaker commented Nov 28, 2022 •

edited

Loading

erikgrinaker commented Nov 28, 2022 •

edited

Loading

erikgrinaker commented Nov 29, 2022 •

edited

Loading

erikgrinaker commented Nov 29, 2022 •

edited

Loading

erikgrinaker commented Dec 1, 2022 •

edited

Loading

erikgrinaker commented Dec 2, 2022 •

edited

Loading

erikgrinaker commented Dec 3, 2022 •

edited

Loading