Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

base: reduce network timeouts #92542

Merged
merged 4 commits into from
Dec 4, 2022

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Nov 27, 2022

*: don't use NetworkTimeout where inappropriate

NetworkTimeout should be used for network roundtrip timeouts, not for request processing timeouts.

Release note: None

rpc: unify heartbeat interval and timeout

Previously, the RPC heartbeat timeout (6s) was set to twice the heartbeat interval (3s). This is rather excessive, so this patch sets them to an equal value of 3s.

Release note: None

base: add DialTimeout

This patch adds a DialTimeout constant, set to 2 * NetworkTimeout to account for the additional roundtrips in TCP + TLS handshakes.

base: reduce network timeouts

This patch reduces the network timeout from 3 seconds to 2 seconds. This change also affects gRPC keepalive intervals/timeouts (3 to 2 seconds), RPC heartbeats and timeouts (3 to 2 seconds), and the gRPC dial timeout (6 to 4 seconds).

When a peer is unresponsive, these timeouts determine how quickly RPC calls (and thus critical operations such as lease acquisitions) will be retried against a different node. Reducing them therefore improves recovery time during infrastructure outages.

An environment variable COCKROACH_NETWORK_TIMEOUT has been introduced to tweak this timeout if needed.

Touches #79494.
Epic: None.

Release note (ops change): The network timeout for RPC connections between cluster nodes has been reduced from 3 seconds to 2 seconds, with a connection timeout of 4 seconds, in order to reduce unavailability and tail latencies during infrastructure outages. This can now be changed via the environment variable COCKROACH_NETWORK_TIMEOUT which defaults to 2s.

@erikgrinaker erikgrinaker requested review from knz, tbg, nvanbenschoten and a team November 27, 2022 18:27
@erikgrinaker erikgrinaker requested a review from a team as a code owner November 27, 2022 18:27
@erikgrinaker erikgrinaker self-assigned this Nov 27, 2022
@erikgrinaker erikgrinaker requested a review from a team as a code owner November 27, 2022 18:27
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @knz, and @nvanbenschoten)


pkg/base/config.go line 57 at r1 (raw file):

	// variance), with a lower bound of 200ms, so the worst-case RTO is 600ms. 2s
	// should therefore be sufficient under most normal conditions.
	// https://datastudio.google.com/reporting/fc733b10-9744-4a72-a502-92290f608571/page/70YCB

Maybe add something similar for AWS, here's the p99 over one week map:

https://www.cloudping.co/grid/p_99/timeframe/1W

Looks like the numbers come out to around the same, between Sao Paulo and Singapore.


pkg/base/config.go line 58 at r1 (raw file):

	// should therefore be sufficient under most normal conditions.
	// https://datastudio.google.com/reporting/fc733b10-9744-4a72-a502-92290f608571/page/70YCB
	NetworkTimeout = 2 * time.Second

You will need to update this one too:

MinConnectTimeout: minConnectionTimeout}))

Otherwise, a "black hole" dial will still take 5s. Probably NetworkTimeout should "just" reference rpc.MinConnectionTimeout.

I'm not sure if the latter also closes over the TLS handshake. I don't think so, but I am not sure. Would be good to clarify which is which.
In terms of authority on TLS roundtrips, I found https://www.gnutls.org/manual/html_node/Reducing-round_002dtrips.html useful.

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added an environment variable COCKROACH_NETWORK_TIMEOUT to control this, even though 2s should be sufficient for most users. FWIW, 20.2 accidentally dropped the gRPC connection timeout to the default of 1s, and we only had reports from a single user that this was problematic (they had >500ms RTTs due to a network topology that routed Japan to South America via Europe).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @tbg)


pkg/base/config.go line 57 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Maybe add something similar for AWS, here's the p99 over one week map:

https://www.cloudping.co/grid/p_99/timeframe/1W

Looks like the numbers come out to around the same, between Sao Paulo and Singapore.

Yep, added this.


pkg/base/config.go line 58 at r1 (raw file):
Good catch, thanks. I updated minConnectionTimeout to reference NetworkTimeout, since that's the canonical setting.

I'm not sure if the latter also closes over the TLS handshake. I don't think so, but I am not sure. Would be good to clarify which is which.

It possibly does. I added some analysis on this.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @nvanbenschoten)

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gRPC dial timeout is also reduced to the network timeout (from 5 seconds to 2 seconds).

Does this include the TLS handshake?

2 seconds is very short, especially for geo-distributed clusters. With the syn, syn-ack, ack exchange before the TCP connection is established + 3 back-and-forth for the TLS handshake, at 250ms roundtrip you're already at the limit.

I'd be interested in @bdarnell 's opinion on this too.

@erikgrinaker
Copy link
Contributor Author

Does this include the TLS handshake?

2 seconds is very short, especially for geo-distributed clusters. With the syn, syn-ack, ack exchange before the TCP connection is established + 3 back-and-forth for the TLS handshake, at 250ms roundtrip you're already at the limit.

I'm not sure, but I believe it might. The TCP+TLS handshake is 3 RTTs. Cloud providers have a max nominal RTT latency between regions of ~350ms, so if we assume 400ms RTT we still have time both for the TCP+TLS handshake and a single packet loss (600ms RTO): 3*400ms + 600ms = 1800ms.

FWIW, we accidentally ran with a 1 second gRPC dial timeout in 20.2, 21.1, and 21.2, with only a single reported problem because of it (see #71417).

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Nov 28, 2022

Does this include the TLS handshake?

I'm not sure, but I believe it might.

Confirmed, the timeout includes the TLS handshake. I'll run some further tests to verify when I get a chance (including a fake high-latency network and inspecting the network traffic).

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Nov 28, 2022

A more conservative option here might be to keep the network timeout at 3 seconds (like today), but apply it to the RPC heartbeats (from 6 to 3 seconds with 1.5 second interval) and the gRPC dial timeout (from 5 to 3 seconds). I still think 2 seconds should be fine for most scenarios though.

@bdarnell
Copy link
Contributor

When a peer is unresponsive, these timeouts determine how quickly RPC calls (and thus critical operations such as lease acquisitions) will be retried against a different node.

Any timeout that is long enough to allow for a full TCP+TLS handshake is going to be much longer than necessary for detecting a failure on an established connection. I know this is easier said than done, but the ideal answer would be to replace the undifferentiated NetworkTimeout with something that can vary by context (at least two, for new vs established connection, but potentially more, so that same-region calls can use a very aggressive timeout).

In the meantime, I'm tentatively OK with moving to 2s by default, especially since we'd be introducing a new configuration knob in case it's not working.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Nov 29, 2022

Any timeout that is long enough to allow for a full TCP+TLS handshake is going to be much longer than necessary for detecting a failure on an established connection. I know this is easier said than done, but the ideal answer would be to replace the undifferentiated NetworkTimeout with something that can vary by context (at least two, for new vs established connection, but potentially more, so that same-region calls can use a very aggressive timeout).

Yeah, I've been thinking along the same lines. We could drop NetworkTimeout down to e.g. 1 second, which applies to operations on the order of 1 RTT (with room for a retransmit), and then use a multiplier for operations that require multiple roundtrips -- so the dial timeout would be 3 * NetworkTimeout. But maybe that's overcomplicating things a bit.

@bdarnell
Copy link
Contributor

I wouldn't even necessarily require room for a retransmit. Retransmit timeouts are pretty high by modern within-region standards (e.g. you mentioned a 200ms minimum above). It could be worthwhile to try another replica after, say, 100ms even though there's a chance a packet was dropped and hasn't been retransmitted yet. (but we shouldn't retry the same replica more than once faster than the retransmission timeout)

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Nov 29, 2022

Possibly, but I feel like we're already pretty aggressively lowering the timeouts here. I think I'd want to hold back a little bit and not go all in on this just yet, since it can be pretty fragile under latency fluctuations. We can consider tightening it further later, and even then we might want to try it out in CC for a while first.

But the RTT factoring seems reasonable, I'll give that a try (i.e. set NetworkTimeout = time.Second and use roundtrip factors as appropriate).

`NetworkTimeout` should be used for network roundtrip timeouts, not for
request processing timeouts.

Release note: None
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Dec 1, 2022

I've restructured this to use NetworkTimeout = time.Second as a measure of a single roundtrip (including retransmit or other latency tolerance), and added DialTimeout = 2 * NetworkTimeout to account for the additional TCP+TLS handshake roundtrips but with reduced retransmit tolerance. I've also cleaned up the use of NetworkTimeout throughout the code base.

As a consequence of these changes, we've reduced the dial timeout from 5 to 2 seconds, gRPC keepalive interval/timeout from 3 to 1 second, and RPC heartbeat interval/timeout from 3/6 seconds to 1 second.

This is rather tight, so I'm open to bumping NetworkTimeout to 2 seconds out of caution (which doubles all of these values). I'm also going to run some high-latency tests to verify this change.

Previously, the RPC heartbeat timeout (6s) was set to twice the
heartbeat interval (3s). This is rather excessive, so this patch sets
them to an equal value of 3s.

Release note: None
This patch adds a `DialTimeout` constant, set to `2 * NetworkTimeout` to
account for the additional roundtrips in TCP + TLS handshakes.

Release note: None
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Dec 2, 2022

Had a quick look at our CC clusters in CentMon, and the worst p99 RPC heartbeat latency in any cluster over the past 90 days was 557 ms. This was on a single-region cluster in the US, and the high latency appeared to be due to CPU overload/throttling: 5 nodes with 2 vCPUs where several nodes were running close to 100% CPU, beyond the Kubernetes resource requests. So I don't think we ever want to drop this below 1 second, to make sure we're tolerant to node overload as well. Updated the NetworkTimeout comment saying as much.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Dec 3, 2022

I wrote up a roachtest to measure the pMax latency during leaseholder loss with a network blackhole as well (#92991), and the results are promising. These sampled 9 outages each, and show a fair bit of variance, but it's a clear improvement nonetheless.

Branch crash blackhole
master 14.5 s 18.3 s
Reduce lease 8.3 s 14.5 s
Reduce lease + network 1s 9.1 s 8.3 s
Reduce lease + network 2s 7.8 s 9.1 s

Notably, the marginal gains of a 1 second vs. 2 second network timeout are fairly low, and may not be worth the instability risk, at least not for this pass.

This patch reduces the network timeout from 3 seconds to 2 seconds. This
change also affects gRPC keepalive intervals/timeouts (3 to 2 seconds),
RPC heartbeats and timeouts (3 to 2 seconds), and the gRPC dial timeout
(6 to 4 seconds).

When a peer is unresponsive, these timeouts determine how quickly RPC
calls (and thus critical operations such as lease acquisitions) will be
retried against a different node. Reducing them therefore improves
recovery time during infrastructure outages.

An environment variable `COCKROACH_NETWORK_TIMEOUT` has been introduced
to tweak this timeout if needed.

Release note (ops change): The network timeout for RPC connections
between cluster nodes has been reduced from 3 seconds to 2 seconds, with
a connection timeout of 4 seconds, in order to reduce unavailability and
tail latencies during infrastructure outages. This can now be changed
via the environment variable `COCKROACH_NETWORK_TIMEOUT` which defaults
to `2s`.
@erikgrinaker
Copy link
Contributor Author

Bumped the network/dial timeouts to 2/4 seconds, out of caution. Opened #93007 to consider reducing them to 1/2 seconds after some verification in cloud. Will do some high-latency testing for #91947.

bors r+

@craig
Copy link
Contributor

craig bot commented Dec 4, 2022

Build succeeded:

@craig craig bot merged commit 8a5cb51 into cockroachdb:master Dec 4, 2022
@erikgrinaker erikgrinaker deleted the reduce-network-timeouts branch December 5, 2022 09:47
craig bot pushed a commit that referenced this pull request Dec 12, 2022
93399: rpc: tweak heartbeat intervals and timeouts r=erikgrinaker a=erikgrinaker

The RPC heartbeat interval and timeout were recently reduced to 2 seconds (`base.NetworkTimeout`), with the assumption that heartbeats require a single network roundtrip and 2 seconds would therefore be more than enough.

However, high-latency experiments showed that clusters under TPCC import load were very unstable even with a relatively moderate 400ms RTT, showing frequent RPC heartbeat timeouts because RPC `Ping` requests are head-of-line blocked by other RPC traffic.

This patch therefore reverts the RPC heartbeat timeout back to the previous 6 second value, which is stable under TPCC import load with 400ms RTT, but struggles under 500ms RTT (which is also the case for 22.2). However, the RPC heartbeat interval and gRPC keepalive ping intervals have been split out to a separate setting `PingInterval` (`COCKROACH_PING_INTERVAL`), with a default value of 1 second, to fail faster despite the very high timeout.

Unfortunately, this increases the maximum lease recovery time during network outages from 9.7 seconds to 14.0 seconds (as measured by the `failover/non-system/blackhole` roachtest), but that's still better than the 18.1 seconds in 22.2.

Touches #79494.
Touches #92542.
Touches #93397.
Epic: none

Release note (ops change): The RPC heartbeat and gRPC keepalive ping intervals have been reduced to 1 second, to detect failures faster. This is adjustable via the new `COCKROACH_PING_INTERVAL` environment variable. The timeouts remain unchanged.

Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants