Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: avoid RPC heartbeat head-of-line blocking #93397

Open
erikgrinaker opened this issue Dec 10, 2022 · 1 comment
Open

rpc: avoid RPC heartbeat head-of-line blocking #93397

erikgrinaker opened this issue Dec 10, 2022 · 1 comment
Labels
A-kv-server Relating to the KV-level RPC server C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Dec 10, 2022

RPC heartbeats are important to detect peer failures and fail over to other nodes. However, they currently need to have very high timeouts (6 seconds) because they can be head-of-line blocked by other RPC traffic. For example, an experiment with a 500ms RTT cluster running a TPCC import would frequently hit the 6 second heartbeat timeout, even though the network latency was a fraction of this.

Furthermore, on idle clusters heartbeats were occasionally seen to take 3 RTTs rather than 1, long after the connection had initially been established (which takes 3 RTTs for the handshake). It's unclear what the cause of this is -- packet dumps showed that the TCP connection was intact throughout, so further analysis is needed.

We should avoid head-of-line blocking and other interference with RPC heartbeats, to get them closer to the basic network RTT, such that we can reduce the heartbeat timeout further. This may require switching the gRPC transport to e.g. QUIC.

Jira issue: CRDB-22311

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-server Relating to the KV-level RPC server T-kv KV Team labels Dec 10, 2022
@blathers-crl blathers-crl bot added this to Incoming in KV Dec 10, 2022
@erikgrinaker erikgrinaker added C-performance Perf of queries or internals. Solution not expected to change functional behavior. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) and removed C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-performance Perf of queries or internals. Solution not expected to change functional behavior. labels Dec 10, 2022
@erikgrinaker
Copy link
Contributor Author

One alternative here could be to use a lower heartbeat timeout for the SystemClass connection, which is less susceptible to network congestion, and terminate all RPC connections to the node when the system connection fails.

craig bot pushed a commit that referenced this issue Dec 12, 2022
93399: rpc: tweak heartbeat intervals and timeouts r=erikgrinaker a=erikgrinaker

The RPC heartbeat interval and timeout were recently reduced to 2 seconds (`base.NetworkTimeout`), with the assumption that heartbeats require a single network roundtrip and 2 seconds would therefore be more than enough.

However, high-latency experiments showed that clusters under TPCC import load were very unstable even with a relatively moderate 400ms RTT, showing frequent RPC heartbeat timeouts because RPC `Ping` requests are head-of-line blocked by other RPC traffic.

This patch therefore reverts the RPC heartbeat timeout back to the previous 6 second value, which is stable under TPCC import load with 400ms RTT, but struggles under 500ms RTT (which is also the case for 22.2). However, the RPC heartbeat interval and gRPC keepalive ping intervals have been split out to a separate setting `PingInterval` (`COCKROACH_PING_INTERVAL`), with a default value of 1 second, to fail faster despite the very high timeout.

Unfortunately, this increases the maximum lease recovery time during network outages from 9.7 seconds to 14.0 seconds (as measured by the `failover/non-system/blackhole` roachtest), but that's still better than the 18.1 seconds in 22.2.

Touches #79494.
Touches #92542.
Touches #93397.
Epic: none

Release note (ops change): The RPC heartbeat and gRPC keepalive ping intervals have been reduced to 1 second, to detect failures faster. This is adjustable via the new `COCKROACH_PING_INTERVAL` environment variable. The timeouts remain unchanged.

Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-server Relating to the KV-level RPC server C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
KV
Incoming
Development

No branches or pull requests

1 participant