rpc: avoid RPC heartbeat head-of-line blocking #93397
Labels
A-kv-server
Relating to the KV-level RPC server
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-kv
KV Team
Projects
RPC heartbeats are important to detect peer failures and fail over to other nodes. However, they currently need to have very high timeouts (6 seconds) because they can be head-of-line blocked by other RPC traffic. For example, an experiment with a 500ms RTT cluster running a TPCC import would frequently hit the 6 second heartbeat timeout, even though the network latency was a fraction of this.
Furthermore, on idle clusters heartbeats were occasionally seen to take 3 RTTs rather than 1, long after the connection had initially been established (which takes 3 RTTs for the handshake). It's unclear what the cause of this is -- packet dumps showed that the TCP connection was intact throughout, so further analysis is needed.
We should avoid head-of-line blocking and other interference with RPC heartbeats, to get them closer to the basic network RTT, such that we can reduce the heartbeat timeout further. This may require switching the gRPC transport to e.g. QUIC.
Jira issue: CRDB-22311
The text was updated successfully, but these errors were encountered: