[Jepsen] During network partitions, CQL requests never time out #822
During a network partition, YugaByte CE 188.8.131.52 does not appear to return immediate failures to client requests; instead, requests appear to time out after 10+ seconds. This behavior could pose problems for production clients: during a fault, increasing latencies by 1-2 orders of magnitude could tie up work queues or available concurrency on nodes, causing cascading failures. High latencies for operations on partitioned shards might also delay or starve operations on shards with a healthy majority component available, reducing goodput. Finally, timeouts are indeterminate results, which force clients to deal with increased ambiguity--did these requests succeed or fail? Clients might retry ambiguous failures multiple times, overloading a struggling cluster.
This plot shows the latency of client operations (writes, in this case) during a network partition isolating 2 nodes from 3 others in a 5-node Jepsen cluster. Time flows horizontally; latency is plotted vertically. Yellow are indeterminate results (e.g. timeouts), and pink operations are known failures. The grey regions indicate when a network partition was in effect.
This client has a 10-second timeout in effect, so we know YB's server latency is at least 10 seconds under these circumstances--it might be higher.
YugaByte could mitigate these issues by returning definite failure messages to clients immediately when a leader's lease has expired, no leader has been recently reachable, no leader is known, and so on, but continuing to make requests for a leader in the background. Once communication has been re-established, requests may flow again.
The text was updated successfully, but these errors were encountered:
A bit more data here: I dug into the Cassandra client timeouts, and discovered that these 12-second timeouts were client-imposed. In fact, Yugabyte's CQL layer refuses to time out any client request--even with up to 500 seconds of network partitions, leader elections, node failures, etc.
This looks to be because the CQL layer specifies MonoTime::Max() for timeouts: https://github.com/YugaByte/yugabyte-db/blob/0274a1c229b8e358d096c7010931402123345de2/src/yb/yql/cql/cqlserver/cql_rpc.cc#L296
While the request between YQL & TServer (at the ybclient layer) is indeed timing out based on the
Here's a simple repro that @amitanandaiyer had:
create a 3 node yb-ctl cluster
now kill 2 tservers.
Given that there are retries in the layer between YQL and TServer, and timeout enforced by the
do we need to restart on a timeout again:
Could we change the NeedsRestart() logic from:
…ut retries already happen at ybclient layer Summary: Even when client request (e.g., from cqlsh) times out, the request is getting retried for ever in the ql/exec/executor.cc layer, and we see that the request keeps being retried in the system (till e.g., the partition heals). Timeout related restarts are already handled at the ybclient RPC layer (used for YQL to TServer communication). Each request is already tried/retried for an overall `client_read_write_timeout_ms` amount of time (60s default), with an individual RPCs default timeout being `retryable_rpc_single_call_timeout_ms` (default 2500ms). Given the above, the NeedsRestart() logic in the executor layer shouldn't need also need consider timeouts as a reason for restart. Test Plan: Look for test failures. Added new test. Reviewers: kannan, sergei, mihnea, robert Reviewed By: robert Differential Revision: https://phabricator.dev.yugabyte.com/D6097