Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
[Jepsen] During network partitions, CQL requests never time out #822
During a network partition, YugaByte CE 188.8.131.52 does not appear to return immediate failures to client requests; instead, requests appear to time out after 10+ seconds. This behavior could pose problems for production clients: during a fault, increasing latencies by 1-2 orders of magnitude could tie up work queues or available concurrency on nodes, causing cascading failures. High latencies for operations on partitioned shards might also delay or starve operations on shards with a healthy majority component available, reducing goodput. Finally, timeouts are indeterminate results, which force clients to deal with increased ambiguity--did these requests succeed or fail? Clients might retry ambiguous failures multiple times, overloading a struggling cluster.
This plot shows the latency of client operations (writes, in this case) during a network partition isolating 2 nodes from 3 others in a 5-node Jepsen cluster. Time flows horizontally; latency is plotted vertically. Yellow are indeterminate results (e.g. timeouts), and pink operations are known failures. The grey regions indicate when a network partition was in effect.
This client has a 10-second timeout in effect, so we know YB's server latency is at least 10 seconds under these circumstances--it might be higher.
YugaByte could mitigate these issues by returning definite failure messages to clients immediately when a leader's lease has expired, no leader has been recently reachable, no leader is known, and so on, but continuing to make requests for a leader in the background. Once communication has been re-established, requests may flow again.
A bit more data here: I dug into the Cassandra client timeouts, and discovered that these 12-second timeouts were client-imposed. In fact, Yugabyte's CQL layer refuses to time out any client request--even with up to 500 seconds of network partitions, leader elections, node failures, etc.
This looks to be because the CQL layer specifies MonoTime::Max() for timeouts:
While the request between YQL & TServer (at the ybclient layer) is indeed timing out based on the
Here's a simple repro that @amitanandaiyer had:
create a 3 node yb-ctl cluster
now kill 2 tservers.
Given that there are retries in the layer between YQL and TServer, and timeout enforced by the
do we need to restart on a timeout again:
Could we change the NeedsRestart() logic from: