Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Jepsen] During network partitions, CQL requests never time out #822

aphyr opened this issue Jan 30, 2019 · 4 comments

[Jepsen] During network partitions, CQL requests never time out #822

aphyr opened this issue Jan 30, 2019 · 4 comments


Copy link

@aphyr aphyr commented Jan 30, 2019

During a network partition, YugaByte CE does not appear to return immediate failures to client requests; instead, requests appear to time out after 10+ seconds. This behavior could pose problems for production clients: during a fault, increasing latencies by 1-2 orders of magnitude could tie up work queues or available concurrency on nodes, causing cascading failures. High latencies for operations on partitioned shards might also delay or starve operations on shards with a healthy majority component available, reducing goodput. Finally, timeouts are indeterminate results, which force clients to deal with increased ambiguity--did these requests succeed or fail? Clients might retry ambiguous failures multiple times, overloading a struggling cluster.

This plot shows the latency of client operations (writes, in this case) during a network partition isolating 2 nodes from 3 others in a 5-node Jepsen cluster. Time flows horizontally; latency is plotted vertically. Yellow are indeterminate results (e.g. timeouts), and pink operations are known failures. The grey regions indicate when a network partition was in effect.

latency-raw 64

This client has a 10-second timeout in effect, so we know YB's server latency is at least 10 seconds under these circumstances--it might be higher.

YugaByte could mitigate these issues by returning definite failure messages to clients immediately when a leader's lease has expired, no leader has been recently reachable, no leader is known, and so on, but continuing to make requests for a leader in the background. Once communication has been re-established, requests may flow again.

Copy link

@aphyr aphyr commented Feb 1, 2019

A bit more data here: I dug into the Cassandra client timeouts, and discovered that these 12-second timeouts were client-imposed. In fact, Yugabyte's CQL layer refuses to time out any client request--even with up to 500 seconds of network partitions, leader elections, node failures, etc.

rate 1

This looks to be because the CQL layer specifies MonoTime::Max() for timeouts:

Copy link

@kmuthukk kmuthukk commented Feb 2, 2019

hi @robertpang , @spolitov

While the request between YQL & TServer (at the ybclient layer) is indeed timing out based on the client_read_write_timeout_ms, @m-iancu noticed that the NeedsRestart() logic returning true seems to keep the request getting retried for ever.


Here's a simple repro that @amitanandaiyer had:

create a 3 node yb-ctl cluster
on cqlsh

create keyspace x; 
CREATE TABLE x.test (hk int, pk1 int, pk2 int, payload int, PRIMARY KEY ((hk), pk1, pk2));
insert into x.test (hk, pk1, pk2, payload) values (1, 2, 3, 4);
select * from x.test;

now kill 2 tservers.
(ideally ts2 and ts3 so the cqlsh is still talking to ts1)
issue the insert again insert into x.test (hk, pk1, pk2, payload) values (1, 2, 3, 4);
this will hang forever
and can be verified by looking at links
cqlsh will timeout. But, the request keeps showing up in the /rpcz until the dead tablet servers are brought back.

@kmuthukk kmuthukk added this to To do in Jepsen Testing via automation Feb 2, 2019
@kmuthukk kmuthukk added this to To Do in YBase features via automation Feb 2, 2019
Copy link

@kmuthukk kmuthukk commented Feb 2, 2019


Given that there are retries in the layer between YQL and TServer, and timeout enforced by the client_read_write_timeout_ms gflag, in the executor layer's NeedsRestart logic:

do we need to restart on a timeout again:

Could we change the NeedsRestart() logic from:

return s.IsTryAgain() || s.IsExpired() || s.IsTimedOut();


return s.IsTryAgain() || s.IsExpired();


yugabyte-ci pushed a commit that referenced this issue Feb 5, 2019
…ut retries already happen at ybclient layer

Even when client request (e.g., from cqlsh) times out, the request is getting retried for ever in the ql/exec/ layer, and we see that the request keeps being retried in the system (till e.g., the partition heals).

Timeout related restarts are already handled at the ybclient RPC layer (used for YQL to TServer communication). Each request is already tried/retried for an overall `client_read_write_timeout_ms` amount of time (60s default), with an individual RPCs default timeout being `retryable_rpc_single_call_timeout_ms` (default 2500ms).

Given the above, the NeedsRestart() logic in the executor layer shouldn't need also need consider timeouts as a reason for restart.

Test Plan: Look for test failures. Added new test.

Reviewers: kannan, sergei, mihnea, robert

Reviewed By: robert

Differential Revision:
@aphyr aphyr changed the title Requests time out during network partitions, instead of failing fast During network partitions, CQL requests never time out Feb 6, 2019
@kmuthukk kmuthukk assigned amitanandaiyer and unassigned spolitov Feb 6, 2019
Copy link

@kmuthukk kmuthukk commented Feb 6, 2019

Fixed in 630955b

@kmuthukk kmuthukk closed this Feb 6, 2019
YBase features automation moved this from To Do to Done Feb 6, 2019
Jepsen Testing automation moved this from To do to Done Feb 6, 2019
@mbautin mbautin changed the title During network partitions, CQL requests never time out [Jepsen] During network partitions, CQL requests never time out Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants