-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Jepsen] Recovery after network partitions seems to take a while #821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is a slightly different workload which performs single-row inserts and all-row reads throughout the test, instead of waiting for recovery before doing a read. When we schedule partitions in long-running patterns, it looks like it takes about 20 seconds to fully recover: However, that same workload with 30s/30s healthy/partitioned cycles results in a cluster which never appears to recover to a healthy state. I'm wondering if more rapid (~30s) changes in network topology can knock YB into a confused state that takes longer to recover from... |
Recovery times are variable, and there does appear to be a relationship between frequency and recovery time--55+ second partitions seem reliably recoverable, ~40s- partitions seem to prevent recovery. Here's 70, 60, 50, 40, 30, 20, 10-second partitions in sequence: 50 seconds seems like it can go either way: |
I'm starting to suspect that there's some sort of state you can get a cluster into that is harder to get out of than normal, which is why we're seeing variable results in these tests. Consider, for instance, these three 1000-second tests with 20-second cycles of partitions followed by 60 second windows for recovery. Most of the time the cluster recovers OK, but sometimes it seems to get wedged for a good long while! Here's the full logs for that last test: http://aphyr.com/media/20190131T124802.000-0500.zip |
Looking into http://aphyr.com/media/20190131T124802.000-0500.zip
grepped for "worker 0|nemesis." in jepsen.log
So, it looks like a lot of these requests may not even be hitting the YB Servers. The stack trace when we run into such an error seems to be something like
@aphyr does the partition affect the communication between the jepsen client and the yb-servers at all? or is it fair to expect that those communication links should be open always, and the only links that are broken/mended are the ones between the yb-servers themselves?
|
Yes, I've also been wondering about this! It's odd that we get timeouts during the connection process, right? This isn't something Jepsen does: we only interfere with network connections between database nodes, not from the clients to the nodes. I can think of a few ways this might happen:
20 seconds would explain some of the long recovery times we've seen (though some are longer!). I'll try tuning that down. |
Yeah, tuning As for issue 3). in #821 (comment) Kudu/Raft has an enhancement called pre-elections that prevents the minority partition from bumping up their terms -- there by reducing the election churn. I'm working on porting that over to Yugabyte in #604 Will expedite my effort on that front, it should also help mitigate this "late elections" to some extent. |
Pre-elections miiiight help, but I feel like that shouldn't really cause major issues... Once the partition resolves, any request (either a vote req from the minority nodes to the majority ones, or an append-entries req from the majority leader to a minority node) will force the majority leader to step down, and snap its term forward to the higher minority term; it should immediately be able to go through an election and one of the majority nodes should win, since they likely have committed log entries above the minority nodes. I'm not sure why this scenario would result in multi-second election delays--but I know there's a lot more complexity involved here than basic Raft. We also have to worry about membership changes too, right? So maybe this is something that gets worse when joint consensus is involved... |
Yeah. I agree once the majority leader steps down he should "immediately" be re-elected. Ideally, that should probably be sub-second. But there could be some inefficiency in the process that's causing this to take 2 election rounds (abt 3-6sec). That by itself is probably still ok. But, if the client timeout is < 6sec; it could cause an operation-timeout; and prompt the jepsen client to create a new client/re-connect and run into issue 1). Will try to dig in a little more to see if I can identify the cause/fix issue 1. That's probably the main contributor to the jepsen-client not making progress. |
for the logs that I've looked at, I haven't seen any membership changes ..... But, if it were to happen, it may add some more corner cases to the equation -- but from a perf standpoint, config changes shouldn't have much effect here, and shouldn't cause additional timeouts to the jepsen client. |
I see this in one of the runs. Will look into what's happening on refresh Nodes.
|
Merging relevant portions of jepsen.log; with time taken for querying system tables from
It looks like most of the queries that result in an "Error opening client" are preceeded by a query to a system table, which takes > 1sec. We seem to be setting the Connection timeout in
We should bump it up to 1 + 6sec to allow for the time to make one CQL call. |
@aphyr I believe increasing the |
We lowered that timeout so that we don't get stuck waiting indefinitely for YB to start doing operations again--previous tests weren't making many requests during partitions because of high timeouts. We can raise it here, but... does it feel like maybe it shouldn't take six seconds to do a query that every Cassandra client has to do before anything else? That seems weirdly high... |
I'm ok with lower timeouts if we want that. My concern is that if we expect to do a read to CQL as part of opening the connection; we should probably keep the connectTimeout to be slightly higher than the timeout for the read. But yeah. it shouldn't take six seconds. Unless if we are in a partition ...
|
I think connectTimeout here refers to the TCP connect timeout, so raising it won't make a difference if it's the query that's slow. It... is weird that it makes a query as a part of the connection process, but that just seems to be how the Cassandra client is designed. Wonder if we can turn that off somehow... I should go digging in that code. |
I'm investigating why queries to the master might take long. It looks like nemesis just healed at Will try to find another case where it takes long during good connectivity. |
there seems to be a We could probably set it to something much larger, if we want to disable node discovery. From what I understand, we want queries from the jepsen client to go only to the node that we connect to, and do not want the jepsen-client to forward the request to a different node (which may be "local" for that key/tablet) |
It looks like even if querying the system table takes > 1sec; it doesn't caus an issue unless it is > 2sec. I'm thinking that it is probably I thought that the timeout used is 10sec or 12 sec. |
While these account for temporary/unnecessary glitches; it would not account for your figure 3; where things are consistently down. If the masters are in good condition; eventually querying the system tablets should become O(x ms) and things should get back to the system. In the logs you uploaded. It looks like the proxies are busy with meta_cache.cc repeatedly querying the masters without any success.
Typically, the master leader should respond with a successful response. But it looks like all 3 masters respond with NOT_THE_LEADER even though the leader election was successful; and one of them is/was currently the leader. The status NOT_THE_LEADER seems to be also used in the response if
my guess is that the issue in the 3rd figure that you ran into is essentially caused by the catalog_manager .... this makes all the system.xyz table operations hang/not-succeed. cassandra client might intrepret this failure to not allow the jepsen-clients to even make queries to other (normal) tables. |
@bmatican noticed that the service queues at the masters on your 3rd run have been running over .... so that would explain why the tservers did not hear back from the masters. Causing the system to not respond.
One simple fix would be to set the |
Yep! When a client fails to open a connection like this, that's not the end of the world; we immediately close the client and try again, and as you observe, YugaByte DB does recover after some time. Since Cassandra clients make that query for peers as a part of their default connection process, I think it'd be worth making that query read from a local cache--I assume right now it's trying to talk over the network to other nodes, which means new clients are going to have trouble talking to YB during partitions. I can open a new ticket for that, if you think it'd be helpful! Meta-cache and service queue issues... yeah! Those seem like useful things to address! |
…borting tasks. Summary: Fix the race condition in the master that's causing the master to get stuck when it re-elects itself as the leader. Issue: As part of becoming a leader, the master will abort and wait for all (previously created) tasks. - There could have been items added to the pending_tasks_ after Abort; that are waited on but not Aborted. - Even the tasks that have been aborted, aren't immediately removed from pending_tasks_. Fix include: - ensuring that once Abort is called with the intention of waiting (AbortAndClose) all further tasks that are added will be automatically aborted - An aborted task is immediately unregistered -- causing it to be promptly removed from the pending_tasks_ list. Other Minor fixes: - fix race condition between Shutdown and RunLoaders in SysCatalog - follower won't reject election requests from current leader because he believes the leader to be in good standing. Test Plan: Jenkins + Added a test to repro Kyle's scenario. Reviewers: sergei, timur, bogdan Reviewed By: bogdan Subscribers: mikhail, ybase, bharat Differential Revision: https://phabricator.dev.yugabyte.com/D6186
I've done a little more experimenting on version 1.1.13.0-b2, and it looks like not only can YB get stuck for long durations given ongoing network partitions, but it can also get stuck for 5+ minutes even in the absence of failures. However, additional network partitions can kick the cluster into a working state again! http://jepsen.io.s3.amazonaws.com/analyses/yugabyte-db-1.1.9/20190221T100947-stuck.zip |
This could again be caused by a leader/master getting stuck upon re-election; if one master gets stuck … we won’t make progress …. indefinitely. Until another partition causes the leadership to move. So here, I think 1 master got stuck with the issue …. when it got re-elected after the small partition, and new clients could not connect for 5+ minute. But the other 2 masters possibly did not run into this issue, we are able to make progress when the leadership flips over to one of them. I have a fix for this issue. Once you have a build that includes 1ae0df3 |
verified that this was indeed the case n1 got stuck while opening the catalog table.
|
Actually both n1 and n2 ran into this issue. But, n3 did not. So we were able to make progress when n3 got reelected the leader. |
There are options which I've used to make recovery faster:
|
In both 1.1.13.0-b2 and 1.1.15.0-b16, it looks like these tuning flags make recovery less reliable, and might introduce spurious failures, at least in my LXC cluster. This is the behavior with stock flags: And with the suggested flags: You can reproduce these by checking out Jepsen 9ea7c3e53b06947d743d4f2bed82dceb0ac088b2, and running:
|
OK, I've done a bunch more testing; recovery times on 1.1.15.0-b16 now look to be around ~25 seconds, and they're in general reliable about coming back after that time. Here's a typical example from Jepsen 337d5fde62839c631b896c862bb676140ea10d23, running
|
Thanks @aphyr for confirming. |
Uh oh!
There was an error while loading. Please reload this page.
This may not be a real issue in YB; I'm still exploring behavior and collecting more evidence!
On Yugabyte CE, version 1.1.10.0, recovery from network partitions seems to take a little longer than one might expect. In particular, when performing a series of single-row CQL inserts during a mixture of randomized 3/2 split network partitions, followed by the end of all partitions, 30 seconds for the cluster to heal, and a final read of all rows, that read consistently times out.
This cluster is running on 5 nodes, with replication factor 3. n1, n2, and n3 are masters; n1, n2, n3, n4, and n5 are tservers.
For instance, in this test, the second partition (the shaded region in this plot) appears to prevent all clients from performing writes, and even with 30 seconds of recovery before the final read, that read times out:
This may be expected behavior, or might be some sort of client issue, so I'm going to keep exploring. I'll update this issue as I find out more. :)
The text was updated successfully, but these errors were encountered: