-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
functional-tester: no commit for 30 seconds with leader failures #4477
Comments
Happened once more |
Leader failure, immediate network partition can slow down the cluster, and this sometimes need more than 30 seconds to reach 10,000 commits since the first failure. Related to etcd-io#4477.
Messages like This indicates bad network connection in cluster. In most cases, it should recover in seconds as below:
|
... continued from the previous comment. We got a few more In But for a few cases like below, it handled 0 commit in 30 seconds:
So all
So I expect if we gave them more time (like 30 seconds more), the cluster could just recover from network partition and do the normal stress commits after Real logs : successful
|
Do we have any clue why this can happen? This is unexpected since we did not inject any failures.
This is also unexpected. Can you verify that |
I have no idea how the network partition happened. From the logs of last 12 hour, there are about 73 occurrences
So network partition happens even without failure injections. But as you pointed out, even with the network partitions, it should be able to handle PUT requests. And
But the logs just stopped after file purge:
And the actual cleaning up happened seconds after the log stop:
So clearly there was no sign of commits in agent1 after network partition. I will investigate more. |
@gyuho Yea. Let's investigate. Seems like an interesting case. |
@xiang90 Yeah, from the logs we use in production testing cluster I cannot tell exactly what was happening after network partition. I will run another testing cluster with more verbose outputs, testing only the Thanks. |
Exit the stress goroutine only when the stresser.Cancel method gets called. Network partitions also cause errors in Stress and in such cases, we should retry rather than exiting for any error (this will stop the whole stress). Fixes etcd-io#4477.
@xiang90 Did more investigation as you suggested:
tester logs(killing agent1 leader): etcd cluster logs (agent2): etcd cluster logs (agent3): |
This contradicts with your previous assumption (#4477 (comment). item6) that even without our injected network failures there still are connection failures. |
We probably need to investigate the failure in case3 more. |
Think I misread the old logs with timestamp. Will investigate more. Thanks! |
@xiang90 Sorry, I was wrong about #4477 (comment). I misread logs confused about UTC timezone. As you pointed out, bad network connection should be something like etcd log from #4477 (comment):
And right before this So I believe:
|
Yeah I restarted the testing cluster this afternoon. Will have more data to support that during this holiday. |
Awesome! Thanks! |
Exit the stress goroutine only when the stresser.Cancel method gets called. Network partitions also cause errors in Stress and in such cases, we should retry rather than exiting for any error (this will stop the whole stress). Fixes coreos#4477.
@xiang90 Ran only |
My guess is that first there is a leader election happened, it took several seconds to be resolved. So no put requests could go through during that several seconds. Then the put requests timedouted since we set timeout on each request. Then the stresser exited. Can you find a way to verify this is exact what happened? Also it seems like as long as we increase the timeout to 2x election timeout, this can be resolved. |
Exactly. And that's what I specifically tested over the weekend. I printed out the timestamp for each request's start time in order to see if it matches the timestamp of leader failure. Look at this log.
You can see at |
From multiple failure cases, we confirmed that requests timed out from the leader failure:
Will redeploy the testing cluster with #4547 |
Got one more:
but much less frequently. Will investigate more. |
Resolved via #4569. There was no issue in etcd itself. We were stopping the stressers too early:
|
kill one member for long time and expect it to recover from incoming snapshot
2016/02/10 04:27:19
:kill
triggered toagent3
(leader at term 457)agent1
(follower at term 457)04:27:19 E | rafthttp: the connection with agent3 became inactive
agent2
(follower at term 457)04:27:19 E | rafthttp: the connection with agent3 became inactive
2016/02/10 04:27:26.730656
:agent1
became leader at term 4632016/02/10 04:27:26.728370
:agent2
became follower at term 463Immediately another network partition happened between
agent1
andagent2
. And below is what was happening inagent2
:This caused
2016/02/10 04:27:49
:etcd-tester: [round#134 case#3] injection error: cluster too slow: only commit 0 requests in 30s
, because 30 seconds weren't long enough to recover from the consecutive failures: leader failure, another network partition between the rest members. If we gave more time, it should have recovered. This is very rare. Only happened once, out of 1000+ rounds we have been running so far.To make functional-test more robust with such cases, we should test the leader failure.
Here are the full logs:
agent1-2016-02-10T04-27-49Z-etcd.log.txt
agent2-2016-02-10T04-27-49Z-etcd.log.txt
agent3-2016-02-10T04-27-49Z-etcd.log.txt
/cc @xiang90
The text was updated successfully, but these errors were encountered: