Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

functional-tester: no commit for 30 seconds with leader failures #4477

Closed
gyuho opened this issue Feb 10, 2016 · 21 comments
Closed

functional-tester: no commit for 30 seconds with leader failures #4477

gyuho opened this issue Feb 10, 2016 · 21 comments

Comments

@gyuho
Copy link
Contributor

gyuho commented Feb 10, 2016

  • 3 servers running in functional-tester suite
  • Case: kill one member for long time and expect it to recover from incoming snapshot
  1. 2016/02/10 04:27:19: kill triggered to agent3 (leader at term 457)
    • This is a leader failure, but the test case just randomly chooses a member to kill. It just happened to be the leader in this run.
    • agent1 (follower at term 457)
      • 04:27:19 E | rafthttp: the connection with agent3 became inactive
    • agent2 (follower at term 457)
      • 04:27:19 E | rafthttp: the connection with agent3 became inactive
  2. 2016/02/10 04:27:26.730656: agent1 became leader at term 463
  3. 2016/02/10 04:27:26.728370: agent2 became follower at term 463

Immediately another network partition happened between agent1 and agent2. And below is what was happening in agent2:

04:27:40.034313 I | fileutil: purged file agent.etcd/member/snap/00000000000001c0-0000000003205924.snap successfully
04:27:49.503099 E | rafthttp: failed to read agent1 on stream Message (unexpected EOF)
04:27:49.503145 I | rafthttp: the connection with agent1 became inactive

This caused 2016/02/10 04:27:49: etcd-tester: [round#134 case#3] injection error: cluster too slow: only commit 0 requests in 30s, because 30 seconds weren't long enough to recover from the consecutive failures: leader failure, another network partition between the rest members. If we gave more time, it should have recovered. This is very rare. Only happened once, out of 1000+ rounds we have been running so far.

To make functional-test more robust with such cases, we should test the leader failure.

Here are the full logs:

agent1-2016-02-10T04-27-49Z-etcd.log.txt
agent2-2016-02-10T04-27-49Z-etcd.log.txt
agent3-2016-02-10T04-27-49Z-etcd.log.txt

/cc @xiang90

@gyuho
Copy link
Contributor Author

gyuho commented Feb 10, 2016

Happened once more 10 Feb 2016 00:11:40.874 2016/02/10 08:11:40 etcd-tester: [round#343 case#3] injection error: cluster too slow: only commit 0 requests in 30s

gyuho added a commit to gyuho/etcd that referenced this issue Feb 10, 2016
Leader failure, immediate network partition can slow down the
cluster, and this sometimes need more than 30 seconds to reach
10,000 commits since the first failure.

Related to etcd-io#4477.
@gyuho
Copy link
Contributor Author

gyuho commented Feb 11, 2016

Messages like rafthttp: failed to read 877e7f4aa0ff117e on stream Message (unexpected EOF) occurs from https://github.com/coreos/etcd/blob/master/rafthttp/peer_status.go#L53-L65, which is called at https://github.com/coreos/etcd/blob/master/rafthttp/stream.go#L265-L283.

This indicates bad network connection in cluster. In most cases, it should recover in seconds as below:

15:46:49.353993 E | rafthttp: failed to read 179aaf9458e18548 on stream MsgApp v2 (unexpected EOF)
15:46:49.354038 I | rafthttp: the connection with 179aaf9458e18548 became inactive
15:46:50.291357 I | etcdserver: start to snapshot (applied: 2255647, lastsnap: 2245597)
15:46:50.301827 I | etcdserver: saved snapshot at index 2255647
15:46:50.302042 I | etcdserver: compacted raft log at 2250647
15:46:51.522093 I | rafthttp: the connection with 179aaf9458e18548 became active

@gyuho gyuho changed the title functional-tester: test leader failure more robustly functional-tester: no commit for 30 seconds with leader failures Feb 12, 2016
@gyuho
Copy link
Contributor Author

gyuho commented Feb 12, 2016

... continued from the previous comment.

We got a few more case #3 failures in the functional-test (these are the only cases that have been failing the functional-test, and we have seen 5 out of about 2,500+ rounds so far).

In case #3, we randomly kill one member and wait for about 30 seconds in order to check if the cluster can still handle stress while loosing one member. This case fails if the cluster handled only 10,000 or less commits during that 30-second period. 99% of the time, the etcd handled over 10,000 commits just in 3 seconds.

But for a few cases like below, it handled 0 commit in 30 seconds:

  1. failure inject kills one random member
    • the member that got killed happened to be the leader
  2. The rest two members now must elect a new leader and continues to operate
    • and also expected to handle more than 10,000 commits within 30 seconds
  3. In 3 node cluster, it took total 5 seconds to recover from leader failure
    • agent1 follower, agent2 follower, agent3 leader
    • agent3 leader got killed by tester
    • agent2 became the new leader with votes from agent1, agent2
  4. Now we have 25 seconds left
  5. At almost the same time, the periodic fileutil.PurgeFile got triggered both in agent1 and agent2
    • This could block or slow down the stress commits (???)
  6. And there was also network partition between those two members

So all case #3 failure cases have two things in common:

  1. leader got killed
  2. periodic PurgeFile got triggered possibly slowing down stress commits (???)
  3. network partition happened between two survived nodes

So I expect if we gave them more time (like 30 seconds more), the cluster could just recover from network partition and do the normal stress commits after PurgeFile, therefore passing the case #3.


Real logs : successful case #3

---------------------------------
2016-02-11 16:36:53 etcd-tester: [round#1057 case#3] start failure kill one member for long time and expect it to recover from incoming snapshot
2016-02-11 16:36:53 etcd-tester: [round#1057 case#3] start injecting failure...
2016-02-11 16:36:59 etcd-tester: [round#1057 case#3] injected failure
---------------------------------
agent1 35111e6c11e936fa
agent2 25a3f9edab423fd3
agent3 179aaf9458e18548

2016-02-11 16:36:52
agent1 follower
agent2 leader
agent3 follower

2016-02-11 16:36:53
kill agent2 leader

2016-02-11 16:36:57
agent1 follower
agent2 inactive
agent3 leader

2016-02-11 16:36:58
agent1 follower
keep writing
2016-02-11 16:36:58.038086 I | etcdserver: start to snapshot (applied: 21073960, lastsnap: 21063946)
2016-02-11 16:36:58.050095 I | etcdserver: saved snapshot at index 21073960
2016-02-11 16:36:58.050282 I | etcdserver: compacted raft log at 21068960

2016-02-11 16:36:58
agent3 leader
keep writing
2016-02-11 16:36:58.067153 I | etcdserver: start to snapshot (applied: 21074117, lastsnap: 21064103)
2016-02-11 16:36:58.076783 I | etcdserver: saved snapshot at index 21074117
2016-02-11 16:36:58.076941 I | etcdserver: compacted raft log at 21069117

Real logs : failed case #3

---------------------------------
2016-02-11 16:37:58 etcd-tester: [round#1058 case#3] start injecting failure...
2016-02-11 16:38:28 etcd-tester: [round#1058 case#3] injection error: cluster too slow: only commit 0 requests in 30s
---------------------------------
agent1 35111e6c11e936fa
agent2 25a3f9edab423fd3
agent3 179aaf9458e18548

2016-02-11 16:37:57 
agent1 follower
agent2 follower
agent3 leader

2016-02-11 16:37:57 
kill agent3 leader

2016-02-11 16:38:04
agent1 follower
agent2 leader
agent3 inactive


= new leader election took only 5 seconds since failure injection


agent1 follower
2016-02-11 16:38:15.980782 I | fileutil: purged file agent.etcd/member/snap/0000000000000092-0000000001456554.snap successfully
2016-02-11 16:38:15.980841 I | fileutil: purged file agent.etcd/member/snap/0000000000000092-0000000001458c70.snap successfully
2016-02-11 16:38:15.980904 I | fileutil: purged file agent.etcd/member/snap/0000000000000092-000000000145b396.snap successfully

agent2 leader
2016-02-11 16:38:22.029865 I | fileutil: purged file agent.etcd/member/snap/0000000000000092-000000000145af3e.snap successfully
2016-02-11 16:38:28.994092 E | rafthttp: failed to read 35111e6c11e936fa on stream Message (unexpected EOF)
2016-02-11 16:38:28.994134 I | rafthttp: the connection with 35111e6c11e936fa became inactive

@xiang90
Copy link
Contributor

xiang90 commented Feb 12, 2016

And there was also network partition between those two members

Do we have any clue why this can happen? This is unexpected since we did not inject any failures.

At almost the same time, the periodic fileutil.PurgeFile got triggered both in agent1 and agent2
This could block or slow down the stress commits

This is also unexpected. Can you verify that prugeFile can slow down etcd in a separate test?

@gyuho
Copy link
Contributor Author

gyuho commented Feb 12, 2016

@xiang90

And there was also network partition between those two members

Do we have any clue why this can happen? This is unexpected since we did not inject any failures.

I have no idea how the network partition happened. From the logs of last 12 hour, there are about 73 occurrences xxxxxx stream MsgApp v2 (unexpected EOF) and if you look at the ones that are not related to failure injection:

In tester side:

2016/02/11 15:44:48 etcd-tester: [round#1008 case#2] canceled stressers
2016/02/11 15:44:48 etcd-tester: [round#1008 case#2] checking current revisions...
2016/02/11 15:44:49 etcd-tester: [round#1008 case#2] all members are consistent with current revisions
In agent etcd-log:

2016/02/11 15:44:48 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2016/02/11 15:44:48 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
...
2016/02/11 15:44:48 grpc: Server.processUnaryRPC failed to write status: stream error: code = 1 desc = "context canceled"
2016-02-11 15:44:49.957652 E | rafthttp: failed to read 35111e6c11e936fa on stream MsgApp v2 (unexpected EOF)
2016-02-11 15:44:49.957706 I | rafthttp: the connection with 35111e6c11e936fa became inactive
2016-02-11 15:44:50.358926 I | etcdserver: start to snapshot (applied: 1488642, lastsnap: 1478523)
2016-02-11 15:44:50.364511 I | etcdserver: saved snapshot at index 1488642
2016-02-11 15:44:50.364669 I | etcdserver: compacted raft log at 1483642
2016-02-11 15:44:51.562849 I | etcdserver: start to snapshot (applied: 1498660, lastsnap: 1488642)
2016-02-11 15:44:51.573142 I | etcdserver: saved snapshot at index 1498660
2016-02-11 15:44:51.573300 I | etcdserver: compacted raft log at 1493660
2016-02-11 15:44:52.656202 I | etcdserver: start to snapshot (applied: 1508677, lastsnap: 1498660)
2016-02-11 15:44:52.672499 I | etcdserver: saved snapshot at index 1508677
2016-02-11 15:44:52.672683 I | etcdserver: compacted raft log at 1503677
2016-02-11 15:44:53.884568 I | etcdserver: start to snapshot (applied: 1518682, lastsnap: 1508677)
2016-02-11 15:44:53.893050 I | etcdserver: saved snapshot at index 1518682
2016-02-11 15:44:53.893236 I | etcdserver: compacted raft log at 1513682
2016-02-11 15:44:54.665878 I | rafthttp: the connection with 35111e6c11e936fa became active

So network partition happens even without failure injections. But as you pointed out, even with the network partitions, it should be able to handle PUT requests.

And PurgeFile might be not related. It is just common thing in all case #3 failure cases. I will try to measure what impact it could have had on etcd. The weird thing is agent1 follower was supposed to keep handling PUT requests and should have shown some logs like:

2016-02-11 15:41:24.741652 I | fileutil: purged file agent.etcd/member/snap/0000000000000002-0000000000029c5b.snap successfully
2016-02-11 15:41:24.741703 I | fileutil: purged file agent.etcd/member/snap/0000000000000002-000000000002c3cd.snap successfully
2016-02-11 15:41:25.651095 I | etcdserver: start to snapshot (applied: 241563, lastsnap: 231486)
2016-02-11 15:41:25.658882 I | etcdserver: saved snapshot at index 241563
2016-02-11 15:41:25.659043 I | etcdserver: compacted raft log at 236563
2016-02-11 15:41:26.893827 I | etcdserver: start to snapshot (applied: 251721, lastsnap: 241563)
2016-02-11 15:41:26.898386 I | etcdserver: saved snapshot at index 251721

But the logs just stopped after file purge:

2016-02-11 16:38:04.237591 I | raft: raft.node: 35111e6c11e936fa elected leader 25a3f9edab423fd3 at term 155
2016-02-11 16:38:15.976614 W | rafthttp: the connection to peer 179aaf9458e18548 is unhealthy
2016-02-11 16:38:15.980782 I | fileutil: purged file agent.etcd/member/snap/0000000000000092-0000000001456554.snap successfully
2016-02-11 16:38:15.980841 I | fileutil: purged file agent.etcd/member/snap/0000000000000092-0000000001458c70.snap successfully
2016-02-11 16:38:15.980904 I | fileutil: purged file agent.etcd/member/snap/0000000000000092-000000000145b396.snap successfully

[END OF LOG]
[ CLEAN UP HAPPENS ]

And the actual cleaning up happened seconds after the log stop:

2016/02/11 16:38:28 etcd-tester: [round#1058 case#3] injection error: cluster too slow: only commit 0 requests in 30s
2016/02/11 16:38:28 etcd-tester: [round#1058 case#3] cleaning up...

So clearly there was no sign of commits in agent1 after network partition.

I will investigate more.

@xiang90
Copy link
Contributor

xiang90 commented Feb 12, 2016

@gyuho Yea. Let's investigate. Seems like an interesting case.

@gyuho
Copy link
Contributor Author

gyuho commented Feb 12, 2016

@xiang90 Yeah, from the logs we use in production testing cluster I cannot tell exactly what was happening after network partition. I will run another testing cluster with more verbose outputs, testing only the case #3.

Thanks.

gyuho added a commit to gyuho/etcd that referenced this issue Feb 12, 2016
Exit the stress goroutine only when the stresser.Cancel
method gets called. Network partitions also cause errors
in Stress and in such cases, we should retry rather than
exiting for any error (this will stop the whole stress).

Fixes etcd-io#4477.
@gyuho
Copy link
Contributor Author

gyuho commented Feb 13, 2016

@xiang90 Did more investigation as you suggested:

  1. in the cluster without failure injection, I don't see any message failed to read 27b7e0c85df3375f on stream Message (unexpected EOF) (3 node cluster, only stressing processes)
  2. in the cluster WITH failure injection, I confirmed that the EOF messages happened at the moment we injected failures or bad network connections. Here are examples:

tester logs(killing agent1 leader):

1

etcd cluster logs (agent2):

2

etcd cluster logs (agent3):

3

@xiang90
Copy link
Contributor

xiang90 commented Feb 14, 2016

in the cluster without failure injection, I don't see any message failed to read 27b7e0c85df3375f on stream Message (unexpected EOF) (3 node cluster, only stressing processes)

This contradicts with your previous assumption (#4477 (comment). item6) that even without our injected network failures there still are connection failures.

@xiang90
Copy link
Contributor

xiang90 commented Feb 14, 2016

We probably need to investigate the failure in case3 more.

@gyuho
Copy link
Contributor Author

gyuho commented Feb 14, 2016

Think I misread the old logs with timestamp. Will investigate more. Thanks!

@gyuho
Copy link
Contributor Author

gyuho commented Feb 14, 2016

@xiang90 Sorry, I was wrong about #4477 (comment). I misread logs confused about UTC timezone.

As you pointed out, bad network connection should be something like context deadline exceeded. I looked at the logs again, for the timestamps that I claimed to have bad connection, and it turns out there was failure injection from the functional tester.

etcd log from #4477 (comment):

2016-02-10 15:46:49.353993 E | rafthttp: failed to read 179aaf9458e18548 on stream MsgApp v2 (unexpected EOF)
2016-02-10 15:46:49.354038 I | rafthttp: the connection with 179aaf9458e18548 became inactive
2016-02-10 15:46:50.291357 I | etcdserver: start to snapshot (applied: 2255647, lastsnap: 2245597)
2016-02-10 15:46:50.301827 I | etcdserver: saved snapshot at index 2255647
2016-02-10 15:46:50.302042 I | etcdserver: compacted raft log at 2250647
2016-02-10 15:46:51.522093 I | rafthttp: the connection with 179aaf9458e18548 became active

And right before this failed to read 179aaf9458e18548 on stream MsgApp v2, you can see the failure injection:

default

So I believe:

  • no commits during 30-second happen from our wrong logic to return stresser
  • (unexpected EOF) only happens when there's a failure injection

@xiang90
Copy link
Contributor

xiang90 commented Feb 14, 2016

@gyuho

Sorry, I was wrong about #4477 (comment).

No problem. But probably we want to observe more failure cases to be certain on this.

no commits during 30-second happen from our long logic to return stresser

Can you try to confirm this? And probably explain why the stresser returns early?

@gyuho
Copy link
Contributor Author

gyuho commented Feb 14, 2016

Yeah I restarted the testing cluster this afternoon. Will have more data to support that during this holiday.

@xiang90
Copy link
Contributor

xiang90 commented Feb 14, 2016

Awesome! Thanks!

gyuho referenced this issue in gyuho/etcd Feb 15, 2016
Exit the stress goroutine only when the stresser.Cancel
method gets called. Network partitions also cause errors
in Stress and in such cases, we should retry rather than
exiting for any error (this will stop the whole stress).

Fixes coreos#4477.
@gyuho
Copy link
Contributor Author

gyuho commented Feb 16, 2016

@xiang90 Ran only case #3 (kill leader and wait for 10,000 commits, this is now case #4) for the holidays until we ran out of space. And we got toal 82 cluster too slow: only commit 0 requests in 30s errors out of total 1333 rounds. This happens when there's a leader failure, and then the cluster goes through leader election. Stress requests triggered at the same time as leader failure time out, and current stress implementation does not do any retrying (continue).

@xiang90
Copy link
Contributor

xiang90 commented Feb 16, 2016

This happens when there's a leader failure, and then the cluster goes through leader election. Stress requests triggered at the same time as leader failure time out, and current stress implementation does not do any retrying (continue).

My guess is that first there is a leader election happened, it took several seconds to be resolved. So no put requests could go through during that several seconds. Then the put requests timedouted since we set timeout on each request. Then the stresser exited.

Can you find a way to verify this is exact what happened?

Also it seems like as long as we increase the timeout to 2x election timeout, this can be resolved.

@gyuho
Copy link
Contributor Author

gyuho commented Feb 16, 2016

My guess is that first there is a leader election happened, it took several seconds to be resolved. So no put requests could go through during that several seconds. Then the put requests timedouted since we set timeout on each request. Then the stresser exited.

Exactly. And that's what I specifically tested over the weekend. I printed out the timestamp for each request's start time in order to see if it matches the timestamp of leader failure. Look at this log.

[ LEADER FAILING ]
2016/02/15 07:16:47 etcd-tester: [round#2 case#0] start injecting failure...
2016/02/15 07:16:47 etcd-tester: stopping 10.240.0.4:2378

...

2016/02/15 07:16:52 [Stress] error: rpc error: code = 4 desc = "context deadline exceeded" (10.240.0.2:2378). Took 5.000291927s
(started at 2016-02-15 07:16:47.354562742 +0000 UTC)
..
[ MANY MORE WITH THE SAME TIMESTAMP ]

You can see at 07:16:47, leader failed. And that is the time when this PUT request got triggered. And we set up the context timeout 5 second, which also exactly matches the time it took to get this error. I see this same pattern in all of the failures. Let me know if you need more data. Thanks.

gyuho added a commit to gyuho/etcd that referenced this issue Feb 17, 2016
gyuho added a commit to gyuho/etcd that referenced this issue Feb 17, 2016
gyuho added a commit to gyuho/etcd that referenced this issue Feb 17, 2016
@gyuho
Copy link
Contributor Author

gyuho commented Feb 17, 2016

From multiple failure cases, we confirmed that requests timed out from the leader failure:

--------------------------------------------------------------------------------------
[ CASE 01 ]

tester            | 2016-02-17 20:38:12.002139017  === start injecting failure (killing the agent3 leader)
agent3 10.240.0.4 | 2016-02-17 20:38:12            === osutil: received terminated signal, shutting down...

agent1 10.240.0.2 | 2016-02-17 20:38:13            === election started, voting...
agent2 10.240.0.3 | 2016-02-17 20:38:13            === election started, voting...

tester            | 2016-02-17 20:38:16            === PUT request to agent1 timed out...
tester            | 2016-02-17 20:38:16            === PUT request to agent2 timed out...

agent2            | 2016-02-17 20:38:17.169098     === became the leader

tester            | 2016-02-17 20:38:42            === [round#4 case#0] injection error: cluster too slow: only commit 0 requests in 30s
tester            | 2016-02-17 20:38:42            === [round#4 case#0] cleaning up...

--------------------------------------------------------------------------------------
[ CASE 02 ]

tester            | 2016-02-17 21:04:02.14378407   === start injecting failure (killing the agent3 leader)
agent3 10.240.0.4 | 2016-02-17 21:04:02.145743     === osutil: received terminated signal, shutting down...

agent1 10.240.0.2 | 2016-02-17 21:04:04            === election started, voting...
agent2 10.240.0.3 | 2016-02-17 21:04:04            === election started, voting...

tester            | 2016/02/17 21:04:07            === PUT requests to agent1 timed out...
tester            | 2016/02/17 21:04:07            === PUT requests to agent2 timed out...

agent2            | 2016-02-17 21:04:07.242309     === became the leader

tester            | 2016-02-17 21:04:32            === [round#38 case#0] injection error: cluster too slow: only commit 0 requests in 30s
tester            | 2016-02-17 21:04:32            === [round#38 case#0] cleaning up...

--------------------------------------------------------------------------------------
[ CASE 03 ]

tester            | 2016-02-17 21:10:14.837887717  === start injecting failure (killing the agent1 leader)
agent1 10.240.0.2 | 2016-02-17 21:10:14.840088     === osutil: received terminated signal, shutting down...

agent2 10.240.0.3 | 2016-02-17 21:10:16.268397     === election started, voting...
agent3 10.240.0.4 | 2016-02-17 21:10:16            === election started, voting...

tester            | 2016/02/17 21:10:19            === PUT requests to agent2 timed out...
tester            | 2016/02/17 21:10:19            === PUT requests to agent3 timed out...

agent2            | 2016-02-17 21:10:21.321989     === became the leader

tester            | 2016-02-17 21:10:44            === [round#46 case#0] injection error: cluster too slow: only commit 65 requests in 30s
tester            | 2016-02-17 21:10:44            === [round#46 case#0] cleaning up...

--------------------------------------------------------------------------------------
[ CASE 04 ]

tester            | 2016-02-17 21:21:50.825446897  === start injecting failure (killing the agent3 leader)
agent3 10.240.0.4 | 2016-02-17 21:21:50.829599     === osutil: received terminated signal, shutting down...

tester            | 2016/02/17 21:21:55            === PUT requests to agent1 timed out...
tester            | 2016/02/17 21:21:55            === PUT requests to agent2 timed out...

agent1 10.240.0.2 | 2016-02-17 21:21:56.058468     === election started, voting...
agent2 10.240.0.3 | 2016-02-17 21:21:56            === election started, voting...

agent1            | 2016-02-17 21:21:56.064566     === became the leader

tester            | 2016-02-17 21:22:20            === [round#61 case#0] injection error: cluster too slow: only commit 0 requests in 30s
tester            | 2016-02-17 21:22:20            === [round#61 case#0] cleaning up...

--------------------------------------------------------------------------------------
[ CASE 05 ]

tester            | 2016-02-17 21:35:15.42699146   === start injecting failure (killing the agent1 leader)
agent1 10.240.0.2 | 2016-02-17 21:35:15.431652     === osutil: received terminated signal, shutting down...

agent2 10.240.0.3 | 2016-02-17 21:35:19.396989     === election started, voting...
agent3 10.240.0.4 | 2016-02-17 21:35:19            === election started, voting...

tester            | 2016/02/17 21:35:20            === PUT requests to agent2 timed out...
tester            | 2016/02/17 21:35:20            === PUT requests to agent3 timed out...

agent2            | 2016-02-17 21:35:21.704066     === became the leader

tester            | 2016-02-17 21:35:45            === [round#78 case#0] injection error: cluster too slow: only commit 0 requests in 30s
tester            | 2016-02-17 21:35:45            === [round#78 case#0] cleaning up...

--------------------------------------------------------------------------------------

[ CASE 06 ]

tester            | 2016-02-17 21:42:59.181099145   === start injecting failure (killing the agent3 leader)
agent3 10.240.0.4 | 2016-02-17 21:42:59.183210      === osutil: received terminated signal, shutting down...

agent1 10.240.0.2 | 2016-02-17 21:43:00.359816      === election started, voting...
agent2 10.240.0.3 | 2016-02-17 21:43:00             === election started, voting...

tester            | 2016/02/17 21:43:04             === PUT requests to agent1 timed out...
tester            | 2016/02/17 21:43:04             === PUT requests to agent2 timed out...

agent2            | 2016-02-17 21:43:05.372366      === became the leader

tester            | 2016-02-17 21:43:29             === [round#88 case#0] injection error: cluster too slow: only commit 0 requests in 30s
tester            | 2016-02-17 21:43:29             === [round#88 case#0] cleaning up...

--------------------------------------------------------------------------------------

[ CASE 07 ]

tester | 2016-02-17 21:53:05.719050313   === start injecting failure (killing the agent2 leader)

agent1 | 2016-02-17 21:53:06.855987      === election started, voting...
agent3 | 2016-02-17 21:53:06             === election started, voting...

tester | 2016/02/17 21:53:10             === PUT requests to agent1 timed out...
tester | 2016/02/17 21:53:10             === PUT requests to agent2 timed out...

agent1 | 2016-02-17 21:53:13.790345      === became the leader

tester | 2016-02-17 21:53:35             === [round#101 case#0] injection error: cluster too slow: only commit 0 requests in 30s
tester | 2016-02-17 21:53:35             === [round#101 case#0] cleaning up...

--------------------------------------------------------------------------------------

[ CASE 08 ]

tester | 2016-02-17 21:55:33.399030693   === start injecting failure (killing the agent3 leader)

agent1 | 2016-02-17 21:55:34.847045      === election started, voting...
agent2 | 2016-02-17 21:55:34             === election started, voting...

tester | 2016/02/17 21:55:38             === PUT requests to agent1 timed out...
tester | 2016/02/17 21:55:38             === PUT requests to agent2 timed out...

agent2 | 2016-02-17 21:55:38.458385      === became the leader

tester | 2016-02-17 21:56:03             === [round#104 case#0] injection error: cluster too slow: only commit 0 requests in 30s
tester | 2016-02-17 21:56:03             === [round#104 case#0] cleaning up...

--------------------------------------------------------------------------------------

[ CASE 09 ]

tester | 2016-02-17 21:59:27.292407883   === start injecting failure (killing the agent2 leader)

agent1 | 2016-02-17 21:59:28.553853      === election started, voting...
agent3 | 2016-02-17 21:59:28             === election started, voting...

tester | 2016/02/17 21:59:32             === PUT requests to agent1 timed out...
tester | 2016/02/17 21:59:32             === PUT requests to agent2 timed out...

agent3 | 2016-02-17 21:59:35.061743      === became the leader

tester | 2016-02-17 21:59:57             === [round#109 case#0] injection error: cluster too slow: only commit 0 requests in 30s
tester | 2016-02-17 21:59:57             === [round#109 case#0] cleaning up...

--------------------------------------------------------------------------------------

[ CASE 10 ]

tester | 2016-02-17 22:23:47.19856007   === start injecting failure (killing the agent1 leader)

agent2 | 2016-02-17 22:23:48.431637     === election started, voting...
agent3 | 2016-02-17 22:23:48            === election started, voting...

tester | 2016/02/17 21:23:52            === PUT requests to agent1 timed out...
tester | 2016/02/17 21:23:52            === PUT requests to agent2 timed out...

agent2 | 2016-02-17 22:23:52.441024     === became the leader

tester | 2016/02/17 22:24:17            === [round#141 case#0] injection error: cluster too slow: only commit 0 requests in 30s
tester | 2016/02/17 22:24:17            === [round#141 case#0] cleaning up...

--------------------------------------------------------------------------------------

Will redeploy the testing cluster with #4547

gyuho added a commit to gyuho/etcd that referenced this issue Feb 17, 2016
gyuho added a commit to gyuho/etcd that referenced this issue Feb 17, 2016
@gyuho
Copy link
Contributor Author

gyuho commented Feb 18, 2016

Got one more:

2016/02/18 03:09:04 etcd-tester: [round#292 case#0] injection error: cluster too slow: only commit 200 requests in 30s
2016/02/18 03:09:04 etcd-tester: [round#292 case#0] cleaning up...

but much less frequently. Will investigate more.

@gyuho
Copy link
Contributor Author

gyuho commented Feb 19, 2016

Resolved via #4569. There was no issue in etcd itself. We were stopping the stressers too early:

We confirmed that PUT requests were timing out
and returning when it triggered at the same moment as leader failure. This saves those requests by continuing in the for loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants