Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

functional-test: tester client error #4517

Closed
gyuho opened this issue Feb 13, 2016 · 27 comments
Closed

functional-test: tester client error #4517

gyuho opened this issue Feb 13, 2016 · 27 comments

Comments

@gyuho
Copy link
Contributor

gyuho commented Feb 13, 2016

There was a bad connection in client side. This is not an etcd error. It might be worth having some retrial logic.

2016/02/13 01:12:51 etcd-tester: [round#83 case#2] recovery error: rpc error: code = 4 desc = "context deadline exceeded"
2016/02/13 01:12:51 etcd-tester: [round#83 case#2] cleaning up...
2016/02/13 01:12:53 etcd-tester: [round#83 case#3] start failure kill leader member
2016/02/13 01:12:53 etcd-tester: [round#83 case#3] start injecting failure...
2016/02/13 01:12:53 etcd-tester: [round#83 case#3] injected failure
2016/02/13 01:12:53 etcd-tester: [round#83 case#3] start recovering failure...
2016/02/13 01:12:56 etcd-tester: [round#83 case#3] recovered failure
@gyuho gyuho changed the title functional-test: kill one random member client error functional-test: tester client error Feb 13, 2016
@gyuho
Copy link
Contributor Author

gyuho commented Feb 13, 2016

screenshot from 2016-02-13 15 08 30

@xiang90
Copy link
Contributor

xiang90 commented Feb 14, 2016

@gyuho Do we know why there is a deadline issue?

@gyuho
Copy link
Contributor Author

gyuho commented Feb 14, 2016

@xiang90 We request grpc client call with timeout. https://github.com/coreos/etcd/blob/master/tools/functional-tester/etcd-tester/cluster.go#L290. And when we look at the logs around the context deadline exceeded, other than terminated agent, I don't see any other error messages, so I think this is just client side network issue.

tester:

2016/02/14 18:34:44 etcd-tester: [round#772 case#4] recovery error: rpc error: code = 4 desc = "context deadline exceeded"

In agents (agent1 got killed):

1

2

@xiang90
Copy link
Contributor

xiang90 commented Feb 14, 2016

so I think this is just client side network issue.

I do not believe there should be a network issue between a client and a server except when we injected the failures. Can you please check if the deadline issues are caused by the failure we injected? And can you also verify that the failure we injected caused the server to be unresponsive for longer than the client side timeout?

@gyuho
Copy link
Contributor Author

gyuho commented Feb 14, 2016

Agree. I will investigate more. Thanks.

@gyuho
Copy link
Contributor Author

gyuho commented Feb 14, 2016

@xiang90

This is the kind of logs when restart successfully went through in all agents with the port :9027.

1

check if the deadline issues are caused by the failure we injected?

2016/02/14 18:32:38 etcd-tester: [round#772 case#4] start failure kill one member for long time and expect it to recover from incoming snapshot
2016/02/14 18:32:38 etcd-tester: [round#772 case#4] start injecting failure...
2016/02/14 18:32:40 etcd-tester: [round#772 case#4] injected failure
2016/02/14 18:32:40 etcd-tester: [round#772 case#4] start recovering failure...
2016/02/14 18:34:44 etcd-tester: [round#772 case#4] recovery error: rpc error: code = 4 desc = "context deadline exceeded"
2016/02/14 18:34:44 etcd-tester: [round#772 case#4] cleaning up...
  1. we kill one member agent2 (failure injection)
  2. then recover the member agent2 that we killed
  3. timeout when connecting to port :9027 of agent2

But in agent side, the request went through:

2

The client side request (to recover the agent) went through. And if you look at other logs in agent2 machine around 2016/02/14 18:32:40.

3

You can tell etcd and etcd-agent server sides were working fine.

So the restart request was successful, which means the rpc context deadline exceeded error is from the WaitHealth function:

func (f *failureKillOneForLongTime) Recover(c *cluster, round int) error {
    i := round % c.Size
    if _, err := c.Agents[i].Restart(); err != nil {
        return err
    }
    return c.WaitHealth()
}

We can resolve this issue with retrying logic in grpc calls (WaitHealth, etc).


And this is not case #4 specific:

  1. 2016/02/13 23:05:29 etcd-tester: [round#21 case#0] recovery error: rpc error: code = 4 desc = "context deadline exceeded"
  2. 2016/02/13 23:07:30 etcd-tester: [round#21 case#1] wait full health error: rpc error: code = 4 desc = "context deadline exceeded"
  3. 2016/02/14 18:34:44 etcd-tester: [round#772 case#4] recovery error: rpc error: code = 4 desc = "context deadline exceeded"

@xiang90
Copy link
Contributor

xiang90 commented Feb 14, 2016

@gyuho I want to know exact why there is that deadline exceeded error. I understand it only happens for the client. However, can we find out what is the root cause of that error? I do not think extending the timeout without finding the root cause is a right thing to do.

@gyuho
Copy link
Contributor Author

gyuho commented Feb 14, 2016

Got it. I will try to find that. Thanks.

@xiang90
Copy link
Contributor

xiang90 commented Feb 14, 2016

To be more specific, you observed

So the restart request was successful, which means the rpc context deadline exceeded error is from the WaitHealth function:

We want to find out why there is this deadline exceeded error.

@gyuho
Copy link
Contributor Author

gyuho commented Feb 14, 2016

So we got 3 cases with timeout error:

  1. 2016/02/13 23:05:29 etcd-tester: [round#21 case#0] recovery error: rpc error: code = 4 desc = "context deadline exceeded"
  2. 2016/02/13 23:07:30 etcd-tester: [round#21 case#1] wait full health error: rpc error: code = 4 desc = "context deadline exceeded"
  3. 2016/02/14 18:34:44 etcd-tester: [round#772 case#4] recovery error: rpc error: code = 4 desc = "context deadline exceeded"

And all those are from this function taking longer than 1 second to return the response:

// setHealthKey sets health key on all given urls.
func setHealthKey(us []string) error {
    for _, u := range us {
        conn, err := grpc.Dial(u, grpc.WithInsecure(), grpc.WithTimeout(5*time.Second))
        if err != nil {
            return fmt.Errorf("%v (%s)", err, u)
        }
        ctx, cancel := context.WithTimeout(context.Background(), time.Second)
        kvc := pb.NewKVClient(conn)
        _, err = kvc.Put(ctx, &pb.PutRequest{Key: []byte("health"), Value: []byte("good")})
        cancel()
        if err != nil {
            return err
        }
    }
    return nil
}

First case: Grafana queried `metrics` and used up all the file descriptors, so blocking the `WaitHealth` call?
2016/02/13 23:03:32 http: Accept error: accept tcp [::]:9028: accept4: too many open files; retrying in 5ms
...
2016/02/13 23:05:29 http: Accept error: accept tcp [::]:9028: accept4: too many open files; retrying in 1
2016/02/13 23:05:29 etcd-tester: [round#21 case#0] recovery error: rpc error: code = 4 desc = "context deadline exceeded"

Second case: killed majority and tried to recover but there were port conflicts in `:2380`, so `WaitHealth` request to the node that had conflicting peer port timed out (we need `dropPort` for `stop` method as well for this case).

3

4


Third case: kill one member for a long time but `WaitHealth` is not aware of which member got killed. And `WaitHealth` sends request to the member that got killed by the failure injection, it will get `context deadline exceeded`.
2016/02/14 18:32:40 etcd-tester: [round#772 case#4] start recovering failure... tester/tester.log
2016/02/14 18:34:44 etcd-tester: [round#772 case#4] recovery error: rpc error: code = 4 desc = "context deadline exceeded"

@gyuho
Copy link
Contributor Author

gyuho commented Feb 14, 2016

@xiang90 Third case is interesting. Here's the error:

2016-02-14 18:32:42.349362 I | etcdserver: raft applied incoming snapshot at index 314332
unexpected fault address 0x7f502600900a
fatal error: fault
[signal 0x7 code=0x2 addr=0x7f502600900a pc=0x86b037]

goroutine 57 [running]:
runtime.throw(0xe8ba28, 0x5)
    /usr/local/go/src/runtime/panic.go:527 +0x90 fp=0xc824c9d2e0 sp=0xc824c9d2c8
runtime.sigpanic()
    /usr/local/go/src/runtime/sigpanic_unix.go:21 +0x10c fp=0xc824c9d330 sp=0xc824c9d2e0
github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt.(*freelist).read(0xc82504a090, 0x7f5026009000)
    /home/gyuho/go/src/github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt/freelist.go:162 +0x27 fp=0xc824c9d3d0 sp=0xc824c9d330
github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt.Open(0xc821be71e0, 0x19, 0x180, 0x1368500, 0xc821be7220, 0x0, 0x0)
    /home/gyuho/go/src/github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt/db.go:216 +0x5eb fp=0xc824c9d518 sp=0xc824c9d3d0
github.com/coreos/etcd/storage/backend.newBackend(0xc821be71e0, 0x19, 0x5f5e100, 0x2710, 0x0)
    /home/gyuho/go/src/github.com/coreos/etcd/storage/backend/backend.go:84 +0x73 fp=0xc824c9d5e8 sp=0xc824c9d518
github.com/coreos/etcd/storage/backend.NewDefaultBackend(0xc821be71e0, 0x19, 0x0, 0x0)
    /home/gyuho/go/src/github.com/coreos/etcd/storage/backend/backend.go:80 +0x53 fp=0xc824c9d620 sp=0xc824c9d5e8
github.com/coreos/etcd/etcdserver.(*EtcdServer).applySnapshot(0xc82007a000, 0xc82038aa80, 0xc8200a6e70)
    /home/gyuho/go/src/github.com/coreos/etcd/etcdserver/server.go:590 +0x597 fp=0xc824c9d8c8 sp=0xc824c9d620
github.com/coreos/etcd/etcdserver.(*EtcdServer).applyAll(0xc82007a000, 0xc82038aa80, 0xc8200a6e70)
    /home/gyuho/go/src/github.com/coreos/etcd/etcdserver/server.go:551 +0x4d fp=0xc824c9de40 sp=0xc824c9d8c8
github.com/coreos/etcd/etcdserver.(*EtcdServer).run.func2(0x7f5536a2d5e0, 0xc82038aa40)
    /home/gyuho/go/src/github.com/coreos/etcd/etcdserver/server.go:532 +0x32 fp=0xc824c9de60 sp=0xc824c9de40
github.com/coreos/etcd/pkg/schedule.(*fifo).run(0xc820228ba0)
    /home/gyuho/go/src/github.com/coreos/etcd/pkg/schedule/schedule.go:160 +0x322 fp=0xc824c9df98 sp=0xc824c9de60
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1721 +0x1 fp=0xc824c9dfa0 sp=0xc824c9df98
created by github.com/coreos/etcd/pkg/schedule.NewFIFOScheduler
    /home/gyuho/go/src/github.com/coreos/etcd/pkg/schedule/schedule.go:71 +0x275

goroutine 1 [chan receive]:
github.com/coreos/etcd/etcdmain.Main()
    /home/gyuho/go/src/github.com/coreos/etcd/etcdmain/etcd.go:192 +0x195b
main.main()
    /home/gyuho/go/src/github.com/coreos/etcd/main.go:29 +0x14

Full error: etcd.txt (https://storage.googleapis.com/failure-archive/agent2/2016-02-14T18%3A34%3A44Z/etcd.log)

gyuho added a commit to gyuho/etcd that referenced this issue Feb 15, 2016
gyuho added a commit to gyuho/etcd that referenced this issue Feb 15, 2016
@gyuho
Copy link
Contributor Author

gyuho commented Feb 15, 2016

@xiang90 I looked at all the errors since 02/13/2016 15:05:29 PST when I restarted the testing cluster. There are total 66 errors:

  1. 11 client errors as discussed in functional-test: tester client error #4517.
  2. 55 of injection error: cluster too slow. And if you take one of those errors, you can tell the stresser returned earlier than it should.
2016/02/13 23:05:29 etcd-tester: [round#21 case#0] recovery error: rpc error: code = 4 desc = "context deadline exceeded" Context
2016/02/13 23:07:30 etcd-tester: [round#21 case#1] wait full health error: rpc error: code = 4 desc = "context deadline exceeded" Context
2016/02/13 23:10:55 etcd-tester: [round#23 case#5] injection error: cluster too slow: only commit 75 requests in 30s Context
2016/02/13 23:13:36 etcd-tester: [round#25 case#4] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/13 23:15:06 etcd-tester: [round#26 case#4] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 00:04:51 etcd-tester: [round#58 case#4] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 00:26:00 etcd-tester: [round#71 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 00:28:34 etcd-tester: [round#73 case#4] injection error: cluster too slow: only commit 6918 requests in 30s Context
2016/02/14 00:38:20 etcd-tester: [round#79 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 00:49:07 etcd-tester: [round#86 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 00:53:22 etcd-tester: [round#89 case#4] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 01:06:15 etcd-tester: [round#97 case#5] injection error: cluster too slow: only commit 9668 requests in 30s Context
2016/02/14 03:36:18 etcd-tester: [round#193 case#5] injection error: cluster too slow: only commit 178 requests in 30s Context
2016/02/14 03:50:29 etcd-tester: [round#202 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 04:10:47 etcd-tester: [round#215 case#5] injection error: cluster too slow: only commit 59 requests in 30s Context
2016/02/14 04:13:35 etcd-tester: [round#217 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 04:20:38 etcd-tester: [round#222 case#4] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 06:00:05 etcd-tester: [round#286 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 06:19:32 etcd-tester: [round#299 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 07:05:55 etcd-tester: [round#329 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 07:08:35 etcd-tester: [round#331 case#4] injection error: cluster too slow: only commit 67 requests in 30s Context
2016/02/14 07:22:33 etcd-tester: [round#340 case#5] injection error: cluster too slow: only commit 98 requests in 30s Context
2016/02/14 08:06:52 etcd-tester: [round#369 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 08:28:05 etcd-tester: [round#383 case#5] injection error: cluster too slow: only commit 8414 requests in 30s Context
2016/02/14 13:13:38 etcd-tester: [round#564 case#4] injection error: cluster too slow: only commit 5 requests in 30s Context
2016/02/14 13:44:02 etcd-tester: [round#584 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 13:45:38 etcd-tester: [round#585 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 14:05:21 etcd-tester: [round#598 case#5] injection error: cluster too slow: only commit 71 requests in 30s Context
2016/02/14 14:43:38 etcd-tester: [round#623 case#4] injection error: cluster too slow: only commit 27 requests in 30s Context
2016/02/14 15:08:43 etcd-tester: [round#639 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 15:17:35 etcd-tester: [round#645 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 16:00:26 etcd-tester: [round#673 case#4] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 16:13:48 etcd-tester: [round#682 case#4] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 16:16:40 etcd-tester: [round#684 case#5] injection error: cluster too slow: only commit 247 requests in 30s Context
2016/02/14 16:24:06 etcd-tester: [round#689 case#5] injection error: cluster too slow: only commit 135 requests in 30s Context
2016/02/14 16:28:36 etcd-tester: [round#692 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 17:15:13 etcd-tester: [round#722 case#4] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 18:31:48 etcd-tester: [round#771 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 18:34:44 etcd-tester: [round#772 case#4] recovery error: rpc error: code = 4 desc = "context deadline exceeded" Context
2016/02/14 18:41:05 etcd-tester: [round#776 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 18:54:11 etcd-tester: [round#785 case#4] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 19:09:46 etcd-tester: [round#795 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 19:11:16 etcd-tester: [round#796 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 19:28:01 etcd-tester: [round#807 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 19:32:32 etcd-tester: [round#810 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 19:35:19 etcd-tester: [round#812 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 21:47:45 etcd-tester: [round#895 case#5] injection error: cluster too slow: only commit 160 requests in 30s Context
2016/02/14 21:58:32 etcd-tester: [round#902 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/14 23:34:41 etcd-tester: [round#963 case#1] recovery error: rpc error: code = 4 desc = "context deadline exceeded" Context
2016/02/14 23:36:48 etcd-tester: [round#963 case#2] wait full health error: rpc error: code = 4 desc = "context deadline exceeded" Context
2016/02/14 23:39:54 etcd-tester: [round#964 case#0] wait full health error: rpc error: code = 4 desc = "context deadline exceeded" Context
2016/02/14 23:42:03 etcd-tester: [round#964 case#1] wait full health error: rpc error: code = 4 desc = "context deadline exceeded" Context
2016/02/14 23:44:24 etcd-tester: [round#964 case#2] wait full health error: rpc error: code = 4 desc = "context deadline exceeded" Context
2016/02/14 23:46:44 etcd-tester: [round#964 case#3] wait full health error: rpc error: code = 4 desc = "context deadline exceeded" Context
2016/02/14 23:48:57 etcd-tester: [round#964 case#4] wait full health error: rpc error: code = 4 desc = "context deadline exceeded" Context
2016/02/14 23:58:35 etcd-tester: [round#969 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/15 00:21:09 etcd-tester: [round#984 case#4] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/15 00:25:35 etcd-tester: [round#987 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/15 00:29:36 etcd-tester: [round#989 case#2] recovery error: rpc error: code = 4 desc = "context deadline exceeded" Context
2016/02/15 00:30:14 etcd-tester: [round#989 case#5] injection error: cluster too slow: only commit 72 requests in 30s Context
2016/02/15 00:37:41 etcd-tester: [round#994 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/15 00:46:34 etcd-tester: [round#1000 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/15 01:18:13 etcd-tester: [round#1020 case#4] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/15 01:25:55 etcd-tester: [round#1025 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/15 01:47:30 etcd-tester: [round#1039 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/15 02:12:13 etcd-tester: [round#1055 case#5] injection error: cluster too slow: only commit 0 requests in 30s Context
2016/02/15 02:11:43 etcd-tester: [round#1055 case#5] start failure kill the leader for long time and expect it to recover from incoming snapshot
2016/02/15 02:11:43 etcd-tester: [round#1055 case#5] start injecting failure...
2016/02/15 02:12:13 etcd-tester: [round#1055 case#5] injection error: cluster too slow: only commit 0 requests in 30s

These errors are before we cancel the stressers and I believe those are from bad network connections.
To confirm that stressers are returning even without canceling, I set up the cluster with verbose output. Below are such cases:

2016/02/15 05:48:23 etcd-tester: [round#208 case#0] start failure kill the leader for long time and expect it to recover from incoming snapshot
2016/02/15 05:48:23 etcd-tester: [round#208 case#0] start injecting failure...
2016/02/15 05:48:23 etcd-tester: stopping 10.240.0.3:2378


2016/02/15 05:48:28 Stress return error: rpc error: code = 13 desc = "transport is closing" (10.240.0.3:2378)
... [as expected...]

[ but these errors are not expected unless we cancel them]
2016/02/15 05:48:28 Stress return error: rpc error: code = 4 desc = "context deadline exceeded" (10.240.0.2:2378)
...
2016/02/15 05:48:28 Stress return error: rpc error: code = 4 desc = "context deadline exceeded" (10.240.0.4:2378)
...

2016/02/15 05:48:53 etcd-tester: [round#208 case#0] injection error: cluster too slow: only commit 4797 requests in 30s
2016/02/15 05:48:53 etcd-tester: [round#208 case#0] cleaning up...

So this explains that in current implementation, bad network connection makes the stressers return earlier even without canceling, and this stops the whole commit. We should only return when there's an explicit cancel on the stresser like #4504.

@xiang90
Copy link
Contributor

xiang90 commented Feb 15, 2016

bad network connection

How do you tell the deadline exceeded is caused by bad network connection? Maybe it is because the etcd server was slow, which we should fix.

@xiang90
Copy link
Contributor

xiang90 commented Feb 15, 2016

@gyuho The network error should happen really really rarely on GCE (probably below 0.01%). It should be much much infrequent than what we have observed, which is around 1% of all cases.

@gyuho
Copy link
Contributor Author

gyuho commented Feb 15, 2016

@xiang90 That's a good point. Current tester code makes it harder to debug that part, since we don't log or differentiate two errors: one from cancel, the other from bad network. I think we should keep this issue open, but keep improving the tester code.

@xiang90
Copy link
Contributor

xiang90 commented Feb 15, 2016

@gyuho We can tell from the errors. I think what you have missed a point that I mentioned above that the deadline exceeded error has a root cause. My personal guess is that the server side was experiencing some kind of bad situation and could not reply in time, which caused the client to timeout.

I do not think masking the error at the tester side is a right thing to do unless we are sure why we need to mask that error.

@gyuho
Copy link
Contributor Author

gyuho commented Feb 15, 2016

@xiang90 Thanks for clarification. Here are some etcd server side events that might have slowed down processing PUT requests:

2016/02/15 07:15:31 etcd-tester: [round#0 case#0] start injecting failure..
2016/02/15 07:15:31 etcd-tester: stopping 10.240.0.2:2378   =========== agent1

[ a bunch of 'context deadline exceeded' errors to 10.240.0.3:2378, 10.240.0.4:2378 ]
2016/02/15 07:15:36 [Stress] error: rpc error: code = 4 desc = "context deadline exceeded" (10.240.0.4:2378). Took 5.000187192s (started at 2016-02-15 07:15:31.157153848 +0000 UTC)
2016/02/15 07:15:36 [Stress] error: rpc error: code = 13 desc = "context deadline exceeded" (10.240.0.3:2378). Took 5.000669925s (started at 2016-02-15 07:15:31.160518597 +0000 UTC)
...

I logged the time that the PUT request started, and you can see requests triggered at the time of leader failure got context deadline exceeded errors. The failure injection kills the leader at 2016/02/15 07:15:31 and then cluster goes through the election, not being able to process the client requests in time. So to answer your questions:

My personal guess is that the server side was experiencing some kind of bad situation and could not reply in time

Leader failure happens.

The network error should happen really really rarely

Sorry, I think I mis-used the term bad network. Timeout != bad network. In this case, it times out because of the leader failures.


Here are more examples:
2016/02/15 07:16:47 etcd-tester: [round#2 case#0] start injecting failure...
2016/02/15 07:16:47 etcd-tester: stopping 10.240.0.4:2378

...

2016/02/15 07:16:52 [Stress] error: rpc error: code = 4 desc = "context deadline exceeded" (10.240.0.2:2378). Took 5.000291927s
(started at 2016-02-15 07:16:47.354562742 +0000 UTC)

@gyuho
Copy link
Contributor Author

gyuho commented Feb 16, 2016

My investigation got messy. To summarize all, client error happens when:

  1. WaitHealth is not aware of which node is terminated and waits on that node
    • we can apply retry logic
  2. Port conflict: sometimes port is bind already in use when restarting
    • we can drop the port when we terminate the node

@xiang90
Copy link
Contributor

xiang90 commented Feb 16, 2016

Port conflict: sometimes port is bind already in use when restarting

Let's fix this first.

gyuho added a commit to gyuho/etcd that referenced this issue Feb 16, 2016
https://github.com/golang/go/blob/master/src/os/exec_posix.go#L18 shows that
cmd.Process.Kill calls syscall.SIGKILL to the command. But
http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_01.html explains
'If you send a SIGKILL to a process, you remove any chance for the process to
do a tidy cleanup and shutdown, which might have unfortunate consequences.'
This sends SIGTERM, SIGINT, SIGHUP syscalls in order to the PID so that
the process could have more time to clean up the resources.

Related to etcd-io#4517.
gyuho added a commit to gyuho/etcd that referenced this issue Feb 16, 2016
https://github.com/golang/go/blob/master/src/os/exec_posix.go#L18 shows that
cmd.Process.Kill calls syscall.SIGKILL to the command. But
http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_01.html explains
'If you send a SIGKILL to a process, you remove any chance for the process to
do a tidy cleanup and shutdown, which might have unfortunate consequences.'
This sends SIGTERM, SIGINT, SIGHUP syscalls in order to the PID so that
the process could have more time to clean up the resources.

Related to etcd-io#4517.
gyuho added a commit to gyuho/etcd that referenced this issue Feb 16, 2016
https://github.com/golang/go/blob/master/src/os/exec_posix.go#L18 shows that
cmd.Process.Kill calls syscall.SIGKILL to the command. But
http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_01.html explains
'If you send a SIGKILL to a process, you remove any chance for the process to
do a tidy cleanup and shutdown, which might have unfortunate consequences.'
This sends SIGTERM, SIGINT syscalls to the PID so that the process could
have more time to clean up the resources.

Related to etcd-io#4517.
gyuho added a commit to gyuho/etcd that referenced this issue Feb 16, 2016
https://github.com/golang/go/blob/master/src/os/exec_posix.go#L18 shows that
cmd.Process.Kill calls syscall.SIGKILL to the command. But
http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_01.html explains
'If you send a SIGKILL to a process, you remove any chance for the process to
do a tidy cleanup and shutdown, which might have unfortunate consequences.'
This sends SIGTERM, SIGINT syscalls to the PID so that the process could
have more time to clean up the resources.

Related to etcd-io#4517.
gyuho added a commit to gyuho/etcd that referenced this issue Feb 16, 2016
https://github.com/golang/go/blob/master/src/os/exec_posix.go#L18 shows that
cmd.Process.Kill calls syscall.SIGKILL to the command. But
http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_01.html explains
'If you send a SIGKILL to a process, you remove any chance for the process to
do a tidy cleanup and shutdown, which might have unfortunate consequences.'
This sends SIGTERM, SIGINT syscalls to the PID so that the process could
have more time to clean up the resources.

Related to etcd-io#4517.
gyuho added a commit to gyuho/etcd that referenced this issue Feb 16, 2016
https://github.com/golang/go/blob/master/src/os/exec_posix.go#L18 shows that
cmd.Process.Kill calls syscall.SIGKILL to the command. But
http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_01.html explains
'If you send a SIGKILL to a process, you remove any chance for the process to
do a tidy cleanup and shutdown, which might have unfortunate consequences.'
This sends SIGTERM, SIGINT syscalls to the PID so that the process could
have more time to clean up the resources.

Related to etcd-io#4517.
gyuho added a commit to gyuho/etcd that referenced this issue Feb 16, 2016
https://github.com/golang/go/blob/master/src/os/exec_posix.go#L18 shows that
cmd.Process.Kill calls syscall.SIGKILL to the command. But
http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_01.html explains
'If you send a SIGKILL to a process, you remove any chance for the process to
do a tidy cleanup and shutdown, which might have unfortunate consequences.'
This sends SIGTERM, SIGINT syscalls to the PID so that the process could
have more time to clean up the resources.

Related to etcd-io#4517.
gyuho added a commit to gyuho/etcd that referenced this issue Feb 16, 2016
https://github.com/golang/go/blob/master/src/os/exec_posix.go#L18 shows that
cmd.Process.Kill calls syscall.SIGKILL to the command. But
http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_01.html explains
'If you send a SIGKILL to a process, you remove any chance for the process to
do a tidy cleanup and shutdown, which might have unfortunate consequences.'
This sends SIGTERM, SIGINT syscalls to the PID so that the process could
have more time to clean up the resources.

Related to etcd-io#4517.
gyuho added a commit to gyuho/etcd that referenced this issue Feb 16, 2016
https://github.com/golang/go/blob/master/src/os/exec_posix.go#L18 shows that
cmd.Process.Kill calls syscall.SIGKILL to the command. But
http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_01.html explains
'If you send a SIGKILL to a process, you remove any chance for the process to
do a tidy cleanup and shutdown, which might have unfortunate consequences.'
This sends SIGTERM, SIGINT syscalls to the PID so that the process could
have more time to clean up the resources.

Related to etcd-io#4517.
gyuho added a commit to gyuho/etcd that referenced this issue Feb 16, 2016
https://github.com/golang/go/blob/master/src/os/exec_posix.go#L18 shows that
cmd.Process.Kill calls syscall.SIGKILL to the command. But
http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_01.html explains
'If you send a SIGKILL to a process, you remove any chance for the process to
do a tidy cleanup and shutdown, which might have unfortunate consequences.'
This sends SIGTERM, SIGINT syscalls to the PID so that the process could
have more time to clean up the resources.

Related to etcd-io#4517.
gyuho added a commit to gyuho/etcd that referenced this issue Feb 16, 2016
https://github.com/golang/go/blob/master/src/os/exec_posix.go#L18 shows that
cmd.Process.Kill calls syscall.SIGKILL to the command. But
http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_01.html explains
'If you send a SIGKILL to a process, you remove any chance for the process to
do a tidy cleanup and shutdown, which might have unfortunate consequences.'
This sends SIGTERM, SIGINT syscalls to the PID so that the process could
have more time to clean up the resources.

Related to etcd-io#4517.
gyuho added a commit to gyuho/etcd that referenced this issue Feb 16, 2016
https://github.com/golang/go/blob/master/src/os/exec_posix.go#L18 shows that
cmd.Process.Kill calls syscall.SIGKILL to the command. But
http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_01.html explains
'If you send a SIGKILL to a process, you remove any chance for the process to
do a tidy cleanup and shutdown, which might have unfortunate consequences.'
This sends SIGTERM, SIGINT syscalls to the PID so that the process could
have more time to clean up the resources.

Related to etcd-io#4517.
@gyuho
Copy link
Contributor Author

gyuho commented Feb 18, 2016

2016-02-18 01:32:16 etcd-tester: [round#168 case#0] injected failure
2016-02-18 01:32:16 etcd-tester: [round#168 case#0] start recovering failure...

2016-02-18 01:32:17 http: Accept error: accept tcp [::]:9028: accept4: too many open files; retrying in 5ms
2016-02-18 01:32:20 http: Accept error: accept tcp [::]:9028: accept4: too many open files; retrying in 1s
[ REPEATING ]
2016-02-18 01:34:15 http: Accept error: accept tcp [::]:9028: accept4: too many open files; retrying in 1s

2016-02-18 01:34:16 etcd-tester: [round#168 case#0] recovery error: rpc error: code = 4 desc = "context deadline exceeded"
2016-02-18 01:34:16 etcd-tester: [round#168 case#0] cleaning up...

2016-02-18 01:34:32 etcd-tester: [round#168] compactKV error (rpc error: code = 13 desc = "transport is closing")
2016-02-18 01:34:32 etcd-tester: [round#168 case#0] cleaning up...

One reason for this kind of client error is from depletion of unix file descriptors.
And there was nothing bad happening in etcd side. And 9028 port is only used by prometheus.
Increased the limit to 120000. Will reploy the testing cluster to see if this still happens.

@xiang90
Copy link
Contributor

xiang90 commented Feb 18, 2016

@gyuho It is strange that etcd runs out of fd... Probably we should not simply increase fd limit. But we need to figure out why it runs out of fd at the first place.

@gyuho
Copy link
Contributor Author

gyuho commented Feb 18, 2016

@xiang90 Agree. The log doesn't tell which one is using up all the ports. I will put the limit back to 1024 and try to find out the root cause.

@gyuho gyuho self-assigned this Feb 18, 2016
@gyuho
Copy link
Contributor Author

gyuho commented Feb 20, 2016

@xiang90 We had one more client error as below:

2016-02-19 23:39:47.762438 I | http: Accept error: accept tcp [::]:9028: accept4: too many open files; retrying in 5ms
2016-02-19 23:39:47.767648 I | http: Accept error: accept tcp [::]:9028: accept4: too many open files; retrying in 10ms
2016-02-19 23:39:47.777790 I | http: Accept error: accept tcp [::]:9028: accept4: too many open files; retrying in 20ms
2016-02-19 23:39:47.797888 I | http: Accept error: accept tcp [::]:9028: accept4: too many open files; retrying in 40ms
2016-02-19 23:39:47.838125 I | http: Accept error: accept tcp [::]:9028: accept4: too many open files; retrying in 80ms
2016-02-19 23:39:47.918340 I | http: Accept error: accept tcp [::]:9028: accept4: too many open files; retrying in 160ms
2016-02-19 23:39:48.078581 I | http: Accept error: accept tcp [::]:9028: accept4: too many open files; retrying in 320ms
2016-02-19 23:39:48.398907 I | http: Accept error: accept tcp [::]:9028: accept4: too many open files; retrying in 640ms
2016-02-19 23:39:49.039093 I | http: Accept error: accept tcp [::]:9028: accept4: too many open files; retrying in 1s
2016-02-19 23:39:50.039307 I | http: Accept error: accept tcp [::]:9028: accept4: too many open files; retrying in 1s
2016-02-19 23:39:51.039529 I | http: Accept error: accept tcp [::]:9028: accept4: too many open files; retrying in 1s
...
2016-02-19 23:42:05.212652 I | http: Accept error: accept tcp [::]:9028: accept4: too many open files; retrying in 1s
2016-02-19 23:42:06.213028 I | http: Accept error: accept tcp [::]:9028: accept4: too many open files; retrying in 1s
2016-02-19 23:42:06.363379 I | etcd-tester: [round#185 case#0] wait full health error: rpc error: code = 4 desc = "context deadline exceeded"
2016-02-19 23:42:06.363473 I | etcd-tester: [round#185 case#0] cleaning up...
2016-02-19 23:42:07.315023 I | etcd-tester: [round#185] compacting storage at 0 (current revision 0)
2016-02-19 23:42:17.351144 I | etcd-tester: [round#185] compactKV error (rpc error: code = 2 desc = "grpc: the client connection is closing")
2016-02-19 23:42:17.351186 I | etcd-tester: [round#185 case#0] cleaning up...

This indicates that the tester machine that launches etcd clients ran out of file descriptors.
ulimit -n returns 1024 in that machine. And I was scanning tcp port usage periodically and found out
we were using more tcp connections than we should have.

  1. 2016-02-19 23:41:59 tcp port scan : /home/gyuho/go/bin/etcd-tester was using total 1018 TCP connections
  2. 2016-02-19 23:42:06 ran out of tcp port -> wait full health (set key-value client call) failed
  3. 2016-02-19 23:42:17 ran out of tcp port -> compact kv client call failed

Which tells we might be leaking client calls. Here's how we spawn stressers:

    conn, err := grpc.Dial
    defer conn.Close()

    for i := 0; i < 200; i++ {
        go func(i int) {
            defer wg.Done()
            for {
                kvc.Put
                if err != nil {
                    if grpc.ErrorDesc(err) == context.DeadlineExceeded.Error() {
                        continue
                    }
                    return
                }
            }
        }(i)
    }

We have 1 TCP connection of gRPC per member, so we should expect to see only a handful
of connections bind to our gRPC port 2378, but if you look at my tcp port scan logs, the
number of being used tcp port has been growing. I will redeploy the testing cluster with
more detailed port counting and stresser counting.

  1. tcp port scanning log: https://storage.googleapis.com/failure-archive/tester_logs/process.log
  2. tester log: https://storage.googleapis.com/failure-archive/tester_logs/tester.log

@gyuho
Copy link
Contributor Author

gyuho commented Feb 20, 2016

So it starts as normal, in round 2:

stresser count: map[10.240.0.4:2378:200 10.240.0.2:2378:200 10.240.0.3:2378:200] | gRPC port count: map[10.240.0.3:2378:11 10.240.0.2:2378:13 10.240.0.4:2378:11]

And it slows increases until it crashes. Around round 165, I got:

stresser count: map[10.240.0.3:2378:200 10.240.0.4:2378:200 10.240.0.2:2378:200] | gRPC port count: map[10.240.0.3:2378:335 10.240.0.4:2378:334 10.240.0.2:2378:344]

And immediately getting http: Accept error: accept tcp [::]:9028: accept4: too many open files; retrying in 5ms. The stresser count stays the same, 200, only tcp ports are being used up. Will look into more in grpc side.

@gyuho
Copy link
Contributor Author

gyuho commented Feb 20, 2016

@xiang90 I think #4572 fixed this issue. I redeployed with your patch and port usage stays stable now.

For round 1, we start with about 5 ~ 6 gRPC connections.

stresser count: map[10.240.0.2:2378:200 10.240.0.4:2378:200 10.240.0.3:2378:200]
gRPC port count: map[10.240.0.3:2378:5 10.240.0.4:2378:5 10.240.0.2:2378:6]

And after 90 rounds, the gRPC connection usage is same, which is expected behavior:

stresser count: map[10.240.0.2:2378:200 10.240.0.4:2378:200 10.240.0.3:2378:200]
 gRPC port count: map[10.240.0.2:2378:20 10.240.0.4:2378:3 10.240.0.3:2378:3]

@xiang90
Copy link
Contributor

xiang90 commented Feb 20, 2016

Thanks. Let's close this issue then.

@xiang90 xiang90 closed this as completed Feb 20, 2016
@gyuho gyuho reopened this Feb 21, 2016
gyuho added a commit to gyuho/etcd that referenced this issue Feb 24, 2016
WaitHealth fails (very rarely) with slow leader election, when it takes more
than one minute, like the one described in
etcd-io#4517.

And adds logs for client errors.
gyuho added a commit to gyuho/etcd that referenced this issue Feb 24, 2016
WaitHealth fails (very rarely) with slow leader election, when it takes more
than one minute, like the one described in
etcd-io#4517.

And adds logs for client errors.
@gyuho
Copy link
Contributor Author

gyuho commented Feb 27, 2016

I observed the same error again, but it's not from the leaky goroutine. I restarted testing cluster.
Logs does not indicate any etcd errors. Only the etcd server to client connection timeouts.
I added lines for more verbose logs, and if it reproduces that error, I will reopen that issue
because it's different than the problem we had in this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants