-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
functional-test: tester client error #4517
Comments
@gyuho Do we know why there is a deadline issue? |
@xiang90 We request grpc client call with timeout. https://github.com/coreos/etcd/blob/master/tools/functional-tester/etcd-tester/cluster.go#L290. And when we look at the logs around the tester:
In agents (agent1 got killed): |
I do not believe there should be a network issue between a client and a server except when we injected the failures. Can you please check if the deadline issues are caused by the failure we injected? And can you also verify that the failure we injected caused the server to be unresponsive for longer than the client side timeout? |
Agree. I will investigate more. Thanks. |
This is the kind of logs when
But in agent side, the request went through: The client side request (to recover the agent) went through. And if you look at other logs in You can tell So the restart request was successful, which means the func (f *failureKillOneForLongTime) Recover(c *cluster, round int) error {
i := round % c.Size
if _, err := c.Agents[i].Restart(); err != nil {
return err
}
return c.WaitHealth()
} We can resolve this issue with retrying logic in grpc calls (WaitHealth, etc). And this is not
|
@gyuho I want to know exact why there is that deadline exceeded error. I understand it only happens for the client. However, can we find out what is the root cause of that error? I do not think extending the timeout without finding the root cause is a right thing to do. |
Got it. I will try to find that. Thanks. |
To be more specific, you observed
We want to find out why there is this deadline exceeded error. |
So we got 3 cases with timeout error:
And all those are from this function taking longer than 1 second to return the response: // setHealthKey sets health key on all given urls.
func setHealthKey(us []string) error {
for _, u := range us {
conn, err := grpc.Dial(u, grpc.WithInsecure(), grpc.WithTimeout(5*time.Second))
if err != nil {
return fmt.Errorf("%v (%s)", err, u)
}
ctx, cancel := context.WithTimeout(context.Background(), time.Second)
kvc := pb.NewKVClient(conn)
_, err = kvc.Put(ctx, &pb.PutRequest{Key: []byte("health"), Value: []byte("good")})
cancel()
if err != nil {
return err
}
}
return nil
} First case: Grafana queried `metrics` and used up all the file descriptors, so blocking the `WaitHealth` call?
Second case: killed majority and tried to recover but there were port conflicts in `:2380`, so `WaitHealth` request to the node that had conflicting peer port timed out (we need `dropPort` for `stop` method as well for this case). Third case: kill one member for a long time but `WaitHealth` is not aware of which member got killed. And `WaitHealth` sends request to the member that got killed by the failure injection, it will get `context deadline exceeded`.
|
@xiang90 Third case is interesting. Here's the error:
Full error: etcd.txt (https://storage.googleapis.com/failure-archive/agent2/2016-02-14T18%3A34%3A44Z/etcd.log) |
@xiang90 I looked at all the errors since
These errors are before we cancel the stressers and I believe those are from bad network connections.
So this explains that in current implementation, bad network connection makes the stressers return earlier even without canceling, and this stops the whole commit. We should only return when there's an explicit cancel on the stresser like #4504. |
How do you tell the deadline exceeded is caused by bad network connection? Maybe it is because the etcd server was slow, which we should fix. |
@gyuho The network error should happen really really rarely on GCE (probably below 0.01%). It should be much much infrequent than what we have observed, which is around 1% of all cases. |
@xiang90 That's a good point. Current tester code makes it harder to debug that part, since we don't log or differentiate two errors: one from cancel, the other from bad network. I think we should keep this issue open, but keep improving the tester code. |
@gyuho We can tell from the errors. I think what you have missed a point that I mentioned above that the I do not think masking the error at the tester side is a right thing to do unless we are sure why we need to mask that error. |
@xiang90 Thanks for clarification. Here are some etcd server side events that might have slowed down processing PUT requests:
I logged the time that the PUT request started, and you can see requests triggered at the time of leader failure got
Leader failure happens.
Sorry, I think I mis-used the term Here are more examples:
|
My investigation got messy. To summarize all, client error happens when:
|
Let's fix this first. |
https://github.com/golang/go/blob/master/src/os/exec_posix.go#L18 shows that cmd.Process.Kill calls syscall.SIGKILL to the command. But http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_01.html explains 'If you send a SIGKILL to a process, you remove any chance for the process to do a tidy cleanup and shutdown, which might have unfortunate consequences.' This sends SIGTERM, SIGINT, SIGHUP syscalls in order to the PID so that the process could have more time to clean up the resources. Related to etcd-io#4517.
https://github.com/golang/go/blob/master/src/os/exec_posix.go#L18 shows that cmd.Process.Kill calls syscall.SIGKILL to the command. But http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_01.html explains 'If you send a SIGKILL to a process, you remove any chance for the process to do a tidy cleanup and shutdown, which might have unfortunate consequences.' This sends SIGTERM, SIGINT, SIGHUP syscalls in order to the PID so that the process could have more time to clean up the resources. Related to etcd-io#4517.
https://github.com/golang/go/blob/master/src/os/exec_posix.go#L18 shows that cmd.Process.Kill calls syscall.SIGKILL to the command. But http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_01.html explains 'If you send a SIGKILL to a process, you remove any chance for the process to do a tidy cleanup and shutdown, which might have unfortunate consequences.' This sends SIGTERM, SIGINT syscalls to the PID so that the process could have more time to clean up the resources. Related to etcd-io#4517.
https://github.com/golang/go/blob/master/src/os/exec_posix.go#L18 shows that cmd.Process.Kill calls syscall.SIGKILL to the command. But http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_01.html explains 'If you send a SIGKILL to a process, you remove any chance for the process to do a tidy cleanup and shutdown, which might have unfortunate consequences.' This sends SIGTERM, SIGINT syscalls to the PID so that the process could have more time to clean up the resources. Related to etcd-io#4517.
https://github.com/golang/go/blob/master/src/os/exec_posix.go#L18 shows that cmd.Process.Kill calls syscall.SIGKILL to the command. But http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_01.html explains 'If you send a SIGKILL to a process, you remove any chance for the process to do a tidy cleanup and shutdown, which might have unfortunate consequences.' This sends SIGTERM, SIGINT syscalls to the PID so that the process could have more time to clean up the resources. Related to etcd-io#4517.
https://github.com/golang/go/blob/master/src/os/exec_posix.go#L18 shows that cmd.Process.Kill calls syscall.SIGKILL to the command. But http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_01.html explains 'If you send a SIGKILL to a process, you remove any chance for the process to do a tidy cleanup and shutdown, which might have unfortunate consequences.' This sends SIGTERM, SIGINT syscalls to the PID so that the process could have more time to clean up the resources. Related to etcd-io#4517.
https://github.com/golang/go/blob/master/src/os/exec_posix.go#L18 shows that cmd.Process.Kill calls syscall.SIGKILL to the command. But http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_01.html explains 'If you send a SIGKILL to a process, you remove any chance for the process to do a tidy cleanup and shutdown, which might have unfortunate consequences.' This sends SIGTERM, SIGINT syscalls to the PID so that the process could have more time to clean up the resources. Related to etcd-io#4517.
https://github.com/golang/go/blob/master/src/os/exec_posix.go#L18 shows that cmd.Process.Kill calls syscall.SIGKILL to the command. But http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_01.html explains 'If you send a SIGKILL to a process, you remove any chance for the process to do a tidy cleanup and shutdown, which might have unfortunate consequences.' This sends SIGTERM, SIGINT syscalls to the PID so that the process could have more time to clean up the resources. Related to etcd-io#4517.
https://github.com/golang/go/blob/master/src/os/exec_posix.go#L18 shows that cmd.Process.Kill calls syscall.SIGKILL to the command. But http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_01.html explains 'If you send a SIGKILL to a process, you remove any chance for the process to do a tidy cleanup and shutdown, which might have unfortunate consequences.' This sends SIGTERM, SIGINT syscalls to the PID so that the process could have more time to clean up the resources. Related to etcd-io#4517.
https://github.com/golang/go/blob/master/src/os/exec_posix.go#L18 shows that cmd.Process.Kill calls syscall.SIGKILL to the command. But http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_01.html explains 'If you send a SIGKILL to a process, you remove any chance for the process to do a tidy cleanup and shutdown, which might have unfortunate consequences.' This sends SIGTERM, SIGINT syscalls to the PID so that the process could have more time to clean up the resources. Related to etcd-io#4517.
https://github.com/golang/go/blob/master/src/os/exec_posix.go#L18 shows that cmd.Process.Kill calls syscall.SIGKILL to the command. But http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_01.html explains 'If you send a SIGKILL to a process, you remove any chance for the process to do a tidy cleanup and shutdown, which might have unfortunate consequences.' This sends SIGTERM, SIGINT syscalls to the PID so that the process could have more time to clean up the resources. Related to etcd-io#4517.
https://github.com/golang/go/blob/master/src/os/exec_posix.go#L18 shows that cmd.Process.Kill calls syscall.SIGKILL to the command. But http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_01.html explains 'If you send a SIGKILL to a process, you remove any chance for the process to do a tidy cleanup and shutdown, which might have unfortunate consequences.' This sends SIGTERM, SIGINT syscalls to the PID so that the process could have more time to clean up the resources. Related to etcd-io#4517.
One reason for this kind of client error is from depletion of unix file descriptors. |
@gyuho It is strange that etcd runs out of fd... Probably we should not simply increase fd limit. But we need to figure out why it runs out of fd at the first place. |
@xiang90 Agree. The log doesn't tell which one is using up all the ports. I will put the limit back to 1024 and try to find out the root cause. |
@xiang90 We had one more client error as below:
This indicates that the tester machine that launches etcd clients ran out of file descriptors.
Which tells we might be leaking client calls. Here's how we spawn stressers: conn, err := grpc.Dial
defer conn.Close()
for i := 0; i < 200; i++ {
go func(i int) {
defer wg.Done()
for {
kvc.Put
if err != nil {
if grpc.ErrorDesc(err) == context.DeadlineExceeded.Error() {
continue
}
return
}
}
}(i)
} We have 1 TCP connection of gRPC per member, so we should expect to see only a handful
|
So it starts as normal, in round 2:
And it slows increases until it crashes. Around round 165, I got:
And immediately getting |
@xiang90 I think #4572 fixed this issue. I redeployed with your patch and port usage stays stable now. For round 1, we start with about 5 ~ 6 gRPC connections.
And after 90 rounds, the gRPC connection usage is same, which is expected behavior:
|
Thanks. Let's close this issue then. |
WaitHealth fails (very rarely) with slow leader election, when it takes more than one minute, like the one described in etcd-io#4517. And adds logs for client errors.
WaitHealth fails (very rarely) with slow leader election, when it takes more than one minute, like the one described in etcd-io#4517. And adds logs for client errors.
I observed the same error again, but it's not from the leaky goroutine. I restarted testing cluster. |
There was a bad connection in client side. This is not an etcd error. It might be worth having some retrial logic.
The text was updated successfully, but these errors were encountered: