Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client watchGrpcStream tight-loops if server is taken down while watcher is running #9578

Closed
fasaxc opened this issue Apr 17, 2018 · 4 comments

Comments

@fasaxc
Copy link
Contributor

fasaxc commented Apr 17, 2018

Debugging an issue in our project, calico-felix which uses the etcdv3 API. I noticed that if I let the product start a watch and then take down the single-node etcd server, the product starts using a lot of CPU (150%+). Profiling, I traced it down to watchGrpcStream.run() Looks like the connection fails quickly (connection refused, presumably) and that results in an immediate retry:

(pprof) top 30 --cum
Showing nodes accounting for 6.02s, 63.44% of 9.49s total
Dropped 96 nodes (cum <= 0.05s)
Showing top 30 nodes out of 95
      flat  flat%   sum%        cum   cum%
         0     0%     0%      5.54s 58.38%  github.com/projectcalico/felix/vendor/github.com/coreos/etcd/clientv3.(*watchGrpcStream).newWatchClient
     0.10s  1.05%  1.05%      5.54s 58.38%  github.com/projectcalico/felix/vendor/github.com/coreos/etcd/clientv3.(*watchGrpcStream).openWatchClient
         0     0%  1.05%      5.54s 58.38%  github.com/projectcalico/felix/vendor/github.com/coreos/etcd/clientv3.(*watchGrpcStream).run
     0.25s  2.63%  3.69%      5.54s 58.38%  runtime.systemstack
     0.05s  0.53%  4.21%      4.70s 49.53%  github.com/projectcalico/felix/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb.(*watchClient).Watch
     0.04s  0.42%  4.64%      4.65s 49.00%  github.com/projectcalico/felix/vendor/google.golang.org/grpc.NewClientStream
     0.14s  1.48%  6.11%      4.61s 48.58%  github.com/projectcalico/felix/vendor/google.golang.org/grpc.newClientStream

I'm using v3.3.3 of the client (I started with v3.3.0 but then upgraded to see if it fixed the issue).

@xiang90
Copy link
Contributor

xiang90 commented Apr 17, 2018

#8914

#9171

@fasaxc
Copy link
Contributor Author

fasaxc commented Apr 18, 2018

How about a quick fix time.Sleep(100 * time.Milliseconds) until the redesign in #9171 lands?

fasaxc added a commit to fasaxc/etcd-1 that referenced this issue Apr 24, 2018
Add a 100ms sleep to avoid tight loop if reconnection fails quickly.

Fixes etcd-io#9578
fasaxc added a commit to fasaxc/etcd-1 that referenced this issue Apr 24, 2018
Add a 100ms sleep to avoid tight loop if reconnection fails quickly.

Fixes etcd-io#9578
fasaxc added a commit to fasaxc/etcd-1 that referenced this issue Apr 24, 2018
Add a 100ms sleep to avoid tight loop if reconnection fails quickly.

Fixes etcd-io#9578
@liggitt
Copy link
Contributor

liggitt commented Jun 13, 2018

How about a quick fix time.Sleep(100 * time.Milliseconds) until the redesign in #9171 lands?

big +1 to a minimal fix here that can be picked back to 3.2.x and 3.3.x streams. the impact of the hotloop is pretty severe.

a simple backoff (start fast, multiply backoff, cap at 100ms) took our client application from 700% CPU consumption when etcd was down to 10%.

the Unavailable error code description explicitly references retrying with a backoff:
https://github.com/coreos/etcd/blob/88acced1cd7ad670001d1280b97de4fe7b647687/vendor/google.golang.org/grpc/codes/codes.go#L134-L140

@stale
Copy link

stale bot commented Apr 7, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 7, 2020
@stale stale bot closed this as completed Apr 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants