Client watchGrpcStream tight-loops if server is taken down while watcher is running #9578

fasaxc · 2018-04-17T17:16:59Z

Debugging an issue in our project, calico-felix which uses the etcdv3 API. I noticed that if I let the product start a watch and then take down the single-node etcd server, the product starts using a lot of CPU (150%+). Profiling, I traced it down to watchGrpcStream.run() Looks like the connection fails quickly (connection refused, presumably) and that results in an immediate retry:

(pprof) top 30 --cum
Showing nodes accounting for 6.02s, 63.44% of 9.49s total
Dropped 96 nodes (cum <= 0.05s)
Showing top 30 nodes out of 95
      flat  flat%   sum%        cum   cum%
         0     0%     0%      5.54s 58.38%  github.com/projectcalico/felix/vendor/github.com/coreos/etcd/clientv3.(*watchGrpcStream).newWatchClient
     0.10s  1.05%  1.05%      5.54s 58.38%  github.com/projectcalico/felix/vendor/github.com/coreos/etcd/clientv3.(*watchGrpcStream).openWatchClient
         0     0%  1.05%      5.54s 58.38%  github.com/projectcalico/felix/vendor/github.com/coreos/etcd/clientv3.(*watchGrpcStream).run
     0.25s  2.63%  3.69%      5.54s 58.38%  runtime.systemstack
     0.05s  0.53%  4.21%      4.70s 49.53%  github.com/projectcalico/felix/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb.(*watchClient).Watch
     0.04s  0.42%  4.64%      4.65s 49.00%  github.com/projectcalico/felix/vendor/google.golang.org/grpc.NewClientStream
     0.14s  1.48%  6.11%      4.61s 48.58%  github.com/projectcalico/felix/vendor/google.golang.org/grpc.newClientStream

I'm using v3.3.3 of the client (I started with v3.3.0 but then upgraded to see if it fixed the issue).

The text was updated successfully, but these errors were encountered:

xiang90 · 2018-04-17T17:41:16Z

#8914

#9171

fasaxc · 2018-04-18T08:26:10Z

How about a quick fix time.Sleep(100 * time.Milliseconds) until the redesign in #9171 lands?

Add a 100ms sleep to avoid tight loop if reconnection fails quickly. Fixes etcd-io#9578

liggitt · 2018-06-13T04:45:54Z

How about a quick fix time.Sleep(100 * time.Milliseconds) until the redesign in #9171 lands?

big +1 to a minimal fix here that can be picked back to 3.2.x and 3.3.x streams. the impact of the hotloop is pretty severe.

a simple backoff (start fast, multiply backoff, cap at 100ms) took our client application from 700% CPU consumption when etcd was down to 10%.

the Unavailable error code description explicitly references retrying with a backoff:
https://github.com/coreos/etcd/blob/88acced1cd7ad670001d1280b97de4fe7b647687/vendor/google.golang.org/grpc/codes/codes.go#L134-L140

stale · 2020-04-07T05:11:53Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

fasaxc added a commit to fasaxc/etcd-1 that referenced this issue Apr 24, 2018

client: Work around tight loop when retrying watch connection.

57409d0

Add a 100ms sleep to avoid tight loop if reconnection fails quickly. Fixes etcd-io#9578

fasaxc mentioned this issue Apr 24, 2018

client: Work around tight loop when retrying watch connection. #9614

Closed

fasaxc added a commit to fasaxc/etcd-1 that referenced this issue Apr 24, 2018

client: Work around tight loop when retrying watch connection.

77c80d3

Add a 100ms sleep to avoid tight loop if reconnection fails quickly. Fixes etcd-io#9578

fasaxc added a commit to fasaxc/etcd-1 that referenced this issue Apr 24, 2018

client: Work around tight loop when retrying watch connection.

710bf4e

Add a 100ms sleep to avoid tight loop if reconnection fails quickly. Fixes etcd-io#9578

gyuho added area/performance area/clientv3 labels Apr 26, 2018

gyuho mentioned this issue May 18, 2018

when etcdserver down, Reject client CPU leakage. #9740

Closed

liggitt mentioned this issue Jun 13, 2018

Backoff on reestablishing watches when Unavailable errors are encountered #9840

Merged

stale bot added the stale label Apr 7, 2020

stale bot closed this as completed Apr 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client watchGrpcStream tight-loops if server is taken down while watcher is running #9578

Client watchGrpcStream tight-loops if server is taken down while watcher is running #9578

fasaxc commented Apr 17, 2018

xiang90 commented Apr 17, 2018

fasaxc commented Apr 18, 2018

liggitt commented Jun 13, 2018

stale bot commented Apr 7, 2020

Client watchGrpcStream tight-loops if server is taken down while watcher is running #9578

Client watchGrpcStream tight-loops if server is taken down while watcher is running #9578

Comments

fasaxc commented Apr 17, 2018

xiang90 commented Apr 17, 2018

fasaxc commented Apr 18, 2018

liggitt commented Jun 13, 2018

stale bot commented Apr 7, 2020