Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd pod exited as 'completed' triggering app health check failures #10

Open
vdice opened this issue Oct 31, 2016 · 1 comment
Open
Labels

Comments

@vdice
Copy link
Member

vdice commented Oct 31, 2016

Deis' own champagne cluster was affected by the nsqd pod exiting w/ Completed status, seemingly triggering errors in both the registry and workflow-manager pods. (The alerts leading us to investigate were health check failures on the k8s-claimer app deployed on this cluster.)

Here is the log tail from said pod:

...
[nsqd] 2016/10/31 21:25:50.576088 200 GET /stats?format=json (10.20.1.9:34742) 39.962µs
[nsqd] 2016/10/31 21:25:51.050988 HTTP: closing [::]:4151
[nsqd] 2016/10/31 21:25:51.052387 TCP: closing [::]:4150
[nsqd] 2016/10/31 21:25:51.052510 NSQ: persisting topic/channel metadata to /opt/nsq/data/nsqd.905.dat
[nsqd] 2016/10/31 21:25:51.534562 NSQ: closing topics
[nsqd] 2016/10/31 21:25:51.534729 TOPIC(metrics): closing
[nsqd] 2016/10/31 21:25:51.534804 TOPIC(metrics): closing ... messagePump
[nsqd] 2016/10/31 21:25:51.534844 CHANNEL(consumer): closing
[nsqd] 2016/10/31 21:25:51.534921 DISKQUEUE(metrics:consumer): closing
[nsqd] 2016/10/31 21:25:51.534989 DISKQUEUE(metrics:consumer): closing ... ioLoop
[nsqd] 2016/10/31 21:25:51.536344 PROTOCOL(V2): [10.20.2.9:53476] exiting ioloop
[nsqd] 2016/10/31 21:25:51.547640 ERROR: client(10.20.2.9:53476) - failed to read command - read tcp 10.20.0.5:4150->10.20.2.9:53476: use of closed network connection
[nsqd] 2016/10/31 21:25:51.549589 PROTOCOL(V2): [10.20.1.9:40500] exiting ioloop
[nsqd] 2016/10/31 21:25:51.549655 ERROR: client(10.20.1.9:40500) - failed to read command - read tcp 10.20.0.5:4150->10.20.1.9:40500: use of closed network connection
[nsqd] 2016/10/31 21:25:51.551025 PROTOCOL(V2): [10.20.1.9:40500] exiting messagePump
[nsqd] 2016/10/31 21:25:51.551086 PROTOCOL(V2): [10.20.0.7:48059] exiting ioloop
[nsqd] 2016/10/31 21:25:51.551144 ERROR: client(10.20.0.7:48059) - failed to read command - read tcp 10.20.0.5:4150->10.20.0.7:48059: use of closed network connection
[nsqd] 2016/10/31 21:25:51.551182 PROTOCOL(V2): [10.20.0.7:48059] exiting messagePump
[nsqd] 2016/10/31 21:25:51.551246 PROTOCOL(V2): [10.20.2.9:53476] exiting messagePump
[nsqd] 2016/10/31 21:25:51.664931 DISKQUEUE(metrics): closing
[nsqd] 2016/10/31 21:25:51.665069 DISKQUEUE(metrics): closing ... ioLoop
[nsqd] 2016/10/31 21:25:51.809062 TOPIC(logs): closing
[nsqd] 2016/10/31 21:25:51.809250 TOPIC(logs): closing ... messagePump
[nsqd] 2016/10/31 21:25:51.809313 CHANNEL(consume): closing
[nsqd] 2016/10/31 21:25:51.809359 DISKQUEUE(logs:consume): closing
[nsqd] 2016/10/31 21:25:51.809416 DISKQUEUE(logs:consume): closing ... ioLoop
[nsqd] 2016/10/31 21:25:51.810602 PROTOCOL(V2): [10.20.2.6:44983] exiting ioloop
[nsqd] 2016/10/31 21:25:51.810664 ERROR: client(10.20.2.6:44983) - failed to read command - read tcp 10.20.0.5:4150->10.20.2.6:44983: use of closed network connection
[nsqd] 2016/10/31 21:25:51.810752 PROTOCOL(V2): [10.20.2.6:44983] exiting messagePump
[nsqd] 2016/10/31 21:25:51.961055 DISKQUEUE(logs): closing
[nsqd] 2016/10/31 21:25:51.961237 DISKQUEUE(logs): closing ... ioLoop
[nsqd] 2016/10/31 21:25:52.081016 QUEUESCAN: closing
[nsqd] 2016/10/31 21:25:52.081297 200 GET /stats?format=json (10.20.1.9:34742) 896.959019ms
[nsqd] 2016/10/31 21:25:52.166461 200 GET /stats?format=json (10.20.2.9:43496) 650.746772ms
[nsqd] 2016/10/31 21:25:52.167008 ID: closing
[nsqd] 2016/10/31 21:25:52.167217 LOOKUP: closing

Cluster state when investigated:

$ kd get po
NAME                                     READY     STATUS      RESTARTS   AGE
deis-builder-808234085-4er8f             1/1       Running     11         14d
deis-controller-3550476522-vjmsw         1/1       Running     0          19d
deis-logger-1750174080-h61sh             1/1       Running     0          19d
deis-logger-fluentd-0s73r                1/1       Running     1          19d
deis-logger-fluentd-m6m7c                1/1       Running     0          19d
deis-logger-fluentd-w2gbi                1/1       Running     0          19d
deis-logger-redis-4282343581-o53zz       1/1       Running     0          19d
deis-monitor-grafana-947410455-7cujl     1/1       Running     0          19d
deis-monitor-influxdb-802438432-eyz7v    1/1       Running     0          19d
deis-monitor-telegraf-6qx76              1/1       Running     3          19d
deis-monitor-telegraf-c8d4t              1/1       Running     0          19d
deis-monitor-telegraf-cz6g7              1/1       Running     0          19d
deis-nsqd-2528936859-oe5je               0/1       Completed   3          19d
deis-registry-2153931602-3jmt9           0/1       Error       0          19d
deis-registry-proxy-s9p27                1/1       Running     0          19d
deis-registry-proxy-vg02u                1/1       Running     1          19d
deis-registry-proxy-vih1z                1/1       Running     0          19d
deis-router-1833288991-6itpr             1/1       Running     0          19d
deis-workflow-manager-3855316227-r1z8o   0/1       Error       0          19d

Log tail from registry pod, unhelpful besides timestamps:

10.20.0.1 - - [31/Oct/2016:21:23:20 +0000] "GET /v2/ HTTP/1.1" 200 2 "" "Go-http-client/1.1"
time="2016-10-31T21:23:20.983954657Z" level=info msg="response completed" go.version=go1.6.2 http.request.host="10.20.0.8:5000" http.request.id=7825c2e9-11a2-45f3-b906-af8f6094dfc2 http.request.method=GET http.request.remoteaddr="10.20.0.1:52909" http.request.uri="/v2/" http.request.useragent="Go-http-client/1.1" http.response.contenttype="application/json; charset=utf-8" http.response.duration=174.29994ms http.response.status=200 http.response.written=2 instance.id=bc726225-3d34-4142-a327-fc7f79f87ccf service=registry version=v2.1.1-601-g0afef00
10.20.0.1 - - [31/Oct/2016:21:23:20 +0000] "GET /v2/ HTTP/1.1" 200 2 "" "Go-http-client/1.1"

Note: Interestingly, unless there is some time skew at play, the last log from registry is at 21:23:20 whereas the nsqd timestamp is a bit later at 21:25:51 -- so it may indeed be a registry issue...

@Cryptophobia
Copy link

This issue was moved to teamhephy/nsqd#2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants