ec2 reboot caused etcd status problems and empty leader #615

PaulCampbell · 2014-03-07T17:40:01Z

On rebooting one of the machines in our coreos cluster of 7 machines, fleetctl status started reporting only 1 unit running, despite the other units still apparently responding as normal.

Our etcd log from the time of the reboot shows the following sequence of events:

Mar 07 16:54:08 ip-10-28-70-224 etcd-bootstrap[489]: [etcd] Mar 7 16:54:08.428 INFO | 10.28.70.. 224: state changed from 'snapshotting' to 'follower'. Mar 07 16:54:11 ip-10-28-70-224 etcd-bootstrap[489]: [etcd] Mar 7 16:54:11.338 INFO | 10.28.70.. 224: snapshot of 642303 events at index 642303 completed Mar 07 16:58:35 ip-10-28-70-224 etcd-bootstrap[489]: [etcd] Mar 7 16:58:35.004 INFO | 10.28.70.. 224: state changed from 'follower' to 'candidate'. Mar 07 16:58:35 ip-10-28-70-224 etcd-bootstrap[489]: [etcd] Mar 7 16:58:35.004 INFO | 10.28.70.. 224: leader changed from '10.28.7.6' to ''. Mar 07 16:58:35 ip-10-28-70-224 etcd-bootstrap[489]: [etcd] Mar 7 16:58:35.006 INFO | 10.28.70.. 224: state changed from 'candidate' to 'leader'. Mar 07 16:58:35 ip-10-28-70-224 etcd-bootstrap[489]: [etcd] Mar 7 16:58:35.006 INFO | 10.28.70.. 224: leader changed from '' to '10.28.70.224'. Mar 07 16:58:35 ip-10-28-70-224 etcd-bootstrap[489]: [etcd] Mar 7 16:58:35.057 INFO | 10.28.70.. 224: warning: heartbeat timed out: '10.28.7.6' Mar 07 16:58:35 ip-10-28-70-224 etcd-bootstrap[489]: [etcd] Mar 7 16:58:35.107 INFO | 10.28.70.. 224: warning: heartbeat timed out: '10.28.7.6' Mar 07 16:58:35 ip-10-28-70-224 etcd-bootstrap[489]: [etcd] Mar 7 16:58:35.157 INFO | 10.28.70.. 224: warning: heartbeat timed out: '10.28.7.6'

I guess I'm particularly interested in:

leader changed from '10.28.7.6' to ''.

Does this look like an expected log?

Thanks

The text was updated successfully, but these errors were encountered:

philips · 2014-03-07T17:46:26Z

I think that is a bug in the leader changed log line. Otherwise it looks normal for a machine going down. Did the cluster recover?

The fleet thing sounds like a bug I believe. /cc @bcwaldon

bcwaldon · 2014-03-07T18:31:09Z

@PaulCampbell Definitely interested in helping debug the fleet issue. Could you try to describe more explicitly the state before and after the reboot?

bashcoder · 2014-03-07T21:38:06Z

I thought I had a similar scenario in a test today, but it eventually worked itself out after a few minutes.

AndrewVos · 2014-03-08T18:27:37Z

UNIT                                    LOAD    ACTIVE  SUB     DESC                            MACHINE
elasticsearch.1.service                 loaded  active  running elasticsearch                   534ead73.../10.28.7.6
elasticsearch.2.service                 -       -       -       elasticsearch                   -
elasticsearch.3.service                 -       -       -       elasticsearch                   -
elasticsearch.4.service                 -       -       -       elasticsearch                   -
elasticsearch.5.service                 -       -       -       elasticsearch                   -
elasticsearch.6.service                 loaded  active  running elasticsearch                   008faf7b.../10.28.70.224
elasticsearch.7.service                 -       -       -       elasticsearch                   -
github-authentication-proxy.1.service   loaded  active  running github-authentication-proxy     6ec5b6a7.../10.28.69.136
github-authentication-proxy.2.service   loaded  active  running github-authentication-proxy     008faf7b.../10.28.70.224
github-authentication-proxy.3.service   -       -       -       github-authentication-proxy     -
github-authentication-proxy.4.service   -       -       -       github-authentication-proxy     -
github-authentication-proxy.5.service   -       -       -       github-authentication-proxy     -
github-authentication-proxy.6.service   -       -       -       github-authentication-proxy     -
github-authentication-proxy.7.service   -       -       -       github-authentication-proxy     -
prodotti.1.service                      -       -       -       prodotti                        -
prodotti.2.service                      loaded  active  running prodotti                        534ead73.../10.28.7.6
prodotti.3.service                      -       -       -       prodotti                        -
prodotti.4.service                      loaded  active  running prodotti                        6ec5b6a7.../10.28.69.136
prodotti.5.service                      -       -       -       prodotti                        -
prodotti.6.service                      -       -       -       prodotti                        -
prodotti.7.service                      loaded  active  running prodotti                        008faf7b.../10.28.70.224

AndrewVos · 2014-03-08T18:28:32Z

That's what the cluster looks like now. All seven elasticsearch containers are actually running, but we can't get the prodotti services to all run. There are job offers for each.

This is confusing, and stressful because these boxes are all in production right now :)

AndrewVos · 2014-03-08T18:30:45Z

Before reboot, we had all ES instances showing as running. We rebooted because one of the prodotti services wouldn't launch (I suppose nobody would accept the job offer).

bcwaldon · 2014-03-09T21:34:19Z

@AndrewVos could you file an issue in coreos/fleet? Sharing the X-Fleet portions of the units would help diagnose your issue as well.

AndrewVos · 2014-03-10T17:26:54Z

@PaulCampbell could you close this issue, I've moved it over to coreos/fleet#210

AndrewVos mentioned this issue Mar 10, 2014

ec2 reboot caused weird fleet status, and we can't launch new services now coreos/fleet#210

Closed

PaulCampbell closed this as completed Mar 10, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ec2 reboot caused etcd status problems and empty leader #615

ec2 reboot caused etcd status problems and empty leader #615

PaulCampbell commented Mar 7, 2014

philips commented Mar 7, 2014

bcwaldon commented Mar 7, 2014

bashcoder commented Mar 7, 2014

AndrewVos commented Mar 8, 2014

AndrewVos commented Mar 8, 2014

AndrewVos commented Mar 8, 2014

bcwaldon commented Mar 9, 2014

AndrewVos commented Mar 10, 2014

ec2 reboot caused etcd status problems and empty leader #615

ec2 reboot caused etcd status problems and empty leader #615

Comments

PaulCampbell commented Mar 7, 2014

philips commented Mar 7, 2014

bcwaldon commented Mar 7, 2014

bashcoder commented Mar 7, 2014

AndrewVos commented Mar 8, 2014

AndrewVos commented Mar 8, 2014

AndrewVos commented Mar 8, 2014

bcwaldon commented Mar 9, 2014

AndrewVos commented Mar 10, 2014