Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ec2 reboot caused etcd status problems and empty leader #615

Closed
PaulCampbell opened this issue Mar 7, 2014 · 8 comments
Closed

ec2 reboot caused etcd status problems and empty leader #615

PaulCampbell opened this issue Mar 7, 2014 · 8 comments

Comments

@PaulCampbell
Copy link

On rebooting one of the machines in our coreos cluster of 7 machines, fleetctl status started reporting only 1 unit running, despite the other units still apparently responding as normal.

Our etcd log from the time of the reboot shows the following sequence of events:

Mar 07 16:54:08 ip-10-28-70-224 etcd-bootstrap[489]: [etcd] Mar 7 16:54:08.428 INFO | 10.28.70.. 224: state changed from 'snapshotting' to 'follower'. Mar 07 16:54:11 ip-10-28-70-224 etcd-bootstrap[489]: [etcd] Mar 7 16:54:11.338 INFO | 10.28.70.. 224: snapshot of 642303 events at index 642303 completed Mar 07 16:58:35 ip-10-28-70-224 etcd-bootstrap[489]: [etcd] Mar 7 16:58:35.004 INFO | 10.28.70.. 224: state changed from 'follower' to 'candidate'. Mar 07 16:58:35 ip-10-28-70-224 etcd-bootstrap[489]: [etcd] Mar 7 16:58:35.004 INFO | 10.28.70.. 224: leader changed from '10.28.7.6' to ''. Mar 07 16:58:35 ip-10-28-70-224 etcd-bootstrap[489]: [etcd] Mar 7 16:58:35.006 INFO | 10.28.70.. 224: state changed from 'candidate' to 'leader'. Mar 07 16:58:35 ip-10-28-70-224 etcd-bootstrap[489]: [etcd] Mar 7 16:58:35.006 INFO | 10.28.70.. 224: leader changed from '' to '10.28.70.224'. Mar 07 16:58:35 ip-10-28-70-224 etcd-bootstrap[489]: [etcd] Mar 7 16:58:35.057 INFO | 10.28.70.. 224: warning: heartbeat timed out: '10.28.7.6' Mar 07 16:58:35 ip-10-28-70-224 etcd-bootstrap[489]: [etcd] Mar 7 16:58:35.107 INFO | 10.28.70.. 224: warning: heartbeat timed out: '10.28.7.6' Mar 07 16:58:35 ip-10-28-70-224 etcd-bootstrap[489]: [etcd] Mar 7 16:58:35.157 INFO | 10.28.70.. 224: warning: heartbeat timed out: '10.28.7.6'

I guess I'm particularly interested in:

leader changed from '10.28.7.6' to ''.

Does this look like an expected log?

Thanks

@philips
Copy link
Contributor

philips commented Mar 7, 2014

I think that is a bug in the leader changed log line. Otherwise it looks normal for a machine going down. Did the cluster recover?

The fleet thing sounds like a bug I believe. /cc @bcwaldon

@bcwaldon
Copy link
Contributor

bcwaldon commented Mar 7, 2014

@PaulCampbell Definitely interested in helping debug the fleet issue. Could you try to describe more explicitly the state before and after the reboot?

@bashcoder
Copy link

I thought I had a similar scenario in a test today, but it eventually worked itself out after a few minutes.

@AndrewVos
Copy link

UNIT                                    LOAD    ACTIVE  SUB     DESC                            MACHINE
elasticsearch.1.service                 loaded  active  running elasticsearch                   534ead73.../10.28.7.6
elasticsearch.2.service                 -       -       -       elasticsearch                   -
elasticsearch.3.service                 -       -       -       elasticsearch                   -
elasticsearch.4.service                 -       -       -       elasticsearch                   -
elasticsearch.5.service                 -       -       -       elasticsearch                   -
elasticsearch.6.service                 loaded  active  running elasticsearch                   008faf7b.../10.28.70.224
elasticsearch.7.service                 -       -       -       elasticsearch                   -
github-authentication-proxy.1.service   loaded  active  running github-authentication-proxy     6ec5b6a7.../10.28.69.136
github-authentication-proxy.2.service   loaded  active  running github-authentication-proxy     008faf7b.../10.28.70.224
github-authentication-proxy.3.service   -       -       -       github-authentication-proxy     -
github-authentication-proxy.4.service   -       -       -       github-authentication-proxy     -
github-authentication-proxy.5.service   -       -       -       github-authentication-proxy     -
github-authentication-proxy.6.service   -       -       -       github-authentication-proxy     -
github-authentication-proxy.7.service   -       -       -       github-authentication-proxy     -
prodotti.1.service                      -       -       -       prodotti                        -
prodotti.2.service                      loaded  active  running prodotti                        534ead73.../10.28.7.6
prodotti.3.service                      -       -       -       prodotti                        -
prodotti.4.service                      loaded  active  running prodotti                        6ec5b6a7.../10.28.69.136
prodotti.5.service                      -       -       -       prodotti                        -
prodotti.6.service                      -       -       -       prodotti                        -
prodotti.7.service                      loaded  active  running prodotti                        008faf7b.../10.28.70.224

@AndrewVos
Copy link

That's what the cluster looks like now. All seven elasticsearch containers are actually running, but we can't get the prodotti services to all run. There are job offers for each.

This is confusing, and stressful because these boxes are all in production right now :)

@AndrewVos
Copy link

Before reboot, we had all ES instances showing as running. We rebooted because one of the prodotti services wouldn't launch (I suppose nobody would accept the job offer).

@bcwaldon
Copy link
Contributor

bcwaldon commented Mar 9, 2014

@AndrewVos could you file an issue in coreos/fleet? Sharing the X-Fleet portions of the units would help diagnose your issue as well.

@AndrewVos
Copy link

@PaulCampbell could you close this issue, I've moved it over to coreos/fleet#210

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants