-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ec2 reboot caused etcd status problems and empty leader #615
Comments
I think that is a bug in the leader changed log line. Otherwise it looks normal for a machine going down. Did the cluster recover? The fleet thing sounds like a bug I believe. /cc @bcwaldon |
@PaulCampbell Definitely interested in helping debug the fleet issue. Could you try to describe more explicitly the state before and after the reboot? |
I thought I had a similar scenario in a test today, but it eventually worked itself out after a few minutes. |
|
That's what the cluster looks like now. All seven elasticsearch containers are actually running, but we can't get the prodotti services to all run. There are job offers for each. This is confusing, and stressful because these boxes are all in production right now :) |
Before reboot, we had all ES instances showing as running. We rebooted because one of the prodotti services wouldn't launch (I suppose nobody would accept the job offer). |
@AndrewVos could you file an issue in coreos/fleet? Sharing the X-Fleet portions of the units would help diagnose your issue as well. |
@PaulCampbell could you close this issue, I've moved it over to coreos/fleet#210 |
On rebooting one of the machines in our coreos cluster of 7 machines, fleetctl status started reporting only 1 unit running, despite the other units still apparently responding as normal.
Our etcd log from the time of the reboot shows the following sequence of events:
Mar 07 16:54:08 ip-10-28-70-224 etcd-bootstrap[489]: [etcd] Mar 7 16:54:08.428 INFO | 10.28.70.. 224: state changed from 'snapshotting' to 'follower'. Mar 07 16:54:11 ip-10-28-70-224 etcd-bootstrap[489]: [etcd] Mar 7 16:54:11.338 INFO | 10.28.70.. 224: snapshot of 642303 events at index 642303 completed Mar 07 16:58:35 ip-10-28-70-224 etcd-bootstrap[489]: [etcd] Mar 7 16:58:35.004 INFO | 10.28.70.. 224: state changed from 'follower' to 'candidate'. Mar 07 16:58:35 ip-10-28-70-224 etcd-bootstrap[489]: [etcd] Mar 7 16:58:35.004 INFO | 10.28.70.. 224: leader changed from '10.28.7.6' to ''. Mar 07 16:58:35 ip-10-28-70-224 etcd-bootstrap[489]: [etcd] Mar 7 16:58:35.006 INFO | 10.28.70.. 224: state changed from 'candidate' to 'leader'. Mar 07 16:58:35 ip-10-28-70-224 etcd-bootstrap[489]: [etcd] Mar 7 16:58:35.006 INFO | 10.28.70.. 224: leader changed from '' to '10.28.70.224'. Mar 07 16:58:35 ip-10-28-70-224 etcd-bootstrap[489]: [etcd] Mar 7 16:58:35.057 INFO | 10.28.70.. 224: warning: heartbeat timed out: '10.28.7.6' Mar 07 16:58:35 ip-10-28-70-224 etcd-bootstrap[489]: [etcd] Mar 7 16:58:35.107 INFO | 10.28.70.. 224: warning: heartbeat timed out: '10.28.7.6' Mar 07 16:58:35 ip-10-28-70-224 etcd-bootstrap[489]: [etcd] Mar 7 16:58:35.157 INFO | 10.28.70.. 224: warning: heartbeat timed out: '10.28.7.6'
I guess I'm particularly interested in:
Does this look like an expected log?
Thanks
The text was updated successfully, but these errors were encountered: