Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubo 0.7.0 (vSphere): all K8s cluster shut down and then powered up: etcd nodes are in failing state #111

Closed
guillierf opened this Issue Sep 12, 2017 · 6 comments

Comments

Projects
None yet
5 participants
@guillierf
Copy link

guillierf commented Sep 12, 2017

I make this very simple test:

-deploy K8s cluster: OK
-bosh instances: OK
Instance Process State AZ IPs
etcd/5eb70526-4522-44c6-8ceb-96ce1ac70e1a running z1 10.40.207.94
etcd/ed79acab-7a95-4285-8c04-2eda607af558 running z1 10.40.207.93
etcd/f96ef4cf-042d-4d94-bcdb-75d823d8203d running z1 10.40.207.95
master-haproxy/bd9f1ff6-356e-47bd-a7ef-289085d40583 running z1 10.40.207.92
master/47962537-e72c-4141-863b-3135d20a56bb running z1 10.40.207.96
master/7fbbbc8d-d7a0-4fda-9deb-4980adc9709e running z1 10.40.207.97
worker-haproxy/5cab4451-b93a-47a2-b2ee-663ea925dd67 running z1 10.40.207.101
worker/1b99456d-c198-4279-9cf9-6893509d248c running z1 10.40.207.99
worker/3ace8643-2d99-4264-927a-1f0eaeb03743 running z1 10.40.207.100
worker/455ddbef-d0e2-40e6-b26d-c1a23af383a6 running z1 10.40.207.98

-on vCenter, I shut down all those VM
-on vCenter, restart all those VM

  • now I get this state:

Instance Process State AZ IPs
etcd/5eb70526-4522-44c6-8ceb-96ce1ac70e1a failing z1 10.40.207.94
etcd/ed79acab-7a95-4285-8c04-2eda607af558 failing z1 10.40.207.93
etcd/f96ef4cf-042d-4d94-bcdb-75d823d8203d failing z1 10.40.207.95
master-haproxy/bd9f1ff6-356e-47bd-a7ef-289085d40583 running z1 10.40.207.92
master/47962537-e72c-4141-863b-3135d20a56bb running z1 10.40.207.96
master/7fbbbc8d-d7a0-4fda-9deb-4980adc9709e running z1 10.40.207.97
worker-haproxy/5cab4451-b93a-47a2-b2ee-663ea925dd67 running z1 10.40.207.101
worker/1b99456d-c198-4279-9cf9-6893509d248c running z1 10.40.207.99
worker/3ace8643-2d99-4264-927a-1f0eaeb03743 running z1 10.40.207.100
worker/455ddbef-d0e2-40e6-b26d-c1a23af383a6 running z1 10.40.207.98

=> all etcd nodes are in failing state.

ideally, all the nodes should be restarted correctly.

(the workaround to solve this issue is to bosh restart one of the etcd node)

@cf-gitbot

This comment has been minimized.

Copy link

cf-gitbot commented Sep 12, 2017

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/151028322

The labels on this github issue will be updated when the story is started.

@guillierf

This comment has been minimized.

Copy link
Author

guillierf commented Sep 12, 2017

found something more interesting:

after all the steps specified above, kubectl get pod provides the following result:

$ kubectl get pod --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system heapster-1569517067-0gxgm 1/1 Running 1 46m
kube-system kube-dns-3329716278-9pmdz 1/3 CrashLoopBackOff 20 47m
kube-system kubernetes-dashboard-1367211859-67172 0/1 CrashLoopBackOff 8 46m
kube-system monitoring-influxdb-564852376-dmmfh 1/1 Running 1 46m

kube-dns and dashboard are in CrashLoopBackOff mode

@mkjelland

This comment has been minimized.

Copy link
Member

mkjelland commented Sep 28, 2017

Hi @guillierf! Sorry for the delay, we will discuss prioritizing the etcd issue in our next planning meeting on Tuesday.

I think you already know this, but we have a work around for the kube-dns issue specifically and are working on the actual fix for that now.

@cf-gitbot cf-gitbot added scheduled and removed unscheduled labels Oct 4, 2017

@cf-gitbot cf-gitbot added unscheduled and removed scheduled labels Nov 20, 2017

@glestaris

This comment has been minimized.

Copy link
Member

glestaris commented Nov 23, 2017

Hey @guillierf. Kubo v0.9.0+ scales down the ETCD nodes to 1 so you should not encounter this problem anymore. Closing. Please re-open if you need help or the issue persists.

@glestaris glestaris closed this Nov 23, 2017

@cf-gitbot cf-gitbot added accepted and removed unscheduled labels Nov 23, 2017

@pbakre

This comment has been minimized.

Copy link

pbakre commented Nov 28, 2017

I am also facing this issue. However, in my case if I try to restart the etcd node. I am getting the following error -
14:56:04 | Preparing deployment: Preparing deployment (00:00:00)
L Error: Instance group 'etcd' references an unknown network 'Comp-VM-Net-1'

14:56:04 | Error: Instance group 'etcd' references an unknown network 'Comp-VM-Net-1'

Started Tue Nov 28 14:56:04 UTC 2017
Finished Tue Nov 28 14:56:04 UTC 2017
Duration 00:00:00

Any help with this is really appreciated.

@glestaris

This comment has been minimized.

Copy link
Member

glestaris commented Dec 11, 2017

hey @pbakre, have you tried upgrading to v0.9.0+?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.