Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apiserver stopped working (no changes made) #2667

Closed
osirisguitar opened this issue Oct 20, 2021 · 12 comments
Closed

Apiserver stopped working (no changes made) #2667

osirisguitar opened this issue Oct 20, 2021 · 12 comments

Comments

@osirisguitar
Copy link

I have two clusters that have both stopped working in the same way. kubectl can't connect to them and the apiserver is not running. One of the clusters is a single-node (inspection report comes from that one), the other has three nodes. No changes have been made to the machines where they are running.

I tried upgrading the single-node cluster from v1.21 to 1.22 without any changes. It's not the problem from #2486, both info.yaml and cluster.yaml have the expected contents...

I don't know why apiserver isn't included in the inspect... This is what is says (after upgrading to 1.22, that's why it's 2585 here instead of 2546 in the inspect tarball that was created before the upgrade).

● snap.microk8s.daemon-apiserver.service - Service for snap application microk8s.daemon-apiserver
   Loaded: loaded (/etc/systemd/system/snap.microk8s.daemon-apiserver.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Mon 2021-10-18 11:26:40 UTC; 2 days ago
  Process: 33264 ExecStart=/usr/bin/snap run microk8s.daemon-apiserver (code=exited, status=0/SUCCESS)
 Main PID: 33264 (code=exited, status=0/SUCCESS)

Oct 18 11:26:40 pm-cluster microk8s.daemon-apiserver[33264]: ++ /snap/microk8s/2585/bin/uname -m
Oct 18 11:26:40 pm-cluster microk8s.daemon-apiserver[33264]: + ARCH=x86_64
Oct 18 11:26:40 pm-cluster microk8s.daemon-apiserver[33264]: + export LD_LIBRARY_PATH=:/snap/microk8s/2585/lib:/snap/microk8s/2585/usr/lib:/snap/microk8s/2585/lib/x86_64-linux-gnu:/snap/mic
Oct 18 11:26:40 pm-cluster microk8s.daemon-apiserver[33264]: + LD_LIBRARY_PATH=:/snap/microk8s/2585/lib:/snap/microk8s/2585/usr/lib:/snap/microk8s/2585/lib/x86_64-linux-gnu:/snap/microk8s/2
Oct 18 11:26:40 pm-cluster microk8s.daemon-apiserver[33264]: + export LD_LIBRARY_PATH=/var/lib/snapd/lib/gl:/var/lib/snapd/lib/gl32:/var/lib/snapd/void::/snap/microk8s/2585/lib:/snap/microk
Oct 18 11:26:40 pm-cluster microk8s.daemon-apiserver[33264]: + LD_LIBRARY_PATH=/var/lib/snapd/lib/gl:/var/lib/snapd/lib/gl32:/var/lib/snapd/void::/snap/microk8s/2585/lib:/snap/microk8s/2585
Oct 18 11:26:40 pm-cluster microk8s.daemon-apiserver[33264]: + '[' -e /var/snap/microk8s/2585/var/lock/lite.lock ']'
Oct 18 11:26:40 pm-cluster microk8s.daemon-apiserver[33264]: + echo 'Will not run along with kubelite'
Oct 18 11:26:40 pm-cluster microk8s.daemon-apiserver[33264]: Will not run along with kubelite
Oct 18 11:26:40 pm-cluster microk8s.daemon-apiserver[33264]: + exit 0

inspection-report-20211018_080210.tar.gz

@balchua
Copy link
Collaborator

balchua commented Oct 20, 2021

@osirisguitar thanks for reporting. Kubelite wraps most if not all kube control plane components, thats why the log you see on the apiserver says it is not starting.

Looking at the logs, it looks like dqlite isn't starting or it is unable to find the leader.
@ktsakalozos @MathieuBordere any thoughts?
Thanks.

@ktsakalozos
Copy link
Member

Hi @osirisguitar

Could you share the output of ls -l ls /var/snap/microk8s/current/var/kubernetes/backend/ as well as cat /var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml and cat /var/snap/microk8s/current/var/kubernetes/backend/localnode.yaml ?

It is strange both clusters stopped working at the same time. Can you think of anything that might have changed?

@osirisguitar
Copy link
Author

osirisguitar commented Oct 21, 2021

That noone made any changes to the machines are what stresses me out the most... The only thing I know is that they have been automatically migrated between hyper-v hosts in a cluster, but that's supposed to be completely undetectable for the VMs. The only thing I've found is that a VM could get a new MAC address in a migration (I don't know if that has happened with these machines...)

And: i really appreciate the help!

From the single.node cluster

(same as the inspect is from above)

ls -l ls /var/snap/microk8s/current/var/kubernetes/backend/

-rw-rw---- 1 root microk8s 8363120 Oct 14 19:43 0000000091522082-0000000091522412
-rw-rw---- 1 root microk8s 8371544 Oct 14 19:44 0000000091522413-0000000091522917
-rw-rw---- 1 root microk8s 8385512 Oct 14 19:44 0000000091522918-0000000091523502
-rw-rw---- 1 root microk8s 8374856 Oct 14 19:44 0000000091523503-0000000091524053
-rw-rw---- 1 root microk8s 8359232 Oct 14 19:47 0000000091524054-0000000091524387
-rw-rw---- 1 root microk8s 8382632 Oct 14 19:49 0000000091524388-0000000091524761
-rw-rw---- 1 root microk8s 8386808 Oct 14 19:49 0000000091524762-0000000091525364
-rw-rw---- 1 root microk8s 8369456 Oct 14 19:49 0000000091525365-0000000091525954
-rw-rw---- 1 root microk8s 8378744 Oct 14 19:51 0000000091525955-0000000091526388
-rw-rw---- 1 root microk8s 8363048 Oct 14 19:53 0000000091526389-0000000091526718
-rw-rw---- 1 root microk8s 8385440 Oct 14 19:54 0000000091526719-0000000091527245
-rw-rw---- 1 root microk8s 8385728 Oct 14 19:54 0000000091527246-0000000091527833
-rw-rw---- 1 root microk8s 8386232 Oct 14 19:54 0000000091527834-0000000091528371
-rw-rw---- 1 root microk8s 8355128 Oct 14 19:57 0000000091528372-0000000091528705
-rw-rw---- 1 root microk8s 8388248 Oct 14 19:59 0000000091528706-0000000091529100
-rw-rw---- 1 root microk8s 8356136 Oct 14 19:59 0000000091529101-0000000091529676
-rw-rw---- 1 root microk8s 8381552 Oct 14 19:59 0000000091529677-0000000091530263
-rw-rw---- 1 root microk8s 8385368 Oct 14 20:01 0000000091530264-0000000091530675
-rw-rw---- 1 root microk8s 8375648 Oct 14 20:03 0000000091530676-0000000091531009
-rw-rw---- 1 root microk8s 5354096 Oct 14 20:04 0000000091531010-0000000091531329
-rw-rw---- 1 root microk8s    2220 May 19 06:33 cluster.crt
-rw-rw---- 1 root microk8s    3272 May 19 06:33 cluster.key
-rw-rw---- 1 root microk8s       0 Oct 14 20:04 cluster.yaml
-rw-rw-r-- 1 root microk8s       2 Oct 21 08:51 failure-domain
-rw-rw---- 1 root microk8s      57 May 19 06:33 info.yaml
srw-rw---- 1 root microk8s       0 Oct  4 20:09 kine.sock
-rw-rw---- 1 root microk8s      63 Oct  4 20:09 localnode.yaml
-rw-rw---- 1 root microk8s      32 May 19 06:33 metadata1
-rw-rw---- 1 root microk8s 5320601 Oct 14 19:59 snapshot-1-91529576-12821376770
-rw-rw---- 1 root microk8s      72 Oct 14 19:59 snapshot-1-91529576-12821376770.meta
-rw-rw---- 1 root microk8s 5328981 Oct 14 20:00 snapshot-1-91530600-12821455927
-rw-rw---- 1 root microk8s      72 Oct 14 20:00 snapshot-1-91530600-12821455927.meta

cat /var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml

Empty

cat /var/snap/microk8s/current/var/kubernetes/backend/localnode.yaml

- Address: 127.0.0.1:19001
  ID: 3297041220608546238
  Role: 0

From the multi-node cluster

ls -l ls /var/snap/microk8s/current/var/kubernetes/backend/

-rw-rw---- 1 root microk8s 8385512 Sep 30 10:34 0000000176463498-0000000176464025
-rw-rw---- 1 root microk8s 8382344 Sep 30 10:35 0000000176464026-0000000176464623
-rw-rw---- 1 root microk8s 8381624 Sep 30 10:36 0000000176464624-0000000176465040
-rw-rw---- 1 root microk8s 8362832 Sep 30 10:36 0000000176465041-0000000176465652
-rw-rw---- 1 root microk8s 8375144 Sep 30 10:37 0000000176465653-0000000176466207
-rw-rw---- 1 root microk8s 8382632 Sep 30 10:39 0000000176466208-0000000176466581
-rw-rw---- 1 root microk8s 8387168 Sep 30 10:39 0000000176466582-0000000176467189
-rw-rw---- 1 root microk8s 8381768 Sep 30 10:40 0000000176467190-0000000176467722
-rw-rw---- 1 root microk8s 8384000 Sep 30 10:41 0000000176467723-0000000176468229
-rw-rw---- 1 root microk8s 8377880 Sep 30 10:42 0000000176468230-0000000176468822
-rw-rw---- 1 root microk8s 8374784 Sep 30 10:43 0000000176468823-0000000176469315
-rw-rw---- 1 root microk8s 8386952 Sep 30 10:44 0000000176469316-0000000176469749
-rw-rw---- 1 root microk8s 8380544 Sep 30 10:44 0000000176469750-0000000176470379
-rw-rw---- 1 root microk8s 8379464 Sep 30 10:46 0000000176470380-0000000176470823
-rw-rw---- 1 root microk8s 8384792 Sep 30 10:46 0000000176470824-0000000176471398
-rw-rw---- 1 root microk8s 8384576 Sep 30 10:47 0000000176471399-0000000176471970
-rw-rw---- 1 root microk8s 3397232 Oct 20 07:43 0000000176471971-0000000176472186
-rw-rw---- 1 root microk8s    2216 Mar 15  2021 cluster.crt
-rw-rw---- 1 root microk8s    3272 Mar 15  2021 cluster.key
-rw-rw---- 1 root microk8s     209 Sep 30 10:47 cluster.yaml
-rw-rw-r-- 1 root microk8s       2 Oct 21 08:48 failure-domain
-rw-rw---- 1 root microk8s      63 Mar 15  2021 info.yaml
srw-rw---- 1 root microk8s       0 Sep 24 12:14 kine.sock
-rw-rw---- 1 root microk8s      69 Sep 24 12:14 localnode.yaml
-rw-rw---- 1 root microk8s      32 Oct 21 08:49 metadata1
-rw-rw---- 1 root microk8s      32 Oct 21 08:49 metadata2
-rw-rw---- 1 root microk8s 6173043 Sep 30 10:46 snapshot-30-176471135-513165012
-rw-rw---- 1 root microk8s     136 Sep 30 10:46 snapshot-30-176471135-513165012.meta
-rw-rw---- 1 root microk8s 5891395 Sep 30 10:47 snapshot-30-176472159-513221856
-rw-rw---- 1 root microk8s     136 Sep 30 10:47 snapshot-30-176472159-513221856.meta

cat /var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml

- Address: 192.168.100.178:19001
  ID: 3297041220608546238
  Role: 0
- Address: 192.168.100.183:19001
  ID: 10170341938016627293
  Role: 0
- Address: 192.168.100.181:19001
  ID: 18208731239841283841
  Role: 0

cat /var/snap/microk8s/current/var/kubernetes/backend/localnode.yaml

- Address: 192.168.100.178:19001
  ID: 3297041220608546238
  Role: 0

@MathieuBordere
Copy link

  • For the single node cluster it looks like if you replace the contents of the empty cluster.yaml with the contents of localnode.yaml you will get it running again. I think it's best to stop microk8s, perform the edit and then start again. This was supposed to have been fixed in More robust filewriting go-dqlite#147 but apparently it still occurs under some circumstances.

  • For the multi node cluster it looks like there weren't any writes on that node since Sep 30 and then suddenly some action on Oct 20

-rw-rw---- 1 root microk8s 8384576 Sep 30 10:47 0000000176471399-0000000176471970
-rw-rw---- 1 root microk8s 3397232 Oct 20 07:43 0000000176471971-0000000176472186

Was this node shut down for a while and then started again? Can you provide the contents of /var/snap/microk8s/current/var/kubernetes/backend/, cluster.yaml & localnode.yaml for the other nodes in that cluster too please?

@osirisguitar
Copy link
Author

Yes, node 1 in the multi-node cluster has been turned off for a while. I had actually forgotten about that.

I'll fix cluster.yaml on the single node and check the files on the other nodes on the multicluster ASAP

@osirisguitar
Copy link
Author

Single-node cluster seems to be back up, thank you so much for the help! microk8s.status is now mostly saying microk8s is running (not every time), the api is responding on port 16443 and the pods seem to be starting.

My big question though is how and why this happened... And how can I prevent it from happening again?

Multi-node cluster

Node 2:

cluster.yaml

- Address: 192.168.100.178:19001
  ID: 3297041220608546238
  Role: 0
- Address: 192.168.100.183:19001
  ID: 10170341938016627293
  Role: 0
- Address: 192.168.100.181:19001
  ID: 18208731239841283841
  Role: 0

localnode.yaml:

- Address: 192.168.100.181:19001
  ID: 18208731239841283841
  Role: 0

Node 3:

Doesn't have a /var/snap/microk8s/current directory, says microk8s isn't installed. What the h.. happened here?

It does have directories 2407, 2487 and common in /var/snap/microk8s...

This could probably explain why there's no leader for dqlite...

@osirisguitar
Copy link
Author

So, microk8s is disabled on node 3 from a failed auto-refresh... Could auto refreshes be what broke my clusters? I had no idea that was even enabled.

ID   Status  Spawn                      Ready  Summary
140  Abort   22 days ago, at 16:31 UTC  -      Auto-refresh snap "microk8s"

@osirisguitar
Copy link
Author

Tried to force abort the stuck snap job and rebooted the machine - multi-node cluster is also back up now!

Super happy with having everything running again, but worried about stability.

One cluster just self-died by emptying cluster.yaml, the other by getting itself stuck in snap auto-refresh...

@osirisguitar
Copy link
Author

So, any ideas why this happened in the single-node cluster? Why did it just lose the contents of cluster.yaml?

@balchua
Copy link
Collaborator

balchua commented Oct 25, 2021

The fix on the dqlite to have a more robust write to cluster.yaml is being merged to the different versions. That should fix the single node issue.

@osirisguitar
Copy link
Author

My final question, is there an auto-refresh always active for the microk8s snap? I read somewhere there is and it can't be turned off...

@balchua
Copy link
Collaborator

balchua commented Oct 25, 2021

I remember if you setup a snap proxy then you can have more control over when the updates can happen.
#1658 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants