Apiserver stopped working (no changes made) #2667

osirisguitar · 2021-10-20T14:50:53Z

I have two clusters that have both stopped working in the same way. kubectl can't connect to them and the apiserver is not running. One of the clusters is a single-node (inspection report comes from that one), the other has three nodes. No changes have been made to the machines where they are running.

I tried upgrading the single-node cluster from v1.21 to 1.22 without any changes. It's not the problem from #2486, both info.yaml and cluster.yaml have the expected contents...

I don't know why apiserver isn't included in the inspect... This is what is says (after upgrading to 1.22, that's why it's 2585 here instead of 2546 in the inspect tarball that was created before the upgrade).

● snap.microk8s.daemon-apiserver.service - Service for snap application microk8s.daemon-apiserver
   Loaded: loaded (/etc/systemd/system/snap.microk8s.daemon-apiserver.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Mon 2021-10-18 11:26:40 UTC; 2 days ago
  Process: 33264 ExecStart=/usr/bin/snap run microk8s.daemon-apiserver (code=exited, status=0/SUCCESS)
 Main PID: 33264 (code=exited, status=0/SUCCESS)

Oct 18 11:26:40 pm-cluster microk8s.daemon-apiserver[33264]: ++ /snap/microk8s/2585/bin/uname -m
Oct 18 11:26:40 pm-cluster microk8s.daemon-apiserver[33264]: + ARCH=x86_64
Oct 18 11:26:40 pm-cluster microk8s.daemon-apiserver[33264]: + export LD_LIBRARY_PATH=:/snap/microk8s/2585/lib:/snap/microk8s/2585/usr/lib:/snap/microk8s/2585/lib/x86_64-linux-gnu:/snap/mic
Oct 18 11:26:40 pm-cluster microk8s.daemon-apiserver[33264]: + LD_LIBRARY_PATH=:/snap/microk8s/2585/lib:/snap/microk8s/2585/usr/lib:/snap/microk8s/2585/lib/x86_64-linux-gnu:/snap/microk8s/2
Oct 18 11:26:40 pm-cluster microk8s.daemon-apiserver[33264]: + export LD_LIBRARY_PATH=/var/lib/snapd/lib/gl:/var/lib/snapd/lib/gl32:/var/lib/snapd/void::/snap/microk8s/2585/lib:/snap/microk
Oct 18 11:26:40 pm-cluster microk8s.daemon-apiserver[33264]: + LD_LIBRARY_PATH=/var/lib/snapd/lib/gl:/var/lib/snapd/lib/gl32:/var/lib/snapd/void::/snap/microk8s/2585/lib:/snap/microk8s/2585
Oct 18 11:26:40 pm-cluster microk8s.daemon-apiserver[33264]: + '[' -e /var/snap/microk8s/2585/var/lock/lite.lock ']'
Oct 18 11:26:40 pm-cluster microk8s.daemon-apiserver[33264]: + echo 'Will not run along with kubelite'
Oct 18 11:26:40 pm-cluster microk8s.daemon-apiserver[33264]: Will not run along with kubelite
Oct 18 11:26:40 pm-cluster microk8s.daemon-apiserver[33264]: + exit 0

inspection-report-20211018_080210.tar.gz

The text was updated successfully, but these errors were encountered:

balchua · 2021-10-20T22:05:33Z

@osirisguitar thanks for reporting. Kubelite wraps most if not all kube control plane components, thats why the log you see on the apiserver says it is not starting.

Looking at the logs, it looks like dqlite isn't starting or it is unable to find the leader.
@ktsakalozos @MathieuBordere any thoughts?
Thanks.

ktsakalozos · 2021-10-21T08:35:15Z

Hi @osirisguitar

Could you share the output of ls -l ls /var/snap/microk8s/current/var/kubernetes/backend/ as well as cat /var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml and cat /var/snap/microk8s/current/var/kubernetes/backend/localnode.yaml ?

It is strange both clusters stopped working at the same time. Can you think of anything that might have changed?

osirisguitar · 2021-10-21T08:56:14Z

That noone made any changes to the machines are what stresses me out the most... The only thing I know is that they have been automatically migrated between hyper-v hosts in a cluster, but that's supposed to be completely undetectable for the VMs. The only thing I've found is that a VM could get a new MAC address in a migration (I don't know if that has happened with these machines...)

And: i really appreciate the help!

From the single.node cluster

(same as the inspect is from above)

ls -l ls /var/snap/microk8s/current/var/kubernetes/backend/

-rw-rw---- 1 root microk8s 8363120 Oct 14 19:43 0000000091522082-0000000091522412
-rw-rw---- 1 root microk8s 8371544 Oct 14 19:44 0000000091522413-0000000091522917
-rw-rw---- 1 root microk8s 8385512 Oct 14 19:44 0000000091522918-0000000091523502
-rw-rw---- 1 root microk8s 8374856 Oct 14 19:44 0000000091523503-0000000091524053
-rw-rw---- 1 root microk8s 8359232 Oct 14 19:47 0000000091524054-0000000091524387
-rw-rw---- 1 root microk8s 8382632 Oct 14 19:49 0000000091524388-0000000091524761
-rw-rw---- 1 root microk8s 8386808 Oct 14 19:49 0000000091524762-0000000091525364
-rw-rw---- 1 root microk8s 8369456 Oct 14 19:49 0000000091525365-0000000091525954
-rw-rw---- 1 root microk8s 8378744 Oct 14 19:51 0000000091525955-0000000091526388
-rw-rw---- 1 root microk8s 8363048 Oct 14 19:53 0000000091526389-0000000091526718
-rw-rw---- 1 root microk8s 8385440 Oct 14 19:54 0000000091526719-0000000091527245
-rw-rw---- 1 root microk8s 8385728 Oct 14 19:54 0000000091527246-0000000091527833
-rw-rw---- 1 root microk8s 8386232 Oct 14 19:54 0000000091527834-0000000091528371
-rw-rw---- 1 root microk8s 8355128 Oct 14 19:57 0000000091528372-0000000091528705
-rw-rw---- 1 root microk8s 8388248 Oct 14 19:59 0000000091528706-0000000091529100
-rw-rw---- 1 root microk8s 8356136 Oct 14 19:59 0000000091529101-0000000091529676
-rw-rw---- 1 root microk8s 8381552 Oct 14 19:59 0000000091529677-0000000091530263
-rw-rw---- 1 root microk8s 8385368 Oct 14 20:01 0000000091530264-0000000091530675
-rw-rw---- 1 root microk8s 8375648 Oct 14 20:03 0000000091530676-0000000091531009
-rw-rw---- 1 root microk8s 5354096 Oct 14 20:04 0000000091531010-0000000091531329
-rw-rw---- 1 root microk8s    2220 May 19 06:33 cluster.crt
-rw-rw---- 1 root microk8s    3272 May 19 06:33 cluster.key
-rw-rw---- 1 root microk8s       0 Oct 14 20:04 cluster.yaml
-rw-rw-r-- 1 root microk8s       2 Oct 21 08:51 failure-domain
-rw-rw---- 1 root microk8s      57 May 19 06:33 info.yaml
srw-rw---- 1 root microk8s       0 Oct  4 20:09 kine.sock
-rw-rw---- 1 root microk8s      63 Oct  4 20:09 localnode.yaml
-rw-rw---- 1 root microk8s      32 May 19 06:33 metadata1
-rw-rw---- 1 root microk8s 5320601 Oct 14 19:59 snapshot-1-91529576-12821376770
-rw-rw---- 1 root microk8s      72 Oct 14 19:59 snapshot-1-91529576-12821376770.meta
-rw-rw---- 1 root microk8s 5328981 Oct 14 20:00 snapshot-1-91530600-12821455927
-rw-rw---- 1 root microk8s      72 Oct 14 20:00 snapshot-1-91530600-12821455927.meta

cat /var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml

Empty

cat /var/snap/microk8s/current/var/kubernetes/backend/localnode.yaml

- Address: 127.0.0.1:19001
  ID: 3297041220608546238
  Role: 0

From the multi-node cluster

ls -l ls /var/snap/microk8s/current/var/kubernetes/backend/

-rw-rw---- 1 root microk8s 8385512 Sep 30 10:34 0000000176463498-0000000176464025
-rw-rw---- 1 root microk8s 8382344 Sep 30 10:35 0000000176464026-0000000176464623
-rw-rw---- 1 root microk8s 8381624 Sep 30 10:36 0000000176464624-0000000176465040
-rw-rw---- 1 root microk8s 8362832 Sep 30 10:36 0000000176465041-0000000176465652
-rw-rw---- 1 root microk8s 8375144 Sep 30 10:37 0000000176465653-0000000176466207
-rw-rw---- 1 root microk8s 8382632 Sep 30 10:39 0000000176466208-0000000176466581
-rw-rw---- 1 root microk8s 8387168 Sep 30 10:39 0000000176466582-0000000176467189
-rw-rw---- 1 root microk8s 8381768 Sep 30 10:40 0000000176467190-0000000176467722
-rw-rw---- 1 root microk8s 8384000 Sep 30 10:41 0000000176467723-0000000176468229
-rw-rw---- 1 root microk8s 8377880 Sep 30 10:42 0000000176468230-0000000176468822
-rw-rw---- 1 root microk8s 8374784 Sep 30 10:43 0000000176468823-0000000176469315
-rw-rw---- 1 root microk8s 8386952 Sep 30 10:44 0000000176469316-0000000176469749
-rw-rw---- 1 root microk8s 8380544 Sep 30 10:44 0000000176469750-0000000176470379
-rw-rw---- 1 root microk8s 8379464 Sep 30 10:46 0000000176470380-0000000176470823
-rw-rw---- 1 root microk8s 8384792 Sep 30 10:46 0000000176470824-0000000176471398
-rw-rw---- 1 root microk8s 8384576 Sep 30 10:47 0000000176471399-0000000176471970
-rw-rw---- 1 root microk8s 3397232 Oct 20 07:43 0000000176471971-0000000176472186
-rw-rw---- 1 root microk8s    2216 Mar 15  2021 cluster.crt
-rw-rw---- 1 root microk8s    3272 Mar 15  2021 cluster.key
-rw-rw---- 1 root microk8s     209 Sep 30 10:47 cluster.yaml
-rw-rw-r-- 1 root microk8s       2 Oct 21 08:48 failure-domain
-rw-rw---- 1 root microk8s      63 Mar 15  2021 info.yaml
srw-rw---- 1 root microk8s       0 Sep 24 12:14 kine.sock
-rw-rw---- 1 root microk8s      69 Sep 24 12:14 localnode.yaml
-rw-rw---- 1 root microk8s      32 Oct 21 08:49 metadata1
-rw-rw---- 1 root microk8s      32 Oct 21 08:49 metadata2
-rw-rw---- 1 root microk8s 6173043 Sep 30 10:46 snapshot-30-176471135-513165012
-rw-rw---- 1 root microk8s     136 Sep 30 10:46 snapshot-30-176471135-513165012.meta
-rw-rw---- 1 root microk8s 5891395 Sep 30 10:47 snapshot-30-176472159-513221856
-rw-rw---- 1 root microk8s     136 Sep 30 10:47 snapshot-30-176472159-513221856.meta

cat /var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml

- Address: 192.168.100.178:19001
  ID: 3297041220608546238
  Role: 0
- Address: 192.168.100.183:19001
  ID: 10170341938016627293
  Role: 0
- Address: 192.168.100.181:19001
  ID: 18208731239841283841
  Role: 0

cat /var/snap/microk8s/current/var/kubernetes/backend/localnode.yaml

- Address: 192.168.100.178:19001
  ID: 3297041220608546238
  Role: 0

MathieuBordere · 2021-10-21T09:30:42Z

For the single node cluster it looks like if you replace the contents of the empty cluster.yaml with the contents of localnode.yaml you will get it running again. I think it's best to stop microk8s, perform the edit and then start again. This was supposed to have been fixed in More robust filewriting go-dqlite#147 but apparently it still occurs under some circumstances.
For the multi node cluster it looks like there weren't any writes on that node since Sep 30 and then suddenly some action on Oct 20

-rw-rw---- 1 root microk8s 8384576 Sep 30 10:47 0000000176471399-0000000176471970
-rw-rw---- 1 root microk8s 3397232 Oct 20 07:43 0000000176471971-0000000176472186

Was this node shut down for a while and then started again? Can you provide the contents of /var/snap/microk8s/current/var/kubernetes/backend/, cluster.yaml & localnode.yaml for the other nodes in that cluster too please?

osirisguitar · 2021-10-21T11:44:15Z

Yes, node 1 in the multi-node cluster has been turned off for a while. I had actually forgotten about that.

I'll fix cluster.yaml on the single node and check the files on the other nodes on the multicluster ASAP

osirisguitar · 2021-10-22T06:56:47Z

Single-node cluster seems to be back up, thank you so much for the help! microk8s.status is now mostly saying microk8s is running (not every time), the api is responding on port 16443 and the pods seem to be starting.

My big question though is how and why this happened... And how can I prevent it from happening again?

Multi-node cluster

Node 2:

cluster.yaml

- Address: 192.168.100.178:19001
  ID: 3297041220608546238
  Role: 0
- Address: 192.168.100.183:19001
  ID: 10170341938016627293
  Role: 0
- Address: 192.168.100.181:19001
  ID: 18208731239841283841
  Role: 0

localnode.yaml:

- Address: 192.168.100.181:19001
  ID: 18208731239841283841
  Role: 0

Node 3:

Doesn't have a /var/snap/microk8s/current directory, says microk8s isn't installed. What the h.. happened here?

It does have directories 2407, 2487 and common in /var/snap/microk8s...

This could probably explain why there's no leader for dqlite...

osirisguitar · 2021-10-22T07:05:28Z

So, microk8s is disabled on node 3 from a failed auto-refresh... Could auto refreshes be what broke my clusters? I had no idea that was even enabled.

ID   Status  Spawn                      Ready  Summary
140  Abort   22 days ago, at 16:31 UTC  -      Auto-refresh snap "microk8s"

osirisguitar · 2021-10-22T07:40:24Z

Tried to force abort the stuck snap job and rebooted the machine - multi-node cluster is also back up now!

Super happy with having everything running again, but worried about stability.

One cluster just self-died by emptying cluster.yaml, the other by getting itself stuck in snap auto-refresh...

osirisguitar · 2021-10-25T06:50:37Z

So, any ideas why this happened in the single-node cluster? Why did it just lose the contents of cluster.yaml?

balchua · 2021-10-25T07:53:50Z

The fix on the dqlite to have a more robust write to cluster.yaml is being merged to the different versions. That should fix the single node issue.

osirisguitar · 2021-10-25T08:03:20Z

My final question, is there an auto-refresh always active for the microk8s snap? I read somewhere there is and it can't be turned off...

balchua · 2021-10-25T08:07:54Z

I remember if you setup a snap proxy then you can have more control over when the updates can happen.
#1658 (comment)

osirisguitar closed this as completed Oct 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apiserver stopped working (no changes made) #2667

Apiserver stopped working (no changes made) #2667

osirisguitar commented Oct 20, 2021

balchua commented Oct 20, 2021

ktsakalozos commented Oct 21, 2021

osirisguitar commented Oct 21, 2021 •

edited

Loading

MathieuBordere commented Oct 21, 2021

osirisguitar commented Oct 21, 2021

osirisguitar commented Oct 22, 2021

osirisguitar commented Oct 22, 2021

osirisguitar commented Oct 22, 2021

osirisguitar commented Oct 25, 2021

balchua commented Oct 25, 2021

osirisguitar commented Oct 25, 2021

balchua commented Oct 25, 2021

Apiserver stopped working (no changes made) #2667

Apiserver stopped working (no changes made) #2667

Comments

osirisguitar commented Oct 20, 2021

balchua commented Oct 20, 2021

ktsakalozos commented Oct 21, 2021

osirisguitar commented Oct 21, 2021 • edited Loading

From the single.node cluster

From the multi-node cluster

MathieuBordere commented Oct 21, 2021

osirisguitar commented Oct 21, 2021

osirisguitar commented Oct 22, 2021

Multi-node cluster

osirisguitar commented Oct 22, 2021

osirisguitar commented Oct 22, 2021

osirisguitar commented Oct 25, 2021

balchua commented Oct 25, 2021

osirisguitar commented Oct 25, 2021

balchua commented Oct 25, 2021

osirisguitar commented Oct 21, 2021 •

edited

Loading