kubelite restart loop on pristine install (after remove --purge) #4342

midnight-wonderer · 2023-12-16T04:36:45Z

Summary

kublite stuck in a loop of restarts

Reproduction Steps

remove microk8s with
snap remove --purge microk8s
restart the server
reboot now
reinstall
snap install microk8s --classic --channel=1.29
configure calico autodetection method
per the document https://microk8s.io/docs/change-cidr#configure-calico-ip-autodetection-method-4
by editing & applying /var/snap/microk8s/current/args/cni-network/cni.yaml
from first-found to can-reach=<reference address>
join nodes with add-node and join.
The cluster consists of 3 nodes.

Logs excerpt

snap logs -f microk8s and we will see the loop. I doctored an instance of a loop here:

Long logs

Too long; can't post it here. See on Gist instead.

The text was updated successfully, but these errors were encountered:

ktsakalozos · 2024-01-04T13:48:57Z

Hi @midnight-wonderer,

It is hard to say what may be the issue from the attached log. Could you share a microk8s inspect tarball? Maybe there is more information there.

Have you tried the suggestion in https://stackoverflow.com/questions/44133503/kubelet-error-failed-to-start-containermanager-failed-to-initialise-top-level-q ?

midnight-wonderer · 2024-01-04T16:04:50Z

Thank you for looking into it.

The report

I don't have the inspection report anymore.
I had one saved, but I don't know if it is from the time with the issue mentioned above.
So, I doubt it will help or just add confusion since it missed the crucial relevant info.

Stackoverflow references

No, I haven't tried that one. I forgot to mention that MicroK8s was normal the first time I installed it on the server. There is something triggering the behavior afterward. Once triggered, the behavior persisted.

My decision

After the ordeal, I deem I am not ready for full-blown MicroK8s. I can't do anything if the same situation happens in an inused production cluster.

I decided to run my cluster without control plane HA, just a single manager node with multiple workers.
The cluster is stable so far.

My situation

We are a small software company, and maintaining a K8s cluster is already a bit too big for us. However, we have no choice since we grew out of Docker Stack in a Docker Swarm cluster.

The single-node MicroK8s cluster is as big as we can chew right now and is much more stable than HA ones.
We will look into the HA mode later, and I hope by the time we are ready, you guys will make them sooper stable.

The thread

I'll leave it up to you if the issue should be left open or, better, closed.

Last info

The following info is extracted from my memory only: (I won't have any more input)

On pristine install, the logs are still looped even as a single node. I repeated all the steps I mentioned sans the last one.
I deduce that this behavior can survive snap purges and reboots once it is triggered.
To break the cycle, I observed the logs just after snap install; if it is unstable, restart the MicroK8s service.
I repeat the same procedure for every node until every single one is stable and then start joining. (This time without HA mode.)

Speculation

My cluster is different than others in the way that I set up point-to-point WireGuard mesh and run MicroK8s through WireGuard. The instability might triggered by systemd service ordering. Maybe, wg-quick interfaces show up later than MicroK8s initializing?

Also, FWIW, my WireGuard mesh is IPv4 in IPv6.

P.S. I still run MicroK8s cluster and still continue to be a fan.

kquinsland · 2024-03-04T18:03:38Z

FWIW, I get a loop like that as well; it continues for some time (multiple min, perhaps as much as 10!) and then things just start up fine.
I suspect it's got something to do with calico setup/init on the host but the logs go by so fast / clear my scrollback buffer before I can really get my head around what's going on.

Taking a super quick look at the logs:

Note: boot=0 means "since most recent boot up" and I can tell you that the host I took the logs from was booted shortly before posting here. I did an miscork8 stop; apt update ...' reboot etc on the host before attempting to update to 1.29/stable channel.

root@node03:~# journalctl --boot=0 --unit=snap.microk8s.daemon-kubelite.service | grep "Failed with result 'exit-code'." | wc -l
125

Just before the failure, I get this:

Mar 04 09:16:21 node03 microk8s.daemon-kubelite[3506]: F0304 09:16:21.058573    3506 daemon.go:46] Proxy exited open /proc/sys/net/netfilter/nf_conntrack_max: no such file or directory
Mar 04 09:16:21 node03 systemd[1]: snap.microk8s.daemon-kubelite.service: Main process exited, code=exited, status=255/EXCEPTION
Mar 04 09:16:21 node03 systemd[1]: snap.microk8s.daemon-kubelite.service: Failed with result 'exit-code'.
Mar 04 09:16:21 node03 systemd[1]: snap.microk8s.daemon-kubelite.service: Consumed 3.039s CPU time.
Mar 04 09:16:21 node03 systemd[1]: snap.microk8s.daemon-kubelite.service: Scheduled restart job, restart counter is at 8.
Mar 04 09:16:21 node03 systemd[1]: Stopped Service for snap application microk8s.daemon-kubelite.
Mar 04 09:16:21 node03 systemd[1]: snap.microk8s.daemon-kubelite.service: Consumed 3.039s CPU time.
Mar 04 09:16:21 node03 systemd[1]: Started Service for snap application microk8s.daemon-kubelite.
Mar 04 09:16:21 node03 microk8s.daemon-kubelite[3714]: + source /snap/microk8s/6539/actions/common/utils.sh

neoaggelos · 2024-03-05T14:39:08Z

Hi @kquinsland

Can you check if the br_netfilter module is loaded? Also, can you check if adding the following arguments to kube-proxy helps?

echo '
--conntrack-max-per-core=0
' | sudo tee -a /var/snap/microk8s/current/args/kube-proxy

sudo snap restart microk8s

kquinsland · 2024-03-05T16:51:50Z

Hi, @neoaggelos.

Good timing on your reply! We've had some stormy weather here and I've just had my power cut so I am in a good position to start the cluster up from a cold boot.

karl@node03:~$ sudo lsmod | grep br_
br_netfilter           32768  0
bridge                307200  1 br_netfilter
karl@node03:~$ sudo lsmod | grep overlay
<not loaded>

My boot loop is not slightly different now:

Mar 05 08:38:02 node03 microk8s.daemon-kubelite[33336]: F0305 08:38:02.037812   33336 daemon.go:46] Proxy exited open /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_established: no such file or directory

I let things "sit" for a few min as I had to dash off to deal with some other matter and a few min later, the cluster had come up on it's own.

Take a look at the two ls results / note the ~300 seconds in time delta:

root@node03:/proc/sys/net/netfilter# ls -lah
total 0
dr-xr-xr-x 1 root root 0 Mar  5 08:28 .
dr-xr-xr-x 1 root root 0 Mar  5 08:28 ..
dr-xr-xr-x 1 root root 0 Mar  5 08:45 nf_log
-rw-r--r-- 1 root root 0 Mar  5 08:45 nf_log_all_netns
root@node03:/proc/sys/net/netfilter# ls -lah
total 0
dr-xr-xr-x 1 root root 0 Mar  5 08:28 .
dr-xr-xr-x 1 root root 0 Mar  5 08:28 ..
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_acct
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_buckets
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_checksum
-r--r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_count
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_dccp_loose
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_dccp_timeout_closereq
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_dccp_timeout_closing
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_dccp_timeout_open
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_dccp_timeout_partopen
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_dccp_timeout_request
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_dccp_timeout_respond
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_dccp_timeout_timewait
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_events
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_expect_max
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_frag6_high_thresh
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_frag6_low_thresh
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_frag6_timeout
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_generic_timeout
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_gre_timeout
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_gre_timeout_stream
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_helper
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_icmp_timeout
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_icmpv6_timeout
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_log_invalid
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_max
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_sctp_timeout_closed
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_sctp_timeout_cookie_echoed
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_sctp_timeout_cookie_wait
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_sctp_timeout_established
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_sctp_timeout_heartbeat_sent
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_sctp_timeout_shutdown_ack_sent
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_sctp_timeout_shutdown_recd
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_sctp_timeout_shutdown_sent
-rw-r--r-- 1 root root 0 Mar  5 08:45 nf_conntrack_tcp_be_liberal
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_tcp_ignore_invalid_rst
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_tcp_loose
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_tcp_max_retrans
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_tcp_timeout_close
-rw-r--r-- 1 root root 0 Mar  5 08:45 nf_conntrack_tcp_timeout_close_wait
-rw-r--r-- 1 root root 0 Mar  5 08:45 nf_conntrack_tcp_timeout_established
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_tcp_timeout_fin_wait
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_tcp_timeout_last_ack
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_tcp_timeout_max_retrans
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_tcp_timeout_syn_recv
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_tcp_timeout_syn_sent
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_tcp_timeout_time_wait
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_tcp_timeout_unacknowledged
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_timestamp
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_udp_timeout
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_udp_timeout_stream
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_flowtable_tcp_timeout
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_flowtable_udp_timeout
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_hooks_lwtunnel
dr-xr-xr-x 1 root root 0 Mar  5 08:45 nf_log
-rw-r--r-- 1 root root 0 Mar  5 08:45 nf_log_all_netns

This smells like some order of operations / dependency issue where microk8s is starting before nf_contrack_* has spun up fully?

jarbaugh8 · 2024-05-12T17:42:58Z

I also ran into this issue (and have before but didn't dig into it quickly the first time, and it resolved itself overnight then).

modprobe nf_conntrack allowed kubelite to start up for me immediately this time.

Aaron-Ritter mentioned this issue Jul 20, 2024

When Worker or Master Nodes gets shutdown they stuck at NotReady #4579

Open

This was referenced Aug 11, 2024

MicroK8s 1.28.12 after node restart it doesn't join due to nf_conntrack #4597

Open

worker node doesn't come up after reboot, logs full of command [/snap/microk8s/7040/microk8s-enable.wrapper ingress] failed with exit code 1: exit status 1 #4619

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubelite restart loop on pristine install (after remove --purge) #4342

kubelite restart loop on pristine install (after remove --purge) #4342

midnight-wonderer commented Dec 16, 2023 •

edited

Loading

ktsakalozos commented Jan 4, 2024

midnight-wonderer commented Jan 4, 2024 •

edited

Loading

kquinsland commented Mar 4, 2024

neoaggelos commented Mar 5, 2024

kquinsland commented Mar 5, 2024

jarbaugh8 commented May 12, 2024

kubelite restart loop on pristine install (after remove --purge) #4342

kubelite restart loop on pristine install (after remove --purge) #4342

Comments

midnight-wonderer commented Dec 16, 2023 • edited Loading

Summary

Reproduction Steps

Logs excerpt

ktsakalozos commented Jan 4, 2024

midnight-wonderer commented Jan 4, 2024 • edited Loading

The report

Stackoverflow references

My decision

My situation

The thread

Last info

Speculation

kquinsland commented Mar 4, 2024

neoaggelos commented Mar 5, 2024

kquinsland commented Mar 5, 2024

jarbaugh8 commented May 12, 2024

midnight-wonderer commented Dec 16, 2023 •

edited

Loading

midnight-wonderer commented Jan 4, 2024 •

edited

Loading