Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelite restart loop on pristine install (after remove --purge) #4342

Open
midnight-wonderer opened this issue Dec 16, 2023 · 6 comments
Open

Comments

@midnight-wonderer
Copy link

midnight-wonderer commented Dec 16, 2023

Summary

kublite stuck in a loop of restarts

Reproduction Steps

  • remove microk8s with
    snap remove --purge microk8s
  • restart the server
    reboot now
  • reinstall
    snap install microk8s --classic --channel=1.29
  • configure calico autodetection method
    per the document https://microk8s.io/docs/change-cidr#configure-calico-ip-autodetection-method-4
    by editing & applying /var/snap/microk8s/current/args/cni-network/cni.yaml
    from first-found to can-reach=<reference address>
  • join nodes with add-node and join.
    The cluster consists of 3 nodes.

Logs excerpt

snap logs -f microk8s and we will see the loop. I doctored an instance of a loop here:

Long logs

Too long; can't post it here. See on Gist instead.

@ktsakalozos
Copy link
Member

Hi @midnight-wonderer,

It is hard to say what may be the issue from the attached log. Could you share a microk8s inspect tarball? Maybe there is more information there.

Have you tried the suggestion in https://stackoverflow.com/questions/44133503/kubelet-error-failed-to-start-containermanager-failed-to-initialise-top-level-q ?

@midnight-wonderer
Copy link
Author

midnight-wonderer commented Jan 4, 2024

Thank you for looking into it.

The report

I don't have the inspection report anymore.
I had one saved, but I don't know if it is from the time with the issue mentioned above.
So, I doubt it will help or just add confusion since it missed the crucial relevant info.

Stackoverflow references

No, I haven't tried that one. I forgot to mention that MicroK8s was normal the first time I installed it on the server. There is something triggering the behavior afterward. Once triggered, the behavior persisted.

My decision

After the ordeal, I deem I am not ready for full-blown MicroK8s. I can't do anything if the same situation happens in an inused production cluster.

I decided to run my cluster without control plane HA, just a single manager node with multiple workers.
The cluster is stable so far.

My situation

We are a small software company, and maintaining a K8s cluster is already a bit too big for us. However, we have no choice since we grew out of Docker Stack in a Docker Swarm cluster.

The single-node MicroK8s cluster is as big as we can chew right now and is much more stable than HA ones.
We will look into the HA mode later, and I hope by the time we are ready, you guys will make them sooper stable.

The thread

I'll leave it up to you if the issue should be left open or, better, closed.

Last info

The following info is extracted from my memory only: (I won't have any more input)

  • On pristine install, the logs are still looped even as a single node. I repeated all the steps I mentioned sans the last one.
  • I deduce that this behavior can survive snap purges and reboots once it is triggered.
  • To break the cycle, I observed the logs just after snap install; if it is unstable, restart the MicroK8s service.
  • I repeat the same procedure for every node until every single one is stable and then start joining. (This time without HA mode.)

Speculation

My cluster is different than others in the way that I set up point-to-point WireGuard mesh and run MicroK8s through WireGuard. The instability might triggered by systemd service ordering. Maybe, wg-quick interfaces show up later than MicroK8s initializing?

Also, FWIW, my WireGuard mesh is IPv4 in IPv6.

P.S. I still run MicroK8s cluster and still continue to be a fan.

@kquinsland
Copy link

FWIW, I get a loop like that as well; it continues for some time (multiple min, perhaps as much as 10!) and then things just start up fine.
I suspect it's got something to do with calico setup/init on the host but the logs go by so fast / clear my scrollback buffer before I can really get my head around what's going on.

Taking a super quick look at the logs:

Note: boot=0 means "since most recent boot up" and I can tell you that the host I took the logs from was booted shortly before posting here. I did an miscork8 stop; apt update ...' reboot etc on the host before attempting to update to 1.29/stable channel.

root@node03:~# journalctl --boot=0 --unit=snap.microk8s.daemon-kubelite.service | grep "Failed with result 'exit-code'." | wc -l
125

Just before the failure, I get this:

Mar 04 09:16:21 node03 microk8s.daemon-kubelite[3506]: F0304 09:16:21.058573    3506 daemon.go:46] Proxy exited open /proc/sys/net/netfilter/nf_conntrack_max: no such file or directory
Mar 04 09:16:21 node03 systemd[1]: snap.microk8s.daemon-kubelite.service: Main process exited, code=exited, status=255/EXCEPTION
Mar 04 09:16:21 node03 systemd[1]: snap.microk8s.daemon-kubelite.service: Failed with result 'exit-code'.
Mar 04 09:16:21 node03 systemd[1]: snap.microk8s.daemon-kubelite.service: Consumed 3.039s CPU time.
Mar 04 09:16:21 node03 systemd[1]: snap.microk8s.daemon-kubelite.service: Scheduled restart job, restart counter is at 8.
Mar 04 09:16:21 node03 systemd[1]: Stopped Service for snap application microk8s.daemon-kubelite.
Mar 04 09:16:21 node03 systemd[1]: snap.microk8s.daemon-kubelite.service: Consumed 3.039s CPU time.
Mar 04 09:16:21 node03 systemd[1]: Started Service for snap application microk8s.daemon-kubelite.
Mar 04 09:16:21 node03 microk8s.daemon-kubelite[3714]: + source /snap/microk8s/6539/actions/common/utils.sh

@neoaggelos
Copy link
Contributor

Hi @kquinsland

Can you check if the br_netfilter module is loaded? Also, can you check if adding the following arguments to kube-proxy helps?

echo '
--conntrack-max-per-core=0
' | sudo tee -a /var/snap/microk8s/current/args/kube-proxy

sudo snap restart microk8s

@kquinsland
Copy link

Hi, @neoaggelos.

Good timing on your reply! We've had some stormy weather here and I've just had my power cut so I am in a good position to start the cluster up from a cold boot.

karl@node03:~$ sudo lsmod | grep br_
br_netfilter           32768  0
bridge                307200  1 br_netfilter
karl@node03:~$ sudo lsmod | grep overlay
<not loaded>

My boot loop is not slightly different now:

Mar 05 08:38:02 node03 microk8s.daemon-kubelite[33336]: F0305 08:38:02.037812   33336 daemon.go:46] Proxy exited open /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_established: no such file or directory

I let things "sit" for a few min as I had to dash off to deal with some other matter and a few min later, the cluster had come up on it's own.

Take a look at the two ls results / note the ~300 seconds in time delta:

root@node03:/proc/sys/net/netfilter# ls -lah
total 0
dr-xr-xr-x 1 root root 0 Mar  5 08:28 .
dr-xr-xr-x 1 root root 0 Mar  5 08:28 ..
dr-xr-xr-x 1 root root 0 Mar  5 08:45 nf_log
-rw-r--r-- 1 root root 0 Mar  5 08:45 nf_log_all_netns
root@node03:/proc/sys/net/netfilter# ls -lah
total 0
dr-xr-xr-x 1 root root 0 Mar  5 08:28 .
dr-xr-xr-x 1 root root 0 Mar  5 08:28 ..
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_acct
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_buckets
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_checksum
-r--r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_count
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_dccp_loose
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_dccp_timeout_closereq
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_dccp_timeout_closing
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_dccp_timeout_open
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_dccp_timeout_partopen
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_dccp_timeout_request
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_dccp_timeout_respond
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_dccp_timeout_timewait
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_events
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_expect_max
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_frag6_high_thresh
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_frag6_low_thresh
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_frag6_timeout
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_generic_timeout
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_gre_timeout
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_gre_timeout_stream
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_helper
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_icmp_timeout
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_icmpv6_timeout
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_log_invalid
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_max
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_sctp_timeout_closed
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_sctp_timeout_cookie_echoed
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_sctp_timeout_cookie_wait
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_sctp_timeout_established
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_sctp_timeout_heartbeat_sent
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_sctp_timeout_shutdown_ack_sent
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_sctp_timeout_shutdown_recd
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_sctp_timeout_shutdown_sent
-rw-r--r-- 1 root root 0 Mar  5 08:45 nf_conntrack_tcp_be_liberal
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_tcp_ignore_invalid_rst
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_tcp_loose
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_tcp_max_retrans
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_tcp_timeout_close
-rw-r--r-- 1 root root 0 Mar  5 08:45 nf_conntrack_tcp_timeout_close_wait
-rw-r--r-- 1 root root 0 Mar  5 08:45 nf_conntrack_tcp_timeout_established
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_tcp_timeout_fin_wait
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_tcp_timeout_last_ack
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_tcp_timeout_max_retrans
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_tcp_timeout_syn_recv
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_tcp_timeout_syn_sent
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_tcp_timeout_time_wait
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_tcp_timeout_unacknowledged
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_timestamp
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_udp_timeout
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_conntrack_udp_timeout_stream
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_flowtable_tcp_timeout
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_flowtable_udp_timeout
-rw-r--r-- 1 root root 0 Mar  5 08:50 nf_hooks_lwtunnel
dr-xr-xr-x 1 root root 0 Mar  5 08:45 nf_log
-rw-r--r-- 1 root root 0 Mar  5 08:45 nf_log_all_netns

This smells like some order of operations / dependency issue where microk8s is starting before nf_contrack_* has spun up fully?

@jarbaugh8
Copy link

I also ran into this issue (and have before but didn't dig into it quickly the first time, and it resolved itself overnight then).

modprobe nf_conntrack allowed kubelite to start up for me immediately this time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants