Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Microk8s crashes when joining a node using ha-cluster #1807

Closed
wsdt opened this issue Dec 11, 2020 · 24 comments
Closed

[BUG] Microk8s crashes when joining a node using ha-cluster #1807

wsdt opened this issue Dec 11, 2020 · 24 comments

Comments

@wsdt
Copy link

wsdt commented Dec 11, 2020

I'm not sure if the problem occurs because my master node is an Ubuntu machine and the worker is Windows 10 Enterprise (WSL enabled), but I thought this might be of interest.

Version: 1.19/stable

Steps to reproduce:

  1. Checking previously with microk8s status and microk8s inspect before joining the cluster, everything seems to be fine.
  2. Add-On ha-cluster is enabled on both the master and worker node.
  3. Running microk8s join x.x.x.x:25000/{TOKEN} makes microk8s crash silently.

No error message is output.

Output of microk8s status before joining:

microk8s is running
high-availability: no
  datastore master nodes: 127.0.0.1:19001
  datastore standby nodes: none
addons:
  enabled:
    ha-cluster           # Configure high availability on the current node
  disabled:
    ambassador           # Ambassador API Gateway and Ingress
    cilium               # SDN, fast with full network policy
    dashboard            # The Kubernetes dashboard
    dns                  # CoreDNS
    fluentd              # Elasticsearch-Fluentd-Kibana logging and monitoring
    gpu                  # Automatic enablement of Nvidia CUDA
    helm                 # Helm 2 - the package manager for Kubernetes
    helm3                # Helm 3 - Kubernetes package manager
    host-access          # Allow Pods connecting to Host services smoothly
    ingress              # Ingress controller for external access
    istio                # Core Istio service mesh services
    jaeger               # Kubernetes Jaeger operator with its simple config
    knative              # The Knative framework on Kubernetes.
    kubeflow             # Kubeflow for easy ML deployments
    linkerd              # Linkerd is a service mesh for Kubernetes and other frameworks
    metallb              # Loadbalancer for your Kubernetes cluster
    metrics-server       # K8s Metrics Server for API access to service metrics
    multus               # Multus CNI enables attaching multiple network interfaces to pods
    prometheus           # Prometheus operator for monitoring and logging
    rbac                 # Role-Based Access Control for authorisation
    registry             # Private image registry exposed on localhost:32000
    storage              # Storage class; allocates storage from host directory

Output of microk8s inspect before joining:

Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-apiserver is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Service snap.microk8s.daemon-control-plane-kicker is running
  Service snap.microk8s.daemon-proxy is running
  Service snap.microk8s.daemon-kubelet is running
  Service snap.microk8s.daemon-scheduler is running
  Service snap.microk8s.daemon-controller-manager is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy openSSL information to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster
Inspecting juju
  Inspect Juju
Inspecting kubeflow
  Inspect Kubeflow

Output of join (finishes without further output):

Contacting cluster at 10.10.40.24
Waiting for this node to finish joining the cluster. .. .. .. .. .. .. .. .. .. ..

Output of microk8s status after joining:

microk8s is not running. Use microk8s inspect for a deeper inspection.

Output of microk8s inspect after joining:

Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
 FAIL:  Service snap.microk8s.daemon-apiserver is not running
For more details look at: sudo journalctl -u snap.microk8s.daemon-apiserver
  Service snap.microk8s.daemon-apiserver-kicker is running
  Service snap.microk8s.daemon-control-plane-kicker is running
  Service snap.microk8s.daemon-proxy is running
  Service snap.microk8s.daemon-kubelet is running
  Service snap.microk8s.daemon-scheduler is running
  Service snap.microk8s.daemon-controller-manager is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy openSSL information to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster
Inspecting juju
  Inspect Juju
Inspecting kubeflow
  Inspect Kubeflow

Building the report tarball
  Report tarball is at /var/snap/microk8s/1791/inspection-report-20201211_113621.tar.gz
An error occurred when trying to execute 'sudo microk8s.inspect' with 'multipass': returned exit code 1.

And as you can image, the node is not added on the master node.

I reinstalled microk8s & removed the VM. Then everything seems to be fine again, and after trying to join microk8s crashes again.

FAIL: Service snap.microk8s.daemon-apiserver is not running

Approximately 15 minutes later, microk8s seemed to be up running again (but the api-server was still down). After trying again to join the cluster, I've received a python stacktrace. Maybe just because the api-server was down, but I thought I append this here just in case.

Contacting cluster at 10.10.40.24
Traceback (most recent call last):
  File "/snap/microk8s/1791/scripts/cluster/join.py", line 967, in <module>
    join_dqlite(connection_parts)
  File "/snap/microk8s/1791/scripts/cluster/join.py", line 900, in join_dqlite
    update_dqlite(info["cluster_cert"], info["cluster_key"], info["voters"], hostname_override)
  File "/snap/microk8s/1791/scripts/cluster/join.py", line 818, in update_dqlite
    with open("{}/info.yaml".format(cluster_backup_dir)) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/var/snap/microk8s/1791/var/kubernetes/backend.backup/info.yaml'
An error occurred when trying to execute 'sudo microk8s.join 10.10.40.24:25000/04f6ac0ea469893c594e5b30954618f0' with 'multipass': returned exit code 1.

NOTE: Resolved in the meantime by disabling the add-on ha-cluster on both nodes. Would be great if this issue could be fixed soon!

@wsdt wsdt changed the title [BUG] Microk8s on Windows crashes when joining a cluster (Master-Node = Linux) [BUG] Microk8s crashes when joining a node using ha-cluster Dec 11, 2020
@balchua
Copy link
Collaborator

balchua commented Dec 11, 2020

It will be helpful if you can attach the inpect tarball.

@bbarclay
Copy link

Yes same problem here. Sometimes it works, other times it doesn't. No clue as to why, but I do find that to get out of it, I have to reinstall microk8s. I wish there was a better workaround.

@balchua
Copy link
Collaborator

balchua commented Dec 14, 2020

@bbarclay just trying to get more info.
Is your nodes all running in linux?

@wsdt
Copy link
Author

wsdt commented Dec 15, 2020

My issue is because of the HA-add on. As soon as I've disabled it, it worked (but I would love to use it).
This seems to be a well-known issue, as I've discovered it in other issues too (after I've created this).

@wsdt
Copy link
Author

wsdt commented Dec 15, 2020

@balchua I've been wrong. It's indeed an issue with the Windows distribution.
Even with ha-cluster disabled several services crash after the join-command.

These error messages occurred after a fresh installation (multipass & microk8s removed) when joining the Linux-master (Ubuntu). Therefore, I'm pretty sure this is reproducible.

Machine: Windows 10 Enterprise

Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-flanneld is running
  Service snap.microk8s.daemon-containerd is running
 FAIL:  Service snap.microk8s.daemon-apiserver is not running
For more details look at: sudo journalctl -u snap.microk8s.daemon-apiserver
 FAIL:  Service snap.microk8s.daemon-apiserver-kicker is not running
For more details look at: sudo journalctl -u snap.microk8s.daemon-apiserver-kicker
  Service snap.microk8s.daemon-proxy is running
  Service snap.microk8s.daemon-kubelet is running
 FAIL:  Service snap.microk8s.daemon-scheduler is not running
For more details look at: sudo journalctl -u snap.microk8s.daemon-scheduler
 FAIL:  Service snap.microk8s.daemon-controller-manager is not running
For more details look at: sudo journalctl -u snap.microk8s.daemon-controller-manager
 FAIL:  Service snap.microk8s.daemon-etcd is not running
For more details look at: sudo journalctl -u snap.microk8s.daemon-etcd
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy openSSL information to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster

I've attached the current error log below:
inspection-report-20201215_180615.tar.gz

Another configuration (channel 19) both nodes with ha-cluster add on disabled:
inspection-report-20201215_190800.tar.gz

@balchua
Copy link
Collaborator

balchua commented Dec 15, 2020

@wsdt in a non ha cluster, all those "FAIL" services are not suppose to run, hence reported as FAIL. Since there is only one control plane.
As i understand, the inspect tarball belongs to the joining node.

@wsdt
Copy link
Author

wsdt commented Dec 15, 2020

@balchua Exactly the tarball belongs to the joining node.
But it seems that because of this the master node cannot enable e.g. the dns add-on on that worker node (request timed-out). Although, joining did "work".

@balchua
Copy link
Collaborator

balchua commented Dec 15, 2020

For a non ha cluster, you can only enable addons from the node with control plane. The dns issue u mentioned above can be a different issue.

@wsdt
Copy link
Author

wsdt commented Dec 15, 2020

I know, I did enable the dns addon on the master node.

Might be, but the ha-cluster didn't work too (as the initial error messages above indicate).
And all this cannot be a configuration issue as I did several fresh installations.

@balchua
Copy link
Collaborator

balchua commented Dec 15, 2020

Thanks @wsdt for the clarifications. Is it ok yo provide the inspect tarball of the main node? The one where you are joining to?
The logs there may reveal something.

@wsdt
Copy link
Author

wsdt commented Dec 15, 2020

@balchua
Thank you for the prompt response :-)

inspection-report.tar.gz

@balchua
Copy link
Collaborator

balchua commented Dec 15, 2020

@wsdt ok the worker node is successfully joined to your main node in this particular setup.

Whats failing at this moment is the dns.
Care to try the potential solution to dns crashlooping FAQ https://microk8s.io/docs/troubleshooting#heading--common-issues.

You may have to do this on all nodes.

@wsdt
Copy link
Author

wsdt commented Dec 16, 2020

@balchua Thank you, I executed the commands proposed on both nodes & even disabled the network firewall on the windows worker node completely. The nodes can definitely talk to each other.

Below the tarball of the master node, after trying to enable dns again (it is listed as enabled, but the worker node cannot be reached):
inspection-report.tar.gz

microk8s kubectl get nodes
results in.. (no roles assigned, windows node is listed with ip instead of host name 'track').

NAME          STATUS   ROLES    AGE     VERSION
10.10.40.25   Ready    <none>   16h     v1.19.5-34+8af48932a5ef06
trick         Ready    <none>   4d23h   v1.19.5-34+8af48932a5ef06

PS: Pod logs cannot be retrieved either (timed out -> as worker node not reachable).

@balchua
Copy link
Collaborator

balchua commented Dec 16, 2020

@wsdt can u also get the inspect tarball of the windows node? I think you need to use multipass command. Sorry i don't know about it much, as i dont use windows a lot.

@wsdt
Copy link
Author

wsdt commented Dec 16, 2020

@balchua
No problem, thanks :-)

inspection-report-windows.tar.gz

@balchua
Copy link
Collaborator

balchua commented Dec 16, 2020

So far i dont see anything wrong with the cluster. As i see it, only the coredns is deployed.

@wsdt
Copy link
Author

wsdt commented Dec 16, 2020

Interesting, as logs cannot be retrieved (as docker container in pod is failing -> container works 100% and I saw last time, that something was wrong with the internet connection within the container).

All that seems very buggy to me..
I hoped to find in Microk8s an easier solution to deploy with kubernetes.

Hmm.. thank you anyway.

@balchua
Copy link
Collaborator

balchua commented Dec 16, 2020

Definitely need some extra pair of eyes and brains. Maybe @ktsakalozos will be able to help. 😊

Btw i dont see your pod deployed in the cluster.
Even if its crashlooping as you mentioned, its not there.

@wsdt
Copy link
Author

wsdt commented Dec 16, 2020

@balchua Yes, I've removed it afterwards. If necessary, I'll start the pod & attach the tarballs.

Thank you! :-)

@wsdt
Copy link
Author

wsdt commented Dec 26, 2020

Ended up using Kubernetes natively and now everything seems fine.

@wsdt wsdt closed this as completed Dec 26, 2020
@maxstreese
Copy link

Hey I am late to the party and most likely wrong but I also had microk8s crash silently on me on nodes when trying to join those nodes into a cluster.

The one thing that resolved it for me (I assume) was renaming the hosts. When being a dummy like me and just getting some Raspberry Pi's and put Ubuntu server on them these are by default simply called "ubuntu". When I first tried to create the cluster and they all had the same name things just crashed silently for me (the nodes I tried to join into the cluster did not report any error but joining did nothing and after trying to join microk8s was not running anymore and could not be started either if you had not left the cluster and/or reset microk8s on that node).

After renaming all four of them to be unique and trying again things worked fine.

As you are mixing Windows and Ubuntu machines I guess your issue was a different one but as I have not seen the host-name issue mentioned anywhere yet this comment may be able to serve some confused Pi users. Cheers!

@devZer0
Copy link

devZer0 commented Mar 17, 2021

i setup 2 microk8s nodes today with ubuntu linux 20.04 lts (virtual machines on proxmox) and when joining the 2nd node to the first one with "microk8s join..." i had repeated crashes with the first node

@balchua
Copy link
Collaborator

balchua commented Mar 17, 2021

@devZer0 can you upload the inspect tarball? Thanks

@devZer0
Copy link

devZer0 commented Mar 18, 2021

ok. i reinstalled everything and re-joined node. for my curiousity, joining worked this time. but something seems to have crashed, though, as dashboard-proxy disconnected while joining with this message:

E0318 00:12:54.343951 842232 portforward.go:385] error copying from local connection to remote stream: read tcp4 172.16.31.207:10443->172.22.3.6:61181: read: connection reset by peer
E0318 00:13:54.830044 842232 portforward.go:233] lost connection to pod

there is lots of errors like "Exec process "566247c54dfcee017b3c4e4605c6b5b1c48f47deadd16b65b43ec152d0a8281d" exits with exit code 0 and error < nil >" in the syslog

i have attached the inspect tarball:

inspection-report-20210318_002035.tar.gz

i'm also curious - the system has 4 vcpu and there is loadavg of >1 (one day later >2), so cpu of the vm is being constantly hogged. syslog growing very large, already at >60mb

on second node, loadavg at about 0.5 and also lots of messages in syslog

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants