Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MiniKube enters NodeNotReady when the K8S worker node is a newly provisioned node by cluster autoscaler #46

Closed
itsnagaraj opened this issue Aug 31, 2021 · 6 comments

Comments

@itsnagaraj
Copy link

Our builds intermittently fail with the below error. For some reason the minikube node enters NodeNotReady (and never recovers) state and cause the workloads to be not scheduled and the build fail. On further investigation this behaviour is more prominent when the kind workload is scheduled on new provisioned worker node (ec2 instance). Any hints on how we can avoid this error?

LAST SEEN   TYPE      REASON                    OBJECT                                      MESSAGE
106d        Normal    NodeHasSufficientMemory   node/minikube                               Node minikube status is now: NodeHasSufficientMemory
106d        Normal    NodeHasNoDiskPressure     node/minikube                               Node minikube status is now: NodeHasNoDiskPressure
106d        Normal    NodeHasSufficientPID      node/minikube                               Node minikube status is now: NodeHasSufficientPID
106d        Normal    RegisteredNode            node/minikube                               Node minikube event: Registered Node minikube in Controller
106d        Normal    Starting                  node/minikube                               Starting kube-proxy.
16m         Normal    RegisteredNode            node/minikube                               Node minikube event: Registered Node minikube in Controller
15m         Normal    NodeNotReady              node/minikube                               Node minikube status is now: NodeNotReady
15m         Warning   FailedScheduling          pod/aws-stub-8559dd85c6-n959r     0/1 nodes are available: 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate.
15m         Warning   FailedScheduling          pod/aws-stub-8559dd85c6-n959r     0/1 nodes are available: 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate.
@furikake
Copy link

Does minikube logs give more clues?

@itsnagaraj itsnagaraj changed the title MiniKube enters NodeNotReady when the K8S worker node is newly provisioned node by cluster autoscaler MiniKube enters NodeNotReady when the K8S worker node is a newly provisioned node by cluster autoscaler Aug 31, 2021
@itsnagaraj
Copy link
Author

itsnagaraj commented Aug 31, 2021

Does minikube logs give more clues?

I did not capture those logs. next time this happens will capture them

@itsnagaraj
Copy link
Author

itsnagaraj commented Sep 1, 2021

Logs from minikube controller

bash-5.0# tail kube-controller-manager-minikube_kube-system_kube-controller-manager-16790cbbf4e920f36e9a0099c860ef3a8f3d2e59f5c753ec58421fcbad62cf77.log
{"log":"I0901 00:25:09.311469       1 event.go:291] \"Event occurred\" object=\"kube-system/coredns-f9fd979d6-w9lb9\" kind=\"Pod\" apiVersion=\"v1\" type=\"Warning\" reason=\"NodeNotReady\" message=\"Node is not ready\"\n","stream":"stderr","time":"2021-09-01T00:25:09.311544102Z"}
{"log":"I0901 00:25:09.317360       1 event.go:291] \"Event occurred\" object=\"kube-system/etcd-minikube\" kind=\"Pod\" apiVersion=\"v1\" type=\"Warning\" reason=\"NodeNotReady\" message=\"Node is not ready\"\n","stream":"stderr","time":"2021-09-01T00:25:09.317460194Z"}
{"log":"I0901 00:25:09.320589       1 event.go:291] \"Event occurred\" object=\"kube-system/kube-apiserver-minikube\" kind=\"Pod\" apiVersion=\"v1\" type=\"Warning\" reason=\"NodeNotReady\" message=\"Node is not ready\"\n","stream":"stderr","time":"2021-09-01T00:25:09.32067275Z"}
{"log":"I0901 00:25:09.323255       1 event.go:291] \"Event occurred\" object=\"kube-system/kube-controller-manager-minikube\" kind=\"Pod\" apiVersion=\"v1\" type=\"Warning\" reason=\"NodeNotReady\" message=\"Node is not ready\"\n","stream":"stderr","time":"2021-09-01T00:25:09.323330302Z"}
{"log":"I0901 00:25:09.326567       1 node_lifecycle_controller.go:1195] Controller detected that all Nodes are not-Ready. Entering master disruption mode.\n","stream":"stderr","time":"2021-09-01T00:25:09.32663584Z"}
{"log":"I0901 00:25:09.326659       1 event.go:291] \"Event occurred\" object=\"kube-system/kube-proxy-ncslh\" kind=\"Pod\" apiVersion=\"v1\" type=\"Warning\" reason=\"NodeNotReady\" message=\"Node is not ready\"\n","stream":"stderr","time":"2021-09-01T00:25:09.326738028Z"}
{"log":"I0901 00:25:36.090104       1 event.go:291] \"Event occurred\" object=\"default/aws-stub\" kind=\"Deployment\" apiVersion=\"apps/v1\" type=\"Normal\" reason=\"ScalingReplicaSet\" message=\"Scaled up replica set aws-stub-8559dd85c6 to 1\"\n","stream":"stderr","time":"2021-09-01T00:25:36.090202319Z"}

I0901 00:24:29.263542       1 node_lifecycle_controller.go:1245] Controller detected that zone  is now in state Normal.
I0901 00:24:29.263616       1 event.go:291] "Event occurred" object="minikube" kind="Node" apiVersion="v1" type="Normal" reason="RegisteredNode" message="Node minikube event: Registered Node minikube in Controller"
I0901 00:24:29.266905       1 shared_informer.go:247] Caches are synced for ReplicationController
I0901 00:24:29.270606       1 shared_informer.go:247] Caches are synced for stateful set
I0901 00:24:29.275111       1 shared_informer.go:247] Caches are synced for certificate-csrapproving
I0901 00:24:29.278757       1 shared_informer.go:247] Caches are synced for PVC protection
I0901 00:24:29.282290       1 shared_informer.go:247] Caches are synced for job
I0901 00:24:29.291643       1 shared_informer.go:247] Caches are synced for HPA
I0901 00:24:29.291756       1 shared_informer.go:247] Caches are synced for ClusterRoleAggregator
I0901 00:24:29.291772       1 shared_informer.go:247] Caches are synced for disruption
I0901 00:24:29.291778       1 disruption.go:339] Sending events to api server.
I0901 00:24:29.291865       1 shared_informer.go:247] Caches are synced for deployment
I0901 00:24:29.291916       1 shared_informer.go:247] Caches are synced for ReplicaSet
I0901 00:24:29.291997       1 shared_informer.go:247] Caches are synced for endpoint_slice
I0901 00:24:29.292190       1 shared_informer.go:247] Caches are synced for expand
I0901 00:24:29.293131       1 shared_informer.go:247] Caches are synced for endpoint
I0901 00:24:29.443618       1 shared_informer.go:247] Caches are synced for resource quota
I0901 00:24:29.456188       1 shared_informer.go:247] Caches are synced for attach detach
I0901 00:24:29.492244       1 shared_informer.go:247] Caches are synced for resource quota
I0901 00:24:29.545339       1 shared_informer.go:240] Waiting for caches to sync for garbage collector
I0901 00:24:29.845496       1 shared_informer.go:247] Caches are synced for garbage collector
I0901 00:24:29.852184       1 shared_informer.go:247] Caches are synced for garbage collector
I0901 00:24:29.852198       1 garbagecollector.go:137] Garbage collector: all resource monitors have synced. Proceeding to collect garbage
I0901 00:25:09.297574       1 event.go:291] "Event occurred" object="minikube" kind="Node" apiVersion="v1" type="Normal" reason="NodeNotReady" message="Node minikube status is now: NodeNotReady"
I0901 00:25:09.302113       1 event.go:291] "Event occurred" object="kube-system/kube-scheduler-minikube" kind="Pod" apiVersion="v1" type="Warning" reason="NodeNotReady" message="Node is not ready"
I0901 00:25:09.304739       1 event.go:291] "Event occurred" object="kube-system/storage-provisioner" kind="Pod" apiVersion="v1" type="Warning" reason="NodeNotReady" message="Node is not ready"
I0901 00:25:09.308867       1 event.go:291] "Event occurred" object="kube-system/coredns-f9fd979d6-qbgtj" kind="Pod" apiVersion="v1" type="Warning" reason="NodeNotReady" message="Node is not ready"
I0901 00:25:09.311469       1 event.go:291] "Event occurred" object="kube-system/coredns-f9fd979d6-w9lb9" kind="Pod" apiVersion="v1" type="Warning" reason="NodeNotReady" message="Node is not ready"
I0901 00:25:09.317360       1 event.go:291] "Event occurred" object="kube-system/etcd-minikube" kind="Pod" apiVersion="v1" type="Warning" reason="NodeNotReady" message="Node is not ready"
I0901 00:25:09.320589       1 event.go:291] "Event occurred" object="kube-system/kube-apiserver-minikube" kind="Pod" apiVersion="v1" type="Warning" reason="NodeNotReady" message="Node is not ready"
I0901 00:25:09.323255       1 event.go:291] "Event occurred" object="kube-system/kube-controller-manager-minikube" kind="Pod" apiVersion="v1" type="Warning" reason="NodeNotReady" message="Node is not ready"
I0901 00:25:09.326567       1 node_lifecycle_controller.go:1195] Controller detected that all Nodes are not-Ready. Entering master disruption mode.
I0901 00:25:09.326659       1 event.go:291] "Event occurred" object="kube-system/kube-proxy-ncslh" kind="Pod" apiVersion="v1" type="Warning" reason="NodeNotReady" message="Node is not ready"
I0901 00:25:36.090104       1 event.go:291] "Event occurred" object="default/aws-stub" kind="Deployment" apiVersion="apps/v1" type="Normal" reason="ScalingReplicaSet" message="Scaled up replica set providers-aws-stub-8559dd85c6 to 1"

@itsnagaraj
Copy link
Author

Minikube node status - might be related to https://github.com/kubernetes/kubernetes/issues/34314

Ran kubectl describe node minikube

Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Mon, 17 May 2021 04:40:31 +0000   Wed, 01 Sep 2021 02:01:52 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Mon, 17 May 2021 04:40:31 +0000   Wed, 01 Sep 2021 02:01:52 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Mon, 17 May 2021 04:40:31 +0000   Wed, 01 Sep 2021 02:01:52 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Mon, 17 May 2021 04:40:31 +0000   Wed, 01 Sep 2021 02:01:52 +0000   NodeStatusUnknown   Kubelet stopped posting node status.

@itsnagaraj
Copy link
Author

itsnagaraj commented Sep 6, 2021

more logs from minikube. looks like issues with kibe-dns

10m         Warning   NodeNotReady             pod/coredns-f9fd979d6-qbgtj            Node is not ready
112d        Normal    Scheduled                pod/coredns-f9fd979d6-w9lb9            Successfully assigned kube-system/coredns-f9fd979d6-w9lb9 to minikube
112d        Normal    Pulled                   pod/coredns-f9fd979d6-w9lb9            Container image "k8s.gcr.io/coredns:1.7.0" already present on machine
112d        Normal    Created                  pod/coredns-f9fd979d6-w9lb9            Created container coredns
112d        Normal    Started                  pod/coredns-f9fd979d6-w9lb9            Started container coredns
112d        Warning   Unhealthy                pod/coredns-f9fd979d6-w9lb9            Readiness probe failed: HTTP probe failed with statuscode: 503
10m         Warning   NodeNotReady             pod/coredns-f9fd979d6-w9lb9            Node is not ready
112d        Normal    SuccessfulCreate         replicaset/coredns-f9fd979d6           Created pod: coredns-f9fd979d6-qbgtj
112d        Normal    SuccessfulCreate         replicaset/coredns-f9fd979d6           Created pod: coredns-f9fd979d6-w9lb9
112d        Normal    ScalingReplicaSet        deployment/coredns                     Scaled up replica set coredns-f9fd979d6 to 2
112d        Normal    Pulled                   pod/etcd-minikube                      Container image "k8s.gcr.io/etcd:3.4.13-0" already present on machine
112d        Normal    Created                  pod/etcd-minikube                      Created container etcd
112d        Normal    Started                  pod/etcd-minikube                      Started container etcd
10m         Warning   NodeNotReady             pod/etcd-minikube                      Node is not ready
112d        Normal    LeaderElection           endpoints/k8s.io-minikube-hostpath     minikube_630050fb-1db9-47bf-855f-99d995190e17 became leader
112d        Normal    Pulled                   pod/kube-apiserver-minikube            Container image "k8s.gcr.io/kube-apiserver:v1.19.4" already present on machine
112d        Normal    Created                  pod/kube-apiserver-minikube            Created container kube-apiserver
112d        Normal    Started                  pod/kube-apiserver-minikube            Started container kube-apiserver
10m         Warning   NodeNotReady             pod/kube-apiserver-minikube            Node is not ready
112d        Normal    Pulled                   pod/kube-controller-manager-minikube   Container image "k8s.gcr.io/kube-controller-manager:v1.19.4" already present on machine
112d        Normal    Created                  pod/kube-controller-manager-minikube   Created container kube-controller-manager
112d        Normal    Started                  pod/kube-controller-manager-minikube   Started container kube-controller-manager
10m         Warning   NodeNotReady             pod/kube-controller-manager-minikube   Node is not ready
112d        Warning   FailedToUpdateEndpoint   endpoints/kube-dns                     Failed to update endpoint kube-system/kube-dns: Operation cannot be fulfilled on endpoints "kube-dns": the object has been modified; please apply your changes to the latest version and try again

@itsnagaraj
Copy link
Author

itsnagaraj commented Sep 7, 2021

After more investigation this seems to be an issue with Kubelet not having enough capacity to start up properly if there is one big pod (with multiple containers) that build docker images, hosts kind container and run integration tests as well

Splitting kind into its own pod seems to give kubelet enough capacity to startup properly. Tested this by triggering 5 or more concurrent builds and all the builds were successful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants