Bug: Inflight check failed for node, Instance Type "" not found #3156

korenyoni · 2023-01-05T08:32:53Z

Version

Karpenter Version: v0.21.1

Kubernetes Version: Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.7-eks-fb459a0", GitCommit:"c240013134c03a740781ffa1436ba2688b50b494", GitTreeState:"clean", BuildDate:"2022-10-24T20:36:26Z", GoVersion:"go1.18.7", Compiler:"gc", Platform:"linux/amd64"}

Expected Behavior

We have one spot instance that failed its AWS EC2 health checks. It is running 0 pods and Karpenter is not removing it (it's been this way for the past 11 hours).

$ kubectl describe node ...

Name:               [redacted]
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    k8s.io/cloud-provider-aws=[redacted]
                    karpenter.k8s.aws/instance-ami-id=ami-0788d7e6b881be91c
                    karpenter.k8s.aws/instance-category=t
                    karpenter.k8s.aws/instance-cpu=8
                    karpenter.k8s.aws/instance-family=t2
                    karpenter.k8s.aws/instance-generation=2
                    karpenter.k8s.aws/instance-hypervisor=xen
                    karpenter.k8s.aws/instance-memory=32768
                    karpenter.k8s.aws/instance-pods=44
                    karpenter.k8s.aws/instance-size=2xlarge
                    karpenter.sh/capacity-type=spot
                    karpenter.sh/provisioner-name=dind
                    kubernetes.io/arch=amd64
                    kubernetes.io/os=linux
                    node-type=dind
                    node.kubernetes.io/instance-type=t2.2xlarge
                    topology.kubernetes.io/region=us-east-1
                    topology.kubernetes.io/zone=us-east-1a
Annotations:        node.alpha.kubernetes.io/ttl: 0
CreationTimestamp:  Wed, 04 Jan 2023 23:00:45 +0200
Taints:             node.kubernetes.io/unreachable:NoExecute
                    codefresh.io=dinds:NoSchedule
                    node.kubernetes.io/unreachable:NoSchedule
Unschedulable:      false
Lease:              Failed to get lease: leases.coordination.k8s.io "[redacted]" not found
Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason                   Message
  ----             ------    -----------------                 ------------------                ------                   -------
  Ready            Unknown   Wed, 04 Jan 2023 23:00:45 +0200   Wed, 04 Jan 2023 23:01:49 +0200   NodeStatusNeverUpdated   Kubelet never posted node status.
  MemoryPressure   Unknown   Wed, 04 Jan 2023 23:00:45 +0200   Wed, 04 Jan 2023 23:01:49 +0200   NodeStatusNeverUpdated   Kubelet never posted node status.
  DiskPressure     Unknown   Wed, 04 Jan 2023 23:00:45 +0200   Wed, 04 Jan 2023 23:01:49 +0200   NodeStatusNeverUpdated   Kubelet never posted node status.
  PIDPressure      Unknown   Wed, 04 Jan 2023 23:00:45 +0200   Wed, 04 Jan 2023 23:01:49 +0200   NodeStatusNeverUpdated   Kubelet never posted node status.
Addresses:
System Info:
  Machine ID:                 
  System UUID:                
  Boot ID:                    
  Kernel Version:             
  OS Image:                   
  Operating System:           
  Architecture:               
  Container Runtime Version:  
  Kubelet Version:            
  Kube-Proxy Version:         
ProviderID:                   aws:///us-east-1a/i-[redacted]
Non-terminated Pods:          (0 in total)
  Namespace                   Name    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----    ------------  ----------  ---------------  -------------  ---
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests  Limits
  --------           --------  ------
  cpu                0 (0%)    0 (0%)
  memory             0 (0%)    0 (0%)
  ephemeral-storage  0 (0%)    0 (0%)
Events:
  Type     Reason               Age                 From       Message
  ----     ------               ----                ----       -------
  Warning  FailedInflightCheck  82s (x63 over 10h)  karpenter  Instance Type "" not found

Karpenter should gracefully remove said node, instead of leaving the cluster in this state where one of the nodes is posting NotReady but Karpenter cannot do anything about it, since it is failing the inflight checks.

Actual Behavior

EDIT: upon further investigation, I'm seeing that the node is missing the beta.kubernetes.io/instance-type label. I think somehow the EC2 health checks failing prevented this label from being created. Either way, Karpenter should not be in this deadlock just because the label is missing.

Steps to Reproduce the Problem

I suppose you can reproduce it by rolling the dice until one of your Spot instances fails its EC2 health checks, and somehow ends up missing the beta.kubernetes.io/instance-type label.

Resource Specs and Logs

See previous sections.

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

korenyoni · 2023-01-05T09:11:21Z

So the node is missing the beta.kubernetes.io/instance-type (probably related to its failed EC2 health checks, somehow).

2023-01-05T09:00:47.494Z    INFO    controller.inflightchecks    Inflight check failed for node, Instance Type "" not found    {"commit": "06cb81f-dirty", "node": "ip-10-13-127-85.ec2.internal"}

From the logs and peeking at karpenter-core, all I can tell is that it's coming from one of two places:

https://github.com/aws/karpenter-core/blob/6039eb43a1144aeaa5a28fbf2c3479669587b0c7/pkg/controllers/inflightchecks/nodeshape.go#L55-L61

https://github.com/aws/karpenter-core/blob/4e7255c0e5b0037327a93c0b936fae6ea67c30c5/pkg/controllers/inflightchecks/failedinit.go#L64-L70

So not sure if it's the failedInit or nodeShape InflightCheck.

@tzneal as the author of these inflightChecks, do you know if this "deadlock" I describe is indeed caused by one of these inflightChecks failing (beta.kubernetes.io/instance-type is somehow missing), and if there's a way to prevent it?

Or do these checks simply add verbosity in the k8s events?

andrewhibbert · 2023-01-05T12:30:56Z

I have this issue to with an EC2 instance that is stuck in pending. I think this type of issue should be fixed in #2235 / #2544

tzneal · 2023-01-05T13:37:15Z

It looks like the node never started successfully so kubelet never came up. This is a known issue, if the node fails to register correctly it currently remains allowing you to troubleshoot why it failed to come up.

There is a feature request at kubernetes-sigs/karpenter#750 for a node auto-repair feature which would automatically remove nodes that experience startup issues. There are some complexities to it, as removing the node may not help. E.g. if your userdata is just bad, removing the node and launching another will just fail again and we'll get into a cycle of launching and terminating a node over and over again.

korenyoni · 2023-01-05T13:45:13Z

It looks like the node never started successfully so kubelet never came up. This is a known issue, if the node fails to register correctly it currently remains allowing you to troubleshoot why it failed to come up.

There is a feature request at kubernetes-sigs/karpenter#750 for a node auto-repair feature which would automatically remove nodes that experience startup issues. There are some complexities to it, as removing the node may not help. E.g. if your userdata is just bad, removing the node and launching another will just fail again and we'll get into a cycle of launching and terminating a node over and over again.

I see, thanks for the response.

I think for kubernetes-sigs/karpenter#750 you would need some sort of exponential backoff per provisioner to avoid the infinite cycle you were describing.

tzneal · 2023-01-23T16:12:54Z

I'm going to close this in favor of kubernetes-sigs/karpenter#750

korenyoni added the bug Something isn't working label Jan 5, 2023

korenyoni mentioned this issue Jan 5, 2023

Node Repair kubernetes-sigs/karpenter#750

Open

tzneal closed this as completed Jan 23, 2023

spring1843 mentioned this issue Jan 29, 2023

Node in notready status for a while and reporting Inflight check failed for node, Instance Type "" not found #3311

Closed

zackgalbreath mentioned this issue Mar 22, 2023

Instance Type "" not found spack/spack-infrastructure#416

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Inflight check failed for node, Instance Type "" not found #3156

Bug: Inflight check failed for node, Instance Type "" not found #3156

korenyoni commented Jan 5, 2023 •

edited

Loading

korenyoni commented Jan 5, 2023 •

edited

Loading

andrewhibbert commented Jan 5, 2023

tzneal commented Jan 5, 2023

korenyoni commented Jan 5, 2023 •

edited

Loading

tzneal commented Jan 23, 2023

Bug: Inflight check failed for node, Instance Type "" not found #3156

Bug: Inflight check failed for node, Instance Type "" not found #3156

Comments

korenyoni commented Jan 5, 2023 • edited Loading

Version

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Resource Specs and Logs

Community Note

korenyoni commented Jan 5, 2023 • edited Loading

andrewhibbert commented Jan 5, 2023

tzneal commented Jan 5, 2023

korenyoni commented Jan 5, 2023 • edited Loading

tzneal commented Jan 23, 2023

korenyoni commented Jan 5, 2023 •

edited

Loading

korenyoni commented Jan 5, 2023 •

edited

Loading

korenyoni commented Jan 5, 2023 •

edited

Loading