Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Inflight check failed for node, Instance Type "" not found #3156

Closed
korenyoni opened this issue Jan 5, 2023 · 5 comments
Closed

Bug: Inflight check failed for node, Instance Type "" not found #3156

korenyoni opened this issue Jan 5, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@korenyoni
Copy link
Contributor

korenyoni commented Jan 5, 2023

Version

Karpenter Version: v0.21.1

Kubernetes Version: Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.7-eks-fb459a0", GitCommit:"c240013134c03a740781ffa1436ba2688b50b494", GitTreeState:"clean", BuildDate:"2022-10-24T20:36:26Z", GoVersion:"go1.18.7", Compiler:"gc", Platform:"linux/amd64"}

Expected Behavior

We have one spot instance that failed its AWS EC2 health checks. It is running 0 pods and Karpenter is not removing it (it's been this way for the past 11 hours).

image

$ kubectl describe node ...

Name:               [redacted]
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    k8s.io/cloud-provider-aws=[redacted]
                    karpenter.k8s.aws/instance-ami-id=ami-0788d7e6b881be91c
                    karpenter.k8s.aws/instance-category=t
                    karpenter.k8s.aws/instance-cpu=8
                    karpenter.k8s.aws/instance-family=t2
                    karpenter.k8s.aws/instance-generation=2
                    karpenter.k8s.aws/instance-hypervisor=xen
                    karpenter.k8s.aws/instance-memory=32768
                    karpenter.k8s.aws/instance-pods=44
                    karpenter.k8s.aws/instance-size=2xlarge
                    karpenter.sh/capacity-type=spot
                    karpenter.sh/provisioner-name=dind
                    kubernetes.io/arch=amd64
                    kubernetes.io/os=linux
                    node-type=dind
                    node.kubernetes.io/instance-type=t2.2xlarge
                    topology.kubernetes.io/region=us-east-1
                    topology.kubernetes.io/zone=us-east-1a
Annotations:        node.alpha.kubernetes.io/ttl: 0
CreationTimestamp:  Wed, 04 Jan 2023 23:00:45 +0200
Taints:             node.kubernetes.io/unreachable:NoExecute
                    codefresh.io=dinds:NoSchedule
                    node.kubernetes.io/unreachable:NoSchedule
Unschedulable:      false
Lease:              Failed to get lease: leases.coordination.k8s.io "[redacted]" not found
Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason                   Message
  ----             ------    -----------------                 ------------------                ------                   -------
  Ready            Unknown   Wed, 04 Jan 2023 23:00:45 +0200   Wed, 04 Jan 2023 23:01:49 +0200   NodeStatusNeverUpdated   Kubelet never posted node status.
  MemoryPressure   Unknown   Wed, 04 Jan 2023 23:00:45 +0200   Wed, 04 Jan 2023 23:01:49 +0200   NodeStatusNeverUpdated   Kubelet never posted node status.
  DiskPressure     Unknown   Wed, 04 Jan 2023 23:00:45 +0200   Wed, 04 Jan 2023 23:01:49 +0200   NodeStatusNeverUpdated   Kubelet never posted node status.
  PIDPressure      Unknown   Wed, 04 Jan 2023 23:00:45 +0200   Wed, 04 Jan 2023 23:01:49 +0200   NodeStatusNeverUpdated   Kubelet never posted node status.
Addresses:
System Info:
  Machine ID:                 
  System UUID:                
  Boot ID:                    
  Kernel Version:             
  OS Image:                   
  Operating System:           
  Architecture:               
  Container Runtime Version:  
  Kubelet Version:            
  Kube-Proxy Version:         
ProviderID:                   aws:///us-east-1a/i-[redacted]
Non-terminated Pods:          (0 in total)
  Namespace                   Name    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----    ------------  ----------  ---------------  -------------  ---
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests  Limits
  --------           --------  ------
  cpu                0 (0%)    0 (0%)
  memory             0 (0%)    0 (0%)
  ephemeral-storage  0 (0%)    0 (0%)
Events:
  Type     Reason               Age                 From       Message
  ----     ------               ----                ----       -------
  Warning  FailedInflightCheck  82s (x63 over 10h)  karpenter  Instance Type "" not found

Karpenter should gracefully remove said node, instead of leaving the cluster in this state where one of the nodes is posting NotReady but Karpenter cannot do anything about it, since it is failing the inflight checks.

Actual Behavior

CleanShot 2023-01-05 at 10 33 51

EDIT: upon further investigation, I'm seeing that the node is missing the beta.kubernetes.io/instance-type label. I think somehow the EC2 health checks failing prevented this label from being created. Either way, Karpenter should not be in this deadlock just because the label is missing.

Steps to Reproduce the Problem

I suppose you can reproduce it by rolling the dice until one of your Spot instances fails its EC2 health checks, and somehow ends up missing the beta.kubernetes.io/instance-type label.

Resource Specs and Logs

See previous sections.

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@korenyoni korenyoni added the bug Something isn't working label Jan 5, 2023
@korenyoni
Copy link
Contributor Author

korenyoni commented Jan 5, 2023

So the node is missing the beta.kubernetes.io/instance-type (probably related to its failed EC2 health checks, somehow).

2023-01-05T09:00:47.494Z    INFO    controller.inflightchecks    Inflight check failed for node, Instance Type "" not found    {"commit": "06cb81f-dirty", "node": "ip-10-13-127-85.ec2.internal"}

From the logs and peeking at karpenter-core, all I can tell is that it's coming from one of two places:

https://github.com/aws/karpenter-core/blob/6039eb43a1144aeaa5a28fbf2c3479669587b0c7/pkg/controllers/inflightchecks/nodeshape.go#L55-L61

https://github.com/aws/karpenter-core/blob/4e7255c0e5b0037327a93c0b936fae6ea67c30c5/pkg/controllers/inflightchecks/failedinit.go#L64-L70

So not sure if it's the failedInit or nodeShape InflightCheck.

@tzneal as the author of these inflightChecks, do you know if this "deadlock" I describe is indeed caused by one of these inflightChecks failing (beta.kubernetes.io/instance-type is somehow missing), and if there's a way to prevent it?

Or do these checks simply add verbosity in the k8s events?

@andrewhibbert
Copy link
Contributor

I have this issue to with an EC2 instance that is stuck in pending. I think this type of issue should be fixed in #2235 / #2544

@tzneal
Copy link
Contributor

tzneal commented Jan 5, 2023

It looks like the node never started successfully so kubelet never came up. This is a known issue, if the node fails to register correctly it currently remains allowing you to troubleshoot why it failed to come up.

There is a feature request at kubernetes-sigs/karpenter#750 for a node auto-repair feature which would automatically remove nodes that experience startup issues. There are some complexities to it, as removing the node may not help. E.g. if your userdata is just bad, removing the node and launching another will just fail again and we'll get into a cycle of launching and terminating a node over and over again.

@korenyoni
Copy link
Contributor Author

korenyoni commented Jan 5, 2023

It looks like the node never started successfully so kubelet never came up. This is a known issue, if the node fails to register correctly it currently remains allowing you to troubleshoot why it failed to come up.

There is a feature request at kubernetes-sigs/karpenter#750 for a node auto-repair feature which would automatically remove nodes that experience startup issues. There are some complexities to it, as removing the node may not help. E.g. if your userdata is just bad, removing the node and launching another will just fail again and we'll get into a cycle of launching and terminating a node over and over again.

I see, thanks for the response.

I think for kubernetes-sigs/karpenter#750 you would need some sort of exponential backoff per provisioner to avoid the infinite cycle you were describing.

@tzneal
Copy link
Contributor

tzneal commented Jan 23, 2023

I'm going to close this in favor of kubernetes-sigs/karpenter#750

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants