Liveness controller doesn't terminate nodes in all cases #1135

mattsre · 2022-01-12T19:49:41Z

Version

Karpenter: v0.5.4

Kubernetes: v1.21.2

Expected Behavior

Karpenter should gracefully handle scenarios where nodes are unable to join the cluster. In this scenario, Karpenter could wait forever (or for some configurable timeout period) for the nodes to join the cluster, or maybe Karpenter should continually delete the nodes that were unable to join the cluster and launch new nodes.

Actual Behavior

Every ~5 minutes Karpenter batches pods and tries to launch a new set of instances. When following the Getting Started guide, using the inflate deployment to test the auto scaling functionality at 10 replices, I ended up with ~30 m5.xlarge instances running, unable to join the cluster after 1 hours of testing. Old nodes were never deleted, only new ones were ever created. Additionally, after scaling down the inflate deployment, these empty nodes did not have a TTL applied. I had to manually delete them.

Steps to Reproduce the Problem

The easiest way I've found to do this is intentionally misconfigure the EKS VPC CNI plugin

Create an EKS cluster
Setup VPC CNI plugin
Remove the required "AmazonEKS_CNI_Policy" from either the EKS node role, or the IAM role that was configured for the VPC CNI plugin
Follow Getting Started guide - https://karpenter.sh/docs/getting-started/

At this point, ipamd should be unable to create/attach new ENIs to the nodes Karpenter is trying to create, which will cause them to fail to join the cluster (the aws-node pod should get stuck in crashloopbackoff) for new nodes).

When scaling to 10 replicas, Karpenter should get stuck in a loop which constantly creates new nodes.

Resource Specs and Logs

I've attached the logs from the controller during an example run. The pod used was the pause image provided in the Getting Started example deployment - https://karpenter.sh/docs/getting-started/#automatic-node-provisioning

Here is the provisioner spec:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: redacted
spec:
  requirements:
    - key: node.kubernetes.io/instance-type
      operator: In
      values: ["m5.large", "m5.xlarge"]
    - key: kubernetes.io/arch
      operator: In
      values: ["amd64"]
    - key: topology.kubernetes.io/zone
      operator: In
      values:
        [
          "us-east-1a",
          "us-east-1b",
          "us-east-1c",
          "us-east-1d",
          "us-east-1e",
          "us-east-1f",
        ]
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot", "on-demand"]
  ttlSecondsAfterEmpty: 60

KarpenterEndlessLaunchLogs.txt

The text was updated successfully, but these errors were encountered:

ellistarn · 2022-01-12T20:06:03Z

This is such a good find. Currently, the liveness controller terminates nodes that fail to connect: https://github.com/aws/karpenter/blob/main/pkg/controllers/node/liveness.go#L49. However, for nodes that do connect but dont become ready, this can happen! We've been aware of this gap, but haven't had an example of it happening, so we hadn't prioritized it.

I think we probably want to apply liveness to all node unreadiness, with the ability for users to configure longer static stability spec.livenessTTLSeconds. Defaulting to 5 minutes probably makes sense, since this is the upstream pod static stability value.

You can protect again this by setting provisioner.spec.limits, which is a simple failsafe mechanism.

mattsre · 2022-01-12T20:18:32Z

Are limits actually implemented? I had read the design doc here: https://github.com/aws/karpenter/blob/main/designs/limits.md but was under the understanding they aren't implemented yet? We actually have them commented out in our current Provisioner configuration as I was waiting until they were available 😄

suket22 · 2022-01-12T21:09:39Z

They are implemented. You can limit via cpu and memory. We've got a doc update in PR right now that should hopefully clear out that confusion - #1117

mattsre · 2022-01-12T21:14:01Z

Awesome, thanks. I'll enable those in the meantime as a failsafe.

suket22 · 2022-01-26T18:35:35Z

Released as part of https://github.com/aws/karpenter/releases/tag/v0.5.6

mattsre added the bug Something isn't working label Jan 12, 2022

ellistarn changed the title ~~Karpenter endlessly creates nodes if created nodes fail to join the cluster~~ Liveness controller doesn't terminate nodes in all cases Jan 12, 2022

ellistarn added the burning Time sensitive issues label Jan 12, 2022

suket22 self-assigned this Jan 19, 2022

suket22 mentioned this issue Jan 19, 2022

Replace liveness and readiness with initialization node controller #1186

Merged

3 tasks

suket22 closed this as completed Jan 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Liveness controller doesn't terminate nodes in all cases #1135

Liveness controller doesn't terminate nodes in all cases #1135

mattsre commented Jan 12, 2022 •

edited

Loading

ellistarn commented Jan 12, 2022

mattsre commented Jan 12, 2022

suket22 commented Jan 12, 2022

mattsre commented Jan 12, 2022

suket22 commented Jan 26, 2022

Liveness controller doesn't terminate nodes in all cases #1135

Liveness controller doesn't terminate nodes in all cases #1135

Comments

mattsre commented Jan 12, 2022 • edited Loading

Version

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Resource Specs and Logs

ellistarn commented Jan 12, 2022

mattsre commented Jan 12, 2022

suket22 commented Jan 12, 2022

mattsre commented Jan 12, 2022

suket22 commented Jan 26, 2022

mattsre commented Jan 12, 2022 •

edited

Loading