Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liveness controller doesn't terminate nodes in all cases #1135

Closed
mattsre opened this issue Jan 12, 2022 · 5 comments
Closed

Liveness controller doesn't terminate nodes in all cases #1135

mattsre opened this issue Jan 12, 2022 · 5 comments
Assignees
Labels
bug Something isn't working burning Time sensitive issues

Comments

@mattsre
Copy link
Contributor

mattsre commented Jan 12, 2022

Version

Karpenter: v0.5.4

Kubernetes: v1.21.2

Expected Behavior

Karpenter should gracefully handle scenarios where nodes are unable to join the cluster. In this scenario, Karpenter could wait forever (or for some configurable timeout period) for the nodes to join the cluster, or maybe Karpenter should continually delete the nodes that were unable to join the cluster and launch new nodes.

Actual Behavior

Every ~5 minutes Karpenter batches pods and tries to launch a new set of instances. When following the Getting Started guide, using the inflate deployment to test the auto scaling functionality at 10 replices, I ended up with ~30 m5.xlarge instances running, unable to join the cluster after 1 hours of testing. Old nodes were never deleted, only new ones were ever created. Additionally, after scaling down the inflate deployment, these empty nodes did not have a TTL applied. I had to manually delete them.

Steps to Reproduce the Problem

The easiest way I've found to do this is intentionally misconfigure the EKS VPC CNI plugin

  1. Create an EKS cluster
  2. Setup VPC CNI plugin
  3. Remove the required "AmazonEKS_CNI_Policy" from either the EKS node role, or the IAM role that was configured for the VPC CNI plugin
  4. Follow Getting Started guide - https://karpenter.sh/docs/getting-started/

At this point, ipamd should be unable to create/attach new ENIs to the nodes Karpenter is trying to create, which will cause them to fail to join the cluster (the aws-node pod should get stuck in crashloopbackoff) for new nodes).

When scaling to 10 replicas, Karpenter should get stuck in a loop which constantly creates new nodes.

Resource Specs and Logs

I've attached the logs from the controller during an example run. The pod used was the pause image provided in the Getting Started example deployment - https://karpenter.sh/docs/getting-started/#automatic-node-provisioning

Here is the provisioner spec:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: redacted
spec:
  requirements:
    - key: node.kubernetes.io/instance-type
      operator: In
      values: ["m5.large", "m5.xlarge"]
    - key: kubernetes.io/arch
      operator: In
      values: ["amd64"]
    - key: topology.kubernetes.io/zone
      operator: In
      values:
        [
          "us-east-1a",
          "us-east-1b",
          "us-east-1c",
          "us-east-1d",
          "us-east-1e",
          "us-east-1f",
        ]
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot", "on-demand"]
  ttlSecondsAfterEmpty: 60

KarpenterEndlessLaunchLogs.txt

@mattsre mattsre added the bug Something isn't working label Jan 12, 2022
@ellistarn ellistarn changed the title Karpenter endlessly creates nodes if created nodes fail to join the cluster Liveness controller doesn't terminate nodes in all cases Jan 12, 2022
@ellistarn ellistarn added the burning Time sensitive issues label Jan 12, 2022
@ellistarn
Copy link
Contributor

This is such a good find. Currently, the liveness controller terminates nodes that fail to connect: https://github.com/aws/karpenter/blob/main/pkg/controllers/node/liveness.go#L49. However, for nodes that do connect but dont become ready, this can happen! We've been aware of this gap, but haven't had an example of it happening, so we hadn't prioritized it.

I think we probably want to apply liveness to all node unreadiness, with the ability for users to configure longer static stability spec.livenessTTLSeconds. Defaulting to 5 minutes probably makes sense, since this is the upstream pod static stability value.

You can protect again this by setting provisioner.spec.limits, which is a simple failsafe mechanism.

@mattsre
Copy link
Contributor Author

mattsre commented Jan 12, 2022

Are limits actually implemented? I had read the design doc here: https://github.com/aws/karpenter/blob/main/designs/limits.md but was under the understanding they aren't implemented yet? We actually have them commented out in our current Provisioner configuration as I was waiting until they were available 😄

@suket22
Copy link
Contributor

suket22 commented Jan 12, 2022

They are implemented. You can limit via cpu and memory. We've got a doc update in PR right now that should hopefully clear out that confusion - #1117

@mattsre
Copy link
Contributor Author

mattsre commented Jan 12, 2022

Awesome, thanks. I'll enable those in the meantime as a failsafe.

@suket22
Copy link
Contributor

suket22 commented Jan 26, 2022

Released as part of https://github.com/aws/karpenter/releases/tag/v0.5.6

@suket22 suket22 closed this as completed Jan 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working burning Time sensitive issues
Projects
None yet
Development

No branches or pull requests

3 participants