Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter not deprovisioning and deleting customized gpu node #3862

Closed
iamtito opened this issue May 6, 2023 · 3 comments
Closed

Karpenter not deprovisioning and deleting customized gpu node #3862

iamtito opened this issue May 6, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@iamtito
Copy link

iamtito commented May 6, 2023

Version

Karpenter Version: v0.20.0

Kubernetes Version: v1.23

Expected Behavior

When i scaled down the deployment to 0 the node should be deprovisioned and deleted, but that's not happening.

Actual Behavior

During scaling the deployment does not bring down the node

kubectl -napp scale deploymentapp1-service --replicas=0

Steps to Reproduce the Problem

Create a provisioner using this config
I deployed this configuration:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
    name: default
    namespace: "app"
spec:
    requirements:
    - key: karpenter.sh/capacity-type
        operator: In
        values: ["spot"]
    - key: node.kubernetes.io/instance-type
        operator: In
        values: ["r5.large"]
    - key: kubernetes.io/os
        operator: In
        values: ["linux"]
    providerRef:
        name: default
    consolidation:
        enabled: true
    labels:
        node: default
---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
    name: default
    namespace: "app"
spec:
    instanceProfile: "IAMProveApp1"
    subnetSelector:
        Role: "private"
    securityGroupSelector:
        aws:eks:cluster-name: dev-cluster
    tags:
        karpenter.sh/discovery/node: default
        karpenter.sh/discovery: dev-cluster
---
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
    name: gpu
    namespace: "app"
spec:
    requirements:
    - key: karpenter.sh/capacity-type
        operator: In
        values: ["on-demand"]
    - key: node.kubernetes.io/instance-type
        operator: In
        values: ["g4dn.4xlarge"]
    kubeletConfiguration:
        containerRuntime: dockerd
    providerRef:
        name: gpu
    consolidation:
        enabled: true
    taints:
    - key: app
        value: app1-service
        effect: NoSchedule
    labels:
        app: app1-service
---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
    name: gpu
    namespace: "app"
spec:
    labels:
    app: app1-service
    amiSelector:
        karpenter.sh/discovery: dev-cluster //<-- We tagged the ami to select with this key name, the ami is GPU base ami with prebuilt images in it
    blockDeviceMappings:
    - deviceName: /dev/xvda
        ebs:
        volumeSize: 150Gi
    subnetSelector:
       Role: "private"
    securityGroupSelector:
       aws:eks:cluster-name: dev-cluster
    tags:
       app: app1-service
       karpenter.sh/discovery/node: GPU
       karpenter.sh/discovery/app: app1-service
       karpenter.sh/discovery: dev-cluster

and use this to deploy it app1-service.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
  namespace: app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: app1-service
  template:
    metadata:
      labels:
        app: app1-service
    spec:
      terminationGracePeriodSeconds: 0
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - app1-service
      tolerations:
        - effect: NoSchedule
          key: app
          operator: Equal
          value: app1-service
      containers:
        - name: inflate
           image: public.ecr.aws/eks-distro/kubernetes/pause:3.2

Deploy it via kubectly apply -napp -f app1-service.yaml

Scale-up should provision the ami from the customAMISelector
Sclae down should deprovision and delete the node

Resource Specs and Logs

Below is the current config spec that got deployed to the cluster

$ kubectl get provisioner
NAME      AGE
default   9d
gpu       3d5h

$ kubectl get AWSNodeTemplate
NAME      AGE
default   3d5h
gpu       3d5h

$ kubectl get provisioner default gpu -o yaml
apiVersion: v1
items:
- apiVersion: karpenter.sh/v1alpha5
  kind: Provisioner
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"karpenter.sh/v1alpha5","kind":"Provisioner","metadata":{"annotations":{},"name":"default"},"spec":{"consolidation":{"enabled":true},"labels":{"node":"default"},"providerRef":{"name":"default"},"requirements":[{"key":"karpenter.sh/capacity-type","operator":"In","values":["spot"]},{"key":"node.kubernetes.io/instance-type","operator":"In","values":["r5.large"]},{"key":"kubernetes.io/os","operator":"In","values":["linux"]}]}}
    creationTimestamp: "2023-04-26T21:15:13Z"
    generation: 2
    managedFields:
    - apiVersion: karpenter.sh/v1alpha5
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:kubectl.kubernetes.io/last-applied-configuration: {}
        f:spec:
          .: {}
          f:consolidation:
            .: {}
            f:enabled: {}
          f:labels:
            .: {}
            f:node: {}
          f:providerRef:
            .: {}
            f:name: {}
          f:requirements: {}
      manager: HashiCorp
      operation: Update
      time: "2023-04-26T21:15:13Z"
    - apiVersion: karpenter.sh/v1alpha5
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          .: {}
          f:resources:
            .: {}
            f:attachable-volumes-aws-ebs: {}
            f:cpu: {}
            f:ephemeral-storage: {}
            f:memory: {}
            f:pods: {}
      manager: karpenter
      operation: Update
      subresource: status
      time: "2023-04-26T21:29:16Z"
    name: default
    resourceVersion: "39970268"
    uid: 3082a50c-7dcc-4e85-adfc-3c4dd78ea3b7
  spec:
    consolidation:
      enabled: true
    labels:
      node: default
    providerRef:
      name: default
    requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values:
      - spot
    - key: node.kubernetes.io/instance-type
      operator: In
      values:
      - r5.large
    - key: kubernetes.io/os
      operator: In
      values:
      - linux
    - key: kubernetes.io/arch
      operator: In
      values:
      - amd64
  status:
    resources:
      attachable-volumes-aws-ebs: "25"
      cpu: "2"
      ephemeral-storage: 20959212Ki
      memory: 16078488Ki
      pods: "29"
      
- apiVersion: karpenter.sh/v1alpha5
  kind: Provisioner
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"karpenter.sh/v1alpha5","kind":"Provisioner","metadata":{"annotations":{},"name":"gpu"},"spec":{"consolidation":{"enabled":true},"kubeletConfiguration":{"containerRuntime":"dockerd"},"labels":{"app":"app1-service"},"providerRef":{"name":"gpu"},"requirements":[{"key":"karpenter.sh/capacity-type","operator":"In","values":["on-demand"]},{"key":"node.kubernetes.io/instance-type","operator":"In","values":["g4dn.4xlarge"]}],"taints":[{"effect":"NoSchedule","key":"app","value":"app1-service"}]}}
    creationTimestamp: "2023-05-03T12:15:51Z"
    generation: 4
    managedFields:
    - apiVersion: karpenter.sh/v1alpha5
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:kubectl.kubernetes.io/last-applied-configuration: {}
        f:spec:
          .: {}
          f:consolidation:
            .: {}
            f:enabled: {}
          f:kubeletConfiguration:
            .: {}
            f:containerRuntime: {}
          f:labels:
            .: {}
            f:app: {}
          f:providerRef:
            .: {}
            f:name: {}
          f:requirements: {}
          f:taints: {}
      manager: HashiCorp
      operation: Update
      time: "2023-05-06T16:39:12Z"
    - apiVersion: karpenter.sh/v1alpha5
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          .: {}
          f:resources:
            .: {}
            f:attachable-volumes-aws-ebs: {}
            f:cpu: {}
            f:ephemeral-storage: {}
            f:memory: {}
            f:nvidia.com/gpu: {}
            f:pods: {}
      manager: karpenter
      operation: Update
      subresource: status
      time: "2023-05-06T16:41:37Z"
    name: gpu
    resourceVersion: "44220337"
    uid: 2d1ab463-06b0-46ae-b13e-9a665ba25a9b
  spec:
    consolidation:
      enabled: true
    kubeletConfiguration:
      containerRuntime: dockerd
    labels:
      app: app1-service
    providerRef:
      name: gpu
    requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values:
      - on-demand
    - key: node.kubernetes.io/instance-type
      operator: In
      values:
      - g4dn.4xlarge
    - key: kubernetes.io/os
      operator: In
      values:
      - linux
    - key: kubernetes.io/arch
      operator: In
      values:
      - amd64
    taints:
    - effect: NoSchedule
      key: app
      value: app1-service
  status:
    resources:
      attachable-volumes-aws-ebs: "39"
      cpu: "16"
      ephemeral-storage: 157274092Ki
      memory: 65042920Ki
      nvidia.com/gpu: "1"
      pods: "29"
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

  apiVersion: v1
items:
- apiVersion: karpenter.k8s.aws/v1alpha1
  kind: AWSNodeTemplate
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"karpenter.k8s.aws/v1alpha1","kind":"AWSNodeTemplate","metadata":{"annotations":{},"name":"default"},"spec":{"instanceProfile":"eks-1cc37185-87f0-0c88-44b4-4db5a32ae91a","securityGroupSelector":{"aws:eks:cluster-name":"dev-cluster"},"subnetSelector":{"Role":"private"},"tags":{"karpenter.sh/discovery":"dev-cluster","karpenter.sh/discovery/node":"default"}}}
    creationTimestamp: "2023-05-03T12:15:51Z"
    generation: 1
    managedFields:
    - apiVersion: karpenter.k8s.aws/v1alpha1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:kubectl.kubernetes.io/last-applied-configuration: {}
        f:spec:
          .: {}
          f:instanceProfile: {}
          f:securityGroupSelector:
            .: {}
            f:aws:eks:cluster-name: {}
          f:subnetSelector:
            .: {}
            f:Role: {}
          f:tags:
            .: {}
            f:karpenter.sh/discovery: {}
            f:karpenter.sh/discovery/node: {}
      manager: HashiCorp
      operation: Update
      time: "2023-05-03T12:15:51Z"
    name: default
    resourceVersion: "42727203"
    uid: fae27ccb-b308-404c-b622-b61233d71975
  spec:
    instanceProfile: eks-1cc37185-87f0-0c88-44b4-4db5a32ae91a
    securityGroupSelector:
      aws:eks:cluster-name: dev-cluster
    subnetSelector:
      Role: private
    tags:
      karpenter.sh/discovery: dev-cluster
      karpenter.sh/discovery/node: default
- apiVersion: karpenter.k8s.aws/v1alpha1
  kind: AWSNodeTemplate
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"karpenter.k8s.aws/v1alpha1","kind":"AWSNodeTemplate","metadata":{"annotations":{},"name":"gpu"},"spec":{"amiSelector":{"karpenter.sh/discovery":"dev-cluster"},"blockDeviceMappings":[{"deviceName":"/dev/xvda","ebs":{"volumeSize":"150Gi"}}],"labels":{"app":"app1-service"},"securityGroupSelector":{"aws:eks:cluster-name":"dev-cluster"},"subnetSelector":{"Role":"private"},"tags":{"app":"app1-service","karpenter.sh/discovery":"dev-cluster","karpenter.sh/discovery/app":"app1-service","karpenter.sh/discovery/node":"gpu"}}}
    creationTimestamp: "2023-05-03T12:15:51Z"
    generation: 10
    managedFields:
    - apiVersion: karpenter.k8s.aws/v1alpha1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:kubectl.kubernetes.io/last-applied-configuration: {}
        f:spec:
          .: {}
          f:amiSelector:
            .: {}
            f:karpenter.sh/discovery: {}
          f:blockDeviceMappings: {}
          f:securityGroupSelector:
            .: {}
            f:aws:eks:cluster-name: {}
          f:subnetSelector:
            .: {}
            f:Role: {}
          f:tags:
            .: {}
            f:app: {}
            f:karpenter.sh/discovery: {}
            f:karpenter.sh/discovery/app: {}
            f:karpenter.sh/discovery/node: {}
          f:userData: {}
      manager: HashiCorp
      operation: Update
      time: "2023-05-06T15:33:52Z"
    name: gpu
    resourceVersion: "44207417"
    uid: 838a5d29-b348-4476-adf2-e9f9712222c2
  spec:
    amiSelector:
      karpenter.sh/discovery: dev-cluster
    blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 150Gi
    securityGroupSelector:
      aws:eks:cluster-name: dev-cluster
    subnetSelector:
      Role: private
    tags:
      app: app1-service
      karpenter.sh/discovery: dev-cluster
      karpenter.sh/discovery/app: app1-service
      karpenter.sh/discovery/node: gpu
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Logs:

2023-05-06T15:45:53.971Z	ERROR	controller.provisioner	Could not schedule pod, incompatible with provisioner "default", incompatible requirements, key karpenter.sh/provisioner-name, karpenter.sh/provisioner-name DoesNotExist not in karpenter.sh/provisioner-name In [default]; incompatible with provisioner "gpu", did not tolerate app=app1-service:NoSchedule	{"commit": "f60dacd", "pod": "karpenter/karpenter-57c5f67dd6-7g9fd"}
2023-05-06T15:47:57.971Z	DEBUG	controller.provisioner	relaxing soft constraints for pod since it previously failed to schedule, removing: spec.topologySpreadConstraints = {"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"ScheduleAnyway","labelSelector":{"matchLabels":{"app.kubernetes.io/instance":"karpenter","app.kubernetes.io/name":"karpenter"}}}	{"commit": "f60dacd", "pod": "karpenter/karpenter-57c5f67dd6-7g9fd"}
2023-05-06T15:47:57.982Z	ERROR	controller.provisioner	Could not schedule pod, incompatible with provisioner "gpu", did not tolerate app=app1-service:NoSchedule; incompatible with provisioner "default", incompatible requirements, key karpenter.sh/provisioner-name, karpenter.sh/provisioner-name DoesNotExist not in karpenter.sh/provisioner-name In [default]	{"commit": "f60dacd", "pod": "karpenter/karpenter-57c5f67dd6-7g9fd"}
2023-05-06T15:47:57.982Z	INFO	controller.provisioner	found provisionable pod(s)	{"commit": "f60dacd", "pods": 2}
2023-05-06T15:47:57.982Z	INFO	controller.provisioner	serviced new node(s) to fit pod(s)	{"commit": "f60dacd", "nodes": 1, "pods": 1}
2023-05-06T15:47:57.989Z	INFO	controller.provisioner	launching node with 1 pods requesting {"cpu":"125m","pods":"3"} from types g4dn.4xlarge	{"commit": "f60dacd", "provisioner": "gpu"}
2023-05-06T15:47:59.662Z	DEBUG	controller.provisioner.cloudprovider	created launch template	{"commit": "f60dacd", "provisioner": "gpu", "launch-template-name": "Karpenter-dev-cluster-123456", "launch-template-id": "lt-0f3aef76435882c8d"}
2023-05-06T15:48:02.821Z	INFO	controller.provisioner.cloudprovider	launched new instance	{"commit": "f60dacd", "provisioner": "gpu", "launched-instance": "i-02c2ad45690ea2d4c", "hostname": "ip-10-111-22-33.ec2.internal", "type": "g4dn.4xlarge", "zone": "us-east-1a", "capacity-type": "on-demand"}
2023-05-06T15:48:17.886Z	DEBUG	controller.provisioner	relaxing soft constraints for pod since it previously failed to schedule, removing: spec.topologySpreadConstraints = {"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"ScheduleAnyway","labelSelector":{"matchLabels":{"app.kubernetes.io/instance":"karpenter","app.kubernetes.io/name":"karpenter"}}}	{"commit": "f60dacd", "pod": "karpenter/karpenter-57c5f67dd6-7g9fd"}
2023-05-06T15:48:17.886Z	ERROR	controller.provisioner	Could not schedule pod, incompatible with provisioner "default", incompatible requirements, key karpenter.sh/provisioner-name, karpenter.sh/provisioner-name DoesNotExist not in karpenter.sh/provisioner-name In [default]; incompatible with provisioner "gpu", did not tolerate app=app1-service:NoSchedule	{"commit": "f60dacd", "pod": "karpenter/karpenter-57c5f67dd6-7g9fd"}
2023-05-06T15:48:20.000Z	DEBUG	controller.deprovisioning	relaxing soft constraints for pod since it previously failed to schedule, removing: spec.topologySpreadConstraints = {"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"ScheduleAnyway","labelSelector":{"matchLabels":{"app.kubernetes.io/instance":"karpenter","app.kubernetes.io/name":"karpenter"}}}	{"commit": "f60dacd", "pod": "karpenter/karpenter-57c5f67dd6-7g9fd"}
2023-05-06T15:48:40.837Z	DEBUG	controller.deprovisioning	relaxing soft constraints for pod since it previously failed to schedule, removing: spec.topologySpreadConstraints = {"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"ScheduleAnyway","labelSelector":{"matchLabels":{"app.kubernetes.io/instance":"karpenter","app.kubernetes.io/name":"karpenter"}}}	{"commit": "f60dacd", "pod": "karpenter/karpenter-57c5f67dd6-7g9fd"}
2023-05-06T15:49:01.586Z	DEBUG	controller.provisioner	relaxing soft constraints for pod since it previously failed to schedule, removing: spec.topologySpreadConstraints = {"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"ScheduleAnyway","labelSelector":{"matchLabels":{"app.kubernetes.io/instance":"karpenter","app.kubernetes.io/name":"karpenter"}}}	{"commit": "f60dacd", "pod": "karpenter/karpenter-57c5f67dd6-7g9fd"}
2023-05-06T15:49:01.587Z	ERROR	controller.provisioner	Could not schedule pod, incompatible with provisioner "default", incompatible requirements, key karpenter.sh/provisioner-name, karpenter.sh/provisioner-name DoesNotExist not in karpenter.sh/provisioner-name In [default]; incompatible with provisioner "gpu", did not tolerate app=app1-service:NoSchedule	{"commit": "f60dacd", "pod": "karpenter/karpenter-57c5f67dd6-7g9fd"}
2023-05-06T15:49:52.987Z	DEBUG	controller.deprovisioning	relaxing soft constraints for pod since it previously failed to schedule, removing: spec.topologySpreadConstraints = {"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"ScheduleAnyway","labelSelector":{"matchLabels":{"app.kubernetes.io/instance":"karpenter","app.kubernetes.io/name":"karpenter"}}}	{"commit": "f60dacd", "pod": "karpenter/karpenter-57c5f67dd6-7g9fd"}
2023-05-06T15:50:28.991Z	DEBUG	controller.aws	deleted launch template	{"commit": "f60dacd"}
2023-05-06T15:50:29.096Z	DEBUG	controller.aws	deleted launch template	{"commit": "f60dacd"}
2023-05-06T15:50:30.099Z	INFO	controller.inflightchecks	Inflight check failed for node, Expected resource "nvidia.com/gpu" didn't register on the node	{"commit": "f60dacd", "node": "ip-10-111-22-33.ec2.internal"}
2023-05-06T15:50:31.541Z	DEBUG	controller.provisioner	relaxing soft constraints for pod since it previously failed to schedule, removing: spec.topologySpreadConstraints = {"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"ScheduleAnyway","labelSelector":{"matchLabels":{"app.kubernetes.io/instance":"karpenter","app.kubernetes.io/name":"karpenter"}}}	{"commit": "f60dacd", "pod": "karpenter/karpenter-57c5f67dd6-7g9fd"}
2023-05-06T15:50:31.541Z	ERROR	controller.provisioner	Could not schedule pod, incompatible with provisioner "default", incompatible requirements, key karpenter.sh/provisioner-name, karpenter.sh/provisioner-name DoesNotExist not in karpenter.sh/provisioner-name In [default]; incompatible with provisioner "gpu", did not tolerate app=app1-service:NoSchedule	{"commit": "f60dacd", "pod": "karpenter/karpenter-57c5f67dd6-7g9fd"}
2023-05-06T15:50:44.504Z	DEBUG	controller.deprovisioning	relaxing soft constraints for pod since it previously failed to schedule, removing: spec.topologySpreadConstraints = {"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"ScheduleAnyway","labelSelector":{"matchLabels":{"app.kubernetes.io/instance":"karpenter","app.kubernetes.io/name":"karpenter"}}}	{"commit": "f60dacd", "pod": "karpenter/karpenter-57c5f67dd6-7g9fd"}

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@iamtito iamtito added the bug Something isn't working label May 6, 2023
@jonathan-innis
Copy link
Contributor

jonathan-innis commented May 8, 2023

Do you see the karpenter.sh/initialized label on the deployed node? If not, you may be hitting this troubleshooting item.

@iamtito
Copy link
Author

iamtito commented May 10, 2023

Thanks @jonathan-innis the karpenter.sh/initialized label is not showing up in the node. Also I added the below to the deployment and the pod didn't get added to the node and the node doesn't have the label.

...
...
  resources:
    limits:
    nvidia.com/gpu: 1
....
....

@jonathan-innis
Copy link
Contributor

@iamtito You'll need to install the https://github.com/NVIDIA/k8s-device-plugin as a daemonset on the cluster in order for the resource to get registered on GPU nodes so that pods can schedule to it. This will also help the karpenter.sh/initialized issue since Karpenter isn't considering the node as initialized because of the lack of that resource.

Alternatively, if you use the Bottlerocket AMIFamily, the image has built-in support for the plugin so that you don't need to install the DS separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants