New Node creation stuck in a loop #1573

vasu-git · 2022-03-25T02:09:58Z

Version

Karpenter: v0.7.3

Kubernetes: v1.21.5-eks-bc4871b

I have a pretty basic eks cluster in aws. Deployed karpetenter and a default provisioner(only on-demand instances and no restriction on instance type. Cluster also has a few daemonsets deployed(splunk, ebs-csi-driver etc)
Trying to deploy a new pod (prometheus) which also has a pvc volume. We use ebs-csi-driver addon in our cluster to provision volumes.

apiVersion: v1
kind: Pod
metadata:
  annotations:
    ieng.proofpoint.com/init-container-inserted: "true"
    kubernetes.io/psp: eks.privileged
  creationTimestamp: "2022-03-25T00:06:25Z"
  generateName: prometheus-prometheus-
  labels:
    app: prometheus
    app.kubernetes.io/instance: prometheus
    app.kubernetes.io/managed-by: prometheus-operator
    app.kubernetes.io/name: prometheus
    app.kubernetes.io/version: 2.24.0
  name: prometheus-prometheus-0
  namespace: prometheus-operator
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: StatefulSet
    name: prometheus-prometheus
    uid: 01cd86c4-3007-4e1b-80fe-207bfefedb02
  resourceVersion: "1977637"
  uid: d3767c88-9fba-45c8-969e-711298019caf
spec:
  containers:
  - args:
    - --web.console.templates=/etc/prometheus/consoles
    - --web.console.libraries=/etc/prometheus/console_libraries
    - --config.file=/etc/prometheus/config_out/prometheus.env.yaml
    - --storage.tsdb.path=/prometheus
    - --storage.tsdb.retention.time=7d
    - --web.enable-lifecycle
    - --query.max-concurrency=20
    - --query.timeout=2m
    - --web.external-url=prometheus-operator.dev-test8-labawsuse.lab.ppops.net
    - --web.route-prefix=/
    - --log.level=warn
    - --web.config.file=/etc/prometheus/web_config/web-config.yaml
    - --storage.tsdb.max-block-duration=2h
    - --storage.tsdb.min-block-duration=2h
    image: repocache.nonprod.ppops.net/dev-docker-local/prometheus/prometheus:v2.24.0
    imagePullPolicy: IfNotPresent
    name: prometheus
    ports:
    - containerPort: 9090
      name: web
      protocol: TCP
    readinessProbe:
      failureThreshold: 120
      httpGet:
        path: /-/ready
        port: web
        scheme: HTTP
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 3
    resources:
      limits:
        cpu: "1"
        memory: 4Gi
      requests:
        cpu: 500m
        memory: 2Gi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /etc/prometheus/config_out
      name: config-out
      readOnly: true
    - mountPath: /etc/prometheus/certs
      name: tls-assets
      readOnly: true
    - mountPath: /prometheus
      name: prometheus-prometheus-db
      subPath: prometheus-db
    - mountPath: /etc/prometheus/rules/prometheus-prometheus-rulefiles-0
      name: prometheus-prometheus-rulefiles-0
    - mountPath: /etc/prometheus/web_config/web-config.yaml
      name: web-config
      readOnly: true
      subPath: web-config.yaml
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-hg4dq
      readOnly: true
  - args:
    - --listen-address=:8080
    - --reload-url=http://localhost:9090/-/reload
    - --config-file=/etc/prometheus/config/prometheus.yaml.gz
    - --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml
    - --watched-dir=/etc/prometheus/rules/prometheus-prometheus-rulefiles-0
    - --log-level=warn
    command:
    - /bin/prometheus-config-reloader
    env:
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: SHARD
      value: "0"
    image: repocache.nonprod.ppops.net/dev-docker-local/prometheus-config-reloader:4.10
    imagePullPolicy: IfNotPresent
    name: config-reloader
    resources:
      limits:
        cpu: 200m
        memory: 50Mi
      requests:
        cpu: 100m
        memory: 25Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /etc/prometheus/config
      name: config
    - mountPath: /etc/prometheus/config_out
      name: config-out
    - mountPath: /etc/prometheus/rules/prometheus-prometheus-rulefiles-0
      name: prometheus-prometheus-rulefiles-0
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-hg4dq
      readOnly: true
  - args:
    - sidecar
    - --prometheus.url=http://localhost:9090/
    - --grpc-address=[$(POD_IP)]:10901
    - --http-address=[$(POD_IP)]:10902
    - --objstore.config=$(OBJSTORE_CONFIG)
    - --tsdb.path=/prometheus
    - --log.level=info
    env:
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: AWS_REGION
      valueFrom:
        configMapKeyRef:
          key: region
          name: ieng-config
    - name: AWS_ROLE_ARN
      value: arn:aws:iam::11111111111:role/dev-test8-labawsuse-prometheus-operator_prometheus
    - name: AWS_WEB_IDENTITY_TOKEN_FILE
      value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    - name: POD_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
    - name: OBJSTORE_CONFIG
      valueFrom:
        secretKeyRef:
          key: thanos.yaml
          name: thanos-objstore-config
    image: repocache.nonprod.ppops.net/dev-docker-local/thanos:4.7
    imagePullPolicy: IfNotPresent
    name: thanos-sidecar
    ports:
    - containerPort: 10902
      name: http
      protocol: TCP
    - containerPort: 10901
      name: grpc
      protocol: TCP
    resources:
      requests:
        cpu: 250m
        memory: 1Gi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
      name: aws-iam-token
      readOnly: true
    - mountPath: /prometheus
      name: prometheus-prometheus-db
      subPath: prometheus-db
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-hg4dq
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostname: prometheus-prometheus-0
  initContainers:
  - args:
    - -query-k8s
    - -namespace=$(NAMESPACE)
    - -pod-name=$(POD_NAME)
    env:
    - name: NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    image: repocache.nonprod.ppops.net/dev-docker-local/certificate-init-container:1.76
    imagePullPolicy: IfNotPresent
    name: certificate-init-container
    resources:
      limits:
        cpu: 100m
        memory: 50Mi
      requests:
        cpu: 100m
        memory: 50Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - all
      runAsNonRoot: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/tls
      name: tls
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-hg4dq
      readOnly: true
  - args:
    - --watch-interval=0
    - --listen-address=:8080
    - --config-file=/etc/prometheus/config/prometheus.yaml.gz
    - --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml
    - --watched-dir=/etc/prometheus/rules/prometheus-prometheus-rulefiles-0
    - --log-level=warn
    command:
    - /bin/prometheus-config-reloader
    env:
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: SHARD
      value: "0"
    image: repocache.nonprod.ppops.net/dev-docker-local/prometheus-config-reloader:4.10
    imagePullPolicy: IfNotPresent
    name: init-config-reloader
    resources:
      limits:
        cpu: 200m
        memory: 50Mi
      requests:
        cpu: 100m
        memory: 25Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /etc/prometheus/config
      name: config
    - mountPath: /etc/prometheus/config_out
      name: config-out
    - mountPath: /etc/prometheus/rules/prometheus-prometheus-rulefiles-0
      name: prometheus-prometheus-rulefiles-0
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-hg4dq
      readOnly: true
  - args:
    - --dry-run=false
    - --secret-name=prometheus
    env:
    - name: NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    image: repocache.nonprod.ppops.net/dev-docker-local/create-tls-secret:1.551
    imagePullPolicy: IfNotPresent
    name: create-tls-secret
    resources:
      limits:
        cpu: 100m
        memory: 250Mi
      requests:
        cpu: 100m
        memory: 250Mi
    securityContext:
      allowPrivilegeEscalation: false
      runAsNonRoot: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/tls
      name: tls
    - mountPath: /tmp
      name: tmp
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-hg4dq
      readOnly: true
  - command:
    - /bin/chmod
    - -R
    - "777"
    - /prometheus
    image: repocache.nonprod.ppops.net/dev-docker-local/busybox:v1.28.0
    imagePullPolicy: IfNotPresent
    name: prometheus-data-permission-fix
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /prometheus
      name: prometheus-prometheus-db
      subPath: prometheus-db
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-hg4dq
      readOnly: true
  nodeName: ip-10-93-173-53.ec2.internal
  preemptionPolicy: PreemptLowerPriority
  priority: 333
  priorityClassName: cloud15-services
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 65534
  serviceAccount: prometheus
  serviceAccountName: prometheus
  subdomain: prometheus-operated
  terminationGracePeriodSeconds: 600
  tolerations:
  - effect: NoSchedule
    key: dedicated
    operator: Equal
    value: prometheus
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: prometheus-prometheus-db
    persistentVolumeClaim:
      claimName: prometheus-prometheus-db-prometheus-prometheus-0
  - name: config
    secret:
      defaultMode: 420
      secretName: prometheus-prometheus
  - name: tls-assets
    secret:
      defaultMode: 420
      secretName: prometheus-prometheus-tls-assets
  - emptyDir: {}
    name: config-out
  - configMap:
      defaultMode: 420
      name: prometheus-prometheus-rulefiles-0
    name: prometheus-prometheus-rulefiles-0
  - name: web-config
    secret:
      defaultMode: 420
      secretName: prometheus-prometheus-web-config
  - emptyDir: {}
    name: tls
  - emptyDir: {}
    name: tmp
  - name: aws-iam-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: sts.amazonaws.com
          expirationSeconds: 900
          path: token
  - name: kube-api-access-hg4dq
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace

Karpenter picks up this pod and tries to provisions a t3a.medium instance. But the pod never reaches running state and karpenter kills the instance and provisions a new t3a.medium instance.... the same process is repeated again and again in an infinite loop.
There were few error log messages in ebs-csi-controller as well when the csi controller was trying to provision the volume for the prometheus pod
So initially I thought this might an issue with the ebs-csi-controller. However just to test it out, I updated the default provisioner to only create instances of type t3a.xlarge... after which the deployment of prometheus pod was successful without any errors which confirmed that this was not an issue with the ebs-csi controller.

Below are log snippets from ebs-csi-controller, karpenter and kubernetes events in the pod namespace. (when t3a.medium instances were being provisioned)

Ebs-csi-controller logs

I0325 00:06:30.330285       1 controller.go:1332] provision "prometheus-operator/prometheus-prometheus-db-prometheus-prometheus-0" class "gp2-prometheus-server": started
W0325 00:06:30.330378       1 controller.go:958] Retrying syncing claim "106773fc-37d6-4ea4-a9df-104064dab30c", failure 1
E0325 00:06:30.330390       1 controller.go:981] error syncing claim "106773fc-37d6-4ea4-a9df-104064dab30c": failed to provision volume with StorageClass "gp2-prometheus-server": error generating accessibility requirements: error getting CSINode for selected node "ip-10-93-173-53.ec2.internal": csinode.storage.k8s.io "ip-10-93-173-53.ec2.internal" not found
I0325 00:06:30.330462       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"prometheus-operator", Name:"prometheus-prometheus-db-prometheus-prometheus-0", UID:"106773fc-37d6-4ea4-a9df-104064dab30c", APIVersion:"v1", ResourceVersion:"1974177", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "prometheus-operator/prometheus-prometheus-db-prometheus-prometheus-0"
I0325 00:06:30.330489       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"prometheus-operator", Name:"prometheus-prometheus-db-prometheus-prometheus-0", UID:"106773fc-37d6-4ea4-a9df-104064dab30c", APIVersion:"v1", ResourceVersion:"1974177", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "gp2-prometheus-server": error generating accessibility requirements: error getting CSINode for selected node "ip-10-93-173-53.ec2.internal": csinode.storage.k8s.io "ip-10-93-173-53.ec2.internal" not found
I0325 00:06:32.330537       1 controller.go:1332] provision "prometheus-operator/prometheus-prometheus-db-prometheus-prometheus-0" class "gp2-prometheus-server": started
W0325 00:06:32.330644       1 controller.go:958] Retrying syncing claim "106773fc-37d6-4ea4-a9df-104064dab30c", failure 2
E0325 00:06:32.330669       1 controller.go:981] error syncing claim "106773fc-37d6-4ea4-a9df-104064dab30c": failed to provision volume with StorageClass "gp2-prometheus-server": error generating accessibility requirements: error getting CSINode for selected node "ip-10-93-173-53.ec2.internal": csinode.storage.k8s.io "ip-10-93-173-53.ec2.internal" not found
I0325 00:06:32.330690       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"prometheus-operator", Name:"prometheus-prometheus-db-prometheus-prometheus-0", UID:"106773fc-37d6-4ea4-a9df-104064dab30c", APIVersion:"v1", ResourceVersion:"1974177", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "prometheus-operator/prometheus-prometheus-db-prometheus-prometheus-0"
I0325 00:06:32.330715       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"prometheus-operator", Name:"prometheus-prometheus-db-prometheus-prometheus-0", UID:"106773fc-37d6-4ea4-a9df-104064dab30c", APIVersion:"v1", ResourceVersion:"1974177", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "gp2-prometheus-server": error generating accessibility requirements: error getting CSINode for selected node "ip-10-93-173-53.ec2.internal": csinode.storage.k8s.io "ip-10-93-173-53.ec2.internal" not found
I0325 00:06:36.330814       1 controller.go:1332] provision "prometheus-operator/prometheus-prometheus-db-prometheus-prometheus-0" class "gp2-prometheus-server": started
W0325 00:06:36.330911       1 controller.go:958] Retrying syncing claim "106773fc-37d6-4ea4-a9df-104064dab30c", failure 3
E0325 00:06:36.330933       1 controller.go:981] error syncing claim "106773fc-37d6-4ea4-a9df-104064dab30c": failed to provision volume with StorageClass "gp2-prometheus-server": error generating accessibility requirements: error getting CSINode for selected node "ip-10-93-173-53.ec2.internal": csinode.storage.k8s.io "ip-10-93-173-53.ec2.internal" not found
I0325 00:06:36.330965       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"prometheus-operator", Name:"prometheus-prometheus-db-prometheus-prometheus-0", UID:"106773fc-37d6-4ea4-a9df-104064dab30c", APIVersion:"v1", ResourceVersion:"1974177", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "prometheus-operator/prometheus-prometheus-db-prometheus-prometheus-0"
I0325 00:06:36.330994       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"prometheus-operator", Name:"prometheus-prometheus-db-prometheus-prometheus-0", UID:"106773fc-37d6-4ea4-a9df-104064dab30c", APIVersion:"v1", ResourceVersion:"1974177", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "gp2-prometheus-server": error generating accessibility requirements: error getting CSINode for selected node "ip-10-93-173-53.ec2.internal": csinode.storage.k8s.io "ip-10-93-173-53.ec2.internal" not found
I0325 00:06:44.331100       1 controller.go:1332] provision "prometheus-operator/prometheus-prometheus-db-prometheus-prometheus-0" class "gp2-prometheus-server": started
W0325 00:06:44.331205       1 controller.go:958] Retrying syncing claim "106773fc-37d6-4ea4-a9df-104064dab30c", failure 4
E0325 00:06:44.331234       1 controller.go:981] error syncing claim "106773fc-37d6-4ea4-a9df-104064dab30c": failed to provision volume with StorageClass "gp2-prometheus-server": error generating accessibility requirements: error getting CSINode for selected node "ip-10-93-173-53.ec2.internal": csinode.storage.k8s.io "ip-10-93-173-53.ec2.internal" not found
I0325 00:06:44.331272       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"prometheus-operator", Name:"prometheus-prometheus-db-prometheus-prometheus-0", UID:"106773fc-37d6-4ea4-a9df-104064dab30c", APIVersion:"v1", ResourceVersion:"1974177", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "prometheus-operator/prometheus-prometheus-db-prometheus-prometheus-0"
I0325 00:06:44.331322       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"prometheus-operator", Name:"prometheus-prometheus-db-prometheus-prometheus-0", UID:"106773fc-37d6-4ea4-a9df-104064dab30c", APIVersion:"v1", ResourceVersion:"1974177", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "gp2-prometheus-server": error generating accessibility requirements: error getting CSINode for selected node "ip-10-93-173-53.ec2.internal": csinode.storage.k8s.io "ip-10-93-173-53.ec2.internal" not found
I0325 00:07:00.331397       1 controller.go:1332] provision "prometheus-operator/prometheus-prometheus-db-prometheus-prometheus-0" class "gp2-prometheus-server": started
W0325 00:07:00.331519       1 controller.go:958] Retrying syncing claim "106773fc-37d6-4ea4-a9df-104064dab30c", failure 5
E0325 00:07:00.331544       1 controller.go:981] error syncing claim "106773fc-37d6-4ea4-a9df-104064dab30c": failed to provision volume with StorageClass "gp2-prometheus-server": error generating accessibility requirements: error getting CSINode for selected node "ip-10-93-173-53.ec2.internal": csinode.storage.k8s.io "ip-10-93-173-53.ec2.internal" not found
I0325 00:07:00.331573       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"prometheus-operator", Name:"prometheus-prometheus-db-prometheus-prometheus-0", UID:"106773fc-37d6-4ea4-a9df-104064dab30c", APIVersion:"v1", ResourceVersion:"1974177", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "prometheus-operator/prometheus-prometheus-db-prometheus-prometheus-0"
I0325 00:07:00.331598       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"prometheus-operator", Name:"prometheus-prometheus-db-prometheus-prometheus-0", UID:"106773fc-37d6-4ea4-a9df-104064dab30c", APIVersion:"v1", ResourceVersion:"1974177", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "gp2-prometheus-server": error generating accessibility requirements: error getting CSINode for selected node "ip-10-93-173-53.ec2.internal": csinode.storage.k8s.io "ip-10-93-173-53.ec2.internal" not found

Karpenter logs

2022-03-24T22:35:17.729Z	DEBUG	controller.provisioning	Excluding instance type t3.micro because there are not enough resources for daemons	{"commit": "78d3031", "provisioner": "default"}
2022-03-24T22:35:17.729Z	DEBUG	controller.provisioning	Excluding instance type t3a.micro because there are not enough resources for daemons	{"commit": "78d3031", "provisioner": "default"}
2022-03-24T22:35:17.729Z	DEBUG	controller.provisioning	Excluding instance type t3.nano because there are not enough resources for kubelet and system overhead	{"commit": "78d3031", "provisioner": "default"}
2022-03-24T22:35:17.730Z	DEBUG	controller.provisioning	Excluding instance type t3a.nano because there are not enough resources for kubelet and system overhead	{"commit": "78d3031", "provisioner": "default"}
2022-03-24T22:35:17.732Z	INFO	controller.provisioning	Computed packing of 1 node(s) for 1 pod(s) with instance type option(s) [t3.medium t3a.medium c5n.large m1.large m6i.large m4.large m5ad.large m5dn.large m5a.large m5d.large m6a.large t3.large m5.large m5n.large t3a.large m5zn.large c4.xlarge c5a.xlarge c5.xlarge c6a.xlarge]	{"commit": "78d3031", "provisioner": "default"}
2022-03-24T22:35:18.108Z	DEBUG	controller.provisioning	Discovered security groups: [sg-0ea7e40b433cccb59]	{"commit": "78d3031", "provisioner": "default"}
2022-03-24T22:35:18.111Z	DEBUG	controller.provisioning	Discovered kubernetes version 1.21	{"commit": "78d3031", "provisioner": "default"}
2022-03-24T22:35:18.150Z	DEBUG	controller.provisioning	Discovered ami-0e1b6f116a3733fef for query /aws/service/eks/optimized-ami/1.21/amazon-linux-2/recommended/image_id	{"commit": "78d3031", "provisioner": "default"}
2022-03-24T22:35:18.191Z	DEBUG	controller.provisioning	Discovered launch template Karpenter-dev-test8-labawsuse-11256910872157935272	{"commit": "78d3031", "provisioner": "default"}
2022-03-24T22:35:20.397Z	INFO	controller.provisioning	Launched instance: i-09a21abd7a60be47d, hostname: ip-10-93-173-100.ec2.internal, type: t3a.medium, zone: us-east-1a, capacityType: on-demand	{"commit": "78d3031", "provisioner": "default"}
2022-03-24T22:35:20.416Z	INFO	controller.provisioning	Bound 1 pod(s) to node ip-10-93-173-100.ec2.internal	{"commit": "78d3031", "provisioner": "default"}
2022-03-24T22:35:20.416Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "78d3031", "provisioner": "default"}
2022-03-24T22:35:20.470Z	INFO	controller.volume	Bound persistent volume claim to node ip-10-93-173-100.ec2.internal	{"commit": "78d3031", "resource": "prometheus-operator/prometheus-prometheus-db-prometheus-prometheus-0"}
2022-03-24T22:35:23.242Z	INFO	controller.node	Added TTL to empty node	{"commit": "78d3031", "node": "ip-10-93-173-48.ec2.internal"}
2022-03-24T22:35:53.001Z	INFO	controller.node	Triggering termination after 30s for empty node	{"commit": "78d3031", "node": "ip-10-93-173-48.ec2.internal"}
2022-03-24T22:35:53.034Z	INFO	controller.termination	Cordoned node	{"commit": "78d3031", "node": "ip-10-93-173-48.ec2.internal"}
2022-03-24T22:35:53.204Z	INFO	controller.termination	Deleted node	{"commit": "78d3031", "node": "ip-10-93-173-48.ec2.internal"}
2022-03-24T22:35:53.933Z	DEBUG	controller.provisioning	Discovered 373 EC2 instance types	{"commit": "78d3031", "provisioner": "default"}
2022-03-24T22:35:54.155Z	DEBUG	controller.provisioning	Discovered EC2 instance types zonal offerings	{"commit": "78d3031", "provisioner": "default"}
2022-03-24T22:36:16.491Z	INFO	controller.provisioning	Batched 1 pods in 1.000822065s	{"commit": "78d3031", "provisioner": "default"}
2022-03-24T22:36:16.497Z	DEBUG	controller.provisioning	Excluding instance type t3a.nano because there are not enough resources for kubelet and system overhead	{"commit": "78d3031", "provisioner": "default"}
2022-03-24T22:36:16.499Z	DEBUG	controller.provisioning	Excluding instance type t3.nano because there are not enough resources for kubelet and system overhead	{"commit": "78d3031", "provisioner": "default"}
2022-03-24T22:36:16.502Z	DEBUG	controller.provisioning	Excluding instance type t3a.micro because there are not enough resources for daemons	{"commit": "78d3031", "provisioner": "default"}
2022-03-24T22:36:16.504Z	DEBUG	controller.provisioning	Excluding instance type t3.micro because there are not enough resources for daemons	{"commit": "78d3031", "provisioner": "default"}
2022-03-24T22:36:16.521Z	INFO	controller.provisioning	Computed packing of 1 node(s) for 1 pod(s) with instance type option(s) [t3a.medium t3.medium c5n.large m1.large m5ad.large m6a.large t3a.large m5.large m5d.large m5n.large m5zn.large t3.large m4.large m5dn.large m6i.large m5a.large c4.xlarge c5.xlarge c5ad.xlarge c5a.xlarge]	{"commit": "78d3031", "provisioner": "default"}
2022-03-24T22:36:18.969Z	INFO	controller.provisioning	Launched instance: i-0dc2662164288c92d, hostname: ip-10-93-173-106.ec2.internal, type: t3a.medium, zone: us-east-1a, capacityType: on-demand	{"commit": "78d3031", "provisioner": "default"}
2022-03-24T22:36:18.995Z	INFO	controller.provisioning	Bound 1 pod(s) to node ip-10-93-173-106.ec2.internal	{"commit": "78d3031", "provisioner": "default"}
2022-03-24T22:36:18.995Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "78d3031", "provisioner": "default"}
2022-03-24T22:36:19.008Z	INFO	controller.volume	Bound persistent volume claim to node ip-10-93-173-106.ec2.internal	{"commit": "78d3031", "resource": "prometheus-operator/prometheus-prometheus-db-prometheus-prometheus-0"}
2022-03-24T22:36:19.593Z	DEBUG	controller.provisioning	Discovered subnets: [subnet-02b8fa0d166e619bb (us-east-1c) subnet-0e7eb33644d652c33 (us-east-1b) subnet-0dd62ea3b1e5fcac2 (us-east-1a)]	{"commit": "78d3031", "provisioner": "default"}
2022-03-24T22:36:41.575Z	INFO	controller.node	Added TTL to empty node	{"commit": "78d3031", "node": "ip-10-93-173-58.ec2.internal"}
2022-03-24T22:37:11.000Z	INFO	controller.node	Triggering termination after 30s for empty node	{"commit": "78d3031", "node": "ip-10-93-173-58.ec2.internal"}
2022-03-24T22:37:11.030Z	INFO	controller.termination	Cordoned node	{"commit": "78d3031", "node": "ip-10-93-173-58.ec2.internal"}
2022-03-24T22:37:11.230Z	INFO	controller.termination	Deleted node	{"commit": "78d3031", "node": "ip-10-93-173-58.ec2.internal"}
2022-03-24T22:37:42.170Z	INFO	controller.node	Added TTL to empty node	{"commit": "78d3031", "node": "ip-10-93-173-61.ec2.internal"}
2022-03-24T22:38:12.000Z	INFO	controller.node	Triggering termination after 30s for empty node	{"commit": "78d3031", "node": "ip-10-93-173-61.ec2.internal"}
2022-03-24T22:38:12.033Z	INFO	controller.termination	Cordoned node	{"commit": "78d3031", "node": "ip-10-93-173-61.ec2.internal"}
2022-03-24T22:38:12.224Z	INFO	controller.termination	Deleted node	{"commit": "78d3031", "node": "ip-10-93-173-61.ec2.internal"}

Kubernetes Events

78s         Normal    TaintManagerEviction   pod/prometheus-prometheus-0                                              Cancelling deletion of Pod prometheus-operator/prometheus-prometheus-0
59m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/csi-secrets-store-provider-aws-b2w6s on node ip-10-93-173-26.ec2.internal
59m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/secrets-store-csi-driver-skwtj on node ip-10-93-173-26.ec2.internal
59m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/5 nodes are available: 3 Insufficient cpu, 5 Insufficient memory.
59m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/6 nodes are available: 1 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 5 Insufficient memory.
58m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/aws-node-7llmt on node ip-10-93-173-100.ec2.internal
58m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/5 nodes are available: 1 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 4 Insufficient memory.
57m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/6 nodes are available: 2 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 4 Insufficient memory.
56m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/kube-proxy-5m7c9 on node ip-10-93-173-66.ec2.internal
56m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/aws-node-hk8mj on node ip-10-93-173-66.ec2.internal
56m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/6 nodes are available: 2 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 4 Insufficient memory.
55m         Warning   NetworkNotReady        pod/prometheus-prometheus-0                                              network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
55m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by ieng/splunk-kubernetes-logging-89dmh on node ip-10-93-173-26.ec2.internal
54m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/6 nodes are available: 2 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 4 Insufficient memory.
53m         Warning   NetworkNotReady        pod/prometheus-prometheus-0                                              network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
53m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by ieng/splunk-kubernetes-logging-vsx6c on node ip-10-93-173-75.ec2.internal
53m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/6 nodes are available: 2 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 4 Insufficient memory.
52m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/kube-proxy-5zm6r on node ip-10-93-173-57.ec2.internal
52m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/aws-node-44wfr on node ip-10-93-173-57.ec2.internal
52m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/6 nodes are available: 2 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 4 Insufficient memory.
52m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/7 nodes are available: 3 Insufficient cpu, 3 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 4 Insufficient memory.
50m         Warning   NetworkNotReady        pod/prometheus-prometheus-0                                              network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
50m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by ieng/splunk-kubernetes-logging-nqgl5 on node ip-10-93-173-77.ec2.internal
50m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/6 nodes are available: 2 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 4 Insufficient memory.
50m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/7 nodes are available: 3 Insufficient cpu, 3 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 4 Insufficient memory.
49m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/kube-proxy-8rrxp on node ip-10-93-173-98.ec2.internal
49m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/aws-node-6nh85 on node ip-10-93-173-98.ec2.internal
49m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/6 nodes are available: 2 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 4 Insufficient memory.
48m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/kube-proxy-5tr95 on node ip-10-93-173-67.ec2.internal
48m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/aws-node-tddll on node ip-10-93-173-67.ec2.internal
48m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/7 nodes are available: 3 Insufficient cpu, 3 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 4 Insufficient memory.
47m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/kube-proxy-pqlpb on node ip-10-93-173-121.ec2.internal
47m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/aws-node-vtf5v on node ip-10-93-173-121.ec2.internal
46m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/7 nodes are available: 3 Insufficient cpu, 3 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 4 Insufficient memory.
45m         Warning   NetworkNotReady        pod/prometheus-prometheus-0                                              network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
45m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by ieng/splunk-kubernetes-logging-58tmv on node ip-10-93-173-71.ec2.internal
45m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/7 nodes are available: 2 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 5 Insufficient memory.
45m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/6 nodes are available: 2 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 4 Insufficient memory.
43m         Warning   NetworkNotReady        pod/prometheus-prometheus-0                                              network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
41m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/csi-secrets-store-provider-aws-khq4f on node ip-10-93-173-48.ec2.internal
41m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/secrets-store-csi-driver-dl2rs on node ip-10-93-173-48.ec2.internal
41m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/5 nodes are available: 3 Insufficient cpu, 5 Insufficient memory.
41m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/6 nodes are available: 1 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 5 Insufficient memory.
39m         Warning   NetworkNotReady        pod/prometheus-prometheus-0                                              network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
37m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/secrets-store-csi-driver-8kz5v on node ip-10-93-173-27.ec2.internal
37m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/csi-secrets-store-provider-aws-gzcwp on node ip-10-93-173-27.ec2.internal
37m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/5 nodes are available: 3 Insufficient cpu, 5 Insufficient memory.
37m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/6 nodes are available: 1 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 5 Insufficient memory.
36m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/kube-proxy-lsp24 on node ip-10-93-173-66.ec2.internal
36m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/aws-node-5hbzt on node ip-10-93-173-66.ec2.internal
36m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/5 nodes are available: 1 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 4 Insufficient memory.
36m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/6 nodes are available: 2 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 4 Insufficient memory.
34m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/kube-proxy-nm4tm on node ip-10-93-173-48.ec2.internal
34m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/aws-node-nqsqp on node ip-10-93-173-48.ec2.internal
34m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/6 nodes are available: 2 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 4 Insufficient memory.
34m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/7 nodes are available: 3 Insufficient cpu, 3 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 4 Insufficient memory.
33m         Warning   NetworkNotReady        pod/prometheus-prometheus-0                                              network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
31m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/csi-secrets-store-provider-aws-5hgpd on node ip-10-93-173-121.ec2.internal
31m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/secrets-store-csi-driver-f65bk on node ip-10-93-173-121.ec2.internal
31m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/5 nodes are available: 3 Insufficient cpu, 5 Insufficient memory.
29m         Warning   NetworkNotReady        pod/prometheus-prometheus-0                                              network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
28m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/secrets-store-csi-driver-66sr4 on node ip-10-93-173-80.ec2.internal
28m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/csi-secrets-store-provider-aws-bzltc on node ip-10-93-173-80.ec2.internal
28m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/5 nodes are available: 3 Insufficient cpu, 5 Insufficient memory.
28m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/6 nodes are available: 1 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 5 Insufficient memory.
27m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/kube-proxy-dsfhf on node ip-10-93-173-12.ec2.internal
27m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/5 nodes are available: 1 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 4 Insufficient memory.
26m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/aws-node-lk5fq on node ip-10-93-173-62.ec2.internal
26m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/kube-proxy-nljd7 on node ip-10-93-173-62.ec2.internal
25m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/6 nodes are available: 2 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 4 Insufficient memory.
24m         Warning   NetworkNotReady        pod/prometheus-prometheus-0                                              network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
22m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/secrets-store-csi-driver-qkgcd on node ip-10-93-173-70.ec2.internal
22m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/csi-secrets-store-provider-aws-mmqvp on node ip-10-93-173-70.ec2.internal
22m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/5 nodes are available: 3 Insufficient cpu, 5 Insufficient memory.
22m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/6 nodes are available: 1 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 5 Insufficient memory.
21m         Warning   NetworkNotReady        pod/prometheus-prometheus-0                                              network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
21m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by ieng/splunk-kubernetes-logging-57fwp on node ip-10-93-173-41.ec2.internal
21m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/5 nodes are available: 1 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 4 Insufficient memory.
20m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/kube-proxy-w7pt7 on node ip-10-93-173-71.ec2.internal
20m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/aws-node-kqcck on node ip-10-93-173-71.ec2.internal
20m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/6 nodes are available: 2 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 4 Insufficient memory.
19m         Warning   NetworkNotReady        pod/prometheus-prometheus-0                                              network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
19m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by ieng/splunk-kubernetes-logging-6rprc on node ip-10-93-173-65.ec2.internal
19m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/7 nodes are available: 1 node(s) were unschedulable, 2 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 4 Insufficient memory.
18m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/6 nodes are available: 2 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 4 Insufficient memory.
17m         Warning   NetworkNotReady        pod/prometheus-prometheus-0                                              network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
17m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by ieng/splunk-kubernetes-logging-khsf2 on node ip-10-93-173-5.ec2.internal
17m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/6 nodes are available: 2 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 4 Insufficient memory.
16m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/kube-proxy-2khb4 on node ip-10-93-173-12.ec2.internal
16m         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/aws-node-k6dsj on node ip-10-93-173-12.ec2.internal
11m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/7 nodes are available: 2 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 5 Insufficient memory.
11m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/7 nodes are available: 2 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 4 Insufficient cpu, 5 Insufficient memory.
10m         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/6 nodes are available: 3 Insufficient cpu, 6 Insufficient memory.
9m19s       Warning   NetworkNotReady        pod/prometheus-prometheus-0                                              network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
7m42s       Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/csi-secrets-store-provider-aws-2n9d2 on node ip-10-93-173-48.ec2.internal
7m42s       Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/secrets-store-csi-driver-fnnp6 on node ip-10-93-173-48.ec2.internal
7m37s       Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/5 nodes are available: 3 Insufficient cpu, 5 Insufficient memory.
7m34s       Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/6 nodes are available: 1 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 5 Insufficient memory.
6m4s        Warning   NetworkNotReady        pod/prometheus-prometheus-0                                              network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
4m29s       Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/secrets-store-csi-driver-npc6p on node ip-10-93-173-112.ec2.internal
4m29s       Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/csi-secrets-store-provider-aws-47z54 on node ip-10-93-173-112.ec2.internal
4m22s       Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/5 nodes are available: 3 Insufficient cpu, 5 Insufficient memory.
3m32s       Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by kube-system/aws-node-rccxw on node ip-10-93-173-46.ec2.internal
3m27s       Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/5 nodes are available: 1 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 4 Insufficient memory.
2m22s       Warning   NetworkNotReady        pod/prometheus-prometheus-0                                              network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
2m20s       Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by ieng/splunk-kubernetes-logging-tzpl7 on node ip-10-93-173-23.ec2.internal
2m16s       Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/6 nodes are available: 2 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 3 Insufficient cpu, 4 Insufficient memory.
2m13s       Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/7 nodes are available: 3 Insufficient cpu, 3 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 4 Insufficient memory.
83s         Warning   NetworkNotReady        pod/prometheus-prometheus-0                                              network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
82s         Normal    Preempted              pod/prometheus-prometheus-0                                              Preempted by ieng/splunk-kubernetes-logging-fm48x on node ip-10-93-173-66.ec2.internal
77s         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/7 nodes are available: 3 Insufficient cpu, 3 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 4 Insufficient memory.
75s         Warning   FailedScheduling       pod/prometheus-prometheus-0                                              0/8 nodes are available: 3 Insufficient cpu, 4 Insufficient memory, 4 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate.
7m34s       Normal    Provisioning           persistentvolumeclaim/prometheus-prometheus-db-prometheus-prometheus-0   External provisioner is provisioning volume for claim "prometheus-operator/prometheus-prometheus-db-prometheus-prometheus-0"
20m         Warning   ProvisioningFailed     persistentvolumeclaim/prometheus-prometheus-db-prometheus-prometheus-0   (combined from similar events): failed to provision volume with StorageClass "gp2-prometheus-server": error generating accessibility requirements: no topology key found on CSINode ip-10-93-173-71.ec2.internal
2m17s       Normal    ExternalProvisioning   persistentvolumeclaim/prometheus-prometheus-db-prometheus-prometheus-0   waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator
5m25s       Warning   ProvisioningFailed     persistentvolumeclaim/prometheus-prometheus-db-prometheus-prometheus-0   failed to provision volume with StorageClass "gp2-prometheus-server": error generating accessibility requirements: no topology key found on CSINode ip-10-93-173-112.ec2.internal
78s         Normal    SuccessfulCreate       statefulset/prometheus-prometheus                                        create Pod prometheus-prometheus-0 in StatefulSet prometheus-prometheus successful
11m         Normal    info                   helmrelease/prometheus                                                   dependencies do not meet ready condition (dependency 'prometheus-operator/prometheus-operator' is not ready), retrying in 30s

Expected Behavior

Karpenter should provision a node which is big enough for the unschedulable pod

Actual Behavior

Node creation gets stuck in an infinite loop because the node getting created is not big enough for the unschedulable pod?

Just fyi Other pods/nodes creation is working fine. I only see issue this issue when I try to deploy this particular pod.

The text was updated successfully, but these errors were encountered:

tzneal · 2022-03-25T12:25:28Z

Can you list your daemon sets and what their resource requests are?

vasu-git · 2022-03-25T15:01:09Z

$ kubectl get daemonsets -A -o=jsonpath='{.items..resources.requests}'
{"cpu":"25m"} {"cpu":"100m"} {"cpu":"70m","memory":"30Mi"}%

$ kubectl get daemonsets -A -o=jsonpath='{.items..resources.limits}'  
{"memory":"500Mi"} {"cpu":"50m","memory":"100Mi"} {"cpu":"100m","memory":"100Mi"} {"cpu":"200m","memory":"200Mi"} {"cpu":"100m","memory":"100Mi"} {"cpu":"70m","memory":"100Mi"}%

dewjam · 2022-03-25T16:07:13Z

Hey @vasu-git ,
We were able to reproduce the issue in a test env and believe we have found the root cause. In short, it seems to be related to DaemonSets which have resource Limits defined, but not Requests. One simple workaround you could test is to ensure resource requests are define in the DaemonSet specs for now.

We are working on a fix which should allow Karpenter to handle this scenario in future releases. Thanks for reporting the issue!

vasu-git · 2022-03-25T17:08:00Z

Thanks @dewjam

infestonn · 2022-03-28T11:49:10Z

@dewjam hey
I have a similar issue. I tried your workaround but it didn't help.
Karpenter adds a new node to fit my "Pending" pod. But pod is being preempted by demonset pods -> karpenter adds a new node with the same capacity to schedule this pod. This is endless process, unless I remove my pod manually.
I also noticed interesting behaviour:

once karpenter adds a new node(i.e. m5d.large) to a cluster node's CPU capacity = 2000m. After ~30-40s when kubelet starts to report its status capacity is reduced by 200m. I presume this is because I have --system-reserved=memory=256Mi,cpu=100m --kube-reserved=memory=256Mi,cpu=100m in bootstrap args.
I have 8 daemonsets. Only 2 of these daemonest's pod are schedule on a new node instantly, whereas rest of them are added only after some time(30-120s) – this is when preemption is happening.

karpenter v0.5.3

dewjam · 2022-03-28T16:29:02Z

Hey @infestonn ,
I think you're absolutely correct that it's related to the CPU/memory reservations you're passing to Kubelet. Can you confirm you're using a custom launch template?

dewjam · 2022-03-28T16:50:39Z

@infestonn
If I understand correctly, your issue is related to how Karpenter calculates overhead. In short, it calculates it based on an instance types CPU/Memory resources, but doesn't take into account what is in custom launch templates. So if the supplied values in the custom launch template exceed that of what Karpenter calculates, then it's possible a pod will stay in pending state.

As an example, using the default values Karpenter expects an md5.large instance type to have 1930m allocatable resources instead of the1800mi you have:

Allocatable:
  cpu:                         1930m

infestonn · 2022-03-29T07:13:31Z

Hey @infestonn , I think you're absolutely correct that it's related to the CPU/memory reservations you're passing to Kubelet. Can you confirm you're using a custom launch template?

$ kubectl get provisioners.karpenter.sh default -o jsonpath='{.spec.provider.launchTemplate}'                                           
uscalq-eks-karpenter20220311161351524700000001

infestonn · 2022-03-29T07:38:16Z

@infestonn If I understand correctly, your issue is related to how Karpenter calculates overhead. In short, it calculates it based on an instance types CPU/Memory resources, but doesn't take into account what is in custom launch templates. So if the supplied values in the custom launch template exceed that of what Karpenter calculates, then it's possible a pod will stay in pending state.

As an example, using the default values Karpenter expects an md5.large instance type to have 1930m allocatable resources instead of the1800mi you have:
Allocatable:
  cpu:                         1930m

Yes. I think that's right. Not sure how to find a workaround yet.

dewjam · 2022-03-29T16:54:26Z

@infestonn You could set kube-reserved and system-reserved to what Karpenter expects as a workaround. Or is there a specific reason kube-reserved and system-reserved need to be set statically?

johngmyers · 2022-03-29T17:34:41Z

Because of kubernetes/kubernetes#102382, it is extremely inadvisable to set resource requests on high-priority daemonsets. Otherwise, a daemonset update could cause critical workloads to be (unnecessarily) preempted. This is why kube-reserved should instead be set to account for the resource requirements of critical daemonsets.

dewjam · 2022-03-29T19:28:39Z

Were you able to reproduce the behavior in kubernetes/kubernetes#102382? It seems like setting kube-reserved to account for DaemonSet resources could be burdensome to manage and error prone. I would hope there is a more elegant solution.

Is your recommendation to not define resource "limits" on critical Daemonsets as well? (if you define resource limits, but not requests then requests = limits)

infestonn · 2022-04-04T13:08:20Z

@infestonn You could set kube-reserved and system-reserved to what Karpenter expects as a workaround. Or is there a specific reason kube-reserved and system-reserved need to be set statically?

What does "what Karpenter expects" means? Does provisioner API have any parameter for that?

dewjam · 2022-04-04T14:25:34Z

Sorry for the confusion @infestonn . Karpenter calculates kube-reserved and system-reserved (overhead) based on CPU/memory resources allocated to instance types (https://github.com/aws/karpenter/blob/main/pkg/cloudprovider/aws/instancetype.go#L193-L230). Basically it takes total available resources minus calculated overhead to get Allocatable resources.

When determining the best instance type for a workload, Karpenter assumes the instance will be launched with kube-reserved and system-reserved values that were calculated using the same overhead formula. At the moment Karpenter is unaware when custom values are set in a launch template, so it's possible for Karpenter to launch an instance which has insufficient Allocatable resources.

We are discussing exposing these params in the Karpenter Provisioner spec, though I don't have any timeline I can share.

infestonn · 2022-04-04T16:06:36Z

Sorry for the confusion @infestonn . Karpenter calculates kube-reserved and system-reserved (overhead) based on CPU/memory resources allocated to instance types (https://github.com/aws/karpenter/blob/main/pkg/cloudprovider/aws/instancetype.go#L193-L230). Basically it takes total available resources minus calculated overhead to get Allocatable resources.

When determining the best instance type for a workload, Karpenter assumes the instance will be launched with kube-reserved and system-reserved values that were calculated using the same overhead formula. At the moment Karpenter is unaware when custom values are set in a launch template, so it's possible for Karpenter to launch an instance which has insufficient Allocatable resources.

We are discussing exposing these params in the Karpenter Provisioner spec, though I don't have any timeline I can share.

Thank you for clarification.
That's what I presumed in the very beginning.

dewjam · 2022-04-08T19:38:54Z

Hello @vasu-git ,
#1616 was recently merged which fixes the issue with DaemonSet resources as well as adds support for init Container resources. It should be included in the next Karpenter release.

I'm going to go ahead and close out this issue, but please feel free to reopen if you have any questions in the meantime.

Thanks for reporting this issue!

vasu-git added the bug Something isn't working label Mar 25, 2022

ellistarn added the burning Time sensitive issues label Mar 25, 2022

dewjam self-assigned this Mar 29, 2022

dewjam mentioned this issue Apr 1, 2022

Fix missing requests when only limits are supplied #1616

Merged

3 tasks

dewjam mentioned this issue Apr 8, 2022

Karpenter is not respecting per-node Daemonsets #1649

Closed

dewjam closed this as completed Apr 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Node creation stuck in a loop #1573

New Node creation stuck in a loop #1573

vasu-git commented Mar 25, 2022 •

edited

Loading

tzneal commented Mar 25, 2022

vasu-git commented Mar 25, 2022

dewjam commented Mar 25, 2022

vasu-git commented Mar 25, 2022

infestonn commented Mar 28, 2022 •

edited

Loading

dewjam commented Mar 28, 2022

dewjam commented Mar 28, 2022

infestonn commented Mar 29, 2022 •

edited

Loading

infestonn commented Mar 29, 2022

dewjam commented Mar 29, 2022

johngmyers commented Mar 29, 2022

dewjam commented Mar 29, 2022

infestonn commented Apr 4, 2022

dewjam commented Apr 4, 2022 •

edited

Loading

infestonn commented Apr 4, 2022

dewjam commented Apr 8, 2022

New Node creation stuck in a loop #1573

New Node creation stuck in a loop #1573

Comments

vasu-git commented Mar 25, 2022 • edited Loading

Version

Expected Behavior

Actual Behavior

tzneal commented Mar 25, 2022

vasu-git commented Mar 25, 2022

dewjam commented Mar 25, 2022

vasu-git commented Mar 25, 2022

infestonn commented Mar 28, 2022 • edited Loading

dewjam commented Mar 28, 2022

dewjam commented Mar 28, 2022

infestonn commented Mar 29, 2022 • edited Loading

infestonn commented Mar 29, 2022

dewjam commented Mar 29, 2022

johngmyers commented Mar 29, 2022

dewjam commented Mar 29, 2022

infestonn commented Apr 4, 2022

dewjam commented Apr 4, 2022 • edited Loading

infestonn commented Apr 4, 2022

dewjam commented Apr 8, 2022

vasu-git commented Mar 25, 2022 •

edited

Loading

infestonn commented Mar 28, 2022 •

edited

Loading

infestonn commented Mar 29, 2022 •

edited

Loading

dewjam commented Apr 4, 2022 •

edited

Loading