Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter nodes get stuck on "NotReady" state #1415

Closed
devopsjnr opened this issue Feb 24, 2022 · 15 comments
Closed

Karpenter nodes get stuck on "NotReady" state #1415

devopsjnr opened this issue Feb 24, 2022 · 15 comments
Assignees
Labels
bug Something isn't working

Comments

@devopsjnr
Copy link

devopsjnr commented Feb 24, 2022

Version 0.5.6

Karpenter has started creating new nodes and work as expected.
After a while (approximately 40min), some nodes are switching from Ready to NotReady. They stay like that for hours.. nothing moves.
It feels like it happens randomly, most of the pods inside the NotReady nodes are in "running" state and then moving to "Terminating".

Provisioner has ttlSecondsAfterEmpty: 60 and ttlSecondsUntilExpired isn't defined.

Posting the Node description Events:
Screen Shot 2022-02-25 at 1 01 19

NodeHasDiskPressure - I think Karpenter nodes are starting with 20gb disk. Is it possible to extend the disk size through the provisioner? It may help with this situation.

Here is another Node that has just has became NotReady . This time I can't really understand why:
Screen Shot 2022-02-25 at 1 25 36

@devopsjnr devopsjnr added the bug Something isn't working label Feb 24, 2022
@ellistarn
Copy link
Contributor

Interesting -- it looks like your pods are consuming the ephemeral storage on the node. Can you list the pod specs applied to the node? Can you provide provisioner specs as well?

@devopsjnr
Copy link
Author

This is an example of one of the pods spec:

spec:
  containers:
  - env:
    - name: JAVA_TOOL_OPTIONS
      value: -Xmx750m -Xms750m
    - name: prf-url
      value: prf:8080/v1
    - name: custom-url
      value: custom:8080/v1
    - name: con-url
    image: <image>
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /actuator/health/liveness
        port: 8080
        scheme: HTTP
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 3
    name: <name>
    ports:
    - containerPort: 8080
      name: http
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /actuator/health/readiness
        port: 8080
        scheme: HTTP
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 3
    resources: {}
    securityContext: {}
    startupProbe:
      failureThreshold: 14
      httpGet:
        path: /actuator/health/liveness
        port: 8080
        scheme: HTTP
      initialDelaySeconds: 10
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 1
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: <name>
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  imagePullSecrets:
  - name: <name>
  - name: <name>
  initContainers:
  - args:
    - migrate
    env:
    - name: FLYWAY_CONFIG_FILES
      value: /flyway/configs/flyway.conf
    image: <image>
    imagePullPolicy: Always
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /flyway/migrations
      name: <name>
    - mountPath: /flyway/configs
      name: <name>
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: <name>
      readOnly: true
  nodeName: ip-172-31-9-224.<region>.compute.internal
  nodeSelector:
    env: integration
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: <sa>
  serviceAccountName: <name>
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - configMap:
      defaultMode: 420
      name: <name>
    name: <name>
  - configMap:
      defaultMode: 420
      name: <name>
    name: <name>
  - name: kube-api-access
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace

This is the provisioner:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: integration-provisioner
spec:
  requirements:
    - key: "node.kubernetes.io/instance-type"
      operator: In
      values: ["m5.large", "m5.xlarge"]
  labels:
    env: integration
  limits:
    resources:
      cpu: 1000
  provider:
    subnetSelector: 
      Name: integration-private
    securityGroupSelector: 
      aws:eks:cluster-name: integration
  ttlSecondsAfterEmpty: 60

@ellistarn
Copy link
Contributor

Great. I see you're not using custom launch templates or bottlerocket, so that simplifies some concerns I had about the disk.

I'm focused on the log line NodeHasDiskPressure. You're running out of disk due to the some combination of image size, logging output, etc. Have you run this workload without Karpenter before? Can you check your EC2 console and see how big the root EBS volume is on this instance? I assume it should be 20GB. Then it may be worth connecting to the instance aws ssm start-session --target $INSTANCE_ID and checking whats using up your disk with something like du -h.

@devopsjnr
Copy link
Author

@ellistarn I have ran this workload with Node Group before, EBS volume used to be 200GB. Now with Karpenter it has only 20GB (I think this is the default size by eks). Hence I asked if there's a way to choose the volume size from the provisioner, or 20gb is unchangeable. I really rather not dealing with launch templates.

@devopsjnr
Copy link
Author

@ellistarn Now there's a new message. all these pods used to work before within a Node Group. The provisioner configurations are similar to those used to be in the node Group.

Screen Shot 2022-02-25 at 10 53 36

"failed to garbage collect required amount of images. Wanted to free X bytes, but freed 0 bytes"
"System OOM encountered, victim process: java, pid: 17173"

@devopsjnr
Copy link
Author

devopsjnr commented Feb 25, 2022

@ellistarn Do you have any insights?
My cluster is burning.

@ellistarn
Copy link
Contributor

@bwagner5 is working on this right now.
#939

@devopsjnr
Copy link
Author

devopsjnr commented Feb 26, 2022

Thank you. In the mean time I am running with my own launch template. Now disk size is 200gb.
However, nodes are still moving from Ready to NotReady state some time after they come up, and they stay get stuck like that for hours, I need to delete them manually (all pods are in Terminating).

@ellistarn @bwagner5 Any idea why this keeps happening?

This is how my cluster looking currently. I really need your advise because the env is down and other people are using it :(
c

@tzneal
Copy link
Contributor

tzneal commented Feb 26, 2022

The example pod spec you have listed above doesn't have any resource requests. Without that, I believe karpenter will pack nodes until it reaches the ENI limit for pods which varies based on instance type. Are you seeing lots of OOM errors, or does kubectl describe node nodename show memory pressure issues?

@devopsjnr
Copy link
Author

@tzneal I do see OMM errors.

This is an example of one of the nodes description:

Allocated resources:

  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests    Limits
  --------                    --------    ------
  cpu                         800m (20%)  1 (25%)
  memory                      712Mi (4%)  1224Mi (8%)
  ephemeral-storage           0 (0%)      0 (0%)
  hugepages-1Gi               0 (0%)      0 (0%)
  hugepages-2Mi               0 (0%)      0 (0%)
  attachable-volumes-aws-ebs  0           0
Events:
  Type     Reason                   Age                   From     Message
  ----     ------                   ----                  ----     -------
  Normal   NodeNotReady             57m (x6 over 3h12m)   kubelet  Node ip-172-31-9-97.<region>.compute.internal status is now: NodeNotReady
  Warning  SystemOOM                57m                   kubelet  System OOM encountered, victim process: java, pid: 1357
  Warning  SystemOOM                48m                   kubelet  System OOM encountered, victim process: java, pid: 648
  Warning  SystemOOM                36m                   kubelet  System OOM encountered, victim process: java, pid: 4364
  Warning  SystemOOM                19m                   kubelet  System OOM encountered, victim process: java, pid: 12247
  Normal   NodeHasNoDiskPressure    19m (x14 over 3h12m)  kubelet  Node ip-172-31-9-97.<region>.compute.internal status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientMemory  19m (x14 over 3h12m)  kubelet  Node ip-172-31-9-97.<region>.compute.internal status is now: NodeHasSufficientMemory
  Warning  SystemOOM                13m                   kubelet  System OOM encountered, victim process: java, pid: 5325

Should I add memory resource limits or requests to the provisioner itself?

@tzneal
Copy link
Contributor

tzneal commented Feb 26, 2022

Karpenter has resource requests already defined in it's helm chart for itself.

In my experience, you really need memory resource requests on your containers or scheduling won't work well within Kubernetes regardless of usage of Karpenter or any other auto-scaler.

If you look at your node output above, it says that Kubernetes is only aware of 712Mi of memory requests, but you have Java processes getting OOM killed by the kernel, so you are running out of physical memory on the node.

@tzneal
Copy link
Contributor

tzneal commented Feb 26, 2022

Setting resource requests is also a best practice listed here

It's a best practice to define these requests and limits in your pod definitions. If you don't include these values, the scheduler doesn't understand what resources are needed. Without this information, the scheduler might schedule the pod on a node without sufficient resources to provide acceptable application performance.

@devopsjnr
Copy link
Author

@tzneal Thanks again for your attention.

Karpenter has resource requests already defined in it's helm chart for itself.

So what is the meaning of resource requests and limits in the provisioner's spec ? (you can see my provisioner above for reference)

In my experience, you really need memory resource requests on your containers or scheduling won't work well within Kubernetes regardless of usage of Karpenter or any other auto-scaler.

Are you certain that adding resources requests and limits for my deployments will solve the issue?

@bwagner5 bwagner5 changed the title Karpenter nods get stuck on "NotReady" state Karpenter nodes get stuck on "NotReady" state Mar 9, 2022
@bwagner5
Copy link
Contributor

Block Device Mappings PR has been merged and should be released next week. You can checkout the preview docs here: https://karpenter.sh/preview/aws/provisioning/#block-device-mappings

If you are fine with the other defaults Karpenter is currently providing, you should be able to use the following mapping once we do a release:

spec:
  provider:
    blockDeviceMappings:
      - deviceName: /dev/xvda
        volumeSize: 200Gi
        volumeType: gp3
        encrypted: true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants