Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New nodes are failing to communicate with the API server #1683

Closed
devopsjnr opened this issue Apr 18, 2022 · 11 comments
Closed

New nodes are failing to communicate with the API server #1683

devopsjnr opened this issue Apr 18, 2022 · 11 comments
Assignees
Labels
bug Something isn't working

Comments

@devopsjnr
Copy link

Karpenter version: v0.5.6
EKS version: v1.21.5

After installing Karpenter I can see new nodes coming up and pods are waiting for resources. After a while the node fails and Karpenter tries to create a new node - this happens in infinite loop.

I looked at the kubelet logs and that's what I saw:

Apr 18 13:10:58 ip-10-104-24-95.eu-west-1.compute.internal kubelet[3197]: E0418 13:10:58.529560    3197 eviction_manager.go:255] "Eviction manager: failed to get summary stats" err="failed to get node info: node \"ip-10-104-24-95.eu-west-1.compute.internal\" not found"
Apr 18 13:10:58 ip-10-104-24-95.eu-west-1.compute.internal kubelet[3197]: E0418 13:10:58.565078    3197 kubelet.go:2214] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Apr 18 13:10:58 ip-10-104-24-95.eu-west-1.compute.internal kubelet[3197]: E0418 13:10:58.618444    3197 kubelet.go:2294] "Error getting node" err="node \"ip-10-104-24-95.eu-west-1.compute.internal\" not found"
Apr 18 13:10:58 ip-10-104-24-95.eu-west-1.compute.internal kubelet[3197]: E0418 13:10:58.720171    3197 kubelet.go:2294] "Error getting node" err="node \"ip-10-104-24-95.eu-west-1.compute.internal\" not found"
Apr 18 13:10:58 ip-10-104-24-95.eu-west-1.compute.internal kubelet[3197]: E0418 13:10:58.821478    3197 kubelet.go:2294] "Error getting node" err="node \"ip-10-104-24-95.eu-west-1.compute.internal\" not found"
Apr 18 13:10:58 ip-10-104-24-95.eu-west-1.compute.internal kubelet[3197]: E0418 13:10:58.929080    3197 kubelet.go:2294] "Error getting node" err="node \"ip-10-104-24-95.eu-west-1.compute.internal\" not found"
Apr 18 13:10:59 ip-10-104-24-95.eu-west-1.compute.internal kubelet[3197]: E0418 13:10:59.030623    3197 kubelet.go:2294] "Error getting node" err="node \"ip-10-104-24-95.eu-west-1.compute.internal\" not found"
Apr 18 13:10:59 ip-10-104-24-95.eu-west-1.compute.internal kubelet[3197]: E0418 13:10:59.130806    3197 kubelet.go:2294] "Error getting node" err="node \"ip-10-104-24-95.eu-west-1.compute.internal\" not found"
Apr 18 13:10:59 ip-10-104-24-95.eu-west-1.compute.internal kubelet[3197]: E0418 13:10:59.231704    3197 kubelet.go:2294] "Error getting node" err="node \"ip-10-104-24-95.eu-west-1.compute.internal\" not found"
Apr 18 13:10:59 ip-10-104-24-95.eu-west-1.compute.internal kubelet[3197]: E0418 13:10:59.332771    3197 kubelet.go:2294] "Error getting node" err="node \"ip-10-104-24-95.eu-west-1.compute.internal\" not found"
Apr 18 13:10:59 ip-10-104-24-95.eu-west-1.compute.internal kubelet[3197]: E0418 13:10:59.433601    3197 kubelet.go:2294] "Error getting node" err="node \"ip-10-104-24-95.eu-west-1.compute.internal\" not found"
Apr 18 13:10:59 ip-10-104-24-95.eu-west-1.compute.internal kubelet[3197]: E0418 13:10:59.534758    3197 kubelet.go:2294] "Error getting node" err="node \"ip-10-104-24-95.eu-west-1.compute.internal\" not found"
Apr 18 13:10:59 ip-10-104-24-95.eu-west-1.compute.internal kubelet[3197]: E0418 13:10:59.635732    3197 kubelet.go:2294] "Error getting node" err="node \"ip-10-104-24-95.eu-west-1.compute.internal\" not found"
Apr 18 13:10:59 ip-10-104-24-95.eu-west-1.compute.internal kubelet[3197]: W0418 13:10:59.648402    3197 transport.go:260] Unable to cancel request for *exec.roundTripper
Apr 18 13:10:59 ip-10-104-24-95.eu-west-1.compute.internal kubelet[3197]: E0418 13:10:59.648452    3197 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://E0FA7ECA2AEBF138BAC98206DB6E77E5.gr7.eu-west-1.eks.amazonaws.com/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-10-104-24-95.eu-west-1.compute.internal?timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Apr 18 13:10:59 ip-10-104-24-95.eu-west-1.compute.internal kubelet[3197]: E0418 13:10:59.736515    3197 kubelet.go:2294] "Error getting node" err="node \"ip-10-104-24-95.eu-west-1.compute.internal\" not found"
Apr 18 13:10:59 ip-10-104-24-95.eu-west-1.compute.internal kubelet[3197]: E0418 13:10:59.837609    3197 kubelet.go:2294] "Error getting node" err="node \"ip-10-104-24-95.eu-west-1.compute.internal\" not found"
Apr 18 13:10:59 ip-10-104-24-95.eu-west-1.compute.internal kubelet[3197]: E0418 13:10:59.938568    3197 kubelet.go:2294] "Error getting node" err="node \"ip-10-104-24-95.eu-west-1.compute.internal\" not found"

I added the Instance Profile to aws-auth configmap:

apiVersion: v1
data:
  mapAccounts: |
    []
  mapRoles: |
    - "groups":
      - "system:nodes"
      - "system:bootstrappers"
      "rolearn": "arn:aws:iam::<accountID>:role/KarpenterNodeInstanceProfile"
      "username": "system:node:{{EC2PrivateDNSName}}"

Also, the instance profile role has the right permissions:

  1. AmazonEKSWorkerNodePolicy
  2. AmazonEC2ContainerRegistryReadOnly
  3. AmazonSSMManagedInstanceCore
  4. AmazonEKS_CNI_Policy
  5. A policy called "karpenter" which contains the following:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "ec2:TerminateInstances",
                "ec2:RunInstances",
                "ec2:CreateFleet"
            ],
            "Resource": [
                "arn:aws:ec2:*:xxxxxxx:security-group/*",
                "arn:aws:ec2:*:xxxxxxx:instance/*",
                "arn:aws:ec2:*:xxxxxxx:network-interface/*",
                "arn:aws:ec2:*:xxxxxxx:volume/*",
                "arn:aws:ec2:*:xxxxxxx:subnet/*",
                "arn:aws:ec2:*:xxxxxxx:fleet/*",
                "arn:aws:ec2:*::image/*"
            ]
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeInstances",
                "ec2:DescribeInstanceTypeOfferings",
                "ec2:DescribeAvailabilityZones",
                "ec2:DescribeLaunchTemplates",
                "ec2:DescribeInstanceTypes",
                "ec2:DescribeSubnets",
                "ec2:DescribeSecurityGroups"
            ],
            "Resource": "*"
        }
    ]
}

Instance profile role's Trust Policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "EKSWorkerAssumeRole",
            "Effect": "Allow",
            "Principal": {
                "Service": "ec2.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

As I am using a custom CNI, hostNetwork is set to true.

What can be the reason for the nodes to not communicate with the API server?

@devopsjnr devopsjnr added the bug Something isn't working label Apr 18, 2022
@tzneal
Copy link
Contributor

tzneal commented Apr 18, 2022

Hey @devopsjnr , can you try to upgrade to the latest Karpenter v0.8.2? It will be helpful to see if the issue occurs there for you. There are some specific notes regarding upgrading from pre-v0.6.2 versions here: https://karpenter.sh/v0.8.2/upgrade-guide/#upgrading-to-v062

@devopsjnr
Copy link
Author

@tzneal I already installed v0.5.6 on another EKS cluster in other AWS account, Hence I rather stick to that version and see it works before I proceed to any upgrade. I know this version is working, I just can't find what I have misconfigured.

Thanks!

@suket22
Copy link
Contributor

suket22 commented Apr 18, 2022

"Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Apr 18 13:10:58 ip-10-104-24-95.eu-west-1.compute.internal kubelet[3197]: E0418 13:10:58.618444 3197 kubelet.go:2294] "Error getting node" err="node "ip-10-104-24-95.eu-west-1.compute.internal" not found"

What CNI are you using? If it's the VPC CNI, you'd want to look at the aws-node pod logs, IPAMD logs on the worker node etc. This sounds like an issue there and not with the kubelet itself.

As I am using a custom CNI, hostNetwork is set to true

Could you look at the logs for this CNI?

@felix-zhe-huang
Copy link
Contributor

felix-zhe-huang commented Apr 19, 2022

Are you using the v0.5.6 getting started guide to install Karpenter? In that case Karpenter is instructed to use the InstanceProfile KarpenterNodeInstanceProfile-${CLUSTER_NAME} created by this step. That's a different roleARN used in the your aws-auth configmap. Can you also check if the KarpenterNodeInstanceProfile-${CLUSTER_NAME} instance profile also has the correct permissions?

@devopsjnr
Copy link
Author

@felix-zhe-huang

Are you using the v0.5.6 getting started guide to install Karpenter?

Yes, installing with Terraform.

That's a different roleARN used in the your aws-auth configmap.

It's the same roleARN, my the role's name did not contain the cluster name. However, I added the cluster name and updated. Still getting the same error.
Here is my updated aws-auth:

apiVersion: v1
data:
  mapAccounts: |
    []
  mapRoles: |
    - "groups":
      - "system:nodes"
      - "system:bootstrappers"
      "rolearn": "arn:aws:iam::<accountID>:role/KarpenterNodeInstanceProfile-devcluster"
      "username": "system:node:{{EC2PrivateDNSName}}"

Is there anything different now that I do not see?

@suket22

What CNI are you using? If it's the VPC CNI, you'd want to look at the aws-node pod logs, IPAMD logs on the worker node etc. This sounds like an issue there and not with the kubelet itself.

I'm using Cilium, logs looks okay..

@dewjam dewjam self-assigned this Apr 19, 2022
@dewjam
Copy link
Contributor

dewjam commented Apr 20, 2022

Hello @devopsjnr ,
The log messages read to me like a network connectivity or security group issue.

net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

From a failed node, can you try to connect to the apiserver endpoint using curl?

curl https://<api-server-endpoint> -k

You should see a 403 status code returned.

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {

  },
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {

  },
  "code": 403
}sh-4.2$

If you don't and instead it times out, then there is likely a network connectivity problem between kubelet and the APIServer endpoint.

@devopsjnr
Copy link
Author

devopsjnr commented Apr 20, 2022

Hi @dewjam, thanks for your reply.
I'm not sure that I can perform a curl command within the failed node because for that I would need at least one running pod, and I cannot have any running pod within any node that Karpenter creates.
The nodes are not coming up at all and pods inside are on waiting state.

I also tried a specific sg (look in my provisioner). Should I check anything in particular regarding my security groups?

@dewjam
Copy link
Contributor

dewjam commented Apr 21, 2022

Hey @devopsjnr ,
Are the nodes being terminated right away? If not, then you should be able to connect to a node via the AWS Console and SSM to do some troubleshooting (or perhaps SSH).

@dewjam
Copy link
Contributor

dewjam commented Apr 22, 2022

Seems to be related to #1634

@dewjam
Copy link
Contributor

dewjam commented Apr 26, 2022

Hello @devopsjnr ,
Is this still an issue? If you are looking for EKS security group requirements, this is a good reference: https://docs.aws.amazon.com/eks/latest/userguide/sec-group-reqs.html (specifically the "Cluster Security Group" section).

@devopsjnr
Copy link
Author

upgraded to latest version. Problem solved, thanks all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants