Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter Addon: on upgrade to 0.32 (v1beta1 CRDs) cannot terminate pre-existing nodes #1004

Open
Feder1co5oave opened this issue May 17, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@Feder1co5oave
Copy link
Contributor

Feder1co5oave commented May 17, 2024

Describe the bug

We need to pay attention on the upgrade path for the Karpenter addon on an existing EKS installation.

The upgrade path from any version <0.32.x to any >0.32 is as follows: https://karpenter.sh/v0.32/upgrading/v1beta1-migration/#upgrade-procedure

  1. upgrade to 0.31.4

  2. install new v1beta1 CRDs. I usually manage to do this manually via Helm
    helm upgrade --kube-context=<kubecontext> --install karpenter-crd oci://public.ecr.aws/karpenter/karpenter-crd --version v0.32.9 --namespace karpenter

  3. upgrade to 0.32.x

  4. convert existing v1alpha5 Karpenter resources into v1beta1 (Provisioner -> NodePool, AWSNodeTemplate -> EC2NodeClass)
    This can be tricky for some use cases but should be doable for the default created by the addon.
    This v1beta1 resources should coexist with older v1alpha5 ones to make a seamless migration possible.

  5. rotate existing nodes to migrate them from v1alpha5's Machine to v1beta1's NodeClaim.

    kubectl get machines -owide
    kubectl delete nodes <nodes>
    

    In order to do this, Karpenter must be able to terminate the EC2 instances underlying the old nodes. Unfortunately, Karpenter's new (stricter) IAM policy does not allow it to terminate instances which are not tagged with the required tags. As a result of this, pre-existing nodes will NEVER be removed by Karpenter. kubectl delete node does nothing. To fix this you need to either manually terminate instances (and have the CCM automatically delete the nodes after a while), or tag them. I chose the former:
    aws ec2 terminate-instances --instance-ids i-0f0d1e695993762f8
    This requirement is described here: https://karpenter.sh/v0.32/upgrading/v1beta1-migration/#updating-tag-based-permissions

    While migrating between alpha and beta, you will need to maintain the old tag permissions as well as the new permissions.

    Instead, when upgrading to 0.32.0, the new IAM scoped policy will be installed, making it impossible for Karpenter to terminate pre-existing node instances

    if (semver.gte(version, "v0.32.0")){
    karpenterPolicyDocument = iam.PolicyDocument.fromJson(KarpenterControllerPolicyBeta(cluster, partition, region));
    } else {
    karpenterPolicyDocument = iam.PolicyDocument.fromJson(KarpenterControllerPolicy);

  6. make sure no old Machines are present kubectl get machines
    At this point we could get rid of the old IAM policy and migrate the new stricter one.

  7. delete all AWSNodeTemplates and Provisioners

  8. delete v1alpha5 CRDs.

  9. upgrade to v0.33.x (and install new CRDs Karpenter Addon: On upgrade to 0.33+, new CRDs are not installed by blueprints #962)

Expected Behavior

Apart from the necessary manual actions (upgrade existing AWSNodeTemplates and Provisioners) to the new v1beta1 version, there should be a safe upgrade path available from the addon

Reproduction Steps

Deploy EKS blueprints with Karpenter addon version v0.31.3, and deploy any workload to spin up some worker nodes.
Update the version to v0.32.9 (you should manually install the new v1beta1 CRDs, as per #962.
Delete any one of the pre-existing nodes

kubectl get machines -owide
kubectl delete nodes <nodes>

Karpenter will not be able to delete pre-existing nodes.

Possible Solution

While upgrading to the new v1beta1 Karpenter APIs, in version 0.32.x, install both old and new IAM policies specified here
as suggested by https://karpenter.sh/v0.32/upgrading/v1beta1-migration/#updating-tag-based-permissions

CDK CLI Version

2.133.0 (build dcc1e75)

EKS Blueprints Version

1.14.1

Node.js Version

v18.18.2

Environment details (OS name and version, etc.)

Ubuntu Linux 22.04

@Feder1co5oave Feder1co5oave added the bug Something isn't working label May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant