Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle helm upgrade of AWS VPC CNI more gracefully #57

Closed
realvz opened this issue Dec 20, 2019 · 19 comments
Closed

Handle helm upgrade of AWS VPC CNI more gracefully #57

realvz opened this issue Dec 20, 2019 · 19 comments
Assignees
Labels
bug Something isn't working

Comments

@realvz
Copy link

realvz commented Dec 20, 2019

According the documentation in aws-VPC-CNI helm chart,
“ If you receive an error similar to Error: release aws-vpc-cni failed: "aws-node" already exists, simply rerun the above command.”

This causes issues with automation. This should be handled automatically in the helm chart and not require a rerun of the command by user.

@max-rocket-internet
Copy link
Contributor

This should be handled automatically in the helm chart

For sure but how? To my knowledge, it's not possible blindly overwrite existing resources without errors. But the behaviour has also changed in the last couple releases of Helm.

@3oris
Copy link

3oris commented Feb 18, 2020

For a freshliy set-up EKS cluster the a above quoted statement:

If you receive an error similar to Error: release aws-vpc-cni failed: "aws-node" already exists, simply rerun the above command.

is just not true:

$ helm version
version.BuildInfo{Version:"v3.0.2", GitCommit:"19e47ee3283ae98139d98460de796c1be1e3975f", GitTreeState:"clean", GoVersion:"go1.13.5"}

$ helm upgrade --install --recreate-pods --force aws-vpc-cni --namespace kube-system eks/aws-vpc-cni
Flag --recreate-pods has been deprecated, functionality will no longer be updated. Consult the documentation for other methods to recreate pods
Release "aws-vpc-cni" does not exist. Installing it now.
Error: rendered manifests contain a resource that already exists. Unable to continue with install: existing resource conflict: kind: ServiceAccount, namespace: kube-system, name: aws-node

$ helm upgrade --install --recreate-pods --force aws-vpc-cni --namespace kube-system eks/aws-vpc-cni
Flag --recreate-pods has been deprecated, functionality will no longer be updated. Consult the documentation for other methods to recreate pods
Release "aws-vpc-cni" does not exist. Installing it now.
Error: rendered manifests contain a resource that already exists. Unable to continue with install: existing resource conflict: kind: ServiceAccount, namespace: kube-system, name: aws-node

$ helm upgrade --install --recreate-pods --force aws-vpc-cni --namespace kube-system eks/aws-vpc-cni
Flag --recreate-pods has been deprecated, functionality will no longer be updated. Consult the documentation for other methods to recreate pods
Release "aws-vpc-cni" does not exist. Installing it now.
Error: rendered manifests contain a resource that already exists. Unable to continue with install: existing resource conflict: kind: ServiceAccount, namespace: kube-system, name: aws-node

The chart simply cannot be installed into an existing EKS cluster.

@jaypipes
Copy link
Contributor

Thanks for the heads up @3oris, this is embarrassing indeed. I will work on getting an upgrade job placed into the VPC CNI system once we move the chart sources in there.

@jaypipes jaypipes added the bug Something isn't working label Feb 18, 2020
@mohsen0
Copy link
Contributor

mohsen0 commented Feb 18, 2020

I removed the daemon set object & service account before applying the change and managing CNI with helm.
after that everything went smoothly

@3oris
Copy link

3oris commented Feb 18, 2020

@mohsen0 for me it doesn't. Helm complains about the pre-existing CRD next. I surely know that I could delete all related resources and then apply the chart. For me the question more or less is: Should an EKS cluster come with a pre-installed CNI plugin if after all the CNI plugin wont be part of the "managed service" promise. Now the thing comes with an unmanaged component out of the box that leaves me manage it myself, but adequate means of automation (in terms of CD) cannot be applied, and I have to work around that.

@stefanprodan
Copy link
Collaborator

The CNI chart is not compatible with Helm v3, we should move the CRD to the crds dir, here is an example: https://github.com/aws/eks-charts/tree/master/stable/appmesh-controller

@mohsen0
Copy link
Contributor

mohsen0 commented Feb 18, 2020

@3oris I understand there is room for more improvement and clarify, But I suggest a more constructive, positive approach. then your contributions will be much more appreciated.

@max-rocket-internet
Copy link
Contributor

CRD issue aside, you can just install the chart, it fails, then purge it, then reinstall and all is done. Obviously deletes all the resources and then recreates them with the chart. I've done this method many times now and it works fine without any interruption.

@max-rocket-internet
Copy link
Contributor

According the documentation in aws-VPC-CNI helm chart,
“ If you receive an error similar to Error: release aws-vpc-cni failed: "aws-node" already exists, simply rerun the above command.”

This was true for older versions of Helm. For 2.16 it doesn't work. Helm 3 also probably not.

@3oris
Copy link

3oris commented Feb 20, 2020

Here is how we work around this in terraform for now: We remove all aws-node related pre-deployed resources as soon as the control plane becomes available (and before the worker nodes become ready) and then apply the helm chart.

data "aws_eks_cluster" "eks_cluster" {
  name = module.eks-cluster.cluster_id
}

data "aws_eks_cluster_auth" "eks_cluster_auth" {
  name = module.eks-cluster.cluster_id
}

provider "kubernetes" {
  host                   = data.aws_eks_cluster.eks_cluster.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks_cluster.certificate_authority.0.data)
  token                  = data.aws_eks_cluster_auth.eks_cluster_auth.token
  load_config_file       = false
  # bug: https://github.com/terraform-providers/terraform-provider-kubernetes/issues/759
  version = "1.10.0"
}

provider "helm" {
  kubernetes {
    host                   = data.aws_eks_cluster.eks_cluster.endpoint
    cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks_cluster.certificate_authority.0.data)
    token                  = data.aws_eks_cluster_auth.eks_cluster_auth.token
    load_config_file       = false
  }
}

module "eks-cluster" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 8.1"

  cluster_name = local.cluster_name
  subnets      = module.vpc.private_subnets
  vpc_id       = module.vpc.vpc_id

  workers_role_name = format("%s-nodes-role", local.cluster_name)

  enable_irsa       = true
  manage_aws_auth   = true

  node_groups = [
    {
      name = "managed-nodes"
    }
  ]

  node_groups_defaults = {
    ...
  }

  map_roles = [
    ...
  ]
  tags = local.tags
}

# employing the data resource wraper above,
# creation of the resources below will be triggered
# as soon as the control plane becomes ready

resource "null_resource" "remove_aws_vpc_cni_plugin" {
  provisioner "local-exec" {
    environment = {
      CLUSTER_ENDPOINT = data.aws_eks_cluster.eks_cluster.endpoint
      CLUSTER_CA       = data.aws_eks_cluster.eks_cluster.certificate_authority.0.data
      CLUSTER_TOKEN    = data.aws_eks_cluster_auth.eks_cluster_auth.token
    }
    command = format("%s/remove-aws-vpc-cni-plugin.sh", path.module)
  }
}

data "helm_repository" "eks" {
  name = "eks"
  url  = "https://aws.github.io/eks-charts"
}

resource "helm_release" "aws_vpc_cni" {
  depends_on = [null_resource.remove_aws_vpc_cni_plugin]
  repository = data.helm_repository.eks.metadata[0].name

  name      = "aws-vpc-cni"
  chart     = "aws-vpc-cni"
  version   = "1.0.2"
  namespace = "kube-system"
}

with remove-aws-vpc-cni-plugin.sh

#!/bin/bash

#CLUSTER_ENDPOINT
#CLUSTER_CA
#CLUSTER_TOKEN
#REGION

until curl -k -s $CLUSTER_ENDPOINT/healthz >/dev/null; do sleep 4; done

MANIFEST_FILE=`mktemp -p /tmp`
CONFIG_FILE=`mktemp -p /tmp`
CA_FILE=`mktemp -p /tmp`

trap "{ rm -f $MANIFEST_FILE $CONFIG_FILE $CA_FILE; }" EXIT

echo $CLUSTER_CA | base64 -d > $CA_FILE

VERSION=$(kubectl get ds aws-node -n kube-system -o yaml \
  --kubeconfig $CONFIG_FILE \
  --server $CLUSTER_ENDPOINT \
  --certificate-authority $CA_FILE \
  --token "$CLUSTER_TOKEN" \
  | grep "image:" | cut -d "/" -f 2 | cut -d ":" -f 2 | sed -E 's#^v([[:digit:]]+)\.([[:digit:]]+)\..+$#\1.\2#g')

curl -sqL \
  https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/release-$VERSION/config/v$VERSION/aws-k8s-cni.yaml \
  | sed -e "s#us-west-2#$REGION#" > $MANIFEST_FILE

kubectl delete \
  --kubeconfig $CONFIG_FILE \
  --server $CLUSTER_ENDPOINT \
  --certificate-authority $CA_FILE \
  --token "$CLUSTER_TOKEN" \
  --filename $MANIFEST_FILE \
  --wait

This can also be applied to an existing cluster. As @max-rocket-internet stated there should be no noticeable interruption.

Now you could annotate the aws-node service account via the helm_release resource and enjoy IAM Roles for Service Accounts.

@mbarrien
Copy link
Contributor

mbarrien commented May 14, 2020

Note that as of Helm 3.2.0, there is theoretically a way to adopt existing resources into helm if you add appropriate annotations and labels to it. helm/helm#7649

Just add the following annotations and labels to existing objects:

annotations:
  meta.helm.sh/release-name: your-release-name
  meta.helm.sh/release-namespace: your-release-namespace
label:
  app.kubernetes.io/managed-by: Helm

In theory, you can add the annotations and labels via kubectl annotate/kubectl label to the objects that AWS pre-creates, then run Helm. I haven't tested this yet, and this is still ugly if you're using something like Terraform, but in theory you don't have delete the Kubernetes objects now.

@willejs
Copy link
Contributor

willejs commented Jul 2, 2020

The above suggestion does not work as matchlabels on the daemonset is an immutable field. I am going to attempt to allow us to override that in the chart and make a pr, which may be feasible if it does not try to modify any other immutable fields.
I am using helmfile and was planning on adding the annotations and labels as a 'prepare' event on the release. I would rather not delete the whole existing 'release' as it could cause pods to fail sandbox create at our scale. If this doesn't work, the only alternative is a blue green deploy on a new node group.

@TBBle
Copy link

TBBle commented Jul 13, 2020

Now that #192 has been merged with documentation on how to adopt the EKS-deployed resources into the Helm chart, is there anything more to do here?

@Skaronator
Copy link

Skaronator commented Oct 27, 2020

What happens when you just delete the existing resources and install the CNI helm chart a few seconds later? Does everything recover or do you break your cluster and existing nodes?

I did that yesterday and everything seems to continue working just fine until I've replaced alle nodes in my cluster. The AWS NLB stopped working and I only got a "Error empty reply from Server" until I deleted the LB (delete the ingress-nginx helm chart) and created a new one.

Update: Did the exact same thing on another cluster and everything went well. I did the exact same thing, only difference was that I was a bit faster doing everything. Weird. Maybe it was just a AWS EKS bug since I did both changes with a EKS Upgrade.

@v-yarotsky
Copy link

Would it be an option to use pre-install helm hook to add labels and annotations to existing resources? Seems like that would make it much easier to install the chart onto existing clusters.

@TBBle
Copy link

TBBle commented Feb 26, 2021

I don't think so, I don't think Helm will run the pre-install hooks if it's not going to do the install because there are conflicting resources it cannot adopt.

@jdn5126
Copy link
Contributor

jdn5126 commented Nov 22, 2022

VPC-CNI is fixing crds issue for helm v3 in aws/amazon-vpc-cni-k8s#2144

@yama6a
Copy link

yama6a commented Dec 4, 2022

If someone runs into problems with @3oris's great workaround (see above), you can try replacing the shell script with this one. Some things have changed since he posted his solution, so I applied some fixes. This works for me now, thanks a lot @3oris for your solution!

Fixes:

  • make mktemp OSX compatible (-p flag)
  • fix output filter to determine vcni-addon version ($VERSION)
  • fix location of aws-k8s-cni.yaml which has changed since v1.10
#!/bin/bash

#CLUSTER_ENDPOINT
#CLUSTER_CA
#CLUSTER_TOKEN
#REGION

until curl -k -s $CLUSTER_ENDPOINT/healthz >/dev/null; do sleep 4; done

MANIFEST_FILE=`mktemp -t manifest_`
CONFIG_FILE=`mktemp -t config_`
CA_FILE=`mktemp -t ca_`

trap "{ rm -f $MANIFEST_FILE $CONFIG_FILE $CA_FILE; }" EXIT

echo $CLUSTER_CA | base64 -d > $CA_FILE

VERSION=$(kubectl get ds aws-node -n kube-system -o yaml \
  --kubeconfig $CONFIG_FILE \
  --server $CLUSTER_ENDPOINT \
  --certificate-authority $CA_FILE \
  --token "$CLUSTER_TOKEN" \
  | grep -i "image: \d" | grep amazon-k8s-cni: | cut -d "/" -f 2 | cut -d ":" -f 2 | sed -E 's#^v([[:digit:]]+)\.([[:digit:]]+)\..+$#\1.\2#g')

curl -sqL \
  https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/release-$VERSION/config/master/aws-k8s-cni.yaml \
  | sed -e "s#us-west-2#$REGION#" > $MANIFEST_FILE

kubectl delete \
  --kubeconfig $CONFIG_FILE \
  --server $CLUSTER_ENDPOINT \
  --certificate-authority $CA_FILE \
  --token "$CLUSTER_TOKEN" \
  --filename $MANIFEST_FILE \
  --wait

seems like TF doesn't like helm_datasource anymore for this use, so I replaced it like this

resource "null_resource" "remove_aws_vpc_cni_plugin" {
  provisioner "local-exec" {
    environment = {
      CLUSTER_ENDPOINT = data.aws_eks_cluster.cluster.endpoint
      CLUSTER_CA       = data.aws_eks_cluster.cluster.certificate_authority.0.data
      CLUSTER_TOKEN    = data.aws_eks_cluster_auth.cluster.token
    }
    command = format("%s/remove-aws-vpc-cni-plugin.sh", path.module)
  }
}

resource "helm_release" "vcni" {
  depends_on = [null_resource.remove_aws_vpc_cni_plugin]
  repository = "https://aws.github.io/eks-charts"

  name       = "aws-vpc-cni"
  chart      = "aws-vpc-cni"
  namespace  = "kube-system"

  set {
    name  = "env.ENABLE_PREFIX_DELEGATION"
    value = true
  }
}

@jdn5126
Copy link
Contributor

jdn5126 commented Jan 24, 2024

Closing as the aws-vpc-cni helm chart is helm3 compatible and the README has been updated to discuss adopting resources vs fresh installs

@jdn5126 jdn5126 closed this as completed Jan 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests