[aws-eks] AWSCDK-EKS-KubernetesResource started to early after AWSCDK-EKS-Cluster upgrade #7457

pbaranow · 2020-04-20T15:26:01Z

During EKS cluster upgrade from 1.14 to 1.15 together with Cluster Autoscaler update, CA update failed while trying to prepare kubeconfig file. CDK version used - 1.31.0.

I have Python CDK script which describes EKS cluster and deploys Cluster Autoscaler as a part of the setup. CA version matches version of EKS. I had EKS 1.14 with CA 1.14 deployed and running. I decided to perform upgrade to 1.15.

When checking diff I see that Custom::AWSCDK-EKS-Cluster will be upgraded to 1.15, LaunchConfigs for ASGs will be updated and Custom::AWSCDK-EKS-KubernetesResource with Cluster Autoscaler will be updated.

I deployed the changes. Excerpts from the logs:

  0/36 | 12:29:28 PM | UPDATE_IN_PROGRESS   | AWS::CloudFormation::Stack            | @aws-cdk--aws-eks.KubectlProvider.NestedStack/@aws-cdk--aws-eks.KubectlProvider.NestedStackResource
  0/36 | 12:29:28 PM | UPDATE_IN_PROGRESS   | AWS::CloudFormation::Stack            | @aws-cdk--aws-eks.ClusterResourceProvider.NestedStack/@aws-cdk--aws-eks.ClusterResourceProvider.NestedStackResource
...
  1/36 | 12:30:15 PM | UPDATE_COMPLETE      | AWS::CloudFormation::Stack            | @aws-cdk--aws-eks.KubectlProvider.NestedStack/@aws-cdk--aws-eks.KubectlProvider.NestedStackResource
  2/36 | 12:30:15 PM | UPDATE_COMPLETE      | AWS::CloudFormation::Stack            | @aws-cdk--aws-eks.ClusterResourceProvider.NestedStack/@aws-cdk--aws-eks.ClusterResourceProvider.NestedStackResource
  2/36 | 12:30:22 PM | UPDATE_IN_PROGRESS   | Custom::AWSCDK-EKS-Cluster            | dev-eks/Resource/Resource/Default
  3/36 | 12:30:33 PM | UPDATE_COMPLETE      | Custom::AWSCDK-EKS-Cluster            | dev-eks/Resource/Resource/Default
  3/36 | 12:30:36 PM | UPDATE_IN_PROGRESS   | Custom::AWSCDK-EKS-KubernetesResource | dev-eks/manifest-ClusterAutoscalerDeployment/Resource/Default
...
  7/36 | 12:30:45 PM | UPDATE_FAILED        | Custom::AWSCDK-EKS-KubernetesResource | dev-eks/manifest-ClusterAutoscalerDeployment/Resource/Default
...
Failed to update resource. Error: Command '['aws', 'eks', 'update-kubeconfig', '--role-arn', 'arn:aws:iam::000000000000:role/dev-eks-deveksCreationRole11111111-2222222222222', '--name', 'dev-eks', '--kubeconfig', '/tmp/kubeconfig']' returned non-zero exit status 255.
...
  7/36 | 12:30:48 PM | UPDATE_ROLLBACK_IN_P | AWS::CloudFormation::Stack            | dev-eks The following resource(s) failed to update: [deveksmanifestClusterAutoscalerDeployment11111111].
...
  9/36 | 12:31:30 PM | UPDATE_IN_PROGRESS   | Custom::AWSCDK-EKS-Cluster            | dev-eks/Resource/Resource/Default
 9/36 Currently in progress: dev-eks
 10/36 | 12:53:03 PM | UPDATE_COMPLETE      | Custom::AWSCDK-EKS-Cluster            | dev-eks/Resource/Resource/Default
 10/36 | 12:53:04 PM | UPDATE_IN_PROGRESS   | Custom::AWSCDK-EKS-KubernetesResource | dev-eks/manifest-ClusterAutoscalerDeployment/Resource/Default

If I understand correctly at 12:30:22 update of Control Plane starts and it is reported complete at 12:30:33. I see in AWS EKS Console that update was submitted at 12:30:27, so it would match the log from CDK CLI. But 11 seconds is strangely quick.

Then, 3 seconds after reporting Control Plane update completion update of Cluster Autoscaler resource is started and it fails with 255 error while trying to prepare kubeconfig file. Rollback is started automatically and then, at 12:31:30 there's another message about update of Control Plane update and then another log at 12:53:03 that update is complete - it took over 20 minutes. Then the rest of rollback continues.

In the end stack is in UPDATE_ROLLBACK_COMPLETE, but EKS is on 1.15 - it was not rolled back to 1.14, even though it took 20 minutes.

Another attempt to deploy changes succeeded - Cluster Autoscaler resource was updated, while there was no need to update the Control Plane since it was already at 1.15.

It seems that Control Plane update was started but not complete yet, when update of KubernetesResource was started. This caused failure of getting kubeconfig. Next attempt was started when 1.15 Control Plane was already running, so resource update was finished successfully.

Reproduction Steps

CDK script with EKS 1.14 cluster and matching Cluster Autoscaler added as KubernetesResource (using add_resource method)
deploy resources
change script to deploy EKS 1.15 and appropriate CA version
deploy resources

Error Log

Failed to update resource. Error: Command '['aws', 'eks', 'update-kubeconfig', '--role-arn', 'arn:aws:iam::000000000000:role/dev-eks-deveksCreationRole11111111-2222222222222', '--name', 'dev-eks', '--kubeconfig', '/tmp/kubeconfig']' returned non-zero exit status 255.
    at invokeUserFunction (/var/task/framework.js:85:19)
    at process._tickCallback (internal/process/next_tick.js:68:7)
        new CustomResource (/tmp/jsii-kernel-HmLPhK/node_modules/@aws-cdk/aws-cloudformation/lib/custom-resource.js:56:25)
        \_ new KubernetesResource (/tmp/jsii-kernel-HmLPhK/node_modules/@aws-cdk/aws-eks/lib/k8s-resource.js:19:9)
        \_ Cluster.addResource (/tmp/jsii-kernel-HmLPhK/node_modules/@aws-cdk/aws-eks/lib/cluster.js:254:16)
        \_ /usr/local/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7663:51
        \_ Kernel._wrapSandboxCode (/usr/local/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:8296:20)
        \_ /usr/local/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7663:25
        \_ Kernel._ensureSync (/usr/local/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:8272:20)
        \_ Kernel.invoke (/usr/local/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7662:26)
        \_ KernelHost.processRequest (/usr/local/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7371:28)
        \_ KernelHost.run (/usr/local/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7311:14)
        \_ Immediate._onImmediate (/usr/local/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7314:37)
        \_ processImmediate (internal/timers.js:439:21)

Environment

Framework Version: 1.31.0
OS : Linux
Language : Python

Other

This is 🐛 Bug Report

The text was updated successfully, but these errors were encountered:

pbaranow · 2020-04-22T11:46:24Z

There is another side effect of this issue. I was looking into this upgrade scenario today and I see, that right after this short Control Plane "update" CDK (or rather CloudFormation) starts update of worker nodes AutoScalingGroups. I don't think this is expected behavior, as Control Plane should be updated first and then followed by update of worker nodes.

eladb · 2020-04-22T13:23:48Z

Thanks for reporting this @pbaranow. Indeed it seems that the version upgrade completes prematurely. If, by any chance, you are able to paste here the CloudWatch logs of this update of the Lambda functions with descriptions "onEvent handler for EKS cluster resource provider" and "isComplete handler for EKS cluster resource provider", that would be super helpful.

The `UpdateClusterVersion` operation takes a while to begin and until then, the cluster's status is still `ACTIVE` instead `UPDATING` as expected. This causes the `isComplete` handler, which is called immediately, to think that the operation is complete, when it hasn't even began. Add logic to the cluster version update `onEvent` method to wait up to 5 minutes until the cluster status is no longer `ACTIVE`, so that the subsequent `isComplete` query will be based on the version update operation itself. Extended the timeout of `onEvent` to 15m to ensure it does not interrupt the operation. TESTING: Updated unit tests to verify this retry behavior and performed a manual upgrade tests while examining the logs. Fixes #7457

pbaranow · 2020-04-23T12:10:37Z

I re-run the scenario, so new set of logs is attached. It seems that the isComplete handler started a bit to early - first request was logged at 06:21:02 and logged that cluster was in Active state, while onEvent handler started at 06:20:58 and logged that update is starting at 06:21:00.

upgrade.sandbox.eks.txt

eladb · 2020-04-23T12:16:26Z

Thanks, see #7526 - fix is under way.

The `UpdateClusterVersion` operation takes a while to begin and until then, the cluster's status is still `ACTIVE` instead `UPDATING` as expected. This causes the `isComplete` handler, which is called immediately, to think that the operation is complete, when it hasn't even began. Modify how `IsComplete` is implemented for cluster version (and config) updates. Extract the update ID and use `DescribeUpdate` to monitor the status of the update. This also allows us to fix a latent bug and fail the update in case the version update failed. The update ID is returned from `OnEvent` via a custom fields called `EksUpdateId` and passed on to the subsequent `IsComplete` invocation. This was already supported by the custom resource provider framework but not documented or officially tested, so we've added that here as well (docs + test). TESTING: Added unit tests to verify the new type of update waiter and performed a manual upgrade tests while examining the logs. Fixes #7457

pbaranow added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Apr 20, 2020

SomayaB added the @aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service label Apr 20, 2020

SomayaB assigned eladb Apr 20, 2020

eladb added the p1 label Apr 22, 2020

eladb mentioned this issue Apr 22, 2020

fix(eks): version update completes prematurely #7526

Merged

1 task

mergify bot closed this as completed in #7526 Apr 23, 2020

snyk-bot mentioned this issue May 2, 2020

[Snyk] Upgrade: @aws-cdk/aws-events-targets, @aws-cdk/aws-iam, @aws-cdk/aws-lambda, @aws-cdk/aws-logs, @aws-cdk/core, cdk bifravst/cloudformation-cleaner#13

Merged

iliapolo changed the title ~~AWSCDK-EKS-KubernetesResource started to early after AWSCDK-EKS-Cluster upgrade~~ [aws-eks] AWSCDK-EKS-KubernetesResource started to early after AWSCDK-EKS-Cluster upgrade Aug 16, 2020

iliapolo removed the needs-triage This issue or PR still needs to be triaged. label Aug 16, 2020

github-actions bot assigned iliapolo Aug 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[aws-eks] AWSCDK-EKS-KubernetesResource started to early after AWSCDK-EKS-Cluster upgrade #7457

[aws-eks] AWSCDK-EKS-KubernetesResource started to early after AWSCDK-EKS-Cluster upgrade #7457

pbaranow commented Apr 20, 2020

pbaranow commented Apr 22, 2020

eladb commented Apr 22, 2020

pbaranow commented Apr 23, 2020

eladb commented Apr 23, 2020

[aws-eks] AWSCDK-EKS-KubernetesResource started to early after AWSCDK-EKS-Cluster upgrade #7457

[aws-eks] AWSCDK-EKS-KubernetesResource started to early after AWSCDK-EKS-Cluster upgrade #7457

Comments

pbaranow commented Apr 20, 2020

Reproduction Steps

Error Log

Environment

Other

pbaranow commented Apr 22, 2020

eladb commented Apr 22, 2020

pbaranow commented Apr 23, 2020

eladb commented Apr 23, 2020