Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[aws-eks] AWSCDK-EKS-KubernetesResource started to early after AWSCDK-EKS-Cluster upgrade #7457

Closed
pbaranow opened this issue Apr 20, 2020 · 4 comments · Fixed by #7526 or bifravst/cloudformation-cleaner#13
Assignees
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. p1

Comments

@pbaranow
Copy link

During EKS cluster upgrade from 1.14 to 1.15 together with Cluster Autoscaler update, CA update failed while trying to prepare kubeconfig file. CDK version used - 1.31.0.

I have Python CDK script which describes EKS cluster and deploys Cluster Autoscaler as a part of the setup. CA version matches version of EKS. I had EKS 1.14 with CA 1.14 deployed and running. I decided to perform upgrade to 1.15.

When checking diff I see that Custom::AWSCDK-EKS-Cluster will be upgraded to 1.15, LaunchConfigs for ASGs will be updated and Custom::AWSCDK-EKS-KubernetesResource with Cluster Autoscaler will be updated.

I deployed the changes. Excerpts from the logs:

  0/36 | 12:29:28 PM | UPDATE_IN_PROGRESS   | AWS::CloudFormation::Stack            | @aws-cdk--aws-eks.KubectlProvider.NestedStack/@aws-cdk--aws-eks.KubectlProvider.NestedStackResource
  0/36 | 12:29:28 PM | UPDATE_IN_PROGRESS   | AWS::CloudFormation::Stack            | @aws-cdk--aws-eks.ClusterResourceProvider.NestedStack/@aws-cdk--aws-eks.ClusterResourceProvider.NestedStackResource
...
  1/36 | 12:30:15 PM | UPDATE_COMPLETE      | AWS::CloudFormation::Stack            | @aws-cdk--aws-eks.KubectlProvider.NestedStack/@aws-cdk--aws-eks.KubectlProvider.NestedStackResource
  2/36 | 12:30:15 PM | UPDATE_COMPLETE      | AWS::CloudFormation::Stack            | @aws-cdk--aws-eks.ClusterResourceProvider.NestedStack/@aws-cdk--aws-eks.ClusterResourceProvider.NestedStackResource
  2/36 | 12:30:22 PM | UPDATE_IN_PROGRESS   | Custom::AWSCDK-EKS-Cluster            | dev-eks/Resource/Resource/Default
  3/36 | 12:30:33 PM | UPDATE_COMPLETE      | Custom::AWSCDK-EKS-Cluster            | dev-eks/Resource/Resource/Default
  3/36 | 12:30:36 PM | UPDATE_IN_PROGRESS   | Custom::AWSCDK-EKS-KubernetesResource | dev-eks/manifest-ClusterAutoscalerDeployment/Resource/Default
...
  7/36 | 12:30:45 PM | UPDATE_FAILED        | Custom::AWSCDK-EKS-KubernetesResource | dev-eks/manifest-ClusterAutoscalerDeployment/Resource/Default
...
Failed to update resource. Error: Command '['aws', 'eks', 'update-kubeconfig', '--role-arn', 'arn:aws:iam::000000000000:role/dev-eks-deveksCreationRole11111111-2222222222222', '--name', 'dev-eks', '--kubeconfig', '/tmp/kubeconfig']' returned non-zero exit status 255.
...
  7/36 | 12:30:48 PM | UPDATE_ROLLBACK_IN_P | AWS::CloudFormation::Stack            | dev-eks The following resource(s) failed to update: [deveksmanifestClusterAutoscalerDeployment11111111].
...
  9/36 | 12:31:30 PM | UPDATE_IN_PROGRESS   | Custom::AWSCDK-EKS-Cluster            | dev-eks/Resource/Resource/Default
 9/36 Currently in progress: dev-eks
 10/36 | 12:53:03 PM | UPDATE_COMPLETE      | Custom::AWSCDK-EKS-Cluster            | dev-eks/Resource/Resource/Default
 10/36 | 12:53:04 PM | UPDATE_IN_PROGRESS   | Custom::AWSCDK-EKS-KubernetesResource | dev-eks/manifest-ClusterAutoscalerDeployment/Resource/Default

If I understand correctly at 12:30:22 update of Control Plane starts and it is reported complete at 12:30:33. I see in AWS EKS Console that update was submitted at 12:30:27, so it would match the log from CDK CLI. But 11 seconds is strangely quick.

Then, 3 seconds after reporting Control Plane update completion update of Cluster Autoscaler resource is started and it fails with 255 error while trying to prepare kubeconfig file. Rollback is started automatically and then, at 12:31:30 there's another message about update of Control Plane update and then another log at 12:53:03 that update is complete - it took over 20 minutes. Then the rest of rollback continues.

In the end stack is in UPDATE_ROLLBACK_COMPLETE, but EKS is on 1.15 - it was not rolled back to 1.14, even though it took 20 minutes.

Another attempt to deploy changes succeeded - Cluster Autoscaler resource was updated, while there was no need to update the Control Plane since it was already at 1.15.

It seems that Control Plane update was started but not complete yet, when update of KubernetesResource was started. This caused failure of getting kubeconfig. Next attempt was started when 1.15 Control Plane was already running, so resource update was finished successfully.

Reproduction Steps

  • CDK script with EKS 1.14 cluster and matching Cluster Autoscaler added as KubernetesResource (using add_resource method)
  • deploy resources
  • change script to deploy EKS 1.15 and appropriate CA version
  • deploy resources

Error Log

Failed to update resource. Error: Command '['aws', 'eks', 'update-kubeconfig', '--role-arn', 'arn:aws:iam::000000000000:role/dev-eks-deveksCreationRole11111111-2222222222222', '--name', 'dev-eks', '--kubeconfig', '/tmp/kubeconfig']' returned non-zero exit status 255.
    at invokeUserFunction (/var/task/framework.js:85:19)
    at process._tickCallback (internal/process/next_tick.js:68:7)
        new CustomResource (/tmp/jsii-kernel-HmLPhK/node_modules/@aws-cdk/aws-cloudformation/lib/custom-resource.js:56:25)
        \_ new KubernetesResource (/tmp/jsii-kernel-HmLPhK/node_modules/@aws-cdk/aws-eks/lib/k8s-resource.js:19:9)
        \_ Cluster.addResource (/tmp/jsii-kernel-HmLPhK/node_modules/@aws-cdk/aws-eks/lib/cluster.js:254:16)
        \_ /usr/local/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7663:51
        \_ Kernel._wrapSandboxCode (/usr/local/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:8296:20)
        \_ /usr/local/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7663:25
        \_ Kernel._ensureSync (/usr/local/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:8272:20)
        \_ Kernel.invoke (/usr/local/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7662:26)
        \_ KernelHost.processRequest (/usr/local/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7371:28)
        \_ KernelHost.run (/usr/local/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7311:14)
        \_ Immediate._onImmediate (/usr/local/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7314:37)
        \_ processImmediate (internal/timers.js:439:21)

Environment

  • Framework Version: 1.31.0
  • OS : Linux
  • Language : Python

Other


This is 🐛 Bug Report

@pbaranow pbaranow added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Apr 20, 2020
@SomayaB SomayaB added the @aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service label Apr 20, 2020
@pbaranow
Copy link
Author

There is another side effect of this issue. I was looking into this upgrade scenario today and I see, that right after this short Control Plane "update" CDK (or rather CloudFormation) starts update of worker nodes AutoScalingGroups. I don't think this is expected behavior, as Control Plane should be updated first and then followed by update of worker nodes.

@eladb
Copy link
Contributor

eladb commented Apr 22, 2020

Thanks for reporting this @pbaranow. Indeed it seems that the version upgrade completes prematurely. If, by any chance, you are able to paste here the CloudWatch logs of this update of the Lambda functions with descriptions "onEvent handler for EKS cluster resource provider" and "isComplete handler for EKS cluster resource provider", that would be super helpful.

@eladb eladb added the p1 label Apr 22, 2020
eladb pushed a commit that referenced this issue Apr 22, 2020
The `UpdateClusterVersion` operation takes a while to begin and until then, the cluster's status is still `ACTIVE` instead `UPDATING` as expected. This causes the `isComplete` handler, which is called immediately, to think that the operation is complete, when it hasn't even began.

Add logic to the cluster version update `onEvent` method to wait up to 5 minutes until the cluster status is no longer `ACTIVE`, so that the subsequent `isComplete` query will be based on the version update operation itself.

Extended the timeout of `onEvent` to 15m to ensure it does not interrupt the operation.

TESTING: Updated unit tests to verify this retry behavior and performed a manual upgrade tests while examining the logs.

Fixes #7457
@pbaranow
Copy link
Author

I re-run the scenario, so new set of logs is attached. It seems that the isComplete handler started a bit to early - first request was logged at 06:21:02 and logged that cluster was in Active state, while onEvent handler started at 06:20:58 and logged that update is starting at 06:21:00.

upgrade.sandbox.eks.txt

@eladb
Copy link
Contributor

eladb commented Apr 23, 2020

Thanks, see #7526 - fix is under way.

@mergify mergify bot closed this as completed in #7526 Apr 23, 2020
mergify bot pushed a commit that referenced this issue Apr 23, 2020
The `UpdateClusterVersion` operation takes a while to begin and until then, the cluster's status is still `ACTIVE` instead `UPDATING` as expected. This causes the `isComplete` handler, which is called immediately, to think that the operation is complete, when it hasn't even began.

Modify how `IsComplete` is implemented for cluster version (and config) updates. Extract the update ID and use `DescribeUpdate` to monitor the status of the update. This also allows us to fix a latent bug and fail the update in case the version update failed.

The update ID is returned from `OnEvent` via a custom fields called `EksUpdateId` and passed on to the subsequent `IsComplete` invocation. This was already supported by the custom resource provider framework but not documented or officially tested, so we've added that here as well (docs + test).

TESTING: Added unit tests to verify the new type of update waiter and performed a manual upgrade tests while examining the logs.

Fixes #7457
@iliapolo iliapolo changed the title AWSCDK-EKS-KubernetesResource started to early after AWSCDK-EKS-Cluster upgrade [aws-eks] AWSCDK-EKS-KubernetesResource started to early after AWSCDK-EKS-Cluster upgrade Aug 16, 2020
@iliapolo iliapolo removed the needs-triage This issue or PR still needs to be triaged. label Aug 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. p1
Projects
None yet
4 participants