New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[aws-eks] AWSCDK-EKS-KubernetesResource started to early after AWSCDK-EKS-Cluster upgrade #7457
[aws-eks] AWSCDK-EKS-KubernetesResource started to early after AWSCDK-EKS-Cluster upgrade #7457
Comments
There is another side effect of this issue. I was looking into this upgrade scenario today and I see, that right after this short Control Plane "update" CDK (or rather CloudFormation) starts update of worker nodes AutoScalingGroups. I don't think this is expected behavior, as Control Plane should be updated first and then followed by update of worker nodes. |
Thanks for reporting this @pbaranow. Indeed it seems that the version upgrade completes prematurely. If, by any chance, you are able to paste here the CloudWatch logs of this update of the Lambda functions with descriptions "onEvent handler for EKS cluster resource provider" and "isComplete handler for EKS cluster resource provider", that would be super helpful. |
The `UpdateClusterVersion` operation takes a while to begin and until then, the cluster's status is still `ACTIVE` instead `UPDATING` as expected. This causes the `isComplete` handler, which is called immediately, to think that the operation is complete, when it hasn't even began. Add logic to the cluster version update `onEvent` method to wait up to 5 minutes until the cluster status is no longer `ACTIVE`, so that the subsequent `isComplete` query will be based on the version update operation itself. Extended the timeout of `onEvent` to 15m to ensure it does not interrupt the operation. TESTING: Updated unit tests to verify this retry behavior and performed a manual upgrade tests while examining the logs. Fixes #7457
I re-run the scenario, so new set of logs is attached. It seems that the isComplete handler started a bit to early - first request was logged at 06:21:02 and logged that cluster was in Active state, while onEvent handler started at 06:20:58 and logged that update is starting at 06:21:00. |
Thanks, see #7526 - fix is under way. |
The `UpdateClusterVersion` operation takes a while to begin and until then, the cluster's status is still `ACTIVE` instead `UPDATING` as expected. This causes the `isComplete` handler, which is called immediately, to think that the operation is complete, when it hasn't even began. Modify how `IsComplete` is implemented for cluster version (and config) updates. Extract the update ID and use `DescribeUpdate` to monitor the status of the update. This also allows us to fix a latent bug and fail the update in case the version update failed. The update ID is returned from `OnEvent` via a custom fields called `EksUpdateId` and passed on to the subsequent `IsComplete` invocation. This was already supported by the custom resource provider framework but not documented or officially tested, so we've added that here as well (docs + test). TESTING: Added unit tests to verify the new type of update waiter and performed a manual upgrade tests while examining the logs. Fixes #7457
During EKS cluster upgrade from 1.14 to 1.15 together with Cluster Autoscaler update, CA update failed while trying to prepare kubeconfig file. CDK version used - 1.31.0.
I have Python CDK script which describes EKS cluster and deploys Cluster Autoscaler as a part of the setup. CA version matches version of EKS. I had EKS 1.14 with CA 1.14 deployed and running. I decided to perform upgrade to 1.15.
When checking diff I see that Custom::AWSCDK-EKS-Cluster will be upgraded to 1.15, LaunchConfigs for ASGs will be updated and Custom::AWSCDK-EKS-KubernetesResource with Cluster Autoscaler will be updated.
I deployed the changes. Excerpts from the logs:
If I understand correctly at 12:30:22 update of Control Plane starts and it is reported complete at 12:30:33. I see in AWS EKS Console that update was submitted at 12:30:27, so it would match the log from CDK CLI. But 11 seconds is strangely quick.
Then, 3 seconds after reporting Control Plane update completion update of Cluster Autoscaler resource is started and it fails with 255 error while trying to prepare kubeconfig file. Rollback is started automatically and then, at 12:31:30 there's another message about update of Control Plane update and then another log at 12:53:03 that update is complete - it took over 20 minutes. Then the rest of rollback continues.
In the end stack is in UPDATE_ROLLBACK_COMPLETE, but EKS is on 1.15 - it was not rolled back to 1.14, even though it took 20 minutes.
Another attempt to deploy changes succeeded - Cluster Autoscaler resource was updated, while there was no need to update the Control Plane since it was already at 1.15.
It seems that Control Plane update was started but not complete yet, when update of KubernetesResource was started. This caused failure of getting kubeconfig. Next attempt was started when 1.15 Control Plane was already running, so resource update was finished successfully.
Reproduction Steps
Error Log
Environment
Other
This is 🐛 Bug Report
The text was updated successfully, but these errors were encountered: