Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(eks): cluster-resource-handler fails to verify delete operations #26375

Closed
wants to merge 9 commits into from

Conversation

kishiel
Copy link
Contributor

@kishiel kishiel commented Jul 14, 2023

Motivation

The recent upgrade to the underlying aws-sdk-js dependency in the custom resource handler caused deletion events to fail. This was due to the response structure differing between sdk versions, and the new response lacked the property the handler was previously targeting.

This change leverages the httpResponseCode to follow the same logic as before.

Some additional changes that come with this fix:

  • Nginx dashboard's latest update breaks the cluster tests, and as a result I've removed them
  • Nginx ingress controller was also removed as it is affected by a separate new defect either in EKS or VPC (I can't tell) which causes NLBs created by the EKS cluster to survive stack deletion. This appears to be a transient issue, but it even appeared in the simplistic hello-k8s chart.

There's a broader question I have about the use of external Helm charts in these snapshot tests which do not target specific versions--this causes the snapshot tests, which are supposed to be deterministic, to become indeterministic. Has the EKS team considered owning a few simple helm chart repos which provide us with stable versions (and thus deterministic tests)?

I spent nearly 5 days fighting with the snapshot tests and found that a few tests (eks-cluster-test, bottlerocket, and cluster-imported test) would never use new assets and instead would deploy using the old. If I synthesized the test using npx cdk -a test/integ.testname.js and deployed that it would use the new assets. The only way I got past this was to completely delete the existing snapshot directory. I couldn't find references to this behavior anywhere, but I wanted to call it out in case I deleted something that was not replaced with the snapshot update.

Remaining work

The unit tests for the cluster and fargate handler were not updated. I took a stab at them but the interface mocks that we have are no longer accurate, and I couldn't get them to mimic this change.

Fixes #26325


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

@github-actions github-actions bot added bug This issue is a bug. effort/medium Medium work item – several days of effort p2 beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK labels Jul 14, 2023
@aws-cdk-automation aws-cdk-automation requested a review from a team July 14, 2023 23:47
@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildv2Project1C6BFA3F-wQm2hXv2jqQv
  • Commit ID: 8eb6cce
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@mrgrain
Copy link
Contributor

mrgrain commented Jul 17, 2023

This is great @kishiel thanks for looking into this. Can you check why the unit tests fail?

@mrgrain
Copy link
Contributor

mrgrain commented Jul 17, 2023

Thanks again @kishiel Gonna close this as the team is looking into this with priority now.
Reference: #26283

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK bug This issue is a bug. effort/medium Medium work item – several days of effort p2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

‼️ NOTICE custom-resources: various custom resources may fail to deploy / destroy
3 participants