Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-eks: max HelmChart Timeout exceeds Lambda Duration #22257

Open
michaelfedell opened this issue Sep 27, 2022 · 4 comments
Open

aws-eks: max HelmChart Timeout exceeds Lambda Duration #22257

michaelfedell opened this issue Sep 27, 2022 · 4 comments
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. effort/small Small work item – less than a day of effort p2

Comments

@michaelfedell
Copy link
Contributor

Describe the bug

The HelmChart construct allows a user to set a Timeout which is passed to the Helm command along with the --wait option so that Helm will wait for thetimeout to elapse before marking the operation as failed. This timeout can be set to a maximum of 15 minutes, which is the same timeout which is given to the kubectl custom resource provider Lambda Function.
However, there is initialization latency that results in the Lambda timeout (15m) expiring before the Helm operation reaches it's timeout (also 15m). This results in a severed cluster connection which in turn results in orphaned helm operations that become stuck in the pending-upgrade state. This is a tough state to resolve which typically requires a user to manually edit a Helm release secret or to rollback the release (see this issue in the Helm Repo for context).

Expected Behavior

Expect timeouts to align in a cooperative manner such that the Lambda will not timeout before the HelmChart can return a response.

Current Behavior

Lambda will timeout first when HelmChart is given max timeout resulting in a severed connection and tainted helm installation.

Reproduction Steps

  1. choose or create an invalid helm installation
  2. create the cdk HelmChart construct with max timeout prop (Duration.minutes(15))
  3. cdk deploy
new eks.HelmChart(stack, 'MyFailingChart', { cluster, chart: 'chart', wait: true, timeout: Duration.minutes(15) });

Possible Solution

Limit the max timeout for the HelmChart custom resource to be 870s (14.5m) instead of 900s to allow for initialization etc.

https://github.com/aws/aws-cdk/blob/main/packages/%40aws-cdk/aws-eks/lib/helm-chart.ts#L114-L116

Additional Information/Context

No response

CDK CLI Version

2.39.1 (build f188fac)

Framework Version

2.28.1

Node.js Version

v16.14.0

OS

MacOS

Language

Typescript

Language Version

No response

Other information

Related to #22254

@michaelfedell michaelfedell added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Sep 27, 2022
@github-actions github-actions bot added the @aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service label Sep 27, 2022
@peterwoodworth
Copy link
Contributor

This is an interesting issue, thanks for bringing it up @michaelfedell. Is there a way to ensure that these take the max amount of time so that I can try to reproduce it?

@peterwoodworth peterwoodworth added p2 effort/small Small work item – less than a day of effort and removed needs-triage This issue or PR still needs to be triaged. labels Sep 27, 2022
@michaelfedell
Copy link
Contributor Author

Here's a quick and dirty helm chart that should continuously fail until timeout is reached:
https://github.com/michaelfedell/helm-fail

@michaelfedell
Copy link
Contributor Author

If you reference that helm chart when installing and use the --wait --timeout 30s, you'll see the helm failure. If you install that chart in the context of a lambda with --wait --timeout 15m (or with construct props wait: true, timeout: Duration.minutes(15)), you'll see the issue referenced above.

You could also simulate this with any custom lambda that has a shorter duration than the helm timeout. Or by running the helm install command locally and then killing the shell session before the timeout is reached. e.g.

helm upgrade --wait --timeout 90s helm-fail ./helm-fail
^D
helm list -a

@peterwoodworth
Copy link
Contributor

Thanks for the reproduction steps @michaelfedell,

This issue has been marked as p2, which means that we are unable to work on this immediately.

We use +1s to help prioritize our work, and are happy to revaluate this issue based on community feedback. You can reach out to the cdk.dev community on Slack to solicit support for reprioritization.

We accept contributions! Check out our contributing guide if you're interested - there's a low chance the team will be able to address this soon but we'd be happy to review a PR 🙂

@otaviomacedo otaviomacedo removed their assignment Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. effort/small Small work item – less than a day of effort p2
Projects
None yet
Development

No branches or pull requests

3 participants