Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm chart change in dnsPolicy might be breaking for AWS Fargate users #5817

Closed
pdrastil opened this issue Mar 9, 2024 · 3 comments · Fixed by #5818
Closed

Helm chart change in dnsPolicy might be breaking for AWS Fargate users #5817

pdrastil opened this issue Mar 9, 2024 · 3 comments · Fixed by #5818
Labels
documentation Improvements or additions to documentation

Comments

@pdrastil
Copy link

pdrastil commented Mar 9, 2024

Description

How can the docs be improved?

Hi I would like to ask maintainers to improve installation / upgrade guide about recent change in v0.34.0 to dnsPolicy in Helm chart (#4947). This change is breaking for anyone running AWS EKS cluster with Fargate profile to host only Karpenter controller. In this scenario Karpenter will end up in crash looping state with following error:

{"level":"FATAL","time":"2024-03-08T22:41:14.975Z","logger":"controller","message":"Checking EC2 API connectivity, WebIdentityErr: failed to retrieve credentials\ncaused by: RequestError: send request failed\ncaused by: Post \"https://sts.us-east-1.amazonaws.com/\": dial tcp: lookup sts.us-east-1.amazonaws.com: i/o timeout","commit":"cd02bab"}

I would say use of Fargate is quite common for operators who don't want to have special process for upgrading initial node pool and use of dnsPolicy: Default option should be really well documented as Karpenter will no longer start with current Helm chart compared to previously released versions. Issue happens in scenario when Karpenter starts on EKS cluster without nodes and should provision initial EC2 instances for kube-system so core-dns and other controllers starts after Karpenter. This is usually the case when your kube-system contains other privileged containers (cilium, metrics-server, etc.) and it's unfeasible to create & manage individual Fargate profiles for each component.

Proposed improvements:

  • Include error above in troubleshooting guide with dnsPolicy: Default fix
  • Include note in installation guide for users who wants to use AWS Fargate that this option must be set if you plan to host only Karpenter controller on AWS Fargate without core-dns. This prevents confusion for new adopters that follow installation guide and gets crashlooping container at the end.
  • Include note in upgrade guide that this is breaking for AWS Fargate users who use Fargate only to host Karpenter controller (current statement about dual-stack clusters / service meshes is completely misleading and seems harmless as most operators do not explicitly set injection label for service mesh to kube-system or similar namespaces and AWS EKS clusters doesn't support dual-stack at this moment). Karpenter will be in CrashLooping state even if core-dns is already hosted on EC2 instance and available.
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@pdrastil pdrastil added documentation Improvements or additions to documentation needs-triage Issues that need to be triaged labels Mar 9, 2024
@jonathan-innis
Copy link
Contributor

and it's unfeasible to create & manage individual Fargate profiles for each component

Just curious about this. Couldn't you create a Fargate profile that just selected on the core DNS application so that core DNS was up when you started Karpenter? You shouldn't need a different Fargate profile for all of the components.

should be really well documented as Karpenter will no longer start with current Helm chart compared to previously released versions

We went back and forth and ultimately decided to go with the Kubernetes default ClusterFirst as the default installation for Karpenter since we believed that this would be the least surprising configuration for users (#4947). As you can see there's been a lot of back-and-forth discussion on which one is the "right" configuration from Karpenter, but ultimately we decided that users who were using ServiceMeshes with Karpenter were more surprised by this change

Include error above in troubleshooting guide

Noted. We'll open a PR to add some more verbose lingo to the TS guide

Include note in installation guide for users who wants to use AWS Fargate that this option must be set if you plan to host only Karpenter controller on AWS Fargate without core-dns

I'd agree that we may also want to callout the case where you aren't running any cluster DNS at all (though I'd imagine that this kind of setup is incredibly uncommon since you are eventually going to want to run something that is going to reach out to a Kubernetes service and is going to need cluster DNS; I'd imagine the much more common setup is the one already called out in the docs, where the DNS is just not running when Karpenter starts up). If you think that we should expand on our existing wording, there is currently a callout made here. We'd happily accept a docs PR if you had ideas around how to improve this wording.

Karpenter will be in CrashLooping state even if core-dns is already hosted on EC2 instance and available

This isn't true as far as I'm aware. Assuming that the core-dns pods are already up and running, Karpenter can run with a ClusterFirst DNS policy on Fargate and this should be the default setup unless you are planning to have Karpenter manage your cluster DNS capacity under-the-hood.

Include note in upgrade guide that this is breaking for AWS Fargate

Noted, the PR that updates the documentation for the TS item will also include this.

@pdrastil
Copy link
Author

Hi @jonathan-innis thanks for reply.

Just curious about this. Couldn't you create a Fargate profile that just selected on the core DNS application so that core DNS was up when you started Karpenter? You shouldn't need a different Fargate profile for all of the components.

You are right that I can technically create second Fargate profile that selects core-dns. For my use case it also means going through security exception process as pods running on Fargate bypass security scanners and this is hard to sell to security team as we have commercial and FedRAMP environments. So far Karpenter is only deployment that got security approval to do this on commercial setups as benefits were too hard to ignore.

We went back and forth and ultimately decided to go with the Kubernetes default ClusterFirst as the default installation for Karpenter since we believed that this would be the least surprising configuration for users (#4947). As you can see there's been a lot of back-and-forth discussion on which one is the "right" configuration from Karpenter, but ultimately we decided that users who were using ServiceMeshes with Karpenter were more surprised by this change

I'm fine with the change. I just pointing out that upgrade guide for 0.34.x should call this out as breaking for this setup.

If you think that we should expand on our existing wording, there is currently a callout made here. We'd happily accept a docs PR if you had ideas around how to improve this wording.

Thanks for pointing out this is already done and I just haven't noticed this.

This isn't true as far as I'm aware. Assuming that the core-dns pods are already up and running, Karpenter can run with a ClusterFirst DNS policy on Fargate and this should be the default setup unless you are planning to have Karpenter manage your cluster DNS capacity under-the-hood.

I've bumped into this when attempting to upgrade existing cluster that was in healthy state. I can try to investigate more today to check if this is related to running core-dns on EC2 and Karpenter on Fargate.

@pdrastil
Copy link
Author

@jonathan-innis - I've managed to test both setups today with following results:

  1. coredns runnning on EC2 + Karpenter on Fargate
    This setup works only with Default dns policy. If ClusterFirst is used Karpenter will end up in crashlooping state.

  2. coredns & karpenter both running on Fargate
    This setup works with ClusterFirst policy and both components start fine.

Based on AWS steps to reconfigure coredns for Fargate it feels like there is some kind of isolation explaining why mixed setup works only with Default policy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
2 participants