-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with EKS module bootstrap and Cilium with newly released EKS AL2023 #1678
Comments
For 1 - I have it on my list to add AL2023 support to the EKS module so that it provides similar functionality like the other OSes we support, but glad you were able to sort it out with the generic |
@ajvn I can take a look at number 2. There should be a default route via the primary ENI, which I assume is Since you are running Cilium, any secondary ENIs would be attached due to EC2 API calls from Cilium. So I am confused about the ordering here. Was the AWS VPC CNI running before you tried to run Cilium? It is possible that some other entity present in AL2023 is installing these routes, but the AMI should be disabling all udev triggers and |
I've added a new node group to test out AL2023 to an existing cluster which is already using Cilium. Other node groups are using Ubuntu take on EKS, there's single default route in those nodes, only for |
Quick update, Cilium failure seems to happen only if node fails to join the cluster initially due to some error (e.g. wrongly parsed This leads me to believe that perhaps second route is getting added after kubelet is started and Cilium agent pod starts. I'm still not sure what's adding it. Edit: |
point 1 Is now resolved in v20.5.0 of the |
No worries, and thank you for tackling it this quick. We have to address breaking changes for v20.x upgrade anyway, and for the time being |
whoops - there was a typo there |
Ah, alright 😄 . I'll let you know once we get around upgrading our module and testing this functionality out. |
@bryantbiggs Gave it a go, only "problem" is if If it's specified in module itself, it will also be added to non-AL2023 nodes user data that are part of that module block in a form of exported environment variable which is not used as bootstrap script argument, but I haven't checked if it's used somewhere else. Doing the same as part of
should yield same results, but still haven't tested it, will do soon. |
thank you for that info - I assume you are using self-managed nodegroups? For managed nodegroups we'll need to wait for the next provider release which has the new AMI types #1696 (comment) In terms of the CIDR, are you saying that |
We are using managed nodegroups, but with a custom AMI (for now using one obtained via You can reproduce it like this:
So if I tried |
@cartermckinnon / @ndbaker1 does |
I got the same behavior, my repository is here. I'm gonna roll back this PR because cilium isn't starting with the error failed to start: daemon creation failed: error while initializing daemon: failed while reinitializing datapath: unable to install ip rule for ENI multi-node NodePort: failed to find interface with default route: Found multiple default routes with the same priority: {Ifindex: 2 Dst: 0.0.0.0/0 Src: 10.0.26.111 Gw: 10.0.16.1 Flags: [] Table: 254 Realm: 0} vs {Ifindex: 16 Dst: 0.0.0.0/0 Src: 10. │
│ 0.29.146 Gw: 10.0.16.1 Flags: [] Table: 254 Realm: 0}" subsys=daemon PS: Just a precision, the error only appears on Karpenter nodes... I might be missing something there. digging... Let me know if you want me to run a few tests :) |
it does, takes the cluster service cidr (IPv4/v6) |
After more testing, issue appears to be due to both default interface, and Cilium created interface create route with same priority. Because AL2023 uses There's this configuration that seems to address all of the interfaces The way we worked around this is by adding following to launch template:
You might want to adjust With default values, after Cilium creates and attaches new interface, you'd see something like this:
And then Cilium would report original issue and go into crashing loop.
Cilium is happy, and everything works with all of the benefits of AL2023. In our environment, Proper solution to this would either be on AWS side, where default (user attached interface(s)) would default to lower priority than default |
@ajvn Thank you, this also fixed it for me. Note that Cilium docs about AWS ENI config also mention a
I have implemented this as:
A second observation: I had a single |
Something like this?
|
What happened:
Hello folks, we've been testing recently released EKS optimized AL2023 AMI, and have some feedback to provide.
While I'm aware that these might not be strictly issue with AMI itself, I believe large enough part of your user base has similar setup, so it's probably beneficial to let you know about issues we've encountered.
When provided as part of
pre_bootstrap_user_data
withenable_bootstrap_user_data
set to false, it seems likenodeadm
cannot parse it, and it fails during boot with Could not find NodeConfig within UserData.This can be sorted by using pure YAML config file provided via
user_data_template_path
and it looks something like this:There are 2 default routes using both
ens5
andens6
interfaces, and Cilium fails with something like this:When locally testing bare bone AL2023 VDI image, there's only one default route, granted underlying virtualization and hardware are vastly different.
Solution to this is dropping second default route on
ens6
interface:After these 2 conditions are handled, node can join the cluster, is
Ready
and Cilium agent is running.What you expected to happen:
As this is Tech Preview, it's already in pretty good state.
How to reproduce it (as minimally and precisely as possible):
Use EKS optimized AMI from initial announcement for one of the nodegroups in cluster where Cilium is used.
Anything else we need to know?:
I don't think so.
Environment:
aws eks describe-cluster --name <name> --query cluster.platformVersion
): eks.12aws eks describe-cluster --name <name> --query cluster.version
): 1.26ami-0bec930579ece14e9
uname -a
):6.1.75-99.163.amzn2023.x86_64
cat /etc/eks/release
on a node):The text was updated successfully, but these errors were encountered: