New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Cilium integration with managed Kubernetes providers #16631
Conversation
75481c4
to
8c0df17
Compare
test-me-please |
3968932
to
4e4db06
Compare
test-runtime |
test-runtime |
1 similar comment
test-runtime |
test-runtime |
test-me-please |
b34e3c5
to
8403809
Compare
As Cilium will remove the node taint 'node.cilium.io/agent-not-ready=true:NoSchedule' once it is up and ready, the documentation has all the necessary steps for users to create clusters using that taint. Having nodes created with this taint will prevent pods from being scheduled into those nodes until Cilium had configured the node where it's being deployed. Signed-off-by: André Martins <andre@cilium.io>
To replicate the same steps as users do, GH workflows will now create clusters with node taints for which Cilium will remove them once it's ready in that node. Signed-off-by: André Martins <andre@cilium.io>
With a node taint setup on node creation, users will no longer be required to restart application pods since application pods will only start when Cilium is deployed and running in the cluster. Signed-off-by: André Martins <andre@cilium.io>
Speaking as a user of AKS and Cilium, I think @aanm 's assessment here aligns with what I look for as an operator. We've had to deal with weird operational issues around the default pool that they force you to launch for as long as we've been on AKS (almost two years now), so having to work around it while setting up a CNI is not a big surprise. It's a lot more important to me that I can follow the GSG and get a prod-ready cluster, we already had to build tooling to deal with the managed cluster bootstrap anyway. |
@aanm / @thejosephstevens / @nbusseneau / @christarazi We are also users of AKS and Cilium together. In my opinion this change makes it very "hard" to use Cilium with AKS. It will also become very hard to update Cilium (or get the backport for 1.10) within an already running AKS-cluster. Will there be a documentation-part on how to do this? We deploy AKS-clusters using terraform. This workaround will be very cumbersome to implement with terraform. In my opinion improving the STARTUP_SCRIPT would have been the better method/thing to fix these issues because it doesnt break so much in the deployment/operation tasks the operators have. E.g. #16356 for AKS. The best solution would be a better integration of Cilium into the cloud providers. For example it should be possible to deploy AKS Clusters with CNI=None. So that one can choose a "custom" CNI like Cilium which will be installed/used. So no problems would occur with Azure-CNI being active before Cilium is ready. E.g. like suggested here: Azure/AKS#2092 In my eyes this big change/merge/backport is not really a good thing for us operators. Sure it makes Cilium more stable on cloud providers managed kubernetes offerings. But the price to pay is too high. As long as there is no direct integration into the cloud provider investigating time into the STARTUP_SCRIPT and make it a lot more fail-safe would have been the better way in my eyes. @aanm with Azure/aks-engine#4476 you refer to "aks-engine". I think you should open the issue here: https://github.com/Azure/AKS/issues ... aks-engine is not the "managed AKS solution". |
This should be far easier to set up on AKS now: https://docs.microsoft.com/en-us/azure/aks/use-byo-cni?tabs=azure-cli and won't need the node taint hack. |
How does this avoid needing the node taint? As best I can tell AKS is still expecting you to deploy a daemonset for the CNI so you would still have race conditions for scheduling on node scaleup. |
FYI @mattstam @thejosephstevens the documentation for the BYO CNI is being done in #19379 |
This PR introduces steps on how to install Cilium in cloud providers. The reason behind this PR is described in detail in #16602. Initially, the changes in this PR were supposed to be a few but the commit "pkg/k8s: replace GetNode instances with a k8s store" is necessary to make sure nodes that have the taint
node.cilium.io/agent-not-ready
set after Cilium started, will also have that taint removed without restart Cilium, by watching for Node events.Marked this PR as needs-backport/1.10 as it tremendously improves the deployment of Cilium in cloud providers.
It should be easier to review the changes on a per-commit basis.
Fixes #7404
Fixes #16602
Fixes #15177
Fixes #16542