Improve Cilium integration with managed Kubernetes providers #16631

aanm · 2021-06-23T03:32:38Z

This PR introduces steps on how to install Cilium in cloud providers. The reason behind this PR is described in detail in #16602. Initially, the changes in this PR were supposed to be a few but the commit "pkg/k8s: replace GetNode instances with a k8s store" is necessary to make sure nodes that have the taint node.cilium.io/agent-not-ready set after Cilium started, will also have that taint removed without restart Cilium, by watching for Node events.

Marked this PR as needs-backport/1.10 as it tremendously improves the deployment of Cilium in cloud providers.

It should be easier to review the changes on a per-commit basis.

Provide new installation steps to deploy Cilium in managed kubernetes providers (GKE, EKS, AKS) to allow scale up and down node pools.

Fixes #7404
Fixes #16602
Fixes #15177
Fixes #16542

⚠️ Note for reviewers, this PR was ran with a test commit to test the new changes into the conformance GH workflows. All of them have passed except multicluster which is currently broken on master.

aanm · 2021-06-23T04:01:01Z

test-me-please

aanm · 2021-06-23T14:01:52Z

test-runtime

aanm · 2021-06-23T16:44:33Z

test-runtime

aanm · 2021-06-23T20:47:24Z

test-runtime

aanm · 2021-06-24T00:36:30Z

test-runtime

aanm · 2021-06-24T04:06:45Z

test-me-please

As Cilium will remove the node taint 'node.cilium.io/agent-not-ready=true:NoSchedule' once it is up and ready, the documentation has all the necessary steps for users to create clusters using that taint. Having nodes created with this taint will prevent pods from being scheduled into those nodes until Cilium had configured the node where it's being deployed. Signed-off-by: André Martins <andre@cilium.io>

To replicate the same steps as users do, GH workflows will now create clusters with node taints for which Cilium will remove them once it's ready in that node. Signed-off-by: André Martins <andre@cilium.io>

With a node taint setup on node creation, users will no longer be required to restart application pods since application pods will only start when Cilium is deployed and running in the cluster. Signed-off-by: André Martins <andre@cilium.io>

pkg/k8s/node.go

thejosephstevens · 2021-07-02T00:40:39Z

I'm holding off on deep diving in the workflow changes as I have a bit of a high-level concern to address first: I feel like this solution, while neat, feels like a complete kludge from a user perspective.
I feel like it complicates cluster setup a lot, especially on AKS and EKS where nodepools are cumbersome to manage, while also restricting the options available for setting up the clusters (e.g. the load balancer choices on AKS).
Are we really OK with these drawbacks? Do we want this to be the default Cilium installation experience?

I see it as an unfortunate price to pay. Right now the AKS integration is broken and although usability is worse with this PR, it guarantees that there won't be any Cilium issues in it. Once we have a better integration in AKS, and / or this is fixed Azure/aks-engine#4476 the usability will improve.

I would advocate for having this as an optional thing for advanced users, and not the default installation. In particular, the simple getting started guide suddenly becomes a whole lot more complex :/

The issue I see with it is that people assume the GSG is enough to run in production, which kind of should it be, and if they only follow the "simple" GSG, they might face the issues we are trying to avoid with this PR.

Speaking as a user of AKS and Cilium, I think @aanm 's assessment here aligns with what I look for as an operator. We've had to deal with weird operational issues around the default pool that they force you to launch for as long as we've been on AKS (almost two years now), so having to work around it while setting up a CNI is not a big surprise.

It's a lot more important to me that I can follow the GSG and get a prod-ready cluster, we already had to build tooling to deal with the managed cluster bootstrap anyway.

smnmtzgr · 2021-07-07T19:15:02Z

@aanm / @thejosephstevens / @nbusseneau / @christarazi

We are also users of AKS and Cilium together.

In my opinion this change makes it very "hard" to use Cilium with AKS. It will also become very hard to update Cilium (or get the backport for 1.10) within an already running AKS-cluster. Will there be a documentation-part on how to do this? We deploy AKS-clusters using terraform. This workaround will be very cumbersome to implement with terraform.

In my opinion improving the STARTUP_SCRIPT would have been the better method/thing to fix these issues because it doesnt break so much in the deployment/operation tasks the operators have. E.g. #16356 for AKS.

The best solution would be a better integration of Cilium into the cloud providers. For example it should be possible to deploy AKS Clusters with CNI=None. So that one can choose a "custom" CNI like Cilium which will be installed/used. So no problems would occur with Azure-CNI being active before Cilium is ready. E.g. like suggested here: Azure/AKS#2092
Sure, thats something you would have to discuss with the cloud providers, but it would be the best one.

In my eyes this big change/merge/backport is not really a good thing for us operators. Sure it makes Cilium more stable on cloud providers managed kubernetes offerings. But the price to pay is too high. As long as there is no direct integration into the cloud provider investigating time into the STARTUP_SCRIPT and make it a lot more fail-safe would have been the better way in my eyes.

@aanm with Azure/aks-engine#4476 you refer to "aks-engine". I think you should open the issue here: https://github.com/Azure/AKS/issues ... aks-engine is not the "managed AKS solution".

mattstam · 2022-04-05T21:31:58Z

This should be far easier to set up on AKS now: https://docs.microsoft.com/en-us/azure/aks/use-byo-cni?tabs=azure-cli and won't need the node taint hack.

thejosephstevens · 2022-04-06T01:25:52Z

This should be far easier to set up on AKS now: https://docs.microsoft.com/en-us/azure/aks/use-byo-cni?tabs=azure-cli and won't need the node taint hack.

How does this avoid needing the node taint? As best I can tell AKS is still expecting you to deploy a daemonset for the CNI so you would still have race conditions for scheduling on node scaleup.

aanm · 2022-04-23T21:01:53Z

FYI @mattstam @thejosephstevens the documentation for the BYO CNI is being done in #19379

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 23, 2021

aanm added the release-note/major This PR introduces major new functionality to Cilium. label Jun 23, 2021

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 23, 2021

aanm force-pushed the pr/fix-clouds branch 3 times, most recently from 75481c4 to 8c0df17 Compare June 23, 2021 04:00

aanm added ci-run/aks labels Jun 23, 2021

aanm force-pushed the pr/fix-clouds branch 2 times, most recently from 3968932 to 4e4db06 Compare June 23, 2021 14:01

aanm force-pushed the pr/fix-clouds branch from 4e4db06 to 2499db0 Compare June 23, 2021 16:43

aanm force-pushed the pr/fix-clouds branch from 2499db0 to 22d708d Compare June 24, 2021 00:35

aanm force-pushed the pr/fix-clouds branch from 22d708d to 7bfe1b1 Compare June 24, 2021 04:06

aanm force-pushed the pr/fix-clouds branch 4 times, most recently from b34e3c5 to 8403809 Compare June 25, 2021 00:40

aanm closed this Jun 25, 2021

aanm reopened this Jun 25, 2021

aanm force-pushed the pr/fix-clouds branch from 8403809 to c463bd5 Compare June 25, 2021 01:43

aanm removed ci-run/aks labels Jun 25, 2021

aanm added 3 commits July 1, 2021 22:48

.github/workflows: create cloud clusters with node taints

33323d9

To replicate the same steps as users do, GH workflows will now create clusters with node taints for which Cilium will remove them once it's ready in that node. Signed-off-by: André Martins <andre@cilium.io>

aanm force-pushed the pr/fix-clouds branch from bee2948 to 4a5a070 Compare July 1, 2021 20:48

aanm requested a review from christarazi July 1, 2021 20:48

maintainer-s-little-helper bot assigned christarazi Jul 1, 2021

christarazi approved these changes Jul 1, 2021

View reviewed changes

pkg/k8s/node.go Show resolved Hide resolved

maintainer-s-little-helper bot unassigned christarazi Jul 1, 2021

aanm merged commit 703b38f into master Jul 1, 2021

aanm deleted the pr/fix-clouds branch July 1, 2021 22:41

aanm added this to Needs backport from master in 1.10.3 Jul 2, 2021

aanm removed this from Needs backport from master in 1.10.2 Jul 2, 2021

twpayne mentioned this pull request Jul 5, 2021

v1.10 backports 2021-07-05 #16774

Merged

twpayne added backport-pending/1.10 and removed needs-backport/1.10 labels Jul 5, 2021

ti-mo mentioned this pull request Jul 7, 2021

Fix nodeinit STARTUP_SCRIPT for AKS #16356

Closed

joestringer mentioned this pull request Jul 7, 2021

Improve logging when cgroupfs mount fails #15999

Merged

joestringer mentioned this pull request Jul 7, 2021

Improves the error logs during the bpf maps updating #16034

Merged

kkourt added backport-done/1.10 and removed backport-pending/1.10 labels Jul 8, 2021

aanm mentioned this pull request Jul 9, 2021

install/kubernetes: delete containers after Cilium is running #16588

Closed

joestringer mentioned this pull request Jul 9, 2021

Cilium latest doesn't handle "network-unavailable" taint correctly #16846

Closed

aanm moved this from Needs backport from master to Backport done to v1.10 in 1.10.3 Jul 15, 2021

aanm mentioned this pull request Jul 15, 2021

Prepare for release v1.10.3 #16898

Merged

bridgetkromhout mentioned this pull request Feb 10, 2022

Allow creation of clusters with 0 nodes or allow node-taints to be set on cluster creation Azure/aks-engine#4476

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Cilium integration with managed Kubernetes providers #16631

Improve Cilium integration with managed Kubernetes providers #16631

aanm commented Jun 23, 2021 •

edited

aanm commented Jun 23, 2021

aanm commented Jun 23, 2021

aanm commented Jun 23, 2021

aanm commented Jun 23, 2021

aanm commented Jun 24, 2021

aanm commented Jun 24, 2021

thejosephstevens commented Jul 2, 2021

smnmtzgr commented Jul 7, 2021 •

edited

mattstam commented Apr 5, 2022

thejosephstevens commented Apr 6, 2022

aanm commented Apr 23, 2022

Improve Cilium integration with managed Kubernetes providers #16631

Improve Cilium integration with managed Kubernetes providers #16631

Conversation

aanm commented Jun 23, 2021 • edited

aanm commented Jun 23, 2021

aanm commented Jun 23, 2021

aanm commented Jun 23, 2021

aanm commented Jun 23, 2021

aanm commented Jun 24, 2021

aanm commented Jun 24, 2021

thejosephstevens commented Jul 2, 2021

smnmtzgr commented Jul 7, 2021 • edited

mattstam commented Apr 5, 2022

thejosephstevens commented Apr 6, 2022

aanm commented Apr 23, 2022

aanm commented Jun 23, 2021 •

edited

smnmtzgr commented Jul 7, 2021 •

edited