workflows: update AKS workflows with new taints #584

nbusseneau · 2021-10-15T14:15:31Z

Context: we recommend users taint all nodepools with node.cilium.io/agent-not-ready=true:NoSchedule to prevent application pods from being managed by the default AKS CNI plugin.

To this end, the proposed workflow users should follow when installing Cilium into AKS was to replace the initial AKS node pool with a new tainted system node pool, as it is not possible to taint the initial AKS node pool, cf. Azure/AKS#1402.

AKS recently pushed a change on the API side that forbids setting up custom taints on system node pools, cf. Azure/AKS#2578.

It is not possible anymore for us to recommend users taint all nodepools with node.cilium.io/agent-not-ready=true:NoSchedule to prevent application pods from being managed by the default AKS CNI plugin.

To work around this new limitation, we propose the following workflow instead:

Replace the initial node pool with a system node pool tainted with CriticalAddonsOnly=true:NoSchedule, preventing application pods from being scheduled on it.
Create a secondary user node pool tainted with node.cilium.io/agent-not-ready=true:NoSchedule to prevent application pods from being scheduled on the user node pool until Cilium is ready to manage them.

nbusseneau · 2021-10-15T14:18:16Z

Link to workflow run testing PR changes: https://github.com/cilium/cilium-cli/actions/runs/1346273735

nbusseneau · 2021-10-15T14:49:27Z

@tklauser Weird issue while trying to run cilium install -- seems consistent:

2021-10-15T14:44:56.791125839Z 🔑 Generating certificates for Hubble...
2021-10-15T14:44:56.815733603Z 
2021-10-15T14:44:56.815775605Z Error: Unable to install Cilium: unable to create secret kube-system/hubble-server-certs: secrets "hubble-server-certs" already exists
2021-10-15T14:44:56.815802205Z ↩️ Rolling back installation...

This does not ring a bell to me: why would there already be a hubble-server-certs before install?

tklauser · 2021-10-15T16:06:33Z

This does not ring a bell to me: why would there already be a hubble-server-certs before install?

Not sure why that would be to be honest, also it looks like the CA is already existing, which is somewhat unexpected in a fresh cluster:

 2021-10-15T14:44:56.791087238Z 🔑 Found existing CA in secret cilium-ca

Is it possible that we somehow ended up (re-)using an existing cluster where Cilium was previously installed and not properly uninstalled?

nbusseneau · 2021-10-15T16:22:08Z

Is it possible that we somehow ended up (re-)using an existing cluster where Cilium was previously installed and not properly uninstalled?

Not possible as the clusters are unique per workflow run, so they are completely clean when created. I'll try to reproduce manually on Monday...

tklauser · 2021-10-15T17:38:18Z

Is it possible that we somehow ended up (re-)using an existing cluster where Cilium was previously installed and not properly uninstalled?

Not possible as the clusters are unique per workflow run, so they are completely clean when created. I'll try to reproduce manually on Monday...

From the sysdump it looks like Cilium is already fully deployed before the installation of the agent DaemonSet and operator Deployment from the cilium-cli-install pod even started 🤔

NAMESPACE     NAME                                  READY   STATUS    RESTARTS   AGE     IP            NODE                                 NOMINATED NODE   READINESS GATES
kube-system   azure-cni-networkmonitor-jf7pc        1/1     Running   0          50s     10.240.0.66   aks-userpool-17851042-vmss000000     <none>           <none>
kube-system   azure-cni-networkmonitor-k4275        1/1     Running   0          48s     10.240.0.97   aks-userpool-17851042-vmss000001     <none>           <none>
kube-system   azure-cni-networkmonitor-nlkf5        1/1     Running   0          98s     10.240.0.35   aks-systempool-17851042-vmss000000   <none>           <none>
kube-system   azure-ip-masq-agent-jpjww             1/1     Running   0          48s     10.240.0.97   aks-userpool-17851042-vmss000001     <none>           <none>
kube-system   azure-ip-masq-agent-l89s7             1/1     Running   0          98s     10.240.0.35   aks-systempool-17851042-vmss000000   <none>           <none>
kube-system   azure-ip-masq-agent-zwc5f             1/1     Running   0          50s     10.240.0.66   aks-userpool-17851042-vmss000000     <none>           <none>
kube-system   cilium-8gqhc                          1/1     Running   0          50s     10.240.0.66   aks-userpool-17851042-vmss000000     <none>           <none>
kube-system   cilium-cli-install-qfxh7              0/1     Error     0          35s     10.240.0.66   aks-userpool-17851042-vmss000000     <none>           <none>
kube-system   cilium-gl87d                          1/1     Running   0          48s     10.240.0.97   aks-userpool-17851042-vmss000001     <none>           <none>
kube-system   cilium-operator-8ff45c8f8-lftqk       1/1     Running   0          35s     10.240.0.97   aks-userpool-17851042-vmss000001     <none>           <none>
kube-system   cilium-rqfs7                          1/1     Running   0          98s     10.240.0.35   aks-systempool-17851042-vmss000000   <none>           <none>
kube-system   coredns-autoscaler-54d55c8b75-przwk   1/1     Running   0          2m47s   10.240.0.45   aks-systempool-17851042-vmss000000   <none>           <none>
kube-system   coredns-d4866bcb7-f4jt4               1/1     Running   0          103s    10.240.0.57   aks-systempool-17851042-vmss000000   <none>           <none>
kube-system   coredns-d4866bcb7-fh6tz               0/1     Running   0          13s     10.240.0.54   aks-systempool-17851042-vmss000000   <none>           <none>
kube-system   kube-proxy-kps52                      1/1     Running   0          48s     10.240.0.97   aks-userpool-17851042-vmss000001     <none>           <none>
kube-system   kube-proxy-m5qfr                      1/1     Running   0          50s     10.240.0.66   aks-userpool-17851042-vmss000000     <none>           <none>
kube-system   kube-proxy-qj654                      1/1     Running   0          98s     10.240.0.35   aks-systempool-17851042-vmss000000   <none>           <none>
kube-system   metrics-server-569f6547dd-n6vfz       1/1     Running   0          2m47s   10.240.0.48   aks-systempool-17851042-vmss000000   <none>           <none>
kube-system   tunnelfront-6dc5bb978-m6hhv           1/1     Running   0          2m47s   10.240.0.42   aks-systempool-17851042-vmss000000   <none>           <none>

nbusseneau · 2021-10-18T15:17:27Z

I found the issue. We use an in-cluster script to run cilium install and the pod running this script is scheduled on the initial system nodepool as it is the only one available at the time it is installed via Helm. But while this is happening, we are deleting the initial system nodepool and replacing it with our own, leading to the pod getting rescheduled on another node. The script then restarts from the beginning, and fails because the secret was already created as part of the previous run.

Immediate solution (implemented in next push): wait at least for the initial system nodepool to be deleted before installing our Helm chart.
Alternate solution: allow cilium install to install on top of a configuration where secrets are already setup. This would also help in case someone's installation is suddenly interrupted in the middle and re-run later.

nbusseneau · 2021-10-18T15:53:19Z

New link to workflow run testing workflow changes: https://github.com/cilium/cilium-cli/actions/runs/1355353366
Test passed, removing temporary commit and draft status.

pchaigno

Immediate solution (implemented in next push): wait at least for the initial system nodepool to be deleted before installing our Helm chart.

Should we push this change to cilium/cilium as well for consistency?

Probably worth a comment in the code, since reading the rest, one would expect a --no-wait in the delete command as well.

nbusseneau · 2021-10-19T11:23:47Z

Should we push this change to cilium/cilium as well for consistency?

I would personally prefer not to. Arguments:

It is not necessary on cilium/cilium because we don't use in-cluster scripts (cilium install is run from the workflow and is thus not interrupted), so it would only slow down cilium/cilium AKS testing (not by much, but we probably gain something like 30 seconds out of it because the Cilium components get installed earlier).
Regarding consistency, I think keeping the test code consistent with the way users are supposed to install Cilium whenever possible is more important, and this would be with --no-wait as it is now in the Getting Started Guide.

Probably worth a comment in the code, since reading the rest, one would expect a --no-wait in the delete command as well.

Can do :)

Re-impacted from: cilium/cilium#17529 Context: we recommend users taint all nodepools with `node.cilium.io/agent-not-ready=true:NoSchedule` to prevent application pods from being managed by the default AKS CNI plugin. To this end, the proposed workflow users should follow when installing Cilium into AKS was to replace the initial AKS node pool with a new tainted system node pool, as it is not possible to taint the initial AKS node pool, cf. Azure/AKS#1402. AKS recently pushed a change on the API side that forbids setting up custom taints on system node pools, cf. Azure/AKS#2578. It is not possible anymore for us to recommend users taint all nodepools with `node.cilium.io/agent-not-ready=true:NoSchedule` to prevent application pods from being managed by the default AKS CNI plugin. To work around this new limitation, we propose the following workflow instead: - Replace the initial node pool with a system node pool tainted with `CriticalAddonsOnly=true:NoSchedule`, preventing application pods from being scheduled on it. - Create a secondary user node pool tainted with `node.cilium.io/agent-not-ready=true:NoSchedule` to prevent application pods from being scheduled on the user node pool until Cilium is ready to manage them. Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>

nbusseneau · 2021-10-19T13:58:52Z

@pchaigno Added code comment linking to #589 for tracking purposes, PTAL.

nbusseneau · 2021-10-19T14:28:54Z

https://github.com/cilium/cilium-cli/actions/workflows/aks.yaml can be re-enabled once this is merged :)

nbusseneau added the area/CI Continuous Integration testing issue or flake label Oct 15, 2021

nbusseneau requested review from a team as code owners October 15, 2021 14:15

nbusseneau requested a review from pchaigno October 15, 2021 14:15

maintainer-s-little-helper bot assigned pchaigno Oct 15, 2021

nbusseneau had a problem deploying to ci October 15, 2021 14:15 Error

nbusseneau force-pushed the pr/fix-aks-workflow branch from 8b5825f to ca35089 Compare October 15, 2021 14:17

nbusseneau temporarily deployed to ci October 15, 2021 14:17 Inactive

nbusseneau marked this pull request as draft October 18, 2021 13:51

nbusseneau force-pushed the pr/fix-aks-workflow branch from ca35089 to fdc073c Compare October 18, 2021 15:06

nbusseneau temporarily deployed to ci October 18, 2021 15:07 Inactive

nbusseneau force-pushed the pr/fix-aks-workflow branch from fdc073c to c6ede4e Compare October 18, 2021 15:30

nbusseneau temporarily deployed to ci October 18, 2021 15:30 Inactive

nbusseneau marked this pull request as ready for review October 18, 2021 15:53

nbusseneau force-pushed the pr/fix-aks-workflow branch from c6ede4e to 2eac3f9 Compare October 18, 2021 15:53

nbusseneau temporarily deployed to ci October 18, 2021 15:53 Inactive

pchaigno approved these changes Oct 18, 2021

View reviewed changes

maintainer-s-little-helper bot unassigned pchaigno Oct 18, 2021

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 18, 2021

pchaigno removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 18, 2021

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 18, 2021

pchaigno removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 18, 2021

pchaigno added the dont-merge/bad-bot To prevent MLH from marking ready-to-merge. label Oct 18, 2021

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 18, 2021

pchaigno removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 18, 2021

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 19, 2021

pchaigno removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 19, 2021

nbusseneau mentioned this pull request Oct 19, 2021

install: allowing install over interrupted previous installation #589

Open

nbusseneau force-pushed the pr/fix-aks-workflow branch from 2eac3f9 to 5853ed3 Compare October 19, 2021 13:58

nbusseneau temporarily deployed to ci October 19, 2021 13:58 Inactive

nbusseneau requested a review from pchaigno October 19, 2021 13:58

maintainer-s-little-helper bot assigned pchaigno Oct 19, 2021

pchaigno approved these changes Oct 19, 2021

View reviewed changes

maintainer-s-little-helper bot unassigned pchaigno Oct 19, 2021

pchaigno added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. and removed dont-merge/bad-bot To prevent MLH from marking ready-to-merge. labels Oct 19, 2021

aanm merged commit db746f8 into master Oct 20, 2021

aanm deleted the pr/fix-aks-workflow branch October 20, 2021 14:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workflows: update AKS workflows with new taints #584

workflows: update AKS workflows with new taints #584

nbusseneau commented Oct 15, 2021

nbusseneau commented Oct 15, 2021

nbusseneau commented Oct 15, 2021

tklauser commented Oct 15, 2021

nbusseneau commented Oct 15, 2021

tklauser commented Oct 15, 2021

nbusseneau commented Oct 18, 2021 •

edited

nbusseneau commented Oct 18, 2021

pchaigno left a comment •

edited

nbusseneau commented Oct 19, 2021 •

edited

nbusseneau commented Oct 19, 2021

nbusseneau commented Oct 19, 2021

workflows: update AKS workflows with new taints #584

workflows: update AKS workflows with new taints #584

Conversation

nbusseneau commented Oct 15, 2021

nbusseneau commented Oct 15, 2021

nbusseneau commented Oct 15, 2021

tklauser commented Oct 15, 2021

nbusseneau commented Oct 15, 2021

tklauser commented Oct 15, 2021

nbusseneau commented Oct 18, 2021 • edited

nbusseneau commented Oct 18, 2021

pchaigno left a comment • edited

Choose a reason for hiding this comment

nbusseneau commented Oct 19, 2021 • edited

nbusseneau commented Oct 19, 2021

nbusseneau commented Oct 19, 2021

nbusseneau commented Oct 18, 2021 •

edited

pchaigno left a comment •

edited

nbusseneau commented Oct 19, 2021 •

edited