workflows: improvements to CI 3.0 workflows #15694

nbusseneau · 2021-04-14T16:46:38Z

Please see individual commits.

nbusseneau · 2021-04-14T18:02:40Z

Links to working runs using the same version of the workflows as in this PR (can be checked by clicking on <name>.yaml):

.github/workflows/aks.yaml

tklauser

Nice 🚀

nbusseneau · 2021-04-15T11:59:57Z

Rebased on master following merge of #15669 adding encryption to EKS workflow. I noticed that the pod-to-local-nodeport exception that we used on EKS seems not necessary anymore, so I've removed it.

Link to run of the latest version of the EKS workflow: https://github.com/cilium/cilium/actions/runs/751917453

aanm

Just some small comments. I like the resource usage decrease a lot. I think we should even try to decrease it even further after a couple days of testing it out.

aanm · 2021-04-15T18:12:43Z

.github/workflows/gke.yaml

@@ -82,10 +82,12 @@ jobs:
          gcloud container clusters create ${{ env.clusterName }} \
            --labels "usage=pr,owner=${{ steps.vars.outputs.owner }}" \
            --zone ${{ env.zone }} \
-            --preemptible \
-            --image-type COS \
+            --image-type COS_CONTAINERD \


The documentation refers that COS is being deprecated in clusters <1.19. To guarantee consistency for the test runs, we should pin the channel in here as well as the k8s version. For example:

--release-channel=regular --cluster-version=1.19.8-gke.1600

This information can be retrieved with:

$ gcloud container get-server-config --zone us-west2-a

I'm not sure what you're referring to 🤔

As mentioned in my commit message, my reading of the documentation is that COS is deprecated for >1.19, hence why we should switch to COS_CONTAINERD:

Warning: Since GKE node version 1.19, the Container-Optimized OS with Docker (cos) variant is deprecated. Please migrate to the Container-Optimized OS with Containerd (cos_containerd) variant, which is now the default GKE node image.

As for version pinning, I think this is a bad idea. GCP constantly updates what versions are available or not (see here), pinning a specific version is a surefire way to have our CI breaking all the time 😅 As a reminder, this is something we already faced with our GKE clusters for Jenkins.

As mentioned in my commit message, my reading of the documentation is that COS is deprecated for >1.19, hence why we should switch to COS_CONTAINERD:

I understand that but by not pinning the k8s version it means that we will select the default k8s version provided by the default release channel which is 1.18. That's why I suggested to pin k8s version and the release-channel.

@aanm unfortunately there is not reliable way to pin k8s version in a release channel, because versions in release channels are being retired all the time. If we pin now, we will have trouble provisioning clusters in a month or two and will need to update this workflow after we get some failing builds.

Edit: unless you want to pin this only to a k8s version as in major.minor version and choose whatever is available in the gke for micro version and gke.x variant?

@nebril would that be possible? I just want to avoid being caught in surprise. Also it would be a precise way of coordinating which k8s versions are being tested by just looking at the table of truth.

Don't worry, we will be monitoring updates when a new version is out.

I see ready-to-merge, but was this discussion resolved?

Yes: we'll tackle K8s version pinning at a later point, after we have wrapped up priorities for 1.10 CI 3.0 items. I have added it to the ever growing list of workflow nice to haves :D

For the record, where is that list kept?

For now, in our private roadmap https://github.com/isovalent/roadmap/issues/2

aanm · 2021-04-15T18:13:06Z

.github/workflows/aks.yaml

@@ -89,8 +89,11 @@ jobs:
            --name ${{ env.name }} \
            --location ${{ env.location }} \
            --network-plugin azure \


same here for the k8s version

aanm · 2021-04-15T18:13:54Z

.github/workflows/eks.yaml

@@ -107,9 +107,12 @@ jobs:
        run: |
          eksctl create nodegroup \
            --cluster ${{ env.clusterName }} \


same here for the k8s version.

.github/workflows/eks.yaml

Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>

The goal is for the trigger to be displayed on the PR checks, so that developers may retry failed runs easily as they are used on Jenkins. Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>

- Use `COS_CONTAINERD`, which is the new default recommended by GKE https://cloud.google.com/kubernetes-engine/docs/concepts/node-images#cos - Switch to a cheaper `e2-custom-2-4096` machine type which should be sufficient for our use case and is consistent with 2vCPUs/4GB machines from other providers. https://cloud.google.com/compute/docs/machine-types - Default node disk is standard persistent disk of 100GB, but the minimum allowed is 10GB and should be sufficient for short-lived tests. https://cloud.google.com/compute/docs/disks Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>

- Default VM size is `Standard_DS2_v2`, but we can use as low as `Standard_B2s` while still meeting AKS requirements. https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-general https://docs.microsoft.com/en-us/azure/aks/use-system-pools#system-and-user-node-pools - Default node disk size is 128GB, but the minimum allowed is 30GB and should be sufficient for short-lived tests. https://azure.microsoft.com/en-us/pricing/details/managed-disks/ - Default load balancer SKU is `standard`, but `basic` should be sufficient. https://docs.microsoft.com/en-us/azure/load-balancer/skus Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>

- Default instance type can be anything by default, but we should be fine using only cheap `t3.medium` or `t3a.medium`. https://aws.amazon.com/ec2/instance-types/ - Default EBS volume is a `gp3` of 80GB, but we can use something lower (such as 10GG) which should be sufficient for short-lived tests. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>

This was supposedly a side-effect of cilium/cilium-cli#57 and was bound to be re-enabled once NodePort is fixed, but from my testing it seems we are not experiencing it anymore on EKS. Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>

Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>

nbusseneau · 2021-04-16T20:59:48Z

Rebased on master to get latest changes, changed machine type on GKE, and made a more significant standardization/refactoring pass (see last commit).

Links to runs using the same version of the workflows as in this PR (can be checked by clicking on <name>.yaml):

nbusseneau · 2021-04-16T21:16:15Z

Question: does it make sense to keep push trigger for having workflows run on merge? Now that workflows are added to test-me-please, I think marking them as required will be sufficient to test all PRs. No need to "double dip" by also checking on merge IMO.

aanm · 2021-04-16T23:19:55Z

Question: does it make sense to keep push trigger for having workflows run on merge? Now that workflows are added to test-me-please, I think marking them as required will be sufficient to test all PRs. No need to "double dip" by also checking on merge IMO.

@nbusseneau the test-me-please will only run the tests against the PR changes, if we don't tests master there is no guarantee that 2 PRs "conflicting" with each other break the e2e tests.

nbusseneau · 2021-04-17T00:14:25Z

Yes, that's true! We'll have to implement some sort of alerting system when workflows break on merge. Right now I don't know if someone gets notified, but I definitely don't.

nbusseneau · 2021-04-19T18:05:38Z

Marking as ready to merge, we'll tackle K8s version pinning at a later point, after we have wrapped up priorities for 1.10 CI 3.0 items.

nbusseneau added area/CI-improvement Topic or proposal to improve the Continuous Integration workflow release-note/ci This PR makes changes to the CI. labels Apr 14, 2021

nbusseneau marked this pull request as ready for review April 14, 2021 18:02

nbusseneau requested review from a team as code owners April 14, 2021 18:02

nbusseneau requested review from tklauser and joestringer April 14, 2021 18:02

maintainer-s-little-helper bot assigned tklauser and joestringer Apr 14, 2021

nbusseneau requested a review from aanm April 14, 2021 18:12

maintainer-s-little-helper bot assigned aanm Apr 14, 2021

nbusseneau commented Apr 14, 2021

View reviewed changes

.github/workflows/aks.yaml Show resolved Hide resolved

tklauser approved these changes Apr 15, 2021

View reviewed changes

aanm requested changes Apr 15, 2021

View reviewed changes

nbusseneau added 7 commits April 16, 2021 21:44

workflows: add test-me-please to comment triggers for CI 3.0

fb27960

Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>

workflows: display unique trigger in CI 3.0 workflows name

8afdc1f

The goal is for the trigger to be displayed on the PR checks, so that developers may retry failed runs easily as they are used on Jenkins. Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>

workflows: refactor and standardize CI 3.0 workflows

28e8854

Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>

nbusseneau added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 19, 2021

pchaigno unassigned tklauser Apr 20, 2021

pchaigno merged commit cd1bd7d into cilium:master Apr 20, 2021

nbusseneau deleted the pr/platform-workflows-fixes branch April 20, 2021 16:06

This was referenced Apr 28, 2021

Prepare for release v1.10.0-rc1 #15896

Closed

Prepare for release v1.10.0-rc1 #15897

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workflows: improvements to CI 3.0 workflows #15694

workflows: improvements to CI 3.0 workflows #15694

nbusseneau commented Apr 14, 2021

nbusseneau commented Apr 14, 2021

tklauser left a comment

nbusseneau commented Apr 15, 2021

aanm left a comment

aanm Apr 15, 2021

nbusseneau Apr 15, 2021 •

edited

aanm Apr 15, 2021

nebril Apr 16, 2021 •

edited

aanm Apr 16, 2021

aanm Apr 17, 2021

pchaigno Apr 20, 2021

nbusseneau Apr 20, 2021

pchaigno Apr 20, 2021

nbusseneau Apr 20, 2021

aanm Apr 15, 2021

aanm Apr 15, 2021

nbusseneau commented Apr 16, 2021

nbusseneau commented Apr 16, 2021

aanm commented Apr 16, 2021

nbusseneau commented Apr 17, 2021 •

edited

nbusseneau commented Apr 19, 2021

workflows: improvements to CI 3.0 workflows #15694

workflows: improvements to CI 3.0 workflows #15694

Conversation

nbusseneau commented Apr 14, 2021

nbusseneau commented Apr 14, 2021

tklauser left a comment

Choose a reason for hiding this comment

nbusseneau commented Apr 15, 2021

aanm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nbusseneau Apr 15, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nebril Apr 16, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nbusseneau commented Apr 16, 2021

nbusseneau commented Apr 16, 2021

aanm commented Apr 16, 2021

nbusseneau commented Apr 17, 2021 • edited

nbusseneau commented Apr 19, 2021

nbusseneau Apr 15, 2021 •

edited

nebril Apr 16, 2021 •

edited

nbusseneau commented Apr 17, 2021 •

edited