-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
workflows: improvements to CI 3.0 workflows #15694
workflows: improvements to CI 3.0 workflows #15694
Conversation
Links to working runs using the same version of the workflows as in this PR (can be checked by clicking on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice 🚀
Rebased on Link to run of the latest version of the EKS workflow: https://github.com/cilium/cilium/actions/runs/751917453 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some small comments. I like the resource usage decrease a lot. I think we should even try to decrease it even further after a couple days of testing it out.
@@ -82,10 +82,12 @@ jobs: | |||
gcloud container clusters create ${{ env.clusterName }} \ | |||
--labels "usage=pr,owner=${{ steps.vars.outputs.owner }}" \ | |||
--zone ${{ env.zone }} \ | |||
--preemptible \ | |||
--image-type COS \ | |||
--image-type COS_CONTAINERD \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation refers that COS is being deprecated in clusters <1.19. To guarantee consistency for the test runs, we should pin the channel in here as well as the k8s version. For example:
--release-channel=regular
--cluster-version=1.19.8-gke.1600
This information can be retrieved with:
$ gcloud container get-server-config --zone us-west2-a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you're referring to 🤔
As mentioned in my commit message, my reading of the documentation is that COS
is deprecated for >1.19, hence why we should switch to COS_CONTAINERD
:
Warning: Since GKE node version 1.19, the Container-Optimized OS with Docker (cos) variant is deprecated. Please migrate to the Container-Optimized OS with Containerd (cos_containerd) variant, which is now the default GKE node image.
As for version pinning, I think this is a bad idea. GCP constantly updates what versions are available or not (see here), pinning a specific version is a surefire way to have our CI breaking all the time 😅 As a reminder, this is something we already faced with our GKE clusters for Jenkins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in my commit message, my reading of the documentation is that COS is deprecated for >1.19, hence why we should switch to COS_CONTAINERD:
I understand that but by not pinning the k8s version it means that we will select the default k8s version provided by the default release channel which is 1.18. That's why I suggested to pin k8s version and the release-channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aanm unfortunately there is not reliable way to pin k8s version in a release channel, because versions in release channels are being retired all the time. If we pin now, we will have trouble provisioning clusters in a month or two and will need to update this workflow after we get some failing builds.
Edit: unless you want to pin this only to a k8s version as in major.minor
version and choose whatever is available in the gke for micro version and gke.x
variant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nebril would that be possible? I just want to avoid being caught in surprise. Also it would be a precise way of coordinating which k8s versions are being tested by just looking at the table of truth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't worry, we will be monitoring updates when a new version is out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see ready-to-merge
, but was this discussion resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes: we'll tackle K8s version pinning at a later point, after we have wrapped up priorities for 1.10 CI 3.0 items. I have added it to the ever growing list of workflow nice to haves :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, where is that list kept?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, in our private roadmap https://github.com/isovalent/roadmap/issues/2
@@ -89,8 +89,11 @@ jobs: | |||
--name ${{ env.name }} \ | |||
--location ${{ env.location }} \ | |||
--network-plugin azure \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here for the k8s version
@@ -107,9 +107,12 @@ jobs: | |||
run: | | |||
eksctl create nodegroup \ | |||
--cluster ${{ env.clusterName }} \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here for the k8s version.
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
The goal is for the trigger to be displayed on the PR checks, so that developers may retry failed runs easily as they are used on Jenkins. Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
- Use `COS_CONTAINERD`, which is the new default recommended by GKE https://cloud.google.com/kubernetes-engine/docs/concepts/node-images#cos - Switch to a cheaper `e2-custom-2-4096` machine type which should be sufficient for our use case and is consistent with 2vCPUs/4GB machines from other providers. https://cloud.google.com/compute/docs/machine-types - Default node disk is standard persistent disk of 100GB, but the minimum allowed is 10GB and should be sufficient for short-lived tests. https://cloud.google.com/compute/docs/disks Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
- Default VM size is `Standard_DS2_v2`, but we can use as low as `Standard_B2s` while still meeting AKS requirements. https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-general https://docs.microsoft.com/en-us/azure/aks/use-system-pools#system-and-user-node-pools - Default node disk size is 128GB, but the minimum allowed is 30GB and should be sufficient for short-lived tests. https://azure.microsoft.com/en-us/pricing/details/managed-disks/ - Default load balancer SKU is `standard`, but `basic` should be sufficient. https://docs.microsoft.com/en-us/azure/load-balancer/skus Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
- Default instance type can be anything by default, but we should be fine using only cheap `t3.medium` or `t3a.medium`. https://aws.amazon.com/ec2/instance-types/ - Default EBS volume is a `gp3` of 80GB, but we can use something lower (such as 10GG) which should be sufficient for short-lived tests. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
This was supposedly a side-effect of cilium/cilium-cli#57 and was bound to be re-enabled once NodePort is fixed, but from my testing it seems we are not experiencing it anymore on EKS. Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
Rebased on Links to runs using the same version of the workflows as in this PR (can be checked by clicking on |
Question: does it make sense to keep |
@nbusseneau the |
Yes, that's true! We'll have to implement some sort of alerting system when workflows break on merge. Right now I don't know if someone gets notified, but I definitely don't. |
Marking as ready to merge, we'll tackle K8s version pinning at a later point, after we have wrapped up priorities for 1.10 CI 3.0 items. |
Please see individual commits.