Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workflows: improvements to CI 3.0 workflows #15694

Merged
merged 7 commits into from
Apr 20, 2021
Merged

workflows: improvements to CI 3.0 workflows #15694

merged 7 commits into from
Apr 20, 2021

Conversation

nbusseneau
Copy link
Member

Please see individual commits.

@nbusseneau nbusseneau added area/CI-improvement Topic or proposal to improve the Continuous Integration workflow release-note/ci This PR makes changes to the CI. labels Apr 14, 2021
@nbusseneau
Copy link
Member Author

Links to working runs using the same version of the workflows as in this PR (can be checked by clicking on <name>.yaml):

Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 🚀

@nbusseneau
Copy link
Member Author

Rebased on master following merge of #15669 adding encryption to EKS workflow. I noticed that the pod-to-local-nodeport exception that we used on EKS seems not necessary anymore, so I've removed it.

Link to run of the latest version of the EKS workflow: https://github.com/cilium/cilium/actions/runs/751917453

Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some small comments. I like the resource usage decrease a lot. I think we should even try to decrease it even further after a couple days of testing it out.

@@ -82,10 +82,12 @@ jobs:
gcloud container clusters create ${{ env.clusterName }} \
--labels "usage=pr,owner=${{ steps.vars.outputs.owner }}" \
--zone ${{ env.zone }} \
--preemptible \
--image-type COS \
--image-type COS_CONTAINERD \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation refers that COS is being deprecated in clusters <1.19. To guarantee consistency for the test runs, we should pin the channel in here as well as the k8s version. For example:

--release-channel=regular
--cluster-version=1.19.8-gke.1600

This information can be retrieved with:

$ gcloud container get-server-config --zone us-west2-a

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you're referring to 🤔

As mentioned in my commit message, my reading of the documentation is that COS is deprecated for >1.19, hence why we should switch to COS_CONTAINERD:

Warning: Since GKE node version 1.19, the Container-Optimized OS with Docker (cos) variant is deprecated. Please migrate to the Container-Optimized OS with Containerd (cos_containerd) variant, which is now the default GKE node image.

As for version pinning, I think this is a bad idea. GCP constantly updates what versions are available or not (see here), pinning a specific version is a surefire way to have our CI breaking all the time 😅 As a reminder, this is something we already faced with our GKE clusters for Jenkins.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in my commit message, my reading of the documentation is that COS is deprecated for >1.19, hence why we should switch to COS_CONTAINERD:

I understand that but by not pinning the k8s version it means that we will select the default k8s version provided by the default release channel which is 1.18. That's why I suggested to pin k8s version and the release-channel.
image

Copy link
Member

@nebril nebril Apr 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aanm unfortunately there is not reliable way to pin k8s version in a release channel, because versions in release channels are being retired all the time. If we pin now, we will have trouble provisioning clusters in a month or two and will need to update this workflow after we get some failing builds.

Edit: unless you want to pin this only to a k8s version as in major.minor version and choose whatever is available in the gke for micro version and gke.x variant?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nebril would that be possible? I just want to avoid being caught in surprise. Also it would be a precise way of coordinating which k8s versions are being tested by just looking at the table of truth.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't worry, we will be monitoring updates when a new version is out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see ready-to-merge, but was this discussion resolved?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes: we'll tackle K8s version pinning at a later point, after we have wrapped up priorities for 1.10 CI 3.0 items. I have added it to the ever growing list of workflow nice to haves :D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, where is that list kept?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, in our private roadmap https://github.com/isovalent/roadmap/issues/2

@@ -89,8 +89,11 @@ jobs:
--name ${{ env.name }} \
--location ${{ env.location }} \
--network-plugin azure \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here for the k8s version

@@ -107,9 +107,12 @@ jobs:
run: |
eksctl create nodegroup \
--cluster ${{ env.clusterName }} \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here for the k8s version.

.github/workflows/eks.yaml Show resolved Hide resolved
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
The goal is for the trigger to be displayed on the PR checks, so that
developers may retry failed runs easily as they are used on Jenkins.

Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
- Use `COS_CONTAINERD`, which is the new default recommended by GKE
  https://cloud.google.com/kubernetes-engine/docs/concepts/node-images#cos
- Switch to a cheaper `e2-custom-2-4096` machine type which should be
  sufficient for our use case and is consistent with 2vCPUs/4GB machines
  from other providers.
  https://cloud.google.com/compute/docs/machine-types
- Default node disk is standard persistent disk of 100GB, but the
  minimum allowed is 10GB and should be sufficient for short-lived
  tests. https://cloud.google.com/compute/docs/disks

Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
- Default VM size is `Standard_DS2_v2`, but we can use as low as
  `Standard_B2s` while still meeting AKS requirements.
  https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-general
  https://docs.microsoft.com/en-us/azure/aks/use-system-pools#system-and-user-node-pools
- Default node disk size is 128GB, but the minimum allowed is 30GB and
  should be sufficient for short-lived tests.
  https://azure.microsoft.com/en-us/pricing/details/managed-disks/
- Default load balancer SKU is `standard`, but `basic` should be
  sufficient. https://docs.microsoft.com/en-us/azure/load-balancer/skus

Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
- Default instance type can be anything by default, but we should be
  fine using only cheap `t3.medium` or `t3a.medium`.
  https://aws.amazon.com/ec2/instance-types/
- Default EBS volume is a `gp3` of 80GB, but we can use something lower
  (such as 10GG) which should be sufficient for short-lived tests.
  https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html

Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
This was supposedly a side-effect of cilium/cilium-cli#57
and was bound to be re-enabled once NodePort is fixed, but from my
testing it seems we are not experiencing it anymore on EKS.

Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
@nbusseneau
Copy link
Member Author

Rebased on master to get latest changes, changed machine type on GKE, and made a more significant standardization/refactoring pass (see last commit).

Links to runs using the same version of the workflows as in this PR (can be checked by clicking on <name>.yaml):

@nbusseneau
Copy link
Member Author

Question: does it make sense to keep push trigger for having workflows run on merge? Now that workflows are added to test-me-please, I think marking them as required will be sufficient to test all PRs. No need to "double dip" by also checking on merge IMO.

@aanm
Copy link
Member

aanm commented Apr 16, 2021

Question: does it make sense to keep push trigger for having workflows run on merge? Now that workflows are added to test-me-please, I think marking them as required will be sufficient to test all PRs. No need to "double dip" by also checking on merge IMO.

@nbusseneau the test-me-please will only run the tests against the PR changes, if we don't tests master there is no guarantee that 2 PRs "conflicting" with each other break the e2e tests.

@nbusseneau
Copy link
Member Author

nbusseneau commented Apr 17, 2021

Yes, that's true! We'll have to implement some sort of alerting system when workflows break on merge. Right now I don't know if someone gets notified, but I definitely don't.

@nbusseneau nbusseneau added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 19, 2021
@nbusseneau
Copy link
Member Author

Marking as ready to merge, we'll tackle K8s version pinning at a later point, after we have wrapped up priorities for 1.10 CI 3.0 items.

@pchaigno pchaigno merged commit cd1bd7d into cilium:master Apr 20, 2021
@nbusseneau nbusseneau deleted the pr/platform-workflows-fixes branch April 20, 2021 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI-improvement Topic or proposal to improve the Continuous Integration workflow ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.
Projects
No open projects
1.10.0
Awaiting triage
Development

Successfully merging this pull request may close these issues.

None yet

6 participants