Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to new ci runners #7082

Merged
merged 12 commits into from May 6, 2024
Merged

Conversation

jedevc
Copy link
Member

@jedevc jedevc commented Apr 11, 2024

This switches all our GitHub workflows to use the second generation of Dagger Runners. They are based on Run Dagger on Amazon EKS with GitHub Actions Runner and Karpenter, currently running:

  • AWS EKS v1.28.7
  • Karpenter v0.32.0

As of today, we expose the following runner options:

  • dagger-v0-10-1
  • dagger-v0-10-2
  • dagger-v0-10-3
  • dagger-v0-10-3-4c
  • dagger-v0-10-3-8c
  • dagger-v0-10-3-16c
  • dagger-v0-11-0
  • dagger-v0-11-0-4c
  • dagger-v0-11-0-8c
  • dagger-v0-11-0-16c
  • dagger-v0-11-1
  • dagger-v0-11-1-4c
  • dagger-v0-11-1-4c-nvme
  • dagger-v0-11-1-8c
  • dagger-v0-11-1-8c-nvme
  • dagger-v0-11-1-16c
  • dagger-v0-11-1-16c-nvme
  • dagger-v0-11-2
  • dagger-v0-11-2-4c
  • dagger-v0-11-2-4c-nvme
  • dagger-v0-11-2-8c
  • dagger-v0-11-2-8c-nvme
  • dagger-v0-11-2-16c
  • dagger-v0-11-2-16c-nvme

As soon as a new version of Dagger gets released - e.g. v0.11.3 - a bunch of variants with that version will appear automatically.

This change is great because:

  1. Dagger Engine version can be bumped by changing the GitHub Workflow definition - no more coordination required!
  2. We can test different configurations (and versions!) in PRs and see which one performs best
  3. We are not forced to bump the Engine version for all workflows at the same time - it was an all-or-nothing proposition before

To learn more, see the following (private) infra repo. @matipan @gerhard that have all the context.


Something worth mentioning separately is that this also introduces asynchronous triggers.

We are running a specific job - engine:testrace - in a production bare-metal K8s cluster. This is done for comparison reasons only since we know that it's going to fail (we only have 12CPU threads available). See #7223 (comment) for more details.

The goal of this is to capture measurements - #6492 - and see how they change over time.

FTR, this job will fail until we merge into main since the new workflow will not be recognised:
image


Split out to avoid blocking the release:

@matipan matipan force-pushed the migrate-to-new-ci-runners branch 17 times, most recently from 9a08f48 to e87002c Compare April 16, 2024 20:15
Copy link
Member

@gerhard gerhard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's how well this works on main 🐿

@matipan matipan force-pushed the migrate-to-new-ci-runners branch 3 times, most recently from 23690ba to 8ff9744 Compare April 18, 2024 13:48
@jedevc
Copy link
Member Author

jedevc commented Apr 29, 2024

Minor request - could we link somewhere in our docs here where we can find the definition for the github workers in https://github.com/dagger/dagger.io? (even though it's private)

@gerhard
Copy link
Member

gerhard commented Apr 29, 2024

Todo:

  • Merge the open PRs that change runners in the dagger.io repo
  • Run all jobs async in legacy CI runners so that we can compare them for a few more days We have 61 days worth of metrics from previous runs, it will have to do.

Follow-up:

  • Migrate testdev to larger GitHub runners (easy) or setup Docker in the new CI runners (harder)

@gerhard
Copy link
Member

gerhard commented Apr 30, 2024

I'm picking this one up again.

gerhard and others added 10 commits May 6, 2024 12:49
…engine

This one requires Docker with specific fixes that we don't yet have in
the new CI setup.

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
The setup we want for production is:
- For all <LANG> SDK jobs, run them on the new CI only
- For testdev, run them on the docker-fix legacy CI
- For test/dagger-runner, run them on both legacy CI and new CI
- For all the rest, run them on the new CI and github runners for the really simple jobs

Signed-off-by: Matias Pan <matias@dagger.io>
Signed-off-by: Matias Pan <matias@dagger.io>
Signed-off-by: Matias Pan <matias@dagger.io>
Signed-off-by: Matias Pan <matias@dagger.io>
Signed-off-by: Matias Pan <matias@dagger.io>
Signed-off-by: Matias Pan <matias@dagger.io>
Signed-off-by: Matias Pan <matias@dagger.io>
Otherwise the workflows are too slow on the new CI runners and are
blocking the migration off the legacy CI runners.

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
@gerhard gerhard force-pushed the migrate-to-new-ci-runners branch from f72cc50 to cc7c562 Compare May 6, 2024 12:47
Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
Large GitHub Runners are failing consistently, not worth debugging at
this point since we know this works on a vanilla Ubuntu 24.04 instance
with Docker - must be an issue related to GitHub Large Runners.

FTR: dagger#7223 (comment)

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
@gerhard gerhard merged commit 3929a6d into dagger:main May 6, 2024
83 of 84 checks passed
vikram-dagger pushed a commit to vikram-dagger/dagger that referenced this pull request May 8, 2024
* Use Dagger v0.11.2 via the new CI setup for all workflows except dev-engine

This one requires Docker with specific fixes that we don't yet have in
the new CI setup.

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

* Setup CI for new, legacy and vertical scaling

The setup we want for production is:
- For all <LANG> SDK jobs, run them on the new CI only
- For testdev, run them on the docker-fix legacy CI
- For test/dagger-runner, run them on both legacy CI and new CI
- For all the rest, run them on the new CI and github runners for the really simple jobs

Signed-off-by: Matias Pan <matias@dagger.io>

* Rename concurrency group

Signed-off-by: Matias Pan <matias@dagger.io>

* Install curl on production vertical scaling runner

Signed-off-by: Matias Pan <matias@dagger.io>

* Add customizable runner for separate perf tests

Signed-off-by: Matias Pan <matias@dagger.io>

* Rename to _async_hack_make

Signed-off-by: Matias Pan <matias@dagger.io>

* Upgrade missing workflow to v0.11.1

Signed-off-by: Matias Pan <matias@dagger.io>

* Target nvme

Signed-off-by: Matias Pan <matias@dagger.io>

* CI: Default to 4CPUs & NVMe disks

Otherwise the workflows are too slow on the new CI runners and are
blocking the migration off the legacy CI runners.

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

* Bump to v0.11.2 & capture extra details in comments

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

* Debug dagger-engine.dev in large GitHub Runner

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

* Continuer running engine:testdev in dagger-runner-docker-fix runner

Large GitHub Runners are failing consistently, not worth debugging at
this point since we know this works on a vanilla Ubuntu 24.04 instance
with Docker - must be an issue related to GitHub Large Runners.

FTR: dagger#7223 (comment)

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

---------

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
Signed-off-by: Matias Pan <matias@dagger.io>
Co-authored-by: Gerhard Lazu <gerhard@dagger.io>
Co-authored-by: Matias Pan <matias@dagger.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants