Migrate to new ci runners #7082

jedevc · 2024-04-11T14:59:03Z

This switches all our GitHub workflows to use the second generation of Dagger Runners. They are based on Run Dagger on Amazon EKS with GitHub Actions Runner and Karpenter, currently running:

AWS EKS v1.28.7
Karpenter v0.32.0

As of today, we expose the following runner options:

dagger-v0-10-1
dagger-v0-10-2
dagger-v0-10-3
dagger-v0-10-3-4c
dagger-v0-10-3-8c
dagger-v0-10-3-16c
dagger-v0-11-0
dagger-v0-11-0-4c
dagger-v0-11-0-8c
dagger-v0-11-0-16c
dagger-v0-11-1
dagger-v0-11-1-4c
dagger-v0-11-1-4c-nvme
dagger-v0-11-1-8c
dagger-v0-11-1-8c-nvme
dagger-v0-11-1-16c
dagger-v0-11-1-16c-nvme
dagger-v0-11-2
dagger-v0-11-2-4c
dagger-v0-11-2-4c-nvme
dagger-v0-11-2-8c
dagger-v0-11-2-8c-nvme
dagger-v0-11-2-16c
dagger-v0-11-2-16c-nvme

As soon as a new version of Dagger gets released - e.g. v0.11.3 - a bunch of variants with that version will appear automatically.

This change is great because:

Dagger Engine version can be bumped by changing the GitHub Workflow definition - no more coordination required!
We can test different configurations (and versions!) in PRs and see which one performs best
We are not forced to bump the Engine version for all workflows at the same time - it was an all-or-nothing proposition before

To learn more, see the following (private) infra repo. @matipan @gerhard that have all the context.

Something worth mentioning separately is that this also introduces asynchronous triggers.

We are running a specific job - engine:testrace - in a production bare-metal K8s cluster. This is done for comparison reasons only since we know that it's going to fail (we only have 12CPU threads available). See #7223 (comment) for more details.

The goal of this is to capture measurements - #6492 - and see how they change over time.

FTR, this job will fail until we merge into main since the new workflow will not be recognised:

Split out to avoid blocking the release:

Improve releasing during v0.11.0 #7018

gerhard

Let's how well this works on main 🐿

.github/workflows/cfg-runner.yml

jedevc · 2024-04-29T10:23:45Z

Minor request - could we link somewhere in our docs here where we can find the definition for the github workers in https://github.com/dagger/dagger.io? (even though it's private)

gerhard · 2024-04-29T10:24:02Z

Todo:

Merge the open PRs that change runners in the dagger.io repo
~~Run all jobs async in legacy CI runners so that we can compare them for a few more days~~ We have 61 days worth of metrics from previous runs, it will have to do.

Follow-up:

Migrate testdev to larger GitHub runners (easy) or setup Docker in the new CI runners (harder)

gerhard · 2024-04-29T10:25:53Z

@jedevc

gerhard · 2024-04-30T14:22:22Z

I'm picking this one up again.

…engine This one requires Docker with specific fixes that we don't yet have in the new CI setup. Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

The setup we want for production is: - For all <LANG> SDK jobs, run them on the new CI only - For testdev, run them on the docker-fix legacy CI - For test/dagger-runner, run them on both legacy CI and new CI - For all the rest, run them on the new CI and github runners for the really simple jobs Signed-off-by: Matias Pan <matias@dagger.io>

Signed-off-by: Matias Pan <matias@dagger.io>

Otherwise the workflows are too slow on the new CI runners and are blocking the migration off the legacy CI runners. Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

Large GitHub Runners are failing consistently, not worth debugging at this point since we know this works on a vanilla Ubuntu 24.04 instance with Docker - must be an issue related to GitHub Large Runners. FTR: dagger#7223 (comment) Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

* Use Dagger v0.11.2 via the new CI setup for all workflows except dev-engine This one requires Docker with specific fixes that we don't yet have in the new CI setup. Signed-off-by: Gerhard Lazu <gerhard@dagger.io> * Setup CI for new, legacy and vertical scaling The setup we want for production is: - For all <LANG> SDK jobs, run them on the new CI only - For testdev, run them on the docker-fix legacy CI - For test/dagger-runner, run them on both legacy CI and new CI - For all the rest, run them on the new CI and github runners for the really simple jobs Signed-off-by: Matias Pan <matias@dagger.io> * Rename concurrency group Signed-off-by: Matias Pan <matias@dagger.io> * Install curl on production vertical scaling runner Signed-off-by: Matias Pan <matias@dagger.io> * Add customizable runner for separate perf tests Signed-off-by: Matias Pan <matias@dagger.io> * Rename to _async_hack_make Signed-off-by: Matias Pan <matias@dagger.io> * Upgrade missing workflow to v0.11.1 Signed-off-by: Matias Pan <matias@dagger.io> * Target nvme Signed-off-by: Matias Pan <matias@dagger.io> * CI: Default to 4CPUs & NVMe disks Otherwise the workflows are too slow on the new CI runners and are blocking the migration off the legacy CI runners. Signed-off-by: Gerhard Lazu <gerhard@dagger.io> * Bump to v0.11.2 & capture extra details in comments Signed-off-by: Gerhard Lazu <gerhard@dagger.io> * Debug dagger-engine.dev in large GitHub Runner Signed-off-by: Gerhard Lazu <gerhard@dagger.io> * Continuer running engine:testdev in dagger-runner-docker-fix runner Large GitHub Runners are failing consistently, not worth debugging at this point since we know this works on a vanilla Ubuntu 24.04 instance with Docker - must be an issue related to GitHub Large Runners. FTR: dagger#7223 (comment) Signed-off-by: Gerhard Lazu <gerhard@dagger.io> --------- Signed-off-by: Gerhard Lazu <gerhard@dagger.io> Signed-off-by: Matias Pan <matias@dagger.io> Co-authored-by: Gerhard Lazu <gerhard@dagger.io> Co-authored-by: Matias Pan <matias@dagger.io>

matipan force-pushed the migrate-to-new-ci-runners branch from 93bd8a4 to 412358b Compare April 11, 2024 18:09

gerhard mentioned this pull request Apr 15, 2024

feat(TSSdk): support OTEL #7074

Merged

matipan force-pushed the migrate-to-new-ci-runners branch 17 times, most recently from 9a08f48 to e87002c Compare April 16, 2024 20:15

gerhard approved these changes Apr 17, 2024

View reviewed changes

matipan force-pushed the migrate-to-new-ci-runners branch 3 times, most recently from 23690ba to 8ff9744 Compare April 18, 2024 13:48

gerhard reviewed Apr 18, 2024

View reviewed changes

.github/workflows/cfg-runner.yml Outdated Show resolved Hide resolved

gerhard reviewed Apr 18, 2024

View reviewed changes

.github/workflows/cfg-runner.yml Outdated Show resolved Hide resolved

matipan force-pushed the migrate-to-new-ci-runners branch 2 times, most recently from e6537ad to 3083e9e Compare April 19, 2024 17:27

gerhard force-pushed the migrate-to-new-ci-runners branch from 660ad83 to e31aa58 Compare April 24, 2024 18:21

This was referenced Apr 25, 2024

🐞 github.com/dagger/dagger/dagql/idtui.CollectSpan invalid memory address or nil pointer dereference #7192

Open

CI: Try running Engine & CLI workflow in larger GitHub runners #7197

Closed

gerhard mentioned this pull request Apr 26, 2024

Improve releasing during v0.11.2 #7205

Merged

gerhard mentioned this pull request May 3, 2024

ci: use nested dagger for testdev #7223

Draft

gerhard force-pushed the migrate-to-new-ci-runners branch from 7892bf0 to f72cc50 Compare May 3, 2024 19:27

gerhard and others added 10 commits May 6, 2024 12:49

Use Dagger v0.11.0 via the new CI setup for all workflows except dev-…

ecf7d35

…engine This one requires Docker with specific fixes that we don't yet have in the new CI setup. Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

Rename concurrency group

da25fef

Signed-off-by: Matias Pan <matias@dagger.io>

Install curl on production vertical scaling runner

e2d75c3

Signed-off-by: Matias Pan <matias@dagger.io>

Add customizable runner for separate perf tests

aac33ec

Signed-off-by: Matias Pan <matias@dagger.io>

Rename to _async_hack_make

d94a384

Signed-off-by: Matias Pan <matias@dagger.io>

Upgrade missing workflow to v0.11.1

a0943fd

Signed-off-by: Matias Pan <matias@dagger.io>

Target nvme

506f9df

Signed-off-by: Matias Pan <matias@dagger.io>

CI: Default to 4CPUs & NVMe disks

a60eb5d

Otherwise the workflows are too slow on the new CI runners and are blocking the migration off the legacy CI runners. Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

Bump to v0.11.2 & capture extra details in comments

cc7c562

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

gerhard force-pushed the migrate-to-new-ci-runners branch from f72cc50 to cc7c562 Compare May 6, 2024 12:47

gerhard added 2 commits May 6, 2024 15:33

Debug dagger-engine.dev in large GitHub Runner

0b6e1c6

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

gerhard merged commit 3929a6d into dagger:main May 6, 2024
83 of 84 checks passed

gerhard mentioned this pull request May 7, 2024

CLI: simplify and clarify usage text #7277

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate to new ci runners #7082

Migrate to new ci runners #7082

jedevc commented Apr 11, 2024 •

edited by gerhard

gerhard left a comment •

edited

jedevc commented Apr 29, 2024

gerhard commented Apr 29, 2024 •

edited

gerhard commented Apr 29, 2024

gerhard commented Apr 30, 2024

Migrate to new ci runners #7082

Migrate to new ci runners #7082

Conversation

jedevc commented Apr 11, 2024 • edited by gerhard

gerhard left a comment • edited

Choose a reason for hiding this comment

jedevc commented Apr 29, 2024

gerhard commented Apr 29, 2024 • edited

gerhard commented Apr 29, 2024

gerhard commented Apr 30, 2024

jedevc commented Apr 11, 2024 •

edited by gerhard

gerhard left a comment •

edited

gerhard commented Apr 29, 2024 •

edited