Skip to content

ci: limit usage of large runners #3722

Merged
aluzzardi merged 1 commit into
dagger:mainfrom
aluzzardi:ci-self-hosted
Nov 11, 2022
Merged

ci: limit usage of large runners #3722
aluzzardi merged 1 commit into
dagger:mainfrom
aluzzardi:ci-self-hosted

Conversation

@aluzzardi
Copy link
Copy Markdown
Contributor

@aluzzardi aluzzardi commented Nov 8, 2022

/cc @gerhard @sipsma @vito

Low urgency. Experimented with moving our CI to self-hosted to harness buildkit cache.

@aluzzardi aluzzardi force-pushed the ci-self-hosted branch 8 times, most recently from 671eb60 to 6174336 Compare November 8, 2022 02:57
@aluzzardi aluzzardi marked this pull request as draft November 8, 2022 03:26
@sipsma
Copy link
Copy Markdown
Contributor

sipsma commented Nov 8, 2022

@aluzzardi Where is the self hosted runner hosted? Trying to rely fully on the local cache is pretty tough now because we want to rebuild the engine quite frequently, and engine now also encapsulates local cache, so trying to share local cache between different engine builds running in parallel is not really feasible at the moment.

Then there's remote cache, but last time we enabled full remote caching it seemed to cause as many problems as it solved (#2365).

Self-hosted runners might improve that situation though by allowing us to e.g. put the self hosted runner in AWS and use S3 remote caching w/ a vpc endpoint. I'd guess that gives us better performance, less throttling, etc. relative to free-tier GHA caching.

@aluzzardi
Copy link
Copy Markdown
Contributor Author

@aluzzardi Where is the self hosted runner hosted?

Currently, it's a DigitalOcean box (16 cores, 32GB ram). Flexible to move it anywhere else, this was just a test.

@sipsma
Copy link
Copy Markdown
Contributor

sipsma commented Nov 8, 2022

Currently, it's a DigitalOcean box (16 cores, 32GB ram). Flexible to move it anywhere else, this was just a test.

I think DO has some S3-api compatible service, so it actually may be usable

@sipsma
Copy link
Copy Markdown
Contributor

sipsma commented Nov 8, 2022

Also fun related fact: I just saw someone in the buildkit slack channel say that the S3 backend is additive. It's not like registry where you overwrite tags; it just keeps growing every time you export cache.

Need to verify but pretty intriguing if true. Means we need to prune our cache somehow, but that's doable.

@aluzzardi
Copy link
Copy Markdown
Contributor Author

Also fun related fact: I just saw someone in the buildkit slack channel say that the S3 backend is additive. It's not like registry where you overwrite tags; it just keeps growing every time you export cache.

Need to verify but pretty intriguing if true. Means we need to prune our cache somehow, but that's doable.

Wow, that's amazing. We could switch to EC2/S3 then. Picked DO just because I could get it done in a few minutes.

@aluzzardi
Copy link
Copy Markdown
Contributor Author

@sipsma I just rebased and I'm getting:

Error: failed to copy dagger-sdk-helper bin with command "docker cp dagger-engine-5d9fb9c65f9098d3:/usr/bin/dagger-sdk-helper-linux-amd64 /root/.cache/dagger/temp-dagger-sdk-helper-5d9fb9c65f9098d31103173820": Error: No such container:path: dagger-engine-5d9fb9c65f9098d3:/usr/bin/dagger-sdk-helper-linux-amd64
[12](https://github.com/dagger/dagger/actions/runs/3423495121/jobs/5702153940#step:4:13)

Haven't got a chance to debug yet -- could it be concurrency related?

@sipsma
Copy link
Copy Markdown
Contributor

sipsma commented Nov 8, 2022

Haven't got a chance to debug yet -- could it be concurrency related?

It's certainly possible but I can't currently see what would cause a race condition like that. The testing so far around that has been me manually invoking python and go tests that do provisioning side-by-side, so not exactly thorough.

My plan is to finish the switch we talked about earlier (more in helper, less in SDK), then to start automating as much testing of all of this as possible. So I'll be sure to cover this sort of case as part of that.

@aluzzardi aluzzardi force-pushed the ci-self-hosted branch 3 times, most recently from fe8ac20 to 24f1714 Compare November 10, 2022 03:08
@aluzzardi aluzzardi changed the title ci: switch to self hosted ci: limit usage of large runners Nov 10, 2022
@aluzzardi
Copy link
Copy Markdown
Contributor Author

aluzzardi commented Nov 10, 2022

Need more time to deal with this -- changed the PR to just limit large GH runners for engine:test and keep the default ones for lint & the other tests, since they're waiting anyway for engine:test to complete

@aluzzardi aluzzardi marked this pull request as ready for review November 10, 2022 03:10
@aluzzardi aluzzardi force-pushed the ci-self-hosted branch 7 times, most recently from 06a0cbb to 195cb24 Compare November 10, 2022 17:57
@aluzzardi
Copy link
Copy Markdown
Contributor Author

I don't know why readthedocs.org is failing /cc @helderco

Signed-off-by: Andrea Luzzardi <al@dagger.io>
@aluzzardi aluzzardi merged commit 4598e4d into dagger:main Nov 11, 2022
@aluzzardi aluzzardi deleted the ci-self-hosted branch November 11, 2022 00:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants