-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: Better build caching for CI #2742
ci: Better build caching for CI #2742
Conversation
For the cache to work properly, we need to derive a cache key from the build context (files that affect the Dockerfile build) instead of the cache key changing by commit SHA. We also need to avoid failing tests preventing the caching of a build, thus splitting to separate jobs. This first pass used `upload-artifact` and `download-artifact` to transfer the built image, but it has quite a bit of overhead and prevented multi-platform build (without complicating the workflow further).
While `download-artifact` + `docker load` is a little faster than rebuilding the image from cached layers, `upload-artifact` takes about 2 minutes to upload the AMD64 (330MB) tar image export (likely due to compression during upload?). The `actions/cache` approach however does not incur that hit and is very quick (<10 secs) to complete it's post upload work. The dependent job still gets a cache-hit, and the build job is able to properly support multi-platform builds. Added additional notes about timing and size of including ARM builds.
When the ARG changes due to commit SHA, it invalidates all cache due to the LABEL layers at the start. Then any RUN layers implicitly invalidate, even when the ARG is not used. Introduced basic multi-stage build, and relocated the container config / metadata to the end of the build. This avoids invalidating expensive caching layers (size and build time) needlessly.
Very nice and welcome changes / additions! Would just adding another job (dependent on whether the AMD64 build was successful maybe) that builds the ARM64 image in parallel to the BATS test suite cut down on the build time (the 13 minutes)? Is there something I am missing, that would prevent this solution?
not ok 217 setup_file failed # (from function `setup_file' in test file test/tests.bats, line 72) # `docker exec mail /bin/sh -c "nc 0.0.0.0 25 < /tmp/docker-mailserver-test/email-templates/existing-added.txt"' failed # ba44729a0ebac54aa83742d2eb89408862474caaa88b64f4ee0f085f46b3b4ed # [ INF ] mail.my-domain.com is up and running # 220 mail.my-domain.com ESMTP # 250 mail.my-domain.com # 250 2.1.0 Ok # 250 2.1.5 Ok # 354 End data with . # 250 2.0.0 Ok: queued as 053F2BDDDC # 221 2.0.0 Bye # 220 mail.my-domain.com ESMTP # 250 mail.my-domain.com # 250 2.1.0 Ok # 250 2.1.5 Ok # 354 End data with . # 250 2.0.0 Ok: queued as 2F89ABDE21 # 221 2.0.0 Bye # 220 mail.my-domain.com ESMTP # 250 mail.my-domain.com # 250 2.1.0 Ok # 250 2.1.5 Ok # 354 End data with . # 250 2.0.0 Ok: queued as 56EC7BDE23 # 221 2.0.0 Bye # 220 mail.my-domain.com ESMTP # 250 mail.my-domain.com # 250 2.1.0 Ok # 250 2.1.5 Ok # 354 End data with . # 250 2.0.0 Ok: queued as 7E1E6BDE26 # 221 2.0.0 Bye # 220 mail.my-domain.com ESMTP # 250 mail.my-domain.com # 250 2.1.0 Ok # 250 2.1.5 Ok # 354 End data with . # 250 2.0.0 Ok: queued as 9FFBABDE28 # 221 2.0.0 Bye # 220 mail.my-domain.com ESMTP # 250 mail.my-domain.com # 250 2.1.0 Ok # 250 2.1.5 Ok # 354 End data with . # 250 2.0.0 Ok: queued as CF2F3BDE21 # 221 2.0.0 Bye # 220 mail.my-domain.com ESMTP # mail # bats warning: Executed 217 instead of expected 318 tests make: *** [Makefile:43: tests] Error 1 |
You are not wrong, I am just really stretched for time now. I am enrolled for a course until december with 40-80 hour weeks demanded from students, and need to be prepared before it starts in a few days. You're welcome to adjust the PR or add a follow-up change 👍 I mostly didn't want to copy/paste more of the same step as we have similar jobs in other workflows. Github docs mention a thing called "re-usable workflows" that might be appropriate way of keeping that DRY. Alternatively, we could also explore a |
Oh and the test failure is probably related to the recent PR, tripping up more timing sensitive tests :( |
I will provide a follow-up PR :) I will have a look at the proposals you gave afterwards too! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍🏼
These may be helpful for most of the improvements discussed.
The |
I didn't have time yet to look at this content wise. Just some stylistic questions so far:
|
Probably just different style preference between me and @georglauterbach , happy to be consistent with whatever preference 👍 I don't think it makes any difference in this case (at least for
Fair 👍 In both cases I chose to prefix the name with the scope/type of value for added clarity, as they're a fair bit apart. I'm not sure how familiar other maintainers are with multi-job workflows or a multi-stage
Happy to remove it if you like, |
Ack 👍
Fair. It's just that I've mostly see
If it's only me, keep it 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work! I haven't tested the CI part, but I think you know what you do 😉
Yeah I'm fine with that too and don't mind doing so on my own Dockerfiles 👍 I've only been a bit more verbose since it's harmless and may help maintainers / contributors that aren't as familiar :) I'll merge as-is. If any maintainer wants to remove the |
Another question for clarity: The tests have been moved to a new job (job-run-tests). There the image is build again (from cache). In |
That's an good question! :) The TL;DR version is not that I know of. This is just a build cache, changes to Dockerfile or content that
When I wrote this improvement I did encounter one failure case. This was due to the The image would build successfully, and then upload the cache. Then any future build where the So it rebuilt from scratch each time, and at the end of the job refused to update the cache for that key because it was a successful cache hit (only applies to exact match for cache key, |
While I was careful to handle that here as I also had that same concern. I did notice I messed up with the doc previews workflow. It uploads an artifact to a common name, I can't recall if it's scoped to the workflow or gets overwritten by other PRs building doc previews in parallel. This provides the docs build result, and PR metadata, I don't think I've seen it yet get mixed up, but might be worth investigating 😅 Thankfully it's just for doc previews and shouldn't cause any risks. |
Description
Fixing flaws to make the tests workflow and Dockerfile more cache friendly.
Change Overview
In Sep 2019, a PR introduced the VCS
ARG
directives with acknowledgement that it invalidates the build cache.Every time the commit SHA changes, due to this placement all layers afterwards are invalidated and must be rebuilt, even though only
LABEL
directive values were being updated (even without those,RUN
will implicitly be invalidated due toARG
changing as it is cached as part of the build ENV).I've shifted all directives related to image config that aren't necessary until the end to a 2nd stage. The additional
FROM
introducing a 2nd stage is not necessary, but provides a basis for later refactoring the Dockerfile to split out ClamAV and Fail2Ban layers into their own stages to further improve build caching.The original intent of this PR was to address the caching with a more stable cache key (instead of the commit SHA), such that if the build context (files involved in the image build process) has not changed since the last build, that we can re-use the build cache for that.
Presently the existing cache key would:
The new cache key:
Dockerfile
,VERSION
andtarget/**
has not changed between commits.key
hit (restore-key
fallback to an older build cache, reducing build time and upon job success uploading for the new derived cache key).Complimentary changes:
type=local
cache mode needs a workaround to avoid accidentally accumulating cache wastefully which adds to upload/download time and more evictions (330-810MB + 7-20 sec depending on supported platforms, vs 2-3GB I've seen presently cached with download times of approx 1 minute).type=gha
cache mode has since been made available (still deemed experimental) earlier this year that uses theactions/cache
API (reverse engineered IIRC, not officially supported), I've chosen to avoid that for now and have not had time to try and compare it.Dockerfile
issue withARG
usage is an important one, as while the derived cache key can be effective at caching, when the commit SHA changed, the cache was completely ignored due to the builder invalidating cache... which caused a problem as that cache hit would not upload the new build cache (build args passed in aren't part of the derived cache key). Although the built image will still be slightly different due to theseARG
, we are only sharing the build layer cache between jobs, so there is minimal work to recreate the image from cache and append theARG
changes (avoiding wasteful cache upload forARG
changes).Dockerfile
and buildxbuilder
instance.Impact of platforms on cache storage + build time:
Additional details
In the first commit, I also experimented with
upload-artifact
anddownload-artifact
to transfer across jobs but this was much worse than just usingactions/cache
.Future improvements could leverage buildx
COPY --link
feature that is available since March 2022.type=gha
for the image cache might be interesting too if it caches on step success instead of job success, allowing for tests to be in the same job. Similar workflows could also potentially be more DRY with Github Actions support for re-usable workflows.For now though, this PR should help, especially if a follow-up moves out the ARM64 build.
Type of change
Checklist: