Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: Better build caching for CI #2742

Conversation

polarathene
Copy link
Member

Description

Fixing flaws to make the tests workflow and Dockerfile more cache friendly.

Change Overview

In Sep 2019, a PR introduced the VCS ARG directives with acknowledgement that it invalidates the build cache.

Every time the commit SHA changes, due to this placement all layers afterwards are invalidated and must be rebuilt, even though only LABEL directive values were being updated (even without those, RUN will implicitly be invalidated due to ARG changing as it is cached as part of the build ENV).

I've shifted all directives related to image config that aren't necessary until the end to a 2nd stage. The additional FROM introducing a 2nd stage is not necessary, but provides a basis for later refactoring the Dockerfile to split out ClamAV and Fail2Ban layers into their own stages to further improve build caching.

The original intent of this PR was to address the caching with a more stable cache key (instead of the commit SHA), such that if the build context (files involved in the image build process) has not changed since the last build, that we can re-use the build cache for that.


Presently the existing cache key would:

  • Only be valid when running tests for the same commit again, but that is rarely helpful if tests failed as it prevents uploading the cache (cache upload happens at end of workflow job if successful only).
  • Would needlessly upload cache (new key per push if job successful, even if the image didn't really change) if a PR only contributes tests wasting cache storage (10GB storage limit, less frequently used cache is evicted first)

The new cache key:

  • Is valid when Dockerfile, VERSION and target/** has not changed between commits.
  • Uploads cache even when tests fail (due to cache eviction policy and open-source plan this is fine provided cache upload is not excessive), but only when there was no direct cache key hit (restore-key fallback to an older build cache, reducing build time and upon job success uploading for the new derived cache key).

Complimentary changes:

  • The type=local cache mode needs a workaround to avoid accidentally accumulating cache wastefully which adds to upload/download time and more evictions (330-810MB + 7-20 sec depending on supported platforms, vs 2-3GB I've seen presently cached with download times of approx 1 minute).
    • A new type=gha cache mode has since been made available (still deemed experimental) earlier this year that uses the actions/cache API (reverse engineered IIRC, not officially supported), I've chosen to avoid that for now and have not had time to try and compare it.
  • Fixing the Dockerfile issue with ARG usage is an important one, as while the derived cache key can be effective at caching, when the commit SHA changed, the cache was completely ignored due to the builder invalidating cache... which caused a problem as that cache hit would not upload the new build cache (build args passed in aren't part of the derived cache key). Although the built image will still be slightly different due to these ARG, we are only sharing the build layer cache between jobs, so there is minimal work to recreate the image from cache and append the ARG changes (avoiding wasteful cache upload for ARG changes).
  • Workflow is better documented. Removed unnecessary inputs like default Dockerfile and buildx builder instance.

Impact of platforms on cache storage + build time:

  • I also checked the build cache for each platform, approx 330MB per platform, but the two ARM platforms seem to be capable of sharing a bit as when they're both included with AMD64, the total size is 810MB (655MB with AMD64 and ARM64 only).
  • The ARM builds turned out to introduce 13 mins to build time and work in parallel (no worthwhile speed improvement with only ARM64), thus deprecating ARMv7 won't provide CI improvements (except reducing cache storage).
  • AMD64 alone seems warranted as it can be built within 2 minutes without cache (30 seconds with cache). It might make sense to run ARM64 as a separate job in parallel (or if available a native ARM64 runner), or as a dependent job after the more important ARM64 build and test.

Additional details

In the first commit, I also experimented with upload-artifact and download-artifact to transfer across jobs but this was much worse than just using actions/cache.

Future improvements could leverage buildx COPY --link feature that is available since March 2022. type=gha for the image cache might be interesting too if it caches on step success instead of job success, allowing for tests to be in the same job. Similar workflows could also potentially be more DRY with Github Actions support for re-usable workflows.

For now though, this PR should help, especially if a follow-up moves out the ARM64 build.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • Improvement (non-breaking change that does improve existing functionality)

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • New and existing unit tests pass locally with my changes

For the cache to work properly, we need to derive a cache key from the build context (files that affect the Dockerfile build) instead of the cache key changing by commit SHA.

We also need to avoid failing tests preventing the caching of a build, thus splitting to separate jobs.

This first pass used `upload-artifact` and `download-artifact` to transfer the built image, but it has quite a bit of overhead and prevented multi-platform build (without complicating the workflow further).
While `download-artifact` + `docker load` is a little faster than rebuilding the image from cached layers, `upload-artifact` takes about 2 minutes to upload the AMD64 (330MB) tar image export (likely due to compression during upload?).

The `actions/cache` approach however does not incur that hit and is very quick (<10 secs) to complete it's post upload work. The dependent job still gets a cache-hit, and the build job is able to properly support multi-platform builds.

Added additional notes about timing and size of including ARM builds.
When the ARG changes due to commit SHA, it invalidates all cache due to the LABEL layers at the start. Then any RUN layers implicitly invalidate, even when the ARG is not used.

Introduced basic multi-stage build, and relocated the container config / metadata to the end of the build. This avoids invalidating expensive caching layers (size and build time) needlessly.
@polarathene polarathene self-assigned this Aug 23, 2022
@polarathene polarathene added area/ci area/tests kind/improvement Improve an existing feature, configuration file or the documentation kind/bugfix labels Aug 23, 2022
@polarathene polarathene added this to the v11.2.0 milestone Aug 23, 2022
@georglauterbach
Copy link
Member

georglauterbach commented Aug 23, 2022

Very nice and welcome changes / additions!

Would just adding another job (dependent on whether the AMD64 build was successful maybe) that builds the ARM64 image in parallel to the BATS test suite cut down on the build time (the 13 minutes)? Is there something I am missing, that would prevent this solution?


Warning Latest Test Failures
In tests.bats line 72, setup_file

not ok 217 setup_file failed
# (from function `setup_file' in test file test/tests.bats, line 72)
#   `docker exec mail /bin/sh -c "nc 0.0.0.0 25 < /tmp/docker-mailserver-test/email-templates/existing-added.txt"' failed
# ba44729a0ebac54aa83742d2eb89408862474caaa88b64f4ee0f085f46b3b4ed
# [   INF   ]  mail.my-domain.com is up and running
# 220 mail.my-domain.com ESMTP
# 250 mail.my-domain.com
# 250 2.1.0 Ok
# 250 2.1.5 Ok
# 354 End data with .
# 250 2.0.0 Ok: queued as 053F2BDDDC
# 221 2.0.0 Bye
# 220 mail.my-domain.com ESMTP
# 250 mail.my-domain.com
# 250 2.1.0 Ok
# 250 2.1.5 Ok
# 354 End data with .
# 250 2.0.0 Ok: queued as 2F89ABDE21
# 221 2.0.0 Bye
# 220 mail.my-domain.com ESMTP
# 250 mail.my-domain.com
# 250 2.1.0 Ok
# 250 2.1.5 Ok
# 354 End data with .
# 250 2.0.0 Ok: queued as 56EC7BDE23
# 221 2.0.0 Bye
# 220 mail.my-domain.com ESMTP
# 250 mail.my-domain.com
# 250 2.1.0 Ok
# 250 2.1.5 Ok
# 354 End data with .
# 250 2.0.0 Ok: queued as 7E1E6BDE26
# 221 2.0.0 Bye
# 220 mail.my-domain.com ESMTP
# 250 mail.my-domain.com
# 250 2.1.0 Ok
# 250 2.1.5 Ok
# 354 End data with .
# 250 2.0.0 Ok: queued as 9FFBABDE28
# 221 2.0.0 Bye
# 220 mail.my-domain.com ESMTP
# 250 mail.my-domain.com
# 250 2.1.0 Ok
# 250 2.1.5 Ok
# 354 End data with .
# 250 2.0.0 Ok: queued as CF2F3BDE21
# 221 2.0.0 Bye
# 220 mail.my-domain.com ESMTP
# mail
# bats warning: Executed 217 instead of expected 318 tests
make: *** [Makefile:43: tests] Error 1

@polarathene
Copy link
Member Author

Is there something I am missing, that would prevent this solution?

You are not wrong, I am just really stretched for time now. I am enrolled for a course until december with 40-80 hour weeks demanded from students, and need to be prepared before it starts in a few days.

You're welcome to adjust the PR or add a follow-up change 👍

I mostly didn't want to copy/paste more of the same step as we have similar jobs in other workflows. Github docs mention a thing called "re-usable workflows" that might be appropriate way of keeping that DRY. Alternatively, we could also explore a matrix job which would spin up two separate runners in parallel to perform the same job but allow us to adjust the platform used. That would also work while keeping code minimal, but need some condition so the test job only runs afterwards for AMD64 build is done.

@polarathene
Copy link
Member Author

Oh and the test failure is probably related to the recent PR, tripping up more timing sensitive tests :(

@georglauterbach
Copy link
Member

You're welcome to adjust the PR or add a follow-up change 👍🏼

I will provide a follow-up PR :) I will have a look at the proposals you gave afterwards too!

Copy link
Member

@georglauterbach georglauterbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍🏼

@polarathene
Copy link
Member Author

I will have a look at the proposals you gave afterwards too!

These may be helpful for most of the improvements discussed.

The copy --link improvement could be good for separate fail2ban and clamav stages I think. ClamAV can probably just copy link over the large database files (and anything else important that dive](https://github.com/wagoodman/dive) may reveal) from the official ClamAV image. But anyone building outside of CI would need to use buildkit release from March 2022 or newer I think?

@casperklein
Copy link
Member

casperklein commented Aug 24, 2022

I didn't have time yet to look at this content wise. Just some stylistic questions so far:

  1. In this previous PR, quotes were removed (line 1). This PR add quotes (line 27). Using quotes seems a good practice.
  2. What is the rational behind naming stages "stage" --> stage-base and jobs "job" --> job-build-image? Following that, the steps should be named "step" --> step-Checkout. On the other hand, we don't name our functions function_ etc. Unless there is a good reason to so, I would stick with not including keywords in namings.

@polarathene
Copy link
Member Author

polarathene commented Aug 25, 2022

  1. Using quotes seems a good practice.

Probably just different style preference between me and @georglauterbach , happy to be consistent with whatever preference 👍

I don't think it makes any difference in this case (at least for name: value), but quoted value stands out better by triggering different syntax highlighting.


What is the rational behind naming stages
Unless there is a good reason to so, I would stick with not including keywords in namings.

Fair 👍

In both cases I chose to prefix the name with the scope/type of value for added clarity, as they're a fair bit apart.

I'm not sure how familiar other maintainers are with multi-job workflows or a multi-stage Dockerfile, so it was to assist with review 😅

  • needs: job-build-image (makes it clear if it wasn't that needs is referencing a job dependency, whereas build-image might seem more ambiguous)
  • FROM stage-base AS stage-final (more evident that it's referencing stages, not images. FROM base AS final may seem a tad ambiguous)

Happy to remove it if you like, ctrl + f / grep works just as well to get the context 👍

@casperklein
Copy link
Member

needs: job-build-image (makes it clear if it wasn't that needs is referencing a job dependency, whereas build-image might seem more ambiguous)

Ack 👍

FROM stage-base AS stage-final (more evident that it's referencing stages, not images. FROM base AS final may seem a tad ambiguous)

Fair. It's just that I've mostly see FROM base AS final in the wild (and was fine with it).

Happy to remove it if you like, ctrl + f / grep works just as well to get the context 👍

If it's only me, keep it 😄

Copy link
Member

@casperklein casperklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! I haven't tested the CI part, but I think you know what you do 😉

@polarathene
Copy link
Member Author

Fair. It's just that I've mostly see FROM base AS final in the wild (and was fine with it).

Yeah I'm fine with that too and don't mind doing so on my own Dockerfiles 👍

I've only been a bit more verbose since it's harmless and may help maintainers / contributors that aren't as familiar :)

I'll merge as-is. If any maintainer wants to remove the job- or stage- prefixes as redundant, I'm happy to add approval to a PR for that.

@polarathene polarathene merged commit 21fbbfa into docker-mailserver:master Aug 27, 2022
@casperklein
Copy link
Member

Another question for clarity: The tests have been moved to a new job (job-run-tests). There the image is build again (from cache).

In job-build-image the cache is replaced. Can this lead to problems/race conditions, when more than one job-build-image is running? For example, when there are two or more PR opened at the same time?

@polarathene
Copy link
Member Author

polarathene commented Aug 28, 2022

Can this lead to problems/race conditions, when more than one job-build-image is running?

That's an good question! :)

The TL;DR version is not that I know of.

This is just a build cache, changes to Dockerfile or content that docker build would invalidate locally happen here too. We're just providing the previous building blocks to finish the image, not a cache of final result and skipping builds (The stage-final portion is likely to often be invalidated due to ARG reflecting the git commit SHA).

  • I have the build layer cache assigned a cache key derived from the file content in target/ + Dockerfile + VERSION, so if nothing changes, they should use the same cache contents, otherwise they retrieve different caches.
  • It should be unlikely that we'd experience a cache eviction (AMD64 + ARM64 = ~660MB per cache stored, least used should be evicted from the total 10GB repo cache storage limit), but if it did, then we'd just re-build from scratch at this step.
  • What also happens when there is no direct cache hit, we will fallback to the last used cache key (anything with the key name that excludes the content hash suffix) instead (aka restore-keys used). That may provide cache up to a certain point for the build, and then the build will re-create any layers necessary and upload that new cache afterwards (assuming build was successful).

When I wrote this improvement I did encounter one failure case. This was due to the ARG usage previously being too early.

The image would build successfully, and then upload the cache. Then any future build where the ARG changed (basically any new commit), the cache hit was successful, but all of it was invalidated from the start due to the ARG...

So it rebuilt from scratch each time, and at the end of the job refused to update the cache for that key because it was a successful cache hit (only applies to exact match for cache key, restore-keys as fallbacks would update).

@polarathene
Copy link
Member Author

For example, when there are two or more PR opened at the same time?

While I was careful to handle that here as I also had that same concern. I did notice I messed up with the doc previews workflow.

It uploads an artifact to a common name, I can't recall if it's scoped to the workflow or gets overwritten by other PRs building doc previews in parallel. This provides the docs build result, and PR metadata, I don't think I've seen it yet get mixed up, but might be worth investigating 😅

Thankfully it's just for doc previews and shouldn't cause any risks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ci area/tests kind/improvement Improve an existing feature, configuration file or the documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants