Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ghcr.io for Docker image caching #99

Merged
merged 1 commit into from
Feb 23, 2024

Conversation

rmartin16
Copy link
Member

@rmartin16 rmartin16 commented Feb 20, 2024

Changes

  • Use a ghcr.io package called "beeware-ci" to cache Docker image builds
  • This implementation is inspired by how caching is managed in the python-build-standalone project....so, there's at least some precedent for this approach
  • Images for different Briefcase builds are managed by tags on the package
  • Caching strategy
    • The cache is pulled in from one of three places in this precedence:
      • The package from BeeWare org using the branch for the event
      • The package from BeeWare org using the main branch
    • The cache is pushed to the beeware-ci package for the BeeWare org
      • So, for a PR, this will be for a tag using the branch name
      • For a merge in to main, that will be a tag for main branch
      • This caching will likely fail except for the main branch because the GITHUB_TOKEN for PRs doesn't allow for writing to packages
      • However, that should be fine because the cache can still be pulled using the main tag
  • Closes Cache Docker images for app-build-verify #98

Notes

PR Checklist:

  • All new features have been tested
  • All new features have been documented
  • I have read the CONTRIBUTING.md file
  • I will abide by the code of conduct

@rmartin16
Copy link
Member Author

@freakboy3742, now that I have all this working and before I complete the work, do you agree with this approach? Questions or concerns about the implementation and its implications?

Here are the PoC Briefcase changes as well.

@freakboy3742
Copy link
Member

The general approach makes sense; two questions:

  1. CI times seem to have increased as a result of this caching. The Verify Linux (toga) step in this PR's CI took 13:40; the most recent main CI build took 11:06 for the same step. Is this just a first-time build cache cost, or is there something else going on?
  2. This is going to create tagged Docker images for every PR - won't that space grow continuously? What process will purge the image cache?

@rmartin16
Copy link
Member Author

rmartin16 commented Feb 20, 2024

The general approach makes sense; two questions:

  1. CI times seem to have increased as a result of this caching. The Verify Linux (toga) step in this PR's CI took 13:40; the most recent main CI build took 11:06 for the same step. Is this just a first-time build cache cost, or is there something else going on?

Nothing is able to use caching so far; the caches haven't been seeded yet. In my testing, I was seeing as much as 5 - 7 minutes shaved off each using caching. Although, this seems to be particularly contingent on the Fedora package CDN not dropping to 150kb/s for the dry-run install...

  1. This is going to create tagged Docker images for every PR - won't that space grow continuously? What process will purge the image cache?

Nothing; however, my expectation of an image registry means this shouldn't really be necessary. AFAIU, a tag is mostly a manifest of image layer hashes. So, just because a new tag is created does not mean that wholly new images are being uploaded to the registry. For the vast majority of CI runs, they will use and create the exact same image layer hashes that already exist in the registry. So, all that really grows over time is the list of tags.

Additionally, I'm still debating if we need to support uploading packages outside of the BeeWare namespace. I have a little more testing to do but I think that GITHUB_TOKEN will be valid for members of the BeeWare org to push packages....and non-members attempt to push will simply fail. That would simplify this a bit....especially since the vast majority of PRs won't have any impact on what's in these images.

To that end, @freakboy3742, can you check if the GITHUB_TOKEN for the BeeWare org has write privileges enabled?

https://github.blog/changelog/2021-04-20-github-actions-control-permissions-for-github_token/#setting-the-default-permissions-for-the-organization-or-repository

And if we're not really comfortable giving the GITHUB_TOKEN write permissions, then we can use a specific token....but that will definitely prevent uploading packages elsewhere. That said, only workflows that explicitly have a permissions section will be able to use the write permission anyway.

@freakboy3742
Copy link
Member

The general approach makes sense; two questions:

  1. CI times seem to have increased as a result of this caching. The Verify Linux (toga) step in this PR's CI took 13:40; the most recent main CI build took 11:06 for the same step. Is this just a first-time build cache cost, or is there something else going on?

Nothing is able to use caching so far; the caches haven't been seeded yet.

👍

  1. This is going to create tagged Docker images for every PR - won't that space grow continuously? What process will purge the image cache?

Nothing; however, my expectation of an image registry means this shouldn't really be necessary. AFAIU, a tag is mostly a manifest of image layer hashes. So, just because a new tag is created does not mean that wholly new images are being uploaded to the registry. For the vast majority of CI runs, they will use and create the exact same image layer hashes that already exist in the registry. So, all that really grows over time is the list of tags.

Sure, but the contents of the hash still needs to be stored - and in our case, that contents will include all the packages that have been installed, which will change every time the upstream repo pushes an update to each individual package. That's potentially a couple of hundred meg for each hash, isn't it?

To that end, @freakboy3742, can you check if the GITHUB_TOKEN for the BeeWare org has write privileges enabled?

I can confirm write permissions are enabled for the Beeware org:

Screenshot 2024-02-21 at 10 25 34 am

@rmartin16
Copy link
Member Author

3. This is going to create tagged Docker images for every PR - won't that space grow continuously? What process will purge the image cache?

Nothing; however, my expectation of an image registry means this shouldn't really be necessary. AFAIU, a tag is mostly a manifest of image layer hashes. So, just because a new tag is created does not mean that wholly new images are being uploaded to the registry. For the vast majority of CI runs, they will use and create the exact same image layer hashes that already exist in the registry. So, all that really grows over time is the list of tags.

Sure, but the contents of the hash still needs to be stored - and in our case, that contents will include all the packages that have been installed, which will change every time the upstream repo pushes an update to each individual package. That's potentially a couple of hundred meg for each hash, isn't it?

The Docker build process will use the cached layer instead of re-running the step and creating a new layer....so, even if Debian uploads new packages, they will not be installed in to the Docker image in CI.

This is actually quite similar to the problem we had with the Briefcase cache and we added the year and month to the cache key to prevent it becoming too stale. We'll probably need to do the same thing here.

Furthermore, though, my testing in #100 shows that even PRs from BeeWare org members will not be able to upload to packages; so, for the most part, the only tags for the package will be from the main branch....which should be fine in practice, I think.

Is there a concern about uploading new tags that I'm missing, though? Want to make sure we're not talking past each other.

@rmartin16
Copy link
Member Author

rmartin16 commented Feb 21, 2024

Also, here some timings using caching:

image

Can ignore PySide because it got stuck arguing with Fedora's packaging mirrors...

https://github.com/beeware/.github/actions/runs/7983571675/usage

@rmartin16
Copy link
Member Author

I updated the caching strategy to only attempt to push and pull from the BeeWare org package namespace. The use-case for pushing to a user's package namespace was mostly for forks that aren't committing back upstream...but I don't think anybody is really doing that.

Additionally, since the GITHUB_TOKEN doesn't have write permissions for PRs from forks apparently, it will really only be the main branch that ever actually pushes to the package without erroring.

@rmartin16 rmartin16 force-pushed the docker-caching branch 2 times, most recently from 85ce773 to 132f93b Compare February 21, 2024 20:22
@freakboy3742
Copy link
Member

  1. This is going to create tagged Docker images for every PR - won't that space grow continuously? What process will purge the image cache?
    ...
    Is there a concern about uploading new tags that I'm missing, though? Want to make sure we're not talking past each other.

I can only assume I'm misunderstanding what is being cached/optimized here.

My understanding is that the process works broadly like this:

As a user, you define a Dockerfile. This describes a starting image (ubuntu/latest), and series of instructions for building on that image.

Each instruction forms a new hash, which is uniquely addressable, so if you've got instructions A B C, and you modify the instructions to A B D, the first two can be reused.

The hash of each layer is a unique identifier, but for convenience, you can associate a tag with a hash - so you can refer to the 1deadbeefcafe hash myproject/latest.

The hash is just an identifier. On disk, actual files are being downloaded and stored. However, it's efficient "delta-based" storage - if you have a big base image hash A, and add 1 tiny script file in B, the incremental storage from storing both the base image and the script file is A+B, not 2A + B.

The hash is also a unique identifier. If I define a docker file that adds an "echo helloworld" instruction on top of ubuntu/latest, and you do the same, they'll be the same hash.

This becomes especially useful when repositories get added to the mix. Every local Docker install has a repository, which is used as a local cache. When you build an image based on ubuntu/latest, the first step is downloading that image from an upstream repo, and caching it locally. When I write my Dockerfile, the layers, keyed by hash, are stored in my local repo.

If I have a Dockerfile that is particularly interesting, I can then publish that hash to an upstream repo (like ghcr.io), using a human-readable tag. So - if I build an image based on ubuntu/latest that adds layers A B C, I can publish that as rkm/latest. This will push the hashes for A B and C, and the deltas of content for A B and C, plus the rkm/latest label. Another user can then pull the rkm/latest image from ghcr.io and use it (and build on it, if they want).

The useful feature for our purposes here is that if I have published rkm/latest, and a second user pulls rkm/latest into their local repo, and then they try to run a Dockerfile that defines an image rkm/latest + A B C configuration, that will evaluate to the same hash, and the image that has been pulled from ghcr.io can be used as is, rather than needing to go through the processes of actually running the commands to evaluate A, B and C (which, in the case of an apt update, could be time consuming). As long as the time required to download the layers for A B C is less than the time to build A B C from scratch (and it almost certainly will be for any apt install), you've got a net win.

So - based on that understanding: My concern is the storage associated with these tags. We're not talking about a single image - we're talking about publishing A B C, A B D, A B E, ... with each PR essentially producing a different hash to be pushed to ghcr.io; and while the storage is a delta (in that we won't be reproducing the storage of the whole of ubuntu with each image), we will be storing the C, D, E,... deltas - and in our case, those can be a whole bunch of apt installs of entire compiler suites - potentially a hundred megabytes or more each.

That isn't a problem in itself.... until we run the clock forward, and we've been pushing hundred megabyte layer deltas on every pull request for a year. In my own experience, I need to purge my local docker cache regularly - docker system prune -a releasing dozens of gigabytes is disturbingly common. In the local case, this can be mitigated by remembering to use docker run -rm so that ephemeral images aren't retained; but if we're actually pushing images to a central repo, I would have thought storage on that repo would eventually become a problem, and we would need to consider how to purge images that are a month old and won't ever be used again.

Have I missed something in my understanding?

@rmartin16
Copy link
Member Author

rmartin16 commented Feb 22, 2024

Have I missed something in my understanding?

AFAICT, your understanding is correct...however, I think the application of it to our situation and therefore the implications of that application could be over-inflated.

we're talking about publishing A B C, A B D, A B E, ... with each PR essentially producing a different hash to be pushed to ghcr.io

As you said above:

The hash is also a unique identifier. If I define a docker file that adds an "echo helloworld" instruction on top of ubuntu/latest, and you do the same, they'll be the same hash.

Therefore, all the PRs calculate the same hashes and skip the step altogether...so, the PRs don't need to upload anything except for the tag because all the hashes already exist in ghcr.io; and if they don't, this push will seed the cache.

To this end, I added the year and month to the tag so the tags may include new image layers once a month....unless of course the actual Dockerfiles change since that would also create new images layers.

would have thought storage on that repo would eventually become a problem, and we would need to consider how to purge images that are a month old and won't ever be used again.

It may; GitHub is allowing open source projects to use seemingly unlimited storage right now. From personal experience, I haven't had any problem; for instance, I am regularly producing hundreds of new tags (and quite likely with new hashes) for my docker-qbittorrent project.

I guess from my standpoint, if GitHub wants to change this storage situation, then they'll probably need to start pruning old image layers somehow.....or stop accepting new pushes. Either way, I expect some sort of notification with the time to react.

Alternatively, I'm not against something to delete the package regularly. I do find this line in the docs a bit interesting, though:

If the package is public and the package version has more than 5,000 downloads, you cannot delete the package version. In this scenario, contact GitHub support for further assistance.

https://docs.github.com/en/rest/packages/packages?apiVersion=2022-11-28#delete-a-package-version-for-the-authenticated-user

@freakboy3742
Copy link
Member

The hash is also a unique identifier. If I define a docker file that adds an "echo helloworld" instruction on top of ubuntu/latest, and you do the same, they'll be the same hash.

Therefore, all the PRs calculate the same hashes and skip the step altogether...so, the PRs don't need to upload anything except for the tag because all the hashes already exist in ghcr.io; and if they don't, this push will seed the cache.

To this end, I added the year and month to the tag so the tags may include new image layers once a month....unless of course the actual Dockerfiles change since that would also create new images layers.

Ok - I think that's the source of my confusion. I was forseeing a larger proliferation of images than we're going to see in practice. I was forgetting that the image for any given BeeWare CI build is essentially identical for all PR runs, with the only real discrepancies happening when either the Dockerfile changes, or a specific PR adds a new dependency to the system/system_requires list - both of which are fairly infrequent changes; and regardless, would be a delta on top of the shared base hash which does the initial system update and Python install.

would have thought storage on that repo would eventually become a problem, and we would need to consider how to purge images that are a month old and won't ever be used again.

It may; GitHub is allowing open source projects to use seemingly unlimited storage right now.

With my grey hat on, I can see this as a way to get unlimited free archival disk storage...

Alternatively, I'm not against something to delete the package regularly.

The problem I was forseeing is definitely a lot less acute given that we're going to be pushing a lot fewer images than I originally thought. If there's no in-practice issue right now, then I don't think we need to invent one. As long as we've got a reasonable potential strategy for purging if/when they introduce a limit, then I'm happy. Since this is entirely a cache/optimization approach, I guess "nuke the whole thing from orbit" is an entirely reasonable option.

I do find this line in the docs a bit interesting, though:

If the package is public and the package version has more than 5,000 downloads, you cannot delete the package version. In this scenario, contact GitHub support for further assistance.

https://docs.github.com/en/rest/packages/packages?apiVersion=2022-11-28#delete-a-package-version-for-the-authenticated-user

I'm going to guess this is protection against a left-pad style rug pull. Someone builds an image that lots of people integrate into their workflows, then yanks the image (either intentionally or accidentally), and everyone downstream has to scramble to fix their projects.

@rmartin16 rmartin16 marked this pull request as ready for review February 23, 2024 02:17
@freakboy3742
Copy link
Member

Just giving this a final lookover- is this 403 concerning?

@rmartin16
Copy link
Member Author

Just giving this a final lookover- is this 403 concerning?

It's an outcome of the GITHUB_TOKEN permissions for a PR from a fork:

GITHUB_TOKEN Permissions
  Contents: read
  Metadata: read
  Packages: read

So, CI will only be able to actually push a tag on main branch or for PRs on the repo itself, like CI for #100.

Copy link
Member

@freakboy3742 freakboy3742 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As always, the twisty passages of Github permissions and "CI testing CI configurations" makes this a bit of a headache to confirm, but I think this all makes sense.

@freakboy3742 freakboy3742 merged commit b243305 into beeware:main Feb 23, 2024
72 checks passed
@rmartin16 rmartin16 deleted the docker-caching branch February 23, 2024 05:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cache Docker images for app-build-verify
2 participants