-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ghcr.io for Docker image caching #99
Conversation
@freakboy3742, now that I have all this working and before I complete the work, do you agree with this approach? Questions or concerns about the implementation and its implications? Here are the PoC Briefcase changes as well. |
5d4571e
to
83d8fe8
Compare
The general approach makes sense; two questions:
|
Nothing is able to use caching so far; the caches haven't been seeded yet. In my testing, I was seeing as much as 5 - 7 minutes shaved off each using caching. Although, this seems to be particularly contingent on the Fedora package CDN not dropping to 150kb/s for the dry-run install...
Nothing; however, my expectation of an image registry means this shouldn't really be necessary. AFAIU, a tag is mostly a manifest of image layer hashes. So, just because a new tag is created does not mean that wholly new images are being uploaded to the registry. For the vast majority of CI runs, they will use and create the exact same image layer hashes that already exist in the registry. So, all that really grows over time is the list of tags. Additionally, I'm still debating if we need to support uploading packages outside of the BeeWare namespace. I have a little more testing to do but I think that To that end, @freakboy3742, can you check if the And if we're not really comfortable giving the |
83d8fe8
to
ae4232f
Compare
👍
Sure, but the contents of the hash still needs to be stored - and in our case, that contents will include all the packages that have been installed, which will change every time the upstream repo pushes an update to each individual package. That's potentially a couple of hundred meg for each hash, isn't it?
I can confirm write permissions are enabled for the Beeware org: |
The Docker build process will use the cached layer instead of re-running the step and creating a new layer....so, even if Debian uploads new packages, they will not be installed in to the Docker image in CI. This is actually quite similar to the problem we had with the Briefcase cache and we added the year and month to the cache key to prevent it becoming too stale. We'll probably need to do the same thing here. Furthermore, though, my testing in #100 shows that even PRs from BeeWare org members will not be able to upload to packages; so, for the most part, the only tags for the package will be from the Is there a concern about uploading new tags that I'm missing, though? Want to make sure we're not talking past each other. |
Also, here some timings using caching: Can ignore PySide because it got stuck arguing with Fedora's packaging mirrors... https://github.com/beeware/.github/actions/runs/7983571675/usage |
I updated the caching strategy to only attempt to push and pull from the BeeWare org package namespace. The use-case for pushing to a user's package namespace was mostly for forks that aren't committing back upstream...but I don't think anybody is really doing that. Additionally, since the |
85ce773
to
132f93b
Compare
I can only assume I'm misunderstanding what is being cached/optimized here. My understanding is that the process works broadly like this: As a user, you define a Dockerfile. This describes a starting image (ubuntu/latest), and series of instructions for building on that image. Each instruction forms a new hash, which is uniquely addressable, so if you've got instructions A B C, and you modify the instructions to A B D, the first two can be reused. The hash of each layer is a unique identifier, but for convenience, you can associate a tag with a hash - so you can refer to the The hash is just an identifier. On disk, actual files are being downloaded and stored. However, it's efficient "delta-based" storage - if you have a big base image hash A, and add 1 tiny script file in B, the incremental storage from storing both the base image and the script file is A+B, not 2A + B. The hash is also a unique identifier. If I define a docker file that adds an "echo helloworld" instruction on top of ubuntu/latest, and you do the same, they'll be the same hash. This becomes especially useful when repositories get added to the mix. Every local Docker install has a repository, which is used as a local cache. When you build an image based on ubuntu/latest, the first step is downloading that image from an upstream repo, and caching it locally. When I write my Dockerfile, the layers, keyed by hash, are stored in my local repo. If I have a Dockerfile that is particularly interesting, I can then publish that hash to an upstream repo (like ghcr.io), using a human-readable tag. So - if I build an image based on ubuntu/latest that adds layers A B C, I can publish that as rkm/latest. This will push the hashes for A B and C, and the deltas of content for A B and C, plus the rkm/latest label. Another user can then pull the rkm/latest image from ghcr.io and use it (and build on it, if they want). The useful feature for our purposes here is that if I have published rkm/latest, and a second user pulls rkm/latest into their local repo, and then they try to run a Dockerfile that defines an image rkm/latest + A B C configuration, that will evaluate to the same hash, and the image that has been pulled from ghcr.io can be used as is, rather than needing to go through the processes of actually running the commands to evaluate A, B and C (which, in the case of an apt update, could be time consuming). As long as the time required to download the layers for A B C is less than the time to build A B C from scratch (and it almost certainly will be for any apt install), you've got a net win. So - based on that understanding: My concern is the storage associated with these tags. We're not talking about a single image - we're talking about publishing A B C, A B D, A B E, ... with each PR essentially producing a different hash to be pushed to ghcr.io; and while the storage is a delta (in that we won't be reproducing the storage of the whole of ubuntu with each image), we will be storing the C, D, E,... deltas - and in our case, those can be a whole bunch of apt installs of entire compiler suites - potentially a hundred megabytes or more each. That isn't a problem in itself.... until we run the clock forward, and we've been pushing hundred megabyte layer deltas on every pull request for a year. In my own experience, I need to purge my local docker cache regularly - Have I missed something in my understanding? |
AFAICT, your understanding is correct...however, I think the application of it to our situation and therefore the implications of that application could be over-inflated.
As you said above:
Therefore, all the PRs calculate the same hashes and skip the step altogether...so, the PRs don't need to upload anything except for the tag because all the hashes already exist in ghcr.io; and if they don't, this push will seed the cache. To this end, I added the year and month to the tag so the tags may include new image layers once a month....unless of course the actual Dockerfiles change since that would also create new images layers.
It may; GitHub is allowing open source projects to use seemingly unlimited storage right now. From personal experience, I haven't had any problem; for instance, I am regularly producing hundreds of new tags (and quite likely with new hashes) for my docker-qbittorrent project. I guess from my standpoint, if GitHub wants to change this storage situation, then they'll probably need to start pruning old image layers somehow.....or stop accepting new pushes. Either way, I expect some sort of notification with the time to react. Alternatively, I'm not against something to delete the package regularly. I do find this line in the docs a bit interesting, though:
|
Ok - I think that's the source of my confusion. I was forseeing a larger proliferation of images than we're going to see in practice. I was forgetting that the image for any given BeeWare CI build is essentially identical for all PR runs, with the only real discrepancies happening when either the Dockerfile changes, or a specific PR adds a new dependency to the system/system_requires list - both of which are fairly infrequent changes; and regardless, would be a delta on top of the shared base hash which does the initial system update and Python install.
With my grey hat on, I can see this as a way to get unlimited free archival disk storage...
The problem I was forseeing is definitely a lot less acute given that we're going to be pushing a lot fewer images than I originally thought. If there's no in-practice issue right now, then I don't think we need to invent one. As long as we've got a reasonable potential strategy for purging if/when they introduce a limit, then I'm happy. Since this is entirely a cache/optimization approach, I guess "nuke the whole thing from orbit" is an entirely reasonable option.
I'm going to guess this is protection against a left-pad style rug pull. Someone builds an image that lots of people integrate into their workflows, then yanks the image (either intentionally or accidentally), and everyone downstream has to scramble to fix their projects. |
132f93b
to
2cab956
Compare
Just giving this a final lookover- is this 403 concerning? |
It's an outcome of the
So, CI will only be able to actually push a tag on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As always, the twisty passages of Github permissions and "CI testing CI configurations" makes this a bit of a headache to confirm, but I think this all makes sense.
Changes
beeware-ci
package: https://github.com/rmartin16/.github-beeware-test/pkgs/container/beeware-cimain
branchbeeware-ci
package for the BeeWare orgmain
, that will be a tag formain
branchmain
branch because theGITHUB_TOKEN
for PRs doesn't allow for writing to packagesmain
tagapp-build-verify
#98Notes
PR Checklist: