Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: control cache sources during build with --cache-from #26065

Closed
stevvooe opened this issue Aug 26, 2016 · 6 comments · Fixed by #26839
Closed

Proposal: control cache sources during build with --cache-from #26065

stevvooe opened this issue Aug 26, 2016 · 6 comments · Fixed by #26839
Labels
area/distribution kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny

Comments

@stevvooe
Copy link
Contributor

stevvooe commented Aug 26, 2016

This proposal was originally from #24711 (comment).

With several changes to the image format, the scope of build caching has been limited to local nodes. This is problematic for architectures which dispatch builds to arbitrary nodes, since pulling new images and data will not fulfill the build cache.

The main concern here is Cache poisoning. The worst part about is that it is not at all obvious that you are affected or protected. It can only be mitigated by limiting the horizon of data that one trusts.

Anything that circumvents that protection, even docker save/load, is going to open your infrastructure up to injection of malicious content. The proposal in #20316 and previous proposals have not addressed this problem. While we all want fast builds (and I really do), introducing cache poisoning to the build step of the infrastructure must be avoided. Could you imagine the impact if someone could just inject a malicious layer into library/ubuntu or library/alpine?

The other aspect to this is the misapplied assumption about the idempotence of shell commands. apt-get update run twice is never guaranteed to have the same result. Ever. That is just not how it works. If you have a build cache that is never purged, you will never update your upstream software. That may or may not be the intent. Even worse, if this build cache gets filled in with remote data, you probably have no visibility into when that command was run.

The underlying problem here is that with 1.10 changes in the image format, we no longer restore the parent image chain when pulling from a registry. As such, a proper solution to this problem involves something that can control the level of trust for content to a distributed build cache.

Let's look at how we build an image, with FROM alpine at the top:

docker pull mysuperapp:v0
docker build -t mysuperapp:v1 .

In this simple case, we cannot assume that a remote mysuperapp:v0 and the ongoing build are related, since that would possibly introduce the cache poisoning scenario that we need to avoid. However, one may have local registry infrastructure that they know they can trust. While we can infer parentage (despite other assertions, this is still possible), we may not be able trust that parentage from a build caching perspective from all registries. But, this build environment is special.

What better way than to tell the build process process that you can trust a related image?

docker build --cache-from mysuperapp:v0 -t mysuperapp:v1 .

The above would allow Dockerfile commands to be satisfied from the entries of mysuperapp:v0 in the build of mysuperapp:v1. Job done!

No! We still have a problem. Now, my build system has to know tag lineage (mysuperapp:v0<mysuperapp:v1`). Let's modify the meaning the tagless reference to mean something slightly different:

docker pull -a mysuperapp
docker build --cache-from mysuperapp -t mysuperapp:v1 .

In the above, we pull all the tags from mysuperapp, any layer of which can satisfy the build cache. In practice, this probably is a little wide for most scenarios, so we can allow multiple --cache-from directives on the command line:

docker build --cache-from mysuperapp:v0 docker build --cache-from mysuperapp:v1 -t mysuperapp:v2 .

There are many possibilities here to make this more flexible, such as running a registry service specially for the purpose of build caching mybuildcache.internal/mysuperapp. Did you know that you can just run a registry and rsync the filesystem around without locking? You can also rsync from multiple sources and merge the result safely (kind of). Such a registry can be purged periodically (or some one could submit a PR to purge old data).

We can take this even further, but I hope the point is brought home. This is probably less convenient that the original behavior, but it is a good trade off. It leverages the existing infrastructure and has the possibility of being extended as use cases change.

Closes #18924.

@tonistiigi
Copy link
Member

Some notes from #26839

  • Specifying multiple --cache-from images is bit problematic. If both images match there is no way(without doing multiple passes) to figure out what image to use. So we pick the first one(let user control the priority) but that may not be the longest chain we could have matched in the end. If we allow matching against one image for some commands and later switch to a different image that had a longer chain we risk in leaking some information between images as we only validate history and layers for cache. Currently I left it so that if we get a match we only use this target image for rest of the commands.
  • If the FROM image changes, no cache is reused. This is kind of useful and actually important because otherwise, the user can not detect that there are important security fixes in the base image because they always use the cache. But this may also come as a surprise to some users. I think we should not try to make some hack to work around it, rather if some users want different behavior they should just use immutable tags or digests as a FROM image.

@stevvooe
Copy link
Contributor Author

@tonistiigi Thanks for implementing this feature!

@jonasschneider
Copy link

jonasschneider commented Jan 19, 2017

Edit: Turns out we were in fact using the cache wrong (cache is not touched when a RUN doesn't modify the FS. Sorry for the confusion :)

--cache-from seems to be broken for me in 1.13.0. Here is a Gist that reproduces the problem: https://gist.github.com/jonasschneider/9e0bf96429444c89da1749225d89750a

Setup: Two empty Docker-in-Docker instances are started. The first one builds an image and pushes it to a registry. The second instances pulls down the image and attempts to build the identical Dockerfile, referencing the repo name of the first image in --cache-from.
Expected: The second build uses the cache of the first build.
Actual: The second build starts from scratch.

Am I using it wrong or is this a bug?

@thaJeztah
Copy link
Member

@jonasschneider could you open an issue with more details (output of docker version and docker info), so that it can be looked into / checked if it's a bug or not?

@jonasschneider
Copy link

@thaJeztah Thanks for taking a look. I think we've tracked down the issue to a pretty surprising combination of cache invalidation rules and non-FS-modifying RUN commands not being cached, so there might not be a bug. I'll think it through again and open issues accordingly.

@thaJeztah
Copy link
Member

Thanks @jonasschneider 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distribution kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants