Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

caching and “apt-get update” #3313

Closed
stapelberg opened this issue Dec 22, 2013 · 30 comments · Fixed by #5816
Closed

caching and “apt-get update” #3313

stapelberg opened this issue Dec 22, 2013 · 30 comments · Fixed by #5816

Comments

@stapelberg
Copy link

This issue is about clarifying the following scenario:

I created a docker container a couple weeks ago. The dockerfile can be found https://index.docker.io/u/stapelberg/git-daemon/. As you can see, it uses RUN apt-get update, which gets cached, as it should.

Now, a couple weeks later, the package lists have changed, and with the old package lists, I cannot install postgresql (I get 404s for the files that are no longer on the Debian mirrors).

Obviously, when running docker build -no-cache -t=stapelberg/postgresql ., this is not a problem, because the cache does not get used.

But that implies that I need to run every build that is based on Debian with -no-cache and can never make use of the cache. I have a hard time believing that this is how it’s supposed to be used.

I then tried to run docker rmi on the cached image:

docker build -t=stapelberg/postgresql:9.3 .                                                      
Uploading context 10240 bytes           
Step 1 : FROM tianon/debian:sid
 ---> 6bd626a5462b
Step 2 : RUN apt-get update
 ---> Using cache
 ---> 3702cc3eb5c9
…

docker rmi 3702cc3eb5c9
Error: Conflict, 3702cc3eb5c9 wasn't deleted
2013/12/22 09:57:03 Error: failed to remove one or more images

That error message is horrible. It doesn’t tell me any details about the conflict, so I have no clue what’s going on. My guess is that the issue is that 3702cc3eb5c9 is still in the “image chain” for e.g. stapelberg/git-daemon, which I do want to keep.

So, how can one specify that a certain step should not be cached for longer than a day?

Or how is running “apt-get update” supposed to work in Docker?

Note that my images inherit from Debian testing, which makes the problem really obvious. But with any Debian(-based) operating system this problem exists. Even the stable release get security updates when appropriate, or point releases. So one needs to have updated apt lists at all times.

Any clarification about what I’m doing wrong are appreciated. Thanks.

@dttocs
Copy link

dttocs commented Dec 22, 2013

How are you trying to install postgresql? Are you logging in and running apt-get install postgres from a shell within the container, or appending it as a later RUN command within an updated Dockerfile? As with normal Debian installs, if the package lists have changed between the last run of apt-get update and the run of apt-get install the install may fail - this is because apt-get update is non-deterministic. The docker cache just allows use of previous builds, but because apt-get update is non-deterministic, the original run of the update and the run which occurs later have different results. I don't believe Docker has any special ability to know that apt-get update is non-deterministic, and thus should not be cached.

For Debian or Ubuntu, if you're appending to a Dockerfile and want to use a cached build, another apt-get update should be run before trying to install a new package. That way you can still take advantage of the previous cached build (and any packages installed or upgraded subsequent to the first update), but your APT cache will be up-to-date when installing the next package.

There may be a better way to do this - other suggestions welcome. There's some discussion of this in #880

For the docker rmi error message you're seeing, I suspect it's caused by having containers which depend on 3702cc3eb5c9. See Can't remove images Issue #3258 for an explanation, but in summary, run docker ps -a and remove any dependent containers before you try to delete the image.

@stapelberg
Copy link
Author

I’m installing it with another RUN command in the same Dockerfile, yeah.

With regards to your suggestion to run another apt-get update, how would I go about that? Just add the command again? That clutters up the Dockerfile, but also the need to add another apt-get update is different from machine to machine. It really does not sound like a good solution :).

What I’d suggest is having an expiration date for RUN instructions, so that e.g. RUN [1h] apt-get update can be used to still have reasonably fast development/experimentation/testing cycles (using the cache).

@Sjord
Copy link
Contributor

Sjord commented Dec 23, 2013

I have made the following script to run apt-get update in a container and tag the resulting image.

CIDFILE="my.cid"
IMAGE="my-image"

docker run -cidfile="$CIDFILE" "$IMAGE" apt-get update
cid=$(cat "$CIDFILE")
docker commit "$cid" "$IMAGE"
rm "$CIDFILE"

If you run this on tianon/debian:sid, you update the package list in that image. If you then run docker build to build your git-daemon image, it will use the new debian image as its base, with the updated package list.

@tianon
Copy link
Member

tianon commented Dec 24, 2013

My preferred method to combat this in a natural way that busts the cache only when necessary is to couple the update lines with the install lines, like so:

RUN apt-get update && apt-get install -yq ...

Then, if you ever change the list of packages, the cache is naturally and normally invalidated properly, causing apt-get update to properly be invoked again, regardless of how long it's been since you ran it.

Also, with the changes coming from stackbrew/debian (that will become just "debian" soon) and stackbrew/ubuntu, apt-get update should also complete much faster, since with the debian image we've switched to using "http.debian.net" by default (which is an auto-switching mirror), and we've stopped Apt from downloading the apt-cache translation files that hardly anyone uses, especially inside containers.

@Sjord
Copy link
Contributor

Sjord commented Dec 24, 2013

This solves the problem when you add a new package, but what about when you want to create a docker image that uses the same packages, but newer versions?

Say you build your image three months ago, and since a new version of some package was released that you want in your docker image. However, docker won't install it, because it gets the old package list and old package version from the cache.

@tianon
Copy link
Member

tianon commented Dec 25, 2013

This is where version pinning comes in handy (especially if the version of said package is actually important to your iagme), like RUN apt-get update && apt-get install -yq bash=4.2*. When 4.3 finally comes out, I would modify that line, which would automatically bust the cache and install the new version (including automatically doing the necessary apt-get update, since they're together now).

@stapelberg
Copy link
Author

tianon, thanks for sharing your way of doing this. I don’t feel like this is a satisfactory solution for my case, though, where the version number is not important to me. What’s important is just that the build file keeps working, ideally with some sort of caching. With my expiration time suggestion, that’d be the case.

@inthecloud247
Copy link

Or you can add a line like this as the third line in your Dockerfiles (after the FROM and MAINTAINER lines):

ENV LAST_UPDATED 2013-12-20

To update the cache, simply change the LAST_UPDATED line and it'll invalidate everything below it, including any apt-get update lines.

@tianon
Copy link
Member

tianon commented Dec 27, 2013

@ydavid365 in that case, why not just use -no-cache when you want to rebuild?

@inthecloud247
Copy link

@tianon I thought that if you select the `-no-cache' option it won't use the existing cache and also won't create any new cache for subsequent builds.

@tianon
Copy link
Member

tianon commented Dec 27, 2013

Nope, your new lines are prime for use in the next cache use (unless you specify -no-cache again, of course). You might see some odd behavior if you have multiple images that match the cache correctly, but it will use lines from a -no-cache run in future caching.

@inthecloud247
Copy link

wow that's not the behavior that I expected. hm. thanks for pointing that
out.

On Thu, Dec 26, 2013 at 8:24 PM, Tianon Gravi notifications@github.comwrote:

Nope, your new lines are prime for use in the next cache use (unless you
specify -no-cache again, of course). You might see some odd behavior if
you have multiple images that match the cache correctly, but it will use
lines from a -no-cache run in future caching.


Reply to this email directly or view it on GitHubhttps://github.com//issues/3313#issuecomment-31246679
.

@Sjord
Copy link
Contributor

Sjord commented Dec 28, 2013

Building with --no-cache seems to update the cache, but when building after that the most recent cache image is not always used. I filed issue #3375 for that.

@unclejack
Copy link
Contributor

@Sjord Issue #3375 has been fixed and Docker 0.7.5 doesn't have that problem.

As for the rest of the problems described in this issue, using --no-cache during docker build is the right solution.

This should also be in the documentation.

@bfirsh
Copy link
Contributor

bfirsh commented Jan 17, 2014

I wonder if there should be some "Dockerfile best practices" documentation which includes patterns like RUN apt-get update && apt-get install -yq bash=4.2*.

@inthecloud247
Copy link

could be a good idea to have a community wiki or somethin.

not sure if there's one 'best' way to do things at the moment. it depends
on the project use case.

On Fri, Jan 17, 2014 at 3:52 AM, Ben Firshman notifications@github.comwrote:

I wonder if there should be some "Dockerfile best practices" documentation
which includes patterns like RUN apt-get update && apt-get install -yq
bash=4.2*.


Reply to this email directly or view it on GitHubhttps://github.com//issues/3313#issuecomment-32599795
.

@cyphar
Copy link
Contributor

cyphar commented May 15, 2014

Why don't we just add a NOCACHE directive for other directives (i.e. NOCACHE RUN apt-get update)? When that directive gets run, the cache is ignored for the rest of the Dockerfile (kind of like ADD if the context has changed).

@jokeyrhyme
Copy link

I've had the same concerns and I have a few ideas...

package managers improved to support limiting updates to a specific date

  • upstream apt, yum, dnf, pacman, etc package managers and supporting services are updated to allow an upper limit to be set when updating package listings and available package versions
  • your Dockerfile states RUN apt-get update -y --no-later-than=2014-07-31 or something similar

The result is that your Dockerfile intrinsically documents the date that apt-get update -y was executed, and simultaneously allows the cache to be maximally exploited whilst that line remains unchanged. As soon as you change that date-limit, the cache will be naturally invalidated. Anyone armed with your Dockerfile can perfectly reproduce the same image.

Downsides:

  • might be difficult to implement this, as it has consequences not just on the package managers themselves, but also the online services that keep them fed.

maintain your own date-labelled base images

  • set a Dockerfile with the usual RUN apt-get update -y, build it, and tag it with the date you first ran it (e.g. "my/ubuntu:14.04-2014-07-31")
  • separately, for your other projects, just point their Dockerfiles at the base image from a specific date, e.g. FROM my/ubuntu:14.04-2014-07-31

Downsides:

  • every human and their dog now needs to keep their own set of date-labelled images
  • crazy proliferation of "base" images, with the community no longer able to rely on shared caches
  • if you lose all copies of any particular image, then you have to stop using it, because there'll never be a way to perfectly reproduce it from the Dockerfile alone

/shrug

@jokeyrhyme
Copy link

Per my upstream package-manager suggestion, I found:

@jjbohn
Copy link

jjbohn commented Sep 18, 2014

Ran into this. As a cheap easy way to not have to invalidate everything via --no-cache, I've been dropping the date in as a comment like so. You could really use anything there obviously.

RUN apt-get update && apt-get install -y openjdk-7-jre # 2014-09-18

@jokeyrhyme
Copy link

As containers are generally supposed to be focused on a particular process, I've given up trying to manage this particular aspect (although @jjbohn's comment idea is probably the most practical for now). I just try to make sure that a specific versions of certain packages are install by apt-get, yum, etc, and then tag the image to match.

@jjbohn
Copy link

jjbohn commented Sep 18, 2014

Yeah, I was writing it down and messing with it, I realized that by using a date comment, I'm pretty much just making a temporal association the same way I would with version numbers. Might as well stick with tying to a specific version.

glogiotatidis pushed a commit to mozilla/fxoss that referenced this issue Oct 6, 2014
Running RUN apt-get update by itself will cause it to use the cache when
rebuilding the dockerfile. This can lead to apt-get installs failing
when updating programs. The preferred method is RUN apt-get update &&
apt-get install -y <any apt-get installs you need>. this will force
apt-get update to run when any package is updated, or new ones are
installed.

moby/moby#3313
@mitar
Copy link

mitar commented Nov 25, 2015

In fact I think there are bigger issues here. One is that if there are security updates to Debian packages, my images are not rebuild. So a better approach would be some cloud service which would trigger Docker Hub rebuilds when package you installed get security updates. That service could also re-trigger rebuilding without cache the layer where you cached Debian package database. In this way it would be granular and images would be updating only when necessary.

The other approach (probably not nice for Docker Hub) would be to have a base image with apt-get update and then daily trigger rebuilding of that image, which would then trigger rebuilding of all linked images.

@mitar
Copy link

mitar commented Nov 25, 2015

Found this, a bit related: https://coreos.com/blog/vulnerability-analysis-for-containers/

@thaJeztah
Copy link
Member

@mitar you may be interested in "Project Nautilus", which was announced during DockerCon EU. I don't think there's a product page for that yet, but here are some slides; http://www.slideshare.net/Docker/official-repos-and-project-nautilus

@cristofersousa
Copy link

docker run -it -p 8080:80  ubuntu /bin/bash
echo "91.189.92.201 archive.ubuntu.com" >> /etc/hosts
cat /etc/hosts
apt-get update

@catamphetamine
Copy link

RUN apt-get update && apt-get install -yq ...

This means that the result of the command is still stale, say, if a developer created an image in 2016 and then created a similar image in 2017 – the cache from 2016 will be used and the latest versions of the packages still won't be installed.
I'd say this is a bug.

@Sjord
Copy link
Contributor

Sjord commented Jul 8, 2017

@halt-hammerzeit What would you say was the desired behavior? Do you have a solution?

@catamphetamine
Copy link

catamphetamine commented Jul 8, 2017

@Sjord

https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/#add-or-copy

That’s because it’s more transparent than ADD

while ADD has some features (like local-only tar extraction and remote URL support) that are not immediately obvious.

If you're advocating the software being more "transparent" and "obvious" then caching apt-get clearly contradicts this line.

The desired behaviour would be the software behaving more "obvious" and "transparent".
Not sure how the developers of Docker came to a conclusion that caching should be enabled by default.
If it's justified then there's no way to make it "obvious" and "transparent" because it's something that every user doesn't know until he ecounters this bug.
Still they could at least introduce something like NOCACHE statement as suggested by @cyphar

@catamphetamine
Copy link

catamphetamine commented Jul 8, 2017

@Sjord Ok, my suggestions then:

  • "opt-in caching" – introducing some kind of CACHE statement that should be put at the start of each Dockerfile that's supposed to be cached, or maybe prefixing each starting command with some kind of CACHE prefix until the parser encounters a line with no CACHE after which it won't cache the following layers
  • Disabling caching entirely and introducing a special --cache command line parameter (inverse of the present --no-cache command line parameter).

aslushnikov added a commit to aslushnikov/playwright that referenced this issue Jun 20, 2020
To avoid caching old package lists, every `apt-get install`
should be prefixed with `apt-get update`.

More info on the matter:
- https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#apt-get
- moby/moby#3313
aslushnikov added a commit to microsoft/playwright that referenced this issue Jun 22, 2020
To avoid caching old package lists, every `apt-get install`
should be prefixed with `apt-get update`.

More info on the matter:
- https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#apt-get
- moby/moby#3313
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.