Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buildah to use cache, layers for bash builds as well #1292

Closed
zvonkok opened this issue Jan 18, 2019 · 41 comments
Closed

buildah to use cache, layers for bash builds as well #1292

zvonkok opened this issue Jan 18, 2019 · 41 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. stale-issue

Comments

@zvonkok
Copy link

zvonkok commented Jan 18, 2019

Description
Buildah can use cache/layers for building with buildah bud (#767). It would be beneficial if buildah could use a cache/layers for bash builds as well. I am using bash builds extensively as there are more convenient/useful for all of my use-cases.

Dan mentioned that I could use commits in-between and base off the next step of them.
Nice idea but I think this will clutter up the script and the signal-to-noise ratio would be rather high.

BTW using bash scripts to build the images is/was a great idea, this way I can use everything, really everything as a tool that I can install on my Linux and I am not tied to a DSL that I cannot extend or is limited.

Describe the results you received:
Executing the buildah script a second time, buildah executes each step again.

Describe the results you expected:
Skip steps that did not change compared to the last invocation of the script.

Output of rpm -q buildah or apt list buildah:

buildah-1.4-3.gitc8ed967.el7.x86_64

Output of buildah version:

Version:         1.5-dev
Go Version:      go1.10.2
Image Spec:      1.0.0
Runtime Spec:    1.0.0
CNI Spec:        0.4.0
libcni Version:  v0.7.0-alpha1
Git Commit:      c8ed967
Built:           Tue Nov  6 14:01:57 2018
OS/Arch:         linux/amd64
@zvonkok
Copy link
Author

zvonkok commented Jan 18, 2019

/cc @jeremyeder

@TomasTomecek
Copy link
Contributor

This is an awesome suggestion! I wrote a tool which utilizes buildah for building images (and ansible as a frontend) and I had to solve the same issue: caching layers.

It would be awesome if buildah had an API for caching. On the other hand, it will be really tricky because you would need to tell buildah up front what the script is and how are you changing the filesystem and then buildah would need to figure out if there is a matching entry in the cache.

@rhatdan
Copy link
Member

rhatdan commented Jan 18, 2019

@umohnani8 What do you think? Would this even be possible.
Should we keep all of the buildah commands, we could record the fact that we have already executed them. But anything that modified the content outside of the script we would fail.

We might be able to figure out that

buildah from fedora
buildah run
buildah config
buildah run
buildah mount
yum --installroot We would definitely fail here.
buildah config.

Also how would we know that two buildah from fedora calls were related?

@arthurbarr
Copy link

I guess that any of those buildah run commands could have a different effect if the bash script has (say) copied some different files into the mounted filesystems. I guess that docker build gets round this by having a fixed "context" directory, where all the source is located. They can then do a hash of that source, or look at time stamps (not sure which they do). So I wonder if buildah could have something which helped users hash and compare particular source files/directories, and skip the step if it hasn't changed. For example:

buildah --skip-if-unchanged=./src:./*.go run blah

@rhatdan
Copy link
Member

rhatdan commented Mar 8, 2019

@umohnani8 Any update on this issue?

@rhatdan rhatdan assigned QiWang19 and unassigned umohnani8 Apr 12, 2019
@rhatdan
Copy link
Member

rhatdan commented Apr 12, 2019

@QiWang19 Can you look into this?

@gcs278
Copy link

gcs278 commented Jun 7, 2019

I'm very interested in this concept, great idea

@rhatdan
Copy link
Member

rhatdan commented Jun 8, 2019

@zvonkok and @gcs278 PRs would make this move along faster...

@cben
Copy link
Contributor

cben commented May 19, 2020

unbaked thought: could a design like redo be a good fit?

A core idea of redo different from make is that dependencies of a build step are not declared in up front; they are recorded by each target's build script (which could be in any language!) as it does it work, by shelling out to redo-ifchange dependency ... and/or redo-ifcreate dependency .... (And there is an implicit ifchange on the build script itself.)
The system is recursive. When a "do" script executes redo-ifchange, the given dependencies are themselves processed. (So a first run always runs all steps, later runs are partailly cached.)

There are multiple implementations of redo, recording dependencies in different ways; each puts its own redo-ifchange / redo-ifcreate into PATH before running build scripts.

  • It seems redo is still focused on producing files, with names, so may not fit here as-is.

  • Not sure buildah wants recursion.
    Dockerfile caching is based on anonymous intermediates in a script-like sequence.
    Conceptually it's a chain: [[[step 1] <--dependency-- step 2] <--dependency-- step 3] ...
    but most of the time, writing it as a single flat script is more fun.

But maybe a contract like "I'm doing whatever I want with buildah mount, but I'm gonna run build ifchange ... to tell you what I depended on" is a good idea?

So if we want a flat script, how would a script with arbitrary bash steps skip those steps? Maybe a 2-directional contract, with a buildah command reporting info on cache hit/miss?

BTW, is it technically possible for buildah to "fast forward" the same container from current state to a an image found in cache? Or would we need container=$(buildah ...) after every step?

@rhatdan
Copy link
Member

rhatdan commented Aug 6, 2020

@cben interested in opening a PR for this?

@dsonck92
Copy link

I was also looking into this cache idea and actually spent some time playing about with redo (which I like enough to port my golang based projects to use it to generate things, indeed much more natural than make).

That said, I think for buildah there are 2 approaches:

  • Actually suggest redo and write a tutorial on some examples of how you would go about it
  • Implement some sort of state tracking like redo.

Using Redo

The short time I used redo I understand the following:

  • You have a file you want to build, for which you create a .do file
  • The do file explains both how to build it and when to rebuild it
  • redo does expect that this file is going to exist and wishes you to fill it by writing to $3

With this information, you can split a buildah script to different discrete steps that should be built as a unit. Considering it's impossible build halfway unless history was kept or it was committed, it's imperative that a build script will end with a commit. So a possibility:

NAME=stage-name # could be random but keep in mind the lingering images
OUTPUT=$3
CONTAINER=$(buildah from fedora)
# do buildah steps
buildah commit $CONTAINER $NAME
echo $NAME > $OUTPUT

This, in itself, is a do file, though it could also be written generically:

  • default.buildah.sh.do:
    # Call the script and tell it to output its result to "$3"
    ./"$1$2" "$3"

and the accompanying build instructions:

  • initial-stage.buildah.sh:
    NAME=initial-stage
    CONTAINER=$(buildah from fedora)
    # do buildah steps
    buildah commit $CONTAINER $NAME
    if [ $# -gt 0 ]
    then echo $NAME > $1
    fi
  • second-stage.buildah.sh:
    NAME=second-stage
    redo-ifchange initial-stage
    buildah from $(cat initial-stage)
    # do further buildah steps
    buildah commit $CONTAINER $NAME
    echo $NAME > $1

A stage leaves behind a file containing the output it made, in buildah terms an image it can reuse. A cat of this output specifies a valid from name.

Built in support

Instead of relying on an external tool (although a tool that has many implementations and is not difficult to obtain), buildah could also attempt to do this kind of tracking itself. For which I propose the following idea:

  • To speed up builds, buildah can be instructed to use a cache file that tracks information on build steps.
    • This cache file can be specified by the user.
    • The cache file records for each instruction what its previous instruction was
    • The cache file is also accompanied by pointer files
      • Pointer files track which instructions have been done from a cache file
      • Each working container has its own pointer file
    • Some instructions cannot be cached and terminate a recipe
      Most notably any instruction that cannot tell what happened like mount
    • --add-history and commit additionally mention their container name

The process of how it works:

  1. When a cache file has been set, when buildah from is executed, it will create a pointer file pointing to the first instruction matching the from. This first instruction is considered a "saved point"
  2. If this first instruction matches the cache file, it will not do anything. If it doesn't match it will perform its action and write this instruction to the cachefile
  3. When a next buildah invocation is made, it will seek a next instruction pointing to the current one
  • If there is an instruction that matches, it will do nothing
  • If there is not a matching instruction:
    1. it will go back in the cachefile till a "saved point"
    2. It will load up this point and replay the cachefile till the point diverged
    3. it writes down the new arguments and performs its own action
  1. If --add-history is used, the generated image will be mentioned in the cache file, this is another "saved point"
  2. This process loops from 3 till there is a cache incompatible instruction:
  • At this point, the cache pointer is removed to indicate the container is in a non-cached stage.
  • If any further commands are executed, since the pointer does not exist (which would have been created by from) it will directly apply it without any cache information

So as an example:

export CACHEFILE=.mybuild
buildah from fedora # saves its details (or hash of these details) and fedora. Also creates .mybuild.fedora-working-container
buildah run $CONTAINER -- yum install nginx # saves its details and advances the pointer
buildah copy $CONTAINER www /var/www # saves its details (including hash of the content it copied) and advances pointer
buildah commit end-image # saves its details and also end-image

Now the same thing is executed again:

export CACHEFILE=.mybuild
buildah from fedora # recognizes the start of .mybuild
buildah run $CONTAINER -- yum install nginx # recognizes the command hasn't changed and moves pointer
buildah copy $CONTAINER www/ /var/www # if www hasn't changed, recognizes and moves pointer
buildah commit second-image # recognizes end-image and recommits it as second-image

It must be noted, if www changes, it will rebuild all the instructions starting with from fedora. In order to prevent this:

export CACHEFILE=.mybuild
buildah from fedora
buildah run --add-history $CONTAINER -- yum install nginx # now becomes a saved point
buildah copy $CONTAINER www/ /var/www # now if this changes, it can continue from after yum install nginx
buildah commit third-image

With this system, even if instructions diverge, this creates multiple alternative paths that can still resolve to a saved image.

@dsonck92
Copy link

Note that the redo script could possibly be improved by extracting the first buildah from instruction written in it and applying some logic to identify whether it's a local build step or a global one. Wait, I guess it's possible to detect the specific cat

@dsonck92
Copy link

It might also be noted that for the buildah native method, there might be options that also cause 5 to happen. I have not spent much time with buildah to know which situations would cancel a cache run. But I think this would give the closest experience of the cache given the limitations of not tracking history.

If the user wants to speed up its rebuild, they should also apply strategic uses of --add-history before anything that might cause diverging points so it doesn't have to backtrack too much steps in order to get to a known state.

@dsonck92
Copy link

Lastly, if this addition to buildah is something that is wanted @rhatdan I could attempt a PR for it. It might take a while though.

@rhatdan
Copy link
Member

rhatdan commented Aug 18, 2020

Sounds good, I don't know when we can get someone else to look at it.

@dsonck92
Copy link

I will then try to familiarize myself with the buildah code and see if my idea is somewhat feasible. I'm not sure when I can work on it but as I'm migrating away from docker and dockerfiles (appreciating the daemonless, rootless abilities of buildah/podman and the ability to use system tools during image steps of buildah and mount dirs (e.g. for extracting test results)), some form of buildah cache would be beneficial. While builds in general seem to be faster, nothing beats a "well I can skip this 10 minute build process as no code changed". I guess it's a bit of a bad example as that could also be fixed by keeping a dirty build directory around and others fixed by a cache mount, but for sake of repeatability, the cache can help without resorting to mounts and dirty build dirs that have their share of potential issues.

I predict it will take me some time and I made some assumptions that I might run against when trying to implement, but I'll keep this issue for any news I have on that topic.

@rhatdan
Copy link
Member

rhatdan commented Oct 7, 2020

@dsonck92 Did you ever get a chance to work on this?

@rhatdan rhatdan added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 7, 2020
@dsonck92
Copy link

dsonck92 commented Oct 8, 2020

Not yet as I'm pretty occupied with other things but I'm considering picking this up in December.

@kwshi
Copy link
Contributor

kwshi commented Dec 14, 2020

I've tried playing around a little with the redo idea, and one major drawback to that approach is that, because it relies on buildah commit, its disk usage scales quite poorly. Consider, as a rough example, a build that consists of steps A, B, C, each of which adds 1GB of stuff to the container. Using a built-in cache:

  • Step A creates a container of size 1GB, total cache use is 1GB;
  • Step B creates a container of size 2GB, but since it's layered on top of A, total cache use is also 2GB;
  • Step C creates a container of size 3GB, total cache is 3GB.

Meanwhile, using a redo/make-like mechanism to reuse previously committed steps:

  • Step A creates a container of size 1GB and commits it, total size of image cache is 1GB;
  • Step B instantiates a new container from image A and creates a container of size 2GB, but AFAIK the buildah image cache is not layered, so each individual image is as large as its full contents, so total image cache is 1GB (A) + 2GB (B) = 3GB;
  • Step C instantiates a container from image B, creates a container of size 3GB, total image cache use is (1 + 2 + 3) GB = 6GB.

My belief that buildah's image cache does not reuse layers arises from some experimentation I did myself with this approach (where I found that simply cloning an image, e.g. buildah commit "$(buildah from <image>)" still boosted my disk usage (as reported by df -h) by the size of <image>, despite no changes/layers being added). I don't fully understand how containers/image storage really works or is supposed to work, so maybe I'm doing something wrong here, but as far as I can tell, this means the redo approach is really not scalable, as disk usage scales "quadratically" with the layers, even if there is a performance time-save.

My understanding is that Docker's cache does not suffer this issue because its cache does store things in a layered-manner, unlike buildah commit. I may be wrong about this, though.

@jsirex
Copy link

jsirex commented Jan 4, 2021

I recommend you to not over-enginering here. Ask yourself why do you really need a cache or layer?
All yours' volatile part is actually what you want to always build as a single level.
And all your cached/layered part is always something you want to use as base image.

@dsonck92
Copy link

Well, meanwhile I changed my personal build system quite differently. I'm utilizing gitlabs (and probably other CI's have similar features) ability to run build steps in a container directly. This gave the ability to extract the different stages and reuse those later on, making the buildah step essentially a "from, add, commit". This lowered the complexity of builds considerably and essentially runs the build steps through podman on kubernetes, which adds some parallelism. Too bad it doesn't (yet) have a generic container runner and is tied to either kubernetes or docker.

Now I know that this essentially evades the problem but I find it relatively elegant. I did discuss this at work, which basically shot it down with "but now it's not a single dockerfile that you can just execute" (or buildah dependent shell script), but then again, you can put the stages in a shellscript and execute those after eachother in the gitlab file and separate buildah steps. Though that will limit its usability somewhat. Complex matter

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan rhatdan closed this as completed May 6, 2021
@computator
Copy link

I don't think this should be closed. The concern @kwshi had with disk usage is resolved, but the main topic of this issue still has no solution as far as I am aware.

@rhatdan rhatdan reopened this May 6, 2021
@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Sep 26, 2021

@flouthoc PTAL

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@flouthoc
Copy link
Collaborator

@kwshi @dsonck92 I have a different perspective to solution for this issue. So i am just posting it here. Could you guys please go through it and give your thoughts

  • Add a new subcommand build-from-bash so end user will invoke it like buildah build-from-bash -t <tag> -f my-build-file.sh

Working of Build-from-bash

  • Creates instruction stack : Assume hypothetical datastructure stack
  • Reads and evaluates bash instructions line by line and converts every buildah command into valid containerfile command.

So
build from fedora --> transformed internally ---> FROM fedora --> save instruction stack
buildah run $CONTAINER -- yum install git ---> transformed internally --> RUN yum install git--> save instruction stack

  • Finally we will have instruction stack with equivalent Dockerfile and which will be finally executed within the context of bash-file

TLDR:
Read and parse bash file to equivalent Containerfile/Dockerfile

CONS:
What if bash file changes directory context mid-run. However there is something we could do with multi-staging here.

@dsonck92
Copy link

If buildah currently caches bud, then a quick solution to this problem is to have some kind of containerfile generator. It could be quite elegant. However, it does prevent the use of volume mounts during builds. (Though, volume mounts should invalidate the cache anyways).

At least, that's what I'm getting from your build-from-bash mode, a containerfile generator.

For me, the original feature is not that important anymore, since I'm generating all my files inside a container inside CI, and only require a single copy into the final container, the CI has some caching capabilities. But I do think an intermediate containerfile generator could help.

@computator
Copy link

Translating a bash script directly to Containerfile syntax is not feasible because of the complexity available with a full scripting language (if statements, for loops, external commands, etc...). If your bash script is simplistic enough that it can be translated to a Containerfile then you might as well write it in Containerfile syntax to start. The benefit to bash scripts is the additional flexibility not expressible using Containerfile syntax.

A Containerfile generator is a great in between idea. Not as flexible as using a full bash script, but a lot more flexible than a static Containerfile. Something like that would be useful in a lot of cases. That said, it seems to me that a generator like that should probably be a separate project instead of part of buildah.

@MLNW
Copy link

MLNW commented Feb 27, 2022

The original comment mentioned this:

Dan mentioned that I could use commits in-between and base off the next step of them.
Nice idea but I think this will clutter up the script and the signal-to-noise ratio would be rather high.

How would this look like in practice?

@MLNW
Copy link

MLNW commented Mar 9, 2022

@rhatdan I'm guessing you were the "Dan" mentioned in the original issue. Could you provide some more information on how to achieve this with

commits in-between and base off the next step of them?

@rhatdan
Copy link
Member

rhatdan commented Mar 9, 2022

I was guessing something like:

$ctr=$(buildah from fedora)
$ buildah run $ctr dnf -y install httpd
$ buildah run $ctr dnf -y clean all
$ buildah commit $ctr Image_layer1
$ ctr=$(buildah from image_layer1)
...

@MLNW
Copy link

MLNW commented Mar 10, 2022

I was guessing something like:

$ctr=$(buildah from fedora)
$ buildah run $ctr dnf -y install httpd
$ buildah run $ctr dnf -y clean all
$ buildah commit $ctr Image_layer1
$ ctr=$(buildah from image_layer1)

The problem with this seems to me that it does not use the cache at all if this is used in a sequential manner such as a bash script. It will always create a container from fedora run commands in it and commit the container to Image_layer1 no matter how often the script is run.

What I would be looking for is a mechanism to skip the run steps if Image_layer1 already exists. At the moment I fear such a mechanism doesn't exist with buildah without falling back to Containerfiles.

@rhatdan
Copy link
Member

rhatdan commented Mar 11, 2022

You are correct this has never been implemented. Only the Caching for Containerfile exists. This discussion has always been about a mechanism to build a generate bash version of caching.

@containers containers locked and limited conversation to collaborators Mar 11, 2022
@rhatdan rhatdan converted this issue into discussion #3819 Mar 11, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
kind/feature Categorizes issue or PR as related to a new feature. stale-issue
Projects
None yet
Development

No branches or pull requests