Skip to content

Make rebuilding the docker image for a new Accumulo snapshot faster (ver 2)#22

Closed
keith-turner wants to merge 2 commits intoapache:next-releasefrom
keith-turner:snapshot-build-reorg-2
Closed

Make rebuilding the docker image for a new Accumulo snapshot faster (ver 2)#22
keith-turner wants to merge 2 commits intoapache:next-releasefrom
keith-turner:snapshot-build-reorg-2

Conversation

@keith-turner
Copy link
Copy Markdown
Contributor

@keith-turner keith-turner commented Sep 19, 2022

See #20 for background. This PR is a replacement of that PR based on feedback from that PR.

This PR used a single docker file and broke the work into 2 run commands. With this structure in the docker file, when there is a new accumulo snapshot tarball it only rebuilds the accumulo and later layers. Takes 5 seconds. Two docker files as in #20 is a bit faster because it does not have to upload the hadoop and ZK tarballs for the accumulo only build. However that is only a few seconds and the extra complexity would not make that worthwhile.

Setting this as a draft again as I have not updated the README. @brianloss and @dlmarion how do these changes look?

@@ -0,0 +1,25 @@
set -eux;

ACCUMULO_VERSION="2.0.1"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These versions are wrong for Accumulo 2.1.0-SNAPSHOT

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah--looks like in the original Dockerfile, these versions were 2.1.0, 3.3.3, and 3.8.0

@dlmarion
Copy link
Copy Markdown
Contributor

dlmarion commented Sep 19, 2022

Is the intention for the user to run download.sh, and then build the docker image?

@brianloss
Copy link
Copy Markdown
Member

Is the intention for the user to run download.sh, and then build the docker image?

I was just having a side conversation with Keith, and that is the intent. I like that it makes the Dockerfile simpler/easier to read. However, it's also nice to be able to just build and change versions with a single command, which is good for new users or CI/CD. At a minimum, if it stays the way it is, the README file will need to be updated. At the same time Keith was doing that, I had taken a stab at this as well to see if it was going to work the way I thought. Here is what I have, for reference or if there's a preference to keep the download inside of the Dockerfile.

@dlmarion
Copy link
Copy Markdown
Contributor

I was curious what would happen in this version of the PR if you changed the Hadoop or ZooKeeper versions in download.sh. Does that layer get rebuilt if download.sh is run again with different versions?

@brianloss
Copy link
Copy Markdown
Member

That's a good catch. I think, since the downloaded files have no version info in them, the build args wouldn't change. Therefore, there'd really be nothing that changed and the cached layer would be used. That's my guess, but I suppose if docker is smart enough to say that the build context changed (since there are different files going in), then it's possible the cache would be invalidated. You'd have to test to see, but my guess is this will use the previously cached layer and not the new version of the downloaded file.

@keith-turner
Copy link
Copy Markdown
Contributor Author

I was curious what would happen in this version of the PR if you changed the Hadoop or ZooKeeper versions in download.sh. Does that layer get rebuilt if download.sh is run again with different versions?

If the file is different it seems to rebuiild. I ran maven to rebuild accumulo and copied it to the docker dir a few times. Each time it would rebuild even though the file name was the same.

if there's a preference to keep the download inside of the Dockerfile.

I don't have strong feeling about moving the docker file in or out. The main thing I wanted to do was speed up the build for a new accumulo snapshot and thats done w/ the two run commands.

@brianloss
Copy link
Copy Markdown
Member

If the file is different it seems to rebuiild. I ran maven to rebuild accumulo and copied it to the docker dir a few times. Each time it would rebuild even though the file name was the same.

Ah, good to know. Happy to be wrong about that. :)

@keith-turner
Copy link
Copy Markdown
Contributor Author

Here is what I have, for reference or if there's a preference to keep the download inside of the Dockerfile.

I would be happy to abandon this PR in favor of those changes if you have preference for keeping DL in the docker build file.

@brianloss
Copy link
Copy Markdown
Member

I would be happy to abandon this PR in favor of those changes if you have preference for keeping DL in the docker build file.

I don't have a strong opinion on it--there are tradeoffs either way.

@dlmarion
Copy link
Copy Markdown
Contributor

There may be value in breaking out the Hadoop and ZK into separate layers too. I could see the following layers:

  1. RockyLinux with dependencies and download script
  2. Hadoop
  3. ZooKeeper
  4. Accumulo

@brianloss
Copy link
Copy Markdown
Member

Yeah, you could experiment and see what the impact is on image size. The best practice is to minimize the number of layers, but there are definitely times when it makes sense to have more. It's likely that someone would want to iterate with multiple Accumulo builds in a cycle. I'm less sure about having separate layers for the other pieces, though I suppose it would speed up a CI/CD build that was building a matrix with different Accumulo, Hadoop, and Zookeeper versions. If there's a small space penalty, then it could be worth having the flexibility.

@dlmarion
Copy link
Copy Markdown
Contributor

This is what I was thinking when I originally stated multi-stage builds. Like you said though, likely not needed. My branch is not complete.

@keith-turner
Copy link
Copy Markdown
Contributor Author

I don't have a strong opinion on it--there are tradeoffs either way.

I have decided I like your version w/ download in docker because like you said its good for new users and it also satisfies what I was trying to achieve. I'll close this and you can submit a PR w/ your branch.

@dlmarion
Copy link
Copy Markdown
Contributor

FWIW, I got my previous example working, then I modified it to be a multi-stage. The size of the image goes from 1.81GB to 1.6GB.

@brianloss
Copy link
Copy Markdown
Member

FWIW, I got my previous example working, then I modified it to be a multi-stage. The size of the image goes from 1.81GB to 1.6GB.

@dlmarion I had originally tried to do each package in its own layer, but that wouldn't work with straight layers since invalidating a layer invalidates everything on top of it. However, multi-stage builds do work around that problem, so that's what I did. I also only installed the full JDK (which on Rocky Linux was pulling in X11 and other large deps) on the Accumulo build layer and then the java-11-openjdk-headless package (which only includes the JRE) in the main image. This did greatly reduce the size of the image. On my machine it went from 2.4 GB to 1.4 GB. This is all in pr #23.

@keith-turner
Copy link
Copy Markdown
Contributor Author

On my machine it went from 2.4 GB to 1.4 GB.

For the image I built using this PR, it came in at 1.7GB. 1.4G is nice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants