Create base-image and minimize layer count #324

johscheuer · 2017-06-02T07:10:10Z

What changes were proposed in this pull request?

Use one single layer for the run command and tool installation. Also move all the common parts into one base-image.

Fixes #323

How was this patch tested?

Tested with the Minikube integration test (https://github.com/apache-spark-on-k8s/spark/blob/branch-2.1-kubernetes/resource-managers/kubernetes/README.md#preserve-the-minikube-vm)

cc @ash211

mccheah · 2017-06-02T20:09:37Z

Thanks for contributing this - I see no problems and think we can afford to put this in the next release. What do you think @ash211 ?

mccheah · 2017-06-02T20:09:56Z

resource-managers/kubernetes/docker-minimal-bundle/src/main/docker/driver/Dockerfile

-ENV SPARK_HOME /opt/spark
-
-WORKDIR /opt/spark
+COPY examples /opt/spark/examples


What's the difference between ADD and COPY?

Looks like ADD supports URLs and extracting files out of tarballs which COPY does not

https://stackoverflow.com/questions/24958140/what-is-the-difference-between-the-copy-and-add-commands-in-a-dockerfile

Good summary of ADD vs COPY : https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/#add-or-copy

ADD is somewhat deprecated because it has complex behaviors. Recommended best practice is to copy tar-balls and unpack them explicitly in a RUN

The support for unpacking tarballs is actually one of the side-effects of ADD that is considered harder to reason about. I think the idea is, just looking at a docker-file you may not be able to tell exactly what operations that ADD is doing. I don't have a real strong opinion on that. It's arguably more convenient too. But that's the current "official" position.

Like @erikerlandson said, I would prefer to use COPY since the files only copied into the image (and no magic is needed).

mccheah · 2017-06-02T20:10:33Z

resource-managers/kubernetes/docker-minimal-bundle/src/main/docker/shuffle-service/Dockerfile

@@ -15,25 +15,12 @@
 # limitations under the License.
 #

-FROM openjdk:8-alpine
+FROM spark-base


Should this have a tag, i.e. latest? How do we ensure that the child images are kept in sync with the right tag of the base?

imo there should be a tag per spark release, something like spark-base:2.1.0, etc

I'd propose renaming it spark-kubernetes-base

How would we handle snapshots then? Or are we only going to push images on every release, at which point we update all of the Dockerfiles?

ideally these would have version tags like 2.1.0-kubernetes-0.3.0 rather than something like latest (I'm a fan of immutable artifacts)

We'd need to have some way of templating Dockerfiles to do that though I think

latest is for snapshots

I'm not familiar with how people treat hadoop versions, but if we need multiple hadoop versions as part of the build crossproduct, then that would also be a part of the tag

I think what we need is some string replacement in the release process e.q. when version 2.1 is released a placeholder in the files (like RELEASE) get's replaced by the actual value.

as long as we build in the correct order (spark-base first) I think we're fine.

@erikerlandson do you feel strongly about renaming this to spark-kubernetes-base? The others are just driver executor etc

My instinct is that it would be preferable to name them all spark-kubernetes-*, but I don't consider it a show stopper

I think we want spark-kubernetes because we might namespace conflict with Nomad in particular, which is another effort being done to run the driver and executors in Docker containers.

mccheah · 2017-06-02T20:12:09Z

Actually the image tagging question stands as something that might make us want to reconsider this approach.

ash211 · 2017-06-05T23:00:52Z

@johscheuer I'm not sure what spark-base gives us that Docker doesn't already provide through its layer deduplication. When I pull the existing driver and executor images before this change, the docker daemon recognizes that the earlier layers have the same hash and reuses the layer, even without the separate tagging for the spark-base layer.

Can you explain more what spark-base does?

It seems most of the benefits are in squashing the RUN commands vs the introduction of spark-base

mccheah · 2017-06-05T23:03:27Z

In using a base image we can avoid duplicating the Dockerfile code itself.

mccheah · 2017-06-05T23:04:05Z

So for example when we start supporting Python we only have to include the Python installation in one Dockerfile and not two.

johscheuer · 2017-06-06T06:25:14Z

The main idea is like @mccheah said to avoid duplicating Dockerfile code. Assume we need to adjust a library or we want to include another one in all images then we would need to modify all images.

If you want I can separate these changes into two PR's -> one for the merge of the RUN commands and another one for the base image (open for discussion).

ash211 · 2017-06-08T19:12:04Z

docs/running-on-kubernetes.md

    docker build -t registry-host:5000/spark-driver:latest -f dockerfiles/driver/Dockerfile .
    docker build -t registry-host:5000/spark-executor:latest -f dockerfiles/executor/Dockerfile .
    docker build -t registry-host:5000/spark-init:latest -f dockerfiles/init-container/Dockerfile .
+    docker push registry-host:5000/spark-base:latest
    docker push registry-host:5000/spark-driver:latest
    docker push registry-host:5000/spark-executor:latest
    docker push registry-host:5000/spark-init:latest


I think we need a note after this that the build order matters -- the spark-base must be built first, then the others can be built afterwards in any order

ash211 · 2017-06-08T19:12:58Z

resource-managers/kubernetes/docker-minimal-bundle/src/main/docker/shuffle-service/Dockerfile

@@ -15,25 +15,12 @@
 # limitations under the License.
 #

-FROM openjdk:8-alpine
+FROM spark-base


as long as we build in the correct order (spark-base first) I think we're fine.

@erikerlandson do you feel strongly about renaming this to spark-kubernetes-base? The others are just driver executor etc

ash211 · 2017-06-08T19:16:02Z

LGTM pending question for Erik (and updating the branch)

ash211 · 2017-06-08T22:33:47Z

Thanks @johscheuer !

* Create base-image and minimize layer count * Create running-on-kubernetes.md

…-kubernetes Rebase to upstream's version of Kubernetes support.

* Create base-image and minimize layer count * Create running-on-kubernetes.md

johscheuer force-pushed the rework-dockerfiles branch from b40dec5 to 3ef935f Compare June 2, 2017 07:48

Create base-image and minimize layer count

5011163

johscheuer force-pushed the rework-dockerfiles branch from b37671b to 5011163 Compare June 2, 2017 08:13

mccheah reviewed Jun 2, 2017

View reviewed changes

ash211 reviewed Jun 8, 2017

View reviewed changes

Create running-on-kubernetes.md

5ba32d5

Merge branch 'branch-2.1-kubernetes' into rework-dockerfiles

3070ebc

ash211 merged commit 78baf9b into apache-spark-on-k8s:branch-2.1-kubernetes Jun 8, 2017

foxish pushed a commit that referenced this pull request Jul 24, 2017

Create base-image and minimize layer count (#324)

e763252

* Create base-image and minimize layer count * Create running-on-kubernetes.md

ifilonenko pushed a commit to ifilonenko/spark that referenced this pull request Feb 26, 2019

Merge pull request apache-spark-on-k8s#324 from palantir/use-upstream…

95cf5f7

…-kubernetes Rebase to upstream's version of Kubernetes support.

puneetloya pushed a commit to puneetloya/spark that referenced this pull request Mar 11, 2019

Create base-image and minimize layer count (apache-spark-on-k8s#324)

ab1f68e

* Create base-image and minimize layer count * Create running-on-kubernetes.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create base-image and minimize layer count #324

Create base-image and minimize layer count #324

johscheuer commented Jun 2, 2017

mccheah commented Jun 2, 2017

mccheah Jun 2, 2017

ash211 Jun 2, 2017

erikerlandson Jun 2, 2017

erikerlandson Jun 2, 2017

johscheuer Jun 6, 2017

mccheah Jun 2, 2017

erikerlandson Jun 2, 2017

mccheah Jun 2, 2017

ash211 Jun 2, 2017

erikerlandson Jun 2, 2017

erikerlandson Jun 2, 2017

johscheuer Jun 6, 2017

ash211 Jun 8, 2017

erikerlandson Jun 8, 2017

mccheah Jun 8, 2017

mccheah commented Jun 2, 2017

ash211 commented Jun 5, 2017

mccheah commented Jun 5, 2017

mccheah commented Jun 5, 2017

johscheuer commented Jun 6, 2017

ash211 Jun 8, 2017

ash211 Jun 8, 2017

ash211 commented Jun 8, 2017

ash211 commented Jun 8, 2017

Create base-image and minimize layer count #324

Create base-image and minimize layer count #324

Conversation

johscheuer commented Jun 2, 2017

What changes were proposed in this pull request?

How was this patch tested?

mccheah commented Jun 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mccheah commented Jun 2, 2017

ash211 commented Jun 5, 2017

mccheah commented Jun 5, 2017

mccheah commented Jun 5, 2017

johscheuer commented Jun 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ash211 commented Jun 8, 2017

ash211 commented Jun 8, 2017