The Beam portability effort aims to make it possible for any SDK to work with any runner. One aspect of the effort is the isolation of the SDK and user code execution environment from the runner execution environment using docker, as defined in the portability container contract.
This document describes how to build and push container images to that end. The push step generally requires an account with a public docker registry, such as bintray.io or Google Container Registry. These instructions assume familiarity with docker and a bintray account under the current username with a docker repository named "apache".
Prerequisites: install docker on your
platform. You can verify that it works by running docker images
or any other
docker command.
Run Gradle with the docker
target:
$ pwd
[...]/beam
$ ./gradlew docker
[...]
> Task :sdks:python:container:docker
a571bb44bc32: Verifying Checksum
a571bb44bc32: Download complete
aa6d783919f6: Verifying Checksum
aa6d783919f6: Download complete
f2b6b4884fc8: Verifying Checksum
f2b6b4884fc8: Download complete
f2b6b4884fc8: Pull complete
74eaa8be7221: Pull complete
2d6e98fe4040: Pull complete
414666f7554d: Pull complete
bb0bcc8d7f6a: Pull complete
a571bb44bc32: Pull complete
aa6d783919f6: Pull complete
Digest: sha256:d9455be2cc68ded908084ec5b63a5cbb87f12ec0915c2f146751bd50b9aef01a
Status: Downloaded newer image for python:2
---> 2863c80c418c
Step 2/6 : MAINTAINER "Apache Beam <dev@beam.apache.org>"
---> Running in c787617f4af1
Removing intermediate container c787617f4af1
---> b4ffbbf94717
[...]
---> a77003ead1a1
Step 5/6 : ADD target/linux_amd64/boot /opt/apache/beam/
---> 4998013b3d63
Step 6/6 : ENTRYPOINT ["/opt/apache/beam/boot"]
---> Running in 30079dc4204b
Removing intermediate container 30079dc4204b
---> 4ea515403a1a
Successfully built 4ea515403a1a
Successfully tagged herohde-docker-apache.bintray.io/beam/python:latest
[...]
Note that the container images include built content, including the Go boot code. Some images, notably python, take a while to build, so building just the specific images needed can be a lot faster:
$ ./gradlew -p sdks/java/container docker
$ ./gradlew -p sdks/python/container docker
$ ./gradlew -p sdks/go/container docker
(Optional) When built, you can see, inspect and run them locally:
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
herohde-docker-apache.bintray.io/beam/python latest 4ea515403a1a 3 minutes ago 1.27GB
herohde-docker-apache.bintray.io/beam/java latest 0103512f1d8f 34 minutes ago 780MB
herohde-docker-apache.bintray.io/beam/go latest ce055985808a 35 minutes ago 121MB
[...]
Despite the names, these container images live only on your local machine. While we will re-use the same tag "latest" for each build, the images IDs will change.
(Optional): the default setting for docker-repository-root
specifies the above bintray
location. You can override it by adding:
-Pdocker-repository-root=<location>
Similarly, if you want to specify a specific tag instead of "latest", such as a "2.3.0" version, you can do so by adding:
-Pdocker-tag=<tag>
Not all dependencies are like insurance on used Vespa, if you don't have them some job's just won't run at all and you can't sweet talk your way out of a tensorflow dependency. On the other hand, for Python users dependencies can be automatically installed at run time on each container, which is a great way to find out what your systems timeout limits are. Regardless as to if you have dependency which isn't being installed for you and you need, or you just don't want to install tensorflow 1.6.0 every time you start a new worker this can help.
For Python we have a sample Dockerfile which will take the user specified requirements and install them on top of your base image. If your building from source follow the directions above, otherwise you can set the environment variable BASE_PYTHON_CONTAINER_IMAGE to the desired released version.
USER_REQUIREMENTS=~/my_req.txt ./sdks/python/scripts/add_requirements.sh
Once your custom container is built, remember to upload it to the registry of your choice.
If you build a custom container when you run your job you will need to specify instead of the default latest container, so for example Holden would specify:
--worker_harness_container_image=holden-docker-apache.bintray.io/beam/python-with-requirements
Preprequisites: obtain a docker registry account and ensure docker can push images to it,
usually by doing docker login
with the appropriate information. The image you want
to push must also be present in the local docker image repository.
For the Python SDK harness container image, run:
$ docker push $USER-docker-apache.bintray.io/beam/python:latest
The push refers to repository [herohde-docker-apache.bintray.io/beam/python]
723b66d57e21: Pushed
12d5806e6806: Pushed
b394bd077c6e: Pushed
ca82a2274c57: Pushed
de2fbb43bd2a: Pushed
4e32c2de91a6: Pushed
6e1b48dc2ccc: Pushed
ff57bdb79ac8: Pushed
6e5e20cbf4a7: Pushed
86985c679800: Pushed
8fad67424c4e: Pushed
latest: digest: sha256:86ad57055324457c3ea950f914721c596c7fa261c216efb881d0ca0bb8457535 size: 2646
Similarly for the Java and Go SDK harness container images. If you want to push the same image
to multiple registries, you can retag the image using docker tag
and push.
(Optional) On any machine, you can now pull the pushed container image:
$ docker pull $USER-docker-apache.bintray.io/beam/python:latest
latest: Pulling from beam/python
f2b6b4884fc8: Pull complete
4fb899b4df21: Pull complete
74eaa8be7221: Pull complete
2d6e98fe4040: Pull complete
414666f7554d: Pull complete
bb0bcc8d7f6a: Pull complete
a571bb44bc32: Pull complete
aa6d783919f6: Pull complete
7255d71dee8f: Pull complete
08274803455d: Pull complete
ef79fab5686a: Pull complete
Digest: sha256:86ad57055324457c3ea950f914721c596c7fa261c216efb881d0ca0bb8457535
Status: Downloaded newer image for herohde-docker-apache.bintray.io/beam/python:latest
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
herohde-docker-apache.bintray.io/beam/python latest 4ea515403a1a 35 minutes ago 1.27 GB
[...]
Note that the image IDs and digests match their local counterparts.