This repository contains the configuration and build files necessary to produce the
quay.io/geodocker/jupyter-geopyspark Docker image.
The Docker image allows easy use of GeoPySpark in a web browser via Jupyter and GeoNotebook without having to modify or configure the host computer (beyond what is needed to run Docker).
Using The Image
You will be prompted for a username and a password when direct your web browser to the container: the username and password are both
One can use the image with or without making a clone of this repository.
Without A Clone
To use the image without (or from outside of) a clone of this repository, first make sure that you are in possession of the image. The command
docker pull quay.io/geodocker/jupyter-geopyspark
will pull the latest version of the image.
The container can then be started by typing
docker run -it --rm --name geopyspark \ -p 8000:8000 -p 4040:4040 \ quay.io/geodocker/jupyter-geopyspark
docker run -it --rm --name geopyspark \ -p 8000:8000 -p 4040:4040 \ -v $HOME/.aws:/home/hadoop/.aws:ro \ quay.io/geodocker/jupyter-geopyspark
if you wish to have your AWS credentials available in the container (e.g. for pulling data from S3).
From A Clone
To use the image from within a clone of this repository,
there are two useful targets in the Makefile:
To use the
run target, type something like
TAG=latest make run
or to use the
run target with some image other than the latest one, something like
TAG=a1b78b9 make run
will launch a container using the image
run-editable target also exists, which attempts to map one's local clone of the GeoPySpark into the container so that that code can be edited and iterated on in a fairly convenient fashion.
By default, it is assumed that the GeoPySpark code is present in
../geopyspark/geopyspark, but that assumption can be changed by passing in an alternate location through the
GEOPYSPARK_DIR environment variable.
TAG=latest GEOPYSPARK_DIR=/tmp/geopyspark/geopyspark run-editable
is an example of that.
Both of those targets also pay attention to the
EXTRA_FLAGS environment variable which can be used to pass additional flags to docker.
Building The Image
To build the image, type
make image, or simply
to run the newly-built image.
TAG environment variable is not set, so by default the
run target will use the tag of the new image.
Modifying The Image; or, Image Architecture
In this section we describe the structure of the repository and document how the various pieces interact as part of the build process.
archivesis an initially-empty directory that is populated with source code and built artifacts as part of the build process.
blobsis an initially-empty directory that is populated with built artifacts from the
archivesdirectory. This directory exists because
archivesis listed in the
.dockerignorefile (which was done to reduce the size of the build context of the final image). Please see the README in that directory for more information.
configcontains the GeoNotebook configuration file and a list of python dependencies that GeoNotebook requires.
emr-dockercontains files useful for running the image on Amazon EMR (please see the README in that directory for more information).
terraform-dockercontains file useful for running the image on Amazon EMR using Terraform. Its remit is similar to that of the directory mentioned in the previous bullet-point, but it uses Terraform instead of shell scripts.
kernelscontains Jupyter kernel configuration files. The one most likely to be of interest is the one that enables GeoNotebook and GeoPySpark, the other two kernels are mostly vestigial/ceremonial.
notebookscontains various sample notebooks.
scratchis a scratch directory used during the build process. The files that are added under this directory during the build can be harmlessly deleted after the build is complete, but not doing so will accelerate subsequent builds.
scriptscontains various scripts using for building and installing artifacts.
netcdf.shbuilds a jar from a particular branch of the Thredds project that provides support for reading NetCDF files.
build-python-blob1.shruns in the context of the AWS build container, its purpose is to acquire most of the python dependencies needed by GeoPySpark and GeoNotebook and package them together into a tarball for later installation.
build-pytohn-blob2.shruns in the context of the AWS build container, its purpose is to package GeoPySpark and
GeoPySpark-NetCDFinto a tarball for later installation.
install-blob1.shruns in the context of the final image build. Its purpose is to install the artifacts created earlier by
install-blob2.shruns in the context of the final image build. Its purpose is to install the artifacts created earlier by
Dockerfilespecifies the final image, the output of the build process.
Makefilecoordinates the build process.
The build process can be divided into three stages: the bootstrap image creation phase, the EMR-compatible artifact creation stage, and the final image build stage.
all makefile target is invoked, the last two stages of the three-stage build process are done.
Stage 0: Build Bootstrap Images
The first of the three stages is done using the contents of the
Its results have already been pushed to the
quay.io/geodocker docker repository, so unless the reader wishes to modify the bootstrap images, this stage can be considered complete.
To rebuild the boostrap images, the reader should navigate into the
rpms/build directory and run the
Stage 1: EMR-Compatible Artifacts
The purpose of this stage is to build python artifacts that need to be linked against those binary dependencies which have been built in a context that resembles EMR (because we want the image to be usable on EMR).
First, a tarball containing python code linked against the binary dependencies mentioned above is created. Then, another python tarball containing GeoPySpark is created. The reason that there are two python tarballs instead of one is simply because contents of the two tarballs change at different rates; over repeated builds, the first tarball is built less frequently than the second one.
Stage 2: Build Final Image
In the third of the three stages, the artifacts which were created earlier are brought together and installed into the final docker image.
Adding Binary Dependencies
As an example of how to make a meaningful modification to the image, in this section we will describe the process of adding new binary dependencies to the image.
Currently, all binary dependencies are located in the file
gdal-and-friends.tar.gz which comes in via the
quay.io/geodocker/jupyter-geopyspark:base-2 image on which the final image is based.
If we want to add an additional binary dependency inside of that file,
then we only need to download or otherwise acquire the source code
and update the build script to build and package the additional code.
If we wish to add a binary dependency outside of the
gdal-and-friends.tar.gz file, then the process is slightly more involved,
but potentially faster because it is not necessary to rebuild bootstrap images.
The strategy for adding new binary dependency, hypothetically
libHelloWorld packaged in a file called
will be to mirror the process for
gdal-and-friends.tar.gz to the extent that we can.
The difference is that this time we will add the binary to the final image rather than to a bootstrap image.
- First, augment to the
Makefileto download or otherwise ensure the existence of the
- Next, we want to build and package
libHelloWorldin the context of the AWS build image, so that it will be usable on EMR. This would probably be done by first creating a script analogous to the one for GDAL that builds, links, and archives the dependency.
- That script should run in the context of the AWS build container so that the created binaries are compiled and linked in an environment that resembles EMR.
- The resulting archived binary blob should then be added to the final image so that it can be distributed to the Spark executors.
That should probably be done by adding a the
COPYcommand to the Dockerfile to copy the new blob to the
/blobsdirectory of the image.
- Finally, the image environment and the kernel should both be modified to make use of the new dependency.
The former will probably involve the addition of an
ENVcommand to the Dockerfile to augment the
LD_LIBRARY_PATHenvironment variable to be able to find any new shared libraries; The latter is described below.
The changes to the kernel described in the last bullet-point would probably look something like this
@@ -14,6 +14,6 @@ "PYTHONPATH": "/usr/local/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip:/usr/local/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip", "GEOPYSPARK_JARS_PATH": "/opt/jars", "YARN_CONF_DIR": "/etc/hadoop/conf", - "PYSPARK_SUBMIT_ARGS": "--archives /blobs/gdal-and-friends.tar.gz,/blobs/friends-of-geopyspark.tar.gz,/blobs/geopyspark-sans-friends.tar.gz --conf spark.executorEnv.LD_LIBRARY_PATH=gdal-and-friends.tar.gz/lib --conf spark.executorEnv.PYTHONPATH=friends-of-geopyspark.tar.gz/:geopyspark-sans-friends.tar.gz/ --conf hadoop.yarn.timeline-service.enabled=false pyspark-shell" + "PYSPARK_SUBMIT_ARGS": "--archives /blobs/helloworld-and-friends.tar.gz,/blobs/gdal-and-friends.tar.gz,/blobs/friends-of-geopyspark.tar.gz,/blobs/geopyspark-sans-friends.tar.gz --conf spark.executorEnv.LD_LIBRARY_PATH=helloworld-and-friends.tar.gz/lib:gdal-and-friends.tar.gz/lib --conf spark.executorEnv.PYTHONPATH=friends-of-geopyspark.tar.gz/:geopyspark-sans-friends.tar.gz/ --conf hadoop.yarn.timeline-service.enabled=false pyspark-shell" } }
(The changes represented by the diff above have not been tested.)
The process for adding new distributed python dependencies is analogous to the one above,
except that changes to
LD_LIBRARY_PATH variable on the executors might not be required,
and additions most-probably will need to be made to the
--conf spark.executorEnv.PYTHONPATH configuration passed in via
PYSPARK_SUBMIT_ARGS in the kernel.
To build the RPMs, navigate into the
rpms/build directory and type
Terraform And AWS
To use the RPM-based deployment, navigate into the
The configuration in that directory require Terraform version 0.10.6 or greater.
If you want to use Google OAuth, GitHub OAuth, or some supported generic type of OAuth, then type
terraform init terraform apply
and respond appropriatly to the prompts.
Doing that will upload (or sync) the RPMs to the S3 location that you specify, and will also upload the
terraform-nodocker/bootstrap.sh bootstrap script.
If you do not wish to use OAuth, then some modifications to the bootstrap script will be required.
With The Docker Image
In to use OAuth for login, two things are necessary:
It is necessary to set three environment variables inside of the container before the JupyterHub process is launched, and
it is necessary to use a
jupyterhub_config.py file that enables the desired OAuth setup.
The three environment variables that must be set are
The first of those three variables should be set to
http://localhost:8000/hub/oauth_callback for local testing and something like
http://$(hostname -f):8000/hub/oauth_callback for deployment.
The second and third are dependent on the OAuth provider.
There three such files already included in the image: One for Google and related services, one for GitHub, and a generic one. There is some variability in precise details of how OAuth providers work (e.g. some require variables to be passed in the URL of a POST request, whereas others require variables to passed in the body of a POST request). For that reason, the generic configuration should be considered a starting point rather than something that is guranteed to work in its unmodified state.
There are only two user accounts in the image:
All three of the configurations discussed above map all valid OAuth users to the
That is done because -- without additional configuration -- Spark jobs on EMR must come from a user named "
(The users inside of the container are separate and distinct from those on the host instance,
but the username is evidently part of a Spark job submission, so it must match that of the user that EMR is expecting submissions from.)
To use OAuth, launch a container with the three variables supplied and with the appropriate
docker run -it --rm --name geopyspark \ -p 8000:8000 \ -e OAUTH_CALLBACK_URL=http://localhost:8000/hub/oauth_callback \ -e OAUTH_CLIENT_ID=xyz \ -e OAUTH_CLIENT_SECRET=abc \ quay.io/geodocker/jupyter-geopyspark:latest \ jupyterhub \ -f /etc/jupterhub/jupyterhub_config_github.py \ --no-ssl --Spawner.notebook_dir=/home/hadoop/notebooks
With The RPM-based Deployment
This was discussed earlier.