Skip to content

3. Docker hands on

Diego Garrido-Martín edited this page Nov 24, 2018 · 41 revisions

1. Hello Docker

Docker is a software that performs OS-level light virtualization (containerization). Docker allows to package an application (i.e. a program) and its dependencies into an isolated, self-contained unit (a container) that can be run on any Linux computer. Before starting, make sure that Docker is installed, following these steps. Also, ensure you have root permissions, as Docker requires them. You can run sudo su for that. Then, you can check that the Docker Engine is running by typing:

docker run hello-world

It should print the following message:

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

Remember that you can have a look at your images and containers, respectively, using:

docker images  # shows images
docker ps -a   # shows running and stopped containers

You can review these and other concepts in our slides. Make sure that you clone my GitHub repository and move to it. Let's get started!

2. The problem

Many programs generate slightly different results depending on the OS, the version of the dependencies, etc. To simulate a situation in which this happens, imagine that you have written the following simple R script. Given a list of numbers as input, it prints the distribution quantiles to the STDOUT and plots a histogram in the PDF output file. However, it adds to the input numbers a small quantity that depends on the R version. We will use Docker to make our result exactly reproducible for any user in any machine.

3. Build a Docker image from a Dockerfile

The first step is to create a Dockerfile that specifies the instrucions to build our image.

Task 1: Write a Dockerfile with the content below. Substituteyou and your_email by your actual name and email, and add also a description line instead of your_description. Note that appart from installing R and some other dependencies, we add a copy of script.R within the container.

# Start from debian linux image (DockerHub)
FROM debian:stable

# Add custom label
LABEL maintainer "you <your_email>" \
      version "0.1" \
      description "your_description"

# Install R (after apt-get update)
RUN apt-get update && apt-get install -y r-base=3.3.3*

# Install R package 'optparse'
RUN R -e 'install.packages("optparse", repos = "http://cloud.r-project.org/")'

# Make the folder '/scripts' in the container
RUN mkdir /scripts

# Copy 'scripts/script.R' to the folder '/scripts' in the container
ADD scripts/script.R /scripts

Now you can build your image.

Task 2: Build a Docker image using your Dockerfile. You can use the following command, substituting <image> by the name that you want to give to your image:

docker build -t <image> .

Hint: If we wanted to push our image to DockerHub we should haved named it username/repository:tag. You can list your images using docker images. Note that the default tag is latest.

4. Run a container

Once the image is built, we can create a container from it and run any command inside.

Task 3: Run a ls command using the image you just built. Which is the name of the container that this command creates? Hint: docker run creates a container from your image, runs your command and stops the container. If you add the --rm option, the container will be removed after stopping. This is a good practice not to store all stopped containers. You can list your containers (running and stopped) using docker ps -a.

docker run <image> ls

Maybe it is more intuitive to run the container interactively in order to understand what is happening. This can be done by adding the options -i and -t. Then, you will be within the container!

Task 4: Run the container interactively. What's inside? Check the path of script.R. Hint: You can exit the container using exit. Remember that any changes you do will be discarded when exiting.

Easy, right? Now lets try to run our own script!

Task 5: Run script.R --help inside the container. Hint: Note that you need to grant execution permissions to script.R and add the scripts folder to the PATH environment variable.

Again, when exiting the container these changes will be lost. But how to make them permanent? Just modify your Dockerfile and re-build your image!

Task 6: Include the following statements in your Dockerfile and build again your image. This will make script.R executable and accessible from any location within the container. Then, run again the script.R --help command within a container in a non-interactive way, naming the container as you prefer. Hint: Note the previous build steps are cached. You can add a name to your container using the option --name.

# Give execution permissions to script.R
RUN chmod +x /scripts/script.R

# Add /scripts folder to the PATH environment variable
ENV PATH="$PATH:/scripts"           

You can use docker history <image>, where <image> is the actual name of your image, to check the different commands that have been run to build the image. Also, remember that you can see all your containers by typing docker ps -a.

Until now we have got used to the main docker commands, however, we have not yet run our script!

Task 7: Run a container interactively from your image. Then, use R to write a file with a single column of ~1000 numbers, with the format shown below. This will be the input file for your script. Finally, run script.R and redirect the STDOUT to a file named summary.txt. Hint: If you enable the option --verbose, script.R will also print to the STDOUT the R version installed within the container.

-0.518790021063055
-0.613178474723641
0.441941540062767
-1.06824702625299
0.582597034289661
1.47676704975289
-0.158877331999312

But how to run the script in any dataset located outside the container? How to get back in your filesystem the output PDF and the summary file? I bet you remember our slides about volumes!

5. Volumes

Remember that Docker has two main options for containers to store files in the host machine, so that the files are persisted even after the container stops: volumes and bind mounts.

First, lets try to save in a volume the input and output files from script.R generated within a container.

Task 8: Repeat task 7, now mounting a volume into the folder /data_vol within the container. Substitute for the name that you prefer. Save all input and output files (input + output PDF + redirected STDOUT) for script.R in /data_vol.

docker run -i -t -v <volume>:/data_vol <image>

Then, inspect the volume and have a look at its contents. Hint: You can guess the volume location using docker volume inspect and looking at "Mountpoint". You can list all your volumes using docker volume ls.

What would you expect to happen with the output files that you just generated if you run the next command?:

docker run -i -t <image>

and if you run the following?:

docker run -i -t -v <volume>:/magicfolder <image>

At this point you already know how to store the files generated within the container into a volume stored in the host. Well done! However, how to use data from the host as input for our software within the container?. One option is to use bind mounts. Using them, we can also easily recover the output files.

Task 9: Employ a bind mount to use data/normal.txt as input for script.R within the container. Save the output files (PDF + redirected STDOUT) in the target folder. Do it in a non-interactive manner, that is, without the -i and -t options. Hint: Substitute the source (<path/to/bind_mount/host>) and target (<path/to/bind_mount/container>) paths by the proper ones in host and container. They may be even the same! You will also find useful the -w option to set the working directory within the container.

docker run -v <path/to/bind_mount/host>:<path/to/bind_mount/container> -w <working_dir> <image>

Now just try to run something interactively but using a bind mount. Note that all the changes that you do in the host folder (e.g. creating an empty file using touch <file>) can be seen in the container folder in real time, and viceversa (you can check that by opening two terminal sessions, one within the container, one within the host).

Think also that we have already solved our initial problem!: we have our program, script.R, running in an isolated environment where we control the version of R and any other dependencies. Besides, with the help of bind mounts, we can use as input for script.R any dataset located in any Linux computer (in practice also Windows and Mac, thanks to Docker Engine versions for those operating systems that run light linux VMs) and generate results 100% reproducible.

6. DockerHub

Very often, we want to share our software with others, or use others' software. And one way of doing it, ensuring reproducibility, is sharing ready-to-use Docker images through DockerHub. DockerHub is a cloud-based registry service, that allows you store and distribute Docker images. In practice, you can search, push and pull images from a local Docker daemon, among others. Although they are very different, maybe you find useful the analogy between Git/GitHub and Docker/DockerHub. Please make sure that you obtain a Docker ID before proceeding with the next tasks.

Task 10: Upload your image to DockerHub. Hint: first, use docker login to log in with your Docker ID. Then use docker push <image> to upload your image. Remember that you should rename it to username/repository:tag. You can do this with docker tag <oldname> <newname>. Name your DockerHub repo as you prefer. You can also personalize it through the browser by adding information about your image.

In addition, you can browse DockerHub to find images of interest for your research. Let's try to find some tools to perform sashimi plots! Hint: Sashimi plots are some of the most widely used representations in alternative splicing analyses, you can read more about them in this articles: sashimi-plot, ggsashimi.

Task 11: Search DockerHub for images with the keyword ggsashimi, and pull the one built by guigolab. Then, try to reproduce the example depicted in this GitHub repo. Hint: First clone the ggsashimi repo.

6. Exercises

Exercise 1:

  • Build a Docker image for your favourite tool in bioinformatics and upload it to DockerHub.