Dockstore New Developer Tutorial

Brian O'Connor edited this page Nov 22, 2016 · 32 revisions

Deprecated

Please note, this wiki tutorial was moved to the main Dockstore.org site and has been updated. See https://dockstore.org/docs/getting-started

Overview

This guide will walk you through a real example of adding a tool to the Dockstore and using it. I will also point our tips and tricks along the way to help you avoid common issues that make tools more difficult to use and less portable. A future tutorial will show you how to register a workflow (in CWL or WDL).

Prerequisites

You need both accounts to online services along with some software installed on your development host.

Gathering Accounts

The first step is to establish accounts with key services, if you haven't already:

You may alternatively (or additionally) sign up for accounts at the following:

These aren't required since GitHub and Quay provide these same services. This tutorial will focus on GitHub and Quay.

Development Environment Setup

This tutorial assumes you are using a Linux host running Ubuntu 14.04 and have the following installed:

  • Python 2.7
  • Java 1.8
  • Docker
  • cwltool: you need to pip install cwltool, with pip install setuptools==24.0.3 && pip install cwl-runner cwltool==1.0.20160316150250 schema-salad==1.7.20160316150109 avro==1.7.7. The Dockstore will direct you to do this when you register
  • Dockstore CLI: you will be prompted to download this when you register

The processes you use to install these really depends on the system your user. See the bottom of this tutorial (Install Tips) for the specific commands I used on a fresh Ubuntu 14.04 VM.

To check to see if things are configured correctly open a terminal and execute the following. You should see very similar output. If there are errors make sure you follow the setup instructions carefully:

$> java -version
java version "1.8.0_91"
Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)

$> docker --version
Docker version 1.11.1, build 5604cbe

$> cwltool --version
/usr/local/bin/cwltool 1.0.20160316150250

$> dockstore --version
Dockstore version 0.4-beta.1
You are running the latest stable version...
...

In addition to the tools mentioned above you will probably want a editor capable syntax highlighting Dockerfiles such as Atom.

Development Overview

Dockstore is a registry which means we don't actually store your source files, build your Docker images, or host them for the community. We are purely a place where you can register, and describe how to call, your Docker-based tools. For this reason we depend on GitHub (or Bitbucket) and Quay (or DockerHub) to provide these services.

Dockstore overview

Adding to Source Control

The first step is to establish the Git source repository you will work out of. This will contain your Dockerfile which describes how to build your Docker images that has all your tools, reference files, configs, etc installed in it. Next the source repository will contain a Dockstore.cwl (or Dockstore.wdl) that describes the tool installed inside the Docker image and how to execute it.

For this tutorial I'm going to use my own personal git repository located at:

https://github.com/briandoconnor/dockstore-tool-bamstats

This will be used to create a Docker-based version of the bamstats command, a simple tool that provides statistics on BAM files.

The process of creating a git repository on github is beyond the scope of this tutorial but details directions can be found on GitHub's help page.

You can follow along on this tutorial by "forking" my repository above into your own GitHub account. Or you can create your own git repository for another tool that you want to share on Dockstore.

Creating a Dockerfile

With a repository established in GitHub, the next step is to create the Docker image with BAMStats correctly installed. You need to create a Dockerfile, this contains the instructions necessary for creating a Docker image that contains all the dependencies of BAMStats along with the executable itself.

Here's my sample Dockerfile:

#############################################################
# Dockerfile to build a sample tool container for BAMStats
#############################################################

# Set the base image to Ubuntu
FROM ubuntu:14.04

# File Author / Maintainer
MAINTAINER Brian OConnor <briandoconnor@gmail.com>

# Setup packages
USER root
RUN apt-get -m update && apt-get install -y wget unzip openjdk-7-jre zip

# get the tool and install it in /usr/local/bin
RUN wget -q http://downloads.sourceforge.net/project/bamstats/BAMStats-1.25.zip
RUN unzip BAMStats-1.25.zip && \
    rm BAMStats-1.25.zip && \
    mv BAMStats-1.25 /opt/
COPY bin/bamstats /usr/local/bin/
RUN chmod a+x /usr/local/bin/bamstats

# switch back to the ubuntu user so this tool (and the files written) are not owned by root
RUN groupadd -r -g 1000 ubuntu && useradd -r -g ubuntu -u 1000 ubuntu
USER ubuntu

# by default /bin/bash is executed
CMD ["/bin/bash"]

This Dockerfile has a lot going on in it. There are good tutorials online about the details of Dockerfile and its syntax. An excellent resource is the Docker website itself, including the Best practices for writing Dockerfiles webpage. I'll highlight some sections below:

FROM ubuntu:14.04

This uses the ubuntu 14.04 base distribution. How do I know to use ubuntu:14.04? This comes from either a search on Ubuntu's home page for their "official" Docker images or you can simply go to DockerHub or Quay and search for whatever base image you like. You can extend anything you find there so if you come across an image that contains most of what you want you can use it as the base here. Just be aware of its source, I tend to stick with "official", basic images for security reasons.

MAINTAINER Brian OConnor <briandoconnor@gmail.com>

You should include your name and contact information.

USER root
RUN apt-get -m update && apt-get install -y wget unzip openjdk-7-jre zip
RUN wget -q http://downloads.sourceforge.net/project/bamstats/BAMStats-1.25.zip
RUN unzip BAMStats-1.25.zip && \
    rm BAMStats-1.25.zip && \
    mv BAMStats-1.25 /opt/

This switches to the root user to perform software installs. It downloads BAMStats, unzips it, and installs it in the correct location, here it's /opt.

This is why Docker is so powerful. On HPC systems the above process might take days or weeks of working with a sys admin to install dependencies on all compute nodes. Here I can control and install whatever I like inside my Docker image, correctly configuring the environment for my tool and avoiding the time to setup these dependencies in the places I want to run. This greatly simplifies the install process for other users that you share your tool with as well.

COPY bin/bamstats /usr/local/bin/
RUN chmod a+x /usr/local/bin/bamstats

This copies the local helper script bamstats from the git checkout directory to /usr/local/bin. This is an important example, it shows how to use COPY to copy files in the git directory structure to inside the Docker image. After copying to /usr/local/bin the script is made runnable by all users.

RUN groupadd -r -g 1000 ubuntu && useradd -r -g ubuntu -u 1000 ubuntu
USER ubuntu

# by default /bin/bash is executed
CMD ["/bin/bash"]

The user ubuntu is created and switched to in order to make file ownership easier and the default command for this Docker image is set to /bin/bash which is a typical default.

An important thing to note, this Dockerfile just really scratches the surface. Take a look at Best practices for writing Dockerfiles for a really terrific in-depth look at writing Dockerfiles.

Building Docker Images

Now that you've created the Dockerfile the next step is to build the image. The docker command line is used for this:

$> docker build -t quay.io/briandoconnor/dockstore-tool-bamstats:1.25-3 .

The . is the path to the location of the Dockerfile, which is in the same directory here. The -t parameter is the "tag" that this Docker image will be called locally when it's cached on your host. A few things to point out, the quay.io part of the tag typically denotes that this was built on Quay.io (which we will see in the next section). I'm manually specifying this tag so it will match the quay.io built version. This allows me to build and test locally then, eventually, switch over to the quay.io-built version. The next part of the tag, briandoconnor/dockstore-tool-bamstats, denotes the name of the tool which is derived from the organization and repository name on GitHub. Finally 1.25-3 denotes a version string, typically you want to sync that with releases on GitHub. In this case I'm working on release 1.25-3 so this is on a release branch. However the most recent release via GitHub is the previous version 1.25-2. The ramifications of this will come up in the Quay section below.

Release in Github

Really, you could use whatever you want for the tag but, practically, you want this to match what Quay will use, aka your next release, so that's what I'm doing here. The tool should build normally and should exit without errors. You should see something like:

Successfully built 01a7ccf55063

Check that the tool is now in your local Docker image cache:

$> docker images | grep bamstats  
quay.io/briandoconnor/dockstore-tool-bamstats   1.25-3  01a7ccf55063   2 minutes ago   538.3 MB

Great! This looks fine!

Testing the Docker Image Locally

OK, so you've built the image. Now what?!

The next step will be to test the tool directly via Docker to ensure that your Dockerfile is valid and correctly installed the tool. If you were developing a new tool there might be multiple rounds of docker build, followed by tesing with docker run before you get your Dockerfile right. Here I'm executing the Docker image, launching it as a container (make sure you launch on a host with at least 8GB of RAM and dozens of GB of disk space!):

$> docker run -it -v `pwd`:/home/ubuntu quay.io/briandoconnor/dockstore-tool-bamstats:1.25-3 /bin/bash

You'll be dropped into a bash shell which works just like the Linux environments you normally work in. I'll come back to what -v is doing in a bit. The goal now is to exercise the tool and make sure it works as you expect. BAMStats is a very simple tool and generates some reports and statistics for a BAM file. Let's run it on some test data from the 1000 Genomes project:

# this is inside the running Docker container
$> cd /home/ubuntu
$> wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA12878/alignment/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam
# if the above doesn't work here's an alternative location
$> wget https://s3.amazonaws.com/oconnor-test-bucket/sample-data/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam
$> /usr/local/bin/bamstats 4 NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam

What's really going on here? The bamstats command above is a simple script I wrote to make it easier to call BAMStats. This is what I used the COPY command to move into the Docker image via the Dockerfile. Here's the script's contents:

#!/bin/bash
set -euf -o pipefail

java -Xmx$1g -jar /opt/BAMStats-1.25/BAMStats-1.25.jar -i $2 -o bamstats_report.html -v html
zip -r bamstats_report.zip bamstats_report.html bamstats_report.html.data
rm -rf bamstats_report.html bamstats_report.html.data

You can see it just executes the BAMStats jar, passing in the GB of memory and the BAM file while collecting the output HTML report as a zip file followed by cleanup.

An important thing to note, notice how the output is written to whatever the current directory is. This is the correct directory to put your output in since the CWL tool described later assumes that outputs are all located in the current working directory that it executes your command in.

The -v parameter used earlier is mounting the current working directory into /home/ubuntu which was the directory we worked in when running /usr/local/bin/bamstats above. The net effect is when you exit the Docker container you're left with a bamstats_report.zip file in the current directory. This is a key point, it shows you how files are retrieved from inside a Docker container.

You can now unzip and examine the bamstats_report.zip file on your computer to see what type of reports are created by this tool. For example, here's a snippet:

Sample report

You Could Stop Here!

At this point you have a working Docker image. You could use the docker push command to send that to Quay or DockerHub and share with others. However, what you loose is a standardized way to describe how to run your tool. That's what the CWL descriptor and Dockstore provide. We think it's valuable and there's an increasing number of tools designed to work with CWL so there are benefits to not just stopping here.

CWL Tool Descriptor

At this point we have validated that the Docker image is good and the BAMStats tool works as expected. The next step is to describe how to call the BAMStats tool using the Common Workflow Language. This is a human- and machine-readible format that describes how tools can be called inside a Docker image.

Here's the Dockstore.cwl for BAMStats tool:

#!/usr/bin/env cwl-runner

class: CommandLineTool
id: "BAMStats"
label: "BAMStats tool"
cwlVersion: cwl:draft-3
description: |
    A Docker container for the BAMStats command. See the [BAMStats](http://bamstats.sourceforge.net/) website for more information.
    ```
    Usage:
    # fetch CWL
    $> dockstore cwl --entry quay.io/briandoconnor/dockstore-tool-bamstats:1.25-3 > Dockstore.cwl
    # make a runtime JSON template and edit it (or use the content of sample_configs.json in this git repo)
    $> dockstore convert cwl2json --cwl Dockstore.cwl > Dockstore.json
    # run it locally with the Dockstore CLI
    $> dockstore launch --entry quay.io/briandoconnor/dockstore-tool-bamstats:1.25-3 \
        --json Dockstore.json
    ```

dct:creator:
  "@id": "http://orcid.org/0000-0002-7681-6415"
  foaf:name: Brian O'Connor
  foaf:mbox: "mailto:briandoconnor@gmail.com"

requirements:
  - class: DockerRequirement
    dockerPull: "quay.io/briandoconnor/dockstore-tool-bamstats:1.25-3"

hints:
  - class: ResourceRequirement
    coresMin: 1
    ramMin: 4092
    outdirMin: 512000
    description: "the process requires at least 4G of RAM"

inputs:
  - id: "#mem_gb"
    type: int
    default: 4
    description: "The memory, in GB, for the reporting tool"
    inputBinding:
      position: 1

  - id: "#bam_input"
    type: File
    description: "The BAM file used as input, it must be sorted."
    format: "http://edamontology.org/format_2572"
    inputBinding:
      position: 2

outputs:
  - id: "#bamstats_report"
    type: File
    format: "http://edamontology.org/format_3615"
    outputBinding:
      glob: bamstats_report.zip
    description: "A zip file that contains the HTML report and various graphics."

baseCommand: ["bash", "/usr/local/bin/bamstats"]

There's a lot going on here. Let's break it down. The CWL is actually recognized and parsed by Dockstore (when we register this later). By default it recognizes Dockstore.cwl but you can customize this if you need to. One of the most important items below is the CWL version, you should label your CWL with the version you are using so tools that can't run this version can error our appropriately.

class: CommandLineTool
id: "BAMStats"
label: "BAMStats tool"
cwlVersion: cwl:draft-3
description: "A Docker container for the BAMStats command. See the BAMStats website for more information."

These items are recommended and the description is actually parsed and displayed in the Dockstore page. Here's an example:

Entry

In the code above you can see how to have an extended description which is quite useful.

dct:creator:
  "@id": "http://orcid.org/0000-0002-7681-6415"
  foaf:name: Brian O'Connor
  foaf:mbox: "mailto:briandoconnor@gmail.com"

This section includes the tool author referenced by Dockstore. It's open to your interpretation whether that is the personal that registers the tool, the person who made the Docker image, or the developer of the original tool. I'm biased towards the person that registers the tool since that is likely to be the primary contact when asking questions about how the tool was setup.

requirements:
  - class: DockerRequirement
    dockerPull: "quay.io/briandoconnor/dockstore-tool-bamstats:1.25-3"

This section links the Docker image used to this CWL. Notice it's exactly the same as the -t you used when building your image.

hints:
  - class: ResourceRequirement
    coresMin: 1
    ramMin: 4092
    outdirMin: 512000
    description: "the process requires at least 4G of RAM"

This may or may not be honoured by the tool calling this CWL but at least it gives you a place to declare computational requirements.

inputs:
  - id: "#mem_gb"
    type: int
    default: 4
    description: "The memory, in GB, for the reporting tool"
    inputBinding:
      position: 1

  - id: "#bam_input"
    type: File
    description: "The BAM file used as input, it must be sorted."
    format: "http://edamontology.org/format_2572"
    inputBinding:
      position: 2

This is one of the items from the inputs section. Notice a few things, first, the #bam_input matches with bam_input in the sample parameterization JSON. Also, you can control the position of the variable, it can have a type (int or File here), and, for tools that require a prefix (--prefix) before a parameter you can use the prefix key:value in the inputBindings section.

Also, I'm using the format field to specify a file format via the EDAM ontology.

outputs:
  - id: "#bamstats_report"
    type: File
    format: "http://edamontology.org/format_3615"
    outputBinding:
      glob: bamstats_report.zip
    description: "A zip file that contains the HTML report and various graphics."

Finally, the outputs section defines the output files. In this case it says in the current working directory there will be a file called bamstats_report.zip. When running this tool with CWL tools the file will be copied out of the container to a location you specify in your parameter JSON file. We'll walk though an example in the next section.

Finally, the baseCommand is the actual command that will be executed, in this case it's the wrapper script I wrote for bamstats.

baseCommand: ["bash", "/usr/local/bin/bamstats"]

Testing Locally

So at this point you've created a Docker-based tool and have described how to call that tool using CWL. Let's test running the BAMStats using the Dockstore command line and descriptor rather than just directly calling it via Docker. This will test that the CWL correctly describes how to run your tool.

First thing I'll do is create a completely local dataset and JSON parameterization file:

$> wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA12878/alignment/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam
# alternative location if the above URL doesn't work
$> wget https://s3.amazonaws.com/oconnor-test-bucket/sample-data/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam
$> mv NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam /tmp/

This downloads to my current directory and then moves to /tmp. I could choose another location, it really doesn't matter, but we need the full path when dealing with the parameter JSON file. I'm using a sample I checked in already: sample_configs.local.json.

{
  "bam_input": {
        "class": "File",
        "path": "/tmp/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam"
    },
    "bamstats_report": {
        "class": "File",
        "path": "/tmp/bamstats_report.zip"
    }
}

Tip: the Dockstore CLI can handle inputs at HTTPS, FTP, and S3 URLs but that's beyond the scope of this tutorial.

You can see in the above I give the full path to the input under bam_input and full path to the output bamstats_report.

At this point, let's run the tool with our local inputs and outputs via the JSON config file:

$> dockstore tool launch --entry Dockstore.cwl --local-entry --json sample_configs.local.json
Creating directories for run of Dockstore launcher at: ./datastore//launcher-1e43745b-3127-4c56-8204-1e56abb81df2
Provisioning your input files to your local machine
Downloading: #bam_input from /tmp/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam into directory: /home/ubuntu/gitroot/dockstore-tool-bamstats/./datastore/launcher-1e43745b-3127-4c56-8204-1e56abb81df2/inputs/91155c9c-fd3b-4edf-871d-b31019ffa0f2
Calling out to cwltool to run your tool
cwltool stdout:
	{
	    "bamstats_report": {
	        "size": 32012,
	        "path": "/home/ubuntu/gitroot/dockstore-tool-bamstats/./datastore/launcher-1e43745b-3127-4c56-8204-1e56abb81df2/outputs/bamstats_report.zip",
	        "checksum": "sha1$b3882afae65e54081727a2fef0d3b7bdb9aa22e6",
	        "class": "File"
	    }
	}
cwltool stderr:
	/usr/local/bin/cwltool 1.0.20160316150250
	[job 140138530869072] /home/ubuntu/gitroot/dockstore-tool-bamstats/./datastore/launcher-1e43745b-3127-4c56-8204-1e56abb81df2/outputs/$ docker run -i --volume=/tmp/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam:/var/lib/cwl/job563598407_tmp/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam:ro --volume=/home/ubuntu/gitroot/dockstore-tool-bamstats/datastore/launcher-1e43745b-3127-4c56-8204-1e56abb81df2/outputs:/var/spool/cwl:rw --volume=/tmp/tmpZ8IdIg:/tmp:rw --workdir=/var/spool/cwl --read-only=true --user=1000 --rm --env=TMPDIR=/tmp quay.io/briandoconnor/dockstore-tool-bamstats:1.25-3 bash /usr/local/bin/bamstats 4 /var/lib/cwl/job563598407_tmp/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam
	Total time: 12 seconds
	  adding: bamstats_report.html (deflated 50%)
	  adding: bamstats_report.html.data/ (stored 0%)
	  adding: bamstats_report.html.data/20_Coverage_cumulativeHistogram.png (deflated 14%)
	  adding: bamstats_report.html.data/20_Coverage_boxAndWhisker.png (deflated 12%)
	  adding: bamstats_report.html.data/Coverage_boxAndWhisker.png (deflated 1%)
	  adding: bamstats_report.html.data/20_Coverage_histogram.png (deflated 13%)
	  adding: bamstats_report.html.data/20_Coverage.html (deflated 60%)
	Final process status is success

Saving copy of cwltool stdout to: /home/ubuntu/gitroot/dockstore-tool-bamstats/./datastore/launcher-1e43745b-3127-4c56-8204-1e56abb81df2/outputs/.cwltool.stdout.txt
Saving copy of cwltool stderr to: /home/ubuntu/gitroot/dockstore-tool-bamstats/./datastore/launcher-1e43745b-3127-4c56-8204-1e56abb81df2/outputs/.cwltool.stderr.txt

Provisioning your output files to their final destinations
Uploading: #bamstats_report from /home/ubuntu/gitroot/dockstore-tool-bamstats/./datastore/launcher-1e43745b-3127-4c56-8204-1e56abb81df2/outputs/bamstats_report.zip to : /tmp/bamstats_report.zip
[##################################################] 100%

So that's a lot of information but you can see the process was a success. We get output from the command we ran and also see the file being moved to the correct output location:

$> ls -lth /tmp/bamstats_report.zip
-rw-rw-r-- 1 ubuntu ubuntu 32K Jun 16 02:14 /tmp/bamstats_report.zip

The output looks fine, just what we'd expect.

So what's going on here? What's the Dockstore CLI doing? It can best be summed up with this image:

Lifecycle

The command line first provisions file. In our case, the files were local so no provisioning was needed. But as the Tip above mentioned, these can be various URLs. After provisioning the docker image is pulled and ran via the cwltool command line. This uses the Dockerfile.cwl and parameterization JSON file (sample_configs.local.json) to construct the underlying docker run command. Finally, the Dockstore CLI provisions files back. In this case it's just a file copy to /tmp/bamstats_report.zip but it could copy the result to a destination in S3 for example.

Tip: you can use --debug to get much more information during this run, including the actual call to cwltool (which can be super helpful in debugging):

cwltool --non-strict --enable-net --outdir /home/ubuntu/gitroot/dockstore-tool-bamstats/./datastore/launcher-08852137-71c1-4b75-b2fc-16ab7ca3243b/outputs/ /home/ubuntu/gitroot/dockstore-tool-bamstats/Dockstore.cwl /home/ubuntu/gitroot/dockstore-tool-bamstats/./datastore/launcher-08852137-71c1-4b75-b2fc-16ab7ca3243b/workflow_params.json

Tip: the dockstore CLI automatically create a datastore directory in the current working directory where you execute the command and uses it for inputs/outputs. It can get quite large depending on the tool/inputs/outputs being used. Plan accordingly e.g. execute the dockstore CLI in a directory located on a partition with sufficient storage.

Releasing on GitHub

At this point we've successfully created our tool in Docker, tested it, written a CWL that describes how to run it, and tested running this via the Dockstore command line. All of this work has been done locally, so if we encounter problems along the way its fast to perform debug cycles, fixing problems as we go. At this point we're confident that the tool is ready to share with others and bug free. It's time to release 1.25-3

Releasing will tag your GitHub repository with a version tag so you always can get back to this particular release. I'm going to use the tag 1.25-3 which you can see referenced in my Docker image tag and also my CWL file. GitHub makes it very easy to release:

Release

I click on "releases" in my GitHub project page and then follow the directions to create a new release. Simple as that!

Tip: HubFlow is an excellent way to manage the lifecycle of releases on GitHub. Take a look!

Building on Quay.io

Now that you've perfected the Dockerfile, have built the image on your local host, and have tested running the Docker container and tool packaged inside and have released this version on GitHub, it's time to push the image to a place where others can use it. For this you can use DockerHub but we prefer Quay.io since it integrates really nicely with Dockstore.

You can manually docker push the image you have already built but the most reliable and transparent thing you can do is link your GitHub repository (and the Dockerfile contained within) to Quay. This will cause Quay to automatically build the Docker image every time there is a change.

Log onto Quay now and setup a new repository (click the "+" icon).

New Quay Repo

You must match the name to what I was using previously, so in this case it's briandoconnor / dockstore-tool-bamstats. Also, Dockstore will only work with Public repositories currently. Notice I'm selecting "Link to a GitHub Repository Push", this is because we want Quay to automatically build our Docker image every time we update the repository on GitHub. Very slick!

It will automatically prompt you to setup a "build trigger" after GitHub authenticates you. Here I select the GitHub repo for briandoconnor/dockstore-tool-bamstats.

Build Trigger

It will then ask if there are particular branches you want to build, I typically just let it build everything:

Build Trigger

So every time you do a commit to your GitHub repo Quay automatially builds and tags a Docker image. If this is overkill for you, consider setting up particular build trigger regular expressions at this step.

It will then ask you where your Dockerfile is located. Since the Dockerfile is in the root directory of this GitHub repo you can just click next:

Build Trigger

At this point you can confirm your settings and "Create Trigger" followed by "Run Trigger Now" to actually perform the build of the Docker images.

Build Trigger

Build it for 1.25-3 and any or all other branches. Typically, I build for each release and develop aka latest are built next time I checkin on that branch.

In my example I should see a 1.25-3 listed in the "tags" for this Quay Docker repository:

Build Tags

And I do, so this Docker image has been built successfully by Quay and is ready for sharing with the community.

Registering on Dockstore

So this is great, we've Docker-ized the BAMStats tool, described it with CWL, built and tested it locally, and hooked the GitHub repo up to Quay to have that service automatically build and host the Docker image. The next step is to register it on Dockstore to make finding and sharing this tool easier.

Log into the Dockstore now using your GitHub account: https://dockstore.org

Now click on "My Tools" in the upper-righ corner.

Generally, you should hit the "Refresh All Tools" button to make sure Dockstore has examined your latest repositories on Quay. Do this especially if you created a new repository like we did here.

Refresh

Now select the briandoconnor/dockstore-tool-bamstats repository and click "Publish". The tool is now listed on Dockstore!

Publish

You can also click on the "Versions" tab and should notice 1.25-3 is present and Valid=Yes. If any versions are invalid it is likely due to a path issue to the Dockstore.cwl, Dockerfile, or Dockstore.wdl (if used) files. In BAMStats I used the default value of Dockstore.cwl and Dockerfile in the root repo directory so this wasn't an issue.

Sharing the Tool

This is the simple part. Now that we've successfully registered the tool on Dockstore you can just send around a link, for example to the BAMStat tool I just registered:

https://www.dockstore.org/containers/quay.io/briandoconnor/dockstore-tool-bamstats

And reproduced here below:

bamstats

This includes several useful items:

  • sample command usage and documentation
  • author information
  • links to GitHub and Quay
  • sharing links to email, tweet, etc
  • a discussion comments section
  • details about the versions of the tool available (click "Versions" tab)
  • the Dockerfile (click "Dockerfile" tab)
  • the CWL descriptor (click the "Descriptor" tab)

The last item gives users information on all the parameterization for this tool and the expected outputs.

Using Other Tools from Dockstore

Almost all tools on Dockstore follow the same model as BAMStats. They are docker-based, described with CWL (or in some cases WDL), and they can be run via the dockstore command line interface. Typically, the way one runs the tool follows the pattern shown for BAMStats:

# fetch CWL
$> dockstore tool cwl --entry quay.io/briandoconnor/dockstore-tool-bamstats:1.25-3 > Dockstore.cwl
# make a runtime JSON template and edit it (or use the content of sample_configs.json in this git repo)
$> dockstore tool convert cwl2json --cwl Dockstore.cwl > Dockstore.json
# run it locally with the Dockstore CLI
$> dockstore tool launch --entry quay.io/briandoconnor/dockstore-tool-bamstats:1.25-3 \
    --json Dockstore.json

Since CWL and Docker are both standards, the latter wildly successfully and the former gaining adoption in the genomics community, we expect other tools and systems will be available that will directly use tools available via Dockstore. Commercial platforms, such as Seven Bridges and DNAStack, are very exciting for us since it would open up Dockstore-based tools to a large audience and platforms capable of running large-scale analysis.

Tips & Tricks

This tutorial was quite long and involved but, hopefully, the final outcome is simple and desirable to use. That being said, we only scratched the surface and there are many issues to explore still, in particular when wrapping more complex tools.

I've organized some of the most important tips we've learned in working with Docker in the PCAWG project which saw the use of many different Docker-based tools and workflows.

Docker Tips

  • sudo and the docker command, make sure you set up Docker command on your system so you don't need sudo
  • don't use sudo inside your Docker-based tools/scripts
  • try to use the default user in the container e.g. USER ubuntu when using Ubuntu
  • try to not run as USER root inside your container (it can make outputs unreadable)
  • don't call Docker-inside-Docker (it's possible but causes Docker client/server issues)
  • don't depend on changes to hostname or /etc/hosts, Docker will interfere with this
  • don't design your Docker container to take directories filled with files as inputs, be explicit about input and output files
  • keep your Docker images small

CWL & Dockstore CLI Tips

  • cwltools (which we use to run tools) is restrictive and locks down much of / as read only, use the current working directory or $TMPDIR for file writes
  • the Dockstore CLI uses ./datastore for temp files so if you're processing large files make sure this partition hosting the current directory is large.
  • you need to "collect" output from your tools/workflows inside docker and drop them into the current working directory in order for CWL to "find" them and pull them back outside of the container
  • related to this, it's often times easiest to write a simple wrapper script that maps the command line arguments specified by CWL to however your tool expects to be parameterized. This script can handle moving output to the current working directory and renaming if need be
  • genomics workflows work with large data files, this can have a few ramifications:
    • do not "package" large data reference files in your Docker image. Instead, treat them as "inputs" so they can be stagged outside and mounted into the running container
    • the $TMPDIR variable can be used as a scratch space inside your container. Make sure your host running Docker has sufficient scratch space for processing your genomics data.

Advanced Tips

  • you can use a single Docker image with multiple tools, each of them registered via a different CWL
  • you can use a Git repository with multiple CWL files
  • related to the two above, you can use non-standard file paths if you customize your registrations in the Version tab of Dockstore
  • WDL files, we talked about CWL but WDL works too
  • workflows can be registered in Dockstore but were outside the scope of this tutorial. Since it doesn't involve Docker, this is considerably easier/simplier than what we focused on above

Install Tips

Here's a list of commands I used to install the various dependencies on a fresh Ubuntu 14.04 box:

# directions for a fresh Ubuntu 14.04 VM
# java setup
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer

# docker setup, see https://docs.docker.com/engine/installation/linux/ubuntulinux/
sudo apt-get install apt-transport-https ca-certificates
sudo apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
sudo vim /etc/apt/sources.list.d/docker.list
sudo apt-get update
sudo apt-get install docker-engine
# make sure you follow directions to add ubuntu to docker group!!
docker run hello-world

# python/pip/cwltools setup
sudo apt-get install python-pip
sudo pip install setuptools==24.0.3
sudo pip install cwl-runner cwltool==1.0.20160316150250 schema-salad==1.7.20160316150109 avro==1.7.7
# WARNING! I had to install this too
sudo pip install typing

# dockstore CLI setup
wget https://github.com/ga4gh/dockstore/releases/download/0.4-beta.4/dockstore
sudo mv `pwd`/dockstore /usr/local/bin/
sudo chmod a+x /usr/local/bin/dockstore
mkdir ~/.dockstore
vim ~/.dockstore/config

# checkout the git repo which has the bamstats example
sudo apt-get install git
mkdir -p gitroot/briandoconnor/
cd gitroot/briandoconnor/

git clone https://github.com/briandoconnor/dockstore-tool-bamstats.git
cd dockstore-tool-bamstats