Skip to content

Commit

Permalink
WDL, dx, and build updates (#732)
Browse files Browse the repository at this point in the history
- update Docker base image from `docker.io/broadinstitute/viral-baseimage:0.1.5` to `quay.io/broadinstitute/viral-baseimage:0.1.6`
- Update the Dockerfile build to no longer rely on the large easy-deploy script, and instead to just invoke the simple `conda install` lines directly. Break up the monolithic `conda install` Docker layer into two relatively equal sized layers, which seems to *sometimes* provide a faster `docker pull` for some clients. (I had previously also attempted to install only a minimal set of conda tools and let the `Tool` self-installer code dynamically install tools as needed, but this appears to no longer work in more complicated scenarios with multiple conda packages that depend on each other, e.g. bmtagger, so this effort was abandoned)
- move production and dev docker repos to `quay.io`, which provides faster `dx-docker pull` for DNAnexus, and is an all around nicer web UI to work with than DockerHub, and provides future possibilities to play with `rkt`, squashed images, and bittorrent pulls (though I am increasingly skeptical that that will matter).
- move DNAnexus Travis CI project to a new one created by the DNAnexus Science Team, which was the only way to make it a Public DNAnexus project.
- update documentation all around to reflect build process changes
- WDL pipelines: add a `refine_2x_and_plot task` which is simply an optimization around the refine, refine, plot_coverage tasks at the end of all assembly workflows. The original atomic tasks are still available, but this saves a lot of staging time for routine analyses.
- WDL scaffolding step: implemented @notestaff 's recent improvements to `order_and_orient` which allows the user to provide multiple reference genomes for this stage. Also, the WDL task now extracts more metrics and outputs from this stage that might be useful.
- taxon_filter.py: some further clean up and optimization around blastn read depletion. Also revert the default behavior of the WDL `deplete_taxa` step to use @tomkinsc 's blastn parallelization again.
  • Loading branch information
dpark01 committed Dec 6, 2017
1 parent 38d651e commit 2c420a2
Show file tree
Hide file tree
Showing 37 changed files with 706 additions and 484 deletions.
17 changes: 8 additions & 9 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,8 +52,8 @@ jobs:
## DOCKER_PASS (for DockerHub)
#- secure: QYylIMLvn1op6d//5yBD7KpquNxK/+xxQxIJLXWFgIl08sdT/MvrI6edgm3k8CumS7735eSV6C+KGOkF9JqM12aGUK/3PPckFGY+h/j4zQX26taT+6221ozbzF6hqYk6qm86FT5QVkBFLxsoDvt0Sh+1FPeVsWJf0o9yrLTrj2E=
# DX_API_TOKEN (for DNAnexus builds)
- secure: WgrzGq3SjH5TG3jDwQxprNoyP7fyfAUZ2O/FqFaIOdr1aihZ45W+sZFfQsd74ny76dSe2I/w+Ly0gAl1WFLgmEBr9yAmxwA1p3mKDRx9NKaxzPrGZaF7hUmmZtBnkTx+hoJCiMjYu3kFVo/RBQ1IgQ1wHLvLb8KkQ6oAYAvd2YY=
- DX_PROJECT=project-F856jv809y3VkzyFGkKqX367
- secure: RENKuendRiAmWxn+lgGF+XXPUHzfRDHGUaP4uR5mpnKOqGKVZX3uadFv/w1POuyxIR7QEt0EcgtbjJYsvEioyQRwsYQe9f1ekQU1k6+cyPS/eDqvwciXA2M+yqjNdAafQu4drN+aNV7qC8ZEU0MZXk3IaItB4g8z5ZfxfM54+m0=
- DX_PROJECT=project-F8PQ6380xf5bK0Qk0YPjB17P
# $BUNDLE_SECRET (for testing GATK on Cromwell-local and DNAnexus)
- secure: KX7DwKRD85S7NgspxevgbulTtV+jHQIiM6NBus2/Ur/P0RMdpt0EQQ2wDq79qGN70bvvkw901N7EjSYd+GWCAM7StXtaxnLRrrZ3XI1gX7KMk8E3QzPf0zualLDs7cuQmL6l6WiElUAEqumLc7WGpLZZLdSPzNqFSg+CBKCmTI8=
install:
Expand All @@ -66,15 +66,16 @@ jobs:
- if [ -f "$CACHE_DIR/old-docker-tag.txt" ]; then OLD_DOCKER_TAG=$(cat $CACHE_DIR/old-docker-tag.txt); else OLD_DOCKER_TAG=$DOCKER_REPO_PROD; fi; echo "old docker tag = $OLD_DOCKER_TAG"
- if docker pull $OLD_DOCKER_TAG; then _CACHE_FROM="--cache-from $OLD_DOCKER_TAG"; else _CACHE_FROM=""; fi
# build new docker image
- travis_wait docker build -t local/viral-ngs:build $_CACHE_FROM .;
- docker build -t local/viral-ngs:build $_CACHE_FROM .;
# deploy docker image
- travis/deploy-docker.sh
# version and validate WDL code
- travis/version-wdl-runtimes.sh
- travis/validate-wdl.sh
# build DNAnexus pipelines, but only spend money on master and PR builds
# build DNAnexus pipelines and launch a few test executions
- travis/build-dx.sh
- if [ "$TRAVIS_BRANCH" = "master" ]; then travis/tests-dx.sh; fi
- travis/tests-dx.sh
#- if [ "$TRAVIS_BRANCH" = "master" ]; then travis/tests-dx.sh; fi
# test Cromwell local execution engine
- travis/tests-cromwell.sh
before_cache:
Expand All @@ -88,14 +89,12 @@ jobs:
# - secure: hYX8492Wqpq3yqv+eHBV9c6VY8JlUSS8mUDfT1eWNEZF2vw8WrnTt8SrLIPC8etQUk1ZaSI/8XbJQKEr9LRqgvuO07AIv6BxYCg9l9BPHCj6B2YTTPo2qPkPapfvtGVd7PUZcWDUvzvPxJqMveuKVkTnCuuQSWwR68Y/Khxj8UY=
# # DOCKER_PASS
# - secure: QYylIMLvn1op6d//5yBD7KpquNxK/+xxQxIJLXWFgIl08sdT/MvrI6edgm3k8CumS7735eSV6C+KGOkF9JqM12aGUK/3PPckFGY+h/j4zQX26taT+6221ozbzF6hqYk6qm86FT5QVkBFLxsoDvt0Sh+1FPeVsWJf0o9yrLTrj2E=
# # DOCKER_EMAIL (for broadinstitute/viral-ngs on Docker Hub)
# - secure: kKSA73w+i9MfIbLBx7FN85SImGn+Rbhke554q39DSkU0++Q8zyRj7862oOKX3XBlBfeFrZaDzuyk7D1WJvdOF48tQHTJVcoI5qvzFqn4qXjLN3oI2dsGWd154SmUH9uviNUKjNzVpzcdQ7nt+ntrR4FQHLf+gqn7PyCg2p0jD2w=
# install: travis/upgrade-docker.sh
# script:
# - set -e
# - if [ -f "$CACHE_DIR/old-docker-tag.txt" ]; then OLD_DOCKER_TAG=$(cat $CACHE_DIR/old-docker-tag.txt); else OLD_DOCKER_TAG=$DOCKER_REPO_PROD; fi; echo "old docker tag = $OLD_DOCKER_TAG"
# - docker pull $OLD_DOCKER_TAG
# - travis_wait docker build -t local/viral-ngs:build --cache-from $OLD_DOCKER_TAG .
# - if docker pull $OLD_DOCKER_TAG; then _CACHE_FROM="--cache-from $OLD_DOCKER_TAG"; else _CACHE_FROM=""; fi
# - travis_wait docker build -t local/viral-ngs:build $_CACHE_FROM .;
# - travis/deploy-docker.sh
# before_cache:
# - travis/list-docker-tags.sh | tail -1 > $CACHE_DIR/old-docker-tag.txt
Expand Down
6 changes: 4 additions & 2 deletions DEVELOPMENT_NOTES.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,10 @@ A few notes on testing:
#### The Travis build matrix
Each commit on any branch, and any pull request, will trigger a build on Travis CI. Branch commits will test code from a specific commit hash. Pull requests will test the simulated result of merging a branch HEAD onto the target branch HEAD. For each build, the following Travis jobs are launched:
1. Docker & WDL
1. A docker image is built and deployed to DockerHub. Master branch images are pushed to `broadinstitute/viral-ngs:latest` and are also given a versioned tag. Non-master branch images and pull requests are pushed to `broadinstitute/viral-ngs-dev` with versioned tags. The docker build is preceded by a docker pull of `broadinstitute/viral-ngs:latest` in order to utilize layer caching. Note that our tool dependencies result in a very large docker image (2GB compressed). The Dockerfile builds the tool dependencies before incorporating the full viral-ngs source code. This means that most docker image builds will be extremely fast: usually 10-20 seconds. The docker push/deploy is similarly fast, since DockerHub already has most of the layers, and only the new source code layer needs to upload. The docker pull of the 2GB image takes about 5 minutes, so altogether this step takes about 6 minutes on Travis. However, if your code commit alters anything in `requirements-*.txt` or the easy deploy script, it will rebuild the heavy conda install layer, adding another 10 minutes or so to this build. The docker push requires login credentials for a docker registry (e.g. DockerHub or Quay.io), stored as an encrypted Travis variable.
1. A docker image is built and deployed to the Docker registry at quay.io. Master branch images are pushed to `quay.io/broadinstitute/viral-ngs:latest` and are also given a versioned tag. Non-master branch images and pull requests are pushed to `quay.io/broadinstitute/viral-ngs-dev` with versioned tags. The docker build is preceded by a docker pull of the docker image associated with the previous Travis build parental to this commit in order to utilize layer caching. Note that our tool dependencies result in a very large docker image (2GB compressed, this is about 10x the typical size for a docker image). The Dockerfile builds the tool dependencies before incorporating the full viral-ngs source code. This means that most docker image builds will be extremely fast: usually 10-20 seconds. The docker push/deploy is similarly fast, since the Docker registry already has most of the layers, and only the new source code layer needs to upload. The docker pull of the 2GB image takes about 5 minutes, so altogether this step takes about 6 minutes on Travis. However, if your code commit alters anything in `requirements-*.txt` or the easy deploy script, it will rebuild the heavy conda install layer, adding another 10 minutes or so to this build. The docker push requires login credentials for a docker registry (e.g. DockerHub, Quay.io, GCP, AWS), stored as an encrypted Travis variable.
2. After the docker image is deployed, WDL pipeline files are edited to reflect the version tag of the recently pushed docker image. A WDL validator is then run (using wdltool.jar) to ensure that all WDL files are still valid. This completes in seconds.
3. WDL pipelines are compiled to DNAnexus workflows using dxWDL.jar. These are deployed to a DNAnexus CI project using an API token stored as an encrypted Travis variable. This completes in under a minute.
4. A couple DNAnexus workflows are test executed in the CI project.
4. WDL pipelines are executed with test data using Cromwell on the local Travis instance. This is a bit slow (roughly 5 mins for a simple test).
1. Documentation is built automatically. It is not deployed to Read the Docs--this test only exists on Travis in order to bring the developer's attention to any auto build problems. Read the Docs has its own auto build process separate from Travis (see section below) but it does not notify anyone of its build failures. This usually completes in less than 1 minute.
1. The `viral-ngs` conda package is built and deployed to the `broad-viral` channel. This requires anaconda.org credentials stored as an encrypted Travis variable. This takes about 10 minutes.
Expand All @@ -37,12 +38,13 @@ Each commit on any branch, and any pull request, will trigger a build on Travis

Some TO DO improvements for the future:
- Separate out all tests of snakemake pipelines from unit and integration into a separate space & Travis job.
- DNAnexus workflows should eventually be launched, however we are awaiting the resolution of https://github.com/dnanexus-rnd/dxWDL/issues/69 in order to do this. The output should also be checked for correctness.
- DNAnexus workflow testing should check output for correctness.
- Cromwell workflow testing should check output for correctness.
- Utilize Travis build stages.
- All of the sub-steps of the first Docker & WDL Travis job should be broken out as separate jobs that wait for the Docker build and deploy.
- Unit tests for Python 3.6, and possibly the conda package build, should occur within the Docker container.
- Second-stage jobs that pull the docker image should utilize quay.io's torrent squashed image pull to reduce the time spent pulling our Docker image (currently about 5 minutes to pull from DockerHub).
- Alternatively, we can explore creating a minimal docker image that installs only the conda pip packages (and perhaps extremely common conda tools like samtools and Picard) and leaves the rest of the conda tools out, letting them dynamically install themselves as needed using our dynamic tool install code.


### Building documentation
Expand Down
28 changes: 16 additions & 12 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
FROM broadinstitute/viral-baseimage:0.1.5
FROM quay.io/broadinstitute/viral-baseimage:0.1.6

LABEL maintainer "Chris Tomkins-Tinch <tomkinsc@broadinstitute.org>"
LABEL maintainer "viral-ngs@broadinstitute.org"

# to build:
# docker build .
Expand All @@ -9,23 +9,32 @@ LABEL maintainer "Chris Tomkins-Tinch <tomkinsc@broadinstitute.org>"
# docker run --rm <image_ID> "<command>.py subcommand"
#
# to run interactively:
# docker run --rm -it <image_ID> bash
# docker run --rm -it <image_ID>
#
# to run with GATK and/or Novoalign:
# Download licensed copies of GATK and Novoalign to the host machine (for Linux-64)
# export GATK_PATH=/path/to/gatk/
# export NOVOALIGN_PATH=/path/to/novoalign/
# docker run --rm -v $GATK_PATH:/gatk -v $NOVOALIGN_PATH:/novoalign -v /path/to/dir/on/host:/user-data <image_ID> "<command>.py subcommand"

ENV INSTALL_PATH="/opt/viral-ngs" VIRAL_NGS_PATH="/opt/viral-ngs/source"
ENV \
INSTALL_PATH="/opt/viral-ngs" \
VIRAL_NGS_PATH="/opt/viral-ngs/source" \
MINICONDA_PATH="/opt/miniconda"
ENV \
PATH="$VIRAL_NGS_PATH:$MINICONDA_PATH/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" \
CONDA_DEFAULT_ENV=$MINICONDA_PATH \
CONDA_PREFIX=$MINICONDA_PATH \
JAVA_HOME=$MINICONDA_PATH

# Prepare viral-ngs user and installation directory
# Set it up so that this slow & heavy build layer is cached
# unless the requirements* files or the install scripts actually change
COPY requirements-conda.txt requirements-conda-tests.txt requirements-py3.txt $VIRAL_NGS_PATH/
COPY docker/install-viral-ngs.sh $VIRAL_NGS_PATH/docker/
COPY easy-deploy-script/easy-deploy-viral-ngs.sh $VIRAL_NGS_PATH/easy-deploy-script/
WORKDIR $INSTALL_PATH
COPY docker/install-viral-ngs.sh $VIRAL_NGS_PATH/docker/
COPY requirements-minimal.txt $VIRAL_NGS_PATH/
RUN $VIRAL_NGS_PATH/docker/install-viral-ngs.sh minimal
COPY requirements-conda.txt requirements-conda-tests.txt requirements-py3.txt $VIRAL_NGS_PATH/
RUN $VIRAL_NGS_PATH/docker/install-viral-ngs.sh

# Copy all of the source code into the repo
Expand All @@ -36,11 +45,6 @@ COPY . $VIRAL_NGS_PATH/
# Volume setup: make external tools and data available within the container
VOLUME ["/gatk", "/novoalign", "/user-data"]
ENV \
PATH="$VIRAL_NGS_PATH:/opt/miniconda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" \
MINICONDA_PATH="/opt/miniconda" \
CONDA_DEFAULT_ENV="/opt/miniconda" \
CONDA_PREFIX="/opt/miniconda" \
JAVA_HOME="/opt/miniconda" \
VIRAL_NGS_DOCKER_DATA_PATH="/user-data" \
NOVOALIGN_PATH="/novoalign" \
GATK_PATH="/gatk"
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
[![Docker Repository on Quay](https://quay.io/repository/broadinstitute/viral-ngs/status "Docker Repository on Quay")](https://quay.io/repository/broadinstitute/viral-ngs)
[![broad-viral-badge](https://img.shields.io/badge/install%20from-broad--viral-green.svg?style=flat-square)](https://anaconda.org/broad-viral/viral-ngs)
[![Build Status](https://travis-ci.org/broadinstitute/viral-ngs.svg?branch=master)](https://travis-ci.org/broadinstitute/viral-ngs)
[![Coverage Status](https://coveralls.io/repos/broadinstitute/viral-ngs/badge.png)](https://coveralls.io/r/broadinstitute/viral-ngs)
Expand Down
46 changes: 32 additions & 14 deletions docker/install-viral-ngs.sh
Original file line number Diff line number Diff line change
@@ -1,26 +1,44 @@
#!/bin/bash
#
# This script requires INSTALL_PATH (typically /opt/viral-ngs)
# and VIRAL_NGS_PATH (typically /opt/viral-ngs/source) to be set.
# This script requires INSTALL_PATH (typically /opt/viral-ngs),
# VIRAL_NGS_PATH (typically /opt/viral-ngs/source), and
# CONDA_DEFAULT_ENV (typically /opt/miniconda) to be set.
#
# A miniconda install must exist at /opt/miniconda
# A miniconda install must exist at $CONDA_DEFAULT_ENV
# and $CONDA_DEFAULT_ENV/bin must be in the PATH
#
# Otherwise, this only requires the existence of the following files:
# easy-deploy-script/easy-deploy-viral-ngs.sh
# requirements-conda.txt
# requirements-conda-tests.txt
# requirements-py3.txt
# requirements-minimal.txt
# requirements-conda.txt
# requirements-conda-tests.txt
# requirements-py3.txt

set -e -o pipefail

export VIRAL_CONDA_ENV_PATH=/opt/miniconda

mkdir -p $INSTALL_PATH/viral-ngs-etc
mkdir -p $VIRAL_NGS_PATH/.git # this is needed to make the setup script know we have/will have a git checkout
ln -s $VIRAL_NGS_PATH $INSTALL_PATH/viral-ngs-etc/viral-ngs
ln -s $VIRAL_CONDA_ENV_PATH $INSTALL_PATH/viral-ngs-etc/conda-env
ln $VIRAL_NGS_PATH/easy-deploy-script/easy-deploy-viral-ngs.sh $INSTALL_PATH
if [ ! -f $INSTALL_PATH/viral-ngs-etc/viral-ngs ]; then
ln -s $VIRAL_NGS_PATH $INSTALL_PATH/viral-ngs-etc/viral-ngs
fi
if [ ! -f $INSTALL_PATH/viral-ngs-etc/conda-env ]; then
ln -s $CONDA_DEFAULT_ENV $INSTALL_PATH/viral-ngs-etc/conda-env
fi

# setup/install viral-ngs directory tree and conda dependencies
sync
$INSTALL_PATH/easy-deploy-viral-ngs.sh setup-git-local

# manually install it ourselves instead of using easy-deploy
if [[ "$1" == "minimal" ]]; then
# a more minimal set of tools (smaller docker image?)
conda install --override-channels -y \
-q -c broad-viral -c bioconda -c conda-forge -c defaults -c r \
--file "$VIRAL_NGS_PATH/requirements-minimal.txt"
else
conda install --override-channels -y \
-q -c broad-viral -c bioconda -c conda-forge -c defaults -c r \
--file "$VIRAL_NGS_PATH/requirements-py3.txt" \
--file "$VIRAL_NGS_PATH/requirements-conda.txt" \
--file "$VIRAL_NGS_PATH/requirements-conda-tests.txt"
fi

# clean up
conda clean -y --all
2 changes: 1 addition & 1 deletion docker/mem_in_gb_90.sh
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
#!/bin/bash

head -n1 /proc/meminfo | awk '{print int($2*0.9/1024)}'
head -n1 /proc/meminfo | awk '{print int($2*0.9/1024/1024)}'
17 changes: 3 additions & 14 deletions docs/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,7 @@ Cloud compute implementations
Docker Images
~~~~~~~~~~~~~

To facilitate cloud compute deployments, we have published a complete Docker
image with associated dependencies at
`DockerHub <https://hub.docker.com/r/broadinstitute/viral-ngs/>`_.
Simply ``docker pull broadinstitute/viral-ngs``.
To facilitate cloud compute deployments, we publish a complete Docker image with associated dependencies to the Docker registry at `quay.io <https://quay.io/repository/broadinstitute/viral-ngs>`_. Simply ``docker pull quay.io/broadinstitute/viral-ngs`` for the latest stable version.


DNAnexus
Expand All @@ -31,12 +28,12 @@ Google Cloud Platform: dsub
All of the command line functions in viral-ngs are accessible from the docker image_ and can be invoked directly using dsub_.

.. _dsub: https://cloud.google.com/genomics/v1alpha2/dsub
.. _image: https://hub.docker.com/r/broadinstitute/viral-ngs/
.. _image: https://quay.io/repository/broadinstitute/viral-ngs

Here is an example invocation of ``illumina.py illumina_demux`` (replace the project with your GCP project, and the input, output-recursive, and logging parameters with URIs within your GCS buckets)::

dsub --project broad-sabeti-lab --zones "us-east1-*" \
--image broadinstitute/viral-ngs \
--image quay.io/broadinstitute/viral-ngs \
--name illumina_demux-test \
--logging gs://sabeti-temp-30d/dpark/test-demux/logs \
--input FC_TGZ=gs://sabeti-sequencing/flowcells/broad-walkup/160907_M04004_0066_000000000-AJH8U.tar.gz \
Expand Down Expand Up @@ -223,14 +220,6 @@ For more information, see the following AWS pages:

Note that the EC2 instance created by the easy-deploy script is currently configured to be an m4.2xlarge, which costs ~$0.55/hour to run. It is suggested that the instance be terminated via the AWS web console once processing with viral-ngs is complete. See the `AWS page for current pricing <https://aws.amazon.com/ec2/pricing/>`_ .

Limitations
~~~~~~~~~~~

As viral-ngs does not currently build a depletion database for BMTagger or BLAST automatically,
it is the responsibility of the user to create a depletion database for use within the virtualized
viral-ngs environment. It can be created within the virtual machine (VM), or uploaded
after the fact via ``rsync``.

Running Easy Deploy
~~~~~~~~~~~~~~~~~~~

Expand Down
Loading

0 comments on commit 2c420a2

Please sign in to comment.