Skip to content
This repository has been archived by the owner on Jan 20, 2022. It is now read-only.

Running WDL workflows locally

rzlim08 edited this page Nov 22, 2021 · 16 revisions

Start with our scientific manuscript presenting IDseq’s bioinformatics pipeline for metagenomic classification of high-throughput sequencing datasets.

The pipeline receives FASTQ or FASTA inputs and processes them through numerous tools and reference databases. To enable external reproduction,

  1. reference databases are available for download from Amazon S3
  2. tools are packaged in a Docker image
  3. the analysis steps are wrapped in Workflow Description Language (WDL)

We’ll walk through each of these, first with small reference databases (representing viral sequences only), then the full-scale metagenomic versions.

Install & configure miniwdl and other dependencies

Prepare to run the workflow by setting up miniwdl, a local WDL runner. Follow its Getting Started guide to begin. Briefly, this will entail (i) installing miniwdl using pip3 or conda, (ii) installing docker & configuring it so that the Unix user can control it without sudo, and (iii) trying miniwdl run_self_test. We will also need git available.

Recommended: enable miniwdl download caching. Multiple invocations of the workflow will be much more efficient if we activate miniwdl’s download cache feature, so that reference databases are downloaded from S3 on first use only. To activate the cache, set and export environment variables:

export MINIWDL__DOWNLOAD_CACHE__PUT=true
export MINIWDL__DOWNLOAD_CACHE__GET=true
export MINIWDL__DOWNLOAD_CACHE__DIR=/mnt/miniwdl_download_cache

Where the last directory is a suitable local storage location for large, cached files. This configuration can also be set using a cfg file instead of transient environment variables.

Fetch workflows

Change into some working directory (on the spacious scratch volume, if planning full-scale operations). Clone the chanzuckerberg/idseq-workflows git repository containing the WDL and Docker source files.

git clone https://github.com/chanzuckerberg/idseq-workflows.git

Configure scratch space

Test runs of the workflow require several gigabytes of scratch space. If running on an EC2 instance, you may want to expand your root EBS volume or mount instance storage, as well as relocate the Docker image storage directory. When running miniwdl, use the --dir DIR option to use the given DIR as scratch space; for example, if your instance storage is mounted on /mnt (Ubuntu default), use miniwdl --dir /mnt to run there. If you plan to run the idseq-workflows test suite, set TMPDIR to your directory using export TMPDIR=/mnt.

Pull or build Docker image

We can either pull the Docker image from a GitHub Packages registry, or build it locally from the Dockerfile (which downloads resources from numerous web locations).

To pull the existing image, see the Packages page for idseq-workflows. Browse to the image for the workflow you need to run, and note the current image tag. At the command line, first docker login to the GitHub packages registry using a personal access token, then docker pull the current tag. (You can use any GitHub account to log in, but you must log in.)

Building the image from the Dockerfile takes several minutes, but doesn’t require logging in anywhere. To build it,

docker build -t idseq-consensus-genome idseq-workflows/consensus-genome

or

docker build -t idseq-short-read-mngs idseq-workflows/short-read-mngs

and note the local tag (idseq-consensus-genome or idseq-short-read-mngs).

Test with synthetic input & small databases

First let’s run the workflow on small, synthetic FASTQ reads using small reference databases (containing only viral sequences).

consensus-genomes

miniwdl run idseq-workflows/consensus-genome/run.wdl technology=ONT sample=test docker_image_id=<<TAG>> \
fastqs_0=idseq-workflows/consensus-genome/test/Ct20K.fastq.gz ref_accession_id=MN908947.3 \
--input idseq-workflows/consensus-genome/test/local_test.yml --verbose

short-read-mngs

miniwdl run idseq-workflows/short-read-mngs/local_driver.wdl \
    docker_image_id=<<TAG>> \
    fastqs_0=idseq-workflows/short-read-mngs/test/norg_6__nacc_27__uniform_weight_per_organism__hiseq_reads__v6__R1.fastq.gz \
    fastqs_1=idseq-workflows/short-read-mngs/test/norg_6__nacc_27__uniform_weight_per_organism__hiseq_reads__v6__R2.fastq.gz \
    -i idseq-workflows/short-read-mngs/test/local_test_viral.yml --verbose

Breaking this down,

  • docker_image_id= should be set to the docker image tag you noted above (either the GitHub Packages tag, or idseq-short-read-mngs if you built the image locally)
  • short-read-mngs/local_driver.wdl is the top-level WDL for the metagenomics sequencing workflow
  • The pair of FASTQ files are small, synthetic read sets included for benchmarking
  • local_test_viral.yml supplies boilerplate workflow inputs, such as the S3 paths for the viral reference databases

The first attempt will take some time to download the reference databases (about 6 GiB total). Thereafter, if miniwdl download caching is enabled as suggested above, running this or other small samples should take just a few minutes.

When the run completes, miniwdl prints a large JSON structure with all the outputs and output file paths in the created run directory. The miniwdl documentation has more information about the run directory’s organization.

The aggregated metrics for every taxon identified by the IDseq short-read-mngs workflow can be found in the output file out/postprocess.refined_taxon_count_out_assembly_refined_taxon_counts_with_dcr_json/refined_taxon_counts_with_dcr.json

Each result entry from the IDseq sample report is recorded as a single entry in the .json file, as shown below:

{"tax_id": "37124", 
 "tax_level": 1, 
 "genus_taxid": "11019", 
 "family_taxid": "11018", 
 "count": 1394, 
 "nonunique_count": 1394, 
 "unique_count": 1394, 
 "dcr": 1.0, 
 "percent_identity": 96.56100000000093, 
 "alignment_length": 11601.0, 
 "e_value": -307.6526555685972, 
 "count_type": "NT"}

To interrogate read-level taxonomic hits for the NT and NR databases independently, the following two files may be used:

  • out/postprocess.refined_gsnap_out_assembly_gsnap_hitsummary2_tab/gsnap.hitsummary2.tab
  • out/postprocess.refined_rapsearch2_out_assembly_rapsearch2_hitsummary2_tab/rapsearch2.hitsummary2.tab

The hitsummary2.tab format is detailed here.

MG049915.1_0__benchmark_lineage_0_37124_11019_11018__s0000000344/1      1       37124   NC_004162.2     37124   11019   11018   NODE_1_length_11589_cov_8.183024        NC_004162.2     37124   11019   11018
MG049915.1_0__benchmark_lineage_0_37124_11019_11018__s0000000344/2      1       37124   NC_004162.2     37124   11019   11018   NODE_1_length_11589_cov_8.183024        NC_004162.2     37124   11019   11018
MG049915.1_0__benchmark_lineage_0_37124_11019_11018__s0000000838/1      1       37124   NC_004162.2     37124   11019   11018   NODE_1_length_11589_cov_8.183024        NC_004162.2     37124   11019   11018
MG049915.1_0__benchmark_lineage_0_37124_11019_11018__s0000000838/2      1       37124   NC_004162.2     37124   11019   11018   NODE_1_length_11589_cov_8.183024        NC_004162.2     37124   11019   11018

Full metagenomics run

To run the workflow on the full metagenomics databases used by IDseq, we recommend starting with an Amazon EC2 r5d.24xlarge or i3.8xlarge instance in us-west-2. Such powerful instance types are needed to store the full-size databases on local disks for speedy random access, up to 3.5 TiB during a large run. (To keep this tutorial self-contained, we run everything on one big compute node; in a production system like IDseq, one would distribute the WDL tasks & databases on several fit-for-purpose nodes.)

Launch the instance using an Ubuntu base image and install packages python3-pip docker git-core mdadm. If, like EC2 instances, the available scratch disk space is divided among several virtual devices, then perform steps like the following to stripe them in a RAID0 array, creating one large volume:

NVME_DISKS=(/dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS?????????????????)
mdadm --create /dev/md0 --force --auto=yes --level=0 --chunk=256 \
    --raid-devices=${#NVME_DISKS[@]} ${NVME_DISKS[@]}
mkfs.xfs /dev/md0
mount /dev/md0 /mnt
chown -R ubuntu /mnt

Check df -h to verify that /mnt has ≥ 3.5T space. Next, reconfigure Docker so that containers operate on the scratch volume, and so that the default ubuntu user can control it:

echo '{"data-root": "/mnt/docker"}' >> /etc/docker/daemon.json
service restart docker
usermod -aG docker ubuntu

Follow the steps above to set up miniwdl and its download cache. The download cache is practically required for the full metagenomics databases, as some databases are reused by different workflow steps. Since the full runs take some hours, you may also wish to set up byobu and/or mosh to avoid losing work to SSH timeouts.

Change into a working directory under mnt and, as above, clone idseq-workflows and pull or build the Docker image. Then launch the same small synthetic FASTQ pair on the full databases:

miniwdl run idseq-workflows/short-read-mngs/local_driver.wdl \
    docker_image_id=<<TAG>> \
    fastqs_0=idseq-workflows/short-read-mngs/test/norg_6__nacc_27__uniform_weight_per_organism__hiseq_reads__v6__R1.fastq.gz \
    fastqs_1=idseq-workflows/short-read-mngs/test/norg_6__nacc_27__uniform_weight_per_organism__hiseq_reads__v6__R2.fastq.gz \
    -i idseq-workflows/tests/short-read-mngs/local_test.yml --verbose

Historical note: before being packaged in WDL (2020), the short-read-mngs pipeline was orchestrated within a custom framework, idseq-dag. IDseq is currently in a transition phase where the high-level pipeline is expressed in WDL, while the logic for individual steps largely resides in idseq-dag modules; and the WDL tasks just invoke the latter. The idseq-dag portions may recede slowly over time, as new and revised steps are implemented first as WDL tasks.