Skip to content
This repository has been archived by the owner on Nov 17, 2021. It is now read-only.

1. Assembly of viral genomes

Chris Tomkins-Tinch edited this page Sep 9, 2019 · 2 revisions

Assembly of viral genomes of diverse species

The workflow has been pre-configured for assembly of the EBOV genome. If you intend to assemble a different viral species, you can still use the viral-ngs pipeline described in this document. You would need to perform an additional step to fetch the reference genomes for the species of interest. Continue with this tutorial until the Run the workflow and monitor its progress step, and then refer to the documentation of a utility applet that is written for this purpose.

Step-by-step: assembling an EBOV genome

Create a DNAnexus project

First, create a DNAnexus platform account using the signup form, which will take you through the usual flow of creating a username/password and confirming your e-mail address. Your new account will have free credits sufficient to process numerous samples, and DNAnexus may sponsor further usage for EBOV research upon request.

Log in and create a new project with a name of your choice. In this example, we've named the project My EBOV assemblies.

Upload sequencing data

You can upload paired-end reads in either of two formats, either a pair of FASTQ files or an "unmapped BAM" file. FASTQ files may be gzipped, which will be faster to upload, and their filenames must end in either *.fastq or *.fastq.gz.

In this example, we'll use one of the samples from Gire et al. (2014), exported from SRA. Here are links to these files: SRR1553554_1.fastq.gz SRR1553554_2.fastq.gz

Click the Add Data button in your new project, then drag or choose the files to upload.

Press Add Data again, and the transfer will begin.

With DNAnexus, your data is transferred and stored with clinical-grade security controls. Other DNAnexus user accounts cannot access your data unless you share your project or make it public. DNAnexus also does not access your file contents without your permission, barring exigent circumstances.

Upload Novocraft and GATK software packages

The assembly workflow requires Novoalign and GATK, which aren't bundled due to their licensing restrictions. Instead, you'll need to drop in tarballs for these software packages. First, please ensure you're entitled to use them without commercial licenses (as in most not-for-profit projects), or else that you have the necessary licenses. Then, download the tarballs from the following websites, and upload them to your project:

Version Tarball file / MD5 Website
Novocraft Programs V3.02.08 X86-64 Linux 3.0 Kernel novocraftV3.02.08.Linux3.0.tar.gz 05810e0da23340300482eba2e47bc45e Link
GATK v3.3-0-g37228af GenomeAnalysisTK-3.3-0.tar.bz2 e3d9d6e87825078d1a574c5bb469a1b4 Link

The workflow has been validated using these exact versions, but other recent versions will probably work too.

Copy the assembly workflow into your project

The assembly workflow has been published in a public DNAnexus project, Broad Inst Viral NGS. We'll make a copy of it in your new project to prepare it for use. The latest version of the viral-ngs workflows can be found here (note that they require a newer version of GATK).

Open the Broad Inst Viral NGS project, and find the viral-ngs-assembly workflow object in the assembly folder. Select it, and press the Copy button.

Navigate to your project and press Copy into this folder.

Run the workflow and monitor its progress

If you're working with a virus rather than EBOV, you should refer to the documentation of the utility tool to fetch the reference genome files necessary for filter and scaffold step of the pipeline.

Back in your project, click on the workflow you just copied in, opening the Run Analysis dialog, in which we'll supply the required inputs and then launch the workflow.

If you uploaded a pair of FASTQ files, click on the file input to the deplete stage, then select the first FASTQ file. Then, click on the paired_fastq input just below, and select the second FASTQ file.

If you uploaded an unmapped BAM file, click on the file input to the deplete stage, then select the BAM file. Leave the paired_fastq input empty.

Next, scroll down to the workflow's scaffold stage, and supply the Novocraft and GATK tarballs you uploaded.

If you'd like, you can specify a sample name/ID to be used in the output filenames and headers. To set this, click on the gear icon of the deplete stage, fill in the sample_name field, and click Save. If you don't set a sample name, the workflow will derive something from the input filename. You may also wish to set the analysis name, in the upper left of the Run Analysis dialog, which can help in distinguishing different analyses running concurrently.

The workflow is now ready. Click the Run Analysis button, which will then take you to the Monitor project view, where you can watch its progress.

You'll also receive an e-mail notification of analysis completion, potentially after several hours.

Collect results

Upon completion, the workflow will output:

  • <sample_name>.fasta the assembly
  • <sample_name>.mapped.bam the reads remaining after human depletion, mapped back to the assembly, excluding any reads not mapping to the assembly
  • <sample_name>.all.bam the reads remaining after human depletion, mapped back to the assembly, with unmapped reads also included

Additionally, you'll find a new intermediates folder containing various by-products of the workflow stages. For example, the <sample_name>.cleaned.bam file is an unmapped BAM file reads remaining after human depletion.

From this point, you can view or download your results. For example, select the FASTA file, and click Open in New Tab to view the contents:

and perhaps take it over to BLAT at the UCSC Ebola Portal:

Congratulations! You can now reuse the workflow object you copied into your project to run additional samples you upload, including running multiple samples in parallel if needed. Setting the sample name, analysis name, and output folder when you launch will help to keep your project tidy.

From time to time, updated versions of the workflow may be published in the Broad Inst Viral NGS public project. The version saved in your project is not automatically updated (ensuring reproducibility), so you'll need to copy the new version over to use it.

Sharing with others

You can share your project with another DNAnexus user by clicking the blue Share button. Enter their username or e-mail address and choose an appropriate permissions level. To share only some of your data - perhaps only the final products, for example - create a second project, copy the desired data into it, and share that project.

Another option is to make a project "public", meaning that any DNAnexus user can discover and view it. This is found in the project settings view (the gear icon in the project toolbar). But this should not be used for projects containing proprietary software packages.

Troubleshooting failures

Given properly formatted inputs, the workflow has two important failure modes:

  • The filter stage, which extracts EBOV reads from the cleaned metagenomic dataset based on a database of known EBOV genomes, fails if too few such reads/bases are found.
  • The scaffold stage, which builds the initial contiguous assembly of the EBOV genome, fails if the assembly doesn't meet certain quality thresholds. An example of this error message in the Monitor view is shown below.

Both of these errors indicate that the input reads contain too little EBOV data to proceed with the assembly. When the workflow fails in a certain stage, results of the previous stages are still output to the project. These, and the logs (standard output and standard error) of each job may provide additional useful information.

If you encounter other types of internal errors, please contact us and/or Send Failure Report.

Advanced topics

Saving workflow modifications

You can modify the workflow's configuration and save the changes so that you don't have to repeat them each time you run a sample. For example, you could pre-set the Novocraft and GATK tarballs, so that you don't have to fill them in each time, or set a default output folder, to help keep your project tidy. To do this, select the workflow object and press Edit. This will take you to a workflow editor view that looks similar to the Run Analysis dialog, but the changes you make will be saved to the workflow and reflected each time you run a new analysis using that workflow.

You can always find an unmodified or up-to-date version of the workflow in the Broad Inst Viral NGS public project.

Skipping human read depletion

The first stage of the workflow depletes the input dataset of reads matching the human genome/transcriptome, so that the remaining "cleaned" data may be shared or published with reduced privacy risks. This tends to be the most time-consuming step, and you have the option to skip it if it's not needed for your purposes. To do so, flip the skip_depletion setting to True in the configuration of the deplete stage (accessed through its gear icon in the Run Analysis dialog).

Standalone human read depletion

You can also run the human read depletion stage independently of the remainder of the workflow. Copy the viral-ngs-human-depletion applet from the utilities folder of Broad Inst Viral NGS into your project. Then, click on this applet in your project to run it, and supply the FASTQs or unmapped BAM input as you did to the workflow. You can leave the applet's other inputs blank, and run it. The .cleaned.bam output file is an unmapped BAM containing the reads remaining after depletion.

Command-line scripting

You can upload data, run the workflow, or automate any other operation using the DNAnexus command-line interface. When you have it installed and logged in to your project, you can upload data from your workstation using e.g. dx upload reads1.fastq.gz, and launch the workflow on data you've uploaded:

dx run "viral-ngs-assembly (Copy: Feb 8th 2015 5:26pm)" \
   -i file=reads1.fastq.gz -i paired_fastq=reads2.fastq.gz \
   -i novocraft_tarball=novocraftV3.02.08.Linux3.0.tar.gz \
   -i gatk_tarball=GenomeAnalysisTK-3.3-0.tar.bz2

Of course, substitute the exact name of your copy of the workflow and input filenames. (You could change the workflow object's name to make it easier to enter here.)

The command-line interface can be scripted to run large numbers of samples, or to automate data upload and analysis from your LIMS.

Combining multiple sequencing runs

If you have multiple pairs of FASTQ files from different sequencing runs of the same sample, a utility is available to combine them into an unmapped BAM file, which you can then use as the assembly workflow input.