Running Du Novo interactively from Galaxy | An ABL1 example
This page explains how to perform discovery of low frequency variants from duplex sequencing data. As an example we use the ABL1 dataset published by Schmitt and colleagues (SRA accession SRR1799908).
Calling low frequency variants from next generation sequencing (NGS) data is challenging due to significant amount of noise characteristic of these technologies. Duplex sequencing (DS) was designed to address this problem by increasing sequencing accuracy by over four orders of magnitude. DS uses randomly generated barcodes to uniquely tag each molecule in a sample. The tagged fragments are then PCR amplified prior to the preparation of a sequencing library, creating fragment families characterized by unique combination of barcodes at both 5’ and 3’ ends:
|The logic of duplex sequencing|
|From Schmitt et al. (2012)|
The computational analysis of DS data (Part
C in the figure above) produces two kinds of output:
- Single Strand Consensus Sequences (SSCS; panel
ivin the figure above);
- Duplex Consensus Sequences (DCS; panel
vin the figure above).
The DCSs have the ultimate accuracy, yet the SSCSs can also be very useful when ampliconic DNA is used as an input to a DS experiment. Let us illustrate the utility of SSCSs with the following example. Suppose one is interested in quantifying variants in a virus that has a very low titer in body fluids. Since DS procedure requires a substantial amount of starting DNA (between between 0.2 and 3 micrograms) the virus needs to be enriched. This can be done, for example, with a PCR designed to amplify the entire genome of the virus. Yet the problem is that during the amplification heterologous strands will almost certainly realign to some extent forming hetoroduplex molecules:
|Heteroduplex formation in ampliconic templates|
|Image by Barbara Arbeithuber from Stoler, Arbeithuber et al. (submitted). Here there are two distinct types of viral genomes: carrying
In the image above there are two alleles: green (A) and red (G). After PCR a fraction of molecules are in heteroduplex state. If this PCR-derived DNA is now used as the starting material for a DS experiment, the heteroduplex molecules will manifest themselves as having an
N base at this site (because Du Novo interprets disagreements as
Ns during consensus generation). So, DSCs produced from this dataset will have
N at the polymorphic site. Yet, SSCSs will only have
G. Thus SSCS will give a more accurate estimate of the allele frequency at this site in this particular case. In Du Novo SSCSs are generated when the Output single-strand consensus sequences option of Du Novo: Make consensus reads tool is set to
Yes (see here).
How to use this tutorial
If you are not familiar with Galaxy, here is a good place to start. The entire analysis described here is accessible as a Galaxy history (by clicking on this link you can create your own copy and play with it).
Each history item has a Rerun button:
Clicking this button will show you how this tool was run with all parameters filled in exactly.
This analysis (and consequently the Galaxy's history) can be divided into three parts
- Consensus generation from initial sequencing reads;
- Analysis of Duplex Consensus Sequences (DCS);
- Analysis of Single Strand Consensus Sequences (SSCS):
Start: Generating consensus sequences
The starting point of the analyses are sequencing reads (usually in fastq format) produced from a duplex sequencing library.
Getting data in and assessing quality
We uploaded Schmitt et al. (2015) data directly from SRA as shown in this screencast. This created two datasets in our galaxy history: one for forward reads and one for reverse. We then evaluated the quality of the data by running FastQC on both datasets (forward and reverse) to obtain the following plots:
|A. Forward||B. Reverse|
One can see that these data are of excellent quality and no additional processing is required before we can start the actual analysis.
Generating Duplex Consensus Sequences (DCS)
From tool section NGS: Du Novo we ran:
Make families (
Tag length = 12;
Invariant sequence length = 5)
- Align families (This is the most time consuming step of the workflow. It may take multiple days to run. The ABL1 example took 34 hours and 7 minutes to finish. )
Make consensus reads (
Minimum reads per family = 3;
Minimum base quality = 20;
FASTQ format = Sanger;
Output single-strand consensus sequences = YesThis is particularly important as explained below; also see the following image)
This is the exact image of the Make consensus reads interface:
|Making DCS and SSCS|
Note that Output single-strand consensus sequences is set to
The Du Novo algorithm occasionally inserts
Nand/or IUPAC notations at sites where a definive base cannot be identified according to the major rule consensus. We however do not want such bases when we call variants. The tool Sequence Content Trimmer will help with filtering these out. Here are the parameters we used:
|Sequence Content Trimmer settings|
The previous step filters forward and reverse DCSs and reports them in FASTA format. Yet the downstream tools require fastq format. To address this we convert FASTA into fastq using Combine FASTA and QUAL from tool section NGS: QC and manipulation. In this case the quality values are filled in with the maximum allowed value of 93 (essentially we fake them here), which is fine as we will not rely on quality scores in the rest of the analysis.
|Combine FASTA and QUAL|
Note that here two datasets (#8 and #9) are selected simultaneously because we clicked the multiple datasets button the left of the FASTA File dropdown:
At this point we have trimmed DCSs in fastq format. We can now proceed to calling variants. This involves the following steps:
- Align against reference genome
- Merge results of multiple mappers This step is only useful if one uses multiple mappers (which we do here to show concordance. But this is not strictly necessary.)
- Left aligning indels
- Tabulate the differences
Align against genome with BWA and BWA-MEM
Here we use two mappers for added reliability (this is not necessary in most situations as long as you use the right mapper for input data). To differentiate between results produced by each mapper we assign readgroups (this is done by clicking on Set read groups information dropdown). For example, for BWA-MEM you would set parameters like this:
Note that we are comparing DCSs against human genome version
We then repeat essentially the same with BWA:
Note here we use
Since we have used two mappers - we have two BAM datasets. Yet because we have set readgroups we can now merge them into a single BAM dataset. This is because the individual reads will be labelled with readgroups (you will see how it will help later). To merge we use MergeSamFiles from tool section NGS: Picard:
|Merging BAM datasets|
Left Aligning indels
To normalize the positional distribution of indels we use Left Align utility (
NGS: Variant Analysis) from FreeBayes package. This is necessary to avoid erroneous polymorphisms flanking regions with indels (e.g., in low complexity loci):
|Left aligning indels|
Note here we use
Tabulating the differences
To identify sites containing variants we use Naive Variant Caller (NVC) (tool section NGS: Variant Analysis) which produces a simple count of differences given coverage and base quality per site (remember that our qualities were "faked" during the conversion from FASTA to fastq and cannot be used here). So in the case of ABL1 we set parameters as follow:
|Finding variants with NVC|
The NVC generates a VCF file that can be viewed at genome browsers such as IGV. Yet one rarely finds variants by looking at genome browsers. The next step is to generate a tab-delimited dataset of nucleotide counts using Variant Annotator from tool section NGS: Variant Analysis. We ran it with the following parameters:
|Annotating variable sites|
There are 3,264 lines in the output, which is clearly too much. Using Filter tool (tool section Filter and Sort) with expression
c16 >= 0.01(because column 16 contains minor allele frequency - MAF - and we are interested in those sites where MAF >= 1%):
|Filtering variable sites|
will get that number to only 4 (showing just some of the columns):
|Mapper||Position (chr9)||Major allele||Minor allele||MAF|
We can see that results of both mappers agree very well. The reason we see these numbers grouped by mappers is because we have set the readgroups while mapping.
The polymorphism we are interested in (and the one reported by [Schmitt and colleagues] (http://www.nature.com/nmeth/journal/v12/n5/full/nmeth.3351.html)) is at the position 130,872,141 and has a frequency of 1.3%. The other site (position 130,880,141) is a known common variant rs2227985, which is heterozygous in this sample.
Analysis of single strand consensus data
SSCSs are generated when the Output single-strand consensus sequences option of Du Novo: Make consensus reads tool is set to
Yes (see here). Analysis of SSCS data follows almost exactly the same trajectory. The only difference is that these do not come as forward and reverse. Instead Du Novo generates a single dataset. With this dataset we go through all the same steps:
- Aligning against genome (here the difference is that one needs to choose a single end option and use a single dataset as input)
- Left aligning indels
- Tabulating the differences
Repeating this analysis using workflows
The analysis described above can be rerun using a workflow. Workflow combined all steps into a single entity that only needs to be executed once. We provide two workflows:
- Du Novo analysis from reads (import from here). This workflow uses fastq reads as input. It should be used if you analyze data for first time.
- Du Novo analysis from aligned families (import from here). This workflow starts with aligned families. It should be used for re-analysis of already generated DCS and SSCS data.
|Starting from Reads|
|Starting from DCS/SSCS data|
If things don't work...
...you need to complain. Use Galaxy's BioStar Channel to do this.