Changing workflows around calling SNPs and indels
We are streamlining our recommended workflows by removing a preprocessing step.
As announced in the GATK v3.6 highlights, variant calling workflows that use HaplotypeCaller or MuTect2 now omit indel realignment. This change does not apply to workflows that call variants with UnifiedGenotyper or the original MuTect. We still recommend indel realignment for these legacy workflows. Recall that indel realignment uses RealignerTargetCreator and IndelRealigner and comes after duplicate marking and before base quality score recalibration (BQSR).
In light of these changes, let’s take a brisk stroll through the implications for variant detection. In particular, let’s focus on insertion and deletion events (indels).
Why am I blogging about this now?
I recently studied IndelRealigner and HaplotypeCaller, one after the other, to update a tutorial and to give workshop talks in the UK. This juxtaposition has brought into focus how our capabilities are changing with (i) improving sequencing technology and (ii) improving algorithmic methods. For a comprehensive overview of the state of different sequencing technologies, see the 2016 Goodwin et al. Nature Reviews article. The points in this blogpost pertain in particular to short read data from sequencing by synthesis. Longer short reads, deeper coverage and increased library complexity with PCR-free preparations give us higher quality sequence data. The graph in this external blog shows short read lengths varying from 25 bases in 2006 to 250 bases in 2014. Reads longer than 70 basepairs enable the mapping optimizations BWA-MEM offers, i.e. maximal exact matches. Adding to this, sophisticated haplotype-based variant detectors, such as HaplotypeCaller and FreeBayes, improve our ability to resolve variants. In particular, they are enabling better detection of indels.
Obviously, the added benefit from greater read lengths is more than just the proportional increase in bases. This external article recaps how coverage alone cannot compensate for the shortcomings of short read lengths in closing gaps. Longer reads reduce mapping ambiguity for regions with interspersed repetitive sequences, for some paralogous regions and for gapped alignments. In the case of gapped alignment, the aligner may split the read into multiple alignment records for distant or discordant events, resolve the split with an indel for local events and clip segments that it cannot map. All this splitting and clipping carries some burden of aligner artifact that confounds variant detection. For example, in the case of an insertion in the middle of a read, the inserted sequence distracts from global then local mapping.
In this way, short read mapping limits detection of variants that are dependent on gapped alignment. Towards overcoming these limitations, we added new algorithmic solutions to our workflows --local realignment in 2009 and local assembly in 2012. Illustration A describes a locus with a 20 basepair insertion that we cannot call with just aligner mapped reads. Realigning reads using IndelRealigner or assembling reads using HaplotypeCaller allows us to call the insertion. That indel realignment has been a part of pre-processing workflows for seven years and will continue to be a part of workflows still dependent on locus-based callers is a testament to the improvements it brings. And if you feel apprehensive about omitting it from your HaplotypeCaller and MuTect2 workflows, we empathize. These changes are about improving efficiency in the face of incremental returns. If you find substantial changes, then I encourage you to share details with us.
Let’s consider another illustration that showcases HaplotypeCaller's more robust capability. Illustration B shows how IndelRealigner is unable to realign reads around a 30 basepair insertion, despite inclusion of the region as a target and the presence of a read with the insertion. HaplotypeCaller reassembles and calls this insertion with or without indel realignment. In point of fact, for the example data, the two HaplotypeCaller workflows call identical genotypes for all sites. However, Illustration A points to some subtle differences in the variant annotation metrics. I will talk about the origins of these and their implications.
For those interested in an anecdote from MuTect2, which uses HaplotypeCaller’s approach but without ploidy assumptions to accommodate tumor data, see our MuTect2 workshop presentation. Slides 14–15 outline a case where MuTect2 resolves a 120 basepair deletion haplotype for a tumor.
Did indel realignment still have some value-add?
If indel realignment is redundant to HaplotypeCaller’s functions, why were we recommending both until now? Well, although HaplotypeCaller throws out realignment’s local placement of reads, indel realignment still influences outcomes. I’m told these influences should be subtle for high quality data but make a difference for lower quality data. Let’s talk about the influences conceptually and examine some initial differences for our example data snippet. Breaking down the effects into two categories gives us: (1) the effects of increasing MAPQ by ten for realigned reads and (2) the effects of the physical realignment.
(1) The effects of increasing MAPQ by 10 for realigned reads
(1a) Increases in MAPQ add previously ignored reads to the pool of primary alignment records BaseRecalibrator (BQSR) and HaplotypeCaller consider. BQSR gains previously MAPQ0 reads now MAPQ10 and HaplotypeCaller gains previously MAPQ10–19 reads now over the default MAPQ20 threshold of the HCMappingQualityFilter. (1b) HaplotypeCaller considers all indel realignment target regions, and except for regions that HaplotypeCaller cannot resolve cycles for in its graph assembly, e.g. a tandem repeat region, HaplotypeCaller acts on these regions. Indel realigned reads that cross above HaplotypeCaller's MAPQ threshold alter the allelic and total depth counts for a variant site and can affect genotyping and genotype probability and quality scores as well as other annotation metrics. If the added read(s) are informative towards haplotypes in a region with few informative reads, although they do not impact genotyping any more than uninformative reads, they may have a disproportionate impact on metrics that use exclusively informative reads, e.g. sample-level AD & DP (allelic depth and total depth), QD (QUAL by depth where QUAL is the phred-scaled score for the assertion made in ALT) and FS & SOR (Fisher strand bias and strand odds ratio). (1c) Variant sites that include these increased MAPQ reads directly increase MQ and MQRankSum annotation values (root mean square of mapping quality and ranksum test of ALT vs. REF read mapping qualities). (1d) Finally, these altered annotation metrics (QD, FS, SOR, MQ & MQRankSum for the snippet) impact cohort level annotation metrics, the variant quality score recalibration (VQSR) model and the resulting VQSLOD score for the site. These changes to the distribution of the variant scores impact variant site filtering.
For our data snippet, indel realignment adds zero reads to either pool. The lowest MAPQ of reads indel realignment then increases is at MAPQ26 and in fact over 94% of increases are for MAPQ60 reads. The 194 (~0.07% of total primary) indel realigned reads increase cumulative MAPQ by 0.012% and the impact on MQ for the 20 basepair insertion site in Illustration A is small (MQ 59.39 vs. 62.81). This site has plenty of informative reads. Interestingly, indel realignment adds four more reads that HaplotypeCaller considers for this locus and adds a total of five additional reads for two loci for our snippet's entire one megabasepair region. Of the four additional reads, only one is realigned. This brings us to the next section's topic.
(2) The effects of the physical realignment of reads
Indel realignment shifts read positions and this impacts the outcomes of both BQSR and HaplotypeCaller. (2a) Shifted reads and their mismatching positions around indels will change the bases BQSR modeling uses and potentially recalibrates. Indel realignment generally reduces false positive mismatches around indels. If these mismatches contribute to a covariate pattern, and indel realignment removes these mismatches, then indel realignment removes the support towards these patterns and in turn enables BQSR to calibrate using cleaner models. These contributing false mismatches avoid being unfairly penalized. However, BQSR looks for fairly obvious failures stemming from sequencer flowcell technology such as those this external blog describes neatly. Any discernible patterns from misaligned reads around indels will be small compared to that of flowcell failures. This is one area that could use more data on the extent of the impact of removing indel realignment. What I am told is that effects will focus more on SNPs than indels in that the new workflow lowers the confidence in mainly what will become reference matches after HaplotypeCaller realignment. For these instances, at the worst, you may end up calling a false heterozygous SNP instead of homozygous reference.
Additionally, changes to read positions for a locus impact (2b) the genomic window HaplotypeCaller considers for graph assembly and realignment, because the entropy distribution changes, and (2c) the reads the window includes. For Ilustration A's locus, indel realignment tightens the active region size from 208 to 152 basepairs and also adds four additional reads for consideration. Such occurences alter site annotation values and can influence downstream variant calling.
For our data snippet (again, high quality 30X coverage PCR-free 2x150 WGS), the effects of removing indel realignment are subtle. However, for data that deviate towards lower quality and for data that benefit greatly from BQSR, the effects could be more substantial. Given this, we ask researchers who are considering our new workflows to take into account properties of their data and consider the impact omitting indel realignment may have for their analyses. A first step towards this could be to examine the MAPQ distribution of the primary alignments.
Let's wrap up with some wider picture considerations.
I've covered some ground on the influences of removing indel realignment from a workflow. One side note is that called sites from different variant callers can have different allele variants. For example, IndelRealigner and HaplotypeCaller can differ in how they resolve the variation around a given locus. This raises the question to what extent differential representation, based on different underlying algorithmic solutions, contribute to the discordant calls observed between different workflows. If the extent is great, then, given the potential for differential phenotypic classification, the clinical impact can be significant. A recent article describes how the burgeoning movement to use sequencing in patient diagnostics and disease research is hampered by discordant classifications. By the same reasoning, then the ability to physically phase proximal variants, that determines their cis or trans relationship, has direct impact on phenotypic classification. For example, resolving phasing using tools such as ReadBackedPhasing can tell us whether a variant that shifts the coding frame is expressed on the same molecule as another variant that restores the coding frame.
You should care about resolving indels as best you can even if your analyses focus only on SNPs. Correctly resolving indels reduces the false positive mismatches from alignment that accrue in the vicinity of indels. Also, we care about resolving larger indels for the biological insights they will bring. One example of the biological significance of indels lies in the transposable elements that litter our genomes that in transposing create insertion and deletion events. These elements are currently considered structural given their large size. They are a part of our individual variation and have been proposed to drive speciation and evolution and are implicated in disease including cancer (see endogenous retrovirus). The RepeatMasker web resource catalogues our repetitive elements, to which transposable elements belong, that together comprise over half our human genome.
Finally, technological capabilities are changing the definition of an indel. Back yonder, detection of indels was limited to small indels of a couple to fewer than ten basepairs. GATK workflows assigned resolving anything larger to the realm of structural variant detection. Today, lengthening sequencing reads together with improving global and local alignment algorithms are revealing indels reaching tens to low hundreds of basepairs in our workflows.
Both IndelRealigner and HaplotypeCaller allow for better indel detection over UnifiedGenotyper alone.
Here I consider a 20 basepair insertion, at 10:96,424,673, for a 2x150 PCR-free human WGS library with deep coverage (~30x average) shown in Tutorial#7156's second IGV screenshot. The region’s sequence is of low complexity and the insertion site also carries a SNP in trans. To capture IndelRealigner’s (IR) capabilities, I used UnifiedGenotyper (UG) as the variant caller, as its genotyping engine is equivalent to that of HaplotypeCaller’s (HC). Below are the calls for the site for the different workflows.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 10 96424673 . G C 824.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.767;DP=44;Dels=0.00;ExcessHet=3.0103;FS=1.268;HaplotypeScore=13.7364;MLEAC=1;MLEAF=0.500;MQ=59.06;MQ0=0;MQRankSum=1.770;QD=18.74;ReadPosRankSum=-0.988;SOR=0.480 GT:AD:DP:GQ:PL 0/1:17,27:44:99:853,0,512
10 96424673 . G C 630.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.259;DP=40;Dels=0.00;ExcessHet=3.0103;FS=0.000;HaplotypeScore=18.8916;MLEAC=1;MLEAF=0.500;MQ=62.57;MQ0=0;MQRankSum=-3.725;QD=15.77;ReadPosRankSum=0.234;SOR=0.836 GT:AD:DP:GQ:PL 0/1:18,22:40:99:659,0,575 10 96424673 . G GTTTGTTTGTTTCTTTCTTTC 2415.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.390;DP=40;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=62.57;MQ0=0;MQRankSum=3.552;QD=10.35;ReadPosRankSum=-2.066;SOR=0.770 GT:AD:DP:GQ:PL 0/1:17,18:40:99:2453,0,2077
10 96424673 . G C,GTTTGTTTGTTTCTTTCTTTC 900.73 . AC=1,1;AF=0.500,0.500;AN=2;DP=40;ExcessHet=3.0103;FS=0.000;MLEAC=1,1;MLEAF=0.500,0.500;MQ=59.39;QD=33.49;SOR=2.712 GT:AD:DP:GQ:PL 1/2:0,12,8:20:99:938,323,569,472,0,742
10 96424673 . G C,GTTTGTTTGTTTCTTTCTTTC 1028.73 . AC=1,1;AF=0.500,0.500;AN=2;DP=44;ExcessHet=3.0103;FS=0.000;MLEAC=1,1;MLEAF=0.500,0.500;MQ=62.81;QD=33.49;SOR=1.981 GT:AD:DP:GQ:PL 1/2:0,13,9:22:99:1066,367,607,516,0,782
The first record shows UnifiedGenotyper cannot call the insertion with just the aligner mapped reads. The insertion is present in ~5 reads in the original alignment and after IndelRealigner+UnifiedGenotyper is supported by 18 total reads (see full context). HaplotypeCaller calls the insertion with or without indel realignment. The former uses nine supporting reads out of 22 informative reads while the latter uses eight supporting reads out of 20 informative reads. Additional minor differences manifest in the annotation metrics. For this particular site, these differences are in
SOR, and FORMAT’s
PL and the described
DP. Despite these minor differences, the two HaplotypeCaller workflows call identical alleles and genotypes for all sites in the one megabasepair interval.
HaplotypeCaller boosts indel detection further.
Here I consider a 30 basepair insertion for the same library in a locus with ~25x depth (shown in B.1). The IndelRealigner+UnifiedGenotyper workflow cannot resolve the 30 basepair insertion, despite inclusion of the locus as a target for realignment and a read with the full insertion placed by the aligner. HaplotypeCaller calls the insertion allele and the region shows five informative reads. Furthermore, as I downsample the coverage of the library, the 6% gap between the number of indels HaplotypeCaller and the IndelRealigner+UnifiedGenotyper workflows detect widens, and the numbers drop in parallel (B.2). I am cautious not to equate greater or matching number of indels with the validity or the concordance of calls. Measuring the full extent of differences requires further comparison of sites, alleles, genotypes and quality of calls. Even still, what is apparent from the illustration is that HaplotypeCaller pushes the envelope of indel detection, a step ahead of IndelRealigner.
HaplotypeCaller calls the insertion with or without indel realignment and the records are shown below. For this particular site, the annotation metrics are identical.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 10 96321641 . A ATATTCCTTCAATACCCTACCTATTGAAGGG 167.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.438;ClippingRankSum=0.000;DP=23;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=10.48;ReadPosRankSum=-1.190;SOR=0.301 GT:AD:DP:GQ:PL 0/1:11,5:16:99:205,0,2170
10 96321641 . A ATATTCCTTCAATACCCTACCTATTGAAGGG 167.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.438;ClippingRankSum=0.000;DP=23;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=10.48;ReadPosRankSum=-1.190;SOR=0.301 GT:AD:DP:GQ:PL 0/1:11,5:16:99:205,0,2170
Links to external references
- Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet. 2016;17(6):333-351. http://dx.doi.org/10.1038/nrg.2016.49.
- Blog In between lines of code's post titled "Developments in high throughput sequencing -- June 2015 edition" by Lex Nederbragt posted June 18, 2015. Accessed May 2016 at https://flxlexblog.wordpress.com/2015/06/17/developments-in-high-throughput-sequencing-june-2015-edition/.
- BioTechniques journal article titled Going to Great Read Lengths by Janelle Weaver and dated October 27, 2014. Accessed May 2016 at http://www.biotechniques.com/news/Going-to-Great-Read-Lengths/biotechniques-341722.html#.Vyz2DT8_4yN.
- GenomeWeb article titled "Genomic Variant Data Sharing Gains Support; Collaboration Seen as Key to Interpretation Challenge" by Turna Ray and dated May 2, 2016. Accessed May 2016 at https://www.genomeweb.com/informatics/genomic-variant-data-sharing-gains-support-collaboration-seen-key-interpretation.
- QCFAIL.com site blog titled "Position specific failures of flowcells" by Simon Andrews dated January 31, 2016. Accessed May 2016 at https://sequencing.qcfail.com/articles/position-specific-failures-of-flowcells/.
- Repeatmasker web resource titled "Human [Homo sapiens] Genomic Dataset" for hg38 by the Institute for Systems Biology based on Repeat Library date-stamped db20140131. Accessed May 2016 at http://www.repeatmasker.org/species/hg.html.