Browse files

Blog post on updated comparisons of variant callers

  • Loading branch information...
1 parent c146ac4 commit f1492212b4ed8501fe4407fbd255b47b0ba9a48a @chapmanb committed Oct 16, 2013
Showing with 223 additions and 0 deletions.
  1. +223 −0 posts/
@@ -0,0 +1,223 @@
+#+BLOG: bcbio
+#+POSTID: 540
+#+DATE: [2013-10-16 Wed 05:45]
+#+TITLE: Updated comparison of variant detection methods: Ensemble, FreeBayes and minimal BAM preparation pipelines
+#+CATEGORY: variation
+#+TAGS: bioinformatics, variant, ngs, clinical
+#+OPTIONS: toc:nil num:nil
+* Variant evaluation overview
+I previously discussed our approach for [[eval-variant][evaluating variant detection methods]]
+using a [[giab-paper][highly confident set of reference calls]] for the
+[[na12878][NA12878 human HapMap genome]], provided by [[giab][NIST's Genome in a Bottle consortium]].
+The comparison utilizes [[bcbio-nextgen][bcbio-nextgen]], an automated open-source
+pipeline for variant calling and evaluation, coupled with the
+[[xprize-val][XPrize validation protocol]] to identify concordant and discordant
+variants. By having an automated validation workflow attached to a
+regularly updated, community developed variant calling pipeline, we
+can actively track progress of variant callers and provide updates as
+algorithms improve.
+Since the initial post, There have been two new GATK releases of
+[[gatk-ug][UnifiedGenotyper]] and [[gatk-hc][HaplotypeCaller]], as well as multiple improvements
+to [[freebayes][FreeBayes]]. Additionally we've enchanced our [[ensemble][ensemble calling method]],
+ which combines inputs from multiple callers into a single
+final set of calls, to better handle comparisons with inputs from
+three callers.
+The goal of this post is to re-evaluate these variant detection
+approaches and provide an updated set of recommendations:
+- The Ensemble calling method provides the best variant detection by
+ combining inputs from GATK UnifiedGenotyper, HaplotypeCaller and
+ FreeBayes.
+- FreeBayes performs slightly better than GATK methods for SNP and
+ indel resolution, including GATK's HaplotypeCaller method.
+- Post-alignment BAM processing steps like base quality recalibration and
+ realignment have little impact on the quality of variant calls with
+ variant callers that perform local realignment, including FreeBayes
+ and GATK HaplotypeCaller.
+These results enable re-evaluation of current best practice pipelines
+by avoiding the processing intensive post-alignment BAM processing
+steps. Additionally, combining FreeBayes and minimal BAM preparation
+allows for development of pipelines that can be freely used for
+academic, clinical and commercial work.
+#+LINK: eval-variant
+#+LINK: na12878
+#+LINK: giab-paper
+#+LINK: giab
+#+LINK: xprize-val
+#+LINK: freebayes
+#+LINK: gatk-ug
+#+LINK: gatk-hc
+#+LINK: ensemble
+#+LINK: bcbio-nextgen
+* Calling and evaluation methods
+We called variants on a NA12878 exome dataset
+from [[edge][EdgeBio's clinical pipeline]] and assessed them against the NIST Genome in a
+Bottle reference material. [[comparison-do][Full instructions for replicating the analysis]]
+are available from the bcbio-nextgen documentation site.
+Following alignment with [[bwa-mem][bwa-mem (0.7.5a)]], we post-processed the BAM
+files with two methods:
+- [[gatk-bp][GATK's best practices (2.7-2)]]: This involves de-duplication with
+ [[picard-md][Picard MarkDuplicates]], GATK base quality score recalibration and
+ GATK realignment around indels.
+- Minimal post-processing, with de-duplication using
+ [[samtools][samtools rmdup]] and no realignment or recalibration.
+We then called prepared BAM files with three general purpose callers:
+- [[freebayes][FreeBayes (v0.9.9.2-18)]]: A haplotype-based Bayesian caller from
+ the Marth Lab.
+- [[gatk-ug][GATK UnifiedGenotyper (2.7-2)]]: GATK's widely used Bayesian caller.
+- [[gatk-hc][GATK HaplotypeCaller (2.7-2)]]: GATK's more recently developed
+ haplotype caller which provides local assembly around variant
+ regions.
+Finally, we evaluated the calls from each combination of BAM
+post-alignment preparation method and variant caller using the
+[[bcbio.variation][bcbio.variation]] framework. This provides a summary identifying
+concordant and discordant variants, separating SNPs and indels since
+they have different error profiles. Additionally it classifies
+discordant variants. where the reference material and evaluation
+variants differ, into three different categories:
+- Extra variants, called in the evaluation data but not in the
+ reference. These are potential false positives or missing calls from
+ the reference materials.
+- Missing variants, found in the NA12878 reference but not in the
+ evaluation data set. These are potential false negatives.
+- Shared variants, called in both the evaluation and reference but
+ differently represented. This results from allele differences, such as
+ heterozygote versus homozygote calls, or variant identification
+ differences, such as indel start and end coordinates.
+#+LINK: edge
+#+LINK: bwa-mem
+#+LINK: gatk-bp
+#+LINK: comparison-do
+#+LINK: samtools
+#+LINK: picard-md
+#+LINK: bcbio.variation
+* Variant caller comparison
+With this framework, we compared the 3 variant callers and combined
+ensemble method:
+- The ensemble calling approach provides the best overall resolution
+ of both SNPs and indels. The one area where it lags slightly behind
+ is in identification of homozygote/heterozygote calls, especially in
+ indels. The higher discordant shared counts reflect this, and it
+ is due to positions where HaplotypeCaller and FreeBayes both call
+ variants but differ on whether it is a heterozygote or homozygote.
+- GATK HaplotypeCaller is all around better than the UnifiedGenotyper.
+ In the previous comparison, we found UnifiedGenotyper better on SNPs
+ and HaplotypeCaller better on indels, but the recent improvements in
+ GATK 2.7 have resolved the difference in SNP calling. If using a
+ GATK pipeline, UnifiedGenotyper lags behind the realigning callers
+ in resolving indels, and I'd recommend using HaplotypeCaller.
+- FreeBayes outperforms the GATK callers on both SNP and indel
+ calling. The most recent versions of FreeBayes have improved
+ sensitivity and specificity which puts them on par with GATK
+ HaplotypeCaller. One area where FreeBayes performs better is in
+ correctly resolving heterozygote/homozygote calls, reflected in the
+ lower number of discordant shared variants.
+<a href="">
+ <img src="" width="700"
+ alt="Comparison of variant callers, GATK best practice preparation">
+In addition to calling sensitivity and specificity, an additional
+factor to consider is the required processing time. Rough benchmarks
+on family-based calling of whole genome sequencing data indicate that
+HaplotypeCaller is roughly 7x slower than UnifiedGenotyper and
+FreeBayes is 2x slower. These estimates depend on the worst case areas
+with deeper coverage and longer uninterrupted regions to call, but
+give some estimates of timing considerations.
+* Post-alignment BAM preparation comparison
+Given the improved accuracy of local realignment haplotype-based
+callers like FreeBayes and HaplotypeCaller, we explored the accuracy
+cost of removing the post-alignment BAM processing steps. The
+recommended GATK best-practice is to follow up alignment with
+identification of duplicate reads, followed by
+[[gatk-bqsr][base quality score recalibration]] and [[gatk-realign][realignment around indels.]]
+Based on [[bcbio-scale][whole genome benchmarking work]], these steps can take as long
+as the initial alignment and scale poorly due to high IO costs of
+manipulating large BAM files.
+To compare the quality impact of avoiding recalibration and
+realignment, we performed the identical alignment and variant calling
+steps as above, but did minimal post-alignment BAM preparation.
+Following alignment, the only step performed was deduplication using
+[[samtools][samtools rmdup]]. Unlike Picard MarkDuplicates, samtools rmdup allows
+streaming to avoid IO penalties which makes it more efficient. This is
+at the [[rmdup-v-markdup][cost of not handling some edge cases]]. Longer term, we'd like to
+explore [[biobambam][biobambam's markduplicates2]], which implements a more efficient
+streaming version of the Picard MarkDuplicates algorithm.
+Skipping base recalibration and indel realignment had little impact on
+the quality of resulting variant calls:
+<a href="]">
+ <img src="]" width="700"
+ alt="Comparison of variant callers, minimal post-alignment preparation">
+While GATK UnifiedGenotyper suffers in indel calling without
+recalibration and realignment, both HaplotypeCaller and FreeBayes
+perform as good or better without these steps. This allows us to save
+on processing time and complexity, without sacrificing call quality.
+#+LINK: gatk-bqsr
+#+LINK: gatk-realign
+#+LINK: bcbio-scale
+#+LINK: biobambam
+#+LINK: rmdup-v-markdup
+* Caveats and conclusions
+Taken together, the improvements in FreeBayes and ability to avoid
+post-alignment BAM processing allow use of a GATK-free pipeline with
+equal quality to current GATK best practices. Adding in GATK's two
+callers plus our ensemble method for combining them provides the most
+accurate overall calls, at the cost of additional processing times.
+It's also important to consider potential drawbacks of this analysis
+in designing future evaluations. The comparison is in exome regions
+for single sample variant calling. In future work it would be helpful
+to have population or family based inputs, and evaluate quality in
+whole genome regions. The reference callset prepared by the Genome in
+a Bottle consortium also makes extensive use of GATK tools during
+preparation. Evaluation of the reference materials with FreeBayes and
+other callers can help reduce potential GATK-specific biases and
+create an even better reference set moving forward.
+All of these pipelines are freely available, open-source, community
+developed projects and we welcome feedback and contributors. By
+integrating validation into a scalable analysis pipeline, we hope to
+build a community interested in widely available calling pipelines
+coupled with well-evaluated methods.

0 comments on commit f149221

Please sign in to comment.