Permalink
Browse files

Blog post on updated comparisons of variant callers

  • Loading branch information...
1 parent c146ac4 commit f1492212b4ed8501fe4407fbd255b47b0ba9a48a @chapmanb committed Oct 16, 2013
Showing with 223 additions and 0 deletions.
  1. +223 −0 posts/callers_compare_2.org
View
@@ -0,0 +1,223 @@
+#+BLOG: bcbio
+#+POSTID: 540
+#+DATE: [2013-10-16 Wed 05:45]
+#+TITLE: Updated comparison of variant detection methods: Ensemble, FreeBayes and minimal BAM preparation pipelines
+#+CATEGORY: variation
+#+TAGS: bioinformatics, variant, ngs, clinical
+#+OPTIONS: toc:nil num:nil
+
+* Variant evaluation overview
+
+I previously discussed our approach for [[eval-variant][evaluating variant detection methods]]
+using a [[giab-paper][highly confident set of reference calls]] for the
+[[na12878][NA12878 human HapMap genome]], provided by [[giab][NIST's Genome in a Bottle consortium]].
+
+The comparison utilizes [[bcbio-nextgen][bcbio-nextgen]], an automated open-source
+pipeline for variant calling and evaluation, coupled with the
+[[xprize-val][XPrize validation protocol]] to identify concordant and discordant
+variants. By having an automated validation workflow attached to a
+regularly updated, community developed variant calling pipeline, we
+can actively track progress of variant callers and provide updates as
+algorithms improve.
+
+Since the initial post, There have been two new GATK releases of
+[[gatk-ug][UnifiedGenotyper]] and [[gatk-hc][HaplotypeCaller]], as well as multiple improvements
+to [[freebayes][FreeBayes]]. Additionally we've enchanced our [[ensemble][ensemble calling method]],
+ which combines inputs from multiple callers into a single
+final set of calls, to better handle comparisons with inputs from
+three callers.
+
+The goal of this post is to re-evaluate these variant detection
+approaches and provide an updated set of recommendations:
+
+- The Ensemble calling method provides the best variant detection by
+ combining inputs from GATK UnifiedGenotyper, HaplotypeCaller and
+ FreeBayes.
+
+- FreeBayes performs slightly better than GATK methods for SNP and
+ indel resolution, including GATK's HaplotypeCaller method.
+
+- Post-alignment BAM processing steps like base quality recalibration and
+ realignment have little impact on the quality of variant calls with
+ variant callers that perform local realignment, including FreeBayes
+ and GATK HaplotypeCaller.
+
+These results enable re-evaluation of current best practice pipelines
+by avoiding the processing intensive post-alignment BAM processing
+steps. Additionally, combining FreeBayes and minimal BAM preparation
+allows for development of pipelines that can be freely used for
+academic, clinical and commercial work.
+
+#+LINK: eval-variant http://bcbio.wordpress.com/2013/05/06/framework-for-evaluating-variant-detection-methods-comparison-of-aligners-and-callers/
+#+LINK: na12878 http://ccr.coriell.org/Sections/Search/Sample_Detail.aspx?Ref=GM12878
+#+LINK: giab-paper http://arxiv.org/abs/1307.4661
+#+LINK: giab http://www.genomeinabottle.org/
+#+LINK: xprize-val http://bcbio.wordpress.com/2012/09/17/genomics-x-prize-public-phase-update-variant-classification-and-de-novo-calling/
+#+LINK: freebayes https://github.com/ekg/freebayes
+#+LINK: gatk-ug http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_genotyper_UnifiedGenotyper.html
+#+LINK: gatk-hc http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_haplotypecaller_HaplotypeCaller.html
+#+LINK: ensemble http://bcbio.wordpress.com/2013/02/06/an-automated-ensemble-method-for-combining-and-evaluating-genomic-variants-from-multiple-callers/
+#+LINK: bcbio-nextgen https://github.com/chapmanb/bcbio-nextgen
+
+* Calling and evaluation methods
+
+We called variants on a NA12878 exome dataset
+from [[edge][EdgeBio's clinical pipeline]] and assessed them against the NIST Genome in a
+Bottle reference material. [[comparison-do][Full instructions for replicating the analysis]]
+are available from the bcbio-nextgen documentation site.
+Following alignment with [[bwa-mem][bwa-mem (0.7.5a)]], we post-processed the BAM
+files with two methods:
+
+- [[gatk-bp][GATK's best practices (2.7-2)]]: This involves de-duplication with
+ [[picard-md][Picard MarkDuplicates]], GATK base quality score recalibration and
+ GATK realignment around indels.
+
+- Minimal post-processing, with de-duplication using
+ [[samtools][samtools rmdup]] and no realignment or recalibration.
+
+We then called prepared BAM files with three general purpose callers:
+
+- [[freebayes][FreeBayes (v0.9.9.2-18)]]: A haplotype-based Bayesian caller from
+ the Marth Lab.
+
+- [[gatk-ug][GATK UnifiedGenotyper (2.7-2)]]: GATK's widely used Bayesian caller.
+
+- [[gatk-hc][GATK HaplotypeCaller (2.7-2)]]: GATK's more recently developed
+ haplotype caller which provides local assembly around variant
+ regions.
+
+Finally, we evaluated the calls from each combination of BAM
+post-alignment preparation method and variant caller using the
+[[bcbio.variation][bcbio.variation]] framework. This provides a summary identifying
+concordant and discordant variants, separating SNPs and indels since
+they have different error profiles. Additionally it classifies
+discordant variants. where the reference material and evaluation
+variants differ, into three different categories:
+
+- Extra variants, called in the evaluation data but not in the
+ reference. These are potential false positives or missing calls from
+ the reference materials.
+
+- Missing variants, found in the NA12878 reference but not in the
+ evaluation data set. These are potential false negatives.
+
+- Shared variants, called in both the evaluation and reference but
+ differently represented. This results from allele differences, such as
+ heterozygote versus homozygote calls, or variant identification
+ differences, such as indel start and end coordinates.
+
+#+LINK: edge http://www.edgebio.com/
+#+LINK: bwa-mem http://bio-bwa.sourceforge.net/
+#+LINK: gatk-bp http://gatkforums.broadinstitute.org/discussion/1186/best-practice-variant-detection-with-the-gatk-v4-for-release-2-0
+#+LINK: comparison-do https://bcbio-nextgen.readthedocs.org/en/latest/contents/testing.html#exome-with-validation-against-reference-materials
+#+LINK: samtools http://samtools.sourceforge.net/
+#+LINK: picard-md http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates
+#+LINK: bcbio.variation https://github.com/chapmanb/bcbio.variation
+
+* Variant caller comparison
+
+With this framework, we compared the 3 variant callers and combined
+ensemble method:
+
+- The ensemble calling approach provides the best overall resolution
+ of both SNPs and indels. The one area where it lags slightly behind
+ is in identification of homozygote/heterozygote calls, especially in
+ indels. The higher discordant shared counts reflect this, and it
+ is due to positions where HaplotypeCaller and FreeBayes both call
+ variants but differ on whether it is a heterozygote or homozygote.
+
+- GATK HaplotypeCaller is all around better than the UnifiedGenotyper.
+ In the previous comparison, we found UnifiedGenotyper better on SNPs
+ and HaplotypeCaller better on indels, but the recent improvements in
+ GATK 2.7 have resolved the difference in SNP calling. If using a
+ GATK pipeline, UnifiedGenotyper lags behind the realigning callers
+ in resolving indels, and I'd recommend using HaplotypeCaller.
+
+- FreeBayes outperforms the GATK callers on both SNP and indel
+ calling. The most recent versions of FreeBayes have improved
+ sensitivity and specificity which puts them on par with GATK
+ HaplotypeCaller. One area where FreeBayes performs better is in
+ correctly resolving heterozygote/homozygote calls, reflected in the
+ lower number of discordant shared variants.
+
+#+BEGIN_HTML
+<a href="http://i.imgur.com/qz4Maf6.png">
+ <img src="http://i.imgur.com/qz4Maf6.png" width="700"
+ alt="Comparison of variant callers, GATK best practice preparation">
+</a>
+#+END_HTML
+
+In addition to calling sensitivity and specificity, an additional
+factor to consider is the required processing time. Rough benchmarks
+on family-based calling of whole genome sequencing data indicate that
+HaplotypeCaller is roughly 7x slower than UnifiedGenotyper and
+FreeBayes is 2x slower. These estimates depend on the worst case areas
+with deeper coverage and longer uninterrupted regions to call, but
+give some estimates of timing considerations.
+
+* Post-alignment BAM preparation comparison
+
+Given the improved accuracy of local realignment haplotype-based
+callers like FreeBayes and HaplotypeCaller, we explored the accuracy
+cost of removing the post-alignment BAM processing steps. The
+recommended GATK best-practice is to follow up alignment with
+identification of duplicate reads, followed by
+[[gatk-bqsr][base quality score recalibration]] and [[gatk-realign][realignment around indels.]]
+Based on [[bcbio-scale][whole genome benchmarking work]], these steps can take as long
+as the initial alignment and scale poorly due to high IO costs of
+manipulating large BAM files.
+
+To compare the quality impact of avoiding recalibration and
+realignment, we performed the identical alignment and variant calling
+steps as above, but did minimal post-alignment BAM preparation.
+Following alignment, the only step performed was deduplication using
+[[samtools][samtools rmdup]]. Unlike Picard MarkDuplicates, samtools rmdup allows
+streaming to avoid IO penalties which makes it more efficient. This is
+at the [[rmdup-v-markdup][cost of not handling some edge cases]]. Longer term, we'd like to
+explore [[biobambam][biobambam's markduplicates2]], which implements a more efficient
+streaming version of the Picard MarkDuplicates algorithm.
+
+Skipping base recalibration and indel realignment had little impact on
+the quality of resulting variant calls:
+
+#+BEGIN_HTML
+<a href="http://i.imgur.com/w8g0HCv.png]">
+ <img src="http://i.imgur.com/w8g0HCv.png]" width="700"
+ alt="Comparison of variant callers, minimal post-alignment preparation">
+</a>
+#+END_HTML
+
+While GATK UnifiedGenotyper suffers in indel calling without
+recalibration and realignment, both HaplotypeCaller and FreeBayes
+perform as good or better without these steps. This allows us to save
+on processing time and complexity, without sacrificing call quality.
+
+#+LINK: gatk-bqsr http://gatk.vanillaforums.com/discussion/44/base-quality-score-recalibration-bqsr
+#+LINK: gatk-realign http://gatk.vanillaforums.com/discussion/38/local-realignment-around-indels
+#+LINK: bcbio-scale http://bcbio.wordpress.com/2013/05/22/scaling-variant-detection-pipelines-for-whole-genome-sequencing-analysis/
+#+LINK: biobambam https://github.com/gt1/biobambam
+#+LINK: rmdup-v-markdup http://www.biostars.org/p/3917/#3985
+
+* Caveats and conclusions
+
+Taken together, the improvements in FreeBayes and ability to avoid
+post-alignment BAM processing allow use of a GATK-free pipeline with
+equal quality to current GATK best practices. Adding in GATK's two
+callers plus our ensemble method for combining them provides the most
+accurate overall calls, at the cost of additional processing times.
+
+It's also important to consider potential drawbacks of this analysis
+in designing future evaluations. The comparison is in exome regions
+for single sample variant calling. In future work it would be helpful
+to have population or family based inputs, and evaluate quality in
+whole genome regions. The reference callset prepared by the Genome in
+a Bottle consortium also makes extensive use of GATK tools during
+preparation. Evaluation of the reference materials with FreeBayes and
+other callers can help reduce potential GATK-specific biases and
+create an even better reference set moving forward.
+
+All of these pipelines are freely available, open-source, community
+developed projects and we welcome feedback and contributors. By
+integrating validation into a scalable analysis pipeline, we hope to
+build a community interested in widely available calling pipelines
+coupled with well-evaluated methods.

0 comments on commit f149221

Please sign in to comment.