This is one of two posts announcing the imminent beta release of GATK4; for details about the open-source licensing, see this other post.
You've probably heard it by now: we are on the cusp of releasing GATK4 into beta status (targeting mid-June), and we plan to push out a general release shortly thereafter (targeting midsummer). That's great. So what's in the box?
Over two years of active development have gone into producing GATK4, and I'm happy to say we have plenty to show for it. Specifically, we've pushed the evolution of GATK on three fronts: (1) technical performance, i.e. speed and scalability; (2) new functionality and expanded scope of analysis, e.g. we can do CNVs now; and (3) openness to collaboration, through open-sourcing as well as general developer-friendliness (documented code! consistent APIs! clear contribution guidelines!).
Want more detail? Let me give you a tour of the highlights, using slides from the presentation I gave at Bio-IT earlier today (code reuse: it's not just for code anymore).
It's basically a Tesla now
Let's talk about the engine, i.e. the part of the GATK that does all the boring things like reading and writing files, understanding formats, applying multithreading, applying read filters, extracting data, doing basic math -- all the functionality that's common to many tools. The old ("classic"?) GATK engine was waaaay over-engineered, too complex for its own good really, and that's a big reason why GATK has always been a bit on the slow side.
So we rewrote it from scratch to be much more efficient, and it was faster, so that was cool. Then we gave it superpowers.
Like what? Well, how about some major speed enhancements through components like the Intel Genomics Kernel Library (GKL), which provides optimized code for things like file compression and decompression, as well as algorithms like PairHMM (a heavily-used component of the HaplotypeCaller's genotyping code). A hot new datastore component --also contributed by Intel-- called GenomicsDB that constitutes a quantum leap in scalability for joint genotyping (more on that in a bit). Built-in support for Apache Spark to take advantage of industry-standard, robust parallelism, including multi-threading that doesn't suck (good riddance to
-nt/-nct!). Functionality to submit jobs to Google's Dataproc spark execution service, and more generally to read/write files directly from Google Cloud Storage. And finally, increased flexibility in terms of how the engine can traverse the data, which opens up new types of analyses that weren't possible before.
Show me the numbers
We're putting together systematic benchmarks that we'll publish with the release to support all these wild claims of dramatically enhanced performance; and in the meantime let me just show you a few preliminary numbers to give you a sense of scale of the GATK3 vs GATK4 evolution. Keep in mind this is run on ordinary hardware with bog-standard parameters, so nothing fancy -- it's just meant to be a baseline comparison to illustrate what migrating to GATK4 gets you, at the very least, before you even start leveraging advanced speed-freak features like Spark.
The takeaway? Over a full pipeline we see up to 5x speedup. Not too shabby! Looking at a subset of tools that are traditionally long-running, and therefore have an outsize effect on overall runtime, we see that those that were ported over from GATK3 (BaseRecalibrator, HaplotypeCaller) ran about twice as fast on average from a combination of the engine enhancements and optimizations/cleanup made to the tools themselves, which is nice -- but almost pales in comparison to the massive 6x speedups we get from re-implementing key Picard tools in the pipeline, like MarkDuplicates and SortSam (which becomes generic Sort in GATK4).
Speed isn't everything of course; with collaborators like Daniel MacArthur and his crazy data aggregation projects (no offense, we love you Daniel) we find ourselves constantly scrambling to scale to larger cohort sizes. The latest incarnation of this, gnomAD, involved joint-calling 15,000 whole genomes. Now you may be thinking, meh, ExAC was what, 120,000 exomes? Yeah but a whole genome is about 100x larger than an exome, so you do the math. It's a lot of data. Anyway, we couldn't have done it with the existing GATK3 tools. During testing, we ran into a wall at about 3,000 WGS samples, which took 6 weeks to run and totally maxed out Eric Banks' credit cards. So it's a good thing we have friends in smart places, specifically our Intel collaborators who built us a new datastore called GenomicsDB. It allows GATK4 to run the joint genotyping computations far more efficiently than we could do in the old framework, and it's what enabled us to sidestep the wall and scale to the 15,000 WGS samples of gnomAD. Oh, and it only took 2 weeks to run. That was a nice touch.
Is it scope creep if it's on purpose?
In the GATK3 world, we have just one fully decked out Best Practices workflow (HaplotypeCaller GVCF for germline short variants) and maaaybe a couple of half-hearted wannabe-BPs (MuTect2-based and RNAseq) if you want to be generous and count them.
In GATK4, we still have our flagship HaplotypeCaller GVCF workflow of course, but we also have three new workflows that are either already Best Practices-grade (GATK CNV + ACNV for somatic CNVs and allelic CNVs) or getting there (MuTect2 for somatic short variants, much improved after the port, and GATK gCNV for germline CNVs which is brand new and still piping hot). Finally, coming down the pipe we also have tools and an eventual Best Practices workflow for Structural Variants. I guess this is what happens when you let your Pokemon get wet after midnight.
Share ALL THE THINGS (including workflows)
It's worth mentioning at this point that the GATK Best Practices have a dual nature. Their primary purpose is to be a “platonic ideal”; a set of guidelines that describe, for a particular use case like “discover germline SNPs and indels” what processing/analysis should be applied to get the best possible results. But they also serve a secondary purpose, which is to describe in practice how you actually implement that workflow correctly as a pipeline – Are there any tweaks to be made depending on the data type? What can you parallelize, and how? and from there, even more tricky questions arise, like what kind of hardware will you need, how much memory and so on. To assist the community in getting this stuff working, we're committing to sharing our own implementations as a reference, in the form of WDL scripts. We already have a few published in the WDL repository's scripts section, including the "raw reads to GVCF" single-sample workflow we use in our production pipeline to process the Broad's genomes.
We're now getting ready to tackle the task of whipping our GATK4 Best Practices workflows into a publishable state (document ALL THE THINGS), with the goal of having them ready for public consumption when GATK4 itself goes into general release. We encourage you to adopt them if they fit your use case and needs. When everyone uses methods that are the same or similar enough (i.e. functionally equivalent), that makes analysis results more comparable across studies, and as fellow humans, we all stand to benefit from that.
So that concludes our whirlwind tour of GATK4 highlights. There are various things I didn't cover, including what exactly is happening with Picard (we'll cover that separately, and soon) and whether there are changes at the command line level. To address the latter (spoiler: yes, yes there are), we are developing a comprehensive migration guide to help ease the transition. Make no mistake, this is a big change. But it's a change for the better at every level that matters.