Batch effects begone: Introducing the Functional Equivalence data processing pipeline spec
By Eric Banks, Director, Data Sciences Platform and original member of the GATK development team
Ever since the GATK started getting noticed by the research community (mainly as a result of our contribution to the 1000 Genomes Project), people have asked us to share the pipelines we use to process data for variant discovery. Historically we have shied away from providing our actual scripts, not because we didn't want to share, but because the scripts themselves were very specific to the infrastructure we were using at the Broad. Fortunately we've been able to move beyond that thanks to the development of WDL and Cromwell, which allow potentially limitless portability of our pipeline scripts.
But it was also because there is a fair amount of wiggle room in terms of how to implement a pipeline to achieve correct results, depending on whether you care more about speed, cost or other factors. So instead we formulated "Best Practices", which I'll talk more about in a minute, to provide a blueprint of what are the key steps in the pipeline.
Today though we're taking that idea a step further: in collaboration with several other major genomics institutions, we defined a "Functional Equivalence" specification that is intended to standardize pipeline implementations, with the ultimate goal of eliminating batch effects and thereby promoting data interoperability. That means if you use a pipeline that follows this specification, you can rest assured that you will be able to analyze your results against all compatible datasets, including huge resources like gnomAD and TOPMed.
The GATK Best Practices were always meant to be just a set of guidelines for processing sequencing data and performing variant discovery, enumerating the key steps that we found produced the best results. To turn that into an actual pipeline, you have to make certain implementation choices -- how to chain the steps together, set command-line parameters, even which tools to use since for some of the steps there are several valid options. This means there is room for different implementations depending on factors like whether you care more about cost or about runtime, for example. However, any difference between implementations has the potential to cause subtle batch effects in downstream analyses that compare datasets produced by those variant pipelines. This is not a purely theoretical concern -- we have seen such batch effects occur in real analyses, with subtle but important consequences for the scientific results. In fact, this has been so prevalent that the Broad Institute has historically reprocessed from scratch any data that it received from other genome centers in order to avoid batch effects.
Clearly, with the amount of data in today’s large sequencing projects, that strategy of reprocessing everything is no longer feasible. So for the past year, we worked closely with several of the other large genome sequencing and analysis centers (New York Genome Center, Washington University, University of Michigan, Baylor College of Medicine) to develop a standardized pipeline specification that would promote compatibility among our respective institutions' pipelines. And I'm proud to say we accomplished our goal! It took a lot of testing and evaluations, but the consortium was able to define very precisely what are the components of a pipeline implementation from unmapped reads to an analysis-ready CRAM file that will make it “functionally equivalent” to any other implementation that adheres to this standard specification. This means that any data produced through such functionally equivalent pipelines will be directly comparable without risk of batch effects.
The consortium has already published the specification itself in Github here, and we're currently preparing a manuscript detailing the methodology we used as well as the consequences for downstream analysis, which we will submit for peer review and journal publication. We'll post updates and link to the preprint and final article as they become available.
The genome centers involved in this effort have all committed to using this standard to process all of our genomes, and together we’ve already processed over 150,000 human whole genomes with it. Since collectively we account for a substantial proportion of all genomes that get sequenced in the world, that should already simplify the lives of the large number of researchers who get their data from any one of our centers. That being said, there are plenty of other genome sequencing facilities out in the world (yes, we're very aware --and happy-- that there is a wonderful world beyond North America), and there are also many researchers who do their own processing and analysis. We hope that these service providers and individual analysts will consider adopting the Functional Equivalence pipeline specification in their own work to further increase the number of genomic datasets that will be fully compatible in this way. Based on our experience working with datasets from different provenances, we expect this will have far-reaching positive consequences for the ability of biomedical scientists to cross-analyze datasets. As an example, major datasets like the next version of gnomAD (~70,000 genomes, plus countless exomes) and TOPMed (another ~70,000 genomes) are now being processed through pipelines that conform to this Functional Equivalence standard. These datasets constitute incredibly valuable resources for analyses that e.g. rely on known allele frequencies, yet we know that any cross-analysis you run against them is vulnerable to batch effects that skew results UNLESS your samples were also processed through a functionally equivalent pipeline.
That last point is really important, so let me give an example to illustrate it. Imagine you want to find a causal variant in a sample you really care about, so you run variant calling and then compare the resulting callset against gnomAD in order to find the population-based allele frequencies. And, behold, you find a SNP that’s not in gnomAD! This could be the rare variant that you’ve been searching for… or it can be an artifact that arose because you didn’t process your sample with a pipeline that’s functionally equivalent to the pipeline used to make gnomAD. In case you're curious, we've found that the most egregious batch effects arise from using different aligners (or even different versions of the same aligner), which is why the functional equivalence specification includes the requirement to run the exact same version of BWA. And we’ve found that even simple tasks like marking of duplicate reads can have some drastic effects on things like downstream calling of structural variation. More details to come in the paper!
In my next blog post (coming soooon) I will point you to the Broad’s actual implementation and describe how we’ve made it really cheap to run. Stay tuned!