Making GATK available on the cloud for everyone
Today, several members of our extended group are talking at the BioIT World meeting in Boston, and the Broad mothership is putting out a handful of announcements that are related to GATK. Among other communications there's a press release accompanied by a blog post on the Broad Institute blog, which unveil a landmark agreement we have reached with several major cloud vendors. I'd like to take a few minutes to discuss what is at stake, both in terms of what we're doing, and of how this will affect the wider GATK community.
These announcements all boil down to two things: we built a platform to run the Broad's GATK analysis pipelines in the cloud instead of our local cluster, and we're making that platform accessible to the wider community following a "Software as a Service" (SaaS) model.
Now, before we get any further into discussing what that entails, I want to reassure everyone that we will continue to provide the GATK software as a downloadable executable that can be used anywhere, whether locally on your laptop, on your institution's server farm or computing cluster, or on a cloud platform if you've already got that set up for yourself. The cloud-based service we're announcing is just one more option that we're making available for running GATK. And it should go without saying that we'll continue to provide the same level of support as we have in the past to everyone through the GATK forum; our commitment to that mission is absolute and unwavering.
Alright, so what's happening exactly? Read on to find out!
As discussed recently on this very blog, we've been migrating a substantial portion of the Broad's production genomic analysis pipelines to the cloud. This move was motivated in large part by a need for greater elasticity to deal with the onslaught of massive projects periodically hammering our datacenter (I'm looking at you, @dgmacarthur) as well as a drive toward increased cost-efficiency. But it was also a recognition that the mind-boggling rate at which genomic data is generated (roughly doubling every 8 months!) means we have to adapt how we share and interact with these frankly staggering amounts of data.
To that end, we've been working elbow to elbow with Google engineers for the past eighteen months; in short, they taught us how to cloud and we taught them how to genome. Together we built a system capable of operating our GATK Best Practices production pipelines at scale on the Google Cloud Platform (GCP), using Cromwell and WDL to define and execute the actual workflows. We've also been working closely with a team from the Intel Life Sciences division to solve some of the key challenges involved in scaling up to the next order of dataset magnitude, resulting in a new kind of database that will enable us to perform joint calling on tens, even hundreds of thousands of genomes at a time.
We're already running the Broad's whole genomes on this new platform, and eventually we plan to migrate most if not all our research pipelines (exomes, RNA etc) as well. As a corollary, all of the analysis results produced by the cloud-based pipeline are delivered to researchers through cloud-based workspaces within which they can kick off further analyses. That way, what happens on the cloud stays on the cloud, as far into the process as possible (in part to minimize egress charges).
From my perspective the most immediate upshot of this is that it finally puts us within reach of the holy grail of reproducibility: given the pipeline WDL scripts and resource datasets (both of which we plan to share freely) anyone will be able to reproduce our pipeline processing on their own instance of the Google Cloud with complete independence.
That being said, standing up and administering your own cloud-based service is not exactly trivial, and we know there's a lot of demand for push-button solutions, so we built our system to double as a Software as a Service (SaaS) platform that we can make publicly available for the convenience of the wider community. We plan to make this service accessible to everyone, Broadies and non-Broadies alike, including commercial/for-profit organizations, under the same conditions. Exact pricing has yet to be determined, but it will certainly include the cloud vendor's compute costs, and there will be no separate licensing cost for for-profit use.
We're also opening resale of GATK as a service to commercial SaaS vendors in order to maximize the options available to the community. Illumina has signed on as the first to offer GATK as a service through BaseSpace via Cromwell+WDL, and we're working with all the major cloud computing vendors mentioned above to ensure that the Cromwell+WDL pipelining solution will work as seamlessly and cost-effectively on their platforms as it does today on Google Cloud Platform.
Our ultimate goal here is to reduce the amount of effort that goes into standing up and maintaining implementations of GATK Best Practices worldwide, so that all those resources can be refocused on more interesting work. Personally, I expect that these new developments will contribute to making the GATK Best Practices more readily accessible and affordable to all, and I'm looking forward to being able to announce the availability of the new service later this year!