Wrap your head around the Cloud
In my last blog post, I introduced the Cromwell+WDL pipelining solution that we developed to make it easier to write and run sophisticated analysis pipelines on cloud infrastructure. I've also mentioned in the recent past that we're building the next generation of GATK (which will be GATK 4) to run efficiently on cloud-based analysis platforms.
So in this follow-up I want to explain why we care so much about building software that runs well in the cloud, which comes down to some key benefits of "The Cloud" over more traditional computing solutions like local servers and clusters (which I'll just refer to as clusters from now on -- the distinction is not really important here).
Old-school compute model
Most research institutions have some kind of locally-hosted computing server or cluster that is made available to their research community as a core service facility. The problem with this model is that when you rely on a local cluster, you're limited by the technical capabilities of the machines in the cluster, the number of machines you have access to, and the degree of control you have over their configuration.
Leaving aside configuration for now, the first two are entirely determined by the cash value that someone places on your ability to get analysis work done in a timely manner. More accurately, someone has the unenviable privilege of making the difficult trade-off between two poles: setting up an expensive cluster that is beefy enough to enable you and your colleagues to run the most demanding software on the largest datasets you have in the shortest amount of time, knowing that most of the time you won't use the cluster at its maximum capacity (which means paying more for the equipment, maintenance and electricity than you're getting out of it); or a more affordable setup that is reasonably calibrated to the average amount of compute you'll need, is generally used at capacity and is therefore more cost-effective -- but completely unable to cope with sudden surges in compute needs (e.g. when you and all of your colleagues realize you need your results back ASAP for that big grant proposal deadline at the same time).
You may not realize that this balancing act is going on, and you may not ever be confronted directly with the costs of the compute you consume — but someone, somewhere has to foot the bill. In many academic institutions, computing and maintenance costs are typically paid for out of the so-called overhead on research grants. That's the money that keeps the lights on, the lab floors scrubbed clean and the CPUs whirring happily in their racks. If your institution is paying more for its compute infrastructure than the value it's getting out of it, that means research money is just being siphoned away towards global warming. Because between powering the CPUs and powering the fans that keep those CPUs cool... prepare your flip-flops and cocktail umbrellas, it's going to get tropical around here.
Enter the Cloud.
By now I'm sure you'll have heard that "The Cloud" is really just a cluster that is maintained by someone else and that you rent time on over the internet. This is true. And if you've ever had to answer those snide questions from a bank or car dealership about whether you rent or own your home, you know owning is better than renting, right? Right?
Well, not necessarily. One of the big advantages of renting is that you only pay for what you need at the time you need it, and it doesn't require long-term commitment. When you start a two-year postdoc in a new town, you wouldn't even think about buying a house, especially since you're single and you basically live in the lab, and there's no way it's going to take you more than two years to land that big glamor-mag paper that will be your ticket to the next step, amirite? Uh, sure. Well, give or take a few error bars... Renting is totally the way to go.
One cool thing about the cloud is that it offers you multiple options, and you can change your mind about what you want literally all the time. Specifically -- with the caveat that different cloud vendors may offer different services and pricing strategies, but the following is fairly widely applicable -- you can request different types of machines (or cores) with technical features that range from super-wimpy to super-beefy, with the pricing scaled accordingly. And you can request different numbers of cores depending on how much you can parallelize the execution of the work and how much you care about getting results back quickly - or not. So if you have some boring processing jobs to run that aren't very compute-intensive and you're not in a rush, you can request a few cores that will run fairly slowly but will be super cheap. On the other hand, if you have some very compute-intensive jobs to run and you need results ASAP, you can request a whole bunch of cores that have e.g. faster CPUs, more RAM and faster-access storage drives, which will be more pricey but will allow you to meet your deadline. It's up to you to balance the cost vs. performance, and you can set a different trade-off point every time depending on your current circumstances.
This ability to switch between different types of machines and increase the number of machines you can use almost limitlessly is called elasticity or sometimes burst capability, and it's one of the most powerful arguments I've heard for using cloud-based platforms instead of building out local infrastructure. In fact, at Broad we are currently migrating a substantial portion of our genomic analysis production pipelines to the cloud, in large part because we need the elasticity offered by cloud computing (it's either that or our datacenter completely takes over Allston like a giant space moussaka). But you don't have to be an analysis powerhouse to see the appeal; many small labs and companies (especially startups) whose computing needs are very intense for short periods of time then remain minimal for extended periods are gravitating toward cloud-based solutions as an alternative to standing up their own local infrastructure. This is one of the ways that cloud computing helps make genome-scale analysis more widely accessible to everyone, as opposed to just the big players.
There are other important benefits to using cloud rather than local infrastructure, of course. For example, it also makes it a lot easier to share data efficiently among collaborators. Rather than downloading endless copies of datasets from servers (or sometimes, physically transferring data by Sneakernet) you can simply all point your software at a common "bucket" that hosts the data (as long as it's on the same cloud). Accessing the data is free as long as it stays in the cloud, though you do typically have to pay so-called "egress charges" if you want to download a copy — hence the incentive to keep everything in the cloud and move all your analysis work there.
This is not to say that the cloud is the ultimate panacea -- there are still plenty of use cases where it makes sense to maintain local high-performance computing capabilities, and we're not getting rid of our datacenter anytime soon (sorry Allston). But for anyone considering upgrading or setting up new compute infrastructure for the purpose of genomic analysis, the cloud is well worth a second look for the reasons above, plus some more I haven't covered here. We'll be talking about this more in the coming days and weeks.