The Art of the Pipeline: Introducing Cromwell + WDL
Today I'm delighted to introduce WDL, pronounced widdle, a new workflow description language that is designed from the ground up as a human-readable and -writable way to express tasks and workflows.
As a lab-grown biologist (so to speak), I think analysis pipelines are amazing. The thought that you can write a script that describes a complex chain of processing and analytical operations, then all you need to do every time you have new data is feed it through the pipe, and out come results -- that's awesome. Back in the day, I learned to run BLAST jobs by copy-pasting gene sequences into the NCBI BLAST web browser window. When I found out you could write a script that takes a collection of sequences, submit them directly to the BLAST server, then extracts the results and does some more computation on them… mind blown. And this was in Perl, so there was pain involved, but it was worth it -- for a few weeks afterward I almost had a social life! In grad school! Then it was back to babysitting cell cultures day and night until I could find another excuse to do some "in silico" work. Where "in silico" was the hand-wavy way of saying we were doing some stuff with computers. Those were simpler times.
So it's been kind of a disappointment for the last decade or so that writing pipelines was still so flipping hard. I mean smart, robust pipelines that can understand things like parallelism, dependencies of inputs and outputs between tasks, and resume intelligently if they get interrupted. Sure, in the GATK world we have Queue, which is very useful in many respects, but it's tailored to specific computing cluster environments like LSF, and writing scala scripts is really not trivial if you're new to it. I wouldn't call Queue user-friendly by a long shot. Plus, we only really use it internally for development work -- to run our production pipelines, we've actually been using a more robust and powerful system called Zamboni. It's a real workhorse, and has gracefully handled the exponential growth of sequencer output up to this point, but it's more complicated to use -- I have to admit I haven't managed to wrap my head around how it works.
Fortunately I don't have to try anymore: our engineers have developed a new pipelining solution that involves WDL, the Workflow Description Language I mentioned earlier, and an execution engine called Cromwell that can run WDL scripts anywhere, whether locally or on the cloud.
It's portable, super user-friendly by design, and open-source (under BSD) -- we’re eager to share it with the world!
In a nutshell, WDL was designed to make it easy to write scripts that describe analysis tasks and chain those tasks into workflows, with built-in support for advanced features like scatter-gather parallelism. This came at the cost of some additional headaches for the engineering team when they built Cromwell, the execution engine that runs WDL scripts -- but that was a conscious design decision, to push off the complexity onto the engineers rather than leave it in the lap of pipeline authors.
You can learn more about WDL at https://software.broadinstitute.org/wdl/, a website we put together to provide detailed documentation on how to write WDL scripts and run them on Cromwell. It's further enriched by a number of tutorials and real GATK-focused workflows provided as examples, and in the near future we'll also share our GATK Best Practices production pipeline scripts.
Yes, I just said that, we're going to be sharing full-on production scripts! In the past we refused to share scripts because they weren't portable across systems and they were difficult to read, so we were concerned that sharing them would generate more support burden than we could handle. But now with that we have a solution that is both portable and user-friendly, we're not so worried -- and we're very excited about this as an opportunity to improve reproducibility among those who adopt our Best Practices in their work.
Of course it won't be the first time someone comes up with a revolutionary way to do things that ends up not working all that well in the real world. To that I say -- have you ever heard of the phrase "eating your own dog food", or "dogfooding" for short? It refers to a practice in software engineering where developers use their own product for running critical systems. I'm happy to say that we are now dogfooding Cromwell and WDL in our own production pipelines; a substantial part of the Broad's own analysis work, which involves processing terabytes worth of whole genome sequencing data, is now done on the cloud using Cromwell to execute pipelines written entirely in WDL.
That's how we know it works at scale, it's solid, and it does what we need it to do. That being said, there are a few additional features that are defined in WDL but not yet fully implemented in Cromwell, as well as a wishlist of convenience functions requested by our own methods development team, which the engineering team will pursue over the coming months. So its capabilities will keep growing as we add convenience functions to address our own and the wider community's pipelining needs.
Finally, I want to reiterate that WDL itself is meant to be very accessible to people who are not engineers or programmers; our own methods development team and various scientific collaborators have all commented that it's very easy to write and understand. So please try it out, and let us know what you think in the WDL forum!