Test drive GATK Best Practices workflows on Terra
Last week, I wrote about a new initiative we're kicking off to make it easier to get started with GATK. Part of that involves making it easier for anyone to try out the Best Practices workflows without having to do a ton of work up front. That's a pretty big can of worms, because for a long time the Best Practices were really meant to describe at a high level the key GATK (and related) tools/steps you need to run for a particular type of analysis (e.g. germline short variant discovery). They weren't intended to provide an exact recipe of commands and parameters… Yet that's what many of you have told us you want.
For the past couple of years we've been providing actual reference implementations in the form of workflows written in the Workflow Description Language, but that still leaves you with a big old learning curve to overcome before you can actually run them. And we know that for many of you, that learning curve can feel both overwhelming and unwarranted - especially when you're in the exploratory phase of a project and you're not even sure yet that you'll end up using GATK.
To address that problem, we've set up all the GATK Best Practices workflows in public workspaces on our cloud platform, Terra. These workspaces feature workflows that are fully configured with all commands and parameters, as well as resource files and example data you need to run them right out of the box. All it takes is a click of a button! (Almost. There's like three clicks involved, for real).
Let me show you one of these workspaces, and how you would use it to try out Best Practices pipelines. It should take about 15 mins if you follow along and actually click all the things. Or you can just read through to get a sense of what's involved.
GATK Best Practices workspaces live in the Terra Showcase library
Terra has a growing library of workspaces showcasing a variety of analysis use cases and tools, including GATK. You can get to it by clicking the "View Examples" button on the Terra landing page or selecting "Terra Library" then "Showcase" in the dropdown menu (top left icon, horizontal stripes) from any page.
If you go there now (go on, we'll wait for you) you'll be asked to log in with a Google identity. If you don't have one already you can create one, and choose to either create a new Gmail account for it or associate your new Google identity with your existing email address. See this article for step-by-step instructions on how to register. Once you've logged in, look for the big green banner at the top of the screen and click "Start trial" to take advantage of the free credits program. As a reminder, access to Terra is free but Google charges you for compute and storage; the credits (a $300 value!) will allow you to try out the Best Practices for free.
Let's try out the germline short variants pipeline
The Terra Showcase is organized in two major categories: "GATK4 Examples" are all the Best Practices workspaces, and "Featured Workspaces" are various others (including GATK workshop materials -- I'll cover that in an upcoming blog post in this series). Find the "Germline-SNPs-Indels-GATK4-hg38" card and click on it to access a read-only version of the workspace. If you want to be able to actually run things, you need to clone it. To do that, expand the workspace action menu (three-dot icon, top right) and select the "Clone" option. The resulting workspace clone belongs to you. See the animation below or this article for an exact step-by-step walkthrough.
You can find a detailed description of the workspace contents in the Dashboard tab, including instructions and links to relevant documentation. There's a lot of interesting info in there that we could go into, but let's zip straight over to the Data tab to look at the example data that we're providing as input for testing the pipeline.
Go to the Data tab of the workspace and click on "sample" in the left hand menu to see the table of input samples we provide. This is all metadata; the actual data files live in Google Cloud Storage. Later I'll point you to docs where you can learn more about how that works and how you can import your own data securely (it stays private unless you choose to share it) but for now, I just want to point out that in this workspace, we provide a full whole genome (WGS) input dataset in CRAM format for full-scale testing as well as a "small" downsampled dataset in BAM format for running faster tests, typically as sanity checks.
There's also a table called "Workspace Data" that lists resource files like the reference genome, known variants files, interval lists and so on -- everything you need to run the pipeline. So let's do that now.
Pre-configured Best Practices workflows
Finally, we get to the good stuff! The workflows are set up in the Tools tab of your workspace. In this particular one, you should see three workflows corresponding to the pre-processing, single-sample calling and joint variant discovery portions of the Best Practices for germline SNP & Indel discovery, respectively:
- 1-Processing-For-Variant-Discovery takes the raw data in unmapped BAM format to analysis-ready BAMs (we have conversion utilities for dealing with FASTQ data);
- 2-Haplotypecaller-GVCF takes the output from the first WDL and does the variant calling per sample, producing a GVCF;
- 3-Joint-Discovery implements the joint calling and VQSR filtering portion to return a VCF file and its index.
The three workflows are designed to be run back-to-back. Each workflow's outputs will get added to the data table in the appropriate columns, so that the next workflow will find the right inputs automatically.
Click on the first tool to load up the details; the page will open at the inputs definition form, which is pre-filled for you. To launch the workflow, select some data to run it on, hit the "Run analysis" button then click "Launch" in the popup dialog, as shown in the animation below.
That's all it takes! Congratulations, especially if this is the first GATK pipeline you've ever run.
You can check its status in the Job History tab; as the system processes your request, the status label will change from “Queued” to “Submitted” to “Done” (remember to refresh the page to see the current status). Behind the scenes, Terra will interpret the workflow script, dispatch jobs for execution on Google Cloud virtual machines (with parallelization in all the right places), move data around as needed, and eventually write the results to your workspace storage bucket. The best part of all that? You don't have to worry about any of it :-)
The Dashboard lists expected runtime and costs of each workflow for each input dataset provided for testing. For example, you see that you can run the complete pipeline on the 3GB sample NA12878_24RG_small in about six hours, for less than the cost of a medium Dunkin's coffee.
At this point you should have a sense of what it's like to test drive GATK workflows on Terra. If you'd like to learn more about how you can take further advantage of these resources, e.g. by uploading your own data to evaluate how our pipelines behave on that, have a look at this quick start guide. You may also want to check out this handy utility workspace that contains preconfigured tools for converting between various input formats, or look at the other GATK Best Practices workspaces in the Terra Showcase.
Next week I'll walk you through using the workspaces that we use in workshops to teach the component steps of each pipeline within Jupyter notebooks. If you want a sneak peek, have a look at this tutorial workspace, where all the action is in the Notebooks tab.
And of course, we're always here to help
It's the same crack team that provides frontline support for both Terra and GATK, so whenever you're using Terra, you can expect the same speedy and caring support you're used to getting on the GATK forum. In fact, you can even write to the support team privately through the Terra Support helpdesk, which you can't do in the GATK forum.
Let us know how it goes; we'd love to hear from you on how we could make this even more useful.