Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Standalone variant calling steps + joint discovery #88
Conversation
vdauwera
changed the title from
Standalone variant calling steps to Standalone variant calling steps + joint discovery
Feb 9, 2017
|
|
Updating all VQSR steps to use calling intervals for potential efficiency. Shouldn't make a difference but just in case -- it can't hurt.
|
|
Done |
|
Could I get a documentation review from @sooheelee or @knoblett? Would like to know if it's clear what these do and what they require. |
|
Ok, will review the changes. Too bad we couldn't use the WDL scripts I wrote for the same processes. |
sooheelee
self-requested a review
Mar 6, 2017
sooheelee
reviewed
Mar 6, 2017
I'm making comments on the changed files because this is the view that github gives me. This works for new documentation but for changes to documentation, this review approach is not ideal given only changes will be visible. So we should think of a better way to review.
- I've only reviewed the hashtagged comments.
- If a comment applies to another section, I omit it so as to only make a particular point once.
- No comments on formatting.
| +## | ||
| +## LICENSING : | ||
| +## This script is released under the WDL source code license (BSD-3) (see LICENSE in | ||
| +## https://github.com/broadinstitute/wdl). Note however that the programs it calls may |
vdauwera
Mar 7, 2017
Collaborator
Actually my "correct" comment was an oversimplification. I should have clarified that I use these terms to distinguish two levels: programs are GATK, Picard, BWA, Cromwell, and tools, such as HaplotypeCaller, are what we invoke in programs that are toolkits, like GATK or Picard. So they're not interchangeable.
| + | ||
| +# TASK DEFINITIONS | ||
| + | ||
| +# Unzip GVCFs because GenotypeGVCFs is picky |
sooheelee
Mar 6, 2017
Collaborator
Change to "unzip block-compressed VCFs with .gz extension as GenotypeGVCFs currently does not handle compressed files."
| + Int disk_size | ||
| + String mem_size | ||
| + | ||
| + # HACK ALERT! Using .gvcf extension here to force IndexFeatureFile to make the right |
sooheelee
Mar 6, 2017
Collaborator
Why can't we name the gunzipped output to ${unzipped_basename}.g.vcf instead to start with? Then we can avoid the renaming.
vdauwera
Mar 6, 2017
Collaborator
Mainly because this should be a very temporary situation, so when it's fixed we can just change this bit. In fact I should check whether it's still even the case.
| + # Unzip GVCFs in parallel | ||
| + scatter (input_gvcf in input_gvcfs) { | ||
| + | ||
| + # Unzip GVCFs because GenotypeGVCFs is a cranky toddler |
sooheelee
Mar 6, 2017
Collaborator
I suggest stating only "Unzip GVCFs". Also, perhaps extended explanations, e.g. that GenotypeGVCFs does not handle block compressed files, should all be kept either to the task or to the workflow but not in both.
| + vcf_index = cohort_vcf_name + ".vcf.gz.tbi" | ||
| + } | ||
| + | ||
| + # Outputs that will be retained when execution is complete |
vdauwera
Mar 6, 2017
Collaborator
I'd rather continue to distinguish the workflow (the thing that gets executed) from its execution.
| + | ||
| +# TASK DEFINITIONS | ||
| + | ||
| +# HaplotypeCaller per-sample in GVCF mode |
sooheelee
Mar 6, 2017
Collaborator
A link to our documentation explaining GVCF mode may be helpful here.
vdauwera
Mar 6, 2017
Collaborator
Eh, that's kind of a slippery slope... By the same logic we could link every single tool call to docs. I don't have the stomach to take that on right now (though for the future it would be nice).
| +## | ||
| +## This WDL implements the joint discovery and VQSR filtering portion of the GATK | ||
| +## Best Practices (June 2016) for germline SNP and Indel discovery in human | ||
| +## whole-genome sequencing (WGS) data. |
sooheelee
Mar 6, 2017
Collaborator
Below you say WGS and exome files. However, here you only mention WGS.
| +## | ||
| +## Requirements/expectations : | ||
| +## - GVCFs produced by HaplotypeCaller in GVCF mode | ||
| +## - Bare minimum 1 WGS sample or 30 Exome samples. Gene panels are not supported. |
vdauwera
Mar 6, 2017
Collaborator
You mean you want to give it more thought, or I should give more details?
| +## - Bare minimum 1 WGS sample or 30 Exome samples. Gene panels are not supported. | ||
| +## | ||
| +## Outputs : | ||
| +## - A (typically but not necessarily multisample) VCF file and its index, filtered |
sooheelee
Mar 6, 2017
Collaborator
Incorporate the information in the parentheses into the main body of the text.
| + resource_indices = resource_indices | ||
| + } | ||
| + | ||
| + # Apply INDEL filter (first because INDEL model is usually done sooner) |
vdauwera
Mar 6, 2017
Collaborator
The two models are built in parallel, but then the recals are applied in series. With this wiring, we can already start applying Indel recalibration to the VCF while SNP model is still being built. By the time it finishes, the indel-recalibrated file is available to serve as input to apply the SNP recal. If we did it the other way around, we would have to wait until SNP recal file is available despite indel recal file being there already, then apply SNP, then apply Indel. Longer wall clock time for complete workflow execution. Switching the two apply calls bypasses this problem.
|
Thanks for the review, @sooheelee. I'm going to call this good. We can do some additional refinement (eg add doc links) in a future pass over all our workflows. |
vdauwera
merged commit 7e6f95b
into
develop
Mar 6, 2017
1 check was pending
vdauwera
deleted the
gvda_standalone_calling_steps branch
Mar 6, 2017
|
Btw I'm pretty sure I used your WDLs as basis to develop most of these, @sooheelee. It's been a while since I started this so I don't remember precisely but I try to avoid reinventing wheels. |
vdauwera commentedFeb 9, 2017
•
edited
Adding WDLs for various component steps of the variant calling pipeline