Standalone variant calling steps + joint discovery #88

Merged
merged 1 commit into from Mar 6, 2017

Conversation

Projects
None yet
2 participants
Collaborator

vdauwera commented Feb 9, 2017 edited

Adding WDLs for various component steps of the variant calling pipeline

    - HaplotypeCaller GVCF scatter-gathered across intervals
    - GenotypeGVCFs scatter-gathered across intervals
    - VQSR (no scatter)
    - Joint discovery (GGVCFs through VQSR, no genotype refinement)

vdauwera changed the title from Standalone variant calling steps to Standalone variant calling steps + joint discovery Feb 9, 2017

Collaborator

vdauwera commented Mar 5, 2017 edited

Collaborator

vdauwera commented Mar 5, 2017 edited

Updating all VQSR steps to use calling intervals for potential efficiency. Shouldn't make a difference but just in case -- it can't hurt.

Collaborator

vdauwera commented Mar 6, 2017

Done

Collaborator

vdauwera commented Mar 6, 2017

Could I get a documentation review from @sooheelee or @knoblett? Would like to know if it's clear what these do and what they require.

Collaborator

sooheelee commented Mar 6, 2017 edited

Ok, will review the changes.

Too bad we couldn't use the WDL scripts I wrote for the same processes.

sooheelee self-requested a review Mar 6, 2017

@sooheelee

I'm making comments on the changed files because this is the view that github gives me. This works for new documentation but for changes to documentation, this review approach is not ideal given only changes will be visible. So we should think of a better way to review.

  • I've only reviewed the hashtagged comments.
  • If a comment applies to another section, I omit it so as to only make a particular point once.
  • No comments on formatting.
+##
+## LICENSING :
+## This script is released under the WDL source code license (BSD-3) (see LICENSE in
+## https://github.com/broadinstitute/wdl). Note however that the programs it calls may
@sooheelee

sooheelee Mar 6, 2017

Collaborator

Here, are programs tools?

@vdauwera

vdauwera Mar 6, 2017

Collaborator

correct

@sooheelee

sooheelee Mar 7, 2017

Collaborator

I'm going to start using programs instead of tools too then.

@vdauwera

vdauwera Mar 7, 2017

Collaborator

Actually my "correct" comment was an oversimplification. I should have clarified that I use these terms to distinguish two levels: programs are GATK, Picard, BWA, Cromwell, and tools, such as HaplotypeCaller, are what we invoke in programs that are toolkits, like GATK or Picard. So they're not interchangeable.

+
+# TASK DEFINITIONS
+
+# Unzip GVCFs because GenotypeGVCFs is picky
@sooheelee

sooheelee Mar 6, 2017

Collaborator

Change to "unzip block-compressed VCFs with .gz extension as GenotypeGVCFs currently does not handle compressed files."

@vdauwera

vdauwera Mar 6, 2017

Collaborator

good point, thanks

@vdauwera

vdauwera Mar 6, 2017

Collaborator

done

+ Int disk_size
+ String mem_size
+
+ # HACK ALERT! Using .gvcf extension here to force IndexFeatureFile to make the right
@sooheelee

sooheelee Mar 6, 2017

Collaborator

Why can't we name the gunzipped output to ${unzipped_basename}.g.vcf instead to start with? Then we can avoid the renaming.

@vdauwera

vdauwera Mar 6, 2017

Collaborator

Mainly because this should be a very temporary situation, so when it's fixed we can just change this bit. In fact I should check whether it's still even the case.

+ # Unzip GVCFs in parallel
+ scatter (input_gvcf in input_gvcfs) {
+
+ # Unzip GVCFs because GenotypeGVCFs is a cranky toddler
@sooheelee

sooheelee Mar 6, 2017

Collaborator

I suggest stating only "Unzip GVCFs". Also, perhaps extended explanations, e.g. that GenotypeGVCFs does not handle block compressed files, should all be kept either to the task or to the workflow but not in both.

@vdauwera

vdauwera Mar 6, 2017

Collaborator

Fair point -- this should probably be only at workflow level

@vdauwera

vdauwera Mar 6, 2017

Collaborator

done

+ vcf_index = cohort_vcf_name + ".vcf.gz.tbi"
+ }
+
+ # Outputs that will be retained when execution is complete
@sooheelee

sooheelee Mar 6, 2017

Collaborator

Instead of "execution", use "workflow".

@vdauwera

vdauwera Mar 6, 2017

Collaborator

I'd rather continue to distinguish the workflow (the thing that gets executed) from its execution.

@sooheelee

sooheelee Mar 7, 2017

Collaborator

Ok Oliver.

@vdauwera

vdauwera Mar 7, 2017

Collaborator

Hey :)

+
+# TASK DEFINITIONS
+
+# HaplotypeCaller per-sample in GVCF mode
@sooheelee

sooheelee Mar 6, 2017

Collaborator

A link to our documentation explaining GVCF mode may be helpful here.

@vdauwera

vdauwera Mar 6, 2017

Collaborator

Eh, that's kind of a slippery slope... By the same logic we could link every single tool call to docs. I don't have the stomach to take that on right now (though for the future it would be nice).

@sooheelee

sooheelee Mar 7, 2017

Collaborator

Ok

+##
+## This WDL implements the joint discovery and VQSR filtering portion of the GATK
+## Best Practices (June 2016) for germline SNP and Indel discovery in human
+## whole-genome sequencing (WGS) data.
@sooheelee

sooheelee Mar 6, 2017

Collaborator

Below you say WGS and exome files. However, here you only mention WGS.

@vdauwera

vdauwera Mar 6, 2017

Collaborator

good catch

@vdauwera

vdauwera Mar 6, 2017

Collaborator

done

+##
+## Requirements/expectations :
+## - GVCFs produced by HaplotypeCaller in GVCF mode
+## - Bare minimum 1 WGS sample or 30 Exome samples. Gene panels are not supported.
@sooheelee

sooheelee Mar 6, 2017

Collaborator

This section needs more thought.

@vdauwera

vdauwera Mar 6, 2017

Collaborator

You mean you want to give it more thought, or I should give more details?

@vdauwera

vdauwera Mar 6, 2017

Collaborator

punting for later -- we can always come back and refine this

+## - Bare minimum 1 WGS sample or 30 Exome samples. Gene panels are not supported.
+##
+## Outputs :
+## - A (typically but not necessarily multisample) VCF file and its index, filtered
@sooheelee

sooheelee Mar 6, 2017

Collaborator

Incorporate the information in the parentheses into the main body of the text.

@vdauwera

vdauwera Mar 6, 2017

Collaborator

So make it a separate sentence?

@vdauwera

vdauwera Mar 6, 2017

Collaborator

done

+ resource_indices = resource_indices
+ }
+
+ # Apply INDEL filter (first because INDEL model is usually done sooner)
@sooheelee

sooheelee Mar 6, 2017

Collaborator

Why do we want to launch tasks that will finish sooner first?

@vdauwera

vdauwera Mar 6, 2017

Collaborator

The two models are built in parallel, but then the recals are applied in series. With this wiring, we can already start applying Indel recalibration to the VCF while SNP model is still being built. By the time it finishes, the indel-recalibrated file is available to serve as input to apply the SNP recal. If we did it the other way around, we would have to wait until SNP recal file is available despite indel recal file being there already, then apply SNP, then apply Indel. Longer wall clock time for complete workflow execution. Switching the two apply calls bypasses this problem.

@vdauwera

vdauwera Mar 6, 2017

Collaborator

Added a note to explain this

@sooheelee

sooheelee Mar 7, 2017

Collaborator

Good to know.

Collaborator

vdauwera commented Mar 6, 2017

Thanks for the review, @sooheelee. I'm going to call this good. We can do some additional refinement (eg add doc links) in a future pass over all our workflows.

@vdauwera vdauwera WDLs for segments of the germline short variant calling pipeline
    - HaplotypeCaller GVCF scatter-gathered across intervals
    - GenotypeGVCFs scatter-gathered across intervals
    - VQSR (no scatter)
    - Joint discovery (GGVCFs through VQSR, no genotype refinement)
2c56658

@vdauwera vdauwera merged commit 7e6f95b into develop Mar 6, 2017

1 check was pending

code-review/pullapprove Approval required by 2 of: cjllanwarne, geoffjentry, Horneth, katevoss, kcibul, kshakir, mcovarr, ruchim
Details

vdauwera deleted the gvda_standalone_calling_steps branch Mar 6, 2017

Collaborator

vdauwera commented Mar 6, 2017

Btw I'm pretty sure I used your WDLs as basis to develop most of these, @sooheelee. It's been a while since I started this so I don't remember precisely but I try to avoid reinventing wheels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment