PathSeq WDL overhaul #6536

mwalker174 · 2020-04-03T18:50:37Z

This new PathSeq WDL redesigns the workflow for improved performance in the cloud. Downsampling can be applied to BAMs with high microbial content (ie >10M reads) that normally cause performance issues.

Other improvements include:

Removed microbial fasta input, as only the sequence dictionary is needed.
Broke pipeline down to into smaller tasks. This helps reduce costs by a) provisioning fewer resources at the filter and score phases of the pipeline and b) reducing job wall time to minimize the likelihood of VM preemption.
Filter-only option, which can be used to cheaply estimate the number of microbial reads in the sample.
Metrics are now parsed so they can be fed as output to the Terra data model.
CRAM-to-BAM capability
Updated WDL readme
Deleted unneeded WDL json configuration, as the configuration can be provided in Terra

ldgauthier

Some questions. The big thing is that it would be great to update this to WDL 1.0.

ldgauthier · 2020-04-14T18:52:38Z

scripts/pathseq/wdl/README.md

- ``PathSeqPipelineWorkflow.min_clipped_read_length`` -- Minimum read length after quality trimming. You may need to reduce this if your input reads are shorter than the default value (default 60)
+- ``PathSeqPipelineWorkflow.estimate_filter_metrics_with_downsampling`` -- read filter metrics will be estimated using a downsampled bam (highly recommended) (default true)
+- ``PathSeqPipelineWorkflow.estimate_filter_metrics_reads`` -- number of reads to downsample to for filter metrics estimation, recommended 1M for samples with ~0.1% non-host reads (default 1M)
+- ``PathSeqPipelineWorkflow.min_clipped_read_length`` -- Minimum read length after quality trimming. Increasing may increase microbial classification specificity but may reduce sensitivity (default 31)


What made you change this from 60 to 31?

Some libraries have very short reads (eg older TCGA data with ~40bp). This makes it less likely for users to get confused when the output comes up empty.

ldgauthier · 2020-04-14T18:55:07Z

scripts/pathseq/wdl/pathseq_pipeline_template.json

-  "PathSeqPipelineWorkflow.sample_name": "sample",
-  "PathSeqPipelineWorkflow.input_bam": "gs://my-bucket/sample.bam",
+  "PathSeqThreeStageWorkflow.sample_name": "sample",
+  "PathSeqThreeStageWorkflow.input_bam_or_cram": "gs://my-bucket/sample.bam",


Can you talk to @bshifaw about what needs to happen to "feature" the PathSeq workspace? It's likely the only this is to put real (mini) data here.

Thats right, mini data would do fine for now. Here is an old json with mini data for pathseq here. Its NA12878_24RG contaminated with chicken reads.

@bshifaw Is it SOP to have these json files? I was considering deleting it - it seems like this kind of metadata should be made available on Terra instead, ie through featured workspaces.

It's also problematic that I can't provide a docker that would support this workflow until we cut a new release

Yes, having the JSON along with the WDL is SOP. Correct, the JSON would be available in Terra (which requires a google account) but not everyone will be looking at the workflow via Terra. You may have visitors directly from Dockstore or to this repo looking for an example JSON.
I can see why not having a docker would be a problem, maybe a place older can be added for the next release or use ":latest". If both don't seem inappropriate, maybe hold off on the JSON for now but eventually add an example JSON with the WDL.

@bshifaw Okay thanks for clarifying, I will update the JSON then. Can we add an index to the chicken sample bam and put it at gs://gatk-best-practices/pathseq/contaminated-bam/NA12878_24RG_med.hg380.7chicken0.3.bam.bai?

ldgauthier · 2020-04-14T18:56:32Z

src/main/java/org/broadinstitute/hellbender/tools/copynumber/DetermineGermlineContigPloidy.java

@@ -108,7 +108,7 @@
 *     counts files, and all contigs appearing in the input counts files must have a corresponding entry in the priors
 *     table. The order of contigs is immaterial in the priors table. The highest ploidy state is determined by the
 *     prior table (3 in the above example). A ploidy state can be strictly forbidden by setting its prior probability
- *     to 0. For example, the X contig in the above example can only assume 0 and 1 ploidy states.</p>
+ *     to 0. For example, the Y contig in the above example can only assume 0 and 1 ploidy states.</p>


Does this need a rebase? I thought I just merged a PR with this change.

Hm something strange happened here. I rebased but it's still showing up... maybe one of my commits changes it to X then another back to Y. The final result is correct though.

src/main/java/org/broadinstitute/hellbender/tools/spark/pathseq/PathSeqBwaSpark.java

ldgauthier · 2020-04-14T19:02:17Z

src/main/java/org/broadinstitute/hellbender/tools/spark/pathseq/PathSeqPipelineSpark.java

 * host k-mer file, and taxonomy file may also be copied to a single path on every worker node or to HDFS.</p>
 *
 * <h3>References</h3>
 * <ol>
+ *     <li>Walker, M. A., Pedamallu, C. S. et al. (2018). GATK PathSeq: a customizable computational tool for the discovery and identification of microbial sequences in libraries from eukaryotic hosts. Bioinformatics. 34, 4287-4289.</li>


ldgauthier · 2020-04-14T20:51:38Z