Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PathSeq WDL overhaul #6536

Merged
merged 6 commits into from
Jun 1, 2020
Merged

PathSeq WDL overhaul #6536

merged 6 commits into from
Jun 1, 2020

Conversation

mwalker174
Copy link
Contributor

This new PathSeq WDL redesigns the workflow for improved performance in the cloud. Downsampling can be applied to BAMs with high microbial content (ie >10M reads) that normally cause performance issues.

Other improvements include:

  • Removed microbial fasta input, as only the sequence dictionary is needed.
  • Broke pipeline down to into smaller tasks. This helps reduce costs by a) provisioning fewer resources at the filter and score phases of the pipeline and b) reducing job wall time to minimize the likelihood of VM preemption.
  • Filter-only option, which can be used to cheaply estimate the number of microbial reads in the sample.
  • Metrics are now parsed so they can be fed as output to the Terra data model.
  • CRAM-to-BAM capability
  • Updated WDL readme
  • Deleted unneeded WDL json configuration, as the configuration can be provided in Terra

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions. The big thing is that it would be great to update this to WDL 1.0.

- ``PathSeqPipelineWorkflow.min_clipped_read_length`` -- Minimum read length after quality trimming. You may need to reduce this if your input reads are shorter than the default value (default 60)
- ``PathSeqPipelineWorkflow.estimate_filter_metrics_with_downsampling`` -- read filter metrics will be estimated using a downsampled bam (highly recommended) (default true)
- ``PathSeqPipelineWorkflow.estimate_filter_metrics_reads`` -- number of reads to downsample to for filter metrics estimation, recommended 1M for samples with ~0.1% non-host reads (default 1M)
- ``PathSeqPipelineWorkflow.min_clipped_read_length`` -- Minimum read length after quality trimming. Increasing may increase microbial classification specificity but may reduce sensitivity (default 31)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What made you change this from 60 to 31?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some libraries have very short reads (eg older TCGA data with ~40bp). This makes it less likely for users to get confused when the output comes up empty.

"PathSeqPipelineWorkflow.sample_name": "sample",
"PathSeqPipelineWorkflow.input_bam": "gs://my-bucket/sample.bam",
"PathSeqThreeStageWorkflow.sample_name": "sample",
"PathSeqThreeStageWorkflow.input_bam_or_cram": "gs://my-bucket/sample.bam",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you talk to @bshifaw about what needs to happen to "feature" the PathSeq workspace? It's likely the only this is to put real (mini) data here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats right, mini data would do fine for now. Here is an old json with mini data for pathseq here. Its NA12878_24RG contaminated with chicken reads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bshifaw Is it SOP to have these json files? I was considering deleting it - it seems like this kind of metadata should be made available on Terra instead, ie through featured workspaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also problematic that I can't provide a docker that would support this workflow until we cut a new release

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, having the JSON along with the WDL is SOP. Correct, the JSON would be available in Terra (which requires a google account) but not everyone will be looking at the workflow via Terra. You may have visitors directly from Dockstore or to this repo looking for an example JSON.
I can see why not having a docker would be a problem, maybe a place older can be added for the next release or use ":latest". If both don't seem inappropriate, maybe hold off on the JSON for now but eventually add an example JSON with the WDL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bshifaw Okay thanks for clarifying, I will update the JSON then. Can we add an index to the chicken sample bam and put it at gs://gatk-best-practices/pathseq/contaminated-bam/NA12878_24RG_med.hg380.7chicken0.3.bam.bai?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done

@@ -108,7 +108,7 @@
* counts files, and all contigs appearing in the input counts files must have a corresponding entry in the priors
* table. The order of contigs is immaterial in the priors table. The highest ploidy state is determined by the
* prior table (3 in the above example). A ploidy state can be strictly forbidden by setting its prior probability
* to 0. For example, the X contig in the above example can only assume 0 and 1 ploidy states.</p>
* to 0. For example, the Y contig in the above example can only assume 0 and 1 ploidy states.</p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need a rebase? I thought I just merged a PR with this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm something strange happened here. I rebased but it's still showing up... maybe one of my commits changes it to X then another back to Y. The final result is correct though.

* host k-mer file, and taxonomy file may also be copied to a single path on every worker node or to HDFS.</p>
*
* <h3>References</h3>
* <ol>
* <li>Walker, M. A., Pedamallu, C. S. et al. (2018). GATK PathSeq: a customizable computational tool for the discovery and identification of microbial sequences in libraries from eukaryotic hosts. Bioinformatics. 34, 4287-4289.</li>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay!

command <<<
set -euo pipefail
if ${defined(input_bam_index_file)}; then
NUM_READS=`samtools idxstats ${input_bam_file} | awk '{s+=$3+$4} END {print s}'`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although you can't play this trick with a cram. Pros and cons.

echo $NUM_READS > num_reads.txt
java -Xmx2000m -jar /usr/gitc/picard.jar \
DownsampleSam \
INPUT=${input_bam_file} \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To call this from GATK I think the args will go to -I, -O, --VALIDATON_STRINGENCY and maybe --P?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this to use GATK instead of Picard

input:
input_bam_file=bam_file,
input_bam_index_file=bam_index,
downsampled_bam_filename="${sample_name}.downsampled.bam",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're here, it would be great to rev this to WDL 1.0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

${if defined(filter_reads_per_partition) then "--filter-reads-per-partition ${filter_reads_per_partition}" else ""}

if [ ! -f "${paired_bam_output_path}" ]; then
echo "File ${paired_bam_output_path} not found, creating empty BAM"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you prefer to create an empty BAM rather than have a default output path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, the BAM will not be written if there are no reads. I believe this is a behavior of Spark tools in general and not something I can change in PathSeq.

Float frac_final_total = read_float("frac_final_total.txt")
Float frac_qual_cpx_filtered = read_float("frac_qual_cpx_filtered.txt")
Float frac_host_filtered = read_float("frac_host_filtered.txt")
Float frac_dup_filtered = read_float("frac_dup_filtered.txt")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These floats are nice for Terra data model output. Thanks.

"PathSeqPipeline.PathSeqFilter.filter_bwa_image": "gs://gatk-best-practices/pathseq/resources/pathseq_host.fa.img",
"PathSeqPipeline.PathSeqScore.taxonomy_file": "gs://gatk-best-practices/pathseq/resources/meats.min2k.db",

"PathSeqPipeline.gatk_docker": "broadinstitute/gatk:latest"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you pin this version please? The provenance is easier to figure out if the docker is fixed in the input json.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


PathSeq reference files are available in the [GATK Resource Bundle](https://software.broadinstitute.org/gatk/download/bundle) (located [here](ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/beta/PathSeq)).
PathSeq reference files are available in the [GATK Resource Bundle](https://software.broadinstitute.org/gatk/download/bundle) at `gs://gatk-best-practices/pathseq`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That broad link is giving me a 404. Maybe just take it out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor fixes, otherwise good to go!

@gatk-bot
Copy link

gatk-bot commented May 28, 2020

Travis reported job failures from build 30432
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 30432.11 logs
integration openjdk8 30432.2 logs
integration openjdk11 30432.11 logs
integration openjdk8 30432.2 logs

@mwalker174 mwalker174 merged commit d9bae88 into master Jun 1, 2020
@mwalker174 mwalker174 deleted the mw_ps_wdl branch June 1, 2020 15:47
jonn-smith pushed a commit that referenced this pull request Jul 14, 2020
* New pathseq wdl with downsampling; remove unneeded microbe reference fasta in alignment step
Co-authored-by: ldgauthier <gauthier@broadinstitute.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants