Adding filtering to Ultima Joint Genotyping pipeline #804

meganshand · 2022-08-24T20:02:11Z

Here is a draft of the Ultima Joint Genotyping pipeline that adds filtering. This needs a release of GATK to remove some test docker tags before being merged. I'm hoping for some initial review before the next GATK release though.

Tests will also currently fail since the expected outputs need to be updated.

github-actions · 2022-08-24T20:02:30Z

Remember to squash merge!

github-actions · 2022-08-24T20:03:21Z

Remember to squash merge!

gileshall

Awesome, looks great! I added a few minor suggestions, but overall, I say ship it.

tasks/broad/UltimaGenomicsGermlineFilteringThreshold.wdl

gileshall · 2022-08-24T22:22:55Z

tasks/broad/UltimaGenomicsGermlineFilteringThreshold.wdl

+    tail -n +2 ~{thresholds} > ~{thresholds}.body
+    cat ~{thresholds}.body | awk 'BEGIN { FS=","} \
+    {print "-filter \"VARIANT_TYPE == \047" $1 "\047 && ~{score_key} \
+     < "$2 "\" --filter-name LOW_SCORE_"$1 " "}' | tr '\n' ' ' > tmpfile


I can't decide if this is fragile or not. What leapt out at me is how long I was staring at this, trying to unravel how this works. I don't think this needs to be rewritten, but in case it does become fragile, I suggest adding a comment or two to help unpack its intent.

Yes, let's comment and perhaps simplify. I don't think I've unraveled it myself!

For example, not sure I understand the presence of LOW_SCORE_. Is this trying to construct the filter name used by ScoreVariantAnnotations? If so, note that the filter name there is also an exposed argument, and it's not necessarily given by LOW_ + score_key; it's just that the default key and filter names are SCORE and LOW_SCORE, respectively.

I attempted a rewrite of this that will hopefully be clearer (with an additional comment too).

The LOW_SCORE_ is trying to construct a filter name of the type of variant (it's broken up by snp, hmer indel, nonhmer indel, etc.) which is not clear at all unless you know what's in the thresholds file. Maybe it makes more sense to name the filter "LOW_" + name_of_score_key + "_" + variant_type, but maybe it's cleaner to hard code the filter name as "LOW_SCORE_" + variant_type? It's convenient/confusing that the annotation happens to be SCORE in this case.

Either way, I'm testing it now and will push the changes on Monday once it works.

.../dna_seq/germline/joint_genotyping/UltimaGenomics/UltimaGenomicsJointGenotyping.changelog.md

samuelklee

Thanks @meganshand, looks great! Just a few minor comments and questions.

...nes/broad/dna_seq/germline/joint_genotyping/UltimaGenomics/UltimaGenomicsJointGenotyping.wdl

samuelklee · 2022-08-25T11:37:24Z

...d/dna_seq/germline/joint_genotyping/UltimaGenomics/test_inputs/Plumbing/plumbing.inputs.json

+"UltimaGenomicsJointGenotyping.ref_fasta_sdf": "gs://broad-gotc-test-storage/tmp/jbx/reference_sdf.tar",
+"UltimaGenomicsJointGenotyping.runs_file": "gs://broad-gotc-test-storage/tmp/jbx/runs.conservative.bed",
+"UltimaGenomicsJointGenotyping.annotation_intervals": ["gs://broad-gotc-test-storage/tmp/jbx/LCR-hs38.bed", "gs://broad-gotc-test-storage/tmp/jbx/mappability.0.bed", "gs://broad-gotc-test-storage/tmp/jbx/exome.twist.bed"],
+"UltimaGenomicsJointGenotyping.use_allele_specific_annotations": true,


Should you be using AS_ annotations above?

Incidentally, I wonder if we should fail if AS annotations are specified but the names do not start with AS_? This seems a bit restrictive, since I don't know that this naming convention is actually documented or enforced.

Also, should we go back and add this argument to the canonical WDL? What will be the strategy for keeping canonical WDLs in the GATK repo in sync with the ones here?

Good catch, we should be using AS_ annotations. I think failing on non AS_ annotations is too restrictive. Especially if we want to use a combination of AS_ and non AS_ annotations (which should be ok I think). Maybe a warning, but honestly I probably wouldn't have noticed a warning in this case either.

For keeping the WDLs in sync: That is an excellent question. @jessicaway A version of this WDL also currently lives in GATK. It's good to have it in GATK so that it can be tested as GATK tools change rather than waiting for a GATK release to test it, but we also want it available as a production pipeline. Do you have any ideas on what the process should be for where these WDLs belong and/or how they should be tested?

I am looking back at a previous slack conversation with @kbergin on this subject and our conclusion was to put it in GATK first and then "cross that bridge when we come to it". I think we're at the bridge now though :)

@meganshand this is a very good question and something we have struggled within the past. There is definitely a need for integration testing in our pipelines as GATK is updated. We often find that things look different when we go to update versions of GATK (or Picard for that matter) and it is a painful process to capture all the changes and track down which commit caused the difference(s).

That being said, keeping the WDLs in 2 locations that do not sync automatically will almost certainly not be a sustainable solution.

I have not spent much time looking into a solution, but off the top of my head, my first idea is to add a test that runs on GATK PRs (maybe PRs to master?) that checks out WARP and runs the pipeline test with the new code. That is not as straightforward as it sounds since we would need to do work to change out the docker used in each task to point to the new version of GATK.

We are definitely open to ideas and suggestions for how to make this better :)

I think this is an ideal solution, but would probably take some effort to set up in GATK. I think that's worth doing, but I'm also not sure what to do in the meantime with something like this Ultima JG WDL. To complicate the matter it's only one subworkflow of the whole pipeline that currently resides in GATK.

Is there a way to have WARP pull the pipeline from GATK in the short term before we get tests set up in GATK that check out WARP? Or is that an overly complicated solution to a temporary problem (although to be fair, I'm not sure how temporary)?

It might be worth talking this through off of this PR with all relevant parties.

...a_seq/germline/joint_genotyping/UltimaGenomics/test_inputs/Scientific/scientific.inputs.json

tasks/broad/UltimaGenomicsGermlineFilteringThreshold.wdl

samuelklee · 2022-08-25T11:42:10Z

tasks/broad/UltimaGenomicsGermlineFilteringThreshold.wdl

+        conda activate genomics.py3
+        bcftools view -s ~{sample_name} ~{input_vcf} -o  tmp.vcf.gz -O z 
+        bcftools index -t tmp.vcf.gz
+        bcftools view -h tmp.vcf.gz | sed 's/COMBINED_TREE_SCORE,Number=A,/COMBINED_TREE_SCORE,Number=\.,/g' > hdr.fixed.txt


Parts of this WDL seem like they could be generically useful, no? It might be worth thinking about exposing arguments needed in this Ultima-specific use case and moving towards generic tasks/sub-WDLs at some point, if not now.

Thank you for commenting this because I realized that this line is outdated and is no longer needed. I made the task more generic, but I'm not sure if you're suggesting that it live somewhere else more generic as well? Should I move it to Utilities.wdl? Every time I touch anything in Utilities I have to update the versions of every pipeline that uses them which seems like a minor drawback.

tasks/broad/UltimaGenomicsGermlineFilteringThreshold.wdl

samuelklee · 2022-08-25T11:48:09Z

tasks/broad/UltimaGenomicsGermlineFilteringThreshold.wdl

+            --hpol_filter_length_dist 12 10 \
+            --input_prefix $(echo "~{input_vcf}" | sed 's/\(.vcf.gz\|.vcf\)$//') \
+            --output_file ~{input_vcf_name}.comp.h5 \
+            --gtr_vcf ~{gtr_vcf} \


Hmm OK, I guess gtr appears here.

tasks/broad/UltimaGenomicsGermlineFilteringThreshold.wdl

samuelklee · 2022-08-25T11:54:36Z

tasks/broad/UltimaGenomicsGermlineFilteringThreshold.wdl

+    tail -n +2 ~{thresholds} > ~{thresholds}.body
+    cat ~{thresholds}.body | awk 'BEGIN { FS=","} \
+    {print "-filter \"VARIANT_TYPE == \047" $1 "\047 && ~{score_key} \
+     < "$2 "\" --filter-name LOW_SCORE_"$1 " "}' | tr '\n' ' ' > tmpfile


Yes, let's comment and perhaps simplify. I don't think I've unraveled it myself!

For example, not sure I understand the presence of LOW_SCORE_. Is this trying to construct the filter name used by ScoreVariantAnnotations? If so, note that the filter name there is also an exposed argument, and it's not necessarily given by LOW_ + score_key; it's just that the default key and filter names are SCORE and LOW_SCORE, respectively.

tasks/broad/UltimaGenomicsGermlineFilteringThreshold.wdl

github-actions · 2022-08-29T20:45:20Z

Remember to squash merge!

github-actions · 2022-08-29T21:00:27Z

Remember to squash merge!

meganshand · 2022-09-12T14:57:08Z

So sorry for the delayed response here. I believe I've addressed the comments on this PR and tried to fix all the tests. The outstanding issues that I need some help with are:

Updating the truth data for the Ultima Joint Genotyping pipeline tests. We expect the output to change drastically with this PR since we're adding filtering. I want to make sure this is updated correctly so I don't break the tests on develop/main.
Have a plan for how to handle WDLs that belong in either GATK or WARP. As discussed with @jessicaway I'll set up a meeting to talk through this part in person.

github-actions · 2022-09-22T19:59:20Z

Remember to squash merge!

github-actions · 2022-09-26T17:52:26Z

Remember to squash merge!

github-actions · 2022-09-27T22:42:31Z

Remember to squash merge!

github-actions · 2022-09-29T22:03:03Z

Remember to squash merge!

github-actions · 2022-09-29T22:27:37Z

Remember to squash merge!

jessicaway

Looks good! A couple minor comments/questions

jessicaway · 2022-10-13T13:37:05Z

tasks/broad/JointGenotypingTasks.wdl

+    Array[String] annotations_to_divide = ["ASSEMBLED_HAPS", "FILTERED_HAPS", "TREE_SCORE"]
+
+    String docker = "us.gcr.io/broad-dsde-methods/broad-gatk-snapshots:UG_feature_branch_v4"


I assume this code is not in an official GATK release yet, right?

It is as of this week! However, this is the docker we've been using in all of the Ultima pipelines. It seems like it might be better to update them all at once in a separate PR? Happy to hear if you think I should update it here first.

jessicaway · 2022-10-13T20:41:59Z

tasks/broad/UltimaGenomicsGermlineFilteringThreshold.wdl

+    String output_vcf_name = "~{base_file_name}" + ".vcf.gz"
+    command <<<
+        set -e
+        source ~/.bashrc


just wondering why this line is necessary?

huh, the .bashrc sets up conda, so I think it's needed for conda activate genomics.py3, but this task (and some of the others) don't use python so I think it can be removed. I'll clean this up.

actually I was wrong, in this docker gatk and bcftools are managed by conda. I'll add a comment in the code.

github-actions · 2022-10-14T16:00:02Z

Remember to squash merge!

github-actions · 2022-10-14T17:02:40Z

Remember to squash merge!

github-actions · 2022-10-16T13:38:27Z

Remember to squash merge!

meganshand · 2022-10-17T17:37:56Z

Retest this please

github-actions · 2022-10-18T13:10:03Z

Remember to squash merge!

* update to new docker image * Np get dev in sync with staging (#838) * make dev and staging in sync * version * Np ultima cramonly test (#823) * add validation and test wdl for cram only ug pipeline * add testing wdl * add to pipelineTestType * add to changed workflows * renaming * update changed pipeline script syntax * update changed pipeline script syntax * add check to make sure bams are null and fail if they are not null * Fix for large Ultima samples in Single Sample pipeline (#835) * folding optional output bam into MarkDuplicates * updating somatic test paths * updating changelogs * updating versions in wdl * missed a few * increase version for external whole genome reprocessing Co-authored-by: jessicaway <jway@broadinstitute.org> * add in load_tag fields * update change logs for arrays and imputation internal * update all other change logs * update validate chip * Force fragment size histogram file to exist in rnaseqc output (#840) * force fragmentSizes histogram file to exist * Adding filtering to Ultima Joint Genotyping pipeline (#804) * Adding filtering to Ultima Joint Genotyping pipeline Co-authored-by: jessicaway <jway@broadinstitute.org> * Adding scientific test for CramOnly Ultima pipeline and adding it to dockstore (#844) * GL-702: Update references to data stored in HCA buckets (#836) * Update references to data stored in HCA buckets to copies made to broad-gotc-test-storage * GL-1988: Update references to files in broad-gotc-test-storage/tmp/jbx (#848) * Update references to files in broad-gotc-test-storage/tmp/jbx to new locations * update docker images for internal arrays and internal imputation * add lab_batch into internal arrays * remove lab_batch * Update Optimus Overview (#852) Optimus Overview now matches dev version * Lk ss2 fix broken link (#853) Fixed broken link to python script. * remove workspace bucket as input to ingestoutputstoTDR * remove workspace bucket from wide table ingest * recommit to refresh dockstore * update versions for broad internal arrays and imputation and changelogs * update changelogs * Allow nested inputs in the imputation pipeline (#857) * allow nested inputs in the imputation pipeline * need to copy over pipeline text metrics to truth bucket (#856) Co-authored-by: Sushma Chaluvadi <schaluva@broadinstitute.org> Co-authored-by: meganshand <mshand@broadinstitute.org> Co-authored-by: jessicaway <jway@broadinstitute.org> Co-authored-by: Sushma Chaluvadi <43828038+schaluva@users.noreply.github.com> Co-authored-by: kachulis <39926576+kachulis@users.noreply.github.com> Co-authored-by: Tim Wang <31369366+timaeusx@users.noreply.github.com> Co-authored-by: ekiernan <55763654+ekiernan@users.noreply.github.com>

pullapprove bot requested a review from samuelklee August 24, 2022 20:02

meganshand force-pushed the ms_filtering branch from 2d1d44d to 125e324 Compare August 24, 2022 20:03

gileshall reviewed Aug 24, 2022

View reviewed changes

samuelklee reviewed Aug 25, 2022

View reviewed changes

.../dna_seq/germline/joint_genotyping/UltimaGenomics/UltimaGenomicsJointGenotyping.changelog.md Outdated Show resolved Hide resolved

samuelklee approved these changes Aug 25, 2022

View reviewed changes

jessicaway approved these changes Oct 13, 2022

View reviewed changes

meganshand and others added 11 commits October 14, 2022 13:02

Adding filtering to Ultima Joint Genotyping pipeline

594ed74

updating missing changelogs

ba098f1

updating version numbers and fixing UltimaJointGenotypingTest

c648f31

addressing comments

17d1ebf

missed one name change

0ea6b10

Moving subworkflow to GATK

3557737

changing branch of imported WDL

c962836

removing missing output file from task

98a4728

address the release issue with the http import

442e508

missed one

4efb80e

adding comment

73a9780

meganshand force-pushed the ms_filtering branch from fd6aeb6 to 73a9780 Compare October 14, 2022 17:02

using tagged gatk version

08911da

nikellepetrillo approved these changes Oct 18, 2022

View reviewed changes

Merge branch 'develop' into ms_filtering

1a0b224

meganshand merged commit f0f8203 into develop Oct 18, 2022

meganshand deleted the ms_filtering branch October 18, 2022 14:48

nikellepetrillo mentioned this pull request Nov 1, 2022

Staging -> Master #837

Merged

nikellepetrillo mentioned this pull request Nov 17, 2022

staging -> master #863

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding filtering to Ultima Joint Genotyping pipeline #804

Adding filtering to Ultima Joint Genotyping pipeline #804

meganshand commented Aug 24, 2022

github-actions bot commented Aug 24, 2022

github-actions bot commented Aug 24, 2022

gileshall left a comment

gileshall Aug 24, 2022

samuelklee Aug 25, 2022

meganshand Aug 26, 2022

samuelklee left a comment

samuelklee Aug 25, 2022

meganshand Aug 26, 2022

jessicaway Aug 26, 2022

meganshand Aug 29, 2022

samuelklee Aug 25, 2022

meganshand Aug 26, 2022

samuelklee Aug 25, 2022

samuelklee Aug 25, 2022

github-actions bot commented Aug 29, 2022

github-actions bot commented Aug 29, 2022

meganshand commented Sep 12, 2022 •

edited

Loading

github-actions bot commented Sep 22, 2022

github-actions bot commented Sep 26, 2022

github-actions bot commented Sep 27, 2022

github-actions bot commented Sep 29, 2022

github-actions bot commented Sep 29, 2022

jessicaway left a comment

jessicaway Oct 13, 2022

meganshand Oct 14, 2022

jessicaway Oct 13, 2022

meganshand Oct 14, 2022

meganshand Oct 14, 2022

github-actions bot commented Oct 14, 2022

github-actions bot commented Oct 14, 2022

github-actions bot commented Oct 16, 2022

meganshand commented Oct 17, 2022

github-actions bot commented Oct 18, 2022

		Array[String] annotations_to_divide = ["ASSEMBLED_HAPS", "FILTERED_HAPS", "TREE_SCORE"]

		String docker = "us.gcr.io/broad-dsde-methods/broad-gatk-snapshots:UG_feature_branch_v4"

Adding filtering to Ultima Joint Genotyping pipeline #804

Adding filtering to Ultima Joint Genotyping pipeline #804

Conversation

meganshand commented Aug 24, 2022

github-actions bot commented Aug 24, 2022

github-actions bot commented Aug 24, 2022

gileshall left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samuelklee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Aug 29, 2022

github-actions bot commented Aug 29, 2022

meganshand commented Sep 12, 2022 • edited Loading

github-actions bot commented Sep 22, 2022

github-actions bot commented Sep 26, 2022

github-actions bot commented Sep 27, 2022

github-actions bot commented Sep 29, 2022

github-actions bot commented Sep 29, 2022

jessicaway left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Oct 14, 2022

github-actions bot commented Oct 14, 2022

github-actions bot commented Oct 16, 2022

meganshand commented Oct 17, 2022

github-actions bot commented Oct 18, 2022

meganshand commented Sep 12, 2022 •

edited

Loading