Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding filtering to Ultima Joint Genotyping pipeline #804

Merged
merged 13 commits into from
Oct 18, 2022
Merged

Conversation

meganshand
Copy link
Contributor

Here is a draft of the Ultima Joint Genotyping pipeline that adds filtering. This needs a release of GATK to remove some test docker tags before being merged. I'm hoping for some initial review before the next GATK release though.

Tests will also currently fail since the expected outputs need to be updated.

@pullapprove pullapprove bot requested a review from samuelklee August 24, 2022 20:02
@github-actions
Copy link

Remember to squash merge!

@github-actions
Copy link

Remember to squash merge!

Copy link

@gileshall gileshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, looks great! I added a few minor suggestions, but overall, I say ship it.

tasks/broad/UltimaGenomicsGermlineFilteringThreshold.wdl Outdated Show resolved Hide resolved
Comment on lines 323 to 326
tail -n +2 ~{thresholds} > ~{thresholds}.body
cat ~{thresholds}.body | awk 'BEGIN { FS=","} \
{print "-filter \"VARIANT_TYPE == \047" $1 "\047 && ~{score_key} \
< "$2 "\" --filter-name LOW_SCORE_"$1 " "}' | tr '\n' ' ' > tmpfile

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't decide if this is fragile or not. What leapt out at me is how long I was staring at this, trying to unravel how this works. I don't think this needs to be rewritten, but in case it does become fragile, I suggest adding a comment or two to help unpack its intent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's comment and perhaps simplify. I don't think I've unraveled it myself!

For example, not sure I understand the presence of LOW_SCORE_. Is this trying to construct the filter name used by ScoreVariantAnnotations? If so, note that the filter name there is also an exposed argument, and it's not necessarily given by LOW_ + score_key; it's just that the default key and filter names are SCORE and LOW_SCORE, respectively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attempted a rewrite of this that will hopefully be clearer (with an additional comment too).

The LOW_SCORE_ is trying to construct a filter name of the type of variant (it's broken up by snp, hmer indel, nonhmer indel, etc.) which is not clear at all unless you know what's in the thresholds file. Maybe it makes more sense to name the filter "LOW_" + name_of_score_key + "_" + variant_type, but maybe it's cleaner to hard code the filter name as "LOW_SCORE_" + variant_type? It's convenient/confusing that the annotation happens to be SCORE in this case.

Either way, I'm testing it now and will push the changes on Monday once it works.

Copy link
Contributor

@samuelklee samuelklee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @meganshand, looks great! Just a few minor comments and questions.

"UltimaGenomicsJointGenotyping.ref_fasta_sdf": "gs://broad-gotc-test-storage/tmp/jbx/reference_sdf.tar",
"UltimaGenomicsJointGenotyping.runs_file": "gs://broad-gotc-test-storage/tmp/jbx/runs.conservative.bed",
"UltimaGenomicsJointGenotyping.annotation_intervals": ["gs://broad-gotc-test-storage/tmp/jbx/LCR-hs38.bed", "gs://broad-gotc-test-storage/tmp/jbx/mappability.0.bed", "gs://broad-gotc-test-storage/tmp/jbx/exome.twist.bed"],
"UltimaGenomicsJointGenotyping.use_allele_specific_annotations": true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you be using AS_ annotations above?

Incidentally, I wonder if we should fail if AS annotations are specified but the names do not start with AS_? This seems a bit restrictive, since I don't know that this naming convention is actually documented or enforced.

Also, should we go back and add this argument to the canonical WDL? What will be the strategy for keeping canonical WDLs in the GATK repo in sync with the ones here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, we should be using AS_ annotations. I think failing on non AS_ annotations is too restrictive. Especially if we want to use a combination of AS_ and non AS_ annotations (which should be ok I think). Maybe a warning, but honestly I probably wouldn't have noticed a warning in this case either.

For keeping the WDLs in sync: That is an excellent question. @jessicaway A version of this WDL also currently lives in GATK. It's good to have it in GATK so that it can be tested as GATK tools change rather than waiting for a GATK release to test it, but we also want it available as a production pipeline. Do you have any ideas on what the process should be for where these WDLs belong and/or how they should be tested?

I am looking back at a previous slack conversation with @kbergin on this subject and our conclusion was to put it in GATK first and then "cross that bridge when we come to it". I think we're at the bridge now though :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@meganshand this is a very good question and something we have struggled within the past. There is definitely a need for integration testing in our pipelines as GATK is updated. We often find that things look different when we go to update versions of GATK (or Picard for that matter) and it is a painful process to capture all the changes and track down which commit caused the difference(s).

That being said, keeping the WDLs in 2 locations that do not sync automatically will almost certainly not be a sustainable solution.

I have not spent much time looking into a solution, but off the top of my head, my first idea is to add a test that runs on GATK PRs (maybe PRs to master?) that checks out WARP and runs the pipeline test with the new code. That is not as straightforward as it sounds since we would need to do work to change out the docker used in each task to point to the new version of GATK.

We are definitely open to ideas and suggestions for how to make this better :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an ideal solution, but would probably take some effort to set up in GATK. I think that's worth doing, but I'm also not sure what to do in the meantime with something like this Ultima JG WDL. To complicate the matter it's only one subworkflow of the whole pipeline that currently resides in GATK.

Is there a way to have WARP pull the pipeline from GATK in the short term before we get tests set up in GATK that check out WARP? Or is that an overly complicated solution to a temporary problem (although to be fair, I'm not sure how temporary)?

It might be worth talking this through off of this PR with all relevant parties.

tasks/broad/UltimaGenomicsGermlineFilteringThreshold.wdl Outdated Show resolved Hide resolved
conda activate genomics.py3
bcftools view -s ~{sample_name} ~{input_vcf} -o tmp.vcf.gz -O z
bcftools index -t tmp.vcf.gz
bcftools view -h tmp.vcf.gz | sed 's/COMBINED_TREE_SCORE,Number=A,/COMBINED_TREE_SCORE,Number=\.,/g' > hdr.fixed.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parts of this WDL seem like they could be generically useful, no? It might be worth thinking about exposing arguments needed in this Ultima-specific use case and moving towards generic tasks/sub-WDLs at some point, if not now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for commenting this because I realized that this line is outdated and is no longer needed. I made the task more generic, but I'm not sure if you're suggesting that it live somewhere else more generic as well? Should I move it to Utilities.wdl? Every time I touch anything in Utilities I have to update the versions of every pipeline that uses them which seems like a minor drawback.

tasks/broad/UltimaGenomicsGermlineFilteringThreshold.wdl Outdated Show resolved Hide resolved
--hpol_filter_length_dist 12 10 \
--input_prefix $(echo "~{input_vcf}" | sed 's/\(.vcf.gz\|.vcf\)$//') \
--output_file ~{input_vcf_name}.comp.h5 \
--gtr_vcf ~{gtr_vcf} \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm OK, I guess gtr appears here.

tasks/broad/UltimaGenomicsGermlineFilteringThreshold.wdl Outdated Show resolved Hide resolved
Comment on lines 323 to 326
tail -n +2 ~{thresholds} > ~{thresholds}.body
cat ~{thresholds}.body | awk 'BEGIN { FS=","} \
{print "-filter \"VARIANT_TYPE == \047" $1 "\047 && ~{score_key} \
< "$2 "\" --filter-name LOW_SCORE_"$1 " "}' | tr '\n' ' ' > tmpfile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's comment and perhaps simplify. I don't think I've unraveled it myself!

For example, not sure I understand the presence of LOW_SCORE_. Is this trying to construct the filter name used by ScoreVariantAnnotations? If so, note that the filter name there is also an exposed argument, and it's not necessarily given by LOW_ + score_key; it's just that the default key and filter names are SCORE and LOW_SCORE, respectively.

tasks/broad/UltimaGenomicsGermlineFilteringThreshold.wdl Outdated Show resolved Hide resolved
@github-actions
Copy link

Remember to squash merge!

1 similar comment
@github-actions
Copy link

Remember to squash merge!

@meganshand
Copy link
Contributor Author

meganshand commented Sep 12, 2022

So sorry for the delayed response here. I believe I've addressed the comments on this PR and tried to fix all the tests. The outstanding issues that I need some help with are:

  • Updating the truth data for the Ultima Joint Genotyping pipeline tests. We expect the output to change drastically with this PR since we're adding filtering. I want to make sure this is updated correctly so I don't break the tests on develop/main.
  • Have a plan for how to handle WDLs that belong in either GATK or WARP. As discussed with @jessicaway I'll set up a meeting to talk through this part in person.

@github-actions
Copy link

Remember to squash merge!

4 similar comments
@github-actions
Copy link

Remember to squash merge!

@github-actions
Copy link

Remember to squash merge!

@github-actions
Copy link

Remember to squash merge!

@github-actions
Copy link

Remember to squash merge!

Copy link
Member

@jessicaway jessicaway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! A couple minor comments/questions

Array[String] annotations_to_divide = ["ASSEMBLED_HAPS", "FILTERED_HAPS", "TREE_SCORE"]

String docker = "us.gcr.io/broad-dsde-methods/broad-gatk-snapshots:UG_feature_branch_v4"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this code is not in an official GATK release yet, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is as of this week! However, this is the docker we've been using in all of the Ultima pipelines. It seems like it might be better to update them all at once in a separate PR? Happy to hear if you think I should update it here first.

String output_vcf_name = "~{base_file_name}" + ".vcf.gz"
command <<<
set -e
source ~/.bashrc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wondering why this line is necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, the .bashrc sets up conda, so I think it's needed for conda activate genomics.py3, but this task (and some of the others) don't use python so I think it can be removed. I'll clean this up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I was wrong, in this docker gatk and bcftools are managed by conda. I'll add a comment in the code.

@github-actions
Copy link

Remember to squash merge!

@github-actions
Copy link

Remember to squash merge!

@github-actions
Copy link

Remember to squash merge!

@meganshand
Copy link
Contributor Author

Retest this please

@github-actions
Copy link

Remember to squash merge!

@meganshand meganshand merged commit f0f8203 into develop Oct 18, 2022
@meganshand meganshand deleted the ms_filtering branch October 18, 2022 14:48
@nikellepetrillo nikellepetrillo mentioned this pull request Nov 1, 2022
nikellepetrillo added a commit that referenced this pull request Nov 17, 2022
* update to new docker image

* Np get dev in sync with staging (#838)

* make dev and staging in sync

* version

* Np ultima cramonly test (#823)

* add validation and test wdl for cram only ug pipeline

* add testing wdl

* add to pipelineTestType

* add to changed workflows

* renaming

* update changed pipeline script syntax

* update changed pipeline script syntax

* add check to make sure bams are null and fail if they are not null

* Fix for large Ultima samples in Single Sample pipeline (#835)

* folding optional output bam into MarkDuplicates

* updating somatic test paths

* updating changelogs

* updating versions in wdl

* missed a few

* increase version for external whole genome reprocessing

Co-authored-by: jessicaway <jway@broadinstitute.org>

* add in load_tag fields

* update change logs for arrays and imputation internal

* update all other change logs

* update validate chip

* Force fragment size histogram file to exist in rnaseqc output (#840)

* force fragmentSizes histogram file to exist

* Adding filtering to Ultima Joint Genotyping pipeline (#804)

* Adding filtering to Ultima Joint Genotyping pipeline

Co-authored-by: jessicaway <jway@broadinstitute.org>

* Adding scientific test for CramOnly Ultima pipeline and adding it to dockstore (#844)

* GL-702: Update references to data stored in HCA buckets (#836)

* Update references to data stored in HCA buckets to copies made to broad-gotc-test-storage

* GL-1988: Update references to files in broad-gotc-test-storage/tmp/jbx (#848)

* Update references to files in broad-gotc-test-storage/tmp/jbx to new locations

* update docker images for internal arrays and internal imputation

* add lab_batch into internal arrays

* remove lab_batch

* Update Optimus Overview (#852)

Optimus Overview now matches dev version

* Lk ss2 fix broken link (#853)

Fixed broken link to python script.

* remove workspace bucket as input to ingestoutputstoTDR

* remove workspace bucket from wide table ingest

* recommit to refresh dockstore

* update versions for broad internal arrays and imputation and changelogs

* update changelogs

* Allow nested inputs in the imputation pipeline (#857)

* allow nested inputs in the imputation pipeline

* need to copy over pipeline text metrics to truth bucket (#856)

Co-authored-by: Sushma Chaluvadi <schaluva@broadinstitute.org>
Co-authored-by: meganshand <mshand@broadinstitute.org>
Co-authored-by: jessicaway <jway@broadinstitute.org>
Co-authored-by: Sushma Chaluvadi <43828038+schaluva@users.noreply.github.com>
Co-authored-by: kachulis <39926576+kachulis@users.noreply.github.com>
Co-authored-by: Tim Wang <31369366+timaeusx@users.noreply.github.com>
Co-authored-by: ekiernan <55763654+ekiernan@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants