-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding filtering to Ultima Joint Genotyping pipeline #804
Conversation
Remember to squash merge! |
2d1d44d
to
125e324
Compare
Remember to squash merge! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, looks great! I added a few minor suggestions, but overall, I say ship it.
tail -n +2 ~{thresholds} > ~{thresholds}.body | ||
cat ~{thresholds}.body | awk 'BEGIN { FS=","} \ | ||
{print "-filter \"VARIANT_TYPE == \047" $1 "\047 && ~{score_key} \ | ||
< "$2 "\" --filter-name LOW_SCORE_"$1 " "}' | tr '\n' ' ' > tmpfile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't decide if this is fragile or not. What leapt out at me is how long I was staring at this, trying to unravel how this works. I don't think this needs to be rewritten, but in case it does become fragile, I suggest adding a comment or two to help unpack its intent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's comment and perhaps simplify. I don't think I've unraveled it myself!
For example, not sure I understand the presence of LOW_SCORE_
. Is this trying to construct the filter name used by ScoreVariantAnnotations? If so, note that the filter name there is also an exposed argument, and it's not necessarily given by LOW_
+ score_key
; it's just that the default key and filter names are SCORE
and LOW_SCORE
, respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attempted a rewrite of this that will hopefully be clearer (with an additional comment too).
The LOW_SCORE_
is trying to construct a filter name of the type of variant (it's broken up by snp, hmer indel, nonhmer indel, etc.) which is not clear at all unless you know what's in the thresholds
file. Maybe it makes more sense to name the filter "LOW_" + name_of_score_key + "_" + variant_type
, but maybe it's cleaner to hard code the filter name as "LOW_SCORE_" + variant_type
? It's convenient/confusing that the annotation happens to be SCORE
in this case.
Either way, I'm testing it now and will push the changes on Monday once it works.
.../dna_seq/germline/joint_genotyping/UltimaGenomics/UltimaGenomicsJointGenotyping.changelog.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @meganshand, looks great! Just a few minor comments and questions.
...nes/broad/dna_seq/germline/joint_genotyping/UltimaGenomics/UltimaGenomicsJointGenotyping.wdl
Outdated
Show resolved
Hide resolved
"UltimaGenomicsJointGenotyping.ref_fasta_sdf": "gs://broad-gotc-test-storage/tmp/jbx/reference_sdf.tar", | ||
"UltimaGenomicsJointGenotyping.runs_file": "gs://broad-gotc-test-storage/tmp/jbx/runs.conservative.bed", | ||
"UltimaGenomicsJointGenotyping.annotation_intervals": ["gs://broad-gotc-test-storage/tmp/jbx/LCR-hs38.bed", "gs://broad-gotc-test-storage/tmp/jbx/mappability.0.bed", "gs://broad-gotc-test-storage/tmp/jbx/exome.twist.bed"], | ||
"UltimaGenomicsJointGenotyping.use_allele_specific_annotations": true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you be using AS_
annotations above?
Incidentally, I wonder if we should fail if AS annotations are specified but the names do not start with AS_
? This seems a bit restrictive, since I don't know that this naming convention is actually documented or enforced.
Also, should we go back and add this argument to the canonical WDL? What will be the strategy for keeping canonical WDLs in the GATK repo in sync with the ones here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, we should be using AS_
annotations. I think failing on non AS_
annotations is too restrictive. Especially if we want to use a combination of AS_
and non AS_
annotations (which should be ok I think). Maybe a warning, but honestly I probably wouldn't have noticed a warning in this case either.
For keeping the WDLs in sync: That is an excellent question. @jessicaway A version of this WDL also currently lives in GATK. It's good to have it in GATK so that it can be tested as GATK tools change rather than waiting for a GATK release to test it, but we also want it available as a production pipeline. Do you have any ideas on what the process should be for where these WDLs belong and/or how they should be tested?
I am looking back at a previous slack conversation with @kbergin on this subject and our conclusion was to put it in GATK first and then "cross that bridge when we come to it". I think we're at the bridge now though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@meganshand this is a very good question and something we have struggled within the past. There is definitely a need for integration testing in our pipelines as GATK is updated. We often find that things look different when we go to update versions of GATK (or Picard for that matter) and it is a painful process to capture all the changes and track down which commit caused the difference(s).
That being said, keeping the WDLs in 2 locations that do not sync automatically will almost certainly not be a sustainable solution.
I have not spent much time looking into a solution, but off the top of my head, my first idea is to add a test that runs on GATK PRs (maybe PRs to master?) that checks out WARP and runs the pipeline test with the new code. That is not as straightforward as it sounds since we would need to do work to change out the docker used in each task to point to the new version of GATK.
We are definitely open to ideas and suggestions for how to make this better :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is an ideal solution, but would probably take some effort to set up in GATK. I think that's worth doing, but I'm also not sure what to do in the meantime with something like this Ultima JG WDL. To complicate the matter it's only one subworkflow of the whole pipeline that currently resides in GATK.
Is there a way to have WARP pull the pipeline from GATK in the short term before we get tests set up in GATK that check out WARP? Or is that an overly complicated solution to a temporary problem (although to be fair, I'm not sure how temporary)?
It might be worth talking this through off of this PR with all relevant parties.
...a_seq/germline/joint_genotyping/UltimaGenomics/test_inputs/Scientific/scientific.inputs.json
Show resolved
Hide resolved
conda activate genomics.py3 | ||
bcftools view -s ~{sample_name} ~{input_vcf} -o tmp.vcf.gz -O z | ||
bcftools index -t tmp.vcf.gz | ||
bcftools view -h tmp.vcf.gz | sed 's/COMBINED_TREE_SCORE,Number=A,/COMBINED_TREE_SCORE,Number=\.,/g' > hdr.fixed.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parts of this WDL seem like they could be generically useful, no? It might be worth thinking about exposing arguments needed in this Ultima-specific use case and moving towards generic tasks/sub-WDLs at some point, if not now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for commenting this because I realized that this line is outdated and is no longer needed. I made the task more generic, but I'm not sure if you're suggesting that it live somewhere else more generic as well? Should I move it to Utilities.wdl? Every time I touch anything in Utilities I have to update the versions of every pipeline that uses them which seems like a minor drawback.
--hpol_filter_length_dist 12 10 \ | ||
--input_prefix $(echo "~{input_vcf}" | sed 's/\(.vcf.gz\|.vcf\)$//') \ | ||
--output_file ~{input_vcf_name}.comp.h5 \ | ||
--gtr_vcf ~{gtr_vcf} \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm OK, I guess gtr
appears here.
tail -n +2 ~{thresholds} > ~{thresholds}.body | ||
cat ~{thresholds}.body | awk 'BEGIN { FS=","} \ | ||
{print "-filter \"VARIANT_TYPE == \047" $1 "\047 && ~{score_key} \ | ||
< "$2 "\" --filter-name LOW_SCORE_"$1 " "}' | tr '\n' ' ' > tmpfile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's comment and perhaps simplify. I don't think I've unraveled it myself!
For example, not sure I understand the presence of LOW_SCORE_
. Is this trying to construct the filter name used by ScoreVariantAnnotations? If so, note that the filter name there is also an exposed argument, and it's not necessarily given by LOW_
+ score_key
; it's just that the default key and filter names are SCORE
and LOW_SCORE
, respectively.
Remember to squash merge! |
1 similar comment
Remember to squash merge! |
So sorry for the delayed response here. I believe I've addressed the comments on this PR and tried to fix all the tests. The outstanding issues that I need some help with are:
|
Remember to squash merge! |
4 similar comments
Remember to squash merge! |
Remember to squash merge! |
Remember to squash merge! |
Remember to squash merge! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! A couple minor comments/questions
Array[String] annotations_to_divide = ["ASSEMBLED_HAPS", "FILTERED_HAPS", "TREE_SCORE"] | ||
|
||
String docker = "us.gcr.io/broad-dsde-methods/broad-gatk-snapshots:UG_feature_branch_v4" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this code is not in an official GATK release yet, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is as of this week! However, this is the docker we've been using in all of the Ultima pipelines. It seems like it might be better to update them all at once in a separate PR? Happy to hear if you think I should update it here first.
String output_vcf_name = "~{base_file_name}" + ".vcf.gz" | ||
command <<< | ||
set -e | ||
source ~/.bashrc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just wondering why this line is necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh, the .bashrc
sets up conda, so I think it's needed for conda activate genomics.py3
, but this task (and some of the others) don't use python so I think it can be removed. I'll clean this up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I was wrong, in this docker gatk and bcftools are managed by conda. I'll add a comment in the code.
Remember to squash merge! |
fd6aeb6
to
73a9780
Compare
Remember to squash merge! |
Remember to squash merge! |
Retest this please |
Remember to squash merge! |
* update to new docker image * Np get dev in sync with staging (#838) * make dev and staging in sync * version * Np ultima cramonly test (#823) * add validation and test wdl for cram only ug pipeline * add testing wdl * add to pipelineTestType * add to changed workflows * renaming * update changed pipeline script syntax * update changed pipeline script syntax * add check to make sure bams are null and fail if they are not null * Fix for large Ultima samples in Single Sample pipeline (#835) * folding optional output bam into MarkDuplicates * updating somatic test paths * updating changelogs * updating versions in wdl * missed a few * increase version for external whole genome reprocessing Co-authored-by: jessicaway <jway@broadinstitute.org> * add in load_tag fields * update change logs for arrays and imputation internal * update all other change logs * update validate chip * Force fragment size histogram file to exist in rnaseqc output (#840) * force fragmentSizes histogram file to exist * Adding filtering to Ultima Joint Genotyping pipeline (#804) * Adding filtering to Ultima Joint Genotyping pipeline Co-authored-by: jessicaway <jway@broadinstitute.org> * Adding scientific test for CramOnly Ultima pipeline and adding it to dockstore (#844) * GL-702: Update references to data stored in HCA buckets (#836) * Update references to data stored in HCA buckets to copies made to broad-gotc-test-storage * GL-1988: Update references to files in broad-gotc-test-storage/tmp/jbx (#848) * Update references to files in broad-gotc-test-storage/tmp/jbx to new locations * update docker images for internal arrays and internal imputation * add lab_batch into internal arrays * remove lab_batch * Update Optimus Overview (#852) Optimus Overview now matches dev version * Lk ss2 fix broken link (#853) Fixed broken link to python script. * remove workspace bucket as input to ingestoutputstoTDR * remove workspace bucket from wide table ingest * recommit to refresh dockstore * update versions for broad internal arrays and imputation and changelogs * update changelogs * Allow nested inputs in the imputation pipeline (#857) * allow nested inputs in the imputation pipeline * need to copy over pipeline text metrics to truth bucket (#856) Co-authored-by: Sushma Chaluvadi <schaluva@broadinstitute.org> Co-authored-by: meganshand <mshand@broadinstitute.org> Co-authored-by: jessicaway <jway@broadinstitute.org> Co-authored-by: Sushma Chaluvadi <43828038+schaluva@users.noreply.github.com> Co-authored-by: kachulis <39926576+kachulis@users.noreply.github.com> Co-authored-by: Tim Wang <31369366+timaeusx@users.noreply.github.com> Co-authored-by: ekiernan <55763654+ekiernan@users.noreply.github.com>
Here is a draft of the Ultima Joint Genotyping pipeline that adds filtering. This needs a release of GATK to remove some test docker tags before being merged. I'm hoping for some initial review before the next GATK release though.
Tests will also currently fail since the expected outputs need to be updated.