Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactored JointVcfFiltering WDL and expanded tests. #8074

Merged
merged 10 commits into from Dec 14, 2022

Conversation

samuelklee
Copy link
Contributor

@samuelklee samuelklee commented Oct 26, 2022

Adds a WDL that replaces the "serial" SnpThenIndel joint filtering workflow added in #7932. This simplified replacement only runs one iteration of the extract-train-score toolchain, rather than running one iteration for SNPs followed by another for INDELs. The original SnpThenIndel workflow (used for Ultima) will be updated and moved to the WARP repo. (EDIT: I was originally confused here, the WDL that was replaced in this PR simply ran SNPs and indels separately, rather than serially. Curious that things still tied out, but I’m not sure it’s worth looking into at this point.)

Test files have also been subset to chr21-22 and slimmed down. A test for the positive-negative was also added, as well as tests of an empty shard.

The first commit contains the original workflow (JointVcfFilteringOriginal.wdl), as well as a reimplementation (JointVcfFilteringSnpThenIndel.wdl) that calls the simplified workflow (JointVcfFiltering.wdl). I've verified that both the original and reimplemented SnpThenIndel workflows tie out on the original test data.

The second commit then removes the original and the reimplementation, leaving only the simplified workflow. It may thus be easier to review the first commit, second commit, or the overall changes, depending on what you are looking at.

@meganshand can you take a look and let me know if there's any missing functionality, or if this otherwise won't work for Ultima and/or importing in WARP? Apologies for the delay!

@samuelklee samuelklee marked this pull request as draft October 26, 2022 13:07
@codecov
Copy link

codecov bot commented Oct 27, 2022

Codecov Report

Merging #8074 (41abfd6) into master (c5bccee) will increase coverage by 0.051%.
The diff coverage is 69.697%.

Additional details and impacted files
@@               Coverage Diff               @@
##              master     #8074       +/-   ##
===============================================
+ Coverage     86.593%   86.644%   +0.051%     
- Complexity     38899     38963       +64     
===============================================
  Files           2336      2336               
  Lines         182709    182730       +21     
  Branches       20060     20066        +6     
===============================================
+ Hits          158213    158325      +112     
+ Misses         17441     17365       -76     
+ Partials        7055      7040       -15     
Impacted Files Coverage Δ
.../scalable/data/LabeledVariantAnnotationsDatum.java 68.421% <45.455%> (-3.801%) ⬇️
...vqsr/scalable/LabeledVariantAnnotationsWalker.java 86.822% <46.154%> (+0.208%) ⬆️
...rs/vqsr/scalable/TrainVariantAnnotationsModel.java 77.778% <66.667%> (-2.991%) ⬇️
...lkers/vqsr/scalable/ExtractVariantAnnotations.java 92.188% <100.000%> (+1.116%) ⬆️
...walkers/vqsr/scalable/ScoreVariantAnnotations.java 76.250% <100.000%> (+0.149%) ⬆️
...r/scalable/data/LabeledVariantAnnotationsData.java 75.510% <100.000%> (+1.283%) ⬆️
.../hellbender/utils/genotyper/AlleleLikelihoods.java 84.201% <0.000%> (+0.186%) ⬆️
...lbender/utils/variant/GATKVariantContextUtils.java 88.242% <0.000%> (+0.224%) ⬆️
...llbender/tools/spark/sv/evidence/ReadMetadata.java 88.480% <0.000%> (+0.245%) ⬆️
... and 16 more

@samuelklee samuelklee force-pushed the sl_simple_joint_vcf_filtering_wdl branch from 1bd0d65 to 96b2d67 Compare October 31, 2022 19:53
@samuelklee samuelklee force-pushed the sl_simple_joint_vcf_filtering_wdl branch 4 times, most recently from 305781d to 929ab9a Compare November 1, 2022 11:12
@gatk-bot

This comment was marked as outdated.

@samuelklee samuelklee force-pushed the sl_simple_joint_vcf_filtering_wdl branch 4 times, most recently from 51dd3d8 to 4023009 Compare November 8, 2022 03:06
@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@samuelklee samuelklee force-pushed the sl_simple_joint_vcf_filtering_wdl branch from cff6900 to fefded5 Compare November 9, 2022 02:09
@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@samuelklee samuelklee force-pushed the sl_simple_joint_vcf_filtering_wdl branch 2 times, most recently from 21a93d4 to 9f5ab4f Compare November 21, 2022 22:39
@gatk-bot

This comment was marked as outdated.

@samuelklee samuelklee force-pushed the sl_simple_joint_vcf_filtering_wdl branch from 9f5ab4f to 9bfc359 Compare November 21, 2022 22:53
@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@samuelklee
Copy link
Contributor Author

I've added a few commits that clean up some of the code inherited from VQSR regarding the use of labeled resources when using allele-specific annotations. This should be ready for review and/or experimentation with importing into the WARP repo, @meganshand. There are a few unrelated failing tests, which I think others are seeing in their branches as well.

@meganshand meganshand self-assigned this Nov 30, 2022
Copy link
Contributor

@meganshand meganshand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My test case WDL ran! I just have a few comments about usability in the WDL.

# will output the same number of shards that are input.
# This portion of the filtering pipeline will assign a SCORE INFO field annotation to each site, but does not yet apply
# the filtering threshold to the final VCF.
# Workflow for scoring and optionally filtering a VCF based on site-level annotations using the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not super clear from the WDL how to get a filtered VCF rather than just a scored VCF (I think you need to go to the tool doc to see the optional parameter that you'd add to extra_args). This feels important enough to elevate to its own argument rather than just being folded into the extra_args so it's clear how to get this feature straight from the WDL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think I feel the same way about the allele specific argument. These feel important enough to the core of the tool to call out explicitly in the WDL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we could make allele-specific mode automatic---we just need to detect whether any annotations have Number=A. So this would get rid of the corresponding parameter. Added a TODO to the meta issue for this.

As for hard filtering, the fact that we need to specify this via extra_args is alluded to in the parameter meta documentation. I agree that it would be nice to elevate certain parameters, but then this raises the question of why those parameters were singled out. I think it's better from a development perspective to treat everything as equitably as possible----i.e., elevate everything or elevate nothing.

Currently, we do "elevate" resource_args, since this is shared by the extract and score tools and we want to ensure the same values are passed. We also elevate any file arguments that might need to be localized (e.g., python_ script, which in turn suggests that we elevate the corresponding model_backend argument). But then everything else is relegated to extra_args. This at least has the benefits of giving a more generic interface (in that using extra_args provides maximal freedom to change arguments in the future, etc.).

As for whether we should replicate documentation for e.g. hard filtering here----in a perfect world, yes, but in a world where replicated documentation quickly falls out of sync, I'd rather refer users to the original documentation. Hopefully reading those docs and looking at example JSONs where hard filtering is turned on should make it clear how this works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we shouldn't be replicating documentation here, but I disagree that if we elevate some inputs we must elevate them all. But perhaps for this use case of an example WDL nothing needs to be elevated. We can leave that for the WDLs that will run a specific use case.

docker: gatk_docker
cpu: select_first([runtime_attributes.cpu, 1])
memory: select_first([runtime_attributes.machine_mem_gb, 7]) + " GB"
disks: "local-disk " + select_first([runtime_attributes.disk_size_gb, 100]) + if select_first([runtime_attributes.use_ssd, false]) then " SSD" else " HDD"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the default disk sizes be generated dynamically from the size of the inputs instead of hardcoded? This could be done for all of the tasks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started an attempt at this, but a lot of the big files (e.g., the input VCFs) are only optionally localized, and I'm not sure how best to use size in these cases. I think it's also likely that e.g. large annotation files will cause memory issues before they cause disk issues, and for these and other output files, there's not a good way to estimate their resulting disk size. Will file something in the meta issue and punt for now.

runtime {
docker: gatk_docker
cpu: select_first([runtime_attributes.cpu, 1])
memory: select_first([runtime_attributes.machine_mem_gb, 7]) + " GB"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth the default memory requirements were too small for both the Extract and the Train tasks for a callset of ~200 samples. Not sure if that means it's worth it to increase the defaults or not for this example WDL.

However, as we discussed in slack, I would personally vote for making the command_mem_gb related to machine_mem_gb (machine_mem_gb = command_mem_gb +1). It seems much simpler for the user to only have to worry about one of those arguments. If you're concerned that they should be untied due to needing extra memory on the machine outside of the java options, then maybe add a additional_machine_mem_gb or something similar (machine_mem_gb = command_mem_gb + additional_machine_mem_gb)? That way you'd still only have two memory arguments (command_mem_gb and additional_machine_mem_gb), but in most use cases the user only needs to update the command_mem_gb.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with command_mem_gb and additional_mem_gb, thanks for the suggestion! I just left the default values as is, though---they're just meant to be an example.

@samuelklee samuelklee force-pushed the sl_simple_joint_vcf_filtering_wdl branch from 9bfc359 to 2d67bfb Compare December 14, 2022 13:31
@samuelklee
Copy link
Contributor Author

Thanks for the review @meganshand, and sorry for the delayed response. I addressed one of your comments and punted/semi-punted on the others. Tests should also be passing after a rebase.

Unless you have strong objections, I'll go ahead and merge---want to get this in before the break!

@samuelklee samuelklee marked this pull request as ready for review December 14, 2022 14:15
@samuelklee samuelklee force-pushed the sl_simple_joint_vcf_filtering_wdl branch from 6663c33 to 41abfd6 Compare December 14, 2022 14:17
@meganshand
Copy link
Contributor

@samuelklee Sounds good I think you can merge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants