Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added toggle for selecting resource-matching strategies and miscellaneous minor fixes to new annotation-based filtering tools. #8049

Merged
merged 1 commit into from Oct 11, 2022

Conversation

samuelklee
Copy link
Contributor

Ticks off a few straggler issues noted in #7724.

@meganshand mind reviewing? Hopefully should be quick and we can get it in before @droazen cuts the next release. Note that this shouldn't change behavior in the Ultima pipeline, as the default toggle is still the same start-position resource-matching strategy inherited from VQSR, but we might want to explore the effect of choosing another strategy there.

…eous minor fixes to new annotation-based filtering tools.
@@ -46,7 +48,7 @@
* walker, performing the operations:
*
* - nthPassApply(n = 0)
* - if variant/alleles pass filters and variant-type/overlapping-resource checks, then:
* - if variant/alleles pass filters and variant-type/resource-match checks, then:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I just changed "overlapping" to "resource-matching" everywhere to distinguish what we're doing here from straightforward genomic overlapping.

@@ -283,7 +304,9 @@ VCFHeader constructVCFHeader(final List<String> sortedLabels) {
.collect(Collectors.toCollection(TreeSet::new));
hInfo.add(GATKVCFHeaderLines.getFilterLine(VCFConstants.PASSES_FILTERS_v4));
final SAMSequenceDictionary sequenceDictionary = getBestAvailableSequenceDictionary();
hInfo = VcfUtils.updateHeaderContigLines(hInfo, null, sequenceDictionary, true);
if (sequenceDictionary != null) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This avoids a complaint raised if the VCF is missing contig lines in the header.

switch (resourceMatchingStrategy) {
case START_POSITION:
return true;
case START_POSITION_AND_GIVEN_REPRESENTATION:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if these strategies add any noticeable overhead if there are a lot of multiallelics, but I haven't noticed anything out of the ordinary on my runs so far.

Copy link
Contributor

@meganshand meganshand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if you could add some tests for this (even a unit test of isMatchingVariant), but you can make that in a separate PR if you're in a hurry to get this in now.

Even START_POSITION_AND_MINIMAL_RERPRESENTATION is not as thorough as RTG VCFEval, right? That does a more complete adjustment to resolve different representations? It would be nice to use their code somehow if it makes a big enough difference. Maybe that's easy enough to achieve through pipelining. Sorry, these are musings not requests. 👍

@samuelklee
Copy link
Contributor Author

Thanks for the lightning-quick review! And good thought to add tests---probably just adding them to the current suite of exact-match tests would hopefully suffice, since we rely on existing/library methods (which presumably already have correctness tests) to do the matching. I'll add it to the straggler issues 😆

And yes, I don't think the matching here is as sophisticated as that done by VCFEval, but it's probably good enough for the purposes of identifying training variants. I am actually curious how often the start-position strategy gets us into trouble (e.g., we hit an artifact at a multiallelic site, or annotations at multiallelic sites are somehow distributed differently even if all alleles are real, etc.)

@samuelklee samuelklee merged commit fd78250 into master Oct 11, 2022
@samuelklee samuelklee deleted the sl_lite_overlap branch October 11, 2022 18:39
rsasch pushed a commit that referenced this pull request Oct 17, 2022
…eous minor fixes to new annotation-based filtering tools. (#8049)
koncheto-broad added a commit that referenced this pull request Feb 2, 2023
* Added a new suite of tools for variant filtering based on site-level annotations. (#7954)

* Adds wdl that tests joint VCF filtering tools (#7932)

* adding filtering wdl

* renaming pipeline

* addressing comments

* added bash

* renaming json

* adding glob to extract for extra files

* changing dollar signs

* small comments

* Added changes for specifying model backend and other tweaks to WDLs and environment.

* Added classes for representing a collection of labeled variant annotations.

* Added interfaces for modeling and scoring backends.

* Added a new suite of tools for variant filtering based on site-level annotations.

* Added integration tests.

* Added test resources and expected results.

* Miscellaneous changes.

* Removed non-ASCII characters.

* Added documentation for TrainVariantAnnotationsModel and addressed review comments.

Co-authored-by: meganshand <mshand@broadinstitute.org>

* Added toggle for selecting resource-matching strategies and miscellaneous minor fixes to new annotation-based filtering tools. (#8049)

* Adding use_allele_specific_annotation arg and fixing task with empty input in JointVcfFiltering WDL (#8027)

* Small changes to JointVCFFiltering WDL

* making default for use_allele_specific_annotations

* addressing comments

* first stab

* wire through WDL changes

* fixed typo

* set model_backend input value

* add gatk_override to JointVcfFiltering call

* typo in indel_annotations

* make model_backend optional

* tabs and spaces

* make all model_backends optional

* use gatk 4.3.0

* no point in changing the table names as this is a POC

* adding new branch to dockstore

* adding in branching logic for classic VQSR vs VQSR-Lite

* implementing the separate schemas for the VQSR vs VQSR-Lite branches, including Java changes necessary to produce the different tsv files

* passing classic flag to indel run of CreateFilteringFiles

* Update GvsCreateFilterSet.wdl

cleaning up verbiage

* Removed mapping error rate from estimate of denoised copy ratios output by gCNV and updated sklearn. (#7261)

* cleanup up sloppy comment

---------

Co-authored-by: samuelklee <samuelklee@users.noreply.github.com>
Co-authored-by: meganshand <mshand@broadinstitute.org>
Co-authored-by: Rebecca Asch <rasch@broadinstitute.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants