Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VariantFiltration not filtering correctly #5362

Closed
sooheelee opened this Issue Oct 25, 2018 · 10 comments

Comments

Projects
None yet
6 participants
@sooheelee
Copy link
Contributor

commented Oct 25, 2018

We need to help users help themselves either with better checks in tools or with better documentation to avoid the discrepancies observed in this thread, whose answer is recapitulated below.


Hi @obigriffith,

I am using GATK v4.0.11.0 and I also see what you are seeing. I've been taking an Android App development course since January (in my free time of course), and I've learned that with multiple expressions, sometimes the Java programming language needs help in parsing expressions. That is, we need to help the tool demarcate where an expression begins and ends.

1. no filtering expected works as expected (but this is misleading)

--filter-expression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRandSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 3.0"

2. should be filtered based on SOR (at 0.608) but is not

--filter-expression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRandSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 0.5" 

3. Using parentheses around each expression allows SOR (and presumably other expressions) to be read correctly

--filter-expression "(QD < 2.0) || (FS > 60.0) || (MQ < 40.0) || (MQRankSum < -12.5) || (ReadPosRankSum < -8.0) || (SOR > 0.5)"

This will allow the tool to read the SOR expression unambiguously. Here are results from my testing:

4. Providing each expression as a separate parameter also allows SOR (and others) to be read correctly and also provides additional insight
Separate out each condition into individual filter expressions:

--filter-expression "QD < 2.0" --filter-name "QDlessthan2" --filter-expression "FS > 60.0" --filter-name "FSgreaterthan60" --filter-expression "MQ < 90.0" --filter-name "MQlessthan90" --filter-expression "MQRankSum < -12.5" --filter-name "MQRankSumlessthannegative12.5" --filter-expression "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSumlessthannegative8" --filter-expression "SOR > 0.5" --filter-name "SORgreaterthan0.5"

This gives you additional resolution into what was the condition that triggered the filtering. Here are the results from my testing:

So be sure to either use parentheses around each expression or to express conditions independently.

This Issue was generated from your [forums]
[forums]: https://gatkforums.broadinstitute.org/gatk/discussion/comment/53310#Comment_53310

@sooheelee

This comment has been minimized.

Copy link
Contributor Author

commented Oct 25, 2018

Assigning to @droazen to consider and delegate. Tagging @ldgauthier in case she is interested.

Tool should throw an exception if the filtering expression cannot be parsed unambiguously instead of running the half-baked expression without any warning.

@sooheelee

This comment has been minimized.

Copy link
Contributor Author

commented Oct 25, 2018

To be clear, when I learned to use VariantFiltration 1.5 years ago, I was told to use the && and || expressions and no other formatting. This may have been alright in GATK3 (I did not check), but it seems in GATK4 it is not.

@sooheelee sooheelee self-assigned this Oct 25, 2018

@sooheelee

This comment has been minimized.

Copy link
Contributor Author

commented Oct 25, 2018

Also assigning myself in case what we want is updated documentation only.

@sooheelee sooheelee assigned cmnbroad and unassigned droazen Oct 29, 2018

@sooheelee

This comment has been minimized.

Copy link
Contributor Author

commented Oct 29, 2018

Assigning @cmnbroad per his request.

@sooheelee

This comment has been minimized.

Copy link
Contributor Author

commented Nov 2, 2018

@vdauwera Researcher who initially pointed out this issue says there are many forum docs showcasing the usage example that is buggy and suggests we clarify the non-buggy usage.

Unassigning myself given my focus is workshop and gCNV tutorial writing.

@sooheelee sooheelee assigned vdauwera and unassigned sooheelee Nov 2, 2018

@cmnbroad

This comment has been minimized.

Copy link
Collaborator

commented Nov 5, 2018

It looks like in each of the cases that are failing, the filter expression references an attribute that isn't present in the variant, which the tool treats as "PASS" by default (this can be toggled using --missing-values-evaluate-as-failing true, in which case variants with missing values will be filtered). The parentheses are confounding but not relevant - the failing case above that uses no parens references MQRandSum (sic); the succeeding case references MQRankSum.

Separating the expression components into separate filter expression arguments does change the results because each expression is evaluated individually and serially, so a reference to a missing value in one expression results in a PASS, but the subsequent expression results in the filter being applied if it meets the criteria.

We should certainly update the doc to reinforce these subtleties. Another option would be to have a stricter "throw on missing" option as the default, but that could lead to lead to warning fatigue. Any other suggestions ?

@obigriffith

This comment has been minimized.

Copy link

commented Nov 5, 2018

@cmnbroad - that was not the issue I reported. I was having cases of variants that did have SOR or DP values but when I applied an SOR or DP filter that filter was not being applied correctly. This issue did go away when I separated into individual filter expression, whether or not I enclosed them in parentheses.

@cmnbroad

This comment has been minimized.

Copy link
Collaborator

commented Nov 5, 2018

@obigriffith Right, but the expression you cited did reference attribute(s) that don't exist in the variants you cited as not being filtered (I'm not saying its intuitive, just that this explains whats happening).

i.e. I think you cited this variant:

chr6 54228213 . A G 412.77 PASS AC=2;AF=1.00;AN=2;DP=12;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=52.55;QD=28.71;SOR=3.442 GT:AD:DP:GQ:PL 1/1:0,11:11:33:441,33,0

as not being filtered when used with this expression:

"QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 3.0"

Even though this has an SOR value that meets the filter criteria, the expression is short-circuited when applied to that variant because it has no MQRankSum attribute. This results in a PASS.

If you have a counter example, or I'm missing something, please do let me know.

@obigriffith

This comment has been minimized.

Copy link

commented Nov 5, 2018

Ah! I see. This is indeed an unintuitive behavior. Especially in the case where a user is providing a compound expression with a series of filters separated by the OR operator. The user is expecting that any variant that meets any of the filter criteria will be marked for filtering. If I understand correctly, a variant could fail 4/5 filters, be NULL for 1/5 filter expressions and end up with a PASS status! This is rarely the desired behavior. In this use case, I might want an --ignore-missing-values option. I don't necessarily want to fail all variants just because they have a missing value for a feature. But, I also don't want them to be evaluated as PASS if they fail filters for which they do have values. I guess maybe a partial solution is to strongly encourage separate filter expressions, for this the most common use case of wanting to apply several filters in the OR situation.

I guess if you have AND operators in the mix, it gets more complicated. If I want to filter a variant only if it fails filter_A AND filter_B it is less clear what the right behavior is when feature A or B is NULL. I guess it should pass?

I don't have enough understanding of why these values are missing. I'm using a pretty standard workflow (GATK HaplotypeCaller) to get these variants and I guess I naively assumed they would be complete for these features.

@ldgauthier

This comment has been minimized.

Copy link
Contributor

commented Nov 6, 2018

Rank sum annotations are often missing for common variants in small cohorts. The rank sum statistic compares observations from the reference reads with the alternate reads and if all the samples are homVar and there are no reference reads, then the statistic can't be computed. See https://gatkforums.broadinstitute.org/gatk/discussion/4732/statistical-methods-used-by-gatk-tools for more detail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.