Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MMQS filter is overly strict in fp_filter #972

Open
chrisamiller opened this issue Dec 2, 2020 · 1 comment
Open

MMQS filter is overly strict in fp_filter #972

chrisamiller opened this issue Dec 2, 2020 · 1 comment

Comments

@chrisamiller
Copy link
Collaborator

Occasionally SNVs near each other will be removed inappropriately if they cause the MMQS value to exceed the threshold. It does still remove plenty of "junk", but we may want to consider replacing it or tweaking the parameters.

@chrisamiller
Copy link
Collaborator Author

chrisamiller commented Aug 23, 2021

Copying old notes here for reference:

Problem: There are two point mutations p.Q61 is filtered as DOCM_ONLY chr12:25227341 T>G. p.R68 does not appear in the vcf outputs chr12:25227322 T>G. VAFs are high both variants are mostly supported by the same reads.

Screen Shot 2020-09-03 at 1 00 06 PM

read depth there is 83,34 for a VAF of 29% but called by docm only - mutect, strelka, and varscan all filtered it out with MMQSD50: Difference in average mismatch quality sum between variant and reference supporting reads is greater than 50

There are not more than just those two mutations in those reads if you zoom out. Mutect also dropped them before the FP filter as being "clustered events"

these are the bam-readcount values for that site:
MMQS ref (T) = 2.99
MMQS var (G) = 63.29


Here's a summary of the mismatch quality sum (MMQS) filter that runs as part of the false-positive filter.
It calculates the average MMQS for reads supporting the variant and reads supporting the reference, and if ``$var_mmqs - $ref_mmqs > 50`, tosses the site.

You can read the definition of MMQS from Travis here: https://www.biostars.org/p/69910/#70336
Ignoring the adjacent-base stuff, because it doesn't apply here, the basic idea is: for each read, take the mismatched bases, and sum up their quality scores. Then, average that across all the reads supporting the variant, and those supporting the reference.

This is meant to remove mismappings due to paralogous sequence, which are a problem. They result in mappings that look a lot like this: 2-10 high-quality bases being present in lots of reads. Unfortunately, that's also what real phased events look like, and it's possible that with the advent of longer reads and hiqher quality scores, the MMQS threshold of 50 is too low.
I'm going to take a pass through this case's VCFs, and take a look at some sites excluded only by this filter to see if altering it might make sense


small sample size, but there are only two other sites in this sample removed solely by this filter, and they both are clearly garbage

Screen Shot 2020-09-03 at 3 09 50 PM

Screen Shot 2020-09-03 at 3 14 45 PM

The two bad sites in this sample have mmqs differences of 95.47 and 78.4, so not miles above the 60.3 of the good site in KRAS

Okay, I've reviewed 4 samples now, each of which only has a fairly small number of sites that only fail this filter. (3-9 variants). For these sites, I calculated the mmqs_diff score and manually reviewed it to see a) if it was a good site or not and b) if any of the later filters (mapq, llr) would have caught it. These are the results:

score   filter
95.47   fail
80.14   mapq
79.28   fail
78.40   fail
76.19   fail
76.16   mapq
76.05   fail
75.11   fail
70.72   fail - hla-region
69.84   llr
68.69   mapq
62.88   mapq
60.31   fail
60.30   pass  (the KRAS site)
56.51   mapq
55.32   fail
54.95   mapq
54.87   mapq
52.73   pass
52.55   fail
52.43   fail
51.01   fail
47.06   fail
44.54   llr

I did find one additional site that appeared real, so that's 2/24 = 8% false negative rate for this filter, based on this limited sample size. Another 9 sites would have been caught by a subsequent filter, but that'd still leave us with 13 false positives across 4 samples. Not awful, but not great either.


Two common patterns in the sites:

  1. the clear paralogs - just mismapped reads

Screen Shot 2020-09-03 at 4 37 22 PM

Screen Shot 2020-09-03 at 3 09 50 PM (1)

  1. garbage regions with lots of errors

Screen Shot 2020-09-03 at 3 14 45 PM (1)

i feel like maybe #2 could be caught by messing with some of the other read/base quality params but I don't have a clear idea of what to require to weed out #1 without also removing sites like the KRAS above.

Open to ideas on how to alter this filter to rescue these sites without losing the specificity bump from using this filter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant