New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mutect2 read depth estimates #3808
Comments
@jhl667 The ref/alt coverage in an M2 vcf may differ from that of IGV for the following reasons:
Do you know which, if any, of these causes you had in mind? Could you provide an example? |
Example:
Also called in a different variant caller:
Yes, M2 also calls the neighboring C>G substitution, these are just being represented differently between the two callers. You can see there is a discrepancy between the counts, 741 in the M2 call and almost double that (1327) in the other call. Looking directly at the aligned reads, I can see 1346 reads overlapping this location. Removing those with MAPQ<20, we are left with about 1302 reads. Again, there are no reads that are smaller than 30bp. Looking at the distribution of start site counts, there is a spike at the beginning of the region, as we would expect. The largest number of counts tied to a start site is 343 in this region. So, there should not be a downsampling effect since the I don't think read position is an issue here, though I haven't fully quantified it. I can see this region in IGV and it looks clean. These examples are very easy to find in our data. The calls themselves seem really good, just trying to figure out how to deal with count estimation. Right now our solution is to use multiple variant callers. Thanks for looking at this, please let me know if I can provide anything else. |
Do the missing reads have deletions spanning the locus? There was a bug we recently patched that might be related to this (#3830). Can you try again with the latest GATK master branch? |
@jhl667 That's definitely worth trying. Otherwise, there's nothing obvious from the vcf. If you make a mini bam around that call, eg
I could step through it in the IDE and hopefully figure things out. |
@droazen Nope, no deletions spanning the locus. Also, this is not an isolated instance of this behavior. @davidbenjamin Excellent. I have created a headerless SAM, hopefully this will work for you around the region 1:12919587. Sorry this region differs from the one above, I completely neglected to track the sample name. Coverage according to alternate variant caller is ~1320, while M2 is listing ~820. |
@jhl667 Thanks for the example file and I think I have the answer. The majority of read pairs in these data are overlapping at the variant; that is, both the forward and reverse strand reads cover it. Mutect, correctly, doesn't count a single fragment as two independent pieces of evidence, so it discards one of the reads before making and annotating the variant call. The coverage of 820 is, therefore, the number of sequenced fragments that cover the variant, not the number of reads. In sequencing with a lot of short fragments (i.e. less than twice the read length) this discrepancy occurs a lot. @droazen Note that this is purely a Mutect thing and has nothing to do with the GATK engine. You're off the hook! |
@davidbenjamin You know what, I was just having this conversation about counting pairs vs directional reads with a colleague about an unrelated project! This must be the issue, since it is something that exists across all samples. Definitely a reasonable way to count, in fact I prefer it over the example from the other variant caller I gave. In our process, this would happen even more often since we clip primer sequence, as well as unique molecular indices. Thanks so much for looking in to this! |
I got the same issue, the difference is huge. I start with same fastq file. the output from Basespace somatic variant caller, The output from M2:
I checked this position by IGV, the number close to the Basespace result. |
@wli1 Overlapping reads ought to account for at most a factor of 2, so it would have to be some other reason to go from 2064 to 51. If you provide a snippet bam file (maybe stretching 500 bases on each side of the variant) I will gladly take a look. |
In looking at Mutect2 for clinical applications, one thing that always seems to come up has to do with the big difference between the ref/alt coverage denoted in the VCF file and what is seen in IGV. For clinical reporting, many labs will provide mutant allele depths, along with the VAF estimate. I understand the purpose of downsampling at stages of the m2 workflow, and I also understand this negatively affects amplicon-based studies. How viable is it to provide more exact (include reads that are high quality but not used during variant determination) estimates of coverage at variant loci, while not substantially increasing runtime? It would be great to get some of our analysts away from always feeling as if they need to visualize calls in IGV...
Thanks,
John
The text was updated successfully, but these errors were encountered: