Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutect2 read depth estimates #3808

Closed
jhl667 opened this issue Nov 8, 2017 · 9 comments
Closed

Mutect2 read depth estimates #3808

jhl667 opened this issue Nov 8, 2017 · 9 comments
Assignees

Comments

@jhl667
Copy link

jhl667 commented Nov 8, 2017

In looking at Mutect2 for clinical applications, one thing that always seems to come up has to do with the big difference between the ref/alt coverage denoted in the VCF file and what is seen in IGV. For clinical reporting, many labs will provide mutant allele depths, along with the VAF estimate. I understand the purpose of downsampling at stages of the m2 workflow, and I also understand this negatively affects amplicon-based studies. How viable is it to provide more exact (include reads that are high quality but not used during variant determination) estimates of coverage at variant loci, while not substantially increasing runtime? It would be great to get some of our analysts away from always feeling as if they need to visualize calls in IGV...

Thanks,
John

@davidbenjamin
Copy link
Contributor

@jhl667 The ref/alt coverage in an M2 vcf may differ from that of IGV for the following reasons:

  • Downsampling. This one ought to occur only at extremely high depths. By default M2 downsamples to 50 reads per alignment start, so if reads are 100 bp long downsampling only kicks in at coverage of 5000 and above. Usually such large coverage indicates a poorly-mappable region, but in targeted sequencing that legitimately has such high coverage you should adjust the -maxReadsPerAlignmentStart argument.
  • Read filters. M2 removes reads with mapping quality less than 20, which are overwhelmingly likely to be mapping errors, reads that are less than 30 bases, reads marked as duplicates etc. This filtering is done by the GATK engine before traversing the BAM and the M2 code never sees them. It would not be possible to output these in M2, but they are bad reads, anyway.
  • Reassembly / realignment. Like HaplotypeCaller, M2 builds a local assembly graph and realigns reads to candidate haplotypes. This can change the allele that a read is assigned to, particularly in the case of indels near STRs or near the end of reads. However, this doesn't lose the original information -- it corrects it.

Do you know which, if any, of these causes you had in mind? Could you provide an example?

@jhl667
Copy link
Author

jhl667 commented Nov 15, 2017

@davidbenjamin

  • Downsampling is definitely a concern for us, but as you have suggested, I have utilized this parameter to not limit ourselves to 50 reads per alignment start. After running a range of values, I landed on 500 as not being too much of a burden computationally, but still allowing us to fully digest alignments at each start site in a given region. For us, I have estimated a value of 500 would cover us to a depth of about 2000 or so, since we expect to see a bias at the projected amplicon start site.
  • The read filters you list don't have a huge effect on the majority of our regions. Generally, I would not expect to see more than a 5% loss based on the mapping quality filter. The read size filter should never be triggered, as we input only reads larger than 30 bases to M2. We perform duplicate removal, as we are working with UMIs, so this also should not be an issue.
  • I wouldn't expect the realignment stage to cause the type of effect I see, and as you say, it really just corrects the data anyways.

Example:
The following was called:

1	12919623	.	C	T	.	.	DP=741;ECNT=4;POP_AF=1.000e-03;P_GERMLINE=-2.169e-04;TLOD=743.86
	GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:PGT:PID:SA_MAP_AF:SA_POST_PROB	0/1:520,210:0.291:0,0:520,210:36:153,132:57:47:0
|1:12919623_C_T:0.232,0.283,0.288:0.835,2.995e-04,0.165

Also called in a different variant caller:

1	12919623	.	CC	TG	6989.54	.	AB=0.308968;ABP=423.639;AC=1;AF=0.5;AN=2;AO=410;CIGAR=2X;DP=1327;DPB=1327;DPRA=0;EPP=55.973;EPPR=139.678;GTI=0;LEN=2;MEANALT=8;MQM=50.4732;MQMR=56.3933;NS=1;NUMALT=1;ODDS=1609.4;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=13156;QR=30163;RO=900;RPL=282;RPP=128.617;RPPR=256.291;RPR=128;RUN=1;SAF=177;SAP=19.6194;SAR=233;SRF=377;SRP=54.4404;SRR=523;TYPE=mnp;technology.ILLUMINA=1	GT:DP:AD:RO:QR:AO:QA:GL	0/1:1327:900,410:900:30163:410:13156:-725.008,0,-2308.15

Yes, M2 also calls the neighboring C>G substitution, these are just being represented differently between the two callers.

You can see there is a discrepancy between the counts, 741 in the M2 call and almost double that (1327) in the other call. Looking directly at the aligned reads, I can see 1346 reads overlapping this location. Removing those with MAPQ<20, we are left with about 1302 reads. Again, there are no reads that are smaller than 30bp. Looking at the distribution of start site counts, there is a spike at the beginning of the region, as we would expect. The largest number of counts tied to a start site is 343 in this region. So, there should not be a downsampling effect since the
-maxReadsPerAlignmentStart parameter has been set to 500.

I don't think read position is an issue here, though I haven't fully quantified it. I can see this region in IGV and it looks clean. These examples are very easy to find in our data. The calls themselves seem really good, just trying to figure out how to deal with count estimation. Right now our solution is to use multiple variant callers.

Thanks for looking at this, please let me know if I can provide anything else.

@droazen
Copy link
Collaborator

droazen commented Nov 15, 2017

Do the missing reads have deletions spanning the locus? There was a bug we recently patched that might be related to this (#3830). Can you try again with the latest GATK master branch?

@davidbenjamin
Copy link
Contributor

@jhl667 That's definitely worth trying. Otherwise, there's nothing obvious from the vcf. If you make a mini bam around that call, eg

samtools view input.bam "1:12919323-12919923" > mini.bam

I could step through it in the IDE and hopefully figure things out.

@jhl667
Copy link
Author

jhl667 commented Nov 21, 2017

@droazen Nope, no deletions spanning the locus. Also, this is not an isolated instance of this behavior.

@davidbenjamin Excellent. I have created a headerless SAM, hopefully this will work for you around the region 1:12919587. Sorry this region differs from the one above, I completely neglected to track the sample name. Coverage according to alternate variant caller is ~1320, while M2 is listing ~820.

depth_ex.tar.gz

@davidbenjamin
Copy link
Contributor

@jhl667 Thanks for the example file and I think I have the answer. The majority of read pairs in these data are overlapping at the variant; that is, both the forward and reverse strand reads cover it. Mutect, correctly, doesn't count a single fragment as two independent pieces of evidence, so it discards one of the reads before making and annotating the variant call.

The coverage of 820 is, therefore, the number of sequenced fragments that cover the variant, not the number of reads. In sequencing with a lot of short fragments (i.e. less than twice the read length) this discrepancy occurs a lot.

@droazen Note that this is purely a Mutect thing and has nothing to do with the GATK engine. You're off the hook!

@jhl667
Copy link
Author

jhl667 commented Nov 22, 2017

@davidbenjamin You know what, I was just having this conversation about counting pairs vs directional reads with a colleague about an unrelated project! This must be the issue, since it is something that exists across all samples. Definitely a reasonable way to count, in fact I prefer it over the example from the other variant caller I gave. In our process, this would happen even more often since we clip primer sequence, as well as unique molecular indices. Thanks so much for looking in to this!

@jhl667 jhl667 closed this as completed Nov 22, 2017
@wli1
Copy link

wli1 commented May 1, 2018

I got the same issue, the difference is huge. I start with same fastq file.

the output from Basespace somatic variant caller,
chr4 55594258 rs121913523 T C 100 PASS DP=2064;cosmic=1|COSM12706;phyloP=5.257;CSQT=1|KIT|NM_000222.2|missense_variant GT:GQ:AD:DP:VF:NL:SB:NC 0/1:100:1636,428:2064:0.207:20:-100.0000:0.0005

The output from M2:

chr4 55594258 . T C . . DP=51;ECNT=2;POP_AF=5.000e-08;TLOD=40.63 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB 0/1:38,13:0.257:19,6:19,7:38:109,110:60:20:0.222,0.232,0.255:0.018,0.017,0.965

I checked this position by IGV, the number close to the Basespace result.

@davidbenjamin
Copy link
Contributor

@wli1 Overlapping reads ought to account for at most a factor of 2, so it would have to be some other reason to go from 2064 to 51. If you provide a snippet bam file (maybe stretching 500 bases on each side of the variant) I will gladly take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants