Correctly represent MNPs in HC and M2, with option to split into SNPs as before #4650

davidbenjamin · 2018-04-12T05:14:53Z

Closes #4647.

@LeeTL1220 Would you mind reviewing this?

@ldgauthier Since this affects HaplotypeCaller you might want to take a look, too. Let me know in particular if making MNPs the default is too bold.

LeeTL1220 · 2018-04-12T13:29:08Z

@davidbenjamin Is it possible to make this the default for M2, but not the default for HC?

davidbenjamin · 2018-04-12T13:51:19Z

Sure. Whatever @ldgauthier wants.

ldgauthier · 2018-04-12T14:02:54Z

I'd rather not change the default HC behavior, but this is pretty exciting because we can lay the argument about porting ReadBackedPhasing to rest. It would be good to do a comparison with RBP -- can you take a look at the RBP integration tests from GATK3 to see if there was a MNP test there?

LeeTL1220

See my question. Might be no additional work or might be a lot.

LeeTL1220 · 2018-04-12T13:38:38Z

src/main/java/org/broadinstitute/hellbender/utils/haplotype/EventMap.java

-                                snpAlleles.add( Allele.create( altByte, false ) );
-                                proposedEvents.add(new VariantContextBuilder(sourceNameToAdd, refLoc.getContig(), refLoc.getStart() + refPos, refLoc.getStart() + refPos, snpAlleles).make());
-                            }
+                    final List<Integer> mismatchBlockStarts = new ArrayList<>();    //inclusive


Does this handle phasing? It doesn't look like it (though I could be wrong). We don't want to call a MNP if the reads supporting the first base are mutually exclusive from the alt reads on the next base.

This does the correct thing. The new code makes a MNP out of any haplotype with a MNP. It will not make a MNP out of two different haplotypes with adjacent SNPs.

In general we produce far more haplotypes than actually exist due to kmerization. For example, when k = 10 in our assembly and you have two SNPs 15 bases apart we will produce four different haplotypes for the 2 x 2 = 4 independent combinations. This is not an issue with adjacent SNPs because kmers span them.

But let's suppose that for unknown reasons there's a hole in this argument and that we produce a haplotype with two adjacent SNPs falsely in phase. In that case, every read will contradict the haplotype and so the MNP allele will have a very low LOD or QUAL, as the case may be.

Do we need to worry about that last case? Is there a way we can test it?

Would this generate multiallelics when the adjacent SNPs are not in phase?

It should be easy enough to make an integration test.

For the same reasons as above I wouldn't expect multiallelics, but an integration test will cover this, too. I just need to dig up or create a small bam with unphased SNPs.

LeeTL1220 · 2018-04-12T13:39:17Z

src/main/java/org/broadinstitute/hellbender/utils/haplotype/EventMap.java

+                    final List<Integer> mismatchBlockEnds = new ArrayList<>();      //exclusive
+                    boolean previousMismatch = false;
+                    for( int offset = 0; offset < elementLength; offset++ ) {
+                        final byte refByte = ref[refPos + offset ];


Extra space in bracket

davidbenjamin · 2018-04-12T16:34:44Z

@ldgauthier the GATK3 tests have two bams relevant to MNPs. One has two unphased SNPs 3 bases apart; the other has two phased adjacent SNPs ie a DNP. That's it as far as I can tell. I think I ought to cook up some synthetic reads for a nice test.

By the way, should add a MNP merging distance option as in ReadBackedPhasing? Currently, for example, the code I wrote can't make a MNP out of ACT -> GCA.

LeeTL1220 · 2018-04-12T16:48:19Z

I only want adjacent bases, so I am okay.

…

On Thu, Apr 12, 2018, 12:34 David Benjamin ***@***.***> wrote: @ldgauthier <https://github.com/ldgauthier> the GATK3 tests have two bams relevant to MNPs. One has two unphased SNPs 3 bases apart; the other has two phased adjacent SNPs ie a DNP. That's it as far as I can tell. I think I ought to cook up some synthetic reads for a nice test. By the way, should add a MNP merging distance option as in ReadBackedPhasing? Currently, for example, the code I wrote can't make a MNP out of ACT -> GCA. — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#4650 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACDXk7itJIetc3iY3pptj9ljoscsExcqks5tn4IngaJpZM4TRJxM> .

codecov-io · 2018-04-12T17:27:48Z

Codecov Report

Merging #4650 into master will increase coverage by 0.306%.
The diff coverage is 97.143%.

@@               Coverage Diff               @@
##              master     #4650       +/-   ##
===============================================
+ Coverage     80.142%   80.448%   +0.306%     
- Complexity     17506     18605     +1099     
===============================================
  Files           1086      1085        -1     
  Lines          63305     68551     +5246     
  Branches       10221     11641     +1420     
===============================================
+ Hits           50734     55148     +4414     
- Misses          8575      9183      +608     
- Partials        3996      4220      +224

Impacted Files	Coverage Δ	Complexity Δ
...ecaller/AssemblyBasedCallerArgumentCollection.java	`100% <ø> (ø)`	`1 <0> (ø)`	⬇️
...hellbender/tools/walkers/mutect/Mutect2Engine.java	`91.736% <100%> (+0.691%)`	`78 <0> (+29)`	⬆️
.../tools/walkers/mutect/SomaticGenotypingEngine.java	`92.308% <100%> (+2.427%)`	`97 <0> (+31)`	⬆️
...tools/walkers/haplotypecaller/HaplotypeCaller.java	`93.939% <100%> (+0.391%)`	`18 <3> (+2)`	⬆️
...ols/walkers/haplotypecaller/AssemblyResultSet.java	`74.251% <100%> (+0.155%)`	`43 <1> (ø)`	⬇️
...der/tools/walkers/mutect/M2ArgumentCollection.java	`97.059% <100%> (+2.941%)`	`3 <0> (ø)`	⬇️
...plotypecaller/HaplotypeCallerGenotypingEngine.java	`77.698% <100%> (+0.162%)`	`32 <0> (ø)`	⬇️
...otypecaller/HaplotypeCallerArgumentCollection.java	`100% <100%> (ø)`	`2 <0> (ø)`	⬇️
...ypecaller/AssemblyBasedCallerGenotypingEngine.java	`91.515% <100%> (ø)`	`70 <0> (ø)`	⬇️
...walkers/haplotypecaller/HaplotypeCallerEngine.java	`78.652% <100%> (-0.315%)`	`66 <0> (-1)`
... and 125 more

ldgauthier · 2018-04-13T15:26:35Z

Actually ACT -> GCA would be useful because they could potentially be in the same codon depending on the reading frame. Is that an easy feature to add?

LeeTL1220 · 2018-04-13T17:29:00Z

@ldgauthier There are downstream tools that are going to choke on that for sure.

If we can add another flag to control this as well, I am okay.

ldgauthier · 2018-04-13T17:52:56Z

Flag all the things.

davidbenjamin · 2018-04-22T05:39:27Z

@ldgauthier @LeeTL1220 I put in an parameter for the MNP spacing and a bunch of tests.

davidbenjamin · 2018-04-22T05:40:30Z

. . . and one of the tests is sadistic.

LeeTL1220 · 2018-04-25T16:26:05Z

...dinstitute/hellbender/tools/walkers/haplotypecaller/AssemblyBasedCallerGenotypingEngine.java

+     * @param maxMnpDistance Phased substitutions separated by this distance or less are merged into MNPs.  More than
+     *                       two substitutions occuring in the same alignment block (ie the same M/X/EQ CIGAR element)
+     *                       are merged until a substitution is separated from the previous one by a greater distance.
+     *                       That is, if maxMnpDistance = 1, substitutions at 10,11,12,14,15,17 are broken into a MNP


... substitutions at positions ...

LeeTL1220 · 2018-04-25T16:31:13Z

...main/java/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/AssemblyResultSet.java

@@ -501,13 +501,17 @@ private void updateReferenceHaplotype(final Haplotype newHaplotype) {
     *
     * <p/>
     * The result is sorted incrementally by location.
-     *
+     * @param maxMnpDistance Phased substitutions separated by this distance or less are merged into MNPs.  More than


This argument can't be <=0, right? If I'm right, can you update the docs? Any public methods should check the value with ParamUtils.isPositive(...) call.

LeeTL1220 · 2018-04-25T16:34:07Z

@davidbenjamin Back to you...

davidbenjamin · 2018-04-25T16:59:33Z

@LeeTL1220 back to you.

ldgauthier

Can you either add a test for or show me what happens when you run HC MNPs in GVCF mode?

I'm also trying to wrap my head around what will happen in GenotypeGVCFs for samples that have one of the SNPs but not the full MNP.

davidbenjamin · 2018-04-26T00:23:16Z

@ldgauthier I added an integration test for GVCF mode and it works fine: the alleles are as expected with the addition of <NON REF>. I'm now going to investigate how the MNPs interact with CombineGVCFs and GenotypeGVCFs.

ldgauthier · 2018-04-26T14:12:22Z

I started drawing this out on the whiteboard yesterday and I don't think MNP output is the best way to represent adjacent, phased SNPs for callsets generated from GVCFs.

Suppose one sample has two phased SNPs (A->T and C->G): one at position 5, one at position 6. Sample2 has a SNP only at position 5 and sample3 has a SNP only at position 6. I think the variants will be output as:
chr 6 AC TG,TC
chr 7 C G
such that position 6 gets a 2bp ref allele and position 7 overlaps that ref. Not elegant, but not in violation of the spec. I think the likelihood reporting is where the wheels come off the wagon, especially for sample3. Sample 3 only has likelihoods for the --AC-- haplotype and the --AG-- haplotype because it was called by itself. The sample3 likelihood assigned to TC and TG at position 6 will be the same non-ref likelihood from the ref block ending at position 5 but we know that for a read containing ---AG--- TG should have better likelihood than TC. For sample 2 we have likelihoods for AC and TC, but it's not obvious to me how to generate likelihoods for TG.

This is making me like the PID/PGT scheme a lot more. At least there we have better accuracy on the representation we provide, even if it require a lot more post-processing. Do you have a sense of how hard it would be to split the MNP events (e.g. for sample1) after genotyping, give them the same likelihoods and apply the PID and PGT?

davidbenjamin · 2018-04-27T01:28:58Z

@ldgauthier Some parts of taking splitting MNPs at the end of HaplotypeCaller are easy: breaking eg one DNP at position n into a SNP at n and a SNP at n + 1, letting the SNPs inherit the PLs, AF, and AD (okay, this isn't quite right because a read might end in the middle of the MNP, but close enough) of the parent MNP. . . but the general problem of splitting annotations seems like it might be too tricky.

I'm leaning toward instead just modifying AssemblyBasedCallerGenotypingEngine.phaseCalls(). It seems that this phasing relies very heavily on perfect phasing or anti-phasing and that even one questionable haplotype with incorrect phasing can spoil things. I would guess that we could improve the phasing by making some simple guess as to which haplotypes are real. Basically, the problem is that while HaplotypeCaller imposes ploidy on alleles, it does not do so on haplotypes, and so phasing information is diluted.

With your permission I would like to merge this PR and open a new issue for improving phaseCalls. After all, the issue is fixed in M2, and HC now has a perfectly good MNP mode, with the caveat that it doesn't interact nicely with GVCF mode.

davidbenjamin · 2018-04-27T01:29:41Z

Of course I would also be glad to be told that splitting the annotations isn't really so hard.

ldgauthier · 2018-04-30T16:42:01Z

What do you need to split? Can't you make the assumption that all reads span both sites (admittedly not perfect) and copy the annotations?

davidbenjamin · 2018-04-30T19:49:40Z

I think most annotations could just be copied, but let me list a few that would be non-trivial:

Anything involving base quality eg BaseQuality, BaseQualityRankSumTest.
Anything sensitive to genotypes eg one haplotype is DNP: chr20 100 CA -> GT, other haplotype is SNP: chr20 100 C -> G. You would have to recognize that your het alt in the MNP representation becomes a hom alt at 100 and a het at 101. While this is easy enough to do in the genotype it means ExcessHet and InbreedingCoef can't be copied.
In general, any annotation that is affected by other alt alleles would be tricky because some of the split SNPs would share a locus with other alt alleles and others wouldn't.
ReadPosition would have to be modified according to the position within the split MNP. Easy enough, but it's custom code for just one annotation that would have to live in HaplotypeCaller.

It's not much, but what would we do about them?

davidbenjamin · 2018-05-08T20:32:50Z

@ldgauthier What do you think?

ldgauthier · 2018-05-09T18:31:03Z

There's enough work with the annotation handling to justify this being a separate task for the HaplotypeCaller side. Let's just turn your new phasing off for HaplotypeCaller GVCF mode. I'm still interested in it being available for single-sample because it would be awesome for clinical.

davidbenjamin · 2018-05-09T20:32:50Z

@ldgauthier As the PR stands mnps are off by default for HC. Should I have it throw an error if they are turned on in GVCF mode, and should I turn it on by default in non-GVCF mode?

ldgauthier · 2018-05-10T14:59:50Z

Let's do off by default for all modes. Error if MNPs and GVCF mode.

davidbenjamin · 2018-05-10T19:41:47Z

Let's do off by default for all modes. Error if MNPs and GVCF mode.

@ldgauthier Done and done.

@LeeTL1220 I need for your sign-off as well.

davidbenjamin · 2018-05-15T18:15:29Z

@LeeTL1220 I ran this branch through M2 and oncotator on your MNPs bam, which had 9 DNPs, with and without --infer-onps in oncotator. Results were identical and the alleles were correct.

@ldgauthier Is this ready to merge?

LeeTL1220 · 2018-05-15T19:13:15Z

👍 Sounds good if it generates the identical results.

…tance threshold

tfenne · 2018-06-11T15:47:00Z

@davidbenjamin & @ldgauthier Sorry for commenting on a closed/merged PR but I wasn't sure where else to take the discussion. If there's a more appropriate place please redirect me!

First off, this is very cool and I'm so glad to see this making it's way into HC/M2! It's super helpful for functional annotation/clinical interpretation. Thanks for working on this!

I had two thoughts which maybe belong as separate issues, but I figured I'd raise them here first and see what you thought:

It would actually be useful to be able to combine this behavior with GVCF mode in some cases. I understand all the caveats about merging and joint-genotyping when this has been done, but there are use cases for single-sample calling where both GCVF and MNP mode combined would be useful. E.g. in a clinical setting it's very useful to have the GVCF with the reference blocks, and also call MNPs as MNPs. There would be no merging in this case. Any chance this could be allowed, perhaps with a warning or requirement that --unsafe be on?
IIRC RBP would also phase combinations of indel-SNP and indel-indel in addition to SNP-SNP. I'm curious how hard it would be to apply the same grouping logic across indels as well? I tried to read the code in the PR, but honestly I don't think I understand the ramifications of including indels sufficiently well. Would there be any conceptual objections or road-blocks to doing this?

davidbenjamin · 2018-06-11T18:38:34Z

@tfenne Thanks for commenting! 1) is up to Laura but I 'm willing to put in time to implement a useful feature that she approves of. This PR only tries to phase SNPs within a single Cigar element, which involved minimal change to the code. Personally I think once we're getting more complicated than that it's worth going more fundamental, that is, genotyping whole haplotypes. This is an experiment that will probably happen in Mutect2 within the next few months.

ldgauthier · 2018-06-11T20:26:06Z

@tfenne I was hesitant to output MNPs in GVCFs because (as you are aware) they're difficult to deal with in joint calling and I didn't want users to expect a MNP solution for multiple samples. An extra argument is an intriguing idea, maybe --not-for-joint-calling, which is sort of like "Check here to accept the user agreement." I'm not entirely sure what the expected behavior for CombineGVCFs or GenomicsDBImport should be if a user tried to combine GVCFs with MNPs. They may not fail gracefully as-is, but I suppose we could just throw a UserException in the middle of traversal if we find a MNP, which wouldn't take too much effort.

davidbenjamin added enhancement HaplotypeCaller Mutect labels Apr 12, 2018

davidbenjamin added this to the Mutect2 milestone Apr 12, 2018

davidbenjamin assigned LeeTL1220 Apr 12, 2018

davidbenjamin requested review from LeeTL1220 and ldgauthier April 12, 2018 05:14

LeeTL1220 reviewed Apr 12, 2018

View reviewed changes

LeeTL1220 assigned davidbenjamin and unassigned LeeTL1220 Apr 12, 2018

davidbenjamin force-pushed the db_mnps branch 2 times, most recently from 8e4f146 to cf3386a Compare April 22, 2018 05:37

davidbenjamin assigned LeeTL1220 and ldgauthier and unassigned davidbenjamin Apr 22, 2018

LeeTL1220 reviewed Apr 25, 2018

View reviewed changes

ldgauthier requested changes Apr 25, 2018

View reviewed changes

davidbenjamin force-pushed the db_mnps branch 2 times, most recently from a1532a7 to 8694279 Compare May 10, 2018 19:38

ldgauthier approved these changes May 16, 2018

View reviewed changes

davidbenjamin force-pushed the db_mnps branch from 06b4209 to c74b9f1 Compare May 22, 2018 15:27

Mutect2 and HaplotypeCaller can emit MNPs according to adjustable dis…

f3e8dd5

…tance threshold

davidbenjamin force-pushed the db_mnps branch from c74b9f1 to f3e8dd5 Compare May 22, 2018 16:50

davidbenjamin merged commit e078a57 into master May 22, 2018

davidbenjamin deleted the db_mnps branch May 22, 2018 19:11

davidbenjamin mentioned this pull request Jun 14, 2018

Create new tool to merge MNPs #1385

Closed

tfenne mentioned this pull request Sep 12, 2018

Allow MNP calling in GVCF mode with stern warnings about not trying … #5182

Merged

Correctly represent MNPs in HC and M2, with option to split into SNPs as before #4650

Correctly represent MNPs in HC and M2, with option to split into SNPs as before #4650

Conversation

davidbenjamin commented Apr 12, 2018

LeeTL1220 commented Apr 12, 2018

davidbenjamin commented Apr 12, 2018

ldgauthier commented Apr 12, 2018

LeeTL1220 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidbenjamin commented Apr 12, 2018

LeeTL1220 commented Apr 12, 2018 via email

codecov-io commented Apr 12, 2018 • edited

Codecov Report

ldgauthier commented Apr 13, 2018

LeeTL1220 commented Apr 13, 2018

ldgauthier commented Apr 13, 2018

davidbenjamin commented Apr 22, 2018

davidbenjamin commented Apr 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LeeTL1220 commented Apr 25, 2018

davidbenjamin commented Apr 25, 2018

ldgauthier left a comment

Choose a reason for hiding this comment

davidbenjamin commented Apr 26, 2018

ldgauthier commented Apr 26, 2018

davidbenjamin commented Apr 27, 2018

davidbenjamin commented Apr 27, 2018

ldgauthier commented Apr 30, 2018

davidbenjamin commented Apr 30, 2018

davidbenjamin commented May 8, 2018

ldgauthier commented May 9, 2018

davidbenjamin commented May 9, 2018

ldgauthier commented May 10, 2018

davidbenjamin commented May 10, 2018

davidbenjamin commented May 15, 2018

LeeTL1220 commented May 15, 2018

tfenne commented Jun 11, 2018

davidbenjamin commented Jun 11, 2018

ldgauthier commented Jun 11, 2018

codecov-io commented Apr 12, 2018 •

edited