Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly represent MNPs in HC and M2, with option to split into SNPs as before #4650

Merged
merged 1 commit into from May 22, 2018

Conversation

davidbenjamin
Copy link
Contributor

Closes #4647.

@LeeTL1220 Would you mind reviewing this?

@ldgauthier Since this affects HaplotypeCaller you might want to take a look, too. Let me know in particular if making MNPs the default is too bold.

@LeeTL1220
Copy link
Contributor

@davidbenjamin Is it possible to make this the default for M2, but not the default for HC?

@davidbenjamin
Copy link
Contributor Author

Sure. Whatever @ldgauthier wants.

@ldgauthier
Copy link
Contributor

I'd rather not change the default HC behavior, but this is pretty exciting because we can lay the argument about porting ReadBackedPhasing to rest. It would be good to do a comparison with RBP -- can you take a look at the RBP integration tests from GATK3 to see if there was a MNP test there?

Copy link
Contributor

@LeeTL1220 LeeTL1220 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my question. Might be no additional work or might be a lot.

snpAlleles.add( Allele.create( altByte, false ) );
proposedEvents.add(new VariantContextBuilder(sourceNameToAdd, refLoc.getContig(), refLoc.getStart() + refPos, refLoc.getStart() + refPos, snpAlleles).make());
}
final List<Integer> mismatchBlockStarts = new ArrayList<>(); //inclusive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this handle phasing? It doesn't look like it (though I could be wrong). We don't want to call a MNP if the reads supporting the first base are mutually exclusive from the alt reads on the next base.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does the correct thing. The new code makes a MNP out of any haplotype with a MNP. It will not make a MNP out of two different haplotypes with adjacent SNPs.

In general we produce far more haplotypes than actually exist due to kmerization. For example, when k = 10 in our assembly and you have two SNPs 15 bases apart we will produce four different haplotypes for the 2 x 2 = 4 independent combinations. This is not an issue with adjacent SNPs because kmers span them.

But let's suppose that for unknown reasons there's a hole in this argument and that we produce a haplotype with two adjacent SNPs falsely in phase. In that case, every read will contradict the haplotype and so the MNP allele will have a very low LOD or QUAL, as the case may be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to worry about that last case? Is there a way we can test it?

Would this generate multiallelics when the adjacent SNPs are not in phase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be easy enough to make an integration test.

For the same reasons as above I wouldn't expect multiallelics, but an integration test will cover this, too. I just need to dig up or create a small bam with unphased SNPs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

final List<Integer> mismatchBlockEnds = new ArrayList<>(); //exclusive
boolean previousMismatch = false;
for( int offset = 0; offset < elementLength; offset++ ) {
final byte refByte = ref[refPos + offset ];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra space in bracket

@LeeTL1220 LeeTL1220 assigned davidbenjamin and unassigned LeeTL1220 Apr 12, 2018
@davidbenjamin
Copy link
Contributor Author

@ldgauthier the GATK3 tests have two bams relevant to MNPs. One has two unphased SNPs 3 bases apart; the other has two phased adjacent SNPs ie a DNP. That's it as far as I can tell. I think I ought to cook up some synthetic reads for a nice test.

By the way, should add a MNP merging distance option as in ReadBackedPhasing? Currently, for example, the code I wrote can't make a MNP out of ACT -> GCA.

@LeeTL1220
Copy link
Contributor

LeeTL1220 commented Apr 12, 2018 via email

@codecov-io
Copy link

codecov-io commented Apr 12, 2018

Codecov Report

Merging #4650 into master will increase coverage by 0.306%.
The diff coverage is 97.143%.

@@               Coverage Diff               @@
##              master     #4650       +/-   ##
===============================================
+ Coverage     80.142%   80.448%   +0.306%     
- Complexity     17506     18605     +1099     
===============================================
  Files           1086      1085        -1     
  Lines          63305     68551     +5246     
  Branches       10221     11641     +1420     
===============================================
+ Hits           50734     55148     +4414     
- Misses          8575      9183      +608     
- Partials        3996      4220      +224
Impacted Files Coverage Δ Complexity Δ
...ecaller/AssemblyBasedCallerArgumentCollection.java 100% <ø> (ø) 1 <0> (ø) ⬇️
...hellbender/tools/walkers/mutect/Mutect2Engine.java 91.736% <100%> (+0.691%) 78 <0> (+29) ⬆️
.../tools/walkers/mutect/SomaticGenotypingEngine.java 92.308% <100%> (+2.427%) 97 <0> (+31) ⬆️
...tools/walkers/haplotypecaller/HaplotypeCaller.java 93.939% <100%> (+0.391%) 18 <3> (+2) ⬆️
...ols/walkers/haplotypecaller/AssemblyResultSet.java 74.251% <100%> (+0.155%) 43 <1> (ø) ⬇️
...der/tools/walkers/mutect/M2ArgumentCollection.java 97.059% <100%> (+2.941%) 3 <0> (ø) ⬇️
...plotypecaller/HaplotypeCallerGenotypingEngine.java 77.698% <100%> (+0.162%) 32 <0> (ø) ⬇️
...otypecaller/HaplotypeCallerArgumentCollection.java 100% <100%> (ø) 2 <0> (ø) ⬇️
...ypecaller/AssemblyBasedCallerGenotypingEngine.java 91.515% <100%> (ø) 70 <0> (ø) ⬇️
...walkers/haplotypecaller/HaplotypeCallerEngine.java 78.652% <100%> (-0.315%) 66 <0> (-1)
... and 125 more

@ldgauthier
Copy link
Contributor

Actually ACT -> GCA would be useful because they could potentially be in the same codon depending on the reading frame. Is that an easy feature to add?

@LeeTL1220
Copy link
Contributor

@ldgauthier There are downstream tools that are going to choke on that for sure.

If we can add another flag to control this as well, I am okay.

@ldgauthier
Copy link
Contributor

Flag all the things.

@davidbenjamin
Copy link
Contributor Author

@ldgauthier @LeeTL1220 I put in an parameter for the MNP spacing and a bunch of tests.

@davidbenjamin
Copy link
Contributor Author

. . . and one of the tests is sadistic.

* @param maxMnpDistance Phased substitutions separated by this distance or less are merged into MNPs. More than
* two substitutions occuring in the same alignment block (ie the same M/X/EQ CIGAR element)
* are merged until a substitution is separated from the previous one by a greater distance.
* That is, if maxMnpDistance = 1, substitutions at 10,11,12,14,15,17 are broken into a MNP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... substitutions at positions ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -501,13 +501,17 @@ private void updateReferenceHaplotype(final Haplotype newHaplotype) {
*
* <p/>
* The result is sorted incrementally by location.
*
* @param maxMnpDistance Phased substitutions separated by this distance or less are merged into MNPs. More than
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This argument can't be <=0, right? If I'm right, can you update the docs? Any public methods should check the value with ParamUtils.isPositive(...) call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@LeeTL1220
Copy link
Contributor

@davidbenjamin Back to you...

@davidbenjamin
Copy link
Contributor Author

@LeeTL1220 back to you.

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you either add a test for or show me what happens when you run HC MNPs in GVCF mode?

I'm also trying to wrap my head around what will happen in GenotypeGVCFs for samples that have one of the SNPs but not the full MNP.

@davidbenjamin
Copy link
Contributor Author

@ldgauthier I added an integration test for GVCF mode and it works fine: the alleles are as expected with the addition of <NON REF>. I'm now going to investigate how the MNPs interact with CombineGVCFs and GenotypeGVCFs.

@ldgauthier
Copy link
Contributor

I started drawing this out on the whiteboard yesterday and I don't think MNP output is the best way to represent adjacent, phased SNPs for callsets generated from GVCFs.

Suppose one sample has two phased SNPs (A->T and C->G): one at position 5, one at position 6. Sample2 has a SNP only at position 5 and sample3 has a SNP only at position 6. I think the variants will be output as:
chr 6 AC TG,TC
chr 7 C G
such that position 6 gets a 2bp ref allele and position 7 overlaps that ref. Not elegant, but not in violation of the spec. I think the likelihood reporting is where the wheels come off the wagon, especially for sample3. Sample 3 only has likelihoods for the --AC-- haplotype and the --AG-- haplotype because it was called by itself. The sample3 likelihood assigned to TC and TG at position 6 will be the same non-ref likelihood from the ref block ending at position 5 but we know that for a read containing ---AG--- TG should have better likelihood than TC. For sample 2 we have likelihoods for AC and TC, but it's not obvious to me how to generate likelihoods for TG.

This is making me like the PID/PGT scheme a lot more. At least there we have better accuracy on the representation we provide, even if it require a lot more post-processing. Do you have a sense of how hard it would be to split the MNP events (e.g. for sample1) after genotyping, give them the same likelihoods and apply the PID and PGT?

@davidbenjamin
Copy link
Contributor Author

@ldgauthier Some parts of taking splitting MNPs at the end of HaplotypeCaller are easy: breaking eg one DNP at position n into a SNP at n and a SNP at n + 1, letting the SNPs inherit the PLs, AF, and AD (okay, this isn't quite right because a read might end in the middle of the MNP, but close enough) of the parent MNP. . . but the general problem of splitting annotations seems like it might be too tricky.

I'm leaning toward instead just modifying AssemblyBasedCallerGenotypingEngine.phaseCalls(). It seems that this phasing relies very heavily on perfect phasing or anti-phasing and that even one questionable haplotype with incorrect phasing can spoil things. I would guess that we could improve the phasing by making some simple guess as to which haplotypes are real. Basically, the problem is that while HaplotypeCaller imposes ploidy on alleles, it does not do so on haplotypes, and so phasing information is diluted.

With your permission I would like to merge this PR and open a new issue for improving phaseCalls. After all, the issue is fixed in M2, and HC now has a perfectly good MNP mode, with the caveat that it doesn't interact nicely with GVCF mode.

@davidbenjamin
Copy link
Contributor Author

Of course I would also be glad to be told that splitting the annotations isn't really so hard.

@ldgauthier
Copy link
Contributor

What do you need to split? Can't you make the assumption that all reads span both sites (admittedly not perfect) and copy the annotations?

@davidbenjamin
Copy link
Contributor Author

I think most annotations could just be copied, but let me list a few that would be non-trivial:

  • Anything involving base quality eg BaseQuality, BaseQualityRankSumTest.
  • Anything sensitive to genotypes eg one haplotype is DNP: chr20 100 CA -> GT, other haplotype is SNP: chr20 100 C -> G. You would have to recognize that your het alt in the MNP representation becomes a hom alt at 100 and a het at 101. While this is easy enough to do in the genotype it means ExcessHet and InbreedingCoef can't be copied.
  • In general, any annotation that is affected by other alt alleles would be tricky because some of the split SNPs would share a locus with other alt alleles and others wouldn't.
  • ReadPosition would have to be modified according to the position within the split MNP. Easy enough, but it's custom code for just one annotation that would have to live in HaplotypeCaller.

It's not much, but what would we do about them?

@davidbenjamin
Copy link
Contributor Author

@ldgauthier What do you think?

@ldgauthier
Copy link
Contributor

There's enough work with the annotation handling to justify this being a separate task for the HaplotypeCaller side. Let's just turn your new phasing off for HaplotypeCaller GVCF mode. I'm still interested in it being available for single-sample because it would be awesome for clinical.

@davidbenjamin
Copy link
Contributor Author

@ldgauthier As the PR stands mnps are off by default for HC. Should I have it throw an error if they are turned on in GVCF mode, and should I turn it on by default in non-GVCF mode?

@ldgauthier
Copy link
Contributor

Let's do off by default for all modes. Error if MNPs and GVCF mode.

@davidbenjamin davidbenjamin force-pushed the db_mnps branch 2 times, most recently from a1532a7 to 8694279 Compare May 10, 2018 19:38
@davidbenjamin
Copy link
Contributor Author

Let's do off by default for all modes. Error if MNPs and GVCF mode.

@ldgauthier Done and done.

@LeeTL1220 I need for your sign-off as well.

@davidbenjamin
Copy link
Contributor Author

@LeeTL1220 I ran this branch through M2 and oncotator on your MNPs bam, which had 9 DNPs, with and without --infer-onps in oncotator. Results were identical and the alleles were correct.

@ldgauthier Is this ready to merge?

@LeeTL1220
Copy link
Contributor

👍 Sounds good if it generates the identical results.

@tfenne
Copy link
Contributor

tfenne commented Jun 11, 2018

@davidbenjamin & @ldgauthier Sorry for commenting on a closed/merged PR but I wasn't sure where else to take the discussion. If there's a more appropriate place please redirect me!

First off, this is very cool and I'm so glad to see this making it's way into HC/M2! It's super helpful for functional annotation/clinical interpretation. Thanks for working on this!

I had two thoughts which maybe belong as separate issues, but I figured I'd raise them here first and see what you thought:

  1. It would actually be useful to be able to combine this behavior with GVCF mode in some cases. I understand all the caveats about merging and joint-genotyping when this has been done, but there are use cases for single-sample calling where both GCVF and MNP mode combined would be useful. E.g. in a clinical setting it's very useful to have the GVCF with the reference blocks, and also call MNPs as MNPs. There would be no merging in this case. Any chance this could be allowed, perhaps with a warning or requirement that --unsafe be on?
  2. IIRC RBP would also phase combinations of indel-SNP and indel-indel in addition to SNP-SNP. I'm curious how hard it would be to apply the same grouping logic across indels as well? I tried to read the code in the PR, but honestly I don't think I understand the ramifications of including indels sufficiently well. Would there be any conceptual objections or road-blocks to doing this?

@davidbenjamin
Copy link
Contributor Author

@tfenne Thanks for commenting! 1) is up to Laura but I 'm willing to put in time to implement a useful feature that she approves of. This PR only tries to phase SNPs within a single Cigar element, which involved minimal change to the code. Personally I think once we're getting more complicated than that it's worth going more fundamental, that is, genotyping whole haplotypes. This is an experiment that will probably happen in Mutect2 within the next few months.

@ldgauthier
Copy link
Contributor

@tfenne I was hesitant to output MNPs in GVCFs because (as you are aware) they're difficult to deal with in joint calling and I didn't want users to expect a MNP solution for multiple samples. An extra argument is an intriguing idea, maybe --not-for-joint-calling, which is sort of like "Check here to accept the user agreement." I'm not entirely sure what the expected behavior for CombineGVCFs or GenomicsDBImport should be if a user tried to combine GVCFs with MNPs. They may not fail gracefully as-is, but I suppose we could just throw a UserException in the middle of traversal if we find a MNP, which wouldn't take too much effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants