New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correctly represent MNPs in HC and M2, with option to split into SNPs as before #4650
Conversation
@davidbenjamin Is it possible to make this the default for M2, but not the default for HC? |
Sure. Whatever @ldgauthier wants. |
I'd rather not change the default HC behavior, but this is pretty exciting because we can lay the argument about porting ReadBackedPhasing to rest. It would be good to do a comparison with RBP -- can you take a look at the RBP integration tests from GATK3 to see if there was a MNP test there? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my question. Might be no additional work or might be a lot.
snpAlleles.add( Allele.create( altByte, false ) ); | ||
proposedEvents.add(new VariantContextBuilder(sourceNameToAdd, refLoc.getContig(), refLoc.getStart() + refPos, refLoc.getStart() + refPos, snpAlleles).make()); | ||
} | ||
final List<Integer> mismatchBlockStarts = new ArrayList<>(); //inclusive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this handle phasing? It doesn't look like it (though I could be wrong). We don't want to call a MNP if the reads supporting the first base are mutually exclusive from the alt reads on the next base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does the correct thing. The new code makes a MNP out of any haplotype with a MNP. It will not make a MNP out of two different haplotypes with adjacent SNPs.
In general we produce far more haplotypes than actually exist due to kmerization. For example, when k = 10 in our assembly and you have two SNPs 15 bases apart we will produce four different haplotypes for the 2 x 2 = 4 independent combinations. This is not an issue with adjacent SNPs because kmers span them.
But let's suppose that for unknown reasons there's a hole in this argument and that we produce a haplotype with two adjacent SNPs falsely in phase. In that case, every read will contradict the haplotype and so the MNP allele will have a very low LOD or QUAL, as the case may be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to worry about that last case? Is there a way we can test it?
Would this generate multiallelics when the adjacent SNPs are not in phase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be easy enough to make an integration test.
For the same reasons as above I wouldn't expect multiallelics, but an integration test will cover this, too. I just need to dig up or create a small bam with unphased SNPs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
final List<Integer> mismatchBlockEnds = new ArrayList<>(); //exclusive | ||
boolean previousMismatch = false; | ||
for( int offset = 0; offset < elementLength; offset++ ) { | ||
final byte refByte = ref[refPos + offset ]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra space in bracket
@ldgauthier the GATK3 tests have two bams relevant to MNPs. One has two unphased SNPs 3 bases apart; the other has two phased adjacent SNPs ie a DNP. That's it as far as I can tell. I think I ought to cook up some synthetic reads for a nice test. By the way, should add a MNP merging distance option as in ReadBackedPhasing? Currently, for example, the code I wrote can't make a MNP out of ACT -> GCA. |
I only want adjacent bases, so I am okay.
…On Thu, Apr 12, 2018, 12:34 David Benjamin ***@***.***> wrote:
@ldgauthier <https://github.com/ldgauthier> the GATK3 tests have two bams
relevant to MNPs. One has two unphased SNPs 3 bases apart; the other has
two phased adjacent SNPs ie a DNP. That's it as far as I can tell. I think
I ought to cook up some synthetic reads for a nice test.
By the way, should add a MNP merging distance option as in
ReadBackedPhasing? Currently, for example, the code I wrote can't make a
MNP out of ACT -> GCA.
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#4650 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACDXk7itJIetc3iY3pptj9ljoscsExcqks5tn4IngaJpZM4TRJxM>
.
|
Codecov Report
@@ Coverage Diff @@
## master #4650 +/- ##
===============================================
+ Coverage 80.142% 80.448% +0.306%
- Complexity 17506 18605 +1099
===============================================
Files 1086 1085 -1
Lines 63305 68551 +5246
Branches 10221 11641 +1420
===============================================
+ Hits 50734 55148 +4414
- Misses 8575 9183 +608
- Partials 3996 4220 +224
|
Actually ACT -> GCA would be useful because they could potentially be in the same codon depending on the reading frame. Is that an easy feature to add? |
@ldgauthier There are downstream tools that are going to choke on that for sure. If we can add another flag to control this as well, I am okay. |
Flag all the things. |
8e4f146
to
cf3386a
Compare
@ldgauthier @LeeTL1220 I put in an parameter for the MNP spacing and a bunch of tests. |
. . . and one of the tests is sadistic. |
* @param maxMnpDistance Phased substitutions separated by this distance or less are merged into MNPs. More than | ||
* two substitutions occuring in the same alignment block (ie the same M/X/EQ CIGAR element) | ||
* are merged until a substitution is separated from the previous one by a greater distance. | ||
* That is, if maxMnpDistance = 1, substitutions at 10,11,12,14,15,17 are broken into a MNP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... substitutions at positions ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -501,13 +501,17 @@ private void updateReferenceHaplotype(final Haplotype newHaplotype) { | |||
* | |||
* <p/> | |||
* The result is sorted incrementally by location. | |||
* | |||
* @param maxMnpDistance Phased substitutions separated by this distance or less are merged into MNPs. More than |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This argument can't be <=0
, right? If I'm right, can you update the docs? Any public methods should check the value with ParamUtils.isPositive(...)
call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@davidbenjamin Back to you... |
@LeeTL1220 back to you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you either add a test for or show me what happens when you run HC MNPs in GVCF mode?
I'm also trying to wrap my head around what will happen in GenotypeGVCFs for samples that have one of the SNPs but not the full MNP.
@ldgauthier I added an integration test for GVCF mode and it works fine: the alleles are as expected with the addition of |
I started drawing this out on the whiteboard yesterday and I don't think MNP output is the best way to represent adjacent, phased SNPs for callsets generated from GVCFs. Suppose one sample has two phased SNPs (A->T and C->G): one at position 5, one at position 6. Sample2 has a SNP only at position 5 and sample3 has a SNP only at position 6. I think the variants will be output as: This is making me like the PID/PGT scheme a lot more. At least there we have better accuracy on the representation we provide, even if it require a lot more post-processing. Do you have a sense of how hard it would be to split the MNP events (e.g. for sample1) after genotyping, give them the same likelihoods and apply the PID and PGT? |
@ldgauthier Some parts of taking splitting MNPs at the end of HaplotypeCaller are easy: breaking eg one DNP at position n into a SNP at n and a SNP at n + 1, letting the SNPs inherit the PLs, AF, and AD (okay, this isn't quite right because a read might end in the middle of the MNP, but close enough) of the parent MNP. . . but the general problem of splitting annotations seems like it might be too tricky. I'm leaning toward instead just modifying With your permission I would like to merge this PR and open a new issue for improving |
Of course I would also be glad to be told that splitting the annotations isn't really so hard. |
What do you need to split? Can't you make the assumption that all reads span both sites (admittedly not perfect) and copy the annotations? |
I think most annotations could just be copied, but let me list a few that would be non-trivial:
It's not much, but what would we do about them? |
@ldgauthier What do you think? |
There's enough work with the annotation handling to justify this being a separate task for the HaplotypeCaller side. Let's just turn your new phasing off for HaplotypeCaller GVCF mode. I'm still interested in it being available for single-sample because it would be awesome for clinical. |
@ldgauthier As the PR stands mnps are off by default for HC. Should I have it throw an error if they are turned on in GVCF mode, and should I turn it on by default in non-GVCF mode? |
Let's do off by default for all modes. Error if MNPs and GVCF mode. |
a1532a7
to
8694279
Compare
@ldgauthier Done and done. @LeeTL1220 I need for your sign-off as well. |
@LeeTL1220 I ran this branch through M2 and oncotator on your MNPs bam, which had 9 DNPs, with and without @ldgauthier Is this ready to merge? |
👍 Sounds good if it generates the identical results. |
@davidbenjamin & @ldgauthier Sorry for commenting on a closed/merged PR but I wasn't sure where else to take the discussion. If there's a more appropriate place please redirect me! First off, this is very cool and I'm so glad to see this making it's way into HC/M2! It's super helpful for functional annotation/clinical interpretation. Thanks for working on this! I had two thoughts which maybe belong as separate issues, but I figured I'd raise them here first and see what you thought:
|
@tfenne Thanks for commenting! 1) is up to Laura but I 'm willing to put in time to implement a useful feature that she approves of. This PR only tries to phase SNPs within a single Cigar element, which involved minimal change to the code. Personally I think once we're getting more complicated than that it's worth going more fundamental, that is, genotyping whole haplotypes. This is an experiment that will probably happen in Mutect2 within the next few months. |
@tfenne I was hesitant to output MNPs in GVCFs because (as you are aware) they're difficult to deal with in joint calling and I didn't want users to expect a MNP solution for multiple samples. An extra argument is an intriguing idea, maybe |
Closes #4647.
@LeeTL1220 Would you mind reviewing this?
@ldgauthier Since this affects HaplotypeCaller you might want to take a look, too. Let me know in particular if making MNPs the default is too bold.