Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added trimming to the Allele removal code in the genotyping engine #6044

Merged
merged 6 commits into from
Aug 14, 2019

Conversation

jamesemery
Copy link
Collaborator

I have tested that this explicitly works on the users data. I decided it was simplest to just check for mis-trimming at the very last stage. I'm a little weary about the change of the locus for the ref context from being the culledVC to being the mergedVC.

Fixes #5994

if( call != null ) {

readAlleleLikelihoods = prepareReadAlleleLikelihoodsForAnnotation(readLikelihoods, perSampleFilteredReadList,
emitReferenceConfidence, alleleMapper, readAlleleLikelihoods, call);

final VariantContext annotatedCall = makeAnnotatedCall(ref, refLoc, tracker, header, mergedVC, readAlleleLikelihoods, call);
final VariantContext annotatedCall = makeAnnotatedCall(ref, refLoc, tracker, header, mergedVC, readAlleleLikelihoods, call, annotationEngine);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this uses the mergedVC rather than the culledVC which in turn is used to construct the reference context used for annotation.

@droazen droazen requested a review from ldgauthier July 17, 2019 14:26
@codecov
Copy link

codecov bot commented Jul 17, 2019

Codecov Report

Merging #6044 into master will increase coverage by 0.001%.
The diff coverage is 100%.

@@               Coverage Diff               @@
##              master     #6044       +/-   ##
===============================================
+ Coverage     87.206%   87.207%   +0.001%     
- Complexity     32722     32724        +2     
===============================================
  Files           2011      2011               
  Lines         150967    150981       +14     
  Branches       16134     16134               
===============================================
+ Hits          131653    131666       +13     
  Misses         13702     13702               
- Partials        5612      5613        +1
Impacted Files Coverage Δ Complexity Δ
...aller/HaplotypeCallerGenotypingEngineUnitTest.java 81.776% <100%> (+1.083%) 19 <1> (+1) ⬆️
...plotypecaller/HaplotypeCallerGenotypingEngine.java 89.944% <100%> (+0.114%) 54 <0> (ø) ⬇️
...nder/utils/runtime/StreamingProcessController.java 67.299% <0%> (-0.474%) 33% <0%> (ø)

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery These changes fix the issue but if I understand correctly (and I might not) I think you may be able to rearrange things for clarity.

I noticed that you need to explicitly keep track of mergedAlleles more than you might want -- it's not final, you need mergedAlleles++ when added NON_REF_ALLELE, and you have to pass it to makeAnnotatedCall. My first instinct was "wait, why isn't this equal to mergedVC.getAlleles().size()", but I soon realized that this was the old buggy code. Nonetheless, it would be really nice if that old buggy code worked. That is, call.getAlleles().size() == mergedVC.getAlleles().size() really should be the criterion for trimming.

Now, if I understand right, the reason this fails is that about 20 lines upstream of makeAnnotatedCall we have mergedVC = removeAltAllelesIfTooManyGenotypes(ploidy, alleleMapper, mergedVC) and don't check whether trimming is necessary. I would prefer to check whether trimming is necessary inside removeAltAllelesIfTooManyGenotypes and inside makeAnnotatedCall in order to render both methods self-contained. The performance cost, if any, will be completely negligible.

@jamesemery
Copy link
Collaborator Author

Thank you for the speedy review @davidbenjamin. I agree with you that the obvious place to trim the alleles is in removeAltAllelesIfTooManyGenotypes(ploidy, alleleMapper, mergedVC) as it is the place where we actually edit the output. Indeed my first attempt at this fix was to make that change. Unfortunately, because the readAlleleLikelihoods object is constructed with the un-trimmed alleles in the alleleMapper was causing failures because the Liklihoods object would have mismatching alleles. To fix removeAltAllelesIfTooManyGenotypes(ploidy, alleleMapper, mergedVC) we would have to edit the alleleMapper object, which would be difficult given that I would prefer to just use the allele trimming library object.

Another proposal would have been to just hold onto the mergedVC object before we cull the extra alleles and then just compare the alleles at the end. Unfortunately due to engine code optimizations we have enabled an unsafe allele list copy for these alleles in the HaplotypeCaller (to save ourselves the cost of allocating dozens of identical ArrayLists to store Haplotypes every time we use the VariantContextBuilder).

To clarify, it is possible to move the check to the right place its likely to force me to write a non-library implementation of the trimming code that tracks what edits it made and I was trying to avoid doing that.

@davidbenjamin
Copy link
Contributor

@jamesemery In that case, I give a 👍 as far as correctness is concerned and I will leave it to the engine team to sort out the software engineering trade-offs. You can merge whenever you are satisfied.

@@ -167,6 +168,7 @@ public CalledHaplotypes assignGenotypeLikelihoods(final List<Haplotype> haplotyp
loc);

VariantContext mergedVC = AssemblyBasedCallerUtils.makeMergedVariantContext(eventsAtThisLocWithSpanDelsReplaced);
int mergedAlleles = mergedVC.getAlleles().size();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable name isn't helpful. Maybe something like "numberOfAllelesBeforeSubsetting" would be clearer

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery I think there's a newly introduced NPE here. I have a bunch of comments about minor stuff too.

I might actually just pull the trimming out of this method and do it outside of the annotating.

@@ -365,14 +368,15 @@ static VariantContext removeExcessAltAllelesFromVC(final VariantContext inputVC,
return vcb.make();
}

protected VariantContext makeAnnotatedCall(byte[] ref, SimpleInterval refLoc, FeatureContext tracker, SAMFileHeader header, VariantContext mergedVC, ReadLikelihoods<Allele> readAlleleLikelihoods, VariantContext call) {
@VisibleForTesting
static protected VariantContext makeAnnotatedCall(byte[] ref, SimpleInterval refLoc, FeatureContext tracker, SAMFileHeader header, VariantContext mergedVC, int mergedAllelesSize, ReadLikelihoods<Allele> readAlleleLikelihoods, VariantContext call, VariantAnnotatorEngine annotationEngine) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about the new variable name in this method

@@ -365,14 +368,15 @@ static VariantContext removeExcessAltAllelesFromVC(final VariantContext inputVC,
return vcb.make();
}

protected VariantContext makeAnnotatedCall(byte[] ref, SimpleInterval refLoc, FeatureContext tracker, SAMFileHeader header, VariantContext mergedVC, ReadLikelihoods<Allele> readAlleleLikelihoods, VariantContext call) {
@VisibleForTesting
static protected VariantContext makeAnnotatedCall(byte[] ref, SimpleInterval refLoc, FeatureContext tracker, SAMFileHeader header, VariantContext mergedVC, int mergedAllelesSize, ReadLikelihoods<Allele> readAlleleLikelihoods, VariantContext call, VariantAnnotatorEngine annotationEngine) {
final SimpleInterval locus = new SimpleInterval(mergedVC.getContig(), mergedVC.getStart(), mergedVC.getEnd());
final SimpleInterval refLocInterval= new SimpleInterval(refLoc);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we make a copy of the refloc?

@@ -167,6 +168,7 @@ public CalledHaplotypes assignGenotypeLikelihoods(final List<Haplotype> haplotyp
loc);

VariantContext mergedVC = AssemblyBasedCallerUtils.makeMergedVariantContext(eventsAtThisLocWithSpanDelsReplaced);
int mergedAlleles = mergedVC.getAlleles().size();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You deference mergedVC here, but immediately afterwards there is a check for mergedVC == null. I think you've introduced an NPE here.

@@ -365,14 +368,15 @@ static VariantContext removeExcessAltAllelesFromVC(final VariantContext inputVC,
return vcb.make();
}

protected VariantContext makeAnnotatedCall(byte[] ref, SimpleInterval refLoc, FeatureContext tracker, SAMFileHeader header, VariantContext mergedVC, ReadLikelihoods<Allele> readAlleleLikelihoods, VariantContext call) {
@VisibleForTesting
static protected VariantContext makeAnnotatedCall(byte[] ref, SimpleInterval refLoc, FeatureContext tracker, SAMFileHeader header, VariantContext mergedVC, int mergedAllelesSize, ReadLikelihoods<Allele> readAlleleLikelihoods, VariantContext call, VariantAnnotatorEngine annotationEngine) {
final SimpleInterval locus = new SimpleInterval(mergedVC.getContig(), mergedVC.getStart(), mergedVC.getEnd());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simplify this:

Suggested change
final SimpleInterval locus = new SimpleInterval(mergedVC.getContig(), mergedVC.getStart(), mergedVC.getEnd());
final SimpleInterval locus = new SimpleInterval(mergedVC);

final SimpleInterval locus = new SimpleInterval(mergedVC.getContig(), mergedVC.getStart(), mergedVC.getEnd());
final SimpleInterval refLocInterval= new SimpleInterval(refLoc);
final ReferenceDataSource refData = new ReferenceMemorySource(new ReferenceBases(ref, refLocInterval), header.getSequenceDictionary());
final ReferenceContext referenceContext = new ReferenceContext(refData, locus, refLocInterval);

final VariantContext untrimmedResult = annotationEngine.annotateContext(call, tracker, referenceContext, readAlleleLikelihoods, a -> true);
return call.getAlleles().size() == mergedVC.getAlleles().size() ? untrimmedResult
return call.getAlleles().size() == mergedAllelesSize ? untrimmedResult
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be simpler if we passed in a boolean called trimAlleles. This is also fine but the mergedAllelesSize variable needs to be named something clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might also want to use the untrimmedResult size instead of call since that would be more foolproof against changes in the future.

final SimpleInterval locus = new SimpleInterval(mergedVC.getContig(), mergedVC.getStart(), mergedVC.getEnd());
final SimpleInterval refLocInterval= new SimpleInterval(refLoc);
final ReferenceDataSource refData = new ReferenceMemorySource(new ReferenceBases(ref, refLocInterval), header.getSequenceDictionary());
final ReferenceContext referenceContext = new ReferenceContext(refData, locus, refLocInterval);

final VariantContext untrimmedResult = annotationEngine.annotateContext(call, tracker, referenceContext, readAlleleLikelihoods, a -> true);
return call.getAlleles().size() == mergedVC.getAlleles().size() ? untrimmedResult
return call.getAlleles().size() == mergedAllelesSize ? untrimmedResult
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use call.getNAlleles() instead.

@lbergelson
Copy link
Member

Also, I would add a comment explaining why you can't trim after the code that removes alleles...

@jamesemery
Copy link
Collaborator Author

@lbergelson responded to your comments

@jamesemery jamesemery assigned lbergelson and unassigned jamesemery Aug 13, 2019
final SimpleInterval locus = new SimpleInterval(mergedVC.getContig(), mergedVC.getStart(), mergedVC.getEnd());
@VisibleForTesting
static protected VariantContext makeAnnotatedCall(byte[] ref, SimpleInterval refLoc, FeatureContext tracker, SAMFileHeader header, VariantContext mergedVC, int mergedAllelesListSizeBeforePossibleTrimming, ReadLikelihoods<Allele> readAlleleLikelihoods, VariantContext call, VariantAnnotatorEngine annotationEngine) {
final SimpleInterval locus = new SimpleInterval(mergedVC.getContig());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's with this getContig()?

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jamesemery jamesemery dismissed davidbenjamin’s stale review August 14, 2019 15:12

He approved the PR in his subsequent comment

@jamesemery jamesemery merged commit 45d9ecb into master Aug 14, 2019
@jamesemery jamesemery deleted the je_fixAccidentalUntrimmedAllelesInHaplotypeCaller branch August 14, 2019 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HaplotypeCaller emitting a MNP in GVCF mode when --max-mnp-distance is 0
3 participants