Added trimming to the Allele removal code in the genotyping engine #6044

jamesemery · 2019-07-16T20:45:59Z

I have tested that this explicitly works on the users data. I decided it was simplest to just check for mis-trimming at the very last stage. I'm a little weary about the change of the locus for the ref context from being the culledVC to being the mergedVC.

Fixes #5994

jamesemery · 2019-07-16T20:47:01Z

...broadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCallerGenotypingEngine.java

            if( call != null ) {

                readAlleleLikelihoods = prepareReadAlleleLikelihoodsForAnnotation(readLikelihoods, perSampleFilteredReadList,
                        emitReferenceConfidence, alleleMapper, readAlleleLikelihoods, call);

-                final VariantContext annotatedCall = makeAnnotatedCall(ref, refLoc, tracker, header, mergedVC, readAlleleLikelihoods, call);
+                final VariantContext annotatedCall = makeAnnotatedCall(ref, refLoc, tracker, header, mergedVC, readAlleleLikelihoods, call, annotationEngine);


Note: this uses the mergedVC rather than the culledVC which in turn is used to construct the reference context used for annotation.

…handled unsafe Allele list operations

codecov · 2019-07-17T14:56:11Z

Codecov Report

Merging #6044 into master will increase coverage by 0.001%.
The diff coverage is 100%.

@@               Coverage Diff               @@
##              master     #6044       +/-   ##
===============================================
+ Coverage     87.206%   87.207%   +0.001%     
- Complexity     32722     32724        +2     
===============================================
  Files           2011      2011               
  Lines         150967    150981       +14     
  Branches       16134     16134               
===============================================
+ Hits          131653    131666       +13     
  Misses         13702     13702               
- Partials        5612      5613        +1

Impacted Files	Coverage Δ	Complexity Δ
...aller/HaplotypeCallerGenotypingEngineUnitTest.java	`81.776% <100%> (+1.083%)`	`19 <1> (+1)`	⬆️
...plotypecaller/HaplotypeCallerGenotypingEngine.java	`89.944% <100%> (+0.114%)`	`54 <0> (ø)`	⬇️
...nder/utils/runtime/StreamingProcessController.java	`67.299% <0%> (-0.474%)`	`33% <0%> (ø)`

davidbenjamin

@jamesemery These changes fix the issue but if I understand correctly (and I might not) I think you may be able to rearrange things for clarity.

I noticed that you need to explicitly keep track of mergedAlleles more than you might want -- it's not final, you need mergedAlleles++ when added NON_REF_ALLELE, and you have to pass it to makeAnnotatedCall. My first instinct was "wait, why isn't this equal to mergedVC.getAlleles().size()", but I soon realized that this was the old buggy code. Nonetheless, it would be really nice if that old buggy code worked. That is, call.getAlleles().size() == mergedVC.getAlleles().size() really should be the criterion for trimming.

Now, if I understand right, the reason this fails is that about 20 lines upstream of makeAnnotatedCall we have mergedVC = removeAltAllelesIfTooManyGenotypes(ploidy, alleleMapper, mergedVC) and don't check whether trimming is necessary. I would prefer to check whether trimming is necessary inside removeAltAllelesIfTooManyGenotypes and inside makeAnnotatedCall in order to render both methods self-contained. The performance cost, if any, will be completely negligible.

jamesemery · 2019-07-17T16:29:04Z

Thank you for the speedy review @davidbenjamin. I agree with you that the obvious place to trim the alleles is in removeAltAllelesIfTooManyGenotypes(ploidy, alleleMapper, mergedVC) as it is the place where we actually edit the output. Indeed my first attempt at this fix was to make that change. Unfortunately, because the readAlleleLikelihoods object is constructed with the un-trimmed alleles in the alleleMapper was causing failures because the Liklihoods object would have mismatching alleles. To fix removeAltAllelesIfTooManyGenotypes(ploidy, alleleMapper, mergedVC) we would have to edit the alleleMapper object, which would be difficult given that I would prefer to just use the allele trimming library object.

Another proposal would have been to just hold onto the mergedVC object before we cull the extra alleles and then just compare the alleles at the end. Unfortunately due to engine code optimizations we have enabled an unsafe allele list copy for these alleles in the HaplotypeCaller (to save ourselves the cost of allocating dozens of identical ArrayLists to store Haplotypes every time we use the VariantContextBuilder).

To clarify, it is possible to move the check to the right place its likely to force me to write a non-library implementation of the trimming code that tracks what edits it made and I was trying to avoid doing that.

davidbenjamin · 2019-07-17T19:58:26Z

@jamesemery In that case, I give a 👍 as far as correctness is concerned and I will leave it to the engine team to sort out the software engineering trade-offs. You can merge whenever you are satisfied.

lbergelson · 2019-08-13T17:57:51Z

...broadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCallerGenotypingEngine.java

@@ -167,6 +168,7 @@ public CalledHaplotypes assignGenotypeLikelihoods(final List<Haplotype> haplotyp
                            loc);

            VariantContext mergedVC = AssemblyBasedCallerUtils.makeMergedVariantContext(eventsAtThisLocWithSpanDelsReplaced);
+            int mergedAlleles = mergedVC.getAlleles().size();


This variable name isn't helpful. Maybe something like "numberOfAllelesBeforeSubsetting" would be clearer

lbergelson

@jamesemery I think there's a newly introduced NPE here. I have a bunch of comments about minor stuff too.

I might actually just pull the trimming out of this method and do it outside of the annotating.

lbergelson · 2019-08-13T19:00:49Z

...broadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCallerGenotypingEngine.java

@@ -365,14 +368,15 @@ static VariantContext removeExcessAltAllelesFromVC(final VariantContext inputVC,
        return vcb.make();
    }

-    protected VariantContext makeAnnotatedCall(byte[] ref, SimpleInterval refLoc, FeatureContext tracker, SAMFileHeader header, VariantContext mergedVC, ReadLikelihoods<Allele> readAlleleLikelihoods, VariantContext call) {
+    @VisibleForTesting
+    static protected VariantContext makeAnnotatedCall(byte[] ref, SimpleInterval refLoc, FeatureContext tracker, SAMFileHeader header, VariantContext mergedVC, int mergedAllelesSize, ReadLikelihoods<Allele> readAlleleLikelihoods, VariantContext call, VariantAnnotatorEngine annotationEngine) {


same comment about the new variable name in this method

lbergelson · 2019-08-13T19:20:07Z

...broadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCallerGenotypingEngine.java

@@ -365,14 +368,15 @@ static VariantContext removeExcessAltAllelesFromVC(final VariantContext inputVC,
        return vcb.make();
    }

-    protected VariantContext makeAnnotatedCall(byte[] ref, SimpleInterval refLoc, FeatureContext tracker, SAMFileHeader header, VariantContext mergedVC, ReadLikelihoods<Allele> readAlleleLikelihoods, VariantContext call) {
+    @VisibleForTesting
+    static protected VariantContext makeAnnotatedCall(byte[] ref, SimpleInterval refLoc, FeatureContext tracker, SAMFileHeader header, VariantContext mergedVC, int mergedAllelesSize, ReadLikelihoods<Allele> readAlleleLikelihoods, VariantContext call, VariantAnnotatorEngine annotationEngine) {
        final SimpleInterval locus = new SimpleInterval(mergedVC.getContig(), mergedVC.getStart(), mergedVC.getEnd());
        final SimpleInterval refLocInterval= new SimpleInterval(refLoc);


Why do we make a copy of the refloc?

lbergelson · 2019-08-13T19:24:39Z

...broadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCallerGenotypingEngine.java

@@ -167,6 +168,7 @@ public CalledHaplotypes assignGenotypeLikelihoods(final List<Haplotype> haplotyp
                            loc);

            VariantContext mergedVC = AssemblyBasedCallerUtils.makeMergedVariantContext(eventsAtThisLocWithSpanDelsReplaced);
+            int mergedAlleles = mergedVC.getAlleles().size();


You deference mergedVC here, but immediately afterwards there is a check for mergedVC == null. I think you've introduced an NPE here.

lbergelson · 2019-08-13T19:38:09Z

...broadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCallerGenotypingEngine.java

@@ -365,14 +368,15 @@ static VariantContext removeExcessAltAllelesFromVC(final VariantContext inputVC,
        return vcb.make();
    }

-    protected VariantContext makeAnnotatedCall(byte[] ref, SimpleInterval refLoc, FeatureContext tracker, SAMFileHeader header, VariantContext mergedVC, ReadLikelihoods<Allele> readAlleleLikelihoods, VariantContext call) {
+    @VisibleForTesting
+    static protected VariantContext makeAnnotatedCall(byte[] ref, SimpleInterval refLoc, FeatureContext tracker, SAMFileHeader header, VariantContext mergedVC, int mergedAllelesSize, ReadLikelihoods<Allele> readAlleleLikelihoods, VariantContext call, VariantAnnotatorEngine annotationEngine) {
        final SimpleInterval locus = new SimpleInterval(mergedVC.getContig(), mergedVC.getStart(), mergedVC.getEnd());


You can simplify this:

Suggested change

final SimpleInterval locus = new SimpleInterval(mergedVC.getContig(), mergedVC.getStart(), mergedVC.getEnd());

final SimpleInterval locus = new SimpleInterval(mergedVC);

lbergelson · 2019-08-13T19:39:12Z

...broadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCallerGenotypingEngine.java

        final SimpleInterval locus = new SimpleInterval(mergedVC.getContig(), mergedVC.getStart(), mergedVC.getEnd());
        final SimpleInterval refLocInterval= new SimpleInterval(refLoc);
        final ReferenceDataSource refData = new ReferenceMemorySource(new ReferenceBases(ref, refLocInterval), header.getSequenceDictionary());
        final ReferenceContext referenceContext = new ReferenceContext(refData, locus, refLocInterval);

        final VariantContext untrimmedResult =  annotationEngine.annotateContext(call, tracker, referenceContext, readAlleleLikelihoods, a -> true);
-        return call.getAlleles().size() == mergedVC.getAlleles().size() ? untrimmedResult
+        return call.getAlleles().size() == mergedAllelesSize ? untrimmedResult


It might be simpler if we passed in a boolean called trimAlleles. This is also fine but the mergedAllelesSize variable needs to be named something clearer.

You might also want to use the untrimmedResult size instead of call since that would be more foolproof against changes in the future.

lbergelson · 2019-08-13T19:40:24Z

...broadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCallerGenotypingEngine.java

        final SimpleInterval locus = new SimpleInterval(mergedVC.getContig(), mergedVC.getStart(), mergedVC.getEnd());
        final SimpleInterval refLocInterval= new SimpleInterval(refLoc);
        final ReferenceDataSource refData = new ReferenceMemorySource(new ReferenceBases(ref, refLocInterval), header.getSequenceDictionary());
        final ReferenceContext referenceContext = new ReferenceContext(refData, locus, refLocInterval);

        final VariantContext untrimmedResult =  annotationEngine.annotateContext(call, tracker, referenceContext, readAlleleLikelihoods, a -> true);
-        return call.getAlleles().size() == mergedVC.getAlleles().size() ? untrimmedResult
+        return call.getAlleles().size() == mergedAllelesSize ? untrimmedResult


You can use call.getNAlleles() instead.

lbergelson · 2019-08-13T19:45:51Z

Also, I would add a comment explaining why you can't trim after the code that removes alleles...

jamesemery · 2019-08-13T20:12:25Z

@lbergelson responded to your comments

lbergelson · 2019-08-13T20:22:09Z

...broadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCallerGenotypingEngine.java

-        final SimpleInterval locus = new SimpleInterval(mergedVC.getContig(), mergedVC.getStart(), mergedVC.getEnd());
+    @VisibleForTesting
+    static protected VariantContext makeAnnotatedCall(byte[] ref, SimpleInterval refLoc, FeatureContext tracker, SAMFileHeader header, VariantContext mergedVC, int mergedAllelesListSizeBeforePossibleTrimming, ReadLikelihoods<Allele> readAlleleLikelihoods, VariantContext call, VariantAnnotatorEngine annotationEngine) {
+        final SimpleInterval locus = new SimpleInterval(mergedVC.getContig());


what's with this getContig()?

lbergelson

👍

He approved the PR in his subsequent comment

jamesemery added 2 commits July 16, 2019 16:18

added trimming to the Allele removal code in the genotyping engine

3c876b2

resolving any possible performance cost this may incur

a8cd6e1

jamesemery requested a review from davidbenjamin July 16, 2019 20:45

jamesemery commented Jul 16, 2019

View reviewed changes

jamesemery added 2 commits July 16, 2019 18:06

made some changes to check for possible performance regressions (and …

ca6bd60

…handled unsafe Allele list operations

fixed a pesky compiler warning issue

d2b77e5

droazen requested a review from ldgauthier July 17, 2019 14:26

davidbenjamin previously requested changes Jul 17, 2019

View reviewed changes

lbergelson reviewed Aug 13, 2019

View reviewed changes

lbergelson requested changes Aug 13, 2019

View reviewed changes

lbergelson assigned jamesemery Aug 13, 2019

responding to louis' comments

c092a62

jamesemery assigned lbergelson and unassigned jamesemery Aug 13, 2019

lbergelson reviewed Aug 13, 2019

View reviewed changes

whoops

e5f1436

lbergelson approved these changes Aug 13, 2019

View reviewed changes

jamesemery merged commit 45d9ecb into master Aug 14, 2019

jamesemery deleted the je_fixAccidentalUntrimmedAllelesInHaplotypeCaller branch August 14, 2019 15:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added trimming to the Allele removal code in the genotyping engine #6044

Added trimming to the Allele removal code in the genotyping engine #6044

jamesemery commented Jul 16, 2019

jamesemery Jul 16, 2019

codecov bot commented Jul 17, 2019 •

edited

Loading

davidbenjamin left a comment

jamesemery commented Jul 17, 2019

davidbenjamin commented Jul 17, 2019

lbergelson Aug 13, 2019

lbergelson left a comment

lbergelson Aug 13, 2019

lbergelson Aug 13, 2019

lbergelson Aug 13, 2019

lbergelson Aug 13, 2019

lbergelson Aug 13, 2019

lbergelson Aug 13, 2019

lbergelson Aug 13, 2019

lbergelson commented Aug 13, 2019

jamesemery commented Aug 13, 2019

lbergelson Aug 13, 2019

lbergelson left a comment

	final SimpleInterval locus = new SimpleInterval(mergedVC.getContig(), mergedVC.getStart(), mergedVC.getEnd());
	final SimpleInterval locus = new SimpleInterval(mergedVC);

Added trimming to the Allele removal code in the genotyping engine #6044

Added trimming to the Allele removal code in the genotyping engine #6044

Conversation

jamesemery commented Jul 16, 2019

Choose a reason for hiding this comment

codecov bot commented Jul 17, 2019 • edited Loading

Codecov Report

davidbenjamin left a comment

Choose a reason for hiding this comment

jamesemery commented Jul 17, 2019

davidbenjamin commented Jul 17, 2019

Choose a reason for hiding this comment

lbergelson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lbergelson commented Aug 13, 2019

jamesemery commented Aug 13, 2019

Choose a reason for hiding this comment

lbergelson left a comment

Choose a reason for hiding this comment

codecov bot commented Jul 17, 2019 •

edited

Loading