Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --include-non-variant-sites to GenotypeGVCFs. #5219

Merged
merged 7 commits into from Nov 5, 2018

Conversation

@cmnbroad
Copy link
Collaborator

commented Sep 25, 2018

Fixes #2865.

To match GATK without sacrificing performance for existing GenotypeGVCF modes, this adds a new traversal type VariantWalkerGroupedByLocus which can process variants one-at-a-time, or optionally grouped by Locus. GenotypeGVCFs itself then has to further filter the variants to emulate the "prioritize start location" behavior in GATK3. If we keep this traversal, I'll add some unit tests, and we should find a better name.

In order to match GATK3 output for sites where the only alt allele is a spanning deletion, I had to use OutputMode.EMIT_ALL_SITES when emitting nonvariant sites only (GATK3 uses OutputMode.EMIT_ALL_CONFIDENT_SITES in that case). Specifically, its necessary to to prevent GenotypingEngine from entering this recently added code block, and short circuiting out. @ldgauthier Any opinion on whether this is too permissive ? Another option would be to special-case this, perhaps by adding another OutputMode state.

@cmnbroad cmnbroad requested a review from ldgauthier Sep 25, 2018

@ldgauthier

This comment has been minimized.

Copy link
Contributor

commented Sep 25, 2018

@cmnbroad I haven't actually looked at the code yet, but we don't necessarily have to match GATK3 output. In the expected file at 20:10004769 where there's a deletion and then each position within the deletion reported out again with exactly the same attributes I'm not married to that representation. Furthermore, I don't know what SelectVariants will do if you ask for e.g. 20: 10004772 -- should you get the deletion? The deletion and the *-only context at the requested position without the previous positions in the deletion? Does it make things any simpler if you don't have to output *-only sites?

@cmnbroad

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 25, 2018

@ldgauthier Emitting the spanning-deletion-only sites completely would definitely make things simpler, since I could then use EMIT_ALL_CONFIDENT_SITES and GenotypingEngine would just naturally do the right thing. Also, the result would comport with my own naive expectations. Using the current PR, outputs could contain LowQual sites that GATK3 wouldn't have included. So if you're good with that, I'll update the PR and tests to reflect that before any more reviewing.

@codecov-io

This comment has been minimized.

Copy link

commented Sep 27, 2018

Codecov Report

Merging #5219 into master will increase coverage by 0.015%.
The diff coverage is 89.189%.

@@               Coverage Diff               @@
##              master     #5219       +/-   ##
===============================================
+ Coverage     86.798%   86.813%   +0.015%     
- Complexity     29711     29822      +111     
===============================================
  Files           1820      1821        +1     
  Lines         137406    137775      +369     
  Branches       15160     15198       +38     
===============================================
+ Hits          119265    119606      +341     
- Misses         12637     12658       +21     
- Partials        5504      5511        +7
Impacted Files Coverage Δ Complexity Δ
...institute/hellbender/engine/VariantWalkerBase.java 100% <ø> (ø) 12 <0> (-2) ⬇️
...ute/hellbender/utils/variant/GATKVCFConstants.java 83.333% <100%> (+3.333%) 4 <0> (ø) ⬇️
...nstitute/hellbender/engine/MultiVariantWalker.java 97.297% <100%> (+0.746%) 15 <2> (+2) ⬆️
...der/tools/walkers/variantutils/SelectVariants.java 79.646% <100%> (ø) 116 <0> (-3) ⬇️
...er/tools/walkers/GenotypeGVCFsIntegrationTest.java 79.021% <100%> (+3.411%) 19 <1> (ø) ⬇️
...roadinstitute/hellbender/engine/VariantWalker.java 93.333% <100%> (+2.424%) 12 <2> (+2) ⬆️
...lbender/utils/variant/GATKVariantContextUtils.java 86.063% <100%> (+0.037%) 254 <3> (+10) ⬆️
...titute/hellbender/tools/walkers/GenotypeGVCFs.java 88.957% <78.431%> (-0.959%) 84 <33> (+35)
...hellbender/engine/VariantWalkerGroupedByLocus.java 90.741% <90.741%> (ø) 16 <16> (?)
...tools/walkers/haplotypecaller/HaplotypeCaller.java 84.211% <0%> (-15.789%) 23% <0%> (-1%)
... and 32 more
@cmnbroad

This comment has been minimized.

Copy link
Collaborator Author

commented Oct 1, 2018

Ok, based on internal discussions, I updated this to not emit the spanning-del only sites, but to include the LowQual sites.

@lbergelson
Copy link
Collaborator

left a comment

@cmnbroad Back to you, I have a bunch of comments, mostly aesthetic. I'm not totally sold on the new walker type, but I don't see a better way to implement it without having to do a much deeper re-write of GenotypeGVCFs so it's probably the best solution.

I think it might be worth pulling out another class between MultiVariantWalker and VariantWalkerBase but I can't figure out a good naming scheme which would make the distinction clear.

// Get all of the variants overlapping this locus, filter them and collect into a list
final Iterator<VariantContext> overlappingVariants = drivingVariants.query(locus);
final List<VariantContext> filteredVariants = getTransformedVariantStream(
Spliterators.spliteratorUnknownSize(overlappingVariants, 0), variantFilter)

This comment has been minimized.

Copy link
@lbergelson

lbergelson Oct 2, 2018

Collaborator

I would just pass through the Iterator and then use Utils.stream() to get a stream from it. No need to mess around with Spliterators and StreamSupport explicitly.

*/
public abstract class VariantWalkerGroupedByLocus extends VariantWalkerBase {

// NOTE: using File rather than FeatureInput<VariantContext> here so that we can keep this driving source

This comment has been minimized.

Copy link
@lbergelson

lbergelson Oct 2, 2018

Collaborator

It says file but it's a String

This comment has been minimized.

Copy link
@cmnbroad

cmnbroad Oct 9, 2018

Author Collaborator

Done.

private FeatureDataSource<VariantContext> drivingVariants;
private FeatureInput<VariantContext> drivingVariantsFeatureInput;

private boolean traverseByLocus = false;

This comment has been minimized.

Copy link
@lbergelson

lbergelson Oct 2, 2018

Collaborator

As you mentioned, it's weird that this is off by default. I would invert it.

This comment has been minimized.

Copy link
@cmnbroad

cmnbroad Oct 9, 2018

Author Collaborator

Done.

* Enable grouped of variants by locus. When true, the {@link #apply} method will be called with all overlapping
* variants at each each locus that has overlapping variants.
*/
protected void setGroupByLocus() { traverseByLocus = true; }

This comment has been minimized.

Copy link
@lbergelson

lbergelson Oct 2, 2018

Collaborator

The naming here is a bit confusing. It sounds like it's a setter for the field groupByLocus which should take a parameter. Maybe calling it changeTraversalModeToGroupByLocus() would be better?

In any case, I think we should throw if this method is called after traversal has already started.

This comment has been minimized.

Copy link
@cmnbroad

cmnbroad Oct 9, 2018

Author Collaborator

Inverted, and named changeTraversalModeToByVariant.


if (traverseByLocus) {
getLocusStream(getTraversalIntervals())
.forEach(locus -> {

This comment has been minimized.

Copy link
@lbergelson

lbergelson Oct 2, 2018

Collaborator

We misuse this all over the place, but this should really be forEachOrdered

This comment has been minimized.

Copy link
@cmnbroad

cmnbroad Oct 9, 2018

Author Collaborator

Yeah, done.

* @return true if the only alternate allele for this VariantContext is a spanning deletion, otherwise false.
*/
public static boolean isSpanningDeletionOnly(final VariantContext vc){
return vc.getAlternateAlleles().size() == 1 && vc.getAlternateAllele(0).basesMatch(Allele.SPAN_DEL);

This comment has been minimized.

Copy link
@lbergelson

lbergelson Oct 2, 2018

Collaborator

This should probably call GATKVCFConstants.isSpanningDeletion. It's not done uniformly but it should probably be made to be uniform.

This comment has been minimized.

Copy link
@cmnbroad

cmnbroad Oct 9, 2018

Author Collaborator

Done.

{getTestFile(BASE_PAIR_GVCF), getTestFile( "expected/gvcf.basepairResolution.includeNonVariantSites.vcf"), Collections.singletonList("--" + GenotypeGVCFs.ALL_SITES_LONG_NAME), b37_reference_20_21 },
{getTestFile( "combine.single.sample.pipeline.1.vcf"),
getTestFile( "expected/combine.single.sample.pipeline.1.include_nonvariant.vcf"),
Arrays.asList( " --include-non-variant-sites -L 20:10,030,000-10,033,000 -L 20:10,386,000-10,386,500 "),

This comment has been minimized.

Copy link
@lbergelson

lbergelson Oct 2, 2018

Collaborator

I would use the constant here instead of hardcoding --include-non-variant-sites. We're not consistent about it at all, but better to use the constants when adding new ones. I think -L is so embedded in gatk that I wouldn't bother constantizing it...

This comment has been minimized.

Copy link
@cmnbroad

cmnbroad Oct 9, 2018

Author Collaborator

Good point - done.

final List<Allele> newAlleles = new ArrayList<>();
// Only keep alleles that are not NON-REF
for ( final Allele allele : vc.getAlleles() ) {
if ( !allele.equals(GATKVCFConstants.NON_REF_SYMBOLIC_ALLELE) ) {

This comment has been minimized.

Copy link
@lbergelson

lbergelson Oct 2, 2018

Collaborator

use the existing constant in Allele

This comment has been minimized.

Copy link
@cmnbroad

cmnbroad Oct 9, 2018

Author Collaborator

Done.

"AS_QD",
"RGQ",

This comment has been minimized.

Copy link
@lbergelson

lbergelson Oct 2, 2018

Collaborator

Why do we need these new exclusions?

This comment has been minimized.

Copy link
@cmnbroad

cmnbroad Oct 9, 2018

Author Collaborator

Good question. GATK4 doesn't appear to use AN, nor have code that moves GQ to RGQ like GATK3 did. So these are removed from the comparison with GATK3 output. @ldguathier - is is correct that GATK4 doesn't do this ?

This comment has been minimized.

Copy link
@ldgauthier

ldgauthier Oct 9, 2018

Contributor

I'm not sure anyone actually used RGQ. We probably didn't port that feature.

I don't know why we're ignoring AN though. That should be part of the ChromosomeCounts annotation (also AC and AF) that's part of the standard group. What happens if you don't ignore AN?

This comment has been minimized.

Copy link
@cmnbroad

cmnbroad Oct 9, 2018

Author Collaborator

@ldgauthier GATK4 isn't adding AN because ChromosomeCounts isn't Reducible (so its bypassed by the annotation engine here). Perhaps thats a separate problem ?

This comment has been minimized.

Copy link
@ldgauthier

ldgauthier Oct 10, 2018

Contributor

But in your testSpanningDeletion output, there's an AC=2 and AN=2 on the deletion, which must have come from ChromosomeCounts. Is it just not being called consistently? We don't need it at non-variant sites.

This comment has been minimized.

Copy link
@cmnbroad

cmnbroad Oct 10, 2018

Author Collaborator

I see your point. For most of the new --includeNonVariant tests, the expected output files came from GATK3, which emits AN annotations for all sites with qual scores (?), even non-variant ones. But GATK4 only emits them for variants, which is why I originally excluded it from verification. But for the tests where I implemented different behavior than GATK3 (like testSpanningDeletion and the LowQual test), my expected files came from GATK4 output after manually verifying them.

So, already long story short, I think the right thing to do is to remove AN from the exclusion list, and just update my test outputs for the GATK3 --includeNonVariant tests with the actual GATK4 output, which we already know matches GATK4 except for the AN.

This comment has been minimized.

Copy link
@cmnbroad

cmnbroad Oct 10, 2018

Author Collaborator

I updated the output files and removed the AN test exclusion.

* VariantWalkerGroupedByLocus authors must implement the {@link #apply} method to process each variant, and may optionally implement
* {@link #onTraversalStart}, {@link #onTraversalSuccess} and/or {@link #closeTool}.
*/
public abstract class VariantWalkerGroupedByLocus extends VariantWalkerBase {

This comment has been minimized.

Copy link
@lbergelson

lbergelson Oct 2, 2018

Collaborator

Maybe VariantLocusWalker would be more accurate?

This comment has been minimized.

Copy link
@cmnbroad

cmnbroad Oct 9, 2018

Author Collaborator

Yeah, though VariantWalkerGroupedByLocus has symmetry with MultiVariantWalkerGroupedOnStart, for what thats worth. Happy to change it if you think VariantLocusWalker is better .

This comment has been minimized.

Copy link
@lbergelson

lbergelson Oct 9, 2018

Collaborator

I think I would change it. It's more clear that way.

@lbergelson lbergelson assigned cmnbroad and unassigned lbergelson and ldgauthier Oct 2, 2018

20 10012731 . C . 27.02 . DP=4 GT:AD:DP:RGQ 0/0:2,0:2:6 ./.:0,0:0 0/0:2,0:2:6
20 10012732 . A . 27.02 . DP=4 GT:AD:DP:RGQ 0/0:2,0:2:6 ./.:0,0:0 0/0:2,0:2:6
20 10012733 . G . 27.02 . DP=4 GT:AD:DP:RGQ 0/0:2,0:2:6 ./.:0,0:0 0/0:2,0:2:6
20 10012734 . G . 0.63 LowQual DP=4;ExcessHet=3.01;MQ=NaN GT:DP:RGQ 0/0:2:6 0/0:0:0 0/0:2:6

This comment has been minimized.

Copy link
@ldgauthier

ldgauthier Oct 9, 2018

Contributor

I looked at the input for this and there were PLs for all the samples (and a bunch more FORMAT annotations):
20 10012751 . T C,<NON_REF> . . DP=4;ExcessHet=3.01;RAW_MQ=14400.00 GT:AD:DP:GQ:MIN_DP:PL:SB ./.:0,2,0:2:6:.:49,6,0,49,6,49:0,0,1,1 ./.:.:0:0:0:0,0,0,0,0,0 ./.:0,2,0:2:6:.:49,6,0,49,6,49:0,0,1,1
Specifically that first sample should be homVar (1/1) even though it's GQ is pretty crappy. If they're all actually homRef then it would be very strange to call a variant there, even if it's very low quality. Am I looking at the wrong input?

This comment has been minimized.

Copy link
@ldgauthier

ldgauthier Oct 9, 2018

Contributor

Missing ADs are also why the MQ=NaN.

This comment has been minimized.

Copy link
@cmnbroad

cmnbroad Oct 9, 2018

Author Collaborator

@ldgauthier 20 10012751 is outside the query interval for this test. I used a narrow interval that resulted in a filtered LowQual (20 10012734) being emitted by GATK4, and compared it against GATK3 - I had to manually force GATK3 to use EMIT_ALL_SITES in --includeNonVariants mode to do this), and restricted the test to a small interval around that. For 20 10012734 the GATK3/4 results are nearly identical:

GATK3: 20 10012734 . G . 0.63 LowQual AN=6;DP=4;ExcessHet=3.01 GT:DP:RGQ 0/0:2:6 0/0:0:0 0/0:2:6
GATK4: 20 10012734 . G . 0.63 LowQual DP=4;ExcessHet=3.01;MQ=NaN GT:DP:RGQ 0/0:2:6 0/0:0:0 0/0:2:6

This comment has been minimized.

Copy link
@ldgauthier

ldgauthier Oct 10, 2018

Contributor

Totally looking at the wrong input. Sorry. I do wonder about the ADs, though it looks like that was also an issue in the GATK3 output.

This comment has been minimized.

Copy link
@cmnbroad

cmnbroad Oct 10, 2018

Author Collaborator

Phew.

##source=GenotypeGVCFs
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GTEX-OXRP-0003 GTEX-QXCU-0004 GTEX-RVPV-0003
20 10096905 . TA T 69.37 . AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=34.69;SOR=0.693 GT:AD:DP:GQ:PGT:PID:PL 1/1:0,2:2:6:1|1:10096899_G_T:90,6,0 ./.:0,0:0:.:.:.:0,0,0 ./.:0,0:0:.:.:.:0,0,0
20 10096907 . A . 18.38 . DP=2 GT:AD:DP ./.:2,0:2 ./.:0,0:0 ./.:0,0:0

This comment has been minimized.

Copy link
@ldgauthier

ldgauthier Oct 9, 2018

Contributor

I'm really happy that all the positions requested have data (905 and 906 in the deletion and 907 subsequently). This does the right thing by dropping the *-only line in the combined GVCF. Can you add output for 20: 10624924 -10624926, which has a spanning deletion with another alt allele?

This comment has been minimized.

Copy link
@cmnbroad

cmnbroad Oct 9, 2018

Author Collaborator

Yeah - good idea.

This comment has been minimized.

Copy link
@cmnbroad

cmnbroad Oct 10, 2018

Author Collaborator

Done.

@droazen droazen self-assigned this Oct 15, 2018

@droazen droazen requested review from droazen and removed request for droazen Oct 15, 2018

@droazen droazen removed their assignment Oct 15, 2018

@droazen

This comment has been minimized.

Copy link
Collaborator

commented Oct 17, 2018

@cmnbroad Looks like this needs a rebase

@cmnbroad cmnbroad force-pushed the cn_include_nonvariant_sites branch from fd70899 to a2e8393 Oct 17, 2018

@cmnbroad cmnbroad force-pushed the cn_include_nonvariant_sites branch from a2e8393 to 4f687d2 Oct 17, 2018

@ldgauthier
Copy link
Contributor

left a comment

Thanks for the new test data. This is good to go!

@lbergelson
Copy link
Collaborator

left a comment

This looks good to me.

@cmnbroad cmnbroad merged commit f11bbde into master Nov 5, 2018

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@cmnbroad cmnbroad deleted the cn_include_nonvariant_sites branch Nov 5, 2018

@droazen

This comment has been minimized.

Copy link
Collaborator

commented Nov 5, 2018

Woohoo! 🍾 🎆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.