New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fixing bugs in CalculateGenotypePosteriers #4352
Conversation
@ldgauthier Could you take a look at this? I changed a bunch of logic to bring it inline with modern tool design. I think it's right, but maybe I'm breaking something that I didn't understand? |
Codecov Report
@@ Coverage Diff @@
## master #4352 +/- ##
==============================================
+ Coverage 79.04% 79.055% +0.015%
- Complexity 16447 16449 +2
==============================================
Files 1047 1047
Lines 59189 59185 -4
Branches 9672 9670 -2
==============================================
+ Hits 46783 46789 +6
+ Misses 8644 8637 -7
+ Partials 3762 3759 -3
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You took out a weird-looking conditional that was a fix for a particular issue. I'm open to an alternate fix that's not so awkward.
While we're here, I'm uncomfortable with the level of reference checking. The supportVariants VCF has to be on the same reference as the input.
if ( header.hasInfoLine(GATKVCFConstants.MLE_ALLELE_COUNT_KEY) ) { | ||
final VCFInfoHeaderLine mleLine = header.getInfoHeaderLine(GATKVCFConstants.MLE_ALLELE_COUNT_KEY); | ||
if ( mleLine.getCountType() != VCFHeaderLineCount.A ) { | ||
throw new UserException("VCF does not have a properly formatted MLEAC field: the count type should be \"A\""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no regression test for this, but it's from a particular issue: https://github.com/broadinstitute/gsa-unstable/issues/845
The issue was caused by a 4.0 VCF as the supplemental VCF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmn, I think I thought this was being checked somewhere else now, but I can't find any evidence that it is. Not sure why I took that out...
fixing 3 issues in CalclualteGenotypePosteriers * no longer emit duplicate variants if you have overlapping VCF records in the input * specifying vcf.gz output will produce a vcf.gz instead of an unzipped file called vcf.gz * fix potential NPE during shutdown
fa09fce
to
acdd17c
Compare
@ldgauthier I undid that change that made no sense and added some new tests so it doesn't happen again. |
@ldgauthier I believe that all of the features reference dictionaries are already checked against eachother for "compatibility". That check is currently kind of strange, but it catches bizarre inconsistencies. |
if (!skipPopulationPriors) { | ||
vc_bothPriors = PosteriorProbabilitiesUtils.calculatePosteriorProbs(vc_familyPriors, otherVCs, missing * numRefIfMissing, globalPrior, !ignoreInputSamples, defaultToAC, useACoff); | ||
} else { | ||
final VariantContextBuilder builder2 = new VariantContextBuilder(vc_familyPriors); | ||
VariantContextUtils.calculateChromosomeCounts(builder, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, this should probably be builder2, right? All this time and no one ever said anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It definitely looks like it should be. It's effectively the same thing, but it's very confusing and if anyone ever make changes it could lead to problems... I might start a separate PR to address that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, what am I talking about... the thing that gets output should be the the result of modifying builder2 and we modify builder. That seems super wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, i'm back to "it's dumb but fine"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new pr to fix this #4431
Take a look at the builder2 issue, but otherwise LGTM! 👍 |
fixing 3 issues in CalclualteGenotypePosteriers
no longer emit duplicate variants if you have overlapping VCF records in the input
specifying vcf.gz output will produce a vcf.gz instead of an unzipped file called vcf.gz
fix potential NPE during shutdown
fixes Duplicate records in VCF produced by CalculateGenotypePosteriors #4346