New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A couple of fixes for writing VCFs in GenotypeConcordance #810
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch @meganshand
a few comments
##contig=<ID=GL000203.1,length=37498> | ||
##contig=<ID=GL000246.1,length=38154> | ||
##contig=<ID=GL000249.1,length=38502> | ||
##contig=<ID=GL000196.1,length=38914> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for berevity, could you remove these GL*** contigs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
##INFO=<ID=HapNoVar,Number=1,Type=Integer,Description="Number of datasets for which HaplotypeCaller called a variant within 35bp and did not call a variant at this location"> | ||
##INFO=<ID=LEN,Number=A,Type=Integer,Description="allele length"> | ||
##INFO=<ID=NoCG,Number=0,Type=Flag,Description="Present if no consensus reached, so looked at all datasets except Complete Genomics since it may have a different representation of complex variants"> | ||
##INFO=<ID=NoPLTot,Number=1,Type=Integer,Description="Number of datasets with likelihood ratio > 20 for a genotype different from the called genotype"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, could you remove all these unused header lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally, one would be able to see both VCFs at the same time in the github diff...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -432,14 +432,11 @@ private void writeVcfTuple(final VcfTuple tuple, final VariantContextWriter writ | |||
final List<Allele> truthAlleles = alleles.truthAlleles(); | |||
final List<Allele> callAlleles = alleles.callAlleles(); | |||
|
|||
// Get the alleles present at this site for both samples to use for the output variant context. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for extra newline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
siteAlleles.addAll(callContext.getAlleles()); | ||
} | ||
siteAlleles.addAll(allAlleles); | ||
siteAlleles.remove(Allele.NO_CALL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a symbolic allele to your test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that GenotypeConcordance doesn't support symbolic alleles:
Note that only SNP and INDEL variants are considered, MNP, Symbolic, and Mixed classes of variants are not included.
That said, a symbolic allele does cause it to blow up if you are writing an output VCF. In this case what should the tool do? Skip that site or blow up with an appropriate error message?
genotypeConcordance.OUTPUT = new File(OUTPUT_DATA_PATH, "MultipleRefAlleles"); | ||
genotypeConcordance.OUTPUT_VCF = true; | ||
|
||
Assert.assertEquals(genotypeConcordance.instanceMain(new String[0]), 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also read in the metrics file and validate the results in it, rather than just checking that the program didn't blowup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -327,11 +327,11 @@ private boolean indexExists(final File vcf) { | |||
final String condition = truthVariantContextType + " " + callVariantContextType; | |||
final Integer count = unClassifiedStatesMap.getOrDefault(condition, 0) + 1; | |||
unClassifiedStatesMap.put(condition, count); | |||
} else { | |||
// write to the output VCF | |||
writer.ifPresent(w -> writeVcfTuple(tuple, w, scheme)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that if we prevent writing unless stateClassified
is true, we will not write sites with no variation in either original VCF. This means that a site that is HOM REF in both the truth and call VCFs will not be output. (This is currently breaking other GenotypeConcordance tests.) Is the desired behavior to output no variation sites?
👍 Thanks for this @meganshand |
2834099
to
dd4dc88
Compare
Description
There are a couple of fixes when running GenotypeConcordance with
OUTPUT_VCF=true
:Fix in GenotypeConcordance for writing VCFs with no-call genotypes #785 ensured that the VCF builder used the original site alleles instead of the genotype alleles in order to prevent the writing of no calls. Unfortunately, this meant that the site alleles were not being normalized by
normalizeAlleles
, which is needed when you are merging two variant contexts with different REF alleles (due to indels). This is now fixed by using the normalized genotype alleles, but removing no calls from the set before making the builder.When a no call site is normalized the new genotype allele would add the surplus bases after the
.
. For example if the ref allele was changing fromA
toACCC
, the no call genotype would change from.
to.CCC
. This is fixed/tested in the last two commits.Checklist (never delete this)
Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.
Content
Review
For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests