Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A couple of fixes for writing VCFs in GenotypeConcordance #810

Merged
merged 1 commit into from Jun 1, 2017

Conversation

meganshand
Copy link
Contributor

@meganshand meganshand commented May 5, 2017

Description

There are a couple of fixes when running GenotypeConcordance with OUTPUT_VCF=true:

  1. Fix in GenotypeConcordance for writing VCFs with no-call genotypes #785 ensured that the VCF builder used the original site alleles instead of the genotype alleles in order to prevent the writing of no calls. Unfortunately, this meant that the site alleles were not being normalized by normalizeAlleles, which is needed when you are merging two variant contexts with different REF alleles (due to indels). This is now fixed by using the normalized genotype alleles, but removing no calls from the set before making the builder.

  2. When a no call site is normalized the new genotype allele would add the surplus bases after the .. For example if the ref allele was changing from A to ACCC, the no call genotype would change from . to .CCC. This is fixed/tested in the last two commits.


Checklist (never delete this)

Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.

Content

  • Added or modified tests to cover changes and any new functionality
  • Edited the README / documentation (if applicable)
  • All tests passing on Travis

Review

  • Final thumbs-up from reviewer
  • Rebase, squash and reword as applicable

For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests

@coveralls
Copy link

Coverage Status

Coverage increased (+0.007%) to 73.986% when pulling 10282e8 on ms_gcWriteMultiRefAlleles into da60c2b on master.

Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch @meganshand

a few comments

##contig=<ID=GL000203.1,length=37498>
##contig=<ID=GL000246.1,length=38154>
##contig=<ID=GL000249.1,length=38502>
##contig=<ID=GL000196.1,length=38914>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for berevity, could you remove these GL*** contigs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

##INFO=<ID=HapNoVar,Number=1,Type=Integer,Description="Number of datasets for which HaplotypeCaller called a variant within 35bp and did not call a variant at this location">
##INFO=<ID=LEN,Number=A,Type=Integer,Description="allele length">
##INFO=<ID=NoCG,Number=0,Type=Flag,Description="Present if no consensus reached, so looked at all datasets except Complete Genomics since it may have a different representation of complex variants">
##INFO=<ID=NoPLTot,Number=1,Type=Integer,Description="Number of datasets with likelihood ratio > 20 for a genotype different from the called genotype">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, could you remove all these unused header lines?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally, one would be able to see both VCFs at the same time in the github diff...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -432,14 +432,11 @@ private void writeVcfTuple(final VcfTuple tuple, final VariantContextWriter writ
final List<Allele> truthAlleles = alleles.truthAlleles();
final List<Allele> callAlleles = alleles.callAlleles();

// Get the alleles present at this site for both samples to use for the output variant context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for extra newline

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

siteAlleles.addAll(callContext.getAlleles());
}
siteAlleles.addAll(allAlleles);
siteAlleles.remove(Allele.NO_CALL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a symbolic allele to your test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that GenotypeConcordance doesn't support symbolic alleles:

Note that only SNP and INDEL variants are considered, MNP, Symbolic, and Mixed classes of variants are not included.

That said, a symbolic allele does cause it to blow up if you are writing an output VCF. In this case what should the tool do? Skip that site or blow up with an appropriate error message?

genotypeConcordance.OUTPUT = new File(OUTPUT_DATA_PATH, "MultipleRefAlleles");
genotypeConcordance.OUTPUT_VCF = true;

Assert.assertEquals(genotypeConcordance.instanceMain(new String[0]), 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also read in the metrics file and validate the results in it, rather than just checking that the program didn't blowup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -327,11 +327,11 @@ private boolean indexExists(final File vcf) {
final String condition = truthVariantContextType + " " + callVariantContextType;
final Integer count = unClassifiedStatesMap.getOrDefault(condition, 0) + 1;
unClassifiedStatesMap.put(condition, count);
} else {
// write to the output VCF
writer.ifPresent(w -> writeVcfTuple(tuple, w, scheme));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that if we prevent writing unless stateClassified is true, we will not write sites with no variation in either original VCF. This means that a site that is HOM REF in both the truth and call VCFs will not be output. (This is currently breaking other GenotypeConcordance tests.) Is the desired behavior to output no variation sites?

@yfarjoun
Copy link
Contributor

👍 Thanks for this @meganshand

@coveralls
Copy link

coveralls commented Jun 1, 2017

Coverage Status

Coverage remained the same at 74.077% when pulling dd4dc88 on ms_gcWriteMultiRefAlleles into 1c54e16 on master.

@meganshand meganshand merged commit d4a632a into master Jun 1, 2017
@meganshand meganshand deleted the ms_gcWriteMultiRefAlleles branch June 1, 2017 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants