Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port of GATK3 Variant Annotator Tool #3803

Merged
merged 5 commits into from Feb 26, 2018
Merged

Conversation

jamesemery
Copy link
Collaborator

@jamesemery jamesemery commented Nov 7, 2017

Resolves #2493
Resolves #51

@codecov-io
Copy link

codecov-io commented Nov 9, 2017

Codecov Report

Merging #3803 into master will increase coverage by 0.047%.
The diff coverage is 83.333%.

@@               Coverage Diff               @@
##              master     #3803       +/-   ##
===============================================
+ Coverage     79.116%   79.163%   +0.047%     
- Complexity     16472     16604      +132     
===============================================
  Files           1047      1049        +2     
  Lines          59199     59615      +416     
  Branches        9676      9776      +100     
===============================================
+ Hits           46836     47193      +357     
- Misses          8600      8635       +35     
- Partials        3763      3787       +24
Impacted Files Coverage Δ Complexity Δ
...nnotator/allelespecific/AS_ReadPosRankSumTest.java 100% <ø> (ø) 7 <0> (ø) ⬇️
...ender/tools/walkers/annotator/AnnotationUtils.java 85.714% <ø> (ø) 3 <0> (ø) ⬇️
...lbender/utils/variant/GATKVariantContextUtils.java 83.52% <ø> (ø) 215 <0> (ø) ⬇️
...titute/hellbender/utils/logging/OneShotLogger.java 72.727% <0%> (-27.273%) 2 <0> (-1)
...ellbender/tools/walkers/annotator/QualByDepth.java 97.297% <100%> (+2.853%) 17 <0> (+1) ⬆️
...r/tools/walkers/annotator/ClippingRankSumTest.java 100% <100%> (ø) 4 <1> (+1) ⬆️
...ools/walkers/annotator/DepthPerAlleleBySample.java 97.727% <100%> (+1.894%) 21 <5> (+9) ⬆️
...walkers/haplotypecaller/HaplotypeCallerEngine.java 77.528% <100%> (ø) 63 <0> (ø) ⬇️
...hellbender/tools/walkers/mutect/Mutect2Engine.java 88.961% <100%> (ø) 45 <0> (ø) ⬇️
...titute/hellbender/tools/walkers/GenotypeGVCFs.java 90.083% <100%> (ø) 47 <0> (ø) ⬇️
... and 36 more

@jamesemery jamesemery force-pushed the je_portVariantAnnotator branch 2 times, most recently from b9805a2 to 962c95e Compare November 27, 2017 21:07
@jamesemery jamesemery assigned jamesemery and droazen and unassigned jamesemery Dec 5, 2017
@droazen
Copy link
Collaborator

droazen commented Dec 8, 2017

@ldgauthier Would you have time to do an initial review pass on this?

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a couple questions, especially about indels and pileups. I haven't done all the tests, but I'll do them now.

return counts;
}

private int[] annotateWithLiklihoods(VariantContext vc, Genotype g, Set<Allele> alleles, ReadLikelihoods<Allele> likelihoods) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Liklihoods -> Likelihoods

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* @param read a read containing the variant
* @return number of non-hard clipped, aligned bases (excluding low quality bases at either end)
*/
//TODO: this is bizarre -- this code counts hard clips, but then subtracts them from the read length, which already doesn't count hard clips
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would not be the only ReadPosRankSum-related bug... Can you file an issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so these comments are mostly pulled in from GATK3. I needed to pull in all of these classes in order to support annotating from the pileup, which doesn't affect the calculations we currently do based on the likelihood. My goal was to match GATK3 bugs and all. I will file a ticket for after this branch to evaluate what should be done for these classes.

return Collections.singletonMap(getKeyNames().get(0), (Object)annotationString);
}

public String makeFinalizedAnnotationString(final VariantContext vc, final Map<Allele, Number> perAlleleData) {
int numOfReads = getNumOfReads(vc);
public String makeFinalizedAnnotationString(final int numOfReads, final Map<Allele, Number> perAlleleData) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the refactor.

}

// Use the pileup to stratify otherwise
} else if (!(this instanceof AS_RankSumTest)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love that we're calling instanceof for every variant. My instinct says this is going to slow things down -- have you noticed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not run extensive timing, I will overload apply() in AS_RankSumTest to make this cleaner though.

for (final PileupElement p : likelihoods.getStratifiedPileups(vc).values().stream().flatMap(Collection::stream).collect(Collectors.toList())) {
final OptionalDouble value = getElementForPileupElement(p, refLoc);
if (value.isPresent() && value.getAsDouble() != INVALID_ELEMENT_FROM_READ && isUsableBase(p)) {
if (vc.getReference().equals(Allele.create(p.getBase(), true))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't worked much with PileupElements -- what happens for deletions? It seems like for a deletion vc.getReference() will return multiple bases, but the pileup element will return one base or the deletion 'D'.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will return the first base before the deletion site. Which is not ideal I imagine but this seems to be a repeated bug from gatk3. It seems like an edge case because the only time this could make a difference is if you are at a site that is not marked as a deletion but you then have a read which has a deletion on it and then happens to match one of the two alleles of the snp... So a minor effect. Should I open a new ticket for this?


import static org.testng.Assert.*;

public class AnnotationUtilsUnitTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can check these later if you'd like, but I figured I'd get comments on the rest back to you faster if I don't do it now.

@@ -95,5 +90,24 @@ public void testBlowUp(){
new DepthPerAlleleBySample().annotate(null, vc, gAC, gb, likelihoods);
}

@Test
public void testEmptyLiklihoodsFallback(){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

search and replace

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling? What is that?

@@ -216,11 +216,12 @@ public void testUsingLikelihoods_Raw() {
Assert.assertNotNull(annotatedMapRaw, vc.toString());
final String actualStringRaw = (String) annotatedMapRaw.get(key);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're going to replace key below you might as well do it here too and get rid of key.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -23,13 +23,6 @@
public static final String RAW_DELIM = ",";
public static final String REDUCED_DELIM = ",";

@Override
public Map<String, Object> annotate(final ReferenceContext ref,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if you remove this, it's going to default to the non-allele-specific version, right? So if you run VariantAnnotator requesting an AS_RankSum annotation, you'll get the finalized form of RankSum classic? I think that at least deserves a warning.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats true, BUT if you want raw annotations you need simply instantiate the variantAnnotator in rawAnnotations mode (set the boolean flag on construction). Then it will call annotateRawData() instead. Indeed the reason for removing this method was to allow for annotate() in the event that VariantAnnotator/HaplotypeCaller want to output a vcf with an AS_Annotation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, I feel weird about this tool default behavior. If you ask for an annotation expecting key "AS_FS" from VariantAnnotator then A) it won't be there and B) there will be an unexpected "FS" key in the output INFO field.

##LoF=Loss-of-function annotation (HC = High Confidence; LC = Low Confidence)
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as predicted by VEP. Format: Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|ALLELE_NUM|DISTANCE|STRAND|SYMBOL|SYMBOL_SOURCE|HGNC_ID|BIOTYPE|CANONICAL|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|SIFT|PolyPhen|EXON|INTRON|DOMAINS|HGVSc|HGVSp|GMAF|AFR_MAF|AMR_MAF|ASN_MAF|EUR_MAF|AA_MAF|EA_MAF|CLIN_SIG|SOMATIC|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|LoF_info|LoF_flags|LoF_filter|LoF">
#CHROM POS ID REF ALT QUAL FILTER INFO
1 26608873 rs1134583 T A,C 8647.13 VQSRTrancheSNP99.90to100.00 AAChange=uc001bly.3:c.A1120T:p.S374C;AC=35,5;AF=0.269,0.038;AN=130;AVSIFT=0;BaseQRankSum=-2.407;DB;DP=2659;Dels=0.00;ExonicFunc=nonsynonymousSNV;FS=27.067;Func=exonic;Gene=UBXN11;HaplotypeScore=11.0781;InbreedingCoeff=-0.3011;LJB_GERP++=0.0;LJB_LRT=0.825324;LJB_LRT_Pred=NA;LJB_PhyloP=0.887806;LJB_PhyloP_Pred=N;LJB_PolyPhen2=0.515975;LJB_PolyPhen2_Pred=NA;LJB_SIFT=1.0;LJB_SIFT_Pred=D;LRT_MutationTaster=7.8E-5;LRT_MutationTaster_Pred=N;MLEAC=35,3;MLEAF=0.269,0.023;MQ=48.46;MQ0=1;MQRankSum=-11.748;NEGATIVE_TRAIN_SITE;QD=5.85;ReadPosRankSum=-6.714;SNPEFF_AMINO_ACID_CHANGE=S251G;SNPEFF_CODON_CHANGE=Agt/Ggt;SNPEFF_EFFECT=NON_SYNONYMOUS_CODING;SNPEFF_EXON_ID=exon_1_26608773_26609060;SNPEFF_FUNCTIONAL_CLASS=MISSENSE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=UBXN11;SNPEFF_IMPACT=MODERATE;SNPEFF_TRANSCRIPT_ID=ENST00000374223;VQSLOD=-4.019e+00;culprit=MQRankSum;dbSNP135=rs1134583;CSQ=A|ENSG00000158062|ENST00000496466|Transcript|non_coding_transcript_exon_variant&non_coding_transcript_variant|2262|||||rs1134583&COSM1602122|1||-1|UBXN11|HGNC|30600|retained_intron|||||||||10/10|||ENST00000496466.1:n.2262A>T||||||||||0&1|||||||||,C|ENSG00000158062|ENST00000496466|Transcript|non_coding_transcript_exon_variant&non_coding_transcript_variant|2262|||||rs1134583&COSM1602122|2||-1|UBXN11|HGNC|30600|retained_intron|||||||||10/10|||ENST00000496466.1:n.2262A>G||||||||||0&1|||||||||,A|ENSG00000158062|ENST00000374217|Transcript|missense_variant|1748|1381|461|S/C|Agt/Tgt|rs1134583&COSM1602122|1||-1|UBXN11|HGNC|30600|protein_coding||CCDS41286.1|ENSP00000363334|UBX11_HUMAN|Q5T130_HUMAN|UPI00004700E0|deleterious(0.01)|unknown(0)|15/15||Low_complexity_(Seg):Seg&PROSITE_profiles:PS50099|ENST00000374217.2:c.1381A>T|ENSP00000363334.2:p.Ser461Cys|||||||||0&1|||||||||,C|ENSG00000158062|ENST00000374217|Transcript|missense_variant|1748|1381|461|S/G|Agt/Ggt|rs1134583&COSM1602122|2||-1|UBXN11|HGNC|30600|protein_coding||CCDS41286.1|ENSP00000363334|UBX11_HUMAN|Q5T130_HUMAN|UPI00004700E0|deleterious(0.03)|unknown(0)|15/15||Low_complexity_(Seg):Seg&PROSITE_profiles:PS50099|ENST00000374217.2:c.1381A>G|ENSP00000363334.2:p.Ser461Gly|||||||||0&1|||||||||,A|ENSG00000158062|ENST00000374222|Transcript|missense_variant|1945|1480|494|S/C|Agt/Tgt|rs1134583&COSM1602122|1||-1|UBXN11|HGNC|30600|protein_coding|YES|CCDS41288.1|ENSP00000363339|UBX11_HUMAN||UPI00004700E1|deleterious(0.01)|unknown(0)|16/16||Low_complexity_(Seg):Seg&PROSITE_profiles:PS50099|ENST00000374222.1:c.1480A>T|ENSP00000363339.1:p.Ser494Cys|||||||||0&1|||||||||,C|ENSG00000158062|ENST00000374222|Transcript|missense_variant|1945|1480|494|S/G|Agt/Ggt|rs1134583&COSM1602122|2||-1|UBXN11|HGNC|30600|protein_coding|YES|CCDS41288.1|ENSP00000363339|UBX11_HUMAN||UPI00004700E1|deleterious(0.03)|unknown(0)|16/16||Low_complexity_(Seg):Seg&PROSITE_profiles:PS50099|ENST00000374222.1:c.1480A>G|ENSP00000363339.1:p.Ser494Gly|||||||||0&1|||||||||,A|ENSG00000158062|ENST00000357089|Transcript|missense_variant|1457|1381|461|S/C|Agt/Tgt|rs1134583&COSM1602122|1||-1|UBXN11|HGNC|30600|protein_coding||CCDS41286.1|ENSP00000349601|UBX11_HUMAN|Q5T130_HUMAN|UPI00004700E0|deleterious(0.01)|unknown(0)|14/14||Low_complexity_(Seg):Seg&PROSITE_profiles:PS50099|ENST00000357089.4:c.1381A>T|ENSP00000349601.4:p.Ser461Cys|||||||||0&1|||||||||,C|ENSG00000158062|ENST00000357089|Transcript|missense_variant|1457|1381|461|S/G|Agt/Ggt|rs1134583&COSM1602122|2||-1|UBXN11|HGNC|30600|protein_coding||CCDS41286.1|ENSP00000349601|UBX11_HUMAN|Q5T130_HUMAN|UPI00004700E0|deleterious(0.03)|unknown(0)|14/14||Low_complexity_(Seg):Seg&PROSITE_profiles:PS50099|ENST00000357089.4:c.1381A>G|ENSP00000349601.4:p.Ser461Gly|||||||||0&1|||||||||,A|ENSG00000158062|ENST00000314675|Transcript|missense_variant|1200|1120|374|S/C|Agt/Tgt|rs1134583&COSM1602122|1||-1|UBXN11|HGNC|30600|protein_coding||CCDS41287.1|ENSP00000324721|UBX11_HUMAN||UPI000013F8CA|deleterious(0.01)|unknown(0)|11/11||Low_complexity_(Seg):Seg&PROSITE_profiles:PS50099|ENST00000314675.7:c.1120A>T|ENSP00000324721.7:p.Ser374Cys|||||||||0&1|||||||||,C|ENSG00000158062|ENST00000314675|Transcript|missense_variant|1200|1120|374|S/G|Agt/Ggt|rs1134583&COSM1602122|2||-1|UBXN11|HGNC|30600|protein_coding||CCDS41287.1|ENSP00000324721|UBX11_HUMAN||UPI000013F8CA|deleterious(0.03)|unknown(0)|11/11||Low_complexity_(Seg):Seg&PROSITE_profiles:PS50099|ENST00000314675.7:c.1120A>G|ENSP00000324721.7:p.Ser374Gly|||||||||0&1|||||||||,A|ENSG00000158062|ENST00000475591|Transcript|non_coding_transcript_exon_variant&non_coding_transcript_variant|2442|||||rs1134583&COSM1602122|1||-1|UBXN11|HGNC|30600|retained_intron|||||||||13/13|||ENST00000475591.1:n.2442A>T||||||||||0&1|||||||||,C|ENSG00000158062|ENST00000475591|Transcript|non_coding_transcript_exon_variant&non_coding_transcript_variant|2442|||||rs1134583&COSM1602122|2||-1|UBXN11|HGNC|30600|retained_intron|||||||||13/13|||ENST00000475591.1:n.2442A>G||||||||||0&1|||||||||,A|ENSG00000158062|ENST00000494942|Transcript|non_coding_transcript_exon_variant&non_coding_transcript_variant|2227|||||rs1134583&COSM1602122|1||-1|UBXN11|HGNC|30600|retained_intron|||||||||10/10|||ENST00000494942.1:n.2227A>T||||||||||0&1|||||||||,C|ENSG00000158062|ENST00000494942|Transcript|non_coding_transcript_exon_variant&non_coding_transcript_variant|2227|||||rs1134583&COSM1602122|2||-1|UBXN11|HGNC|30600|retained_intron|||||||||10/10|||ENST00000494942.1:n.2227A>G||||||||||0&1|||||||||,A|ENSG00000158062|ENST00000374221|Transcript|missense_variant|1694|1480|494|S/C|Agt/Tgt|rs1134583&COSM1602122|1||-1|UBXN11|HGNC|30600|protein_coding||CCDS41288.1|ENSP00000363338|UBX11_HUMAN||UPI00004700E1|deleterious(0.01)|unknown(0)|16/16||Low_complexity_(Seg):Seg&PROSITE_profiles:PS50099|ENST00000374221.3:c.1480A>T|ENSP00000363338.3:p.Ser494Cys|||||||||0&1|||||||||,C|ENSG00000158062|ENST00000374221|Transcript|missense_variant|1694|1480|494|S/G|Agt/Ggt|rs1134583&COSM1602122|2||-1|UBXN11|HGNC|30600|protein_coding||CCDS41288.1|ENSP00000363338|UBX11_HUMAN||UPI00004700E1|deleterious(0.03)|unknown(0)|16/16||Low_complexity_(Seg):Seg&PROSITE_profiles:PS50099|ENST00000374221.3:c.1480A>G|ENSP00000363338.3:p.Ser494Gly|||||||||0&1|||||||||,A|ENSG00000158062|ENST00000472155|Transcript|non_coding_transcript_exon_variant&non_coding_transcript_variant|2545|||||rs1134583&COSM1602122|1||-1|UBXN11|HGNC|30600|retained_intron|||||||||14/14|||ENST00000472155.1:n.2545A>T||||||||||0&1|||||||||,C|ENSG00000158062|ENST00000472155|Transcript|non_coding_transcript_exon_variant&non_coding_transcript_variant|2545|||||rs1134583&COSM1602122|2||-1|UBXN11|HGNC|30600|retained_intron|||||||||14/14|||ENST00000472155.1:n.2545A>G||||||||||0&1|||||||||,A|ENSG00000158062|ENST00000374223|Transcript|missense_variant|1212|751|251|S/C|Agt/Tgt|rs1134583&COSM1602122|1||-1|UBXN11|HGNC|30600|protein_coding|||ENSP00000363340|||UPI00004700DA|deleterious(0.03)|unknown(0)|12/12||Low_complexity_(Seg):Seg&PROSITE_profiles:PS50099|ENST00000374223.1:c.751A>T|ENSP00000363340.1:p.Ser251Cys|||||||||0&1|||||||||,C|ENSG00000158062|ENST00000374223|Transcript|missense_variant|1212|751|251|S/G|Agt/Ggt|rs1134583&COSM1602122|2||-1|UBXN11|HGNC|30600|protein_coding|||ENSP00000363340|||UPI00004700DA|tolerated(0.06)|unknown(0)|12/12||Low_complexity_(Seg):Seg&PROSITE_profiles:PS50099|ENST00000374223.1:c.751A>G|ENSP00000363340.1:p.Ser251Gly|||||||||0&1|||||||||,A||ENSR00000533083|RegulatoryFeature|regulatory_region_variant||||||rs1134583&COSM1602122|1||||||regulatory_region||||||||||||||||||||||0&1|||||||||,C||ENSR00000533083|RegulatoryFeature|regulatory_region_variant||||||rs1134583&COSM1602122|2||||||regulatory_region||||||||||||||||||||||0&1|||||||||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good lord, this is huge. Is this funcotator output? Do you really need all those transcripts to test what you want to test?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a ported GATK4 test. Im guessing it was some pathological Oncotator output that caused an error in GATK3.

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the tests. I feel weird about running b36 data with a b37 reference, but I would probably feel less weird if we just changed the reference in the header to b37. (There's no sequence dictionary so we can play fast and loose.) I'm also curious if all the annotations are covered with the reduced VariantAnnotatorIntegrationTest. The Standard annotations are definitely covered and maybe it's not worth extending the runtime of our integration test suite to cover things no one uses. I'll let @droazen weigh in.

import java.util.stream.Collectors;

public class VariantAnnotatorIntegrationTest extends CommandLineProgramTest {
final static String STANDARD_ANNOTATIONS = " -G StandardAnnotation "; //-G StandardUG ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears we don't have StandardUG anymore. Good. Those were useless.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

erased

public class VariantAnnotatorIntegrationTest extends CommandLineProgramTest {
final static String STANDARD_ANNOTATIONS = " -G StandardAnnotation "; //-G StandardUG ?
private static final List<String> ATTRIBUTES_TO_IGNORE = Arrays.asList(
"QD",//TODO QD has a cap value and anything that reaches that is randomized. It's difficult to reproduce the same random numbers across gatk3 -> 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, I read that as "crap value". Over the cap it's crap... I'm really not a stickler for matching values exactly when we know there's some wiggle room.

"FS");//TODO There's some bug in either gatk3 or gatk4 fisherstrand that's making them not agree still, I'm not sure which is correct
private static final List<String> HEADER_LINES_TO_IGNROE = Arrays.asList(
"reference",
"HaplotypeScore",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is, incidentally, one of those StandardUG annotations.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I eventually gave up scrubbing it from all the headers, should I make a pass and remove it everywhere?

assertVariantContextsMatch(getTestFile("HCOutput.NoAnnotations.vcf"), new File(getToolTestDataDir() + "expected/integrationTest.vcf"), Arrays.asList("-G", "Standard", "-G", "AS_Standard", "-L", "20:10000000-10100000", "-I", NA12878_20_21_WGS_bam), b37_reference_20_21);
}

//TODO these tests must be modernized :-(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the names too. :-/

private static final List<String> ATTRIBUTES_TO_IGNORE = Arrays.asList(
"QD",//TODO QD has a cap value and anything that reaches that is randomized. It's difficult to reproduce the same random numbers across gatk3 -> 4
"FS");//TODO There's some bug in either gatk3 or gatk4 fisherstrand that's making them not agree still, I'm not sure which is correct
private static final List<String> HEADER_LINES_TO_IGNROE = Arrays.asList(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IGNROE -> IGNORE

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


//TODO these tests must be modernized :-(
@Test
public void testHasAnnotsNotAsking1() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why there were two of these in GATK3. I'm fine with dropping #2.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically the reason was that there was a run of tests comparing a standard single sample vcf and one comparing the outputs for a complicated 20-30 sample vcf but they were primarily just testing input combinations to see that there was the right expected behavior with respect to recomputing annotations and keeping irrelevant annotations. They were all on old references so I couldn't port them. I have condensed all of this down into one set of tests on one multisample VCF.



@Test
public void testNoAnnotsAsking1() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I especially think it's good that you dropped the second test of this style because the comments claimed that the input genotype annotations were weird. I hate to keep around test data that's not representative of our usual output unless it's testing a specific error mode.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

b37_reference_20_21);
}
@Test
public void testOverwritingHeader() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this use the withUpdatedContigs method? No, it must not because otherwise the results wouldn't match. I'm not sure this is necessary since the noAnnotsAsking tests will make sure the header gets updated. I can't help but notice that vcfexample4.vcf is allegedly on b36 according to its header. b37 happens to have the same reference base at that position, so the output is still valid, but honestly I feel weird with a test that overwrites the VCF reference.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is using updatedContigs actually, everything is I believe. The reason I used it is because it was the best way to match the header to gatk3 since that is what it was using.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is our sequence dictionary validation policy for VCFs? As things are now, there are a lot of opportunities for mayhem. For example, CalculateGenotypePosteriors doesn't need a reference and there's no way to check that the input VCF and the supporting VCF (if specified) are on the same reference. Ditto VariantRecalibrator. I believe those tools both check alleles for overlapping variants, but that doesn't mean is 100% secure against input mismatches.

Assert.assertFalse(lineIteratorAnn.hasNext());
}

}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's nothing specific that jumps out at me as being missing. That being said, in GATK3, this was pretty much the only way we got test coverage on most of the annotations. In GATK4 there are a lot more unit tests, so maybe that's less of an issue. How hard would it be to run test coverage over the classes in the walkers.annotator package? I knew how to do it in IntelliJ once, but it's been a while.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have run coverage, there are a few annotations that could use better coverage. Notable among them are StrandArtifact, RMSMappingQuality (though that seems to be a fluke because it does have tests), OxoGCounts, MappingQuality, Fragment Length, and DepthPerAlleleBySample. I have added a test for --use-all-annotationsthat asserts everything at least ran and is outputting results if I think it should but it's by no means comprehensive (just testing for crashes)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of which I have found a few in our seldom-used annotations, namely StrandArtifact and ReferenceBases. These have been resolved now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OxoGCounts is probably going to be obsolete soon. Not sure if @davidbenjamin still cares about MappingQuality (median MQ per allele). DepthPerAlleleBySample is AD, so that should be tested in a lot of places -- maybe that's a fluke too. I think annotation coverage is good enough as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MappingQuality is a default annotation of Mutect2 and remains useful to us.

@jamesemery
Copy link
Collaborator Author

@ldgauthier @droazen Responded to first round of comments

@jamesemery
Copy link
Collaborator Author

A warning, in the last commit I ended up changing the behavior of LikelihoodRankSumTest which used to crash when you ran it without likelihoods to instead just not emit the annotation in keeping in line with other annotations. There may be a reason to keep its previous behavior

@droazen
Copy link
Collaborator

droazen commented Jan 19, 2018

@jamesemery Can you rebase this branch onto latest master to resolve the conflicts? Recommend doing a local squash first given the number of commits here to make it less painful (by "local squash" I mean first rebase -i onto the hash of the first commit in the git log history that's not your own, and then rebase onto origin/master).

@jamesemery
Copy link
Collaborator Author

@droazen Rebased and resolved the conflicts

@vdauwera
Copy link
Contributor

Hi guys, is this good to go?

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistic stuff.

final Cigar c = read.getCigar();
CigarElement last = c.getCigarElement(c.numCigarElements() - 1);

int numEndClippedBases = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid the non-final primitive by:

final int terminalHardClipCount = last.getOperator() == CigarOperator.H ? last.getLength() : 0;
. . .
for loop to calculate terminalLowQualCount
return terminalHardClipCount + terminalLowQualCount;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up slightly to be more readable.

final Cigar c = read.getCigar();
final CigarElement first = c.getCigarElement(0);

int numStartClippedBases = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below

readPos = numAlignedBases - (initialReadPosition + 1);
}
return readPos;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra whitespace

if (initialReadPosition > numAlignedBases / 2) {
readPos = numAlignedBases - (initialReadPosition + 1);
}
return readPos;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return Math.min(initialReadPosition, numAlignedBases - (initialReadPosition + 1)) or something like that ought to work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would work, but i'm trying to preserve bugged behavior which means this is subtly different than what was there before.

//TODO: this code may not even get used because HaplotypeCaller already hard clips low quality tails
for (int i = unclippedReadBases.length - numEndClippedBases - 1; i >= 0; i--) {
if (unclippedReadQuals[i] < 20) { //TODO the 20 hard value here is in lieu of PairHMM.BASE_QUALITY_SCORE_THRESHOLD in order to directly match GATK3 output

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra whitespace

if (refContext.getBases().length ==0 || BaseUtils.simpleBaseToBaseIndex(refContext.getBase()) != -1 ) {
//TODO remove this filter and update the tests, we limit reads starting within a spanning deletion to match GATK3 at the expense of matching haplotype caller
List<GATKRead> reads = new ArrayList<>();
readsContext.iterator().forEachRemaining(r -> {if (r.getStart()<=vc.getStart()) {reads.add(r);}});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ifstatement could be replaced by afilter` so you could have:

final List<GATKRead> reads = Utils.stream(readsContext).filter(r -> r.getStart() <= vc.getStart()).collect(Collectors.toList());

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, changed


VariantContext annotatedVC = annotatorEngine.annotateContext(vc, fc, refContext, likelihoods, a -> true);
vcfWriter.add(annotatedVC);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra whitespace

for ( final VAExpression expression : expressions ) {
List<VariantContext> variantContexts = features.getValues(expression.binding, vc.getStart());

if (!variantContexts.isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is too much nesting here, but you can reduce it with if( variantContexts.isEmpty) { continue; }

} else if (expression.fieldName.equals("ALT")) {
attributes.put(expression.fullName, expressionVC.getAlternateAllele(0).getDisplayString());
} else if (expression.fieldName.equals("FILTER")) {
if (expressionVC.isFiltered()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final String filterString = expressionVC.isFiltered() ? expressionVC.getFilters().toString().replace("[", "").replace("]", "").replace(" ", "")) : "PASS";
attributes.put(express.fullName, filterString);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, like many places in this pr, I should have cleaned these things up

// check that ref and alt alleles are the same
List<Allele> exprAlleles = biallelicVC.getAlleles();
boolean isAlleleConcordant = false;
int i = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There has to be a better way. This variable can be incremented in two different places, and it is related to the for loop without being syntactically tied to it, and you only need it for the check i == 0.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its actually used in several places below to pull the ith elemetn out of expressionValuesList

@davidbenjamin
Copy link
Contributor

@jamesemery I am well aware that everything I complained about is old code for which you are guiltless.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery A bunch of comments, some trivial, some major but maybe can be pushed back. I haven't gotten into the tests at all yet.

@@ -46,7 +46,7 @@
private ReadFilter wellFormedFilter = null;

// Command line parser requires a no-arg constructor
public WellformedReadFilter() {
public WellformedReadFilter() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unecessary space here

final int numAlignedBases = getNumAlignedBases(read);

int readPos = initialReadPosition;
//TODO: this doesn't work for the middle-right position if we index from zero
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this todo worries me

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method worries me but it was necessary to mimic gatk3


/**
*
* @param read a read containing the variant
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment is confusing, what's "the variant" in this context?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

*/
public static int getNumClippedBasesAtEnd(final GATKRead read) {
// check for hard clips (never consider these bases):
final Cigar c = read.getCigar();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read.getCigar() copies the cigar, it's more efficient to use read.getCigarElement()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// and may leave a string of Q2 bases still hanging off the reads.
//TODO: this code may not even get used because HaplotypeCaller already hard clips low quality tails
for (int i = unclippedReadBases.length - numEndClippedBases - 1; i >= 0; i--) {
if (unclippedReadQuals[i] < 20) { //TODO the 20 hard value here is in lieu of PairHMM.BASE_QUALITY_SCORE_THRESHOLD in order to directly match GATK3 output
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be updated?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it breaks tests because a handful of reads have a 18-20 quality score at the ends with this annotation.

}

/**
* Is this container expected to have the per-allele likelihoods calculations filled in.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment should reflect that this is always false for this class

* @return
*/
public Map<String, List<PileupElement>> getStratifiedPileups(final Locatable loc) {
if (stratifiedPileups != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an awkward duplication of the functionality from ReadsPileup.splitBySample(), but this is probably more efficient. Maybe we can unify / speed that up? Save for future maybe.

/**
* Is this container expected to have the per-allele likelihoods calculations filled in.
*/
public boolean hasFilledLikelihoods() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing Overrides here and elsewhere

import java.util.Map;

//TODO comment this off
public class UnfilledReadsLikelihoods<A extends Allele> extends ReadLikelihoods<A> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this as a subclass, I feel like it's almost certainly better as a second implementation of an abstract class

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do this in another pass, I have opened a ticket.

* or {@code reads} is {@code null},
* or if they contain null values.
*/
public UnfilledReadsLikelihoods(SampleList samples, AlleleList<A> alleles, Map<String, List<GATKRead>> reads) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect there are a lot of other methods on this class that should be overriden to throw. I feel like this whole thing is kind of problematic and doesn't result in clean code elsewhere. I don't have a better design in my head, since somewhere we essentially need a switch that does totally different things with totally different input data. Maybe the right thing is to have 2 different methods in annotate that accept different data? We have to get this in soon though so maybe we should revisit.

…to the annotation engine and client annotations to support annotating variants independant of the haplotype caller.
Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few changes requested of test files

);

//TODO because of differences between how GATK3 and GATK4 handle capturing reads for spanning deletions (Namely 3 only looks for reads overlapping the first site, 4 gets all reads over the span)
//TODO then we want ot ignore affected attributes for concordance tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ot -> to

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, this is unused

import java.util.stream.Collectors;

public class VariantAnnotatorIntegrationTest extends CommandLineProgramTest {
final static String STANDARD_ANNOTATIONS = " -G StandardAnnotation ";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is unused


final String testFile = getToolTestDataDir() + "vcfexample2.vcf";

public static String baseTestString() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so is this method

final VCFHeader expectedHeader = getHeaderFromFile(expected);
final VCFHeader inputHeader = getHeaderFromFile(input);

Iterator<VCFHeaderLine> inputItr = inputHeader.getMetaDataInSortedOrder().iterator();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be simplified by like:

        final Iterator<VCFHeaderLine> itr = expectedHeader.getMetaDataInSortedOrder().stream().filter(line -> !shouldFilterLine(line, linesToIgnore)).iterator();
        final Iterator<VCFHeaderLine> itr2 = inputHeader.getMetaDataInSortedOrder().stream().filter(line -> !shouldFilterLine(line, linesToIgnore)).iterator();
        Assert.assertEquals(itr, itr2);

return header;
}

private static <T> void assertForEachElementInLists(final List<T> actual, final List<T> expected, final BiConsumer<T, T> assertion) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shows up in 3 places in the code now, is there somewhere common it could live?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is an issue to extract a common interface for this, its just nontrivial enough that I haven't done it yet in any of the branches. So there it languishes....

return VCs;
}

private static VCFHeader getHeaderFromFile(final File vcfFile) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        try(VCFFileReader reader = new VCFFileReader(vcfFile)){
            return reader.getFileHeader();
        }

is cleaner, or

        try(FeatureDataSource<VariantContext> reader = new FeatureDataSource<>(vcfFile)){
            return (VCFHeader)reader.getHeader();
        }

if you want to be more gatk but more verbose

//==================================================================================================================

@Test
public void GATK3LargeConcoranceTest() throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo, not concordant with Concordance

final File expected = new File(getToolTestDataDir() + "expected/testHasAnnotsAsking1.vcf");
final VCFHeader header = getHeaderFromFile(expected);
runVariantAnnotatorAndAssertSomething(getTestFile("vcfexamplemultisample.vcf"), new File(getToolTestDataDir() + "expected/testHasAnnotsAsking1.vcf"), Arrays.asList("-G", "Standard", "-I", largeFileTestDir + "CEUTrio.multisample.b37.1M-1M50k.bam"),
(a, e) -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

factor out this method since it's repeated 3 times

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, intellij suggested that instead of !anyMatch we might prefer noneMatch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, this might be a good candidate for a dataprovider since there are a bunch of test that call the same code but use different data

final LineIterator lineIterator = codec.makeSourceFromStream(new PositionalBufferedStream(s));
codec.readHeader(lineIterator);

final VCFCodec codecAnn = new VCFCodec();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use FeatureDataSource or VCFFileReader here instead of raw codecs and input streams I think. Also, be sure to put closeables in try()

//
//
// @Test(enabled = true)
// public void testChromosomeCountsPed() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are waiting for the next PR?

@@ -41,4 +42,10 @@ public static OptionalDouble getReadBaseQuality(final GATKRead read, final int r
Utils.nonNull(read);
return OptionalDouble.of(read.getBaseQuality(ReadUtils.getReadCoordinateForReferenceCoordinateUpToEndOfRead(read, refLoc, ReadUtils.ClippingTail.RIGHT_TAIL)));
}

@Override
// When we have a pileupe element we only need its underlying read in order to com
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double typo! plileupE in order to com... pute?

this.binding = binding.get();
}

public void sethInfo(VCFInfoHeaderLine hInfo) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, my good friend Seth Info

@@ -46,7 +46,8 @@

private final String printFormat = "%.2f";

private static final OneShotLogger logger = new OneShotLogger(AS_RMSMappingQuality.class);
private static final OneShotLogger allele_logger = new OneShotLogger(AS_RMSMappingQuality.class);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-standard, maybe alleleNotFoundLogger? Same with below

* A container object for storing the objects necessary for carrying over expression annotations.
* It holds onto the source feature input as well as any relevant header lines in order to alter the vcfHeader.
*/
public static class VAExpression {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be private now?

@lbergelson lbergelson dismissed davidbenjamin’s stale review February 26, 2018 20:59

updates were made, this needs to go in

@lbergelson
Copy link
Member

Most things were addressed here. A bunch of follow on tickets created to address more complicated refactorings that we don't have time to hit now.

@lbergelson lbergelson merged commit 8d929ed into master Feb 26, 2018
@lbergelson lbergelson deleted the je_portVariantAnnotator branch February 26, 2018 23:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants