Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TP considered FN #168

Closed
valeandri opened this issue Sep 15, 2023 · 5 comments
Closed

TP considered FN #168

valeandri opened this issue Sep 15, 2023 · 5 comments

Comments

@valeandri
Copy link

Hello,

I am benchmarking a vcf against the GIAB SV dataset and I found some variants reported as FNs but looking TPs to me.

I am running truvari v4.1.0 with the following parameters:

    "extend": 0,
    "debug": false,
    "refdist": 500,
    "pctseq": 0.0,
    "minhaplen": 50,
    "pctsize": 0.0,
    "pctovl": 0.0,
    "typeignore": false,
    "chunksize": 1000,
    "bSample": "HG002",
    "cSample": "HG002_WGS_LC",
    "dup_to_ins": false,
    "sizemin": 1000,
    "sizefilt": 950,
    "sizemax": 5000,
    "passonly": false,
    "no_ref": false,
    "pick": "single",
    "check_monref": true,
    "check_multi": true

Here are two variants extracted from the base and comp vcf:

BASE
chr12 71962728 HG3_PB_SVrefine2PBcRDovetail_6275 TTTGTTTTGTTTTGAGATAGAGTATCTGTCATCCAAGCTGAGGTGCAGTGGCACGATCTCAGCTCACTGCAACCTCCGCCTCCTGGGTTCAAGCGATTCTCCTGCCTCAGCCTCTCTATTAGCTGGAATTACAGGCACACGCCACCATGCCTGGTTAATTTTTGTATTTTTAGTAGAGACGGGGTTTCACCATGTTGGCCAGGCTGGTCTTGACCCCCTGACCTCAGGTGATCCACCCGCCTCGGCCTCCCAGAGTGCCGGGATTACAGATGTGAGCCACCGTGCCTGGCCTGAACACCTCTGCATTCCAGGCTCCACCCTGGGTAGCAGGGATACAAAAATGAACACAGCAACAAAAATCTCTGCCTTCAAAAACTGTATATCCTAATAGGGAGACACATTTAGCAGAGTAATATTTTGAATATTTTACTTTTTGCTTTTTTGCCCCGGGCATTACTGATAATATTTTATCATAAATTTATTGAGTGTCCTTTATGGTCTGAATCCTGAAGTCTAATTTTATCAAACGCAGGACTCCCAGACTTTTCTTATGAATTCTTGGAAGTCAGTAAATAAAAATTACATTTGCTTTACGGTTAAACCTTAATTAGAATTCATCATCAACTCCATTTACTGTGTGTATATATATAATATATATATAGCTATATATATTATATATGTACTAATACTATGTATTAGTGTGTGTATATACATACATATATATATATATGTATGTATATACACACATATATGTACTGCCCAAGAGGCCACATGTTTTCGTTTTTAAAAAATCTGCTTAGTGGCTGGATACTTTGGCTCACACCTGTAATCCCAGCTCTTTGGGAGGCCGAGGTGGGCGGATCACGAGGTCAGGAGTTCAAGACCAGCCTGGCCAATATGGTGAAACCCTGTCTCTACTAAAAATACAAAAATTAGCCAGGTGTGGTGGCACGCGCCTATAGTCCCAGCTACTCAGGAGGCTGAGGCAGAAGAATTGCTTGAACCTGGGAGGCAGAAGTTGCAGTGAGCTGAGATCGTGCCACTGCACTCCAGCCTGGGTGACAGAGTCAGATTCTGTCTCAAAAAAAAAAAAAAAAAAAATCTGCTTAGCAACTTCCATTTGACATAGGTGGTATGACACACAGGTTGAAAACAAGACCAATCGAATTAAGTGACTTGGGTTTTTAAAAAATTATTCTTCTCTTTACAGAAGTCTCAGCATATTTGAAATGCTGAGACTTCTTGTTGAAACCCAGCAAGCCTGAAAAGCAAAAAGGAGAAACCATTTCAGAGAAATTAAAGCATCAACAACTGGTGTTTATGCAGGAGTCACAAAGTACCCAAACAGCCACATATATCTCAGTTTCATTCAAAACCTGAGAAAGCAAATCAATGGAAGATAATGGAATGGGATTTAAAAAAATGTATATATATATGCACATACATACTGATAATATTCTGGGCATATATATACACTTATATATATTAGTATATATATGCTTATTATATATTATATATTAGTATATATAGCATTCTGGCAATATTCAGAAAAAATATATATATAG T 20 PASS ClusterIDs=HG2_10X_SVrefine210Xhap12_9300:HG4_Ill_Krunchall_18429:HG3_10X_allpass_2602:HG4_PB_assemblyticsfalcon_5728:HG3_PB_assemblyticsfalcon_5805:HG2_PB_assemblyticsfalcon_5810:HG2_PB_assemblyticsPBcR_5615:HG2_10X_allpass_2147:HG3_PB_pbsv_12983:HG2_PB_pbsv_13300:HG4_Ill_MetaSV_1137:HG4_Ill_GATKHCSBGrefine_9179:HG4_10X_allpass_2517:HG3_Ill_MetaSV_1167:HG3_Ill_GATKHCSBGrefine_8779:HG2_Ill_MetaSV_1065:HG2_Ill_GATKHCSBGrefine_8904:HG4_PB_pbsv_13306:HG4_PB_SVrefine2Falcon1Dovetail_8104:HG4_Ill_SVrefine2DISCOVARDovetail_9481:HG3_PB_SVrefine2PBcRDovetail_6275:HG3_PB_SVrefine2Falcon1Dovetail_8030:HG3_Ill_SVrefine2DISCOVARDovetail_9606:HG2_PB_SVrefine2PBcRplusDovetail_2413:HG2_PB_SVrefine2PB10Xhap12_9602:HG2_PB_SVrefine2Falcon2Bionano_6574:HG2_PB_SVrefine2Falcon1plusDovetail_2543:HG2_Ill_SVrefine2DISCOVARplusDovetail_2826;NumClusterSVs=28;ExactMatchIDs=HG2_10X_SVrefine210Xhap12_9300:HG4_PB_SVrefine2Falcon1Dovetail_8104:HG4_Ill_SVrefine2DISCOVARDovetail_9481:HG3_PB_SVrefine2PBcRDovetail_6275:HG3_PB_SVrefine2Falcon1Dovetail_8030:HG3_Ill_SVrefine2DISCOVARDovetail_9606:HG2_PB_SVrefine2PBcRplusDovetail_2413:HG2_PB_SVrefine2PB10Xhap12_9602:HG2_PB_SVrefine2Falcon2Bionano_6574:HG2_PB_SVrefine2Falcon1plusDovetail_2543:HG2_Ill_SVrefine2DISCOVARplusDovetail_2826;NumExactMatchSVs=11;ClusterMaxShiftDist=0.234021;ClusterMaxSizeDiff=0.234021;ClusterMaxEditDist=0.234021;PBcalls=14;Illcalls=10;TenXcalls=4;CGcalls=0;PBexactcalls=7;Illexactcalls=3;TenXexactcalls=1;CGexactcalls=0;HG2count=12;HG3count=8;HG4count=8;NumTechs=3;NumTechsExact=3;SVLEN=-1564;DistBack=6222;DistForward=17722;DistMin=6222;DistMinlt1000=FALSE;MultiTech=TRUE;MultiTechExact=TRUE;SVTYPE=DEL;sizecat=gt1000;DistPASSHG2gt49Minlt1000=FALSE;DistPASSMinlt1000=FALSE;MendelianError=FALSE;HG003_GT=1/1;HG004_GT=1/1;TRall=FALSE;TRgt100=FALSE;TRgt10k=FALSE;segdup=FALSE;REPTYPE=SIMPLEDEL;BREAKSIMLENGTH=174;REFWIDENED=12:72356509-72358246;PctSeqSimilarity=0;PctSizeSimilarity=1;PctRecOverlap=0.9565;SizeDiff=0;StartDistance=-68;EndDistance=-68;GTMatch=0;TruScore=65;MatchId=94.0.0 GT:GTcons1:PB_GT:PB_REF:PB_ALT:PBHP_GT:P ![Screenshot from 2023-09-15 14-55-44](https://github.com/ACEnglish/truvari/assets/63858464/31269b02-c6bb-4dc8-8041-167438a3c330) B_REF_HP1:PB_ALT_HP1:PB_REF_HP2:PB_ALT_HP2:TenX_GT:TenX_REF_HP1:TenX_ALT_HP1:TenX_REF_HP2:TenX_ALT_HP2:ILL250bp_GT:ILL250bp_REF:ILL250bp_ALT:ILLMP_GT:ILLMP_REF:ILLMP_ALT:BNG_LEN_DEL:BNG_LEN_INS:nabsys_svm 1/1:./.:1/1:1:48:1/1:1:31:0:11:./.:6:12:12:11:0/1:13:24:0/1:10:85:1581:.:.

COMP
chr12 71962796 DRAGEN:LOSS:chr12:71962797-71964360 N <DEL> 150 PASS SVLEN=-1564;SVTYPE=CNV;END=71964360;REFLEN=1564;OrigCnvPos=71962186;OrigCnvEnd=71964742;SVCLAIM=DJ;MatchSv=DRAGEN:DEL:183383:0:1:0:0:0 GT:SM:CN:BC:GC:CT:AC:PE 1/1:0.263257:1:2:0.375978:0.507825:0.483959:39,42

I also checked them on IGV and look very similar.

Am I missing something?

Thanks,
Valentina

@ACEnglish
Copy link
Owner

I would need to see more context but something could be happening is that this run with --pick single. The --pick parameter controls how many matches a call is allowed to participate in (details).

Are there other calls in this region which were marked as true positive? If so consider using --pick multi

Otherwise I'm not sure. The base vcf entry you've provided shows the call was annotated:

PctSizeSimilarity=1;PctRecOverlap=0.9565;SizeDiff=0;StartDistance=-68;EndDistance=-68;GTMatch=0;TruScore=65;MatchId=94.0.0

So you could look through the tp-comp/fp VCFs for MatchId=94.0.* To see what it was compared to. (MatchId details)

Also, If you turn on --debug the logging is pretty verbose about exactly why any match decision is made.

If you're still stuck, if you could send me the input base/comp VCFs in the region, say chr12: 71959796-71965796, I could help investigate it.

@valeandri
Copy link
Author

Thank you @ACEnglish! I'll start with your suggestions and I'll let you know about the outcomes.

@valeandri
Copy link
Author

Here it is the matched variant in the fp.vcf.gz, that is actually the one I was expecting to match:

chr12 71962796 DRAGEN:LOSS:chr12:71962797-71964360 N <DEL> 150 PASS SVLEN=-1564;SVTYPE=CNV;END=71964360;REFLEN=1564;OrigCnvPos=71962186;OrigCnvEnd=71964742;SVCLAIM=DJ;MatchSv=DRAGEN:DEL:183383:0:1:0:0:0;PctSeqSimilarity=0;PctSizeSimilarity=1;PctRecOverlap=0.9565;SizeDiff=0;StartDistance=-68;EndDistance=-68;GTMatch=0;TruScore=65;MatchId=94.0.0 GT:SM:CN:BC:GC:CT:AC:PE 1/1:0.263257:1:2:0.375978:0.507825:0.483959:39,42

They are also annotated with a high Overlapping score (PctRecOverlap=0.9565) so I do not see a reason to exclude them. Moreover, in the region there in no other variant, so I presume the --pick multiple won't help.

@ACEnglish
Copy link
Owner

Ah! I see what's happening, now. So there's another 'threshold' that wasn't accounted for typeignore. It is off (false) by default. These variants don't have matching types, so they're not passing the thresholds.

Now, you and I can see that they do have matching types (DEL), but if you look at how truvari determines svtype you'll also see that the SVTYPE=CNV in the comp vcf doesn't match the SVTYPE=DEL in the base.

I know that vcf v4.4 had a lot of changes recently and in it SVTYPE was deprecated in favor of symbolic alts. This was not my favorite move because it's obviously a breaking change.

Regardless, the quickest way to get these variants to match is for you to run bcftools annotate -x INFO/SVTYPE on the comp vcf.

@valeandri
Copy link
Author

Ok, great, thanks for the help!!
Now it makes sense!

Have a nice day,
Valentina

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants