Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of symbolic variants without ref sequence #165

Closed
kcleal opened this issue Aug 20, 2023 · 1 comment
Closed

Handling of symbolic variants without ref sequence #165

kcleal opened this issue Aug 20, 2023 · 1 comment

Comments

@kcleal
Copy link

kcleal commented Aug 20, 2023

Hi @ACEnglish,

Im finding a number of variants called by dysgu end up being labelled as false-positives, but look to me like they should be true-positives. If I change to -p 0 they are labelled as true, which is a bit confusing as they are symbolic SVs without the reference sequence filled out. For example, using the HG002 GIAB dataset, the reference SV is:

1       36733251        HG4_PB_SVrefine2Falcon1Dovetail_252     GGAGGCCAGAGCAGATTCTTTTGTTATTTGTTAACTGATACTAGGGCTTGAAAGGAGGTAAGTCAAGAAGAGGTAGGATGGGC
TCCCTGCTGGATAGACGGGAGCCCATATTCCCCGCACCCACTCACAGAATACGGGGCATCCGGGGGCTTTCGGAGACTGAGTTCTGTTAGAAAAGGTCTGTTTTACAACTGGGAATCCAAAACATCCAAAGTGAATAGCAGTACTTT
TAGAAAGGAATTACCCTA        AAAATTAAAAAACTAACTC     20      PASS    ClusterIDs=HG4_PB_HySA_522:HG2_PB_HySA_661:HG2_Ill_Spiral_11:HG4_PB_SVrefine2PBcR
Dovetail_220:HG4_PB_assemblyticsfalcon_459:HG4_PB_SVrefine2Falcon1Dovetail_252:HG4_Ill_SVrefine2DISCOVARDovetail_220:HG3_PB_assemblyticsfalcon_468:
HG3_PB_SVrefine2PBcRDovetail_231:HG3_PB_SVrefine2Falcon1Dovetail_251:HG3_Ill_SVrefine2DISCOVARDovetail_241:HG2_PB_assemblyticsfalcon_482:HG2_PB_SVr
efine2PBcRplusDovetail_261:HG2_PB_SVrefine2PB10Xhap12_325:HG2_PB_SVrefine2Falcon1plusDovetail_266:HG2_10X_SVrefine210Xhap12_281:HG2_PB_PB10Xdip_215
3:HG2_PB_PB10Xdip_2152:HG4_Ill_Spiral_24:HG3_Ill_Spiral_265:HG3_Ill_MetaSV_79:HG4_10X_allpass_29:HG2_PB_assemblyticsPBcR_471:HG3_PB_HySA_494:HG4_PB
_pbsv_387:HG2_PB_pbsv_376:HG3_PB_pbsv_374:HG4_Ill_Krunchall_350;NumClusterSVs=28;ExactMatchIDs=HG2_10X_SVrefine210Xhap12_281:HG2_PB_SVrefine2Falcon
1plusDovetail_266:HG2_PB_SVrefine2PB10Xhap12_325:HG2_PB_SVrefine2PBcRplusDovetail_261:HG2_PB_assemblyticsfalcon_482:HG3_Ill_SVrefine2DISCOVARDoveta
il_241:HG3_PB_SVrefine2Falcon1Dovetail_251:HG3_PB_SVrefine2PBcRDovetail_231:HG3_PB_assemblyticsfalcon_468:HG4_Ill_SVrefine2DISCOVARDovetail_220:HG4
_PB_SVrefine2Falcon1Dovetail_252:HG4_PB_assemblyticsfalcon_459;NumExactMatchSVs=12;ClusterMaxShiftDist=0.310606;ClusterMaxSizeDiff=0.310606;Cluster
MaxEditDist=0.310606;PBcalls=19;Illcalls=7;TenXcalls=2;CGcalls=0;PBexactcalls=9;Illexactcalls=2;TenXexactcalls=1;CGexactcalls=0;HG2count=11;HG3coun
t=8;HG4count=9;NumTechs=3;NumTechsExact=3;SVLEN=-229;DistBack=-173;DistForward=-248;DistMin=-248;DistMinlt1000=TRUE;MultiTech=TRUE;MultiTechExact=T
RUE;SVTYPE=DEL;sizecat=100to299;DistPASSHG2gt49Minlt1000=FALSE;DistPASSMinlt1000=FALSE;MendelianError=FALSE;HG003_GT=1/1;HG004_GT=0/1;TRall=FALSE;T
Rgt100=FALSE;TRgt10k=FALSE;segdup=FALSE;REPTYPE=SUBSDEL;BREAKSIMLENGTH=0;REFWIDENED=1:36733251-36733498;PctSeqSimilarity=0.008;PctSizeSimilarity=0.
9825;PctRecOverlap=0.9113;SizeDiff=4;StartDistance=0;EndDistance=22;GTMatch=0;TruScore=63;MatchId=188.0.0  GT:GTcons1:PB_GT:PB_REF:PB_ALT:PBHP_GT:P
B_REF_HP1:PB_ALT_HP1:PB_REF_HP2:PB_ALT_HP2:TenX_GT:TenX_REF_HP1:TenX_ALT_HP1:TenX_REF_HP2:TenX_ALT_HP2:ILL250bp_GT:ILL250bp_REF:ILL250bp_ALT:ILLMP_
GT:ILLMP_REF:ILLMP_ALT:BNG_LEN_DEL:BNG_LEN_INS:nabsys_svm    0/1:0/1:0/1:18:23:0|1:17:3:0:22:0|1:53:0:0:31:0/1:52:18:./.:.:.:.:.:.

The dysgu call is:

1       36733251        29421   A       <DEL>   .       PASS    SVMETHOD=DYSGUv1.6.0;SVTYPE=DEL;END=36733476;CHR2=1;GRP=29421;NGRP=1;CT=3to5;CIPOS95=0;CIEND95=0;SVLEN=225;CONTIGA=CTGGTATTACAGGCGTGAGCCACCGCGCCCTGCCTATAATAAGAATCTTAAATAATTTCTTCTGCAATTAATTTCAGAGTCACAGTGTATGTGTATGTGTGGGTGAAGGTGATTCTCCTTTAAAAGCAAAGTTTCTTTACCACCTAATTCCATTTTTTAGGACATGGCTGAAGTAGTAAAAAAGGAATGTTCCCCTCTATTCGTGTATAATTTTTAAGTTTTTTTGTCAGCAATACAGCTGTTCCGAATCATCATTTTACTGATGAAAAAATAGAGGTAAAAACACATGCACCAAATAAGAGTTTGTGGTTTATTCAGTAGGAGCCTTAGTTTTGGTAAATTTCTCTCATTTAGGGACCCTGAACACTTGTTTGGCAGCGGTTATGTTTTTCAGCTTCTTACCTACTAGTGTAGTGAGATCAGTTCCACCCAATTCCAGGGGTATTGATACTTTGTGGGGAGAAGAAAGAGGGAAGAAAGCAATTAGTAATAGTCAAACAAAAATTAAAAAACTAACTCGGCTGGGCGCGGTGGCTCATACCTGTAATCCCAGCACTTTCAGAGGCCGAGGCAGGTGGATCACCTGAGGCCAGGAGTTTGAGGCCAGCCTGGCCAATATGGTGAAACCCCATCTCTACTGACAATACAAGAATTAGTTGGGTGTGGTAGCACGCAGCTGTAATCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATCGCTTGAACCCAAGAGGTGAAGATTGTAGTGAGCTGAGATCATGTGCCACTGTACTC;KIND=extra-regional;GC=42.49;NEXP=0;STRIDE=0;EXPSEQ=;RPOLY=32;OL=0;SU=42;WR=21;PE=0;SR=0;SC=0;BND=0;LPREC=1;RT=pacbio   GT:GQ:MAPQP:SU:WR:PE:SR:SC:BND:COV:NEIGH10:PS:MS:RMS:RED:BCC:FCC:ICN:OCN:PROB   0/1:198:60.0:42:21:0:0:0:0:52.11:1:8:13:0:0:0:0.615:0.571:0.929:0.951

This is labelled as FP. If I collect the reference sequence between POS and END, and put it in the ref column of the vcf, the variant is labelled as true-positive. Also there are no nearby calls from dysgu, or other reference SVs.

Here is another example, refernce SV is:

1       106969624       HG2_PB_SVrefine2Falcon1plusDovetail_495 TGTATATGTGTGTATATATATATACATATATATATACATATATATATATCTATATATATGTATATATACGTATATATACATAT
ATATGTATATATGTATATATACGTATATATGTATATATACGTATATATGTATATATACGTATATATACGTATCTATATCTAT   GAC     20      PASS    ClusterIDs=HG3_PB_pbsv_756:HG4_Ill_GAT
KHCSBGrefine_490:HG3_PB_HySA_1095:HG2_Ill_GATKHCSBGrefine_448:HG2_PB_pbsv_736:HG2_Ill_SpiralSDKrefine_637:HG4_PB_SVrefine2Falcon1Dovetail_467:HG4_I
ll_manta_170:HG4_Ill_SVrefine2DISCOVARDovetail_518:HG3_PB_SVrefine2PBcRDovetail_432:HG3_PB_SVrefine2Falcon1Dovetail_427:HG2_PB_SVrefine2PBcRplusDov
etail_492:HG2_PB_SVrefine2PB10Xhap12_613:HG2_PB_SVrefine2Falcon1plusDovetail_495:HG2_Ill_manta_188:HG2_Ill_SVrefine2DISCOVARplusDovetail_582:HG2_10
X_SVrefine210Xhap12_572:HG4_PB_SVrefine2PBcRDovetail_404:HG2_PB_PB10Xdip_69:HG2_PB_HySA_1396:HG4_Ill_svaba_838:HG4_Ill_scalpel_620:HG4_10X_allpass_
106:HG3_Ill_svaba_874:HG3_Ill_scalpel_718:HG2_Ill_svaba_840:HG2_Ill_scalpel_653:HG2_Ill_250bpfermikitraw_905:HG2_Ill_150bpfermikitraw_812:HG2_10X_a
llpass_105:HG3_Ill_Krunchall_931:HG2_Ill_Krunchall_900:HG2_Ill_Krunchall_899:HG4_PB_pbsv_742:HG4_Ill_MetaSV_205:HG3_Ill_MetaSV_184:HG2_Ill_MetaSV_2
01;NumClusterSVs=37;ExactMatchIDs=HG2_10X_SVrefine210Xhap12_572:HG2_Ill_SVrefine2DISCOVARplusDovetail_582:HG2_Ill_manta_188:HG2_PB_SVrefine2Falcon1
plusDovetail_495:HG2_PB_SVrefine2PB10Xhap12_613:HG2_PB_SVrefine2PBcRplusDovetail_492:HG3_PB_SVrefine2Falcon1Dovetail_427:HG3_PB_SVrefine2PBcRDoveta
il_432:HG4_Ill_SVrefine2DISCOVARDovetail_518:HG4_Ill_manta_170:HG4_PB_SVrefine2Falcon1Dovetail_467;NumExactMatchSVs=11;ClusterMaxShiftDist=0.165017
;ClusterMaxSizeDiff=0.165017;ClusterMaxEditDist=0.20598;PBcalls=13;Illcalls=21;TenXcalls=3;CGcalls=0;PBexactcalls=6;Illexactcalls=4;TenXexactcalls=
1;CGexactcalls=0;HG2count=19;HG3count=8;HG4count=10;NumTechs=3;NumTechsExact=3;SVLEN=-162;DistBack=-4014;DistForward=-165;DistMin=-4014;DistMinlt10
00=TRUE;MultiTech=TRUE;MultiTechExact=TRUE;SVTYPE=DEL;sizecat=100to299;DistPASSHG2gt49Minlt1000=FALSE;DistPASSMinlt1000=FALSE;MendelianError=FALSE;
HG003_GT=0/1;HG004_GT=0/1;TRall=TRUE;TRgt100=TRUE;TRgt10k=FALSE;segdup=FALSE;REPTYPE=SUBSDEL;BREAKSIMLENGTH=0;REFWIDENED=1:106969624-106969788;PctS
eqSimilarity=0.012;PctSizeSimilarity=1;PctRecOverlap=0.9879;SizeDiff=0;StartDistance=0;EndDistance=2;GTMatch=0;TruScore=66;MatchId=404.0.0 GT:GTcon
s1:PB_GT:PB_REF:PB_ALT:PBHP_GT:PB_REF_HP1:PB_ALT_HP1:PB_REF_HP2:PB_ALT_HP2:TenX_GT:TenX_REF_HP1:TenX_ALT_HP1:TenX_REF_HP2:TenX_ALT_HP2:ILL250bp_GT:
ILL250bp_REF:ILL250bp_ALT:ILLMP_GT:ILLMP_REF:ILLMP_ALT:BNG_LEN_DEL:BNG_LEN_INS:nabsys_svm    1/1:1/1:1/1:2:43:1/1:6:25:0:13:1/1:0:21:0:18:1/1:0:32:
./.:.:.:.:.:.

Dysgu call was:

1	106969624	67156	A	<DEL>	.	PASS	SVMETHOD=DYSGUv1.6.0;SVTYPE=DEL;END=106969786;CHR2=1;GRP=67156;NGRP=1;CT=3to5;CIPOS95=0;CIEND95=0;SVLEN=162;CONTIGA=AAGCTGGAAACCATCATTCTCAGCAAACTATCGCAAGGACAAAAAACCAAACACCGCATGTTATCACTCATAAGTGGGAATTGAATAATGAGAACACTTGGACACAGGAAGGGAACATCACACACCGGGGCCTGTCATGGGGTGGGGGGAGTGGAGAGGGATAGCATTAGGAGAAATACCGAATGTAAATGACAAGTTAATCAGTGCAGCACACCAACATGACGCGTGTATACATATGTAACAAACCTGCACGTTGTGCACATGTACCCTAGAACTTAAAGTATAATAATAAAAAAATATATAAAATTAGCTGGGCATGGTGGCACATGTCTGTAATCCCAGCCACTCAGGAGGCTGAGGCAAGAGAATCGCTTGAACCCAGGAGGTGAAGGTTGTAGAGAGCTGAGATCACGCCATTGCACTCCAGATATATAGATAGATAGATAGATATACGATATCTATATATATATGGATATATACATACGTATCTATATATATAGACATATATATAGATACGTATCTATATATAGATATGTATAGATATGAGGGGAGCTATATTCTACAAAGCCACAGGTGCAGAGCTGCCCAAGGCTGTAGGAGCCCACCTCTTGCATTGGCATGACTTGGATGTGAGACATGGAGTCAAAAGAGATCATTTTGGAACTTCCAAGTTTAATGACTGCCCTGTTGAATTTTGGACGTTCATGGGGCCTGTAGCCCCTTTGTTTTGGCCAATTTCTCCCATTTGGAATGCGTATATTTACCCAATGGCTGTACCCCCATTGTATCAGGGAAGTAATTAACTTGCTTTTGATTTTACAAGCTCAGAGGCAGAAGA;KIND=extra-regional;GC=41.53;NEXP=0;STRIDE=0;EXPSEQ=;RPOLY=126;OL=0;SU=82;WR=41;PE=0;SR=0;SC=0;BND=0;LPREC=1;RT=pacbio	GT:GQ:MAPQP:SU:WR:PE:SR:SC:BND:COV:NEIGH10:PS:MS:RMS:RED:BCC:FCC:ICN:OCN:PROB	1/1:51:60.0:82:41:0:0:0:0:50.17:5:26:15:0:0:0:0.0:0.0:0.759:0.938

This is also labelled as FP.

Most other symbolic SVs are labelled as expected though. Do you know why these examples are not labelled as true? Is it anything to do with the reference SV having both REF and ALT seqs listed? Thanks!

@ACEnglish
Copy link
Owner

Hello,

Another user had a similar issue recently (#163). Truvari 4.1 (#164) has a change that will now warn users about using --pctsim != 0 and non-sequence-resolved calls (code). That post also has a script to help users fill in deletions, but I think you've already figured out that part. Needing to ignore sequence or altering the input VCFs isn't an ideal solution, but as callers (particularly those using long reads) are getting better at fully resolving SVs, and Truvari moves towards super accurate comparison (details) , and the VCF format specs are beginning to change the use of symbolic alts (details), the 'wild west' of SV representations seems to be shrinking.

Have a great day,
~/Adam English

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants