Duplication to Insertion doubt #188

poddarharsh15 · 2024-01-25T14:33:00Z

Hi, I have a specific query regarding benchmarking with short reads using the truvari tool. While benchmarking, I've observed that several pipelines (Manta,Delly,Smoove,Dysgu,nf-core/sarek ) generate duplications SVs. However, the truth sets (GIABv0.6_HG002) I am using do not include duplications. Is there any recommended way to convert these duplications to insertions for more accurate benchmarking? Thanks in advance.

ACEnglish · 2024-01-25T15:10:58Z

Hello. One possibility is to use truvari bench --dup-to-ins which will consider variants with SVTYPE==DUP as SVTYPE==INS. However, as duplications are not typically sequence resolved (i.e. ALT==<DUP> instead of a sequence), you'll need to turn off sequence similarity with truvari bench --pctseq 0. Alternatively, for some projects I've had success 'filling in' DUP sequences. For example, a DUP from chr1:1234-2234 presumably could be represented as an INS at chr1:1234-1235 with ALT equal to the reference sequence from chr1:1234-2234.

poddarharsh15 · 2024-01-25T15:54:22Z

Thank you for your previous response. I did attempt to use the suggested parameters for my pilot runs. However, I have a hypothetical question. If I take or separate only duplications (ALT==) from any SV callers' .vcf files and benchmark them using the GIABv0.6_HG002 truth set for insertions (INS), or alternatively, if I merge INS and DUP .vcf files together, would either approach be considered correct?

ACEnglish · 2024-01-25T16:05:37Z

Correct in this context is subjective. I typically don't separate/subset types of variants for many reasons. But some researchers may be interested in only DELs.

poddarharsh15 · 2024-01-26T08:35:49Z

I'm relatively new to variant analysis studies and will be working on a project soon. In preparation, I'm considering the approach to benchmarking. I've noticed in some papers that researchers benchmark variants separately (e.g., deletions and insertions only) and others benchmark all variants together.
Considering your experience, would you suggest separating variants to achieve a more detailed Precision-Recall Curve (PRC) or testing all variants together for a comprehensive analysis? Your insights would greatly assist me in planning the benchmark strategy for my upcoming project. Many thanks.

poddarharsh15 changed the title ~~Duplation to Insertion doubt~~ Duplication to Insertion doubt Jan 25, 2024

ACEnglish closed this as completed Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplication to Insertion doubt #188

Duplication to Insertion doubt #188

poddarharsh15 commented Jan 25, 2024

ACEnglish commented Jan 25, 2024

poddarharsh15 commented Jan 25, 2024

ACEnglish commented Jan 25, 2024

poddarharsh15 commented Jan 26, 2024

Duplication to Insertion doubt #188

Duplication to Insertion doubt #188

Comments

poddarharsh15 commented Jan 25, 2024

ACEnglish commented Jan 25, 2024

poddarharsh15 commented Jan 25, 2024

ACEnglish commented Jan 25, 2024

poddarharsh15 commented Jan 26, 2024