Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplication to Insertion doubt #188

Closed
poddarharsh15 opened this issue Jan 25, 2024 · 4 comments
Closed

Duplication to Insertion doubt #188

poddarharsh15 opened this issue Jan 25, 2024 · 4 comments

Comments

@poddarharsh15
Copy link

Hi, I have a specific query regarding benchmarking with short reads using the truvari tool. While benchmarking, I've observed that several pipelines (Manta,Delly,Smoove,Dysgu,nf-core/sarek ) generate duplications SVs. However, the truth sets (GIABv0.6_HG002) I am using do not include duplications. Is there any recommended way to convert these duplications to insertions for more accurate benchmarking? Thanks in advance.

@poddarharsh15 poddarharsh15 changed the title Duplation to Insertion doubt Duplication to Insertion doubt Jan 25, 2024
@ACEnglish
Copy link
Owner

Hello. One possibility is to use truvari bench --dup-to-ins which will consider variants with SVTYPE==DUP as SVTYPE==INS. However, as duplications are not typically sequence resolved (i.e. ALT==<DUP> instead of a sequence), you'll need to turn off sequence similarity with truvari bench --pctseq 0. Alternatively, for some projects I've had success 'filling in' DUP sequences. For example, a DUP from chr1:1234-2234 presumably could be represented as an INS at chr1:1234-1235 with ALT equal to the reference sequence from chr1:1234-2234.

@poddarharsh15
Copy link
Author

Thank you for your previous response. I did attempt to use the suggested parameters for my pilot runs. However, I have a hypothetical question. If I take or separate only duplications (ALT==) from any SV callers' .vcf files and benchmark them using the GIABv0.6_HG002 truth set for insertions (INS), or alternatively, if I merge INS and DUP .vcf files together, would either approach be considered correct?

@ACEnglish
Copy link
Owner

Correct in this context is subjective. I typically don't separate/subset types of variants for many reasons. But some researchers may be interested in only DELs.

@poddarharsh15
Copy link
Author

I'm relatively new to variant analysis studies and will be working on a project soon. In preparation, I'm considering the approach to benchmarking. I've noticed in some papers that researchers benchmark variants separately (e.g., deletions and insertions only) and others benchmark all variants together.
Considering your experience, would you suggest separating variants to achieve a more detailed Precision-Recall Curve (PRC) or testing all variants together for a comprehensive analysis? Your insights would greatly assist me in planning the benchmark strategy for my upcoming project. Many thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants