Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Does truvari have a upper limit on the file size? How to speed up? #190

Closed
jyw-atgithub opened this issue Jan 31, 2024 · 2 comments

Comments

@jyw-atgithub
Copy link

jyw-atgithub commented Jan 31, 2024

Dear @ACEnglish ,
I am processing a VCF file containing 62 individuals and 187691 entries. The genome is Drosophila melanogaster. The run tim eis unexpectedly long.

At the beginning, I tried the following settings of truvari collapse to merge SVs across individuals. But it takes longer than 2 days (the program is still running; first job in the screenshot)

truvari collapse --sizemax 200000000 -k common \
-i merge.asm-2.sort.vcf.gz \
-c truvari.asm-2.collapse.vcf.gz -f ${ref_genome} |\
bcftools sort --max-mem 4G |\
bgzip -@ 30 > truvari.asm-2.vcf.gz

Then, I splited the merging process with --bed option and executed truvari in parellel. It went well on chromosome 4, X and Y, but it still takes very long on major chromosomes, 2L, 2R, 3L, 3R. (the program is still running; 2nd to 5th job in the screenshot)


It was suspected that sequence comparison takes more computation so I tried to do the truvari collapse in two steps. First, give --pctseq 0 with other stricter thresholds and then let truvari collapse work on the smaller intermediate VCf file.

The first part took less than 4 hours (sorry, I did not use time command) and yeilded 24018 entries.

truvari collapse --sizemax 200000000 -k first \
-i merge.asm-2.sort.vcf.gz \
-c truvari.asm-2.noseq-2.collapse.vcf.gz -f ${ref_genome} \
--pctseq 0 --refdist 100 --minhaplen 30 --pctsize 0.98 
2024-01-28 23:04:11,897 [INFO] Zipped 187691 variants Counter({'base': 187691})
2024-01-28 23:04:11,897 [INFO] 7 chunks of 187691 variants Counter({'base': 187010, '__filtered': 681})
2024-01-28 23:04:13,915 [INFO] Wrote 91485 Variants
2024-01-28 23:04:13,916 [INFO] 96206 variants collapsed into 24018 variants
2024-01-28 23:04:13,916 [INFO] 124510 samples' FORMAT fields consolidated

However, the second step still takes a long time, which is still running; (last job in the screenshot)

truvari collapse --sizemax 200000000 -k first \
-i truvari.asm-2.noseq-2.vcf.gz \
-c truvari.asm-2.collapse.vcf.gz -f ${ref_genome} 

image

@ACEnglish
Copy link
Owner

Using -f (--reference) is the problem here because it needs to fetch reference sequence for every variant. The --reference parameter has been kept for backwards compatibility, but is no longer recommended. The default 'unroll' sequence comparison technique (details) is faster and also more accurate (see supplementary figure 7).

Try without -f and --minhaplen and it should run similarly to --pctseq 0

@ACEnglish
Copy link
Owner

ACEnglish commented Jan 31, 2024

Also, I see that you're using --keep common which requires checking variants' genotypes. pysam is pretty slow at accessing genotypes. I just committed a change to develop that reduces how often they need to be accessed. I'm working on a ~50 sample VCF right now and this change is ~2x-5x faster with identical results. So if you'd like to install from develop of the repo, that should help, too. There's also a change to how --gt is used which helps, but since you're not using that parameter, you won't see the speedup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants