Question: Does truvari have a upper limit on the file size? How to speed up? #190

jyw-atgithub · 2024-01-31T00:25:12Z

Dear @ACEnglish ,
I am processing a VCF file containing 62 individuals and 187691 entries. The genome is Drosophila melanogaster. The run tim eis unexpectedly long.

At the beginning, I tried the following settings of truvari collapse to merge SVs across individuals. But it takes longer than 2 days (the program is still running; first job in the screenshot)

truvari collapse --sizemax 200000000 -k common \
-i merge.asm-2.sort.vcf.gz \
-c truvari.asm-2.collapse.vcf.gz -f ${ref_genome} |\
bcftools sort --max-mem 4G |\
bgzip -@ 30 > truvari.asm-2.vcf.gz

Then, I splited the merging process with --bed option and executed truvari in parellel. It went well on chromosome 4, X and Y, but it still takes very long on major chromosomes, 2L, 2R, 3L, 3R. (the program is still running; 2nd to 5th job in the screenshot)

It was suspected that sequence comparison takes more computation so I tried to do the truvari collapse in two steps. First, give --pctseq 0 with other stricter thresholds and then let truvari collapse work on the smaller intermediate VCf file.

The first part took less than 4 hours (sorry, I did not use time command) and yeilded 24018 entries.

truvari collapse --sizemax 200000000 -k first \
-i merge.asm-2.sort.vcf.gz \
-c truvari.asm-2.noseq-2.collapse.vcf.gz -f ${ref_genome} \
--pctseq 0 --refdist 100 --minhaplen 30 --pctsize 0.98

2024-01-28 23:04:11,897 [INFO] Zipped 187691 variants Counter({'base': 187691})
2024-01-28 23:04:11,897 [INFO] 7 chunks of 187691 variants Counter({'base': 187010, '__filtered': 681})
2024-01-28 23:04:13,915 [INFO] Wrote 91485 Variants
2024-01-28 23:04:13,916 [INFO] 96206 variants collapsed into 24018 variants
2024-01-28 23:04:13,916 [INFO] 124510 samples' FORMAT fields consolidated

However, the second step still takes a long time, which is still running; (last job in the screenshot)

truvari collapse --sizemax 200000000 -k first \
-i truvari.asm-2.noseq-2.vcf.gz \
-c truvari.asm-2.collapse.vcf.gz -f ${ref_genome}

The text was updated successfully, but these errors were encountered:

ACEnglish · 2024-01-31T02:36:25Z

Using -f (--reference) is the problem here because it needs to fetch reference sequence for every variant. The --reference parameter has been kept for backwards compatibility, but is no longer recommended. The default 'unroll' sequence comparison technique (details) is faster and also more accurate (see supplementary figure 7).

Try without -f and --minhaplen and it should run similarly to --pctseq 0

ACEnglish · 2024-01-31T06:15:26Z

Also, I see that you're using --keep common which requires checking variants' genotypes. pysam is pretty slow at accessing genotypes. I just committed a change to develop that reduces how often they need to be accessed. I'm working on a ~50 sample VCF right now and this change is ~2x-5x faster with identical results. So if you'd like to install from develop of the repo, that should help, too. There's also a change to how --gt is used which helps, but since you're not using that parameter, you won't see the speedup.

ACEnglish closed this as completed Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Does truvari have a upper limit on the file size? How to speed up? #190

Question: Does truvari have a upper limit on the file size? How to speed up? #190

jyw-atgithub commented Jan 31, 2024 •

edited

Loading

ACEnglish commented Jan 31, 2024

ACEnglish commented Jan 31, 2024 •

edited

Loading

Question: Does truvari have a upper limit on the file size? How to speed up? #190

Question: Does truvari have a upper limit on the file size? How to speed up? #190

Comments

jyw-atgithub commented Jan 31, 2024 • edited Loading

ACEnglish commented Jan 31, 2024

ACEnglish commented Jan 31, 2024 • edited Loading

jyw-atgithub commented Jan 31, 2024 •

edited

Loading

ACEnglish commented Jan 31, 2024 •

edited

Loading