You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Dear @ACEnglish ,
I am processing a VCF file containing 62 individuals and 187691 entries. The genome is Drosophila melanogaster. The run tim eis unexpectedly long.
At the beginning, I tried the following settings of truvari collapse to merge SVs across individuals. But it takes longer than 2 days (the program is still running; first job in the screenshot)
Then, I splited the merging process with --bed option and executed truvari in parellel. It went well on chromosome 4, X and Y, but it still takes very long on major chromosomes, 2L, 2R, 3L, 3R. (the program is still running; 2nd to 5th job in the screenshot)
It was suspected that sequence comparison takes more computation so I tried to do the truvari collapse in two steps. First, give --pctseq 0 with other stricter thresholds and then let truvari collapse work on the smaller intermediate VCf file.
The first part took less than 4 hours (sorry, I did not use time command) and yeilded 24018 entries.
Using -f (--reference) is the problem here because it needs to fetch reference sequence for every variant. The --reference parameter has been kept for backwards compatibility, but is no longer recommended. The default 'unroll' sequence comparison technique (details) is faster and also more accurate (see supplementary figure 7).
Try without -f and --minhaplen and it should run similarly to --pctseq 0
Also, I see that you're using --keep common which requires checking variants' genotypes. pysam is pretty slow at accessing genotypes. I just committed a change to develop that reduces how often they need to be accessed. I'm working on a ~50 sample VCF right now and this change is ~2x-5x faster with identical results. So if you'd like to install from develop of the repo, that should help, too. There's also a change to how --gt is used which helps, but since you're not using that parameter, you won't see the speedup.
Dear @ACEnglish ,
I am processing a VCF file containing 62 individuals and 187691 entries. The genome is Drosophila melanogaster. The run tim eis unexpectedly long.
At the beginning, I tried the following settings of
truvari collapse
to merge SVs across individuals. But it takes longer than 2 days (the program is still running; first job in the screenshot)Then, I splited the merging process with
--bed
option and executed truvari in parellel. It went well on chromosome 4, X and Y, but it still takes very long on major chromosomes, 2L, 2R, 3L, 3R. (the program is still running; 2nd to 5th job in the screenshot)It was suspected that sequence comparison takes more computation so I tried to do the
truvari collapse
in two steps. First, give--pctseq 0
with other stricter thresholds and then lettruvari collapse
work on the smaller intermediate VCf file.The first part took less than 4 hours (sorry, I did not use
time
command) and yeilded 24018 entries.truvari collapse --sizemax 200000000 -k first \ -i merge.asm-2.sort.vcf.gz \ -c truvari.asm-2.noseq-2.collapse.vcf.gz -f ${ref_genome} \ --pctseq 0 --refdist 100 --minhaplen 30 --pctsize 0.98
However, the second step still takes a long time, which is still running; (last job in the screenshot)
The text was updated successfully, but these errors were encountered: