New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
freebayes-parallel not fully utilising CPUs #479
Comments
How are you choosing the regions to run? You may want to ensure that you have approximately the same number of alignments in each region, otherwise you will find your runtime dominated by the longest-running segment. This could be scripted out pretty easily with bedtools coverage or something similar. Unfortunately, I don't think I ever pushed a script to do this to this repo, but contributions are more than welcome. The warning about file handles is concerning. You should find out what is accessing the files. It doesn't make sense to me that it's freebayes or the parallel script, but it could be that the way the number of CPUs is specified is not being read and it's trying to run all the jobs at once. This does seem like a strange possibility. Without access to the system it will be impossible for me to debug so hopefully you can get some idea of what's going on. This seems like a likely cause of the problem, but only because it's an unusual error. |
Hi Erik, Thanks for getting back to me so quickly. To address your points
I thought about this one. At first it made a lot of sense that skewed coverage could be the cause. My understanding is that running parallel with the -k parameter (--keep-order) as in the freebayes-parallel script would necessitate that a process buffer until it is its turn to flush to stdout. It is easy to imagine if there is one region that has a particularly high number of reads, it would block all of the other processes that need to follow it. This makes a lot of sense, however under this scenario I would expect to see 30 processes open in htop, but I don't. I only see one or two processes running in htop during the slow down period. No processes appeared to be blocked from htop at least.
Also unlikely if htop is to be believed. As I said, I'm running freebayes-parallel with |
Actually, you'll only see the parallel process and the freebayes process that hasn't finished. parallel has gathered all the results of the jobs that have finished, so they won't be running anymore. I think this is exactly what we're seeing. Use the
You can then convert targets.bed to the "region" format using vcflib's script Please let me know if this resolves the issue. You obviously need to choose an appropriate number of chunks. 10k may be too many. |
As a caveat, note that the This should be more deeply integrated into freebayes, but I haven't got the time to do it. |
Thanks Erik. Very interesting that parallel doesn't start new processes while waiting for one to complete. I might also try running |
@MarkDunne well it's specifically because of the Setting things with even numbers of reads in them will absolutely help in harmonizing runtimes. The same could be said for raw variant density (it's a good proxy for read depth and freebayes runtime) and you can use vcfevenregions from vcflib to generate that. I'd be cautious about increasing the region size. That is more likely to make things worse as now the longest running region will take even longer. It'd be better to make the regions much smaller. Your total runtime is bounded by the longest running chunk, unless the longest runtime is much shorter than the total. |
@ekg if you want to subsample the BAM file, modern
Also, might be worth consideting |
We've tackled this problem without needing to calculate genome coverage. The trick was to split the script in two. Step one is to run freebayes with parallel while dumping the many small vcfs to a temporary directory. This frees us from the |
@MarkDunne thanks for reporting, this is very helpful to know. I'll close the issue. |
Hi @MarkDunne is there an exact implementation of splitting the script somewhere in the repo? I have this problem with freebayes currently. |
Hi @MarkDunne. How did you make the split? |
I'm using the freebayes-parallel script with freebayes version 1.2.0 and GNU Parallel version 20180522, and the CPU utilisation seems to randomly flip between all or almost none of the available cores.
In the above image after about 5 hours of running, the number of processes being run (checking via htop) dropped from 30 to just 1. From other runs where the same thing has happened - I know that after another day or two, this will randomly change again and run with all 30 cores for another few hours. This is running on a 32 core machine and freebayes-parallel with
ncpus=30
.The only indicative error I see is from the output of GNU Parallel
But I'm only using 2 (albeit large) BAM files, so number of file handles seems like a weird issue. I have tried raising the ulimit while the script was running, but it didn't appear to have any effect. If it's relevant, the input for GNU parallel is the output of
fasta_generate_regions.py
with region size100000
.Any help appreciated.
The text was updated successfully, but these errors were encountered: