-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running multiple diamond searches in parallel on the same machine #732
Comments
I will run some tests and see if I can reproduce the issue. |
Thank you, appreciate it, let me know if I can provide any more details that could be helpful! |
Just for the record, @bbuchfink , here are some examples of full logs ( Frozen log:
Successful log:
|
I am getting this same problem as well. However, I am getting the problem during clustering as I run multiple iterations in parallel. See #747 I have excellent small datasets that can be run in parallel. I bash 100 runs and I'd estimate about .001% of genes crash at 32 instances of diamond running in parallel. This translates to about 10 - 20 crashes per 1500000 runs. Let me know if you want them. Runtime is 5 to 15 minutes for the entire bash loop. |
Hi, I have a set of 100,000 analyses (batched into 1000 jobs of 100 analyses each) which include running a diamond search. The searches involved are small (10-15 protein queries against a single proteome of ~10,000 proteins) so they won't benefit much from multithreading, but I figured I could get them running in parallel, with 1 thread each, and this way I could cut the overall job runtime significantly.
When I do that though, the problem is that sometimes searches will get stuck at "Computing alignments..." and won't proceed no matter what. What is weird is that this is non-deterministic - if I resubmit that same job with these same 100 analyses, the one or two that got stuck last time will proceed just fine, and a different one will hang...
My test set for this has 50 jobs of 100 searches each. Of those 50, anywhere between 1-3 jobs will always gets stuck, with 1-3 of the 100 searches in them getting frozen.
Assorted details for more context:
-Jobs are run inside a docker container, on Google Cloud VMs (8CPU/32GB RAM, 8CPU/64GB RAM or 16CPU/64GB RAM)
-My analysis code is Python (run with a 3.11.4 interpreter) and it runs diamond directly from it via
os.system()
-A job runs a single script with a list of searches to do, and individual searches are parallelized through the standard Python multiprocessing library, with a Pool of 8/16 processes running at the same time
-The issue is not affected by trying other Python parallelization libraries such as joblib, pathos, concurrent.futures
-The issue is not affected by CPU counts, available RAM, using fewer CPUs than max available, or staggering individual searches by up to 20 seconds in order to reduce peak memory usage (I've seen ~10GB peak usage, so well under what's available to the VM)
-The issue only happens with diamond v2.1.0 and later, up to v2.1.8. v2.0.15 runs through just fine. I realize that there were a lot of features added in v2.1.x, so maybe one of them is causing this issue...
Any idea what might be going on there?
The text was updated successfully, but these errors were encountered: