-
-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"no worker was ever available" when running on a slow slurm cluster #1016
Comments
Hi Florian, |
@FlorianPommerening Did you check on the branch @benjamc mentioned above? |
On my cluster, the fix works when I don't ask for |
Hey folks, I started an example in #1064 in which workers are started manually. There's something not yet working in there, but you may use it as a starting point to achieve what you want. In case you get the example working, please consider updating the PR I made. |
Sorry for being quiet for so long. Some deadlines and holidays got in the way. I now tired reproducing the problem again on the current dev branch but couldn't. The behavior was always a bit difficult to reproduce because it relies heavily on the timing. The patch in https://github.com/automl/SMAC3/tree/fix/daskworker makes sense to me, but since I couldn't reproduce the error, I'm also fine with just closing this issue. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This came up in #998, I'll repeat the relevant parts here for easier reference.
Description
I want to parallelize SMAC on a slurm cluster. The cluster only schedules new jobs once every 15 seconds
Steps/Code to Reproduce
An example is available on:
https://ai.dmi.unibas.ch/_experiments/pommeren/innosuisse/mwe/
benchmarks.py
contains the list of instances and their features.gurobi.py
contains the model (configuration space and trial evaluation function).run_smac.py
contains the actual call to smac, the dask client and so on.setup.sh
shows what software I installed:gurobipy
,dask_jobqueue
,swig
and SMAC on the development branch as of last week (the code needs check if config in rh when storing state #997, which I merged locally for previous tests, but now it's already on the dev branch). All of this is now in release 2.0.1.I tried cutting the example down to the essentials but it still optimizes a model that relies on Gurobi and some instances that I unfortunately cannot share. If you have a simpler model, you want me to try instead, I can see if the error still occurs there. So far, if I execute the code as is, everything works fine, but if I remove the
time.sleep(10)
in line 61, I get the following output:Expected Results
I would expect the runner to wait until the workers are fully scheduled on the grid before giving up on them.
Actual Results
The runner waits for some time (
_patience
inDaskParallelRunner
) and then counts the worker as failed. If that happens to all workers, the optimization doesn't start and produces the error below. I suspect that if it happens to some but not all workers, the optimization will start but only use those workers that were ready in time.Either adding a
time.sleep(10)
or settingmy_facade._runner._patience
to 15 before the optimization seemed to fix the issue for me. It is somewhat hard to see, because the bug is not perfectly reproducible. I assume this has to do with the 15 second scheduling frequency of our slurm cluster: if dask submits the workers just before the next "tick" of slurm, they will be scheduled quickly, but if this happens just after the tick, then it will take at least 15 seconds. All of this assumes grid resources are available at all. I have not tried on a busy grid.Versions
I ran this on a version of the development branch with some feature branches merged. It should be equivalent to what is now release 2.0.1.
The text was updated successfully, but these errors were encountered: