Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"no worker was ever available" when running on a slow slurm cluster #1016

Closed
FlorianPommerening opened this issue May 23, 2023 · 6 comments
Closed
Labels

Comments

@FlorianPommerening
Copy link

This came up in #998, I'll repeat the relevant parts here for easier reference.

Description

I want to parallelize SMAC on a slurm cluster. The cluster only schedules new jobs once every 15 seconds

Steps/Code to Reproduce

An example is available on:
https://ai.dmi.unibas.ch/_experiments/pommeren/innosuisse/mwe/

  • benchmarks.py contains the list of instances and their features.
  • gurobi.py contains the model (configuration space and trial evaluation function).
  • run_smac.py contains the actual call to smac, the dask client and so on.
  • setup.sh shows what software I installed: gurobipy, dask_jobqueue, swig and SMAC on the development branch as of last week (the code needs check if config in rh when storing state #997, which I merged locally for previous tests, but now it's already on the dev branch). All of this is now in release 2.0.1.

I tried cutting the example down to the essentials but it still optimizes a model that relies on Gurobi and some instances that I unfortunately cannot share. If you have a simpler model, you want me to try instead, I can see if the error still occurs there. So far, if I execute the code as is, everything works fine, but if I remove the time.sleep(10) in line 61, I get the following output:

Expected Results

I would expect the runner to wait until the workers are fully scheduled on the grid before giving up on them.

Actual Results

The runner waits for some time (_patience in DaskParallelRunner) and then counts the worker as failed. If that happens to all workers, the optimization doesn't start and produces the error below. I suspect that if it happens to some but not all workers, the optimization will start but only use those workers that were ready in time.

Either adding a time.sleep(10) or setting my_facade._runner._patience to 15 before the optimization seemed to fix the issue for me. It is somewhat hard to see, because the bug is not perfectly reproducible. I assume this has to do with the 15 second scheduling frequency of our slurm cluster: if dask submits the workers just before the next "tick" of slurm, they will be scheduled quickly, but if this happens just after the tick, then it will take at least 15 seconds. All of this assumes grid resources are available at all. I have not tried on a busy grid.

[WARNING][abstract_facade.py:192] Provided `dask_client`. Ignore `scenario.n_workers`, directly set `n_workers` in `dask_client`.
[INFO][abstract_initial_design.py:147] Using 0 initial design configurations and 1 additional configurations.
[INFO][abstract_intensifier.py:305] Using only one seed for deterministic scenario.
[WARNING][dask_runner.py:127] No workers are available. This could mean workers crashed. Waiting for new workers...
Traceback (most recent call last):
  File "/infai/pommeren/experiments/pommeren/innosuisse/mwe/./run_smac.py", line 62, in <module>
    incumbent = smac.optimize()
  File "/infai/pommeren/experiments/pommeren/innosuisse/mwe/SMAC3/smac/facade/abstract_facade.py", line 303, in optimize
    incumbents = self._optimizer.optimize()
  File "/infai/pommeren/experiments/pommeren/innosuisse/mwe/SMAC3/smac/main/smbo.py", line 284, in optimize
    self._runner.submit_trial(trial_info=trial_info)
  File "/infai/pommeren/experiments/pommeren/innosuisse/mwe/SMAC3/smac/runner/dask_runner.py", line 130, in submit_trial
    raise RuntimeError(
RuntimeError: Tried to execute a job, but no worker was ever available.This likely means that a worker crashed or no workers were properly configured.

Versions

I ran this on a version of the development branch with some feature branches merged. It should be equivalent to what is now release 2.0.1.

@benjamc
Copy link
Contributor

benjamc commented Jun 6, 2023

Hi Florian,
it is hard to recreate the issue on our setup.
However, I might have a fix.
You could try this branch https://github.com/automl/SMAC3/tree/fix/daskworker
Here dask waits for at least one worker to be scheduled.
Could you try this and report back if that worked for you?

@alexandertornede
Copy link
Contributor

@FlorianPommerening Did you check on the branch @benjamc mentioned above?

@mens-artis
Copy link

mens-artis commented Sep 16, 2023

On my cluster, the fix works when I don't ask for worker_extra_args=["--gpus-per-task=2"] which end up in
/usr/bin/python3.10 -m distributed.cli.dask_worker tcp://..166.214:38861 --nthreads 1 --memory-limit 0.93GiB --name dummy-name --nanny --death-timeout 60 --gpus-per-task=2
When I use job_extra_directives=["--gres=gpu:2"] however, no gpus are every alloted as far as I can tell. I think there is another argument where a gpu request might be passed, but I also print the cluster.job_script() and it is the same as when I write the job script myself (with #SBATCH --gres=gpu:2). When I write the job script myself, the GPUs are allotted, but I need to use SMAC obviously.

@mfeurer
Copy link
Contributor

mfeurer commented Sep 21, 2023

Hey folks, I started an example in #1064 in which workers are started manually. There's something not yet working in there, but you may use it as a starting point to achieve what you want. In case you get the example working, please consider updating the PR I made.

@FlorianPommerening
Copy link
Author

Sorry for being quiet for so long. Some deadlines and holidays got in the way. I now tired reproducing the problem again on the current dev branch but couldn't. The behavior was always a bit difficult to reproduce because it relies heavily on the timing. The patch in https://github.com/automl/SMAC3/tree/fix/daskworker makes sense to me, but since I couldn't reproduce the error, I'm also fine with just closing this issue.

Copy link

stale bot commented Nov 30, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Nov 30, 2023
@stale stale bot closed this as completed Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

No branches or pull requests

5 participants