"no worker was ever available" when running on a slow slurm cluster #1016

FlorianPommerening · 2023-05-23T16:04:59Z

This came up in #998, I'll repeat the relevant parts here for easier reference.

Description

I want to parallelize SMAC on a slurm cluster. The cluster only schedules new jobs once every 15 seconds

Steps/Code to Reproduce

An example is available on:
https://ai.dmi.unibas.ch/_experiments/pommeren/innosuisse/mwe/

benchmarks.py contains the list of instances and their features.
gurobi.py contains the model (configuration space and trial evaluation function).
run_smac.py contains the actual call to smac, the dask client and so on.
setup.sh shows what software I installed: gurobipy, dask_jobqueue, swig and SMAC on the development branch as of last week (the code needs check if config in rh when storing state #997, which I merged locally for previous tests, but now it's already on the dev branch). All of this is now in release 2.0.1.

I tried cutting the example down to the essentials but it still optimizes a model that relies on Gurobi and some instances that I unfortunately cannot share. If you have a simpler model, you want me to try instead, I can see if the error still occurs there. So far, if I execute the code as is, everything works fine, but if I remove the time.sleep(10) in line 61, I get the following output:

Expected Results

I would expect the runner to wait until the workers are fully scheduled on the grid before giving up on them.

Actual Results

The runner waits for some time (_patience in DaskParallelRunner) and then counts the worker as failed. If that happens to all workers, the optimization doesn't start and produces the error below. I suspect that if it happens to some but not all workers, the optimization will start but only use those workers that were ready in time.

Either adding a time.sleep(10) or setting my_facade._runner._patience to 15 before the optimization seemed to fix the issue for me. It is somewhat hard to see, because the bug is not perfectly reproducible. I assume this has to do with the 15 second scheduling frequency of our slurm cluster: if dask submits the workers just before the next "tick" of slurm, they will be scheduled quickly, but if this happens just after the tick, then it will take at least 15 seconds. All of this assumes grid resources are available at all. I have not tried on a busy grid.

[WARNING][abstract_facade.py:192] Provided `dask_client`. Ignore `scenario.n_workers`, directly set `n_workers` in `dask_client`.
[INFO][abstract_initial_design.py:147] Using 0 initial design configurations and 1 additional configurations.
[INFO][abstract_intensifier.py:305] Using only one seed for deterministic scenario.
[WARNING][dask_runner.py:127] No workers are available. This could mean workers crashed. Waiting for new workers...
Traceback (most recent call last):
  File "/infai/pommeren/experiments/pommeren/innosuisse/mwe/./run_smac.py", line 62, in <module>
    incumbent = smac.optimize()
  File "/infai/pommeren/experiments/pommeren/innosuisse/mwe/SMAC3/smac/facade/abstract_facade.py", line 303, in optimize
    incumbents = self._optimizer.optimize()
  File "/infai/pommeren/experiments/pommeren/innosuisse/mwe/SMAC3/smac/main/smbo.py", line 284, in optimize
    self._runner.submit_trial(trial_info=trial_info)
  File "/infai/pommeren/experiments/pommeren/innosuisse/mwe/SMAC3/smac/runner/dask_runner.py", line 130, in submit_trial
    raise RuntimeError(
RuntimeError: Tried to execute a job, but no worker was ever available.This likely means that a worker crashed or no workers were properly configured.

Versions

I ran this on a version of the development branch with some feature branches merged. It should be equivalent to what is now release 2.0.1.

The text was updated successfully, but these errors were encountered:

benjamc · 2023-06-06T14:02:52Z

Hi Florian,
it is hard to recreate the issue on our setup.
However, I might have a fix.
You could try this branch https://github.com/automl/SMAC3/tree/fix/daskworker
Here dask waits for at least one worker to be scheduled.
Could you try this and report back if that worked for you?

alexandertornede · 2023-07-25T12:28:38Z

@FlorianPommerening Did you check on the branch @benjamc mentioned above?

mens-artis · 2023-09-16T09:21:10Z

On my cluster, the fix works when I don't ask for worker_extra_args=["--gpus-per-task=2"] which end up in
/usr/bin/python3.10 -m distributed.cli.dask_worker tcp://..166.214:38861 --nthreads 1 --memory-limit 0.93GiB --name dummy-name --nanny --death-timeout 60 --gpus-per-task=2
When I use job_extra_directives=["--gres=gpu:2"] however, no gpus are every alloted as far as I can tell. I think there is another argument where a gpu request might be passed, but I also print the cluster.job_script() and it is the same as when I write the job script myself (with #SBATCH --gres=gpu:2). When I write the job script myself, the GPUs are allotted, but I need to use SMAC obviously.

mfeurer · 2023-09-21T07:57:51Z

Hey folks, I started an example in #1064 in which workers are started manually. There's something not yet working in there, but you may use it as a starting point to achieve what you want. In case you get the example working, please consider updating the PR I made.

FlorianPommerening · 2023-09-29T13:05:52Z

Sorry for being quiet for so long. Some deadlines and holidays got in the way. I now tired reproducing the problem again on the current dev branch but couldn't. The behavior was always a bit difficult to reproduce because it relies heavily on the timing. The patch in https://github.com/automl/SMAC3/tree/fix/daskworker makes sense to me, but since I couldn't reproduce the error, I'm also fine with just closing this issue.

stale · 2023-11-30T07:38:55Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

FlorianPommerening mentioned this issue May 24, 2023

Create example for custom dask client #998

Closed

benjamc mentioned this issue Jun 6, 2023

Fix Dask unscheduled workers #1032

Draft

benjamc mentioned this issue Jun 28, 2023

[Question] Dask on Cluster #1049

Open

stale bot added the stale label Nov 30, 2023

stale bot closed this as completed Dec 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"no worker was ever available" when running on a slow slurm cluster #1016

"no worker was ever available" when running on a slow slurm cluster #1016

FlorianPommerening commented May 23, 2023

benjamc commented Jun 6, 2023

alexandertornede commented Jul 25, 2023

mens-artis commented Sep 16, 2023 •

edited

mfeurer commented Sep 21, 2023

FlorianPommerening commented Sep 29, 2023

stale bot commented Nov 30, 2023

"no worker was ever available" when running on a slow slurm cluster #1016

"no worker was ever available" when running on a slow slurm cluster #1016

Comments

FlorianPommerening commented May 23, 2023

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

benjamc commented Jun 6, 2023

alexandertornede commented Jul 25, 2023

mens-artis commented Sep 16, 2023 • edited

mfeurer commented Sep 21, 2023

FlorianPommerening commented Sep 29, 2023

stale bot commented Nov 30, 2023

mens-artis commented Sep 16, 2023 •

edited