Create example for custom dask client #998

benjamc · 2023-05-05T06:32:43Z

No description provided.

FlorianPommerening · 2023-05-08T10:38:06Z

I stumbled on this issue and saw that it was opened just shortly after I was looking for this, what a lucky conincidence. I would be particularly interested in an example that uses dask to run smac on a slurm-based cluster. From looking at dask, this seems like an option, but I don't know how to use it: https://jobqueue.dask.org/en/latest/generated/dask_jobqueue.SLURMCluster.html

benjamc · 2023-05-09T13:51:20Z

Hi @FlorianPommerening ,
I created a dask client example for a SLURM cluster, you can find it in this PR #1001 under examples/1_basics/7_parallelization_cluster.py.

mfeurer · 2023-05-10T06:44:51Z

Hey @benjamc, it's great to see progress in this direction.

I would suggest to also add an example that does not require a custom client, but rather a standard client, and shows how to connect manually spawned workers (in case someone doesn't have a SLURM cluster but still wants to do similar things). As a starting point one could have a look into this example in Auto-sklearn which can be easily adapted for SMAC.

FlorianPommerening · 2023-05-10T07:11:16Z

Thanks a lot @benjamc, that was super quick.

FlorianPommerening · 2023-05-11T14:13:03Z

Unfortunately, the example doesn't work on our cluster. I changed the name of the queue and increased the number of trials to 1000 and then ran the process on the login node of our cluster. I can see worker jobs spawning on the cluster but they don't seem to be doing anything. The work is all done on the login node instead (htop on the login node shows it under full load, htop on the node where the workers are running shows some activity initially as they start up, then nothing). After a while (the main thread on the login node is still running trials at this point) the workers stop

When I look into the logs in tmp/smac_dask_slurm/*.err I see the following error. Any idea what I'm doing wrong?

2023-05-11 16:00:40,070 - distributed.nanny - INFO - Closing Nanny at 'tcp://[private IP removed]:37187'. Reason: nanny-close
2023-05-11 16:00:40,072 - distributed.dask_worker - INFO - End worker
...
OSError: Timed out trying to connect to tcp://[public IP removed]:41950 after 30 s
...
RuntimeError: Nanny failed to start.

(edit: simplified long log since it is no longer relevant, see below.)

FlorianPommerening · 2023-05-11T14:46:53Z

Ok, I figured out that the nanny was not connecting to the workers because I had to specify the "interface" parameter. Otherwise, the public IP of the login nodes was used which does not accept connections.

I now no longer see the error but the work still seem to be done exclusively on the login node.

FlorianPommerening · 2023-05-15T09:38:06Z

I managed to get it to work but I had to do additional changes:

In the Scenario, I had to set n_workers to the number of workers spawned by Dask. I didn't see this in the example, and without it, SMAC used a TargetFunctionRunner instead of a DaskParallelRunner, so it ran locally.
I added a time.sleep(10) before the call to optimize(). Without it, the workers were not ready when the optimization started and the whole process failed with something like "no worker was ever available".
In the Intensifier, I had to increase retries a lot. Without this, I often got "Intensifier could not find any new trials." shortly after the optimization started. I don't understand exactly what happened, but it seems like the intensifier schedules some trials, but there are so many workers that all of them can be scheduled in parallel. While they are running, no new trials are scheduled and the queue runs empty.

Maybe some of those points are worth adding to the example.

benjamc · 2023-05-16T12:41:40Z

Hi,
thank you for pointing out the thing with scenario.n_workers. We updated the PR to wrap the runner in a dask runner when either scenario.n_workers > 1 or a dask client is passed. This should be fine now.
However for the rest, the parallelization example runs fine on our machine.
If you still have troubles, you could provide a minimal working example (if you use a slurm cluster).

FlorianPommerening · 2023-05-17T14:36:41Z

I could not reproduce the problem in the third point (the one about retries of the intensifier) but the second one (sleep before optimize) is reproducible for me. The script I used is available here:

https://ai.dmi.unibas.ch/_experiments/pommeren/innosuisse/mwe/

benchmarks.py contains the list of instances and their features.
gurobi.py contains the model (configuration space and trial evaluation function).
run_smac.py contains the actual call to smac, the dask client and so on.
setup.sh shows what software I installed: gurobipy, dask_jobqueue, swig and SMAC on the development branch as of yesterday (the code needs check if config in rh when storing state #997, which I merged locally for previous tests, but now it's already on the dev branch).

I tried cutting the example down to the essentials but it still optimizes a model that relies on Gurobi and some instances that I unfortunately cannot share. If you have a simpler model, you want me to try instead, I can see if the error still occurs there. So far, if I execute the code as is, everything works fine, but if I remove the time.sleep(10) in line 61, I get the following output:

[WARNING][abstract_facade.py:192] Provided `dask_client`. Ignore `scenario.n_workers`, directly set `n_workers` in `dask_client`.
[INFO][abstract_initial_design.py:147] Using 0 initial design configurations and 1 additional configurations.
[INFO][abstract_intensifier.py:305] Using only one seed for deterministic scenario.
[WARNING][dask_runner.py:127] No workers are available. This could mean workers crashed. Waiting for new workers...
Traceback (most recent call last):
  File "/infai/pommeren/experiments/pommeren/innosuisse/mwe/./run_smac.py", line 62, in <module>
    incumbent = smac.optimize()
  File "/infai/pommeren/experiments/pommeren/innosuisse/mwe/SMAC3/smac/facade/abstract_facade.py", line 303, in optimize
    incumbents = self._optimizer.optimize()
  File "/infai/pommeren/experiments/pommeren/innosuisse/mwe/SMAC3/smac/main/smbo.py", line 284, in optimize
    self._runner.submit_trial(trial_info=trial_info)
  File "/infai/pommeren/experiments/pommeren/innosuisse/mwe/SMAC3/smac/runner/dask_runner.py", line 130, in submit_trial
    raise RuntimeError(
RuntimeError: Tried to execute a job, but no worker was ever available.This likely means that a worker crashed or no workers were properly configured.

The first warning about scenario.n_workers always shows up when using a Dask client, even when not specifying n_workers, but this shouldn't matter, right?

benjamc · 2023-05-23T12:48:27Z

Hi Florian,
it might be that the patience is too low. Currently we do not have this parameter accessible but as a quick fix you can try to set it up in here. Maybe adding 10s already suffices.

The warning is just for information that we use the number of workers specified in the dask client and scenario.n_workers is ignored.

FlorianPommerening · 2023-05-23T15:45:07Z

Thanks. This seems to help but it is somewhat complicated to test.

I'll open new issues for the two problems as you suggested by email.

FlorianPommerening · 2023-05-24T12:04:55Z

For future reference: the new issues are #1016 and #1017.

benjamc · 2023-06-01T07:58:13Z

Thanks for the issues, I will close this one then. :)

benjamc added the documentation Documentation is needed/added. label May 5, 2023

benjamc self-assigned this May 5, 2023

benjamc mentioned this issue May 9, 2023

Documentation/dask client example #1001

Merged

FlorianPommerening mentioned this issue May 23, 2023

"no worker was ever available" when running on a slow slurm cluster #1016

Closed

benjamc closed this as completed Jun 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create example for custom dask client #998

Create example for custom dask client #998

benjamc commented May 5, 2023

FlorianPommerening commented May 8, 2023

benjamc commented May 9, 2023

mfeurer commented May 10, 2023

FlorianPommerening commented May 10, 2023

FlorianPommerening commented May 11, 2023 •

edited

Loading

FlorianPommerening commented May 11, 2023

FlorianPommerening commented May 15, 2023

benjamc commented May 16, 2023

FlorianPommerening commented May 17, 2023

benjamc commented May 23, 2023 •

edited

Loading

FlorianPommerening commented May 23, 2023

FlorianPommerening commented May 24, 2023

benjamc commented Jun 1, 2023

Create example for custom dask client #998

Create example for custom dask client #998

Comments

benjamc commented May 5, 2023

FlorianPommerening commented May 8, 2023

benjamc commented May 9, 2023

mfeurer commented May 10, 2023

FlorianPommerening commented May 10, 2023

FlorianPommerening commented May 11, 2023 • edited Loading

FlorianPommerening commented May 11, 2023

FlorianPommerening commented May 15, 2023

benjamc commented May 16, 2023

FlorianPommerening commented May 17, 2023

benjamc commented May 23, 2023 • edited Loading

FlorianPommerening commented May 23, 2023

FlorianPommerening commented May 24, 2023

benjamc commented Jun 1, 2023

FlorianPommerening commented May 11, 2023 •

edited

Loading

benjamc commented May 23, 2023 •

edited

Loading