Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create example for custom dask client #998

Closed
benjamc opened this issue May 5, 2023 · 13 comments
Closed

Create example for custom dask client #998

benjamc opened this issue May 5, 2023 · 13 comments
Assignees
Labels
documentation Documentation is needed/added.

Comments

@benjamc
Copy link
Contributor

benjamc commented May 5, 2023

No description provided.

@benjamc benjamc added the documentation Documentation is needed/added. label May 5, 2023
@benjamc benjamc self-assigned this May 5, 2023
@FlorianPommerening
Copy link

I stumbled on this issue and saw that it was opened just shortly after I was looking for this, what a lucky conincidence. I would be particularly interested in an example that uses dask to run smac on a slurm-based cluster. From looking at dask, this seems like an option, but I don't know how to use it: https://jobqueue.dask.org/en/latest/generated/dask_jobqueue.SLURMCluster.html

@benjamc
Copy link
Contributor Author

benjamc commented May 9, 2023

Hi @FlorianPommerening ,
I created a dask client example for a SLURM cluster, you can find it in this PR #1001 under examples/1_basics/7_parallelization_cluster.py.

@mfeurer
Copy link
Contributor

mfeurer commented May 10, 2023

Hey @benjamc, it's great to see progress in this direction.

I would suggest to also add an example that does not require a custom client, but rather a standard client, and shows how to connect manually spawned workers (in case someone doesn't have a SLURM cluster but still wants to do similar things). As a starting point one could have a look into this example in Auto-sklearn which can be easily adapted for SMAC.

@FlorianPommerening
Copy link

Thanks a lot @benjamc, that was super quick.

@FlorianPommerening
Copy link

FlorianPommerening commented May 11, 2023

Unfortunately, the example doesn't work on our cluster. I changed the name of the queue and increased the number of trials to 1000 and then ran the process on the login node of our cluster. I can see worker jobs spawning on the cluster but they don't seem to be doing anything. The work is all done on the login node instead (htop on the login node shows it under full load, htop on the node where the workers are running shows some activity initially as they start up, then nothing). After a while (the main thread on the login node is still running trials at this point) the workers stop

When I look into the logs in tmp/smac_dask_slurm/*.err I see the following error. Any idea what I'm doing wrong?

2023-05-11 16:00:40,070 - distributed.nanny - INFO - Closing Nanny at 'tcp://[private IP removed]:37187'. Reason: nanny-close
2023-05-11 16:00:40,072 - distributed.dask_worker - INFO - End worker
...
OSError: Timed out trying to connect to tcp://[public IP removed]:41950 after 30 s
...
RuntimeError: Nanny failed to start.

(edit: simplified long log since it is no longer relevant, see below.)

@FlorianPommerening
Copy link

Ok, I figured out that the nanny was not connecting to the workers because I had to specify the "interface" parameter. Otherwise, the public IP of the login nodes was used which does not accept connections.

I now no longer see the error but the work still seem to be done exclusively on the login node.

@FlorianPommerening
Copy link

I managed to get it to work but I had to do additional changes:

  • In the Scenario, I had to set n_workers to the number of workers spawned by Dask. I didn't see this in the example, and without it, SMAC used a TargetFunctionRunner instead of a DaskParallelRunner, so it ran locally.
  • I added a time.sleep(10) before the call to optimize(). Without it, the workers were not ready when the optimization started and the whole process failed with something like "no worker was ever available".
  • In the Intensifier, I had to increase retries a lot. Without this, I often got "Intensifier could not find any new trials." shortly after the optimization started. I don't understand exactly what happened, but it seems like the intensifier schedules some trials, but there are so many workers that all of them can be scheduled in parallel. While they are running, no new trials are scheduled and the queue runs empty.

Maybe some of those points are worth adding to the example.

@benjamc
Copy link
Contributor Author

benjamc commented May 16, 2023

Hi,
thank you for pointing out the thing with scenario.n_workers. We updated the PR to wrap the runner in a dask runner when either scenario.n_workers > 1 or a dask client is passed. This should be fine now.
However for the rest, the parallelization example runs fine on our machine.
If you still have troubles, you could provide a minimal working example (if you use a slurm cluster).

@FlorianPommerening
Copy link

I could not reproduce the problem in the third point (the one about retries of the intensifier) but the second one (sleep before optimize) is reproducible for me. The script I used is available here:

https://ai.dmi.unibas.ch/_experiments/pommeren/innosuisse/mwe/

  • benchmarks.py contains the list of instances and their features.
  • gurobi.py contains the model (configuration space and trial evaluation function).
  • run_smac.py contains the actual call to smac, the dask client and so on.
  • setup.sh shows what software I installed: gurobipy, dask_jobqueue, swig and SMAC on the development branch as of yesterday (the code needs check if config in rh when storing state #997, which I merged locally for previous tests, but now it's already on the dev branch).

I tried cutting the example down to the essentials but it still optimizes a model that relies on Gurobi and some instances that I unfortunately cannot share. If you have a simpler model, you want me to try instead, I can see if the error still occurs there. So far, if I execute the code as is, everything works fine, but if I remove the time.sleep(10) in line 61, I get the following output:

[WARNING][abstract_facade.py:192] Provided `dask_client`. Ignore `scenario.n_workers`, directly set `n_workers` in `dask_client`.
[INFO][abstract_initial_design.py:147] Using 0 initial design configurations and 1 additional configurations.
[INFO][abstract_intensifier.py:305] Using only one seed for deterministic scenario.
[WARNING][dask_runner.py:127] No workers are available. This could mean workers crashed. Waiting for new workers...
Traceback (most recent call last):
  File "/infai/pommeren/experiments/pommeren/innosuisse/mwe/./run_smac.py", line 62, in <module>
    incumbent = smac.optimize()
  File "/infai/pommeren/experiments/pommeren/innosuisse/mwe/SMAC3/smac/facade/abstract_facade.py", line 303, in optimize
    incumbents = self._optimizer.optimize()
  File "/infai/pommeren/experiments/pommeren/innosuisse/mwe/SMAC3/smac/main/smbo.py", line 284, in optimize
    self._runner.submit_trial(trial_info=trial_info)
  File "/infai/pommeren/experiments/pommeren/innosuisse/mwe/SMAC3/smac/runner/dask_runner.py", line 130, in submit_trial
    raise RuntimeError(
RuntimeError: Tried to execute a job, but no worker was ever available.This likely means that a worker crashed or no workers were properly configured.

The first warning about scenario.n_workers always shows up when using a Dask client, even when not specifying n_workers, but this shouldn't matter, right?

@benjamc
Copy link
Contributor Author

benjamc commented May 23, 2023

Hi Florian,
it might be that the patience is too low. Currently we do not have this parameter accessible but as a quick fix you can try to set it up in here. Maybe adding 10s already suffices.

The warning is just for information that we use the number of workers specified in the dask client and scenario.n_workers is ignored.

@FlorianPommerening
Copy link

Thanks. This seems to help but it is somewhat complicated to test.

I'll open new issues for the two problems as you suggested by email.

@FlorianPommerening
Copy link

For future reference: the new issues are #1016 and #1017.

@benjamc
Copy link
Contributor Author

benjamc commented Jun 1, 2023

Thanks for the issues, I will close this one then. :)

@benjamc benjamc closed this as completed Jun 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Documentation is needed/added.
Projects
Status: Done
Development

No branches or pull requests

3 participants