-
-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SLURMCluster doesn't spawn new workers when old ones timeout #611
Comments
There's a long issue discussion on this at #122 (which hopefully includes a solution for you!) |
See #122 (comment) |
Ah, that is now in the docs at https://jobqueue.dask.org/en/latest/advanced-tips-and-tricks.html#how-to-handle-job-queueing-system-walltime-killing-workers |
@berkgercek, hopefully the links provided by @ocaisa should give you at least some workaround. Other than that, I agree that in a simple case, with adaptive mode, new Workers should be started if some are lost. We should look at how this is handled in distributed repository. |
Just a note you should be able to get the scheduler logs with |
It seems that that it should be possible to make the respawning workaround even with just current = len(self.plan)
cluster.scale(jobs=len(cluster.scheduler.workers))
cluster.scale(current) However, in the following code responsible for scaling: there is a mismatch between worker names (when using
This mismatch seems to be also responsible for making |
I'm running into exactly the same issue. The code that I'm using to call Here is the log output from a recent run with
If I then continue my work and end up calling |
Spawning of new workers fails with: cluster.adapt(minimum_jobs=1,
maximum_jobs=2,
worker_key=lambda state: state.address.split(':')[0],
interval='10s') It works however when using the following: cluster.adapt(minimum=1,
maximum=8,
worker_key=lambda state: state.address.split(':')[0],
interval='10s') In my case each job spawns 4 workers so Removing |
@matrach you are right about the mismatch in the distributed code in the specific case where we want to scale down not yet launched workers. However I'm not sure how this relates to this problem were we want to respawn dead workers? @maawoo the link to your code is dead for me. Considering the second part, cluster.adapt(minimum_jobs=1, maximum_jobs=2) will be translated in cluster.adapt(minimum=4, maximum=8), which probably causes the issue. It's important to stress that adaptive mode is known to have issues with dask-jobqueue when starting several Worker processes per job. Getting back at the original problem, I just tested the following code using dask 2023.6.0: import time
import numpy as np
from dask_jobqueue import SLURMCluster as Cluster
from dask import delayed
from dask.distributed import Client, as_completed
cluster = Cluster(walltime='00:01:00', cores=1, memory='4gb', account="campus")
cluster.adapt(minimum=2, maximum=4) # FIX
client = Client(cluster) And I see new workers being created as soon as older ones dies, without performing any computations. I'm going to close this issue as the more specific problems are covered by other ones. |
I've never mentioned such a case. The issue was (is?) that the variable name not_yet_launched = set(self.worker_spec) - {
v["name"] for v in self.scheduler_info["workers"].values()
}
while len(self.worker_spec) > n and not_yet_launched:
del self.worker_spec[not_yet_launched.pop()] But, |
What I meant was that it is another issue, or is it not? |
This is related when using |
When spawning a SLURM cluster on dask-jobqueue, the cluster spawns workers as expected when
cluster.adapt(minimum_jobs=6, maximum_jobs=100)
is called. These workers continue to function as expected until they time out, however when the workers die (due to walltime limits on the job associated) the dask cluster does not spawn new workers to replace them.For me this is a case of unknown unknowns. I don't know where to look for the dask scheduler logs which would perhaps explain why the issue is happening. The worker logs are fine, and simply show that they were killed due to the cluster timeout.
Environment:
The text was updated successfully, but these errors were encountered: