-
-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask client connects to SLURM workers, then rapidly loses them #20
Comments
Given your logs, it looks like the worker are failing to start for some reason:
Not clear to me why. Could you provide the SLURMCluster generated job_script: cluster.job_script() For the sake of log clarity, could you try to launch only one worker and give us the log output? Those lines in the stack trace look weird to me:
|
Clients don't kill workers themselves. Nothing will kill them by default. What is the output of the following: import distributed
print(distributed.__version__) |
Yeah, I agree that these lines are concerning:
We've seen this very intermittently in our CI tests and I haven't yet been able to figure out how they occur. As a test you might try setting the following in your # multiprocessing-method: forkserver
multiprocessing-method: spawn I would be quite interested to learn if that fixes the problem |
If you had time to test out dask/distributed#1848 that would also be welcome. It's hard for me to reproduce this error locally, which is one reason why it has lingered for a while. |
Okay, i'm up and will be working on this today. To summarize the above plans of action. I will add to this thread as I have answers to the following questions.
|
Job script, with formatted line breaks.
|
And to be clear, running that script breaks? |
Ya, posting now. Not sure best way to separate the thread. It breaks from within cluster.start_workers. I have not tried separately to push that slurm script. This is the dask.err file for that one worker. Note that client has not been called.
|
Another possibility here is that it's just taking your system a long time to start a new process. Currently we restart the worker if we don't hear an initial hello from the forked worker process within five seconds. On a normal system this should happen in well under a second. I would not be very surprised though if systems with fancy file systems and OS's take far longer. |
Is that timeout customizable? If not, could we make customizable? |
No and Yes. Ideally it also just wouldn't be necessary if we can find the root of the problem. |
One thing i'm noticing is that when you do call for one worker, I never seem to get connected to the client. But if I try two, I do get a connection, and then the dask.err file is different. I'm trying to separate this behavior from the death_timeout flag. If the client isn't responsible for killing the workers, what is the function of this arg?
Example here of requesting two workers gets synced up, albeit briefly.
Atleast one worker still exists when I killed it.
Error File
|
Maybe its something about how resources are allocated, but just driving home the point here. Exact same call as above, except just one worker, and I never get synced up to client.
|
I see no reason why any worker would leave after it has connected. I'm as surprised as you are. Death timeout is used by the worker itself when it can't connect to the scheduler in time |
okay, next thing is to try that PR. What's the best way to do that? Fork the PR, clone on the cluster and replace
will that be sufficient? |
The easiest way is to pip install from a git branch
But I'll probably have a second PR for you to try shortly. |
You could also try this PR: dask/distributed#1852 I suspect that it will solve your problem, but it may introduce some others for us. It would be good to verify that nothing else happens on your system though.
|
Okay, starting one at a time. Causes an error in package msgpack. Let me reinstall conda env, then try the 2nd PR. |
Try adding --no-deps to the pip install line
…On Thu, Mar 22, 2018 at 1:43 PM, Ben Weinstein ***@***.***> wrote:
Okay, starting one at a time.
dask/distributed#1848 <dask/distributed#1848>
Causes an error in package msgpack.
Let me reinstall conda env, then try the 2nd PR.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#20 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszCME0co-4JFZxxfquV6VcALT3XV_ks5tg-K_gaJpZM4S2ZBz>
.
|
Okay, reinstalled with a clean conda env. With this PR, requesting two workers does not sync up.
No change in the dask.err I will try the 2nd pull request. |
causes workers to stay alive, but not connect to client.
Workers are still alive
|
You'll want to wait a bit on dask.err, it's not done yet. Or, if it is, then your system has trouble starting new processes |
This is the sort of question that you might want to ask of your IT staff. You could probably avoid all of this just by adding the |
(not that that would resolve the underlying issue, but it might make you happier personally.) |
Note that the login nodes of many HPC systems disallow processes from forking or spawning new processes. |
Yes, sorry, edited above, was too jumpy with dask.err. I stepped away to chat with other HPC users at the university to see if anyone had leads. I am opening a ticket with IT to see if it has to do with the forking issue. If so, is this fatal? I was quite excited about dask. I never did change the .config environment, where can I find the .dask folder? In the miniconda site packages? btw, can't do no-nanny.
|
I am going to revert back to the stable dask, before the PR. |
Can you post the worker logs again? So far it looks like the nanny is starting, but never the worker. |
To be clear, this is what a healthy log looks like. You can run this on your personal computer: scheduler
worker
|
I seem to have lost the ability to even get workers added to the client. If you scroll up, you can see that the same code had processes added, which then died. Now its just 0. I went back and performed
to make sure I got rid of the two pull requests. I agree with you on the error logs. They are just long lists of workers not starting.
I'm submiting IT ticket now. |
Perhaps i've worn out my usefuless, but added no-nanny on one process is quite weird.
At the same time, if I open up a 2nd terminal and check on those workers. They exist, but no error is given.
I'm trying to get back to the place where atleast workers were added. |
okay last update until I hear from IT.
Running 1 worker. Spawn multiprocessing. Added the process then died. Atleast farther along then above. Not clear to me what reinstalling did.
Error log
|
If it is a problem with the login nodes not allowed to start new process, one thing you could do if possible with slurm, is to launch your scheduler into an interactive job. With pbs, this is done using qsub -I. Don't know with slurm. |
Yup, sorry to not mentioned. I am already there. I am sitting on an interactive node. Admin doesn't let you hang out too much on the login nodes. |
No word from IT. I woke up convinced that perhaps the workers didn't correctly inherent the dask environment. So i tried adding a call to the conda env from within the slurm worker. No change. Worker processes are added and then die about a minute later. Just documenting here to keep track of attempts.
Error Log
|
Update for future reference. No change. The SLURM cluster distributes tasks based on the total number of requested resources, but does not respect where those tasks are placed unless specifically told. Since dask is working through a MPI connection, those 8 workers must be on the same rack? To do this, I edited the submission script to enforce that (n) --ntasks 8 are all performed on the same node -N 1.
I feel like the error message is slightly different at the end. I'm going to change back to forking from spawning to see if it makes a different in the .dash config yaml.
|
You might want to point your IT staff to these pages:
http://dask.pydata.org/en/latest/setup/cli.html
http://dask.pydata.org/en/latest/setup/hpc.html
…On Fri, Mar 23, 2018 at 2:19 PM, Ben Weinstein ***@***.***> wrote:
Update for future reference. No change.
The SLURM cluster distributes tasks based on the total number of requested
resources, but does not respect where those tasks are placed unless
specifically told. Since dask is working through a MPI connection, those 8
workers must be on the same rack? To do this, I edited the submission
script to enforce that (n) --ntasks 8 are all performed on the same node -N
1.
(pangeo) ***@***.*** dask-jobqueue]$ python
Python 3.6.4 | packaged by conda-forge | (default, Dec 23 2017, 16:31:06)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from dask_jobqueue import SLURMCluster
>>> from datetime import datetime
>>> from time import sleep
>>>
>>> cluster = SLURMCluster(project='ewhite',death_timeout=100)
>>> cluster.start_workers(1)
[2]
>>>
>>> print(cluster.job_script())
#!/bin/bash
#SBATCH -N 1
#SBATCH -J dask
#SBATCH -n 8
#SBATCH -A ewhite
#SBATCH -t 00:30:00
#SBATCH -e dask.err
#SBATCH -o dask.out
source /home/b.weinstein/miniconda3/bin/activate pangeo
export LANG="en_US.utf8"
export LANGUAGE="en_US.utf8"
export LC_ALL="en_US.utf8"
/home/b.weinstein/miniconda3/envs/pangeo/bin/dask-worker tcp://172.16.192.24:41461 --nthreads 4 --nprocs 8 --memory-limit 7GB --name dask-3 --death-timeout 100
>>>
>>> from dask.distributed import Client
>>> client = Client(cluster)
>>>
>>> client
<Client: scheduler='tcp://172.16.192.24:41461' processes=0 cores=0>
>>> client
<Client: scheduler='tcp://172.16.192.24:41461' processes=1 cores=4>
>>> client
<Client: scheduler='tcp://172.16.192.24:41461' processes=1 cores=4>
>>> counter=0
>>> while counter < 10:
... print(datetime.now().strftime("%a, %d %B %Y %I:%M:%S"))
... print(client)
... sleep(20)
... counter+=1
...
Fri, 23 March 2018 02:08:29
<Client: scheduler='tcp://172.16.192.24:41461' processes=8 cores=32>
Fri, 23 March 2018 02:08:49
<Client: scheduler='tcp://172.16.192.24:41461' processes=8 cores=32>
Fri, 23 March 2018 02:09:09
<Client: scheduler='tcp://172.16.192.24:41461' processes=8 cores=32>
Fri, 23 March 2018 02:09:29
<Client: scheduler='tcp://172.16.192.24:41461' processes=8 cores=32>
Fri, 23 March 2018 02:09:49
<Client: scheduler='tcp://172.16.192.24:41461' processes=8 cores=32>
Fri, 23 March 2018 02:10:09
<Client: scheduler='tcp://172.16.192.24:41461' processes=0 cores=0>
Fri, 23 March 2018 02:10:29
<Client: scheduler='tcp://172.16.192.24:41461' processes=0 cores=0>
Fri, 23 March 2018 02:10:49
<Client: scheduler='tcp://172.16.192.24:41461' processes=0 cores=0>
I feel like the error message is slightly different at the end. I'm going
to change back to forking from spawning to see if it makes a different in
the .dash config yaml.
(pangeo) ***@***.*** dask-jobqueue]$ squeue -u b.weinstein
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18426188 hpg2-comp bash b.weinst R 49:21 1 c21b-s18
(pangeo) ***@***.*** dask-jobqueue]$ cat dask.err
distributed.nanny - INFO - Start Nanny at: 'tcp://172.16.193.44:36107'
distributed.nanny - INFO - Start Nanny at: 'tcp://172.16.193.44:37161'
distributed.nanny - INFO - Start Nanny at: 'tcp://172.16.193.44:41005'
distributed.nanny - INFO - Start Nanny at: 'tcp://172.16.193.44:37816'
distributed.nanny - INFO - Start Nanny at: 'tcp://172.16.193.44:33666'
distributed.nanny - INFO - Start Nanny at: 'tcp://172.16.193.44:44045'
distributed.nanny - INFO - Start Nanny at: 'tcp://172.16.193.44:38295'
distributed.nanny - INFO - Start Nanny at: 'tcp://172.16.193.44:43426'
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-ls69yrj9', purging
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-ryfo6rlt', purging
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-s6945kth', purging
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-_tstz9nw', purging
distributed.worker - INFO - Start worker at: tcp://172.16.193.44:45134
distributed.worker - INFO - Listening to: tcp://172.16.193.44:45134
distributed.worker - INFO - nanny at: 172.16.193.44:41005
distributed.worker - INFO - bokeh at: 172.16.193.44:8789
distributed.worker - INFO - Waiting to connect to: tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 4
distributed.worker - INFO - Memory: 7.00 GB
distributed.worker - INFO - Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-ei5uiva3
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Start worker at: tcp://172.16.193.44:32896
distributed.worker - INFO - Listening to: tcp://172.16.193.44:32896
distributed.worker - INFO - nanny at: 172.16.193.44:36107
distributed.worker - INFO - bokeh at: 172.16.193.44:38506
distributed.worker - INFO - Waiting to connect to: tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 4
distributed.worker - INFO - Memory: 7.00 GB
distributed.worker - INFO - Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-kbzqab1o
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Start worker at: tcp://172.16.193.44:46276
distributed.worker - INFO - Listening to: tcp://172.16.193.44:46276
distributed.worker - INFO - nanny at: 172.16.193.44:38295
distributed.worker - INFO - bokeh at: 172.16.193.44:40615
distributed.worker - INFO - Waiting to connect to: tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 4
distributed.worker - INFO - Memory: 7.00 GB
distributed.worker - INFO - Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-3514b1ne
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Start worker at: tcp://172.16.193.44:39987
distributed.worker - INFO - Listening to: tcp://172.16.193.44:39987
distributed.worker - INFO - nanny at: 172.16.193.44:43426
distributed.worker - INFO - bokeh at: 172.16.193.44:42895
distributed.worker - INFO - Waiting to connect to: tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 4
distributed.worker - INFO - Memory: 7.00 GB
distributed.worker - INFO - Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-s6mk0mle
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Start worker at: tcp://172.16.193.44:42682
distributed.worker - INFO - Listening to: tcp://172.16.193.44:42682
distributed.worker - INFO - nanny at: 172.16.193.44:33666
distributed.worker - INFO - bokeh at: 172.16.193.44:39981
distributed.worker - INFO - Waiting to connect to: tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 4
distributed.worker - INFO - Memory: 7.00 GB
distributed.worker - INFO - Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-rg4nsaxo
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.nanny - INFO - Failed to start worker process. Restarting
distributed.nanny - INFO - Failed to start worker process. Restarting
distributed.nanny - INFO - Failed to start worker process. Restarting
distributed.worker - INFO - Start worker at: tcp://172.16.193.44:40680
distributed.worker - INFO - Listening to: tcp://172.16.193.44:40680
distributed.worker - INFO - nanny at: 172.16.193.44:37816
distributed.worker - INFO - bokeh at: 172.16.193.44:36999
distributed.worker - INFO - Waiting to connect to: tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 4
distributed.worker - INFO - Memory: 7.00 GB
distributed.worker - INFO - Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-8yixbaxu
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Start worker at: tcp://172.16.193.44:44216
distributed.worker - INFO - Listening to: tcp://172.16.193.44:44216
distributed.worker - INFO - nanny at: 172.16.193.44:37161
distributed.worker - INFO - bokeh at: 172.16.193.44:39797
distributed.worker - INFO - Waiting to connect to: tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 4
distributed.worker - INFO - Memory: 7.00 GB
distributed.worker - INFO - Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-aq19j5se
distributed.worker - INFO - Registered to: tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Start worker at: tcp://172.16.193.44:42203
distributed.worker - INFO - Listening to: tcp://172.16.193.44:42203
distributed.worker - INFO - nanny at: 172.16.193.44:44045
distributed.worker - INFO - bokeh at: 172.16.193.44:46418
distributed.worker - INFO - Waiting to connect to: tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 4
distributed.worker - INFO - Memory: 7.00 GB
distributed.worker - INFO - Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-84gxedtr
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.193.44:44216', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-aq19j5se'},)> after timeout
Traceback (most recent call last):
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
future.result()
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
yielded = self.gen.send(value)
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.193.44:44216', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-aq19j5se'}
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.44:37161'
distributed.worker - INFO - Stopping worker at tcp://172.16.193.44:44216
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 849, in callback
result_list.append(f.result())
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
yielded = self.gen.throw(*exc_info)
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 155, in _start
response = yield self.instantiate()
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
value = future.result()
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
yielded = self.gen.throw(*exc_info)
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 223, in instantiate
self.process.start()
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
value = future.result()
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
yielded = self.gen.throw(*exc_info)
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 363, in start
self._wait_until_started())
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
value = future.result()
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
yielded = self.gen.send(value)
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/
nanny.py", line 471, in _wait_until_started
assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.193.44:42203', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-84gxedtr'}
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
File "/home/b.weinstein/miniconda3/envs/pangeo/bin/dask-worker", line 11, in <module>
sys.exit(go())
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/cli/dask_worker.py", line 252, in go
main()
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/cli/dask_worker.py", line 243, in main
loop.run_sync(run)
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/ioloop.py", line 582, in run_sync
return future_cell[0].result()
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
yielded = self.gen.throw(*exc_info)
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/cli/dask_worker.py", line 236, in run
yield [n._start(addr) for n in nannies]
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
value = future.result()
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 849, in callback
result_list.append(f.result())
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
yielded = self.gen.throw(*exc_info)
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 155, in _start
response = yield self.instantiate()
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
value = future.result()
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
yielded = self.gen.throw(*exc_info)
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 223, in instantiate
self.process.start()
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
value = future.result()
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
yielded = self.gen.throw(*exc_info)
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 363, in start
self._wait_until_started())
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
value = future.result()
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
yielded = self.gen.send(value)
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.193.44:40680', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-8yixbaxu'}
distributed.process - WARNING - reaping stray process <SpawnProcess(SpawnProcess-10, started daemon)>
distributed.process - WARNING - reaping stray process <SpawnProcess(SpawnProcess-11, started daemon)>
distributed.process - WARNING - reaping stray process <SpawnProcess(SpawnProcess-7, started daemon)>
distributed.process - WARNING - reaping stray process <SpawnProcess(SpawnProcess-8, started daemon)>
distributed.process - WARNING - reaping stray process <SpawnProcess(SpawnProcess-3, started daemon)>
distributed.process - WARNING - reaping stray process <SpawnProcess(SpawnProcess-5, started daemon)>
distributed.process - WARNING - reaping stray process <SpawnProcess(SpawnProcess-1, started daemon)>
Exception in thread AsyncProcess SpawnProcess-7 watch process join:
Traceback (most recent call last):
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/process.py", line 216, in _watch_process
assert exitcode is not None
AssertionError
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#20 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszHEkdqJJsC8KXc6QwA0Wtq0Fo2Dlks5thTydgaJpZM4S2ZBz>
.
|
Yup, waiting to hear back. One thing i've been wondering. Where is the scheduler.json? In the docs it suggests that creating a client should create a ~/scheduler.json in the /home/$USER/scheduler.json location. I never seem to find such a file. The reason I was wondering is that when I set death_timeout to a long duration, I can go through the process to create the juypter lab notebook. I can ssh tunnel just fine, but it doesn't sync up to that client. But I can't quite seem to find any file.
|
I don't know. I wonder if someone with more experience with this library
like @jhamman can help out here.
…On Fri, Mar 23, 2018 at 2:26 PM, Ben Weinstein ***@***.***> wrote:
Yup, waiting to hear back.
One thing i've been wondering. Where is the scheduler.json? In the docs it
suggests that creating a client should create a ~/scheduler.json in the
/home/$USER/scheduler.json location. I never seem to find such a file. The
reason I was wondering is that when I set death_timeout to a long duration,
I can go through the process to create the juypter lab notebook. I can ssh
tunnel just fine, but it doesn't sync up to that client. But I can't quite
seem to find any file.
(pangeo) ***@***.*** .dask]$ cd ~
(pangeo) ***@***.*** ~]$ ls
dask-jobqueue dask-worker-space DeepForest logs miniconda3
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#20 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszJBhh5Q_KDzYYaaWRLSHlqRTFPCMks5thT5BgaJpZM4S2ZBz>
.
|
@bw4sz -
|
thanks @jhamman, I should say that i'm not tied to this repo, whatever I can do to get dask up and running on the UF slurm cluster. We have enormous computing resources here, and my lab has been itching to try dask for some time now. I took on the challenge because, as you can see from the 30 comments, I enjoy debugging. Perhaps i'm mixing and matching docs a bit, going from @mrocklin screencast on connecting to HPC, to this repo, as was originally suggested. I will try your workflow and let you know. After that, let me go back to the original video. |
Its alittle awkward because srun on this resource sends you to a single interactive worker, but I think this all checks out. Again, I want to stress that if not using dask_jobqueue is helpful, just give me a push in the direction of what to use. It was not clear to me that it was not using MPI, but a totally different network strategy. |
Okay, glad to see that works. That is test 0. dask-jobqueue just manages the setup/teardown of those steps. So this should work. The dask-mpi approach described in some detail here should also work but from my perspective the dask-jobqueue approach is much more flexible and user friendly. The next thing to try is to start a |
Confirm this is what you meant.
That works |
Great. Can you incrementally add command line arguments until you figure out which ones are causing your workers to fail to start. My guess is that eventually, you'll find that one of these options is breaking things for you:
(this is copied from the If running with all of these arguments works, there must be an environment difference that occurs when submitting via |
Got it. I'm not shy. I'll give you updates. I hope this is helpful for future users. Does the TCP protocol require that requires be physically on the same node? I had added
into cluster.start_workers to ensure that I got a single rack. This isn't needed? |
As far as starting a jupyter notebook, do this however you would typically do it on your cluster. Launching from an existing Client is one way but is also somewhat orthogonal to how we are doing things here. My typical workflow is:
On most clusters, this is not going to be needed. I routinely run on many dozens of compute nodes spread over a very large HPC cluster. I do not have any control over where the jobs start. |
agreed on workflow. I'm tweaking the memory and processors per worker to see if it makes a difference. Its a bit of a funny thing to debug because I can't tell how long to wait. Several times i've been deceived that I've solved the problem, but then workers die later. For example, I've been sitting here for 5 minutes, all workers are still there, for now! Once I get it to work, the key will be to break it again to confirm. Is there a way to get each worker to announce its IP or rank or communicate with the client? I'm going solely off of the client saying that it has workers. It would be nice to get info from them. |
The dashboard may be a logical place to track this. @mrocklin may have other ideas too. |
Yup, it looks like it sticks for the moment! I think we have a success on 1 thread per worker. Its not clear why, by @mrocklin was probably right yesterday on the spawning/forking. I'm going to play with this now and see if I can break it again to really understand it. How will single thread on each process effect my performance? Its hard for me to conceptualize the tradeoff between the number of nodes, the number of workers, and number of processes and the number of threads. Once I confirm its a single thread issue, I will close this thread. |
Is it possible Slurm is configured on your cluster in such a way that won't let you overload your requested resources. In other words, if you ask for 1 process on one node, and try to use two, would that fail? Glad to hear you're starting to get things working. |
… On Fri, Mar 23, 2018 at 5:17 PM, Ben Weinstein ***@***.***> wrote:
Yup, it looks like it sticks for the moment! I think we have a success on *1
thread per worker*. Its not clear why, by @mrocklin <http:///mrocklin>
was probably right yesterday on the spawning/forking. I'm going to play
with this now and see if I can break it again to really understand it.
How will single thread on each process effect my performance? Its hard for
me to conceptualize the tradeoff between the number of nodes, the number of
workers, and number of processes and the number of threads.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszNj8h7TjJQq4JgL8ZU3L6po_ahbRks5thWZ4gaJpZM4S2ZBz>
.
|
Hmm, yes you are right, its something to this. Its not as simple as single threaded. 4 processes with 2 threads per worker seems to work. 8 processes with 4 threads per work fails as described above. I think that's good enough to call it close. Maybe IT will have an explanation for the balance among nodes/workers/threads/processes. It might have been nice if admin has a more explicit message to give us a sense. I really appreciate the willingness by the developers to assist. Let me know how I can contribute in the future. I am documenting this for my lab. |
Cool. Glad you have this (mostly) sorted out. It would be great to get your continued interaction here. @guillaumeeb and I seem to be the most engaged users/developers at the moment and we're both mainly using |
Hey all, I'm just jumping on this thread, as I seem to be having a similar problem. In my specific case I only get worker death using Is anyone able to help me diagnose this? Happy to open a new issue if that's better, too. |
Summary
When adding workers to a SLURM dask client, workers are added as resources are provisioned by the scheduler, but then they quickly disappear. Presumably they are killed by the client because a lack of connection (--death_timeout flag). Its not clear whether this is intended behavior. My goal is to add workers to a dask client, connect to that client from my local laptop using jupyter lab. By the time I ssh tunnel in from my laptop, all the workers are killed.
Expected Behavior
Following this helpful screencast, I thought that once workers were added, they would remain available for computation. Either the client is very aggressive about pruning unused workers, or something else is wrong.
Comments
I can confirm that the workers that were once there, are now gone.
presumably killed by the client.
Edited desk.err file produced, with many hundreds of duplicate lines removed.
The text was updated successfully, but these errors were encountered: