Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask client connects to SLURM workers, then rapidly loses them #20

Closed
bw4sz opened this issue Mar 22, 2018 · 60 comments
Closed

Dask client connects to SLURM workers, then rapidly loses them #20

bw4sz opened this issue Mar 22, 2018 · 60 comments

Comments

@bw4sz
Copy link

bw4sz commented Mar 22, 2018

Summary

When adding workers to a SLURM dask client, workers are added as resources are provisioned by the scheduler, but then they quickly disappear. Presumably they are killed by the client because a lack of connection (--death_timeout flag). Its not clear whether this is intended behavior. My goal is to add workers to a dask client, connect to that client from my local laptop using jupyter lab. By the time I ssh tunnel in from my laptop, all the workers are killed.

(pangeo) [b.weinstein@c30b-s1 ~]$ python
Python 3.6.4 | packaged by conda-forge | (default, Dec 23 2017, 16:31:06) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from dask_jobqueue import SLURMCluster
>>> from datetime import datetime
>>> from time import sleep
>>> 
>>> cluster = SLURMCluster(project='ewhite',death_timeout=200)
>>> cluster.start_workers(5)
[3, 5, 7, 9, 11]
>>> 
>>> from dask.distributed import Client
>>> client = Client(cluster)
>>> 
>>> while True:
...     print(datetime.now().strftime("%a, %d %B %Y %I:%M:%S"))
...     print(client)
...     sleep(30)
... 
Wed, 21 March 2018 10:57:19
<Client: scheduler='tcp://172.16.194.66:35459' processes=0 cores=0>
Wed, 21 March 2018 10:57:49
<Client: scheduler='tcp://172.16.194.66:35459' processes=6 cores=24>
Wed, 21 March 2018 10:58:19
<Client: scheduler='tcp://172.16.194.66:35459' processes=6 cores=24>
Wed, 21 March 2018 10:58:49
<Client: scheduler='tcp://172.16.194.66:35459' processes=6 cores=24>
Wed, 21 March 2018 10:59:19
<Client: scheduler='tcp://172.16.194.66:35459' processes=6 cores=24>
Wed, 21 March 2018 10:59:49
<Client: scheduler='tcp://172.16.194.66:35459' processes=6 cores=24>
Wed, 21 March 2018 11:00:20
<Client: scheduler='tcp://172.16.194.66:35459' processes=6 cores=24>
Wed, 21 March 2018 11:00:50
<Client: scheduler='tcp://172.16.194.66:35459' processes=5 cores=20>
Wed, 21 March 2018 11:01:20
<Client: scheduler='tcp://172.16.194.66:35459' processes=5 cores=20>
Wed, 21 March 2018 11:01:50
<Client: scheduler='tcp://172.16.194.66:35459' processes=0 cores=0>
Wed, 21 March 2018 11:02:20
<Client: scheduler='tcp://172.16.194.66:35459' processes=0 cores=0>
Wed, 21 March 2018 11:02:50
<Client: scheduler='tcp://172.16.194.66:35459' processes=0 cores=0>

Expected Behavior

Following this helpful screencast, I thought that once workers were added, they would remain available for computation. Either the client is very aggressive about pruning unused workers, or something else is wrong.

Comments

  • May be relevant, but unknown. Once I quit the above python session, I am greeted with several error messages saying the TCP stream is closed. Example
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/comm/tcp.py", line 200, in read
    convert_stream_closed_error(self, e)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error
    raise CommClosedError("in %s: %s" % (obj, exc))
distributed.comm.core.CommClosedError: in <closed TCP>: Stream is closed

I can confirm that the workers that were once there, are now gone.

(pangeo) [b.weinstein@c30b-s1 ~]$ squeue -u b.weinstein
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          18361726 hpg2-comp     bash b.weinst  R      22:14      1 c30b-s1

presumably killed by the client.

Edited desk.err file produced, with many hundreds of duplicate lines removed.

(pangeo) [b.weinstein@c30b-s1 ~]$ cat dask.err 
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.194.178:36916'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.194.178:38675'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.194.178:32914'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.194.178:44959'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.194.178:44970'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.194.178:36426'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.194.178:35157'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.194.178:40539'
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-worker-space/worker-r9gleghg', purging
distributed.worker - INFO -       Start worker at: tcp://172.16.194.178:39384
distributed.worker - INFO -          Listening to: tcp://172.16.194.178:39384
distributed.worker - INFO -              nanny at:       172.16.194.178:44959
distributed.worker - INFO -              bokeh at:        172.16.194.178:8789
distributed.worker - INFO - Waiting to connect to:  tcp://172.16.194.66:35459
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-l7jp7he1
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:  tcp://172.16.194.66:35459
distributed.worker - INFO - -------------------------------------------------
distributed.nanny - INFO - Failed to start worker process.  Restarting
...
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.worker - INFO -       Start worker at: tcp://172.16.194.178:46551
distributed.worker - INFO -          Listening to: tcp://172.16.194.178:46551
distributed.worker - INFO -              nanny at:       172.16.194.178:38675
distributed.worker - INFO -              bokeh at:       172.16.194.178:41314
distributed.worker - INFO - Waiting to connect to:  tcp://172.16.194.66:35459
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-yjnhwx2e
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:  tcp://172.16.194.66:35459
distributed.worker - INFO - -------------------------------------------------
...
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.192.163:44012'
...
:35125
distributed.worker - INFO -          Listening to: tcp://172.16.194.178:35125
distributed.worker - INFO -              nanny at:       172.16.194.178:32914
distributed.worker - INFO -              bokeh at:       172.16.194.178:45869
distributed.worker - INFO - Waiting to connect to:  tcp://172.16.194.66:35459
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-w9w338nb
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Stopping worker at tcp://172.16.194.178:35125
distributed.worker - INFO -       Start worker at: tcp://172.16.194.184:46207
distributed.worker - INFO -          Listening to: tcp://172.16.194.184:46207
distributed.worker - INFO -              nanny at:       172.16.194.184:45711
distributed.worker - INFO -              bokeh at:        172.16.194.184:8789
distributed.worker - INFO - Waiting to connect to:  tcp://172.16.194.66:35459
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-g0jjm93z
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Stopping worker at tcp://172.16.194.184:46207
orker - INFO -              nanny at:       172.16.194.178:38675
distributed.worker - INFO -              bokeh at:       172.16.194.178:45033
distributed.worker - INFO - Waiting to connect to:  tcp://172.16.194.66:35459
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-u2_1pu5v
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Stopping worker at tcp://172.16.194.178:35568
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 849, in callback
    result_list.append(f.result())
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 155, in _start
    response = yield self.instantiate()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 223, in instantiate
    self.process.start()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 363, in start
    self._wait_until_started())
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.194.178:37068', 'dir': '/home/b.weinstein/dask-worker-space/worker-5ipv17oi'}
distributed.dask_worker - INFO - End worker
distributed.nanny - INFO - Failed to start worker process.  Restarting
/envs/pangeo/bin/dask-worker", line 6, in <module>
    sys.exit(distributed.cli.dask_worker.go())
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/cli/dask_worker.py", line 252, in go
    main()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/cli/dask_worker.py", line 243, in main
    loop.run_sync(run)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/ioloop.py", line 582, in run_sync
    return future_cell[0].result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/cli/dask_worker.py", line 236, in run
    yield [n._start(addr) for n in nannies]
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 849, in callback
    result_list.append(f.result())
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 155, in _start
    response = yield self.instantiate()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 223, in instantiate
    self.process.start()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 363, in start
    self._wait_until_started())
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.194.178:37661', 'dir': '/home/b.weinstein/dask-worker-space/worker-_b3fhcyu'}
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-25, started daemon)>
...
distributed.worker - INFO -          Listening to: tcp://172.16.194.184:43288
distributed.worker - INFO -              nanny at:       172.16.194.184:43216
distributed.worker - INFO -              bokeh at:        172.16.194.184:8789
distributed.worker - INFO - Waiting to connect to:  tcp://172.16.194.66:35459
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-mo6fnfvo
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Stopping worker at tcp://172.16.194.184:43288
distributed.nanny - INFO - Failed to start worker process.  Restarting
...
:36157
distributed.worker - INFO -          Listening to: tcp://172.16.194.184:36157
distributed.worker - INFO -              nanny at:       172.16.194.184:42948
distributed.worker - INFO -              bokeh at:        172.16.194.184:8789
distributed.worker - INFO - Waiting to connect to:  tcp://172.16.194.66:35459
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-h0jj6gi3
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Stopping worker at tcp://172.16.194.184:36157
distributed.nanny - INFO - Failed to start worker process.  Restarting
...
distributed.worker - INFO -          Listening to: tcp://172.16.194.184:46197
distributed.worker - INFO -              nanny at:       172.16.194.184:38837
distributed.worker - INFO -              bokeh at:        172.16.194.184:8789
distributed.worker - INFO - Waiting to connect to:  tcp://172.16.194.66:35459
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-lft5bgwk
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Stopping worker at tcp://172.16.194.184:46197
...
distributed.nanny - WARNING - Worker process 27618 was killed by unknown signal
...
distributed.dask_worker - INFO - End worker
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-392, 
...
distributed.nanny - WARNING - Worker process still alive after 159 seconds, killing
distributed.nanny - WARNING - Worker process 13695 was killed by unknown signal
distributed.nanny - WARNING - Worker process still alive after 159 seconds, killing
...
distributed.nanny - WARNING - Worker process 13712 was killed by unknown signal
...
distributed.nanny - WARNING - Worker process still alive after 159 seconds, killing
distributed.dask_worker - INFO - End worker
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-303, started daemon)>
(pangeo) [b.weinstein@c30b-s1 ~]$ 

@guillaumeeb
Copy link
Member

Given your logs, it looks like the worker are failing to start for some reason:

distributed.nanny - INFO - Failed to start worker process.  Restarting

Not clear to me why. Could you provide the SLURMCluster generated job_script:

cluster.job_script()

For the sake of log clarity, could you try to launch only one worker and give us the log output?

Those lines in the stack trace look weird to me:

  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.194.178:37068', 'dir': '/home/b.weinstein/dask-worker-space/worker-5ipv17oi'}

@mrocklin
Copy link
Member

Clients don't kill workers themselves. Nothing will kill them by default.

What is the output of the following:

import distributed
print(distributed.__version__)

@mrocklin
Copy link
Member

Yeah, I agree that these lines are concerning:

  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.194.178:37068', 'dir': '/home/b.weinstein/dask-worker-space/worker-5ipv17oi'}

We've seen this very intermittently in our CI tests and I haven't yet been able to figure out how they occur. As a test you might try setting the following in your ~/.dask/config.yaml file

# multiprocessing-method: forkserver
multiprocessing-method: spawn

I would be quite interested to learn if that fixes the problem

@mrocklin
Copy link
Member

If you had time to test out dask/distributed#1848 that would also be welcome.

It's hard for me to reproduce this error locally, which is one reason why it has lingered for a while.

@bw4sz
Copy link
Author

bw4sz commented Mar 22, 2018

Okay, i'm up and will be working on this today. To summarize the above plans of action. I will add to this thread as I have answers to the following questions.

  1. Give version info and job script
import distributed
print(distributed.__version__)
cluster.job_script()
  1. Run on a single worker (cluster.start_workers(1)) and report entire log file.
  2. Fork dask/distributed and try PR #1848
  3. Edit dask config file
~/.dask/config.yaml file

# multiprocessing-method: forkserver
multiprocessing-method: spawn

@bw4sz
Copy link
Author

bw4sz commented Mar 22, 2018

  1. Version and job scripts.
>>> import distributed
>>> print(distributed.__version__)
1.21.3

Job script, with formatted line breaks.

>>> cluster.job_script()

#!/bin/bash
#SBATCH -J dask
#SBATCH -n 8
#SBATCH -A ewhite
#SBATCH -t 00:30:00
#SBATCH -e dask.err
#SBATCH -o dask.out

export LANG="en_US.utf8"
export LANGUAGE="en_US.utf8"
export LC_ALL="en_US.utf8"

"/home/b.weinstein/miniconda3/envs/pangeo/bin/dask-worker" tcp://172.16.194.158:37289     --nthreads 4     --nprocs 8     --memory-limit 7GB     --name dask-4     --death-timeout 200

@mrocklin
Copy link
Member

And to be clear, running that script breaks?

@bw4sz
Copy link
Author

bw4sz commented Mar 22, 2018

Ya, posting now. Not sure best way to separate the thread. It breaks from within cluster.start_workers. I have not tried separately to push that slurm script.

This is the dask.err file for that one worker. Note that client has not been called.

(pangeo) [b.weinstein@c31b-s21 dask-jobqueue]$ python
Python 3.6.4 | packaged by conda-forge | (default, Dec 23 2017, 16:31:06) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from dask_jobqueue import SLURMCluster
>>> cluster = SLURMCluster(project='ewhite',death_timeout=30)
>>> cluster.start_workers(1)
[3]
>>> exit()
(pangeo) [b.weinstein@c31b-s21 dask-jobqueue]$ squeue -u b.weinstein
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          18382054 hpg2-comp     bash b.weinst  R      18:04      1 c31b-s21
(pangeo) [b.weinstein@c31b-s21 dask-jobqueue]$ cat dask.err
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.51:37297'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.51:35788'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.51:35218'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.51:44645'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.51:41522'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.51:44451'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.51:37657'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.51:41545'
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Worker process 31388 was killed by unknown signal
distributed.nanny - WARNING - Worker process 31391 was killed by unknown signal
distributed.nanny - WARNING - Worker process 31395 was killed by unknown signal
distributed.nanny - WARNING - Worker process 31397 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Worker process 31435 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.51:37297'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.51:35788'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.51:35218'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.51:44645'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.51:41522'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.51:44451'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.51:37657'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.51:41545'
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Worker process 31537 was killed by unknown signal
distributed.nanny - WARNING - Worker process 31535 was killed by unknown signal
distributed.nanny - WARNING - Worker process 31540 was killed by unknown signal
distributed.nanny - WARNING - Worker process 31543 was killed by unknown signal
distributed.nanny - WARNING - Worker process 31547 was killed by unknown signal
distributed.nanny - WARNING - Worker process 31553 was killed by unknown signal
distributed.nanny - WARNING - Worker process 31556 was killed by unknown signal
distributed.nanny - WARNING - Worker process 31549 was killed by unknown signal
distributed.dask_worker - INFO - End worker
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-53, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-55, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-49, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-56, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-50, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-54, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-51, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-52, started daemon)>

@mrocklin
Copy link
Member

Another possibility here is that it's just taking your system a long time to start a new process. Currently we restart the worker if we don't hear an initial hello from the forked worker process within five seconds. On a normal system this should happen in well under a second. I would not be very surprised though if systems with fancy file systems and OS's take far longer.

@jakirkham
Copy link
Member

Currently we restart the worker if we don't hear an initial hello from the forked worker process within five seconds.

Is that timeout customizable? If not, could we make customizable?

@mrocklin
Copy link
Member

No and Yes. Ideally it also just wouldn't be necessary if we can find the root of the problem.

@bw4sz
Copy link
Author

bw4sz commented Mar 22, 2018

One thing i'm noticing is that when you do call for one worker, I never seem to get connected to the client. But if I try two, I do get a connection, and then the dask.err file is different. I'm trying to separate this behavior from the death_timeout flag. If the client isn't responsible for killing the workers, what is the function of this arg?

        death_timeout : float
            Seconds to wait for a scheduler before closing workers

Example here of requesting two workers gets synced up, albeit briefly.

>>> cluster = SLURMCluster(project='ewhite',death_timeout=200)
>>> cluster.start_workers(2)
[3, 5]
>>> 
>>> from dask.distributed import Client
>>> client = Client(cluster)
>>> 
>>> while True:
...     print(datetime.now().strftime("%a, %d %B %Y %I:%M:%S"))
...     print(client)
...     sleep(30)
... 
Thu, 22 March 2018 01:02:01
<Client: scheduler='tcp://172.16.194.158:33354' processes=0 cores=0>
Thu, 22 March 2018 01:02:31
<Client: scheduler='tcp://172.16.194.158:33354' processes=2 cores=8>
Thu, 22 March 2018 01:03:01
<Client: scheduler='tcp://172.16.194.158:33354' processes=2 cores=8>
Thu, 22 March 2018 01:03:31
<Client: scheduler='tcp://172.16.194.158:33354' processes=2 cores=8>
Thu, 22 March 2018 01:04:01
<Client: scheduler='tcp://172.16.194.158:33354' processes=5 cores=20>
Thu, 22 March 2018 01:04:31
<Client: scheduler='tcp://172.16.194.158:33354' processes=5 cores=20>
Thu, 22 March 2018 01:05:01
<Client: scheduler='tcp://172.16.194.158:33354' processes=7 cores=28>
Thu, 22 March 2018 01:05:31
<Client: scheduler='tcp://172.16.194.158:33354' processes=6 cores=24>
Thu, 22 March 2018 01:06:01
<Client: scheduler='tcp://172.16.194.158:33354' processes=1 cores=4>
Thu, 22 March 2018 01:06:31
<Client: scheduler='tcp://172.16.194.158:33354' processes=1 cores=4>
>>> 
[1]+  Stopped                 python

Atleast one worker still exists when I killed it.

(pangeo) [b.weinstein@c31b-s21 ~]$ squeue -u b.weinstein
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          18383642 hpg2-comp     dask b.weinst  R       4:50      1 c25a-s38
          18382054 hpg2-comp     bash b.weinst  R      32:27      1 c31b-s21

Error File

(pangeo) [b.weinstein@c31b-s21 ~]$ cat dask.err
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.37:45179'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.37:37653'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.37:44663'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.37:34624'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.37:34534'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.37:36418'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.37:46417'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.37:44895'
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-worker-space/worker-l7jp7he1', purging
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-worker-space/worker-yjnhwx2e', purging
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-worker-space/worker-5ipv17oi', purging
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-worker-space/worker-oqjzke10', purging
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-worker-space/worker-bm2z_ut8', purging
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-worker-space/worker-_b3fhcyu', purging
distributed.worker - INFO -       Start worker at:  tcp://172.16.193.37:36876
distributed.worker - INFO -          Listening to:  tcp://172.16.193.37:36876
distributed.worker - INFO -              nanny at:        172.16.193.37:34534
distributed.worker - INFO -              bokeh at:         172.16.193.37:8789
distributed.worker - INFO - Waiting to connect to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-6el6x5a8
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Worker process 11616 was killed by unknown signal
distributed.nanny - WARNING - Worker process 11619 was killed by unknown signal
distributed.nanny - WARNING - Worker process 11621 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Worker process 11789 was killed by unknown signal
distributed.nanny - WARNING - Worker process 11790 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-enltse1k
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Worker process 6713 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.core - WARNING - Event loop was unresponsive in Nanny for 3.61s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.worker - INFO -       Start worker at:  tcp://172.16.193.37:42822
distributed.worker - INFO -          Listening to:  tcp://172.16.193.37:42822
distributed.worker - INFO -              nanny at:        172.16.193.37:36418
distributed.worker - INFO -              bokeh at:        172.16.193.37:33011
distributed.worker - INFO - Waiting to connect to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-yzma4wfe
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.193.37:42822', 'dir': '/home/b.weinstein/dask-worker-space/worker-yzma4wfe'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.193.37:42822', 'dir': '/home/b.weinstein/dask-worker-space/worker-yzma4wfe'}
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Worker process 12298 was killed by unknown signal
distributed.nanny - WARNING - Worker process 12291 was killed by unknown signal
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Worker process 12301 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.worker - INFO -       Start worker at:  tcp://172.16.193.37:35731
distributed.worker - INFO -          Listening to:  tcp://172.16.193.37:35731
distributed.worker - INFO -              nanny at:        172.16.193.37:44663
distributed.worker - INFO -              bokeh at:        172.16.193.37:41203
distributed.worker - INFO - Waiting to connect to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-btuqdh1h
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.193.37:35731', 'dir': '/home/b.weinstein/dask-worker-space/worker-btuqdh1h'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.193.37:35731', 'dir': '/home/b.weinstein/dask-worker-space/worker-btuqdh1h'}
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-worker-space/worker-btuqdh1h', purging
distributed.worker - INFO -       Start worker at:  tcp://172.16.193.37:37075
distributed.worker - INFO -          Listening to:  tcp://172.16.193.37:37075
distributed.worker - INFO -              nanny at:        172.16.193.37:34624
distributed.worker - INFO -              bokeh at:        172.16.193.37:44202
distributed.worker - INFO - Waiting to connect to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-y1r323fs
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.193.37:37075', 'dir': '/home/b.weinstein/dask-worker-space/worker-y1r323fs'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.193.37:37075', 'dir': '/home/b.weinstein/dask-worker-space/worker-y1r323fs'}
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-worker-space/worker-y1r323fs', purging
distributed.worker - INFO -       Start worker at:  tcp://172.16.193.37:39963
distributed.worker - INFO -          Listening to:  tcp://172.16.193.37:39963
distributed.worker - INFO -              nanny at:        172.16.193.37:46417
distributed.worker - INFO -              bokeh at:        172.16.193.37:42135
distributed.worker - INFO - Waiting to connect to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-ujdmow3a
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.worker - INFO -       Start worker at:  tcp://172.16.193.37:33235
distributed.worker - INFO -          Listening to:  tcp://172.16.193.37:33235
distributed.worker - INFO -              nanny at:        172.16.193.37:44895
distributed.worker - INFO -              bokeh at:        172.16.193.37:39107
distributed.worker - INFO - Waiting to connect to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-wh6nho8w
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.193.37:33235', 'dir': '/home/b.weinstein/dask-worker-space/worker-wh6nho8w'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.193.37:33235', 'dir': '/home/b.weinstein/dask-worker-space/worker-wh6nho8w'}
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-worker-space/worker-wh6nho8w', purging
distributed.worker - INFO -       Start worker at:  tcp://172.16.193.37:38405
distributed.worker - INFO -          Listening to:  tcp://172.16.193.37:38405
distributed.worker - INFO -              nanny at:        172.16.193.37:44895
distributed.worker - INFO -              bokeh at:        172.16.193.37:39540
distributed.worker - INFO - Waiting to connect to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-_s2movv7
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.193.37:38405', 'dir': '/home/b.weinstein/dask-worker-space/worker-_s2movv7'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.193.37:38405', 'dir': '/home/b.weinstein/dask-worker-space/worker-_s2movv7'}
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-worker-space/worker-_s2movv7', purging
distributed.worker - INFO -       Start worker at:  tcp://172.16.193.37:36385
distributed.worker - INFO -          Listening to:  tcp://172.16.193.37:36385
distributed.worker - INFO -              nanny at:        172.16.193.37:37653
distributed.worker - INFO -              bokeh at:        172.16.193.37:34652
distributed.worker - INFO - Waiting to connect to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-wqfzo61b
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.193.37:36385', 'dir': '/home/b.weinstein/dask-worker-space/worker-wqfzo61b'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.193.37:36385', 'dir': '/home/b.weinstein/dask-worker-space/worker-wqfzo61b'}
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-worker-space/worker-wqfzo61b', purging
distributed.worker - INFO -       Start worker at:  tcp://172.16.193.37:37970
distributed.worker - INFO -          Listening to:  tcp://172.16.193.37:37970
distributed.worker - INFO -              nanny at:        172.16.193.37:37653
distributed.worker - INFO -              bokeh at:        172.16.193.37:41849
distributed.worker - INFO - Waiting to connect to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-8hr6m2l4
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.193.37:37970', 'dir': '/home/b.weinstein/dask-worker-space/worker-8hr6m2l4'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.193.37:37970', 'dir': '/home/b.weinstein/dask-worker-space/worker-8hr6m2l4'}
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-worker-space/worker-8hr6m2l4', purging
distributed.worker - INFO -       Start worker at:  tcp://172.16.193.37:42689
distributed.worker - INFO -          Listening to:  tcp://172.16.193.37:42689
distributed.worker - INFO -              nanny at:        172.16.193.37:45179
distributed.worker - INFO -              bokeh at:        172.16.193.37:38875
distributed.worker - INFO - Waiting to connect to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-prqyjzpk
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.193.37:42689', 'dir': '/home/b.weinstein/dask-worker-space/worker-prqyjzpk'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.193.37:42689', 'dir': '/home/b.weinstein/dask-worker-space/worker-prqyjzpk'}
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.core - WARNING - Event loop was unresponsive in Nanny for 3.57s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Worker process 7779 was killed by unknown signal
distributed.nanny - WARNING - Worker process 7776 was killed by unknown signal
distributed.nanny - WARNING - Worker process 7773 was killed by unknown signal
distributed.nanny - WARNING - Worker process 7782 was killed by unknown signal
distributed.nanny - WARNING - Worker process 7785 was killed by unknown signal
distributed.nanny - WARNING - Worker process 7788 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Worker process 7824 was killed by unknown signal
distributed.nanny - WARNING - Worker process 7827 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Worker process 7913 was killed by unknown signal
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Worker process 12710 was killed by unknown signal
distributed.nanny - WARNING - Worker process 12712 was killed by unknown signal
distributed.nanny - WARNING - Worker process 12707 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.worker - INFO -       Start worker at:  tcp://172.16.193.37:33064
distributed.worker - INFO -          Listening to:  tcp://172.16.193.37:33064
distributed.worker - INFO -              nanny at:        172.16.193.37:34624
distributed.worker - INFO -              bokeh at:        172.16.193.37:43279
distributed.worker - INFO - Waiting to connect to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-dshkgia7
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.worker - INFO -       Start worker at:  tcp://172.16.193.37:38705
distributed.worker - INFO -          Listening to:  tcp://172.16.193.37:38705
distributed.worker - INFO -              nanny at:        172.16.193.37:45179
distributed.worker - INFO -              bokeh at:        172.16.193.37:43251
distributed.worker - INFO - Waiting to connect to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-walv8bwc
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.193.37:38705', 'dir': '/home/b.weinstein/dask-worker-space/worker-walv8bwc'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.193.37:38705', 'dir': '/home/b.weinstein/dask-worker-space/worker-walv8bwc'}
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-worker-space/worker-walv8bwc', purging
distributed.worker - INFO -       Start worker at:  tcp://172.16.193.37:35667
distributed.worker - INFO -          Listening to:  tcp://172.16.193.37:35667
distributed.worker - INFO -              nanny at:        172.16.193.37:45179
distributed.worker - INFO -              bokeh at:        172.16.193.37:36515
distributed.worker - INFO - Waiting to connect to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-59y3zgvs
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.193.37:35667', 'dir': '/home/b.weinstein/dask-worker-space/worker-59y3zgvs'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.193.37:35667', 'dir': '/home/b.weinstein/dask-worker-space/worker-59y3zgvs'}
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Worker process 8197 was killed by unknown signal
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-worker-space/worker-59y3zgvs', purging
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.51:39452'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.51:39527'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.51:42840'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.51:38489'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.51:35320'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.51:39109'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.51:33045'
distributed.worker - INFO - Stopping worker at tcp://172.16.193.51:41034
distributed.worker - INFO -       Start worker at:  tcp://172.16.193.37:41538
distributed.worker - INFO -          Listening to:  tcp://172.16.193.37:41538
distributed.worker - INFO -              nanny at:        172.16.193.37:44663
distributed.worker - INFO -              bokeh at:        172.16.193.37:39385
distributed.worker - INFO - Waiting to connect to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-502r_ai_
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Stopping worker at tcp://172.16.193.37:41538
distributed.worker - INFO -       Start worker at:  tcp://172.16.193.37:44847
distributed.worker - INFO -       Start worker at:  tcp://172.16.193.37:37823
distributed.worker - INFO -          Listening to:  tcp://172.16.193.37:44847
distributed.worker - INFO -          Listening to:  tcp://172.16.193.37:37823
distributed.worker - INFO -              nanny at:        172.16.193.37:44895
distributed.worker - INFO -              nanny at:        172.16.193.37:45179
distributed.worker - INFO -              bokeh at:        172.16.193.37:46176
distributed.worker - INFO -              bokeh at:        172.16.193.37:38671
distributed.worker - INFO - Waiting to connect to: tcp://172.16.194.158:33354
distributed.worker - INFO - Waiting to connect to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-djymwodh
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-g1h_6kpl
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Stopping worker at tcp://172.16.193.37:44847
distributed.worker - INFO - Stopping worker at tcp://172.16.193.37:37823
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 849, in callback
    result_list.append(f.result())
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 155, in _start
    response = yield self.instantiate()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 223, in instantiate
    self.process.start()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 363, in start
    self._wait_until_started())
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.193.37:33064', 'dir': '/home/b.weinstein/dask-worker-space/worker-dshkgia7'}
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 849, in callback
    result_list.append(f.result())
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 155, in _start
    response = yield self.instantiate()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 223, in instantiate
    self.process.start()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 363, in start
    self._wait_until_started())
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.193.37:39963', 'dir': '/home/b.weinstein/dask-worker-space/worker-ujdmow3a'}
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/bin/dask-worker", line 6, in <module>
    sys.exit(distributed.cli.dask_worker.go())
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/cli/dask_worker.py", line 252, in go
    main()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/cli/dask_worker.py", line 243, in main
    loop.run_sync(run)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/ioloop.py", line 582, in run_sync
    return future_cell[0].result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/cli/dask_worker.py", line 236, in run
    yield [n._start(addr) for n in nannies]
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 849, in callback
    result_list.append(f.result())
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 155, in _start
    response = yield self.instantiate()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 223, in instantiate
    self.process.start()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 363, in start
    self._wait_until_started())
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.193.37:40506', 'dir': '/home/b.weinstein/dask-worker-space/worker-oom6rvvo'}
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-195, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-185, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-5, started daemon)>
distributed.worker - INFO -       Start worker at:  tcp://172.16.193.51:45811
distributed.worker - INFO -          Listening to:  tcp://172.16.193.51:45811
distributed.worker - INFO -              nanny at:        172.16.193.51:39527
distributed.worker - INFO -              bokeh at:         172.16.193.51:8789
distributed.worker - INFO - Waiting to connect to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-aw1j2b4o
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Stopping worker at tcp://172.16.193.51:45811
distributed.worker - INFO -       Start worker at:  tcp://172.16.193.51:41891
distributed.worker - INFO -          Listening to:  tcp://172.16.193.51:41891
distributed.worker - INFO -              nanny at:        172.16.193.51:42840
distributed.worker - INFO -              bokeh at:         172.16.193.51:8789
distributed.worker - INFO - Waiting to connect to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-tbm72j6l
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Stopping worker at tcp://172.16.193.51:41891
distributed.worker - INFO -       Start worker at:  tcp://172.16.193.51:33100
distributed.worker - INFO -          Listening to:  tcp://172.16.193.51:33100
distributed.worker - INFO -              nanny at:        172.16.193.51:39452
distributed.worker - INFO -              bokeh at:         172.16.193.51:8789
distributed.worker - INFO - Waiting to connect to: tcp://172.16.194.158:33354
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-s0t0xf97
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Stopping worker at tcp://172.16.193.51:33100

@bw4sz
Copy link
Author

bw4sz commented Mar 22, 2018

Maybe its something about how resources are allocated, but just driving home the point here. Exact same call as above, except just one worker, and I never get synced up to client.

(pangeo) [b.weinstein@c31b-s21 ~]$ python
Python 3.6.4 | packaged by conda-forge | (default, Dec 23 2017, 16:31:06) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from dask_jobqueue import SLURMCluster
>>> from datetime import datetime
>>> from time import sleep
>>> 
>>> cluster = SLURMCluster(project='ewhite',death_timeout=200)
>>> cluster.start_workers(1)
[3]
>>> 
>>> from dask.distributed import Client
>>> client = Client(cluster)
>>> 
>>> while True:
...     print(datetime.now().strftime("%a, %d %B %Y %I:%M:%S"))
...     print(client)
...     sleep(30)
... 
Thu, 22 March 2018 01:11:00
<Client: scheduler='tcp://172.16.194.158:40049' processes=0 cores=0>
Thu, 22 March 2018 01:11:30
<Client: scheduler='tcp://172.16.194.158:40049' processes=0 cores=0>
Thu, 22 March 2018 01:12:00
<Client: scheduler='tcp://172.16.194.158:40049' processes=0 cores=0>
Thu, 22 March 2018 01:12:30
<Client: scheduler='tcp://172.16.194.158:40049' processes=0 cores=0>
Thu, 22 March 2018 01:13:00
<Client: scheduler='tcp://172.16.194.158:40049' processes=0 cores=0>

@mrocklin
Copy link
Member

I see no reason why any worker would leave after it has connected. I'm as surprised as you are.

Death timeout is used by the worker itself when it can't connect to the scheduler in time

@bw4sz
Copy link
Author

bw4sz commented Mar 22, 2018

okay, next thing is to try that PR. What's the best way to do that? Fork the PR, clone on the cluster and replace

'/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/dask/distributed.py'

will that be sufficient?

@mrocklin
Copy link
Member

The easiest way is to pip install from a git branch

pip install git+https://github.com/mrocklin/distributed@nanny-assert --upgrade

But I'll probably have a second PR for you to try shortly.

@mrocklin
Copy link
Member

You could also try this PR: dask/distributed#1852

I suspect that it will solve your problem, but it may introduce some others for us. It would be good to verify that nothing else happens on your system though.

pip install git+https://github.com/mrocklin/distributed@remove-nanny-start-timeout --upgrade

@bw4sz
Copy link
Author

bw4sz commented Mar 22, 2018

Okay, starting one at a time.

dask/distributed#1848

Causes an error in package msgpack.

Let me reinstall conda env, then try the 2nd PR.

@mrocklin
Copy link
Member

mrocklin commented Mar 22, 2018 via email

@bw4sz
Copy link
Author

bw4sz commented Mar 22, 2018

Okay, reinstalled with a clean conda env. With this PR, requesting two workers does not sync up.

[b.weinstein@c30a-s41 ~]$ pip install git+https://github.com/mrocklin/distributed@nanny-assert --upgrade --no-deps
Collecting git+https://github.com/mrocklin/distributed@nanny-assert
  Cloning https://github.com/mrocklin/distributed (to nanny-assert) to /scratch/local/18388940/pip-7ml412ne-build
Installing collected packages: distributed
  Found existing installation: distributed 1.18.1+285.gfb7e4c56
    Uninstalling distributed-1.18.1+285.gfb7e4c56:
      Successfully uninstalled distributed-1.18.1+285.gfb7e4c56
  Running setup.py install for distributed ... done
Successfully installed distributed-1.18.1+285.gfb7e4c56
Python 3.6.4 | packaged by conda-forge | (default, Dec 23 2017, 16:31:06) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 
>>> from dask_jobqueue import SLURMCluster
>>> from datetime import datetime
>>> from time import sleep
>>> 
>>> cluster = SLURMCluster(project='ewhite',death_timeout=200)
>>> cluster.start_workers(2)
[3, 5]
>>> 
>>> from dask.distributed import Client
>>> client = Client(cluster)
>>> 
>>> while True:
...     print(datetime.now().strftime("%a, %d %B %Y %I:%M:%S"))
...     print(client)
...     sleep(30)
... 
Thu, 22 March 2018 01:54:43
<Client: scheduler='tcp://172.16.194.121:34635' processes=0 cores=0>
Thu, 22 March 2018 01:55:13
<Client: scheduler='tcp://172.16.194.121:34635' processes=0 cores=0>
Thu, 22 March 2018 01:55:43
<Client: scheduler='tcp://172.16.194.121:34635' processes=0 cores=0>
Thu, 22 March 2018 01:56:13
<Client: scheduler='tcp://172.16.194.121:34635' processes=0 cores=0>

No change in the dask.err

I will try the 2nd pull request.

@bw4sz
Copy link
Author

bw4sz commented Mar 22, 2018

dask/distributed#1852

causes workers to stay alive, but not connect to client.

(pangeo) [b.weinstein@c30a-s41 dask-jobqueue]$ pip install git+https://github.com/mrocklin/distributed@remove-nanny-start-timeout --upgrade --no-deps
Collecting git+https://github.com/mrocklin/distributed@remove-nanny-start-timeout
  Cloning https://github.com/mrocklin/distributed (to remove-nanny-start-timeout) to /scratch/local/18388940/pip-78z73spx-build
Installing collected packages: distributed
  Found existing installation: distributed 1.21.4
    Uninstalling distributed-1.21.4:
      Successfully uninstalled distributed-1.21.4
  Running setup.py install for distributed ... done
Successfully installed distributed-1.18.1+285.g312a83e6
You are using pip version 9.0.1, however version 9.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

(pangeo) [b.weinstein@c30a-s41 dask-jobqueue]$ python
Python 3.6.4 | packaged by conda-forge | (default, Dec 23 2017, 16:31:06) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from dask_jobqueue import SLURMCluster
>>> from datetime import datetime
>>> from time import sleep
>>> 
>>> cluster = SLURMCluster(project='ewhite',death_timeout=200)
>>> cluster.start_workers(2)
[3, 5]
>>> 
>>> from dask.distributed import Client
>>> client = Client(cluster)
>>> 
>>> while True:
...     print(datetime.now().strftime("%a, %d %B %Y %I:%M:%S"))
...     print(client)
...     sleep(30)
... 
Thu, 22 March 2018 02:01:02
<Client: scheduler='tcp://172.16.194.121:43735' processes=0 cores=0>
Thu, 22 March 2018 02:01:32
<Client: scheduler='tcp://172.16.194.121:43735' processes=0 cores=0>
Thu, 22 March 2018 02:02:02
<Client: scheduler='tcp://172.16.194.121:43735' processes=0 cores=0>
Thu, 22 March 2018 02:02:32
<Client: scheduler='tcp://172.16.194.121:43735' processes=0 cores=0>
^Z
[6]+  Stopped                 python

Workers are still alive

(pangeo) [b.weinstein@c30a-s41 dask-jobqueue]$ squeue -u b.weinstein
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          18389915 hpg2-comp     dask b.weinst  R       1:39      2 c34a-s[22,28]
          18389916 hpg2-comp     dask b.weinst  R       1:39      2 c34a-s35,c34b-s27
          18388940 hpg2-comp     bash b.weinst  R      14:59      1 c30a-s41

[b.weinstein@gator4 dask-jobqueue]$ cat dask.err
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.195.95:46392'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.195.95:36015'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.195.95:41564'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.195.95:38059'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.195.95:46388'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.195.95:46744'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.195.95:38011'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.195.95:39330'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.195.95:46392'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.195.95:36015'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.195.95:41564'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.195.95:38059'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.195.95:46388'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.195.95:46744'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.195.95:38011'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.195.95:39330'
distributed.nanny - WARNING - Worker process still alive after 159 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 159 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 159 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 159 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 159 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 159 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 159 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 159 seconds, killing
distributed.dask_worker - INFO - End worker
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-8, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-6, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-3, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-4, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-2, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-1, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-7, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-5, started daemon)>

@mrocklin
Copy link
Member

You'll want to wait a bit on dask.err, it's not done yet.

Or, if it is, then your system has trouble starting new processes

@mrocklin
Copy link
Member

This is the sort of question that you might want to ask of your IT staff.

You could probably avoid all of this just by adding the --no-nanny flag to the dask-worker command

@mrocklin
Copy link
Member

(not that that would resolve the underlying issue, but it might make you happier personally.)

@mrocklin
Copy link
Member

Note that the login nodes of many HPC systems disallow processes from forking or spawning new processes.

@bw4sz
Copy link
Author

bw4sz commented Mar 22, 2018

Yes, sorry, edited above, was too jumpy with dask.err. I stepped away to chat with other HPC users at the university to see if anyone had leads. I am opening a ticket with IT to see if it has to do with the forking issue. If so, is this fatal? I was quite excited about dask. I never did change the .config environment, where can I find the .dask folder? In the miniconda site packages?

btw, can't do no-nanny.

(pangeo) [b.weinstein@c27a-s5 dask-jobqueue]$ python
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from dask_jobqueue import SLURMCluster
>>> from datetime import datetime
>>> from time import sleep
>>> 
>>> cluster = SLURMCluster(project='ewhite',death_timeout=200,extra='--no-nanny')
>>> cluster.start_workers(2)
[3, 5]
>>> 
>>> from dask.distributed import Client
>>> client = Client(cluster)
>>> 
>>> while True:
...     print(datetime.now().strftime("%a, %d %B %Y %I:%M:%S"))
...     print(client)
...     sleep(30)
...     
... 
Thu, 22 March 2018 02:44:25
<Client: scheduler='tcp://172.16.193.137:46024' processes=0 cores=0>
Thu, 22 March 2018 02:44:55
<Client: scheduler='tcp://172.16.193.137:46024' processes=0 cores=0>
Thu, 22 March 2018 02:45:25
<Client: scheduler='tcp://172.16.193.137:46024' processes=0 cores=0>
(pangeo) [b.weinstein@c27a-s5 dask-jobqueue]$ cat dask.err
distributed.dask_worker - ERROR - Failed to launch worker.  You cannot use the --no-nanny argument when nprocs > 1.

@bw4sz
Copy link
Author

bw4sz commented Mar 22, 2018

I am going to revert back to the stable dask, before the PR.

@mrocklin
Copy link
Member

Can you post the worker logs again? So far it looks like the nanny is starting, but never the worker.

@mrocklin
Copy link
Member

To be clear, this is what a healthy log looks like. You can run this on your personal computer:

scheduler

mrocklin@workstation:~$ dask-scheduler 
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at: tcp://192.168.50.230:8786
distributed.scheduler - INFO -       bokeh at:                     :8787
distributed.scheduler - INFO - Local Directory:    /tmp/scheduler-m5i21qmg
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Register tcp://127.0.0.1:34929
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:34929
distributed.scheduler - INFO - Register tcp://127.0.0.1:37295
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:37295

worker

mrocklin@workstation:~$ dask-worker localhost:8786
distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:36895'
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:37295
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:37295
distributed.worker - INFO -              nanny at:            127.0.0.1:36895
distributed.worker - INFO -              bokeh at:            127.0.0.1:45321
distributed.worker - INFO - Waiting to connect to:       tcp://localhost:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          8
distributed.worker - INFO -                Memory:                   16.69 GB
distributed.worker - INFO -       Local Directory: /home/mrocklin/dask-worker-space/worker-2p9b1k4u
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:       tcp://localhost:8786
distributed.worker - INFO - -------------------------------------------------

@bw4sz
Copy link
Author

bw4sz commented Mar 22, 2018

I seem to have lost the ability to even get workers added to the client. If you scroll up, you can see that the same code had processes added, which then died. Now its just 0. I went back and performed

conda install dask distributed -n pangeo

to make sure I got rid of the two pull requests.

I agree with you on the error logs. They are just long lists of workers not starting.

(pangeo) [b.weinstein@c27a-s5 dask-jobqueue]$ cat dask.err
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.177:45058'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.177:34477'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.177:34700'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.177:33955'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.177:34399'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.177:36471'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.177:40401'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.177:41170'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.177:39640'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.177:33304'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.177:40812'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.177:45906'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.177:33588'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.177:42652'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.177:44831'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.177:33597'
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Worker process 11851 was killed by unknown signal
distributed.nanny - WARNING - Worker process 11852 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Worker process 11953 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Worker process 12204 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Worker process 12617 was killed by unknown signal
distributed.nanny - WARNING - Worker process 12623 was killed by unknown signal
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Worker process 12721 was killed by unknown signal
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Worker process 12826 was killed by unknown signal
distributed.nanny - WARNING - Worker process 12822 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Worker process 12900 was killed by unknown signal
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Worker process 13023 was killed by unknown signal
distributed.nanny - WARNING - Worker process 13056 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Worker process 13131 was killed by unknown signal
distributed.nanny - WARNING - Worker process 13126 was killed by unknown signal
distributed.nanny - WARNING - Worker process 13121 was killed by unknown signal
distributed.nanny - WARNING - Worker process 13127 was killed by unknown signal
distributed.nanny - WARNING - Worker process 13120 was killed by unknown signal
distributed.nanny - WARNING - Worker process 13136 was killed by unknown signal
distributed.nanny - WARNING - Worker process 13132 was killed by unknown signal
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Worker process 13139 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Worker process 13210 was killed by unknown signal
distributed.nanny - WARNING - Worker process 13217 was killed by unknown signal
distributed.nanny - WARNING - Worker process 13222 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.177:45058'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.177:34700'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.177:33955'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.177:36471'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.177:34477'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.177:41170'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.177:33304'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.177:34399'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.177:45906'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.177:40401'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.177:39640'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.177:42652'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.177:40812'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.177:33588'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.177:44831'
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.177:33597'
distributed.nanny - WARNING - Worker process 13263 was killed by unknown signal
distributed.nanny - WARNING - Worker process 13264 was killed by unknown signal
distributed.nanny - WARNING - Worker process 13269 was killed by unknown signal
distributed.nanny - WARNING - Worker process 13278 was killed by unknown signal
distributed.nanny - WARNING - Worker process 13294 was killed by unknown signal
distributed.nanny - WARNING - Worker process 13296 was killed by unknown signal
distributed.nanny - WARNING - Worker process 13285 was killed by unknown signal
distributed.nanny - WARNING - Worker process 13261 was killed by unknown signal
distributed.nanny - WARNING - Worker process 13273 was killed by unknown signal
distributed.nanny - WARNING - Worker process 13277 was killed by unknown signal
distributed.nanny - WARNING - Worker process 13286 was killed by unknown signal
distributed.nanny - WARNING - Worker process 13280 was killed by unknown signal
distributed.nanny - WARNING - Worker process 13290 was killed by unknown signal
distributed.dask_worker - INFO - End worker
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-154, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-158, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-153, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-160, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-159, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-155, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-156, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-157, started daemon)>
distributed.dask_worker - INFO - End worker
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-154, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-160, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-156, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-159, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-153, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-155, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-158, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-157, started daemon)>

I'm submiting IT ticket now.

@bw4sz
Copy link
Author

bw4sz commented Mar 22, 2018

Perhaps i've worn out my usefuless, but added no-nanny on one process is quite weird.

>>> from dask_jobqueue import SLURMCluster
>>> from datetime import datetime
>>> from time import sleep
>>> 
>>> cluster = SLURMCluster(project='ewhite',death_timeout=200,processes=1,extra='--no-nanny')
>>> cluster.start_workers(2)
[3, 5]
>>> 
>>> print(cluster.job_script())
#!/bin/bash

#SBATCH -J dask
#SBATCH -n 1
#SBATCH -A ewhite
#SBATCH -t 00:30:00
#SBATCH -e dask.err
#SBATCH -o dask.out

export LANG="en_US.utf8"
export LANGUAGE="en_US.utf8"
export LC_ALL="en_US.utf8"

"/home/b.weinstein/miniconda3/envs/pangeo/bin/dask-worker" tcp://172.16.193.137:36654     --nthreads 4     --nprocs 1     --memory-limit 7GB     --name dask-6     --death-timeout 200      --no-nanny

>>> 
>>> from dask.distributed import Client
>>> client = Client(cluster)
>>> 
>>> while True:
...     print(datetime.now().strftime("%a, %d %B %Y %I:%M:%S"))
...     print(client)
...     sleep(20)
... 
Thu, 22 March 2018 03:31:57
<Client: scheduler='tcp://172.16.193.137:36654' processes=0 cores=0>
Thu, 22 March 2018 03:32:17
<Client: scheduler='tcp://172.16.193.137:36654' processes=0 cores=0>
Thu, 22 March 2018 03:32:37
<Client: scheduler='tcp://172.16.193.137:36654' processes=0 cores=0>
Thu, 22 March 2018 03:32:57
<Client: scheduler='tcp://172.16.193.137:36654' processes=0 cores=0>
Thu, 22 March 2018 03:33:17
<Client: scheduler='tcp://172.16.193.137:36654' processes=0 cores=0>
Thu, 22 March 2018 03:33:37
<Client: scheduler='tcp://172.16.193.137:36654' processes=0 cores=0>
Thu, 22 March 2018 03:33:57
<Client: scheduler='tcp://172.16.193.137:36654' processes=0 cores=0>
Thu, 22 March 2018 03:34:17
<Client: scheduler='tcp://172.16.193.137:36654' processes=0 cores=0>
Thu, 22 March 2018 03:34:37
<Client: scheduler='tcp://172.16.193.137:36654' processes=0 cores=0>

At the same time, if I open up a 2nd terminal and check on those workers. They exist, but no error is given.

[b.weinstein@gator4 dask-jobqueue]$ cat dask.err
[b.weinstein@gator4 dask-jobqueue]$ squeue -u b.weinstein
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          18392368 hpg2-comp     dask b.weinst  R       1:58      1 c35b-s18
          18392369 hpg2-comp     dask b.weinst  R       1:58      1 c23b-s7
          18391103 hpg2-comp     bash b.weinst  R      50:24      1 c27a-s5

I'm trying to get back to the place where atleast workers were added.

@bw4sz
Copy link
Author

bw4sz commented Mar 22, 2018

okay last update until I hear from IT.

  • Reinstalled conda environment.
  • Recloned dask-jobqueue repo.
  • Edited the file to remove the --queue flag (see PR).

Running 1 worker. Spawn multiprocessing. Added the process then died. Atleast farther along then above. Not clear to me what reinstalling did.

(pangeo) [b.weinstein@c27a-s5 dask-jobqueue]$ python
Python 3.6.4 | packaged by conda-forge | (default, Dec 23 2017, 16:31:06) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from dask_jobqueue import SLURMCluster
>>> from datetime import datetime
>>> from time import sleep
>>> 
>>> cluster = SLURMCluster(project='ewhite',death_timeout=200)
>>> cluster.start_workers(1)
[2]
>>> 
>>> print(cluster.job_script())
#!/bin/bash

#SBATCH -J dask
#SBATCH -n 8
#SBATCH -A ewhite
#SBATCH -t 00:30:00
#SBATCH -e dask.err
#SBATCH -o dask.out

export LANG="en_US.utf8"
export LANGUAGE="en_US.utf8"
export LC_ALL="en_US.utf8"

/home/b.weinstein/miniconda3/envs/pangeo/bin/dask-worker tcp://172.16.193.137:35695     --nthreads 4     --nprocs 8     --memory-limit 7GB     --name dask-3     --death-timeout 200      

>>> 
>>> from dask.distributed import Client
>>> client = Client(cluster)
>>> while True:
...     print(datetime.now().strftime("%a, %d %B %Y %I:%M:%S"))
...     print(client)
...     sleep(20)
... 
Thu, 22 March 2018 03:44:07
<Client: scheduler='tcp://172.16.193.137:35695' processes=1 cores=4>
Thu, 22 March 2018 03:44:27
<Client: scheduler='tcp://172.16.193.137:35695' processes=4 cores=16>
Thu, 22 March 2018 03:44:47
<Client: scheduler='tcp://172.16.193.137:35695' processes=4 cores=16>
Thu, 22 March 2018 03:45:07
<Client: scheduler='tcp://172.16.193.137:35695' processes=8 cores=32>
Thu, 22 March 2018 03:45:27
<Client: scheduler='tcp://172.16.193.137:35695' processes=8 cores=32>
Thu, 22 March 2018 03:45:47
<Client: scheduler='tcp://172.16.193.137:35695' processes=8 cores=32>
Thu, 22 March 2018 03:46:07
<Client: scheduler='tcp://172.16.193.137:35695' processes=8 cores=32>
Thu, 22 March 2018 03:46:27
<Client: scheduler='tcp://172.16.193.137:35695' processes=8 cores=32>
Thu, 22 March 2018 03:46:47
<Client: scheduler='tcp://172.16.193.137:35695' processes=8 cores=32>
Thu, 22 March 2018 03:47:07
<Client: scheduler='tcp://172.16.193.137:35695' processes=8 cores=32>
Thu, 22 March 2018 03:47:27
<Client: scheduler='tcp://172.16.193.137:35695' processes=0 cores=0>
Thu, 22 March 2018 03:47:47
<Client: scheduler='tcp://172.16.193.137:35695' processes=0 cores=0>
Thu, 22 March 2018 03:48:07
<Client: scheduler='tcp://172.16.193.137:35695' processes=0 cores=0>

Error log

(pangeo) [b.weinstein@c27a-s5 dask-jobqueue]$ cat dask.err
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.192.170:40711'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.192.170:35203'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.192.170:40524'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.192.170:35274'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.192.170:34699'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.192.170:44285'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.192.170:38354'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.192.170:39839'
distributed.worker - INFO -       Start worker at: tcp://172.16.192.170:37927
distributed.worker - INFO -          Listening to: tcp://172.16.192.170:37927
distributed.worker - INFO -              nanny at:       172.16.192.170:40524
distributed.worker - INFO -              bokeh at:        172.16.192.170:8789
distributed.worker - INFO - Waiting to connect to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-xz11ng15
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Worker process 26428 was killed by signal 15
distributed.nanny - WARNING - Worker process 26431 was killed by signal 15
distributed.nanny - WARNING - Worker process 26422 was killed by signal 15
distributed.nanny - WARNING - Worker process 26426 was killed by signal 15
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Worker process 26434 was killed by signal 15
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.worker - INFO -       Start worker at: tcp://172.16.192.170:41719
distributed.worker - INFO -          Listening to: tcp://172.16.192.170:41719
distributed.worker - INFO -              nanny at:       172.16.192.170:40711
distributed.worker - INFO -              bokeh at:       172.16.192.170:37913
distributed.worker - INFO - Waiting to connect to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-k4tzpa2y
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at: tcp://172.16.192.170:36966
distributed.worker - INFO -          Listening to: tcp://172.16.192.170:36966
distributed.worker - INFO -              nanny at:       172.16.192.170:35203
distributed.worker - INFO -              bokeh at:       172.16.192.170:44147
distributed.worker - INFO - Waiting to connect to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-bn9ksu5l
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Worker process 26487 was killed by signal 15
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Restarting worker
distributed.worker - INFO -       Start worker at: tcp://172.16.192.170:46148
distributed.worker - INFO -          Listening to: tcp://172.16.192.170:46148
distributed.worker - INFO -              nanny at:       172.16.192.170:34699
distributed.worker - INFO -              bokeh at:       172.16.192.170:33261
distributed.worker - INFO - Waiting to connect to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-qx_ahop_
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.worker - INFO -       Start worker at: tcp://172.16.192.170:39590
distributed.worker - INFO -          Listening to: tcp://172.16.192.170:39590
distributed.worker - INFO -              nanny at:       172.16.192.170:39839
distributed.worker - INFO -              bokeh at:       172.16.192.170:39765
distributed.worker - INFO - Waiting to connect to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-iwmd6hgn
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at: tcp://172.16.192.170:33791
distributed.worker - INFO -       Start worker at: tcp://172.16.192.170:37415
distributed.worker - INFO -          Listening to: tcp://172.16.192.170:33791
distributed.worker - INFO -          Listening to: tcp://172.16.192.170:37415
distributed.worker - INFO -              nanny at:       172.16.192.170:38354
distributed.worker - INFO -              bokeh at:       172.16.192.170:46212
distributed.worker - INFO -              nanny at:       172.16.192.170:35274
distributed.worker - INFO -              bokeh at:       172.16.192.170:44693
distributed.worker - INFO - Waiting to connect to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Waiting to connect to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at: tcp://172.16.192.170:35930
distributed.worker - INFO -          Listening to: tcp://172.16.192.170:35930
distributed.worker - INFO -              nanny at:       172.16.192.170:44285
distributed.worker - INFO -              bokeh at:       172.16.192.170:41969
distributed.worker - INFO - Waiting to connect to: tcp://172.16.193.137:35695
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-b49rya10
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-34cn_po_
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-8rn4c74w
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.192.170:37415', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-34cn_po_'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.192.170:37415', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-34cn_po_'}
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.192.170:35930', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-8rn4c74w'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.192.170:35930', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-8rn4c74w'}
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.192.170:39590', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-iwmd6hgn'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.192.170:39590', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-iwmd6hgn'}
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.192.170:33791', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-b49rya10'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.192.170:33791', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-b49rya10'}
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-8rn4c74w', purging
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-b49rya10', purging
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-34cn_po_', purging
distributed.worker - INFO -       Start worker at: tcp://172.16.192.170:46766
distributed.worker - INFO -          Listening to: tcp://172.16.192.170:46766
distributed.worker - INFO -              nanny at:       172.16.192.170:44285
distributed.worker - INFO -              bokeh at:       172.16.192.170:44148
distributed.worker - INFO - Waiting to connect to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-cc3w01ca
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-iwmd6hgn', purging
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.192.170:46766', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-cc3w01ca'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.192.170:46766', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-cc3w01ca'}
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.core - WARNING - Event loop was unresponsive in Nanny for 3.18s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - WARNING - Event loop was unresponsive in Nanny for 3.17s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - WARNING - Event loop was unresponsive in Nanny for 3.17s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-cc3w01ca', purging
distributed.worker - INFO -       Start worker at: tcp://172.16.192.170:41382
distributed.worker - INFO -          Listening to: tcp://172.16.192.170:41382
distributed.worker - INFO -              nanny at:       172.16.192.170:35274
distributed.worker - INFO -              bokeh at:       172.16.192.170:38306
distributed.worker - INFO - Waiting to connect to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-e7zdfvr6
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at: tcp://172.16.192.170:45723
distributed.worker - INFO -          Listening to: tcp://172.16.192.170:45723
distributed.worker - INFO -              nanny at:       172.16.192.170:38354
distributed.worker - INFO -              bokeh at:       172.16.192.170:37276
distributed.worker - INFO - Waiting to connect to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-43a6xq6r
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.192.170:41382', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-e7zdfvr6'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.192.170:41382', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-e7zdfvr6'}
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.192.170:45723', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-43a6xq6r'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.192.170:45723', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-43a6xq6r'}
distributed.worker - INFO -       Start worker at: tcp://172.16.192.170:45585
distributed.worker - INFO -          Listening to: tcp://172.16.192.170:45585
distributed.worker - INFO -              nanny at:       172.16.192.170:39839
distributed.worker - INFO -              bokeh at:       172.16.192.170:43366
distributed.worker - INFO - Waiting to connect to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-x_9zk5r4
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at: tcp://172.16.192.170:41819
distributed.worker - INFO -          Listening to: tcp://172.16.192.170:41819
distributed.worker - INFO -              nanny at:       172.16.192.170:44285
distributed.worker - INFO -              bokeh at:       172.16.192.170:36603
distributed.worker - INFO - Waiting to connect to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-kprzaus4
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.192.170:45585', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-x_9zk5r4'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.192.170:45585', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-x_9zk5r4'}
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.192.170:41819', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-kprzaus4'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.192.170:41819', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-kprzaus4'}
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-43a6xq6r', purging
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-x_9zk5r4', purging
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-e7zdfvr6', purging
distributed.worker - INFO -       Start worker at: tcp://172.16.192.170:37132
distributed.worker - INFO -          Listening to: tcp://172.16.192.170:37132
distributed.worker - INFO -              nanny at:       172.16.192.170:39839
distributed.worker - INFO -              bokeh at:       172.16.192.170:41064
distributed.worker - INFO - Waiting to connect to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-tosc0yvi
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-kprzaus4', purging
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.192.170:37132', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-tosc0yvi'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.192.170:37132', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-tosc0yvi'}
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-tosc0yvi', purging
distributed.worker - INFO -       Start worker at: tcp://172.16.192.170:38333
distributed.worker - INFO -          Listening to: tcp://172.16.192.170:38333
distributed.worker - INFO -              nanny at:       172.16.192.170:44285
distributed.worker - INFO -              bokeh at:       172.16.192.170:36921
distributed.worker - INFO - Waiting to connect to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-jz9ibwfd
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at: tcp://172.16.192.170:38838
distributed.worker - INFO -          Listening to: tcp://172.16.192.170:38838
distributed.worker - INFO -              nanny at:       172.16.192.170:38354
distributed.worker - INFO -              bokeh at:       172.16.192.170:40845
distributed.worker - INFO - Waiting to connect to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-4pq1dj09
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at: tcp://172.16.192.170:33455
distributed.worker - INFO -          Listening to: tcp://172.16.192.170:33455
distributed.worker - INFO -              nanny at:       172.16.192.170:39839
distributed.worker - INFO -              bokeh at:       172.16.192.170:34880
distributed.worker - INFO - Waiting to connect to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-me6n9a_q
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.192.170:38333', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-jz9ibwfd'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.192.170:38333', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-jz9ibwfd'}
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.192.170:38838', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-4pq1dj09'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.192.170:38838', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-4pq1dj09'}
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.192.170:33455', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-me6n9a_q'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.192.170:33455', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-me6n9a_q'}
distributed.worker - INFO -       Start worker at: tcp://172.16.192.170:35519
distributed.worker - INFO -          Listening to: tcp://172.16.192.170:35519
distributed.worker - INFO -              nanny at:       172.16.192.170:35274
distributed.worker - INFO -              bokeh at:       172.16.192.170:32984
distributed.worker - INFO - Waiting to connect to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-106tk858
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.192.170:35519', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-106tk858'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.192.170:35519', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-106tk858'}
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-jz9ibwfd', purging
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-106tk858', purging
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-4pq1dj09', purging
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-me6n9a_q', purging
distributed.worker - INFO -       Start worker at: tcp://172.16.192.170:40315
distributed.worker - INFO -          Listening to: tcp://172.16.192.170:40315
distributed.worker - INFO -              nanny at:       172.16.192.170:38354
distributed.worker - INFO -              bokeh at:       172.16.192.170:46120
distributed.worker - INFO - Waiting to connect to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-_cn133ys
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at: tcp://172.16.192.170:32931
distributed.worker - INFO -          Listening to: tcp://172.16.192.170:32931
distributed.worker - INFO -              nanny at:       172.16.192.170:44285
distributed.worker - INFO -              bokeh at:       172.16.192.170:37593
distributed.worker - INFO - Waiting to connect to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-uxj42_ka
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at: tcp://172.16.192.170:38071
distributed.worker - INFO -          Listening to: tcp://172.16.192.170:38071
distributed.worker - INFO -              nanny at:       172.16.192.170:39839
distributed.worker - INFO -              bokeh at:       172.16.192.170:33648
distributed.worker - INFO - Waiting to connect to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-qap7v1f5
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.192.170:40315', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-_cn133ys'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.192.170:40315', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-_cn133ys'}
distributed.worker - INFO -         Registered to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.192.170:38071', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-qap7v1f5'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.192.170:38071', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-qap7v1f5'}
distributed.worker - INFO -       Start worker at: tcp://172.16.192.170:37935
distributed.worker - INFO -          Listening to: tcp://172.16.192.170:37935
distributed.worker - INFO -              nanny at:       172.16.192.170:35274
distributed.worker - INFO -              bokeh at:       172.16.192.170:36390
distributed.worker - INFO - Waiting to connect to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-q941lm6n
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://172.16.193.137:35695
distributed.worker - INFO - -------------------------------------------------
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.192.170:37935', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-q941lm6n'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.192.170:37935', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-q941lm6n'}
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.192.170:35274'
distributed.worker - INFO - Stopping worker at tcp://172.16.192.170:37935
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.192.170:38354'
distributed.worker - INFO - Stopping worker at tcp://172.16.192.170:40315
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.192.170:39839'
distributed.worker - INFO - Stopping worker at tcp://172.16.192.170:38071
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 849, in callback
    result_list.append(f.result())
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 155, in _start
    response = yield self.instantiate()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 223, in instantiate
    self.process.start()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 363, in start
    self._wait_until_started())
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.192.170:32931', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-uxj42_ka'}
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/bin/dask-worker", line 11, in <module>
    sys.exit(go())
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/cli/dask_worker.py", line 252, in go
    main()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/cli/dask_worker.py", line 243, in main
    loop.run_sync(run)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/ioloop.py", line 582, in run_sync
    return future_cell[0].result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/cli/dask_worker.py", line 236, in run
    yield [n._start(addr) for n in nannies]
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 849, in callback
    result_list.append(f.result())
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 155, in _start
    response = yield self.instantiate()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 223, in instantiate
    self.process.start()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 363, in start
    self._wait_until_started())
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.192.170:41719', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-k4tzpa2y'}
distributed.process - WARNING - reaping stray process <SpawnProcess(SpawnProcess-50, started daemon)>
distributed.process - WARNING - reaping stray process <SpawnProcess(SpawnProcess-3, started daemon)>
distributed.process - WARNING - reaping stray process <SpawnProcess(SpawnProcess-10, started daemon)>
distributed.process - WARNING - reaping stray process <SpawnProcess(SpawnProcess-9, started daemon)>
distributed.process - WARNING - reaping stray process <SpawnProcess(SpawnProcess-19, started daemon)>

@guillaumeeb
Copy link
Member

If it is a problem with the login nodes not allowed to start new process, one thing you could do if possible with slurm, is to launch your scheduler into an interactive job.

With pbs, this is done using qsub -I. Don't know with slurm.

@bw4sz
Copy link
Author

bw4sz commented Mar 22, 2018

Yup, sorry to not mentioned. I am already there. I am sitting on an interactive node. Admin doesn't let you hang out too much on the login nodes.

@bw4sz
Copy link
Author

bw4sz commented Mar 23, 2018

No word from IT. I woke up convinced that perhaps the workers didn't correctly inherent the dask environment. So i tried adding a call to the conda env from within the slurm worker. No change. Worker processes are added and then die about a minute later. Just documenting here to keep track of attempts.

(pangeo) [b.weinstein@c25a-s5 dask-jobqueue]$ python
Python 3.6.4 | packaged by conda-forge | (default, Dec 23 2017, 16:31:06) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from dask_jobqueue import SLURMCluster
>>> from datetime import datetime
>>> from time import sleep
>>> 
>>> cluster = SLURMCluster(project='ewhite',death_timeout=200)
>>> cluster.start_workers(1)
[2]
>>> 
>>> print(cluster.job_script())
#!/bin/bash

#SBATCH -J dask
#SBATCH -n 8
#SBATCH -A ewhite
#SBATCH -t 00:30:00
#SBATCH -e dask.err
#SBATCH -o dask.out

source /home/b.weinstein/miniconda3/bin/activate pangeo

export LANG="en_US.utf8"
export LANGUAGE="en_US.utf8"
export LC_ALL="en_US.utf8"

/home/b.weinstein/miniconda3/envs/pangeo/bin/dask-worker tcp://172.16.193.9:33887     --nthreads 4     --nprocs 8     --memory-limit 7GB     --name dask-3     --death-timeout 200      

>>> 
>>> from dask.distributed import Client
>>> client = Client(cluster)
>>> 
>>> while True:
...     print(datetime.now().strftime("%a, %d %B %Y %I:%M:%S"))
...     print(client)
...     sleep(20)
... 
Fri, 23 March 2018 12:13:19
<Client: scheduler='tcp://172.16.193.9:33887' processes=0 cores=0>
Fri, 23 March 2018 12:13:39
<Client: scheduler='tcp://172.16.193.9:33887' processes=5 cores=20>
Fri, 23 March 2018 12:13:59
<Client: scheduler='tcp://172.16.193.9:33887' processes=8 cores=32>
Fri, 23 March 2018 12:14:19
<Client: scheduler='tcp://172.16.193.9:33887' processes=8 cores=32>
Fri, 23 March 2018 12:14:39
<Client: scheduler='tcp://172.16.193.9:33887' processes=8 cores=32>
Fri, 23 March 2018 12:14:59
<Client: scheduler='tcp://172.16.193.9:33887' processes=8 cores=32>
Fri, 23 March 2018 12:15:19
<Client: scheduler='tcp://172.16.193.9:33887' processes=8 cores=32>
Fri, 23 March 2018 12:15:39
<Client: scheduler='tcp://172.16.193.9:33887' processes=8 cores=32>
Fri, 23 March 2018 12:15:59
<Client: scheduler='tcp://172.16.193.9:33887' processes=8 cores=32>
Fri, 23 March 2018 12:16:19
<Client: scheduler='tcp://172.16.193.9:33887' processes=8 cores=32>
Fri, 23 March 2018 12:16:39
<Client: scheduler='tcp://172.16.193.9:33887' processes=0 cores=0>
Fri, 23 March 2018 12:16:59
<Client: scheduler='tcp://172.16.193.9:33887' processes=0 cores=0>
Fri, 23 March 2018 12:17:19
<Client: scheduler='tcp://172.16.193.9:33887' processes=0 cores=0>

Error Log

[b.weinstein@gator4 dask-jobqueue]$ cat dask.err
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.194.70:44038'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.194.70:41491'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.194.70:36755'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.194.70:34360'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.194.70:42966'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.194.70:40867'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.194.70:44366'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.194.70:36109'
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-qx_ahop_', purging
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-bn9ksu5l', purging
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-uxj42_ka', purging
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-k4tzpa2y', purging
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-xz11ng15', purging
distributed.worker - INFO -       Start worker at:  tcp://172.16.194.70:40885
distributed.worker - INFO -          Listening to:  tcp://172.16.194.70:40885
distributed.worker - INFO -              nanny at:        172.16.194.70:40867
distributed.worker - INFO -              bokeh at:         172.16.194.70:8789
distributed.worker - INFO - Waiting to connect to:   tcp://172.16.193.9:33887
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-fjl2m3tj
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:   tcp://172.16.193.9:33887
distributed.worker - INFO - -------------------------------------------------
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.worker - INFO -       Start worker at:  tcp://172.16.194.70:38520
distributed.worker - INFO -          Listening to:  tcp://172.16.194.70:38520
distributed.worker - INFO -              nanny at:        172.16.194.70:34360
distributed.worker - INFO -              bokeh at:        172.16.194.70:46607
distributed.worker - INFO - Waiting to connect to:   tcp://172.16.193.9:33887
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-fxsafo3m
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:   tcp://172.16.193.9:33887
distributed.worker - INFO - -------------------------------------------------
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - WARNING - Worker process 21603 was killed by signal 15
distributed.nanny - WARNING - Restarting worker
distributed.worker - INFO -       Start worker at:  tcp://172.16.194.70:41634
distributed.worker - INFO -          Listening to:  tcp://172.16.194.70:41634
distributed.worker - INFO -              nanny at:        172.16.194.70:44038
distributed.worker - INFO -              bokeh at:        172.16.194.70:39651
distributed.worker - INFO - Waiting to connect to:   tcp://172.16.193.9:33887
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-_tomj8s_
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:   tcp://172.16.193.9:33887
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at:  tcp://172.16.194.70:39743
distributed.worker - INFO -          Listening to:  tcp://172.16.194.70:39743
distributed.worker - INFO -              nanny at:        172.16.194.70:36755
distributed.worker - INFO -              bokeh at:        172.16.194.70:35000
distributed.worker - INFO - Waiting to connect to:   tcp://172.16.193.9:33887
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-7kdujkht
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:   tcp://172.16.193.9:33887
distributed.worker - INFO - -------------------------------------------------
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.194.70:39743', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-7kdujkht'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.194.70:39743', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-7kdujkht'}
...
...
etc. for each worker.

@bw4sz
Copy link
Author

bw4sz commented Mar 23, 2018

Update for future reference. No change.

The SLURM cluster distributes tasks based on the total number of requested resources, but does not respect where those tasks are placed unless specifically told. Since dask is working through a MPI connection, those 8 workers must be on the same rack? To do this, I edited the submission script to enforce that (n) --ntasks 8 are all performed on the same node -N 1.

(pangeo) [b.weinstein@c21b-s18 dask-jobqueue]$ python
Python 3.6.4 | packaged by conda-forge | (default, Dec 23 2017, 16:31:06) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from dask_jobqueue import SLURMCluster
>>> from datetime import datetime
>>> from time import sleep
>>> 
>>> cluster = SLURMCluster(project='ewhite',death_timeout=100)
>>> cluster.start_workers(1)
[2]
>>> 
>>> print(cluster.job_script())
#!/bin/bash

#SBATCH -N 1
#SBATCH -J dask
#SBATCH -n 8
#SBATCH -A ewhite
#SBATCH -t 00:30:00
#SBATCH -e dask.err
#SBATCH -o dask.out

source /home/b.weinstein/miniconda3/bin/activate pangeo

export LANG="en_US.utf8"
export LANGUAGE="en_US.utf8"
export LC_ALL="en_US.utf8"

/home/b.weinstein/miniconda3/envs/pangeo/bin/dask-worker tcp://172.16.192.24:41461     --nthreads 4     --nprocs 8     --memory-limit 7GB     --name dask-3     --death-timeout 100      

>>> 
>>> from dask.distributed import Client
>>> client = Client(cluster)
>>> 
>>> client
<Client: scheduler='tcp://172.16.192.24:41461' processes=0 cores=0>
>>> client
<Client: scheduler='tcp://172.16.192.24:41461' processes=1 cores=4>
>>> client
<Client: scheduler='tcp://172.16.192.24:41461' processes=1 cores=4>
>>> counter=0
>>> while counter < 10:
...     print(datetime.now().strftime("%a, %d %B %Y %I:%M:%S"))
...     print(client)
...     sleep(20)
...     counter+=1
... 
Fri, 23 March 2018 02:08:29
<Client: scheduler='tcp://172.16.192.24:41461' processes=8 cores=32>
Fri, 23 March 2018 02:08:49
<Client: scheduler='tcp://172.16.192.24:41461' processes=8 cores=32>
Fri, 23 March 2018 02:09:09
<Client: scheduler='tcp://172.16.192.24:41461' processes=8 cores=32>
Fri, 23 March 2018 02:09:29
<Client: scheduler='tcp://172.16.192.24:41461' processes=8 cores=32>
Fri, 23 March 2018 02:09:49
<Client: scheduler='tcp://172.16.192.24:41461' processes=8 cores=32>
Fri, 23 March 2018 02:10:09
<Client: scheduler='tcp://172.16.192.24:41461' processes=0 cores=0>
Fri, 23 March 2018 02:10:29
<Client: scheduler='tcp://172.16.192.24:41461' processes=0 cores=0>
Fri, 23 March 2018 02:10:49
<Client: scheduler='tcp://172.16.192.24:41461' processes=0 cores=0>

I feel like the error message is slightly different at the end. I'm going to change back to forking from spawning to see if it makes a different in the .dash config yaml.

(pangeo) [b.weinstein@c21b-s18 dask-jobqueue]$ squeue -u b.weinstein
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          18426188 hpg2-comp     bash b.weinst  R      49:21      1 c21b-s18
(pangeo) [b.weinstein@c21b-s18 dask-jobqueue]$ cat dask.err
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.44:36107'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.44:37161'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.44:41005'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.44:37816'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.44:33666'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.44:44045'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.44:38295'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.193.44:43426'
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-ls69yrj9', purging
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-ryfo6rlt', purging
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-s6945kth', purging
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-_tstz9nw', purging
distributed.worker - INFO -       Start worker at:  tcp://172.16.193.44:45134
distributed.worker - INFO -          Listening to:  tcp://172.16.193.44:45134
distributed.worker - INFO -              nanny at:        172.16.193.44:41005
distributed.worker - INFO -              bokeh at:         172.16.193.44:8789
distributed.worker - INFO - Waiting to connect to:  tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-ei5uiva3
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:  tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at:  tcp://172.16.193.44:32896
distributed.worker - INFO -          Listening to:  tcp://172.16.193.44:32896
distributed.worker - INFO -              nanny at:        172.16.193.44:36107
distributed.worker - INFO -              bokeh at:        172.16.193.44:38506
distributed.worker - INFO - Waiting to connect to:  tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-kbzqab1o
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at:  tcp://172.16.193.44:46276
distributed.worker - INFO -          Listening to:  tcp://172.16.193.44:46276
distributed.worker - INFO -              nanny at:        172.16.193.44:38295
distributed.worker - INFO -              bokeh at:        172.16.193.44:40615
distributed.worker - INFO - Waiting to connect to:  tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-3514b1ne
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:  tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:  tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at:  tcp://172.16.193.44:39987
distributed.worker - INFO -          Listening to:  tcp://172.16.193.44:39987
distributed.worker - INFO -              nanny at:        172.16.193.44:43426
distributed.worker - INFO -              bokeh at:        172.16.193.44:42895
distributed.worker - INFO - Waiting to connect to:  tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-s6mk0mle
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:  tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at:  tcp://172.16.193.44:42682
distributed.worker - INFO -          Listening to:  tcp://172.16.193.44:42682
distributed.worker - INFO -              nanny at:        172.16.193.44:33666
distributed.worker - INFO -              bokeh at:        172.16.193.44:39981
distributed.worker - INFO - Waiting to connect to:  tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-rg4nsaxo
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:  tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.worker - INFO -       Start worker at:  tcp://172.16.193.44:40680
distributed.worker - INFO -          Listening to:  tcp://172.16.193.44:40680
distributed.worker - INFO -              nanny at:        172.16.193.44:37816
distributed.worker - INFO -              bokeh at:        172.16.193.44:36999
distributed.worker - INFO - Waiting to connect to:  tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-8yixbaxu
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at:  tcp://172.16.193.44:44216
distributed.worker - INFO -          Listening to:  tcp://172.16.193.44:44216
distributed.worker - INFO -              nanny at:        172.16.193.44:37161
distributed.worker - INFO -              bokeh at:        172.16.193.44:39797
distributed.worker - INFO - Waiting to connect to:  tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-aq19j5se
distributed.worker - INFO -         Registered to:  tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:  tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at:  tcp://172.16.193.44:42203
distributed.worker - INFO -          Listening to:  tcp://172.16.193.44:42203
distributed.worker - INFO -              nanny at:        172.16.193.44:44045
distributed.worker - INFO -              bokeh at:        172.16.193.44:46418
distributed.worker - INFO - Waiting to connect to:  tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-jobqueue/dask-worker-space/worker-84gxedtr
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:  tcp://172.16.192.24:33114
distributed.worker - INFO - -------------------------------------------------
tornado.application - ERROR - Exception in Future <Future finished exception=AssertionError({'address': 'tcp://172.16.193.44:44216', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-aq19j5se'},)> after timeout
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 936, in error_callback
    future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.193.44:44216', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-aq19j5se'}
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.193.44:37161'
distributed.worker - INFO - Stopping worker at tcp://172.16.193.44:44216
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 849, in callback
    result_list.append(f.result())
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 155, in _start
    response = yield self.instantiate()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 223, in instantiate
    self.process.start()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 363, in start
    self._wait_until_started())
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.193.44:42203', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-84gxedtr'}
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/bin/dask-worker", line 11, in <module>
    sys.exit(go())
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/cli/dask_worker.py", line 252, in go
    main()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/cli/dask_worker.py", line 243, in main
    loop.run_sync(run)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/ioloop.py", line 582, in run_sync
    return future_cell[0].result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/cli/dask_worker.py", line 236, in run
    yield [n._start(addr) for n in nannies]
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 849, in callback
    result_list.append(f.result())
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 155, in _start
    response = yield self.instantiate()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 223, in instantiate
    self.process.start()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 363, in start
    self._wait_until_started())
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.193.44:40680', 'dir': '/home/b.weinstein/dask-jobqueue/dask-worker-space/worker-8yixbaxu'}
distributed.process - WARNING - reaping stray process <SpawnProcess(SpawnProcess-10, started daemon)>
distributed.process - WARNING - reaping stray process <SpawnProcess(SpawnProcess-11, started daemon)>
distributed.process - WARNING - reaping stray process <SpawnProcess(SpawnProcess-7, started daemon)>
distributed.process - WARNING - reaping stray process <SpawnProcess(SpawnProcess-8, started daemon)>
distributed.process - WARNING - reaping stray process <SpawnProcess(SpawnProcess-3, started daemon)>
distributed.process - WARNING - reaping stray process <SpawnProcess(SpawnProcess-5, started daemon)>
distributed.process - WARNING - reaping stray process <SpawnProcess(SpawnProcess-1, started daemon)>
Exception in thread AsyncProcess SpawnProcess-7 watch process join:
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/process.py", line 216, in _watch_process
    assert exitcode is not None
AssertionError

@mrocklin
Copy link
Member

mrocklin commented Mar 23, 2018 via email

@bw4sz
Copy link
Author

bw4sz commented Mar 23, 2018

Yup, waiting to hear back.

One thing i've been wondering. Where is the scheduler.json? In the docs it suggests that creating a client should create a ~/scheduler.json in the /home/$USER/scheduler.json location. I never seem to find such a file. The reason I was wondering is that when I set death_timeout to a long duration, I can go through the process to create the juypter lab notebook. I can ssh tunnel just fine, but it doesn't sync up to that client. But I can't quite seem to find any file.

(pangeo) [b.weinstein@c21b-s18 .dask]$ cd ~
(pangeo) [b.weinstein@c21b-s18 ~]$ ls
dask-jobqueue  dask-worker-space  DeepForest  logs  miniconda3

@mrocklin
Copy link
Member

mrocklin commented Mar 23, 2018 via email

@jhamman
Copy link
Member

jhamman commented Mar 23, 2018

@bw4sz -

Where is the scheduler.json? In the docs it suggests that creating a client should create a ~/scheduler.json in the /home/$USER/scheduler.json location. I never seem to find such a file.

  • we don't utilize the scheduler.json file in dask-jobqueue. Which docs were you looking at?

Since dask is working through a MPI connection

  • MPI is not being used here. Each worker is started a seperate process and connects to the scheduler via TCP Address.

  • can you try this workflow and see if you can get workers to connect:

    • open two interactive sessions on two different compute nodes on your cluster
    • on the first, run dask-scheduler from the command line
    • on the second, run dask-worker ${SCHEDULER} from the command line, ${SCHEDULER} is a tcp address printed the by the dask-scheduler command (e.g. tcp://192.168.0.102:8786).
    • report back on whether this is sucessful
    • if that does work, can you, in a third session start a client with that same address (e.g. Client('tcp://192.168.0.102:8786')
    • (basically, I'm not entirely convinced your cluster is configured in such a way as to allow this sort of communication)

@bw4sz
Copy link
Author

bw4sz commented Mar 23, 2018

thanks @jhamman,

I should say that i'm not tied to this repo, whatever I can do to get dask up and running on the UF slurm cluster. We have enormous computing resources here, and my lab has been itching to try dask for some time now. I took on the challenge because, as you can see from the 30 comments, I enjoy debugging. Perhaps i'm mixing and matching docs a bit, going from @mrocklin screencast on connecting to HPC, to this repo, as was originally suggested. I will try your workflow and let you know. After that, let me go back to the original video.

@bw4sz
Copy link
Author

bw4sz commented Mar 23, 2018

That worked without error. Easiest to see by screenshot.
screen shot 2018-03-23 at 12 26 05 pm

@bw4sz
Copy link
Author

bw4sz commented Mar 23, 2018

Its alittle awkward because srun on this resource sends you to a single interactive worker, but I think this all checks out. Again, I want to stress that if not using dask_jobqueue is helpful, just give me a push in the direction of what to use. It was not clear to me that it was not using MPI, but a totally different network strategy.

@jhamman
Copy link
Member

jhamman commented Mar 23, 2018

Okay, glad to see that works. That is test 0. dask-jobqueue just manages the setup/teardown of those steps. So this should work.

The dask-mpi approach described in some detail here should also work but from my perspective the dask-jobqueue approach is much more flexible and user friendly.

The next thing to try is to start a SLURMCluster in a notebook and see if you can manually add workers via the command line approach I described above.

@bw4sz
Copy link
Author

bw4sz commented Mar 23, 2018

Confirm this is what you meant.

  1. start a juypterlab notebook from the interactive HPC node
[b.weinstein@c25a-s15 ~]$ source activate pangeo
(pangeo) [b.weinstein@c25a-s15 ~]$ python
Python 3.6.4 | packaged by conda-forge | (default, Dec 23 2017, 16:31:06) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> import subprocess
>>> host = socket.gethostname()
>>> proc = subprocess.Popen(['jupyter', 'lab', '--ip', host, '--no-browser'])
>>> 
>>> print("ssh -N -L 8888:%s:8888 -l b.weinstein hpg2.rc.ufl.edu" % (host))
ssh -N -L 8888:c25a-s15.ufhpc:8888 -l b.weinstein hpg2.rc.ufl.edu
>>> [I 15:40:27.627 LabApp] JupyterLab beta preview extension loaded from /home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/jupyterlab
[I 15:40:27.627 LabApp] JupyterLab application directory is /home/b.weinstein/miniconda3/envs/pangeo/share/jupyter/lab
[I 15:40:28.482 LabApp] Serving notebooks from local directory: /home/b.weinstein
[I 15:40:28.483 LabApp] 0 active kernels
[I 15:40:28.483 LabApp] The Jupyter Notebook is running at:
[I 15:40:28.483 LabApp] http://c25a-s15.ufhpc:8888/?token=996cdc65afb0315d806c108d9c1642a87c691c4d1e74f16c
[I 15:40:28.483 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 15:40:28.484 LabApp] 
    
    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://c25a-s15.ufhpc:8888/?token=996cdc65afb0315d806c108d9c1642a87c691c4d1e74f16c
  1. tunnel in from my local computer
Last login: Fri Mar 23 12:23:46 on ttys001
MacBook-Pro:~ ben$ ssh -N -L 8888:c25a-s15.ufhpc:8888 -l b.weinstein hpg2.rc.ufl.edu
b.weinstein@hpg2.rc.ufl.edu's password: 
  1. Open Juypter notebook locally and start a SLURMCluster
from dask_jobqueue import SLURMCluster
from datetime import datetime
from time import sleep

cluster = SLURMCluster(project='ewhite',death_timeout=100)
  1. Find the tcp address
cluster.config
{'base_path': '/home/b.weinstein/miniconda3/envs/pangeo/bin',
 'death_timeout': 100,
 'extra': '',
 'memory': '7GB',
 'name': 'dask',
 'processes': 8,
 'project': 'ewhite',
 'scheduler': 'tcp://172.16.193.17:33896',
 'threads_per_worker': 4,
 'walltime': '00:30:00'}
  1. Back on HPC node, get a worker to connect that tcp
(pangeo) [b.weinstein@gator3 ~]$ dask-worker tcp://172.16.193.17:33896
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.207.123:41695'
distributed.worker - INFO -       Start worker at: tcp://172.16.207.123:38606
distributed.worker - INFO -          Listening to: tcp://172.16.207.123:38606
distributed.worker - INFO -              nanny at:       172.16.207.123:41695
distributed.worker - INFO -              bokeh at:        172.16.207.123:8789
distributed.worker - INFO - Waiting to connect to:  tcp://172.16.193.17:33896
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                         16
distributed.worker - INFO -                Memory:                   33.64 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-di4e4yjm
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:  tcp://172.16.193.17:33896
distributed.worker - INFO - -------------------------------------------------
  1. Back on local machine, create the client.
from dask.distributed import Client
client = Client(cluster)

client

That works

screen shot 2018-03-23 at 12 48 14 pm

@jhamman
Copy link
Member

jhamman commented Mar 23, 2018

Great. Can you incrementally add command line arguments until you figure out which ones are causing your workers to fail to start. My guess is that eventually, you'll find that one of these options is breaking things for you:

%(base_path)s/dask-worker %(scheduler)s \
    --nthreads %(threads_per_worker)d \
    --nprocs %(processes)s \
    --memory-limit %(memory)s \
    --name %(name)s-%(n)d \
    --death-timeout %(death_timeout)s \
     %(extra)s

(this is copied from the SLURMCluster._template).

If running with all of these arguments works, there must be an environment difference that occurs when submitting via sbatch. That would be the next thing to incrementally debug...

@bw4sz
Copy link
Author

bw4sz commented Mar 23, 2018

Got it. I'm not shy. I'll give you updates. I hope this is helpful for future users.

Does the TCP protocol require that requires be physically on the same node? I had added

#SBATCH -N 1

into cluster.start_workers to ensure that I got a single rack. This isn't needed?

@jhamman
Copy link
Member

jhamman commented Mar 23, 2018

As far as starting a jupyter notebook, do this however you would typically do it on your cluster. Launching from an existing Client is one way but is also somewhat orthogonal to how we are doing things here. My typical workflow is:

  • launch a notebook
  • open an ssh tunnel to that notebook
  • from within the notebook
    • start a Cluster -- I typically use the PBSCluster (e.g. cluster = PBSClient(...))
    • create a client (e.g.client = Client(cluster))
    • start workers on the cluster (e.g. cluster.start_workers(2))

Does the TCP protocol require that requires be physically on the same node? I had added

On most clusters, this is not going to be needed. I routinely run on many dozens of compute nodes spread over a very large HPC cluster. I do not have any control over where the jobs start.

@bw4sz
Copy link
Author

bw4sz commented Mar 23, 2018

agreed on workflow. I'm tweaking the memory and processors per worker to see if it makes a difference. Its a bit of a funny thing to debug because I can't tell how long to wait. Several times i've been deceived that I've solved the problem, but then workers die later. For example, I've been sitting here for 5 minutes, all workers are still there, for now!

screen shot 2018-03-23 at 1 51 15 pm

Once I get it to work, the key will be to break it again to confirm.

Is there a way to get each worker to announce its IP or rank or communicate with the client? I'm going solely off of the client saying that it has workers. It would be nice to get info from them.

@jhamman
Copy link
Member

jhamman commented Mar 23, 2018

Is there a way to get each worker to announce its IP or rank or communicate with the client? I'm going solely off of the client saying that it has workers. It would be nice to get info from them.

The dashboard may be a logical place to track this. @mrocklin may have other ideas too.

@bw4sz
Copy link
Author

bw4sz commented Mar 23, 2018

Yup, it looks like it sticks for the moment! I think we have a success on 1 thread per worker. Its not clear why, by @mrocklin was probably right yesterday on the spawning/forking. I'm going to play with this now and see if I can break it again to really understand it.

How will single thread on each process effect my performance? Its hard for me to conceptualize the tradeoff between the number of nodes, the number of workers, and number of processes and the number of threads.

Once I confirm its a single thread issue, I will close this thread.

@jhamman
Copy link
Member

jhamman commented Mar 23, 2018

Is it possible Slurm is configured on your cluster in such a way that won't let you overload your requested resources. In other words, if you ask for 1 process on one node, and try to use two, would that fail?

Glad to hear you're starting to get things working.

@mrocklin
Copy link
Member

mrocklin commented Mar 23, 2018 via email

@bw4sz
Copy link
Author

bw4sz commented Mar 23, 2018

Hmm, yes you are right, its something to this. Its not as simple as single threaded. 4 processes with 2 threads per worker seems to work. 8 processes with 4 threads per work fails as described above.

I think that's good enough to call it close. Maybe IT will have an explanation for the balance among nodes/workers/threads/processes. It might have been nice if admin has a more explicit message to give us a sense.

I really appreciate the willingness by the developers to assist. Let me know how I can contribute in the future. I am documenting this for my lab.

@bw4sz bw4sz closed this as completed Mar 23, 2018
@jhamman
Copy link
Member

jhamman commented Mar 23, 2018

Cool. Glad you have this (mostly) sorted out. It would be great to get your continued interaction here. @guillaumeeb and I seem to be the most engaged users/developers at the moment and we're both mainly using PBS. The Slurm cluster will need some attention at some point soon (e.g. #19, #21).

@AlecThomson
Copy link

Hey all, I'm just jumping on this thread, as I seem to be having a similar problem.

In my specific case I only get worker death using cluster.adapt and when the processes kwarg in SLURMCluster is > 1. I've put some details in a SO post.

Is anyone able to help me diagnose this? Happy to open a new issue if that's better, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants