Skip to content

Initializing a local client crashes randomly #2234

@bluenote10

Description

@bluenote10

When running the following code

from dask.distributed import Client
if __name__ == "__main__":
    client = Client()

repeatedly, I'm observing a ~10% probability that the client initialization crashes randomly. Example output:

$ for i in {1..100} ; do echo $i; python debug_distributed_client.py; done
1
Exception in thread IO loop (most likely raised during interpreter shutdown):Exception in thread Profile (most likely raised during interpreter shutdown):
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
  File "/usr/lib/python2.7/threading.py", line 754, in run
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/profile.py", line 239, in _watch
<type 'exceptions.TypeError'>: 'NoneType' object is not callable

Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
  File "/usr/lib/python2.7/threading.py", line 754, in run
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/utils.py", line 359, in run_loop
  File "/usr/lib/python2.7/threading.py", line 585, in set
  File "/usr/lib/python2.7/threading.py", line 407, in notifyAll
<type 'exceptions.TypeError'>: 'NoneType' object is not callable
2
Exception in thread Profile (most likely raised during interpreter shutdown):
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
  File "/usr/lib/python2.7/threading.py", line 754, in run
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/profile.py", line 239, in _watch
<type 'exceptions.TypeError'>: 'NoneType' object is not callable
Exception in thread IO loop (most likely raised during interpreter shutdown):
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
  File "/usr/lib/python2.7/threading.py", line 754, in run
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/utils.py", line 359, in run_loop
  File "/usr/lib/python2.7/threading.py", line 585, in set
  File "/usr/lib/python2.7/threading.py", line 407, in notifyAll
<type 'exceptions.TypeError'>: 'NoneType' object is not callable
3
Exception in thread Profile (most likely raised during interpreter shutdown):
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
  File "/usr/lib/python2.7/threading.py", line 754, in run
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/profile.py", line 239, in _watch
<type 'exceptions.TypeError'>: 'NoneType' object is not callable
Exception in thread IO loop (most likely raised during interpreter shutdown):
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
  File "/usr/lib/python2.7/threading.py", line 754, in run
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/utils.py", line 359, in run_loop
  File "/usr/lib/python2.7/threading.py", line 585, in set
  File "/usr/lib/python2.7/threading.py", line 407, in notifyAll
<type 'exceptions.TypeError'>: 'NoneType' object is not callable
4
5
6
7
8
9
10
11
distributed.nanny - WARNING - Worker process still alive after 47 seconds, killing
distributed.nanny - WARNING - Worker process 30970 was killed by signal 15
Traceback (most recent call last):
  File "debug_distributed_client.py", line 3, in <module>
    client = Client()
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/client.py", line 608, in __init__
    self.start(timeout=timeout)
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/client.py", line 731, in start
    sync(self.loop, self._start, **kwargs)
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/utils.py", line 277, in sync
    six.reraise(*error[0])
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/utils.py", line 262, in f
    result[0] = yield future
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/concurrent.py", line 261, in result
    raise_exc_info(self._exc_info)
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/client.py", line 794, in _start
    yield self.cluster
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/concurrent.py", line 261, in result
    raise_exc_info(self._exc_info)
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/gen.py", line 1314, in _wrap_awaitable
    _y = _m(*_x)
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/concurrent.py", line 211, in __await__
    result = yield self
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/gen.py", line 1297, in _wrap_awaitable
    _s = yield _y
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/concurrent.py", line 261, in result
    raise_exc_info(self._exc_info)
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/deploy/local.py", line 191, in _start
    yield [self._start_worker(**self.worker_kwargs) for i in range(n_workers)]
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/concurrent.py", line 261, in result
    raise_exc_info(self._exc_info)
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/gen.py", line 883, in callback
    result_list.append(f.result())
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/concurrent.py", line 261, in result
    raise_exc_info(self._exc_info)
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/gen.py", line 1147, in run
    yielded = self.gen.send(value)
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/deploy/local.py", line 217, in _start_worker
    raise gen.TimeoutError("Worker failed to start")
tornado.util.TimeoutError: Worker failed to start

In the cases where the initialization fails it also takes a long time (~1 minute), i.e., it looks like there is an internal race condition. Sending SIGKILL in the time where the process is stuck gives this traceback:

^CTraceback (most recent call last):
  File "debug_distributed_client.py", line 3, in <module>
    client = Client()
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/client.py", line 608, in __init__
    self.start(timeout=timeout)
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/client.py", line 731, in start
    sync(self.loop, self._start, **kwargs)
  File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/utils.py", line 275, in sync
    e.wait(10)
  File "/usr/lib/python2.7/threading.py", line 614, in wait
    self.__cond.wait(timeout)
  File "/usr/lib/python2.7/threading.py", line 359, in wait
    _sleep(delay)
KeyboardInterrupt

Notes:

  • The log also shows a minor issue (the crashes in runs 1, 2, 3): When the main process terminates too quickly after initializing the client, there can be an exception during interpreter shutdown. The problem goes away by adding e.g. a time.sleep(0.1).
  • The same issue occurs when using cluster = LocalCluster() directly instead of initializing the client, indicating that it is actually a problem of the local cluster.
  • The problem seems to be Python 2 specific.

Versions:

  • dask==0.19.0
  • distributed==1.23.0
  • tornado==5.1
  • Python 2.7.12
  • Ubuntu 16.04

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions