repeatedly, I'm observing a ~10% probability that the client initialization crashes randomly. Example output:
$ for i in {1..100} ; do echo $i; python debug_distributed_client.py; done
1
Exception in thread IO loop (most likely raised during interpreter shutdown):Exception in thread Profile (most likely raised during interpreter shutdown):
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
File "/usr/lib/python2.7/threading.py", line 754, in run
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/profile.py", line 239, in _watch
<type 'exceptions.TypeError'>: 'NoneType' object is not callable
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
File "/usr/lib/python2.7/threading.py", line 754, in run
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/utils.py", line 359, in run_loop
File "/usr/lib/python2.7/threading.py", line 585, in set
File "/usr/lib/python2.7/threading.py", line 407, in notifyAll
<type 'exceptions.TypeError'>: 'NoneType' object is not callable
2
Exception in thread Profile (most likely raised during interpreter shutdown):
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
File "/usr/lib/python2.7/threading.py", line 754, in run
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/profile.py", line 239, in _watch
<type 'exceptions.TypeError'>: 'NoneType' object is not callable
Exception in thread IO loop (most likely raised during interpreter shutdown):
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
File "/usr/lib/python2.7/threading.py", line 754, in run
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/utils.py", line 359, in run_loop
File "/usr/lib/python2.7/threading.py", line 585, in set
File "/usr/lib/python2.7/threading.py", line 407, in notifyAll
<type 'exceptions.TypeError'>: 'NoneType' object is not callable
3
Exception in thread Profile (most likely raised during interpreter shutdown):
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
File "/usr/lib/python2.7/threading.py", line 754, in run
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/profile.py", line 239, in _watch
<type 'exceptions.TypeError'>: 'NoneType' object is not callable
Exception in thread IO loop (most likely raised during interpreter shutdown):
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
File "/usr/lib/python2.7/threading.py", line 754, in run
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/utils.py", line 359, in run_loop
File "/usr/lib/python2.7/threading.py", line 585, in set
File "/usr/lib/python2.7/threading.py", line 407, in notifyAll
<type 'exceptions.TypeError'>: 'NoneType' object is not callable
4
5
6
7
8
9
10
11
distributed.nanny - WARNING - Worker process still alive after 47 seconds, killing
distributed.nanny - WARNING - Worker process 30970 was killed by signal 15
Traceback (most recent call last):
File "debug_distributed_client.py", line 3, in <module>
client = Client()
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/client.py", line 608, in __init__
self.start(timeout=timeout)
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/client.py", line 731, in start
sync(self.loop, self._start, **kwargs)
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/utils.py", line 277, in sync
six.reraise(*error[0])
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/utils.py", line 262, in f
result[0] = yield future
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/concurrent.py", line 261, in result
raise_exc_info(self._exc_info)
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/client.py", line 794, in _start
yield self.cluster
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/concurrent.py", line 261, in result
raise_exc_info(self._exc_info)
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/gen.py", line 1314, in _wrap_awaitable
_y = _m(*_x)
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/concurrent.py", line 211, in __await__
result = yield self
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/gen.py", line 1297, in _wrap_awaitable
_s = yield _y
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/concurrent.py", line 261, in result
raise_exc_info(self._exc_info)
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/deploy/local.py", line 191, in _start
yield [self._start_worker(**self.worker_kwargs) for i in range(n_workers)]
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/concurrent.py", line 261, in result
raise_exc_info(self._exc_info)
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/gen.py", line 883, in callback
result_list.append(f.result())
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/concurrent.py", line 261, in result
raise_exc_info(self._exc_info)
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/tornado/gen.py", line 1147, in run
yielded = self.gen.send(value)
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/deploy/local.py", line 217, in _start_worker
raise gen.TimeoutError("Worker failed to start")
tornado.util.TimeoutError: Worker failed to start
In the cases where the initialization fails it also takes a long time (~1 minute), i.e., it looks like there is an internal race condition. Sending SIGKILL in the time where the process is stuck gives this traceback:
^CTraceback (most recent call last):
File "debug_distributed_client.py", line 3, in <module>
client = Client()
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/client.py", line 608, in __init__
self.start(timeout=timeout)
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/client.py", line 731, in start
sync(self.loop, self._start, **kwargs)
File "/tmp/fresh_venv/local/lib/python2.7/site-packages/distributed/utils.py", line 275, in sync
e.wait(10)
File "/usr/lib/python2.7/threading.py", line 614, in wait
self.__cond.wait(timeout)
File "/usr/lib/python2.7/threading.py", line 359, in wait
_sleep(delay)
KeyboardInterrupt
When running the following code
repeatedly, I'm observing a ~10% probability that the client initialization crashes randomly. Example output:
In the cases where the initialization fails it also takes a long time (~1 minute), i.e., it looks like there is an internal race condition. Sending
SIGKILLin the time where the process is stuck gives this traceback:Notes:
time.sleep(0.1).cluster = LocalCluster()directly instead of initializing the client, indicating that it is actually a problem of the local cluster.Versions: