Skip to content

Sporadic concurrency issue leading to zero task progress #1176

Description

@adamklein

Versions:

- dask=0.14.3=py35_1
- distributed=1.16.3=py35_0
- tornado=4.5.1=py35_0

I'll try to create a repro but until I can do that, I'll just describe the setup here as a placeholder.

I am running a LocalCluster with 6 processes (# cpus) and 1 thread per worker.

I am submitting tasks from tasks using worker_client and using an AsCompleted(loop=client.loop) in the worker_client context to await their completion. Occasionally I will see a secondary task complete (ie, the task-of-task has finished what it is supposed to do), but the original task (that launched it) continues to wait on its queue for notification of completion, halting progress of the program. When I interrupt the program, my stack traces are as follows:

In worker_client context / original task:

^CTraceback (most recent call last):                                   
  File "<string>", line 1, in <module>                                 
  File "lib/python3.5/multiprocessing/forkserver.py", line 164, in main
	rfds = [key.fileobj for (key, events) in selector.select()]                                                                                                                     
  File "lib/python3.5/selectors.py", line 441, in select
	fd_event_list = self._epoll.poll(timeout, max_ev)                                                           
KeyboardInterrupt                                                      
  File "...", line 248, in run
	for f in queue:                                                                                                 
  File "lib/python3.5/site-packages/distributed/client.py", line 2634, in __next__
	return self.queue.get()                                                                         
  File "lib/python3.5/queue.py", line 164, in get
	self.not_empty.wait()                                                                                    
  File "lib/python3.5/threading.py", line 293, in wait
	waiter.acquire()    

Where the secondary task was running:

tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7f3fbd31b730>, <tornado.concurrent.Future object at 0x7f3fbd2c9d68>)
Traceback (most recent call last):
  File "lib/python3.5/site-packages/tornado/ioloop.py", line 605, in _run_callback
	ret = callback()
  File "lib/python3.5/site-packages/tornado/stack_context.py", line 277, in null_wrapper
	return fn(*args, **kwargs)
  File "lib/python3.5/site-packages/tornado/ioloop.py", line 626, in _discard_future_result
	future.result()
  File "lib/python3.5/site-packages/tornado/concurrent.py", line 238, in result
	raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "lib/python3.5/site-packages/tornado/gen.py", line 1063, in run
	yielded = self.gen.throw(*exc_info)
  File "lib/python3.5/site-packages/distributed/client.py", line 2600, in track_future
	yield _wait(future)
  File "lib/python3.5/site-packages/tornado/gen.py", line 1055, in run
	value = future.result()
  File "lib/python3.5/site-packages/tornado/concurrent.py", line 238, in result
	raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "lib/python3.5/site-packages/tornado/gen.py", line 1069, in run
	yielded = self.gen.send(value)
  File "lib/python3.5/site-packages/distributed/client.py", line 2482, in _wait
	raise CancelledError(cancelled)
concurrent.futures._base.CancelledError: ['mytask-349a99bf-9ffa-4fc6-988e-d85345a654d6']

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions