Deadlock when using celery tasks #7137

andyneff · 2021-12-01T21:15:04Z

andyneff
Dec 1, 2021

I have recently started upgrading my environment to python 3.9, and then all of a sudden I started getting deadlocks. As only Linux uses fork for ProcessingPool by default, this is a Linux specific problem.

Reproducing the deadlock

After tracking down the cause of the problem, I found the following code snippet will reproduce the problem:

#!/usr/bin/env python
# script.py

from concurrent.futures import ProcessPoolExecutor, as_completed
from celery import shared_task

@shared_task
def foo(x):
  return x*x

if __name__ == '__main__':
  futures = {}
  with ProcessPoolExecutor(max_workers=10) as executor:
    for x in range(10):
      futures[executor.submit(foo, x)] = x

    results = {}
    for future in as_completed(futures):
      task_id = futures[future]
      results[task_id] = future.result()
      print(len(results))

The thread deadlock can be replicated by simply running the script: (docker commands for an easy reference point)

docker run -it --rm -v `pwd`:/src python:3.9.0a6
pip install celery 
python /src/script.py

You should see it count from 1 to 10, but it will most likely stop in the middle somewhere, as half the workers deadlock. If it doesn't hang, keep rerunning it until it deadlocks.

As this is a timing issue, it's possible different computer will be harder to reproduce on, but every Linux machine I've tried deadlocks almost every time.

What I believe changed to cause this behavior

I've been able to track this down to python version 3.9.0a6. 3.9.0a6 and every version after has an extremely high chance of deadlocking when the number of workers is at least 4, Python 3.6-3.8 and up to 3.9.0a5 has never deadlocked.

I suspect it has to do with https://bugs.python.org/issue40089 where they added _at_fork_reinit mentioned here

It appears a thread in the parent process is locking a mutex, as another thread forks the worker processes. And so these new workers are being started with locks in the "locked" state and no thread to unlock them.

I have confirmed in the debugger that this is happening: the locks are starting in the "locked" state on the forked workers. And this is the part that becomes very confusing, as this seems to be the exact opposite of what the fix in 3.9.0a6 claimed to do:

Add a private _at_fork_reinit() method to :class:`_thread.Lock`, :class:`_thread.RLock`, :class:`threading.RLock` and
:class:`threading.Condition` classes: reinitialize the lock at fork in the child process, reset the lock to the
unlocked state. Rename also the private _reset_internal_locks() method of :class:`threading.Event` to _at_fork_reinit().

Is this a regression introduced by this fix?

Deep dive into the problem

It appears that the code is getting deadlocked on (at least) two locks of concern.

The first lock is from Celery.tasks in a cached_property, which uses a lock here.
Depending on the timing, a second lock it will hang here, during app.finalize

There may be additional locks I was unable to track down, but at least these two cause issue.

ProcessPoolExecutor is pickling the function in its "QueueFeederThread" as another thread is forking new workers. During pickling app.tasks is cached (which also finalized the celery app seen here)

Parent Stack trace (QueueFeederThread)

  File "/usr/local/lib/python3.9/threading.py", line 908, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/lib/python3.9/threading.py", line 950, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.9/threading.py", line 888, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.9/multiprocessing/queues.py", line 245, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/usr/local/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/usr/local/lib/python3.9/site-packages/celery/local.py", line 287, in __reduce__
    return self._get_current_object().__reduce__()
  File "/usr/local/lib/python3.9/site-packages/celery/local.py", line 105, in _get_current_object
    return loc(*self.__args, **self.__kwargs)
  File "/usr/local/lib/python3.9/site-packages/celery/app/__init__.py", line 69, in task_by_cons
    return app.tasks[
  File "/usr/local/lib/python3.9/site-packages/kombu/utils/objects.py", line 30, in __get__
    return super().__get__(instance, owner)
  File "/usr/local/lib/python3.9/functools.py", line 1202, in __get__
    with self.lock:

This covers why and when the parent is accessing app.tasks. The QueueFeederThread is serializing _CallItems which include the function and thus ends up pickling the celery task.

Worker stack tracker

  File "/src/foo_orig.py", line 32, in <module>
    futures[executor.submit(foo, x)] = x
  File "/usr/local/lib/python3.9/concurrent/futures/process.py", line 683, in submit
    self._adjust_process_count()
  File "/usr/local/lib/python3.9/concurrent/futures/process.py", line 661, in _adjust_process_count
    p.start()
  File "/usr/local/lib/python3.9/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/local/lib/python3.9/multiprocessing/context.py", line 276, in _Popen
    return Popen(process_obj)
  File "/usr/local/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/local/lib/python3.9/multiprocessing/popen_fork.py", line 71, in _launch
    code = process_obj._bootstrap(parent_sentinel=child_r)
  File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.9/concurrent/futures/process.py", line 234, in _process_worker
    call_item = call_queue.get(block=True)
  File "/usr/local/lib/python3.9/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
  File "/usr/local/lib/python3.9/site-packages/celery/app/registry.py", line 68, in _unpickle_task_v2
    return get_current_app().tasks[name]
  File "/usr/local/lib/python3.9/site-packages/kombu/utils/objects.py", line 30, in __get__
    return super().__get__(instance, owner)
  File "/usr/local/lib/python3.9/functools.py", line 1202, in __get__
    with self.lock:

This covers when the worker is hitting the lock. During unpickle, it accessing app.tasks, and this also finalizes and caches app.tasks

Work arounds

Pre-finalize and pre cache tasks.

Adding something like:

from celery import _state as celery_state
app = celery_state.get_current_app()
# Just trigger the cached_property
app.tasks

before the executor clause will pre-cache tasks and prevent the deadlocks from happening, at least in my tests. This is not really ideal and feels too fragile and it might fail in other ways down the road.

Don't pass a celery task to ProcessPoolExecutor

While, on the surface, this works, it does break how I use celery tasks. I use the bind=True flag on celery tasks, so they all have a self object that won't be there if I'm not using the task.

  with ProcessPoolExecutor(max_workers=10) as executor:
    for x in range(10):
      futures[executor.submit(foo.__wrapped__, x)] = x

Additional notes

The problem is the same if I replace @shared_task with a simple:

app = Celery('tasks')
@app.task

I'm mainly using the latest version of celery and all dependencies, but the behavior is the same on both master and the last celery 4.

Background

Why am I passing a celery task to a ProcessPoolExecutor? In my library, celery is only one of the possible Executors I use. I also use ThreadPoolExecutor and ProcessPoolExecutor based on external configurations. And up until python 3.9.0a6, this has worked without issue.

Questions

Is this a regression in python 3.9?
Did I do something wrong?
What is the best way to handle this and move forward?
Is this an issue in celery that can be fixed?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock when using celery tasks #7137

{{title}}

Replies: 0 comments

Select a reply

Deadlock when using celery tasks #7137

andyneff Dec 1, 2021

Reproducing the deadlock

What I believe changed to cause this behavior

Deep dive into the problem

Work arounds

Additional notes

Background

Questions

Replies: 0 comments

andyneff
Dec 1, 2021