Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock with Billiards #1218

Closed
AnonymousHelper opened this issue Mar 1, 2013 · 32 comments
Closed

Deadlock with Billiards #1218

AnonymousHelper opened this issue Mar 1, 2013 · 32 comments

Comments

@AnonymousHelper
Copy link

Celery 3.0.13
Biliiards 2.7.3.19

Workers consuming 10,000 tasks a day.
We kill a worker after executing 1 task. Max_Task_Per_Child = 1

Overtime, we deadlocked workers.
Can't get a pystack for you.
In the Stack Trace, it shows that the deadlock happens while billiards is trying to acquire a lock.
#0 0x0000003cff40c0ed in sem_wait () from /lib64/libpthread.so.0
#1 0x00002b180708129a in Billiard_semlock_acquire (self=0x3d165f9340, args=, kwds=) at Modules/_billiard/semaphore.c:312

Strace shows the processes in a futex wait state.

What sort of data can help you debug this problem?

@bear330
Copy link

bear330 commented Mar 4, 2013

+1
I also got the same problem, after running celery worker for a period of time, deadlock occurred.
In celeryd.log, I found some tasks only have "TaskPool: Apply ...", but no further "Task accepted: ..." (log level = DEBUG), then I try gdb, I saw Billiard_semlock_acquire.

Here is my command line arguments:
python -m Wikiprice.Celery worker --pidfile=fetch.pid --autoscale 4 --loglevel DEBUG --logfile Fetch.log

Please, if any data can help you to debug this problem, please let me know.

@ask
Copy link
Contributor

ask commented Mar 4, 2013

If you're using 3.0.13 then please upgrade to 3.0.15. This fixes a deadlock that is more likely to occur with autoscale or time limits enabled (but may occur anyway)

@bear330
Copy link

bear330 commented Mar 5, 2013

'AnonymousHelper' uses 3.0.13 but I am using 3.0.15. There still have deadlock.
Some of my processes seem crashed before deadlock, I got these:

[2013-03-01 10:55:11,699: ERROR/MainProcess] Process 'PoolWorker-91' pid:12666 exited with exitcode 15

I am not sure whether it related to this issue.
You can get full log from here:
http://chocobo.servehttp.com/static/celeryd.log

@AnonymousHelper
Copy link
Author

I'm not using time limits or auto-scaling.

@ask
Copy link
Contributor

ask commented Mar 6, 2013

@AnonymousHelper I said "more likely to occur", which means it may still occur

@ask
Copy link
Contributor

ask commented Mar 6, 2013

btw, maxtasksperchild also increases the possibility, the extenuating factor is processes that exit often (needs to be replaced)

@ask
Copy link
Contributor

ask commented Mar 6, 2013

@bear330 Are you using a transport other than amqp/redis? If so try running with CELERYD_FORCE_EXECV=True

@bear330
Copy link

bear330 commented Mar 7, 2013

I only use RabbitMQ 2.6.1 (broker) and Redis 2.4.10 (result backend).

Actually, this deadlock happened since 2.x (I guess my first tried version is about 2.2), it may not the same reason across versions but deadlock still there.
At beginning, I was on Windows platform (include RabbitMQ and Redis), after 3.0 I tried Linux (CentOS 6, RabbitMQ and Redis also moved).
Every time celery has new version, I will upgrade it to try.

Celery is a great work! Thank you a lot, if this issue can be solved, it would be better.

@ask
Copy link
Contributor

ask commented Mar 14, 2013

Please note that user code may also deadlock a worker, and this is the most common cause of deadlock,
especially when they happen frequently.

If all the pool processes execute tasks that never return you have resource starvation,
time-limits can be used to handle this case.

So the deadlock @bear330 is experiencing may not be related.

I would love if you could create an example project that is able to reproduce the deadlock,
that would help me identify the issue. It's almost impossible for me to debug without the ability to reproduce.

@bear330
Copy link

bear330 commented Mar 15, 2013

Here is my gdb output for all deadlock workers:

PID - 13295:
#0  0x00007feb25e44720 in sem_wait () from /lib64/libpthread.so.0
#1  0x00007feb1a6d0421 in Billiard_semlock_acquire (self=0x2bc3768,
    args=<value optimized out>, kwds=<value optimized out>)
    at Modules/_billiard/semaphore.c:312
#2  0x00007feb26132b24 in call_function (f=<value optimized out>,
    throwflag=<value optimized out>) at Python/ceval.c:3794
#3  PyEval_EvalFrameEx (f=<value optimized out>,
    throwflag=<value optimized out>) at Python/ceval.c:2453
#4  0x00007feb26134797 in PyEval_EvalCodeEx (co=0x2bbd990,
    globals=<value optimized out>, locals=<value optimized out>,
    args=<value optimized out>, argcount=1, kws=0x2e088e8, kwcount=0, defs=
    0x0, defcount=0, closure=0x0) at Python/ceval.c:3044
#5  0x00007feb26132be4 in fast_function (f=<value optimized out>,
    throwflag=<value optimized out>) at Python/ceval.c:3890
#6  call_function (f=<value optimized out>, throwflag=<value optimized out>)
    at Python/ceval.c:3815
#7  PyEval_EvalFrameEx (f=<value optimized out>,
    throwflag=<value optimized out>) at Python/ceval.c:2453
#8  0x00007feb26134797 in PyEval_EvalCodeEx (co=0x2bc07b0,
    globals=<value optimized out>, locals=<value optimized out>,
    args=<value optimized out>, argcount=0, kws=0x2e02030, kwcount=0, defs=
    0x0, defcount=0, closure=
    (<cell at remote 0x2bc3910>, <cell at remote 0x2bc38d8>))

PID - 12688:
#0  0x00007feb25e44720 in sem_wait () from /lib64/libpthread.so.0
#1  0x00007feb1a6d0421 in Billiard_semlock_acquire (self=0x2bc3768,
    args=<value optimized out>, kwds=<value optimized out>)
    at Modules/_billiard/semaphore.c:312
#2  0x00007feb26132b24 in call_function (f=<value optimized out>,
    throwflag=<value optimized out>) at Python/ceval.c:3794
#3  PyEval_EvalFrameEx (f=<value optimized out>,
    throwflag=<value optimized out>) at Python/ceval.c:2453
#4  0x00007feb26134797 in PyEval_EvalCodeEx (co=0x2bbd990,
    globals=<value optimized out>, locals=<value optimized out>,
    args=<value optimized out>, argcount=1, kws=0x2e49bb8, kwcount=0, defs=
    0x0, defcount=0, closure=0x0) at Python/ceval.c:3044
#5  0x00007feb26132be4 in fast_function (f=<value optimized out>,
    throwflag=<value optimized out>) at Python/ceval.c:3890
#6  call_function (f=<value optimized out>, throwflag=<value optimized out>)
    at Python/ceval.c:3815
#7  PyEval_EvalFrameEx (f=<value optimized out>,
    throwflag=<value optimized out>) at Python/ceval.c:2453
#8  0x00007feb26134797 in PyEval_EvalCodeEx (co=0x2bc07b0,
    globals=<value optimized out>, locals=<value optimized out>,
    args=<value optimized out>, argcount=0, kws=0x2e499f0, kwcount=0, defs=
    0x0, defcount=0, closure=
    (<cell at remote 0x2bc3910>, <cell at remote 0x2bc38d8>))

PID - 12479:
#0  0x00007feb25e44720 in sem_wait () from /lib64/libpthread.so.0
#1  0x00007feb1a6d0421 in Billiard_semlock_acquire (self=0x2bc3768,
    args=<value optimized out>, kwds=<value optimized out>)
    at Modules/_billiard/semaphore.c:312
#2  0x00007feb26132b24 in call_function (f=<value optimized out>,
    throwflag=<value optimized out>) at Python/ceval.c:3794
#3  PyEval_EvalFrameEx (f=<value optimized out>,
    throwflag=<value optimized out>) at Python/ceval.c:2453
#4  0x00007feb26134797 in PyEval_EvalCodeEx (co=0x2bbd990,
    globals=<value optimized out>, locals=<value optimized out>,
    args=<value optimized out>, argcount=1, kws=0x2d75f08, kwcount=0, defs=
    0x0, defcount=0, closure=0x0) at Python/ceval.c:3044
#5  0x00007feb26132be4 in fast_function (f=<value optimized out>,
    throwflag=<value optimized out>) at Python/ceval.c:3890
#6  call_function (f=<value optimized out>, throwflag=<value optimized out>)
    at Python/ceval.c:3815
#7  PyEval_EvalFrameEx (f=<value optimized out>,
    throwflag=<value optimized out>) at Python/ceval.c:2453
#8  0x00007feb26134797 in PyEval_EvalCodeEx (co=0x2bc07b0,
    globals=<value optimized out>, locals=<value optimized out>,
    args=<value optimized out>, argcount=0, kws=0x2ec42c0, kwcount=0, defs=
    0x0, defcount=0, closure=
    (<cell at remote 0x2bc3910>, <cell at remote 0x2bc38d8>))

PID - 12159:
#0  0x00007feb25e44720 in sem_wait () from /lib64/libpthread.so.0
#1  0x00007feb1a6d0421 in Billiard_semlock_acquire (self=0x2bc3768,
    args=<value optimized out>, kwds=<value optimized out>)
    at Modules/_billiard/semaphore.c:312
#2  0x00007feb26132b24 in call_function (f=<value optimized out>,
    throwflag=<value optimized out>) at Python/ceval.c:3794
#3  PyEval_EvalFrameEx (f=<value optimized out>,
    throwflag=<value optimized out>) at Python/ceval.c:2453
#4  0x00007feb26134797 in PyEval_EvalCodeEx (co=0x2bbd990,
    globals=<value optimized out>, locals=<value optimized out>,
    args=<value optimized out>, argcount=1, kws=0x2c8fef8, kwcount=0, defs=
    0x0, defcount=0, closure=0x0) at Python/ceval.c:3044
#5  0x00007feb26132be4 in fast_function (f=<value optimized out>,
    throwflag=<value optimized out>) at Python/ceval.c:3890
#6  call_function (f=<value optimized out>, throwflag=<value optimized out>)
    at Python/ceval.c:3815
#7  PyEval_EvalFrameEx (f=<value optimized out>,
    throwflag=<value optimized out>) at Python/ceval.c:2453
#8  0x00007feb26134797 in PyEval_EvalCodeEx (co=0x2bc07b0,
    globals=<value optimized out>, locals=<value optimized out>,
    args=<value optimized out>, argcount=0, kws=0x2dd88c0, kwcount=0, defs=
    0x0, defcount=0, closure=
    (<cell at remote 0x2bc3910>, <cell at remote 0x2bc38d8>))

PID - 64742: 
#0  0x00007feb254f9d03 in select () from /lib64/libc.so.6
#1  0x00007feb1eb6b219 in floatsleep (self=<value optimized out>,
    args=<value optimized out>) at Modules/timemodule.c:910
#2  time_sleep (self=<value optimized out>, args=<value optimized out>)
    at Modules/timemodule.c:206
#3  0x00007feb26132b24 in call_function (f=<value optimized out>,
    throwflag=<value optimized out>) at Python/ceval.c:3794
#4  PyEval_EvalFrameEx (f=<value optimized out>,
    throwflag=<value optimized out>) at Python/ceval.c:2453
#5  0x00007feb26134797 in PyEval_EvalCodeEx (co=0x7feb26461300,
    globals=<value optimized out>, locals=<value optimized out>,
    args=<value optimized out>, argcount=2, kws=0x7feb10001380, kwcount=0,
    defs=0x7feb264740e8, defcount=1, closure=0x0) at Python/ceval.c:3044
#6  0x00007feb26132be4 in fast_function (f=<value optimized out>,
    throwflag=<value optimized out>) at Python/ceval.c:3890
#7  call_function (f=<value optimized out>, throwflag=<value optimized out>)
    at Python/ceval.c:3815
#8  PyEval_EvalFrameEx (f=<value optimized out>,
    throwflag=<value optimized out>) at Python/ceval.c:2453
#9  0x00007feb26134797 in PyEval_EvalCodeEx (co=0x26e0c60,
    globals=<value optimized out>, locals=<value optimized out>,
    args=<value optimized out>, argcount=1, kws=0x7feb10001118, kwcount=1,
    defs=0x26e29b0, defcount=2, closure=0x0) at Python/ceval.c:3044

The stack showed processes was locked on Billiard_semlock_acquire. So I think deadlock is not caused by my code. But

If all the pool processes execute tasks that never return you have resource starvation,
time-limits can be used to handle this case.

I never think about this, I will try it! Thank you!

If there still have problems, I will try to make a small example project for debugging. Thanks.

@ask
Copy link
Contributor

ask commented Mar 15, 2013

Hey, could you please try running the worker with rate limits disabled? (CELERY_DISABLE_RATE_LIMITS=True)

@AnonymousHelper
Copy link
Author

Turned off rate limits and increased max task per child to 10. Still seeing large number of workers deadlocked.

@ask
Copy link
Contributor

ask commented Apr 2, 2013

Why did you increase maxtasks? It would be better if you tried disabling that as I said earlier.

You can try adding a time-limit to see if the workers are stuck doing something

@AnonymousHelper
Copy link
Author

  1. Increased it from 1 to 10, which should reduce the chance of deadlock. I'm worried about the performance implications of completely disabling the MAX_TASK limit due to possible resource leaks.

  2. Will attempt time limits.

  3. Will try to give you a reproducible case.

@ask
Copy link
Contributor

ask commented Apr 3, 2013

Patch celery/billiard@94f5623 seems to fix a very similar issue (#1266),
could you try patching your billiard with that, or install the branch (pip install -U http://github.com/celery/billiard/zipball/2.7#egg=billiard)

@ask
Copy link
Contributor

ask commented Apr 10, 2013

Fixed in billiard 2.7.3.26

@ask ask closed this as completed Apr 10, 2013
@AnonymousHelper
Copy link
Author

lib/python2.7/site-packages/billiard/synchronize.py (95): enter
lib/python2.7/site-packages/billiard/queues.py (352): put
lib/python2.7/site-packages/billiard/pool.py (306): worker
lib/python2.7/site-packages/billiard/process.py (101): run
lib/python2.7/site-packages/billiard/process.py (253): _bootstrap
lib/python2.7/site-packages/billiard/forking.py (189): init
lib/python2.7/site-packages/billiard/process.py (118): start
lib/python2.7/site-packages/billiard/pool.py (912): _create_worker_process
lib/python2.7/site-packages/billiard/pool.py (1029): _repopulate_pool
lib/python2.7/site-packages/billiard/pool.py (1042): _maintain_pool
lib/python2.7/site-packages/billiard/pool.py (1048): maintain_pool
lib/python2.7/site-packages/celery/utils/timer2.py (155): _reschedules
lib/python2.7/site-packages/celery/utils/timer2.py (51): call
lib/python2.7/site-packages/celery/utils/timer2.py (98): apply_entry
lib/python2.7/site-packages/celery/worker/hub.py (154): fire_timers
lib/python2.7/site-packages/celery/worker/consumer.py (461): consume_messages
lib/python2.7/site-packages/celery/worker/consumer.py (394): start
lib/python2.7/site-packages/celery/worker/init.py (349): start
lib/python2.7/site-packages/celery/apps/worker.py (274): run_worker
lib/python2.7/site-packages/celery/apps/worker.py (185): run
lib/python2.7/site-packages/celery/bin/celeryd.py (153): run
lib/python2.7/site-packages/celery/bin/base.py (220): handle_argv
lib/python2.7/site-packages/celery/bin/base.py (180): execute_from_commandline
lib/python2.7/site-packages/celery/bin/celeryd.py (134): execute_from_commandline
lib/python2.7/site-packages/celery/bin/celeryd.py (202): main
lib/python2.7/site-packages/celery/main.py (20): _compat_worker
/bin/celeryd (8):

This is the pystack from the deadlocked workers.

@ask
Copy link
Contributor

ask commented Apr 12, 2013

Btw, the patch above only fixed it for SIGTERM, it was later added to include other term signals, but it's still not fixed for SIGKILL (if killed while having acquired the semaphore)

@ask
Copy link
Contributor

ask commented Apr 16, 2013

It seems the fix was later disabled due to a change, fixed in 2.7.3.28 (now with tests)

@AnonymousHelper
Copy link
Author

Deployed the latest celery, billiards.
Still deadlocking.
/billiard/synchronize.py (95): enter
/billiard/queues.py (341): get
/billiard/pool.py (257): poll
/billiard/pool.py (296): worker
/billiard/process.py (103): run
/billiard/process.py (255): _bootstrap
/billiard/forking.py (190): init
/billiard/process.py (120): start
/billiard/pool.py (927): _create_worker_process
/billiard/pool.py (858): init
/celery/concurrency/processes/init.py (81): on_start
/celery/concurrency/base.py (113): start
/celery/worker/init.py (364): start
/celery/apps/worker.py (279): run_worker
/celery/apps/worker.py (189): run
/celery/bin/celeryd.py (153): run
/celery/bin/base.py (227): handle_argv
/celery/bin/base.py (187): execute_from_commandline
/celery/bin/celeryd.py (134): execute_from_commandline
/celery/bin/celeryd.py (202): main
/celery/main.py (20): _compat_worker

@ask
Copy link
Contributor

ask commented Apr 22, 2013

Since you're using maxtasks this could be a duplicate of #1310.

You can try using the instructions in the comment there to try out the fix:
#1310 (comment)

@AnonymousHelper
Copy link
Author

I have removed max tasks a few days ago.

@ask
Copy link
Contributor

ask commented Apr 22, 2013

Ok, if the worker child processes never exits/terminates then it's unlikely #1310 will do anything. In that case you need to come up with an example that reproduces the deadlock.

@AnonymousHelper
Copy link
Author

So the entire fleet isn't deadlocked. However, since tasks are pre-fetched and assigned to "deadlocked" workers, they do not complete.

What could be a mitigation for that scenario?

I'll attempt to recreate this issue tonight.

@AnonymousHelper
Copy link
Author

Reproduced the bug. Will post shortly.

@AnonymousHelper
Copy link
Author

celeryconfig.py

BROKER_URL = 'redis://localhost:6379/1'
CELERY_RESULT_BACKEND = 'redis://localhost:6379/1'
CELERY_IMPORTS = ('tasks')
CELERY_ENABLE_UTC = True

CELERY_QUEUES = {
    "default": {
        "binding_key": "default",
        "exchange": "default",
    },
    "checks": {
        "binding_key": "check",
        "exchange": "direct",
    },
    "deployments": {
        "binding_key": "deployment",
        "exchange": "direct",
    },
    "re-deployments": {
        "binding_key": "re-deployment",
        "exchange": "direct",
    },
    "postchecks": {
        "binding_key": "postcheck",
        "exchange": "direct",
    },
    "rendering": {
        "binding_key": "rendering",
        "exchange": "direct",
    }
}
CELERY_DEFAULT_QUEUE = "default"
CELERY_DEFAULT_EXCHANGE_TYPE = "direct"
CELERY_DEFAULT_ROUTING_KEY = "default"
CELERY_DISABLE_RATE_LIMITS = True
CELERYD_PREFETCH_MULTIPLIER = 1

CELERY_ROUTES = {
    "tasks.check_device": {
        "queue": "checks",
        "routing_key": "check",
    },
    "tasks.deploy_to_device": {
        "queue": "deployments",
        "routing_key": "deployment",
    },
    "tasks.redeploy_to_device": {
        "queue": "re-deployments",
        "routing_key": "re-deployment",
    },
    "tasks.test_rendering": {
        "queue": "rendering",
        "routing_key": "rendering",
    },
    "tasks.device_postcheck": {
        "queue": "postchecks",
        "routing_key": "postcheck",
    }
}

# 3 day result expiry                                                                                                                                                                                               
CELERY_TASK_RESULT_EXPIRES = 259200

start_tasks.py

import celery
import sys
from celery import group

def run():
    from tasks import check_device
    task_list = []
    for x in range(1000):
        task_list.append(check_device.subtask())

    group_task = group(task_list)
    group_task()



sys.exit(
    run()
)

tasks.py

import celery

@celery.task
def check_device():
    return 1+1

celery worker -Q checks --concurrency 20 -E --logfile /tmp/celery --loglevel=DEBUG &
celery worker -Q deployments --concurrency 20 -E --logfile /tmp/celery --loglevel=DEBUG &

@AnonymousHelper
Copy link
Author

Run that a few times, and you'l'l notice a few workers start going into a futex wait state.

@ask
Copy link
Contributor

ask commented Apr 23, 2013

I've been executing 157054 tasks now, and not able to reproduce so far.
It seems that the wchan flag in ps is not supported on OS X so I'm not able to see if the process is waiting on a lock (in any case a futex wait state is normal if another process is currently reading/writing from the socket)

  • Why are you starting the deployments worker? It never receives any tasks, so is that significant?
  • Why are you using --concurrency=20? 40 processes on a single machine sounds excessive,
    and is that necessary to reproduce the deadlock? (usually deadlocks reproduce faster with lower concurrency
    values).

@AnonymousHelper
Copy link
Author

I reproduced it with just the 20 processes.

Kernel Distro Python
3.2.0-37-generic Ubuntu 12.04 2.7.2
2.6.18-164.el5 RHEL5 2.7.2
3.5.0-17-generic Ubuntu 12.0.4 2.7.3

Our work is mostly network-bound/IO bound and not cpu intensive.

The deployment worker was there to see if that had an impact on deadlocks. I was able to reproduce w/o as well.

Thing I noted. When starting up the worker-pool. One appears to consume from the redis (consumer). The others appear to selecting "select(10, [9], NULL, NULL, {0, 936733}) = 0 (Timeout)"

After running tasks, not all workers run select commands anymore.

Will double check whether we can reproduce this on Mac OS X

@mvikharev
Copy link

AnonymousHelper, i have recently fixed a similar problem in my project.

We also had a problem with workers that goes to futex wait state and not longer processes new tasks. we also had mostly network-bound/IO bound and not cpu intensive. So the problem was that we had no timeouts on network read operations.

You can run lsof -i -p PID to see all opened network connections by hung worker. In my case there was one connection that was marked as estabilished for a long time:

$lsof -i -p 19777
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
....
python 19777 www-data 19u IPv4 73761971 0t0 TCP ***.LOCALHOST.XX:47081->REMOTE.HOST.XX:https (ESTABLISHED)
....

So, when i forcibly closed a network connection (thx to http://stackoverflow.com/questions/323146/how-to-close-a-file-descriptor-from-another-process-in-unix-systems):

gdb -p 19777
(gdb) p close(19)
$1 = 0
(gdb) c
Continuing.

Worker continued to work, i.e. continued to write logs. It finished a task and continued to proccess new ones.

So the issue was fixed by changing

urllib2.urlopen(url, request).read()

to

urllib2.urlopen(url, request, READ_TIMEOUT).read()

Hope this helps

@AnonymousHelper
Copy link
Author

In the example problem, I also observed this behavior. In that case, I don't open a single network connection besides the ones celery opens.

I'll post some more later tonight.

@j0hnsmith
Copy link
Contributor

@mvikharev #1218 (comment) helped me find my problem, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants