-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock with Billiards #1218
Comments
+1 Here is my command line arguments: Please, if any data can help you to debug this problem, please let me know. |
If you're using 3.0.13 then please upgrade to 3.0.15. This fixes a deadlock that is more likely to occur with autoscale or time limits enabled (but may occur anyway) |
'AnonymousHelper' uses 3.0.13 but I am using 3.0.15. There still have deadlock. [2013-03-01 10:55:11,699: ERROR/MainProcess] Process 'PoolWorker-91' pid:12666 exited with exitcode 15 I am not sure whether it related to this issue. |
I'm not using time limits or auto-scaling. |
@AnonymousHelper I said "more likely to occur", which means it may still occur |
btw, maxtasksperchild also increases the possibility, the extenuating factor is processes that exit often (needs to be replaced) |
@bear330 Are you using a transport other than amqp/redis? If so try running with |
I only use RabbitMQ 2.6.1 (broker) and Redis 2.4.10 (result backend). Actually, this deadlock happened since 2.x (I guess my first tried version is about 2.2), it may not the same reason across versions but deadlock still there. Celery is a great work! Thank you a lot, if this issue can be solved, it would be better. |
Please note that user code may also deadlock a worker, and this is the most common cause of deadlock, If all the pool processes execute tasks that never return you have resource starvation, So the deadlock @bear330 is experiencing may not be related. I would love if you could create an example project that is able to reproduce the deadlock, |
Here is my gdb output for all deadlock workers: PID - 13295:
#0 0x00007feb25e44720 in sem_wait () from /lib64/libpthread.so.0
#1 0x00007feb1a6d0421 in Billiard_semlock_acquire (self=0x2bc3768,
args=<value optimized out>, kwds=<value optimized out>)
at Modules/_billiard/semaphore.c:312
#2 0x00007feb26132b24 in call_function (f=<value optimized out>,
throwflag=<value optimized out>) at Python/ceval.c:3794
#3 PyEval_EvalFrameEx (f=<value optimized out>,
throwflag=<value optimized out>) at Python/ceval.c:2453
#4 0x00007feb26134797 in PyEval_EvalCodeEx (co=0x2bbd990,
globals=<value optimized out>, locals=<value optimized out>,
args=<value optimized out>, argcount=1, kws=0x2e088e8, kwcount=0, defs=
0x0, defcount=0, closure=0x0) at Python/ceval.c:3044
#5 0x00007feb26132be4 in fast_function (f=<value optimized out>,
throwflag=<value optimized out>) at Python/ceval.c:3890
#6 call_function (f=<value optimized out>, throwflag=<value optimized out>)
at Python/ceval.c:3815
#7 PyEval_EvalFrameEx (f=<value optimized out>,
throwflag=<value optimized out>) at Python/ceval.c:2453
#8 0x00007feb26134797 in PyEval_EvalCodeEx (co=0x2bc07b0,
globals=<value optimized out>, locals=<value optimized out>,
args=<value optimized out>, argcount=0, kws=0x2e02030, kwcount=0, defs=
0x0, defcount=0, closure=
(<cell at remote 0x2bc3910>, <cell at remote 0x2bc38d8>))
PID - 12688:
#0 0x00007feb25e44720 in sem_wait () from /lib64/libpthread.so.0
#1 0x00007feb1a6d0421 in Billiard_semlock_acquire (self=0x2bc3768,
args=<value optimized out>, kwds=<value optimized out>)
at Modules/_billiard/semaphore.c:312
#2 0x00007feb26132b24 in call_function (f=<value optimized out>,
throwflag=<value optimized out>) at Python/ceval.c:3794
#3 PyEval_EvalFrameEx (f=<value optimized out>,
throwflag=<value optimized out>) at Python/ceval.c:2453
#4 0x00007feb26134797 in PyEval_EvalCodeEx (co=0x2bbd990,
globals=<value optimized out>, locals=<value optimized out>,
args=<value optimized out>, argcount=1, kws=0x2e49bb8, kwcount=0, defs=
0x0, defcount=0, closure=0x0) at Python/ceval.c:3044
#5 0x00007feb26132be4 in fast_function (f=<value optimized out>,
throwflag=<value optimized out>) at Python/ceval.c:3890
#6 call_function (f=<value optimized out>, throwflag=<value optimized out>)
at Python/ceval.c:3815
#7 PyEval_EvalFrameEx (f=<value optimized out>,
throwflag=<value optimized out>) at Python/ceval.c:2453
#8 0x00007feb26134797 in PyEval_EvalCodeEx (co=0x2bc07b0,
globals=<value optimized out>, locals=<value optimized out>,
args=<value optimized out>, argcount=0, kws=0x2e499f0, kwcount=0, defs=
0x0, defcount=0, closure=
(<cell at remote 0x2bc3910>, <cell at remote 0x2bc38d8>))
PID - 12479:
#0 0x00007feb25e44720 in sem_wait () from /lib64/libpthread.so.0
#1 0x00007feb1a6d0421 in Billiard_semlock_acquire (self=0x2bc3768,
args=<value optimized out>, kwds=<value optimized out>)
at Modules/_billiard/semaphore.c:312
#2 0x00007feb26132b24 in call_function (f=<value optimized out>,
throwflag=<value optimized out>) at Python/ceval.c:3794
#3 PyEval_EvalFrameEx (f=<value optimized out>,
throwflag=<value optimized out>) at Python/ceval.c:2453
#4 0x00007feb26134797 in PyEval_EvalCodeEx (co=0x2bbd990,
globals=<value optimized out>, locals=<value optimized out>,
args=<value optimized out>, argcount=1, kws=0x2d75f08, kwcount=0, defs=
0x0, defcount=0, closure=0x0) at Python/ceval.c:3044
#5 0x00007feb26132be4 in fast_function (f=<value optimized out>,
throwflag=<value optimized out>) at Python/ceval.c:3890
#6 call_function (f=<value optimized out>, throwflag=<value optimized out>)
at Python/ceval.c:3815
#7 PyEval_EvalFrameEx (f=<value optimized out>,
throwflag=<value optimized out>) at Python/ceval.c:2453
#8 0x00007feb26134797 in PyEval_EvalCodeEx (co=0x2bc07b0,
globals=<value optimized out>, locals=<value optimized out>,
args=<value optimized out>, argcount=0, kws=0x2ec42c0, kwcount=0, defs=
0x0, defcount=0, closure=
(<cell at remote 0x2bc3910>, <cell at remote 0x2bc38d8>))
PID - 12159:
#0 0x00007feb25e44720 in sem_wait () from /lib64/libpthread.so.0
#1 0x00007feb1a6d0421 in Billiard_semlock_acquire (self=0x2bc3768,
args=<value optimized out>, kwds=<value optimized out>)
at Modules/_billiard/semaphore.c:312
#2 0x00007feb26132b24 in call_function (f=<value optimized out>,
throwflag=<value optimized out>) at Python/ceval.c:3794
#3 PyEval_EvalFrameEx (f=<value optimized out>,
throwflag=<value optimized out>) at Python/ceval.c:2453
#4 0x00007feb26134797 in PyEval_EvalCodeEx (co=0x2bbd990,
globals=<value optimized out>, locals=<value optimized out>,
args=<value optimized out>, argcount=1, kws=0x2c8fef8, kwcount=0, defs=
0x0, defcount=0, closure=0x0) at Python/ceval.c:3044
#5 0x00007feb26132be4 in fast_function (f=<value optimized out>,
throwflag=<value optimized out>) at Python/ceval.c:3890
#6 call_function (f=<value optimized out>, throwflag=<value optimized out>)
at Python/ceval.c:3815
#7 PyEval_EvalFrameEx (f=<value optimized out>,
throwflag=<value optimized out>) at Python/ceval.c:2453
#8 0x00007feb26134797 in PyEval_EvalCodeEx (co=0x2bc07b0,
globals=<value optimized out>, locals=<value optimized out>,
args=<value optimized out>, argcount=0, kws=0x2dd88c0, kwcount=0, defs=
0x0, defcount=0, closure=
(<cell at remote 0x2bc3910>, <cell at remote 0x2bc38d8>))
PID - 64742:
#0 0x00007feb254f9d03 in select () from /lib64/libc.so.6
#1 0x00007feb1eb6b219 in floatsleep (self=<value optimized out>,
args=<value optimized out>) at Modules/timemodule.c:910
#2 time_sleep (self=<value optimized out>, args=<value optimized out>)
at Modules/timemodule.c:206
#3 0x00007feb26132b24 in call_function (f=<value optimized out>,
throwflag=<value optimized out>) at Python/ceval.c:3794
#4 PyEval_EvalFrameEx (f=<value optimized out>,
throwflag=<value optimized out>) at Python/ceval.c:2453
#5 0x00007feb26134797 in PyEval_EvalCodeEx (co=0x7feb26461300,
globals=<value optimized out>, locals=<value optimized out>,
args=<value optimized out>, argcount=2, kws=0x7feb10001380, kwcount=0,
defs=0x7feb264740e8, defcount=1, closure=0x0) at Python/ceval.c:3044
#6 0x00007feb26132be4 in fast_function (f=<value optimized out>,
throwflag=<value optimized out>) at Python/ceval.c:3890
#7 call_function (f=<value optimized out>, throwflag=<value optimized out>)
at Python/ceval.c:3815
#8 PyEval_EvalFrameEx (f=<value optimized out>,
throwflag=<value optimized out>) at Python/ceval.c:2453
#9 0x00007feb26134797 in PyEval_EvalCodeEx (co=0x26e0c60,
globals=<value optimized out>, locals=<value optimized out>,
args=<value optimized out>, argcount=1, kws=0x7feb10001118, kwcount=1,
defs=0x26e29b0, defcount=2, closure=0x0) at Python/ceval.c:3044 The stack showed processes was locked on Billiard_semlock_acquire. So I think deadlock is not caused by my code. But
I never think about this, I will try it! Thank you! If there still have problems, I will try to make a small example project for debugging. Thanks. |
Hey, could you please try running the worker with rate limits disabled? ( |
Turned off rate limits and increased max task per child to 10. Still seeing large number of workers deadlocked. |
Why did you increase maxtasks? It would be better if you tried disabling that as I said earlier. You can try adding a time-limit to see if the workers are stuck doing something |
|
Patch celery/billiard@94f5623 seems to fix a very similar issue (#1266), |
Fixed in billiard 2.7.3.26 |
lib/python2.7/site-packages/billiard/synchronize.py (95): enter This is the pystack from the deadlocked workers. |
Btw, the patch above only fixed it for SIGTERM, it was later added to include other term signals, but it's still not fixed for SIGKILL (if killed while having acquired the semaphore) |
It seems the fix was later disabled due to a change, fixed in 2.7.3.28 (now with tests) |
Deployed the latest celery, billiards. |
Since you're using maxtasks this could be a duplicate of #1310. You can try using the instructions in the comment there to try out the fix: |
I have removed max tasks a few days ago. |
Ok, if the worker child processes never exits/terminates then it's unlikely #1310 will do anything. In that case you need to come up with an example that reproduces the deadlock. |
So the entire fleet isn't deadlocked. However, since tasks are pre-fetched and assigned to "deadlocked" workers, they do not complete. What could be a mitigation for that scenario? I'll attempt to recreate this issue tonight. |
Reproduced the bug. Will post shortly. |
celeryconfig.py BROKER_URL = 'redis://localhost:6379/1'
CELERY_RESULT_BACKEND = 'redis://localhost:6379/1'
CELERY_IMPORTS = ('tasks')
CELERY_ENABLE_UTC = True
CELERY_QUEUES = {
"default": {
"binding_key": "default",
"exchange": "default",
},
"checks": {
"binding_key": "check",
"exchange": "direct",
},
"deployments": {
"binding_key": "deployment",
"exchange": "direct",
},
"re-deployments": {
"binding_key": "re-deployment",
"exchange": "direct",
},
"postchecks": {
"binding_key": "postcheck",
"exchange": "direct",
},
"rendering": {
"binding_key": "rendering",
"exchange": "direct",
}
}
CELERY_DEFAULT_QUEUE = "default"
CELERY_DEFAULT_EXCHANGE_TYPE = "direct"
CELERY_DEFAULT_ROUTING_KEY = "default"
CELERY_DISABLE_RATE_LIMITS = True
CELERYD_PREFETCH_MULTIPLIER = 1
CELERY_ROUTES = {
"tasks.check_device": {
"queue": "checks",
"routing_key": "check",
},
"tasks.deploy_to_device": {
"queue": "deployments",
"routing_key": "deployment",
},
"tasks.redeploy_to_device": {
"queue": "re-deployments",
"routing_key": "re-deployment",
},
"tasks.test_rendering": {
"queue": "rendering",
"routing_key": "rendering",
},
"tasks.device_postcheck": {
"queue": "postchecks",
"routing_key": "postcheck",
}
}
# 3 day result expiry
CELERY_TASK_RESULT_EXPIRES = 259200 start_tasks.py import celery
import sys
from celery import group
def run():
from tasks import check_device
task_list = []
for x in range(1000):
task_list.append(check_device.subtask())
group_task = group(task_list)
group_task()
sys.exit(
run()
) tasks.py import celery
@celery.task
def check_device():
return 1+1 celery worker -Q checks --concurrency 20 -E --logfile /tmp/celery --loglevel=DEBUG & |
Run that a few times, and you'l'l notice a few workers start going into a futex wait state. |
I've been executing 157054 tasks now, and not able to reproduce so far.
|
I reproduced it with just the 20 processes. Kernel Distro Python Our work is mostly network-bound/IO bound and not cpu intensive. The deployment worker was there to see if that had an impact on deadlocks. I was able to reproduce w/o as well. Thing I noted. When starting up the worker-pool. One appears to consume from the redis (consumer). The others appear to selecting "select(10, [9], NULL, NULL, {0, 936733}) = 0 (Timeout)" After running tasks, not all workers run select commands anymore. Will double check whether we can reproduce this on Mac OS X |
AnonymousHelper, i have recently fixed a similar problem in my project. We also had a problem with workers that goes to futex wait state and not longer processes new tasks. we also had mostly network-bound/IO bound and not cpu intensive. So the problem was that we had no timeouts on network read operations. You can run lsof -i -p PID to see all opened network connections by hung worker. In my case there was one connection that was marked as estabilished for a long time:
So, when i forcibly closed a network connection (thx to http://stackoverflow.com/questions/323146/how-to-close-a-file-descriptor-from-another-process-in-unix-systems):
Worker continued to work, i.e. continued to write logs. It finished a task and continued to proccess new ones. So the issue was fixed by changing urllib2.urlopen(url, request).read() to urllib2.urlopen(url, request, READ_TIMEOUT).read() Hope this helps |
In the example problem, I also observed this behavior. In that case, I don't open a single network connection besides the ones celery opens. I'll post some more later tonight. |
@mvikharev #1218 (comment) helped me find my problem, thanks. |
Celery 3.0.13
Biliiards 2.7.3.19
Workers consuming 10,000 tasks a day.
We kill a worker after executing 1 task. Max_Task_Per_Child = 1
Overtime, we deadlocked workers.
Can't get a pystack for you.
In the Stack Trace, it shows that the deadlock happens while billiards is trying to acquire a lock.
#0 0x0000003cff40c0ed in sem_wait () from /lib64/libpthread.so.0
#1 0x00002b180708129a in Billiard_semlock_acquire (self=0x3d165f9340, args=, kwds=) at Modules/_billiard/semaphore.c:312
Strace shows the processes in a futex wait state.
What sort of data can help you debug this problem?
The text was updated successfully, but these errors were encountered: