New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Celery Hanging with FUTEX_WAIT_PRIVATE #2429

Closed
maxsmythe opened this Issue Dec 20, 2014 · 20 comments

Comments

Projects
None yet
7 participants
@maxsmythe

maxsmythe commented Dec 20, 2014

We are encountering a problem running Celery 3.1.9

After running for a seemingly random amount of time, Celery will hang. There is nothing obviously wrong in the logs, the worker will continue to queue jobs locally, but they will not be processed.

After running strace, we found that all worker processes were hanging with the same message:

sudo strace -p 25389
Process 25389 attached - interrupt to quit
futex(0x44fb218, FUTEX_WAIT_PRIVATE, 2, NULL
Process 25389 detached

This looks similar to #572

We were going to try enabling CELERYD_FORCE_EXECV, but that looks to be contraindicated:
http://celery.readthedocs.org/en/latest/changelog.html

After further research, it looks like you have addressed this bug by using async IO for consumer communication as of 3.1.0. We do appear to be using the new async prefork pool: RabbitMQ is our broker and Worker.use_eventloop is True.

Do you know why we might still be experiencing deadlocks even with the new event loop? Here are the options with which we are launching celery workers:

celery -A app worker -l info -E -Q default -Ofair --concurrency=5 --hostname=%n.ourworker.com --maxtasksperchild=50

@leotsem

This comment has been minimized.

leotsem commented Feb 4, 2015

I'm getting the same behaviour on 3.1.17 with a redis broker.

futex(0x2ccdc58, FUTEX_WAIT_PRIVATE, 2, NULL

Celery hangs after running for a random amount of time, jobs are queued locally but they are not processed. Nothing wrong in the logs.
They are only resumed after killing the hanged process with kill -9.
Here's the command used to start the workers:

celery -A appname worker -B --loglevel=INFO

Any ideas on how to further debug this?

@maxsmythe

This comment has been minimized.

maxsmythe commented Feb 11, 2015

We were able to find out the problem was not in Celery, it was a lock conflict in our system's SSL libraries.

The easiest way we found to debug the issue was to install a debug version of Python:

https://wiki.python.org/moin/DebuggingWithGdb

Then generate a venv using the debug executable, this will allow python to be analyzed using gdb

Once you find a hanging thread, you can debug it in gdb using this command:

gdb <python executable with debug enabled> <pid of running process>
@derega

This comment has been minimized.

derega commented Mar 6, 2015

Similar problem here. Tasks might run fine for an hour, or few days, and then suddenly everything stops. Below are stack traces from the running processes as they were at the time of this happening. These are from production so debug symbols weren't available. Next time with those :)

Even without debug symbols it can be seen that:

  • one thread is connecting to postgresql database and it stops in pthread mutex lock wait (FUTEX_WAIT_PRIVATE)
  • two threads are trying to receive data from postgresql command
  • one thread is in select()

Versions

python 2.7.3
django-celery 3.1.1
celery 3.1.17
amqp 1.4.6
kombu 3.0.24
librabbitmq 1.6.1

Command

python bin/django celery worker --maxtasksperchild=1000 -l info --autoscale=3,1 -B --settings=project.production

Quick look with strace

There are four processes running, and this is what strace tells about them.

# strace -p 10853
Process 10853 attached - interrupt to quit
futex(0x49e6b78, FUTEX_WAIT_PRIVATE, 2, NULL^C <unfinished ...>
Process 10853 detached
# strace -p 10850
Process 10850 attached - interrupt to quit
read(10, ^C <unfinished ...>
Process 10850 detached
# strace -p 10804
Process 10804 attached - interrupt to quit
read(14, ^C <unfinished ...>
Process 10804 detached
# strace -p 6405
Process 6405 attached - interrupt to quit
select(0, NULL, NULL, NULL, {193, 283439}^C <unfinished ...>
Process 6405 detached

Process 10853

#0  0x00007ffa30ab089c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007ffa30aac065 in _L_lock_858 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007ffa30aabeba in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007ffa2c138300 in ?? () from /usr/lib/libpq.so.5
#4  0x00007ffa301889c9 in X509_STORE_get_by_subject () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#5  0x00007ffa3018924b in X509_STORE_CTX_get1_issuer () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#6  0x00007ffa30184edf in X509_verify_cert () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#7  0x00007ffa304788d8 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#8  0x00007ffa30456023 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#9  0x00007ffa3045a402 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#10 0x00007ffa2c1390db in ?? () from /usr/lib/libpq.so.5
#11 0x00007ffa2c129cce in PQconnectPoll () from /usr/lib/libpq.so.5
#12 0x00007ffa2c12ae9e in ?? () from /usr/lib/libpq.so.5
#13 0x00007ffa2c12b87f in PQconnectdb () from /usr/lib/libpq.so.5
#14 0x00007ffa2c358859 in ?? () from /usr/lib/python2.7/dist-packages/psycopg2/_psycopg.so
#15 0x00007ffa2c359cfa in ?? () from /usr/lib/python2.7/dist-packages/psycopg2/_psycopg.so
#16 0x00000000004df6d6 in ?? ()
#17 0x00000000004d91b6 in PyObject_Call ()
#18 0x000000000042cec5 in _PyObject_CallFunction_SizeT ()
#19 0x00007ffa2c353d62 in ?? () from /usr/lib/python2.7/dist-packages/psycopg2/_psycopg.so

Process 10850

#0  0x00007ffa30ab0d2d in read () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007ffa280a82b6 in read (__nbytes=4, __buf=0x7fffefee089c, __fd=10) at /usr/include/x86_64-linux-gnu/bits/unistd.h:45
#2  _Billiard_conn_recvall (h=10, buffer=<optimized out>, length=4) at Modules/_billiard/socket_connection.c:76
#3  0x00007ffa280a8811 in Billiard_conn_recv_string (
    buffer=0x45aadb0 "\200\002K\002(M\225iNccelery.app.trace\n_fast_trace_task\nq\001(U#dreammoodle.users.tasks.update_userq\002U$27620fc7-9936-4256-afde-a67c17c3848bq\003cdjango.db.models.base\nmodel_unpickle\nq\004cdreamuserdb.models\nUser\nq\005]cdjang"..., buflength=131072,
    newbuffer=0x7fffefee08f8, maxlength=9223372036854775807, conn=<optimized out>) at Modules/_billiard/socket_connection.c:149
#4  0x00007ffa280a892c in Billiard_connection_recv_payload (self=0x45aad90) at Modules/_billiard/connection.h:424

Process 10804

#0  0x00007ffa30ab0d2d in read () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007ffa280a82b6 in read (__nbytes=4, __buf=0x7fffefee089c, __fd=14) at /usr/include/x86_64-linux-gnu/bits/unistd.h:45
#2  _Billiard_conn_recvall (h=14, buffer=<optimized out>, length=4) at Modules/_billiard/socket_connection.c:76
#3  0x00007ffa280a8811 in Billiard_conn_recv_string (
    buffer=0x45edd60 "\200\002K\002(M\226iNccelery.app.trace\n_fast_trace_task\nq\001(U#dreammoodle.users.tasks.update_userq\002U$aabce684-28bf-4ba5-82b9-4befe98cf291q\003cdjango.db.models.base\nmodel_unpickle\nq\004cdreamuserdb.models\nUser\nq\005]cdjang"..., buflength=131072,
    newbuffer=0x7fffefee08f8, maxlength=9223372036854775807, conn=<optimized out>) at Modules/_billiard/socket_connection.c:149
#4  0x00007ffa280a892c in Billiard_connection_recv_payload (self=0x45edd40) at Modules/_billiard/connection.h:424

Process 6405

#0  0x00007ffa2f87cc03 in select () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x0000000000561023 in ?? ()
#2  0x000000000054bb14 in PyEval_EvalFrameEx ()
#3  0x0000000000575d92 in PyEval_EvalCodeEx ()
#4  0x000000000054c028 in PyEval_EvalFrameEx ()
#5  0x0000000000575d92 in PyEval_EvalCodeEx ()
@domenkozar

This comment has been minimized.

domenkozar commented Jul 13, 2015

@derega did you ever figure out the cause? We have very similar stacktraces that hang.

@akuchling

This comment has been minimized.

akuchling commented Jul 13, 2015

Here's one of our examples:

Versions:

python 2.7.5
celery[redis]==3.1.17
amqp==1.4.6
kombu==3.0.24
billiard==3.3.0.19

Strace results:

Process 1960 (this process shows some activity, exiting the epoll_wait()
every few seconds to do stuff):

epoll_ctl(9, EPOLL_CTL_ADD, 38, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=38, u64=17750533311001788454}}) = -1 EEXIST (File exists)
epoll_ctl(9, EPOLL_CTL_ADD, 39, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=39, u64=17750533311001788455}}) = -1 EEXIST (File exists)
gettimeofday({1436825712, 75017}, NULL) = 0
epoll_wait(9, ...

Process 2014 attached
read(23, ...

Process 2019 attached
futex(0x4983c78, FUTEX_WAIT_PRIVATE, 2, NULL ...

Process 2020 attached
read(15, ...

Process 19010 attached
read(19, ...

@maxsmythe

This comment has been minimized.

maxsmythe commented Jul 13, 2015

@derega If you install the debug version of Python and run gdb against it per my original post, you'll get more descriptive debug symbols, instead of just PyEval_EvalFrameEx (), etc.

Process 10853 appears to be connecting with Postgres (psycopg2).

Also note that it's not only the processes that are interesting, but threads as well. It looks like you may have 2 different celery manager processes running, each managing a single worker process. If that's the case, I wouldn't expect any interaction between the two processes as far as blocking goes.

Can you guys try running gdb per these instructions?

https://wiki.python.org/moin/DebuggingWithGdb

And look to see if the process has child threads? In our case it was two child threads deadlocking each other.

@maxsmythe

This comment has been minimized.

maxsmythe commented Jul 13, 2015

It's also worth noting that you can still use gdb without the debug symbols, it will just be harder to tell what Python is doing. But if the locks are occurring in an underlying C library, you might still get some good info.

The key here is that gdb will let you see the stack trace per thread.

You can see what threads are running via:

info thread

And switch context to a different thread to get that thread's backtrace via:

thread <ID #>
@domenkozar

This comment has been minimized.

domenkozar commented Jul 21, 2015

I think this is related to #1770, will report if it happens again.

@tuky

This comment has been minimized.

tuky commented Jul 24, 2015

@maxsmythe may i ask you for details about what was wrong with you ssl libs and how did you resolve these issues?

@maxsmythe

This comment has been minimized.

maxsmythe commented Jul 25, 2015

Sure. There were a number of places that were blocking. For us, we simply avoided creating a separate thread for the time being. Launching a new process or sending a new celery task is not ideal, but it does prevent thread locks. Here are some of the lock points (note that these tracebacks are very old by now):

#4 0x00007f7802c1622e in CRYPTO_lock (mode=9, type=18, file=0x7f7802d6b627 "md_rand.c", line=384) at cryptlib.c:604

#4 0x00007f7802c1622e in CRYPTO_lock (mode=9, type=18, file=0x7f7802d6b627 "md_rand.c", line=228) at cryptlib.c:604

#4 0x00007fcb43cd322e in CRYPTO_lock (mode=9, type=2, file=0x7fcb43e0d068 "ex_data.c", line=304) at cryptlib.c:604

#4 0x00007fcb43cd322e in CRYPTO_lock (mode=9, type=10, file=0x7fcb43e2a3b0 "x_pubkey.c", line=175) at cryptlib.c:604

#4 0x00007f7802c1622e in CRYPTO_lock (mode=9, type=11, file=0x7f7802d712a7 "x509_lu.c", line=349) at cryptlib.c:604

#4 0x00007f7802c1622e in CRYPTO_lock (mode=9, type=11, file=0x7f7802d712a7 "x509_lu.c", line=297) at cryptlib.c:604

The package being traced is openssl

@maxsmythe

This comment has been minimized.

maxsmythe commented Jul 25, 2015

Actually, launching a new process might not do it, the lock state could be copied

@maxsmythe

This comment has been minimized.

maxsmythe commented Jul 29, 2015

After digging in a little deeper it turns out the problem is with a callback that libpq (Postgres C client drivers) is registering with openssl.

The hints were here (from gdb):

(gdb) bt
#0  0x00007f11f1d6d89c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f11f1d69065 in _L_lock_858 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f11f1d68eba in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f11ea058e00 in pq_lockingcallback (mode=<optimized out>, n=<optimized out>, file=<optimized out>, line=<optimized out>) at fe-secure.c:872
#4  0x00007f11f1445c19 in X509_STORE_get_by_subject () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#5  0x00007f11f144649b in X509_STORE_CTX_get1_issuer () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#6  0x00007f11f144212f in X509_verify_cert () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#7  0x00007f11f1735aa8 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#8  0x00007f11f1713053 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#9  0x00007f11f1717582 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#10 0x00007f11f17204f3 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0

(gdb) frame 2
#2  0x00007f11f1d68eba in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) p *(pthread_mutex_t *) $r8
$3 = {__data = {__lock = 2, __count = 0, __owner = 13178, __nusers = 1, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\000\000\000\000z3\000\000\001", '\000' <repeats 26 times>, __align = 2}
(gdb) info thread
  Id   Target Id         Frame 
  3    Thread 0x7f11e71b9700 (LWP 13176) "celery" 0x00007f11f1d6d89c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
* 2    Thread 0x7f11e69b8700 (LWP 13178) "celery" 0x00007f11f1d6d89c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
  1    Thread 0x7f11f2191700 (LWP 13174) "celery" 0x00007f11f1d6d89c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0

^^^ above steps taken from https://en.wikibooks.org/wiki/Linux_Applications_Debugging_Techniques/Deadlocks

Seeing that the thread was locking itself (LWP 13178 of the blocked thread, lock owned by thread 13178)
was weird. Researching how this could be led me to these two pages:

psycopg/psycopg2#290
http://www.postgresql.org/message-id/871tlzrlkq.fsf@wulczer.org

I'm hopeful upgrading our version of psycopg2 will fix the problem, but only time will tell.

@tuky

This comment has been minimized.

tuky commented Jul 30, 2015

thank you for you input. i am specifically interested in this matter, because our workers constantly hang when using 3rd party apps that use the requests package for SSL protected API calls. Of course, that is not related with psql. that is why i suspect the ssl libraries. sadly, i often have problems with running gdb to get a trace and the problem is not very reproducible (after forcefully restarting the worker, the API calls typically work).

@maxsmythe

This comment has been minimized.

maxsmythe commented Jul 30, 2015

No problem!

Are you using Postgres in any way? Because thats exactly what was happening with us. Postgres registers a locking callback with ssl, which all ssl requests use. Unfortunately Postgres unloads the callback after closing its connections, which is fine for Postgres, but any other ssl consumer (requests included) would be unable to release the lock, allowing a thread to deadlock itself if it tries to reacquire its own lock.

In short: Postgres with psycopg2 < 2.6.1, threading, and any other ssl consumer => possible deadlock.

On Jul 30, 2015, at 1:24 AM, tuky notifications@github.com wrote:

thank you for you input. i am specifically interested in this matter, because our workers constantly hang when using 3rd party apps that use the requests package for SSL protected API calls. Of course, that is not related with psql. that is why i suspect the ssl libraries. sadly, i often have problems with running gdb to get a trace and the problem is not very reproducible (after forcefully restarting the worker, the API calls typically work).


Reply to this email directly or view it on GitHub.

@tuky

This comment has been minimized.

tuky commented Jul 30, 2015

unfortunately we dont use postgres at all.

@maxsmythe

This comment has been minimized.

maxsmythe commented Jul 30, 2015

Resolving as this both is resolved for me and has nothing to do with Celery.

@domenkozar

This comment has been minimized.

domenkozar commented Aug 10, 2015

@maxsmythe I think I'm seeing the same issue now:

(gdb) info threads
  Id   Target Id         Frame 
  2    Thread 0x7fefefd41700 (LWP 771) "celery" 0x00007ff002c40f7d in __lll_lock_wait () from /lib64/libpthread.so.0
* 1    Thread 0x7ff003427740 (LWP 765) "celery" 0x00007ff002c40f7d in __lll_lock_wait () from /lib64/libpthread.so.0

How did you resolve it?

@maxsmythe

This comment has been minimized.

maxsmythe commented Aug 10, 2015

Those two threads are definitely waiting for locks. Beyond that I can't say if it's exactly the same situation as libpthread is a common Linux library invoked by many different packages. If you're interested in my specific resolution, please read the thread as I mention upgrading psycopg2 and why this worked for me.

To figure out if your situation is different, you need more information. Those instructions are also in this thread. I would highlight this link as the most helpful:

https://en.wikibooks.org/wiki/Linux_Applications_Debugging_Techniques/Deadlocks

at the very least you need to run a backtrace (bt in gdb) to figure out what is calling __lll_lock_wait

@domenkozar

This comment has been minimized.

domenkozar commented Aug 10, 2015

It's the postgresql/ssl trace, I'll try to upgrade to psycopg 2.6.1

@maxsmythe

This comment has been minimized.

maxsmythe commented Aug 10, 2015

Awesome, that should do it. It shows up whenever you have Postgres (with SSL) in one thread, and an HTTPS connection in another.

I can confirm that 2.6.1 fixed the problem for us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment