Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Celery Node locking up and no more tasks are being processed! #1352

Closed
myusuf3 opened this Issue · 28 comments

5 participants

@myusuf3

I have successfully caused this to occur on a ubuntu machine, it is reproducible.

  • I dont have CELERYD_MAX_TASKS_PER_CHILD set.
  • I am setting BROKER_POOL_LIMIT though.

I ran on strace on one of the child processes,

write(8, "\27\3\1\0 \256f\215T\355k|\267\2538\250\0m*\4\246\2\36]\10\342\236`\r\255\261\340"..., 90) = 90
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=8, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=8, revents=POLLIN}])
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
read(8, "\27\3\1\0 ", 5)                = 5
read(8, "/\332\324\212\202\357\323\371\\Lmc\324\236\222:\201\265\330\32\370x\211\375snKI\275\253\242\f", 32) = 32
read(8, "\27\3\1\0000", 5)              = 5
read(8, "\221\210\355\312\227R\215\3742\24\\\367^\366C\203\2{\377:\30\257\310^\343\302\203\257\231?A\36"..., 48) = 48
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
futex(0xb7769000, FUTEX_WAIT, 0, NULL

It seems to be locking and waiting (for what, I don't know, nothing my tasks do.), this issue is very similar to #1009, #896, #1218

Here are the versions of everything I am using.

billiard==2.7.3.28
celery==3.0.19
django-celery==3.0.17
kombu==2.5.10
@myusuf3

here is some further debug information.

lsof
strace
lsof-grep

while debugging with @ionelmc

@fellowshipofone

I have a similar issue that could be the same bug. This has been happening every day in 5 days since we increased the load. I have 1,500,000 messages per day on a 6 AWS nodes, all stopping.

I am not sure what the best way is to try to debug it. I have now set the following 3 parameters that were not set before:

CELERYD_TASK_TIME_LIMIT = None
CELERYD_TASK_SOFT_TIME_LIMIT = None
CELERYD_MAX_TASKS_PER_CHILD = 100

We are running with the following versions:

billiard==2.7.3.28
celery==3.0.19
django-celery==3.0.17
kombu==2.5.10
@fellowshipofone

Here I am 9h later with more details.

All 6 nodes have stopped processing tasks. The main process strace returns the following (looping):

select(5, [4], [], [], {1, 0})          = 0 (Timeout)
gettimeofday({1368426330, 181340}, NULL) = 0
fstat(9, {st_mode=S_IFREG, st_size=0, ...}) = 0
gettimeofday({1368426330, 181806}, NULL) = 0
fstat(7, {st_mode=S_IFREG, st_size=0, ...}) = 0
gettimeofday({1368426330, 182187}, NULL) = 0
fstat(8, {st_mode=S_IFREG, st_size=0, ...}) = 0
gettimeofday({1368426330, 182488}, NULL) = 0
fstat(10, {st_mode=S_IFREG, st_size=0, ...}) = 0
wait4(-1, 0x7fff9df4b120, WNOHANG, NULL) = 0

And celeryctl inspect active returns empty although there are 250k messages waiting:

-> w1.prod1.eu: OK
    - empty -
-> w1.prod2.eu: OK
    - empty -
-> w1.prod5.eu: OK
    - empty -
-> w1.prod3.eu: OK
    - empty -
-> w1.prod6.eu: OK
    - empty -
-> w1.prod4.eu: OK
    - empty -
@myusuf3

@fellowshipofone the way mine is presenting is that when I run celeryctl inspect active It says no nodes responded in time.

@JeffPaine

+1, having very similar, equally confusing issue.

@ask
Owner

This is probably related to the issue I'm working on. The master branch already contains a fix, but it's too complex to include in a 3.0.x patch release. I would be happy if you tried out the master branch to see if the issue still persists there, but I understand that this can be difficult in production. I'm available if you need instructions or help trying out the development version.

@fellowshipofone

Thanks! I tried to run this on one of our nodes but it did not reconnect with the other ones. Maybe I did not install it properly, or do I have to upgrade all nodes?

@kmt

Having the same issue here. Tested with versions 3.0.16, 3.017, 3.0.18 and the current code in master and they all exhibit the issue every now and then. Trying to find a pattern...

@myusuf3

@fellowshipofone what do you mean reconnect? you should be able to upgrade one node, without issue I believe.

There is an issue with the

[11:58:20] asklol : I found a bug now that I need to fix, it will lock if the connection is lost

so @ask just found a bug where if it can't connect it locks up; that might what you might be seeing;

Installing from master can be risky; I am waiting for @ask to give me relatively stable sha1s (commits) for all the needed dependencies, so we can use that till celery3.1 is released.

so I can use that in production, but the issue I had mentioned above had been indeed solved by this new release.

@fellowshipofone

@myusuf3 I mean the node does not get any tasks and shuts down immediately.

I am sure this is not the kind of things that is safe on production, but I'd love to get to the bottom of this.

I am running this on one node only. It seems it is not "compatible" with pre-3.1 nodes, but I am not sure why

pip install https://github.com/celery/django-celery/zipball/master\#egg\=django-celery --upgrade
pip install https://github.com/celery/celery/zipball/master\#egg\=celery --upgrade
pip install https://github.com/celery/billiard/zipball/master#egg=billiard --upgrade
pip install https://github.com/celery/kombu/zipball/master#egg=kombu --upgrade
pip install https://github.com/celery/py-amqp/zipball/master#egg=amqp --upgrade

When restarting celeryd:

[2013-05-16 13:50:00,707: DEBUG/MainProcess] | Worker: Preparing bootsteps.
[2013-05-16 13:50:00,710: DEBUG/MainProcess] | Worker: Building graph...
[2013-05-16 13:50:00,710: DEBUG/MainProcess] | Worker: New boot order: {StateDB, Beat, Hub, Queues (intra), Pool, Autoscaler, Timer, Autoreloader, Consumer}
[2013-05-16 13:50:00,715: DEBUG/MainProcess] | Consumer: Preparing bootsteps.
[2013-05-16 13:50:00,715: DEBUG/MainProcess] | Consumer: Building graph...
[2013-05-16 13:50:00,717: DEBUG/MainProcess] | Consumer: New boot order: {Connection, Mingle, Events, Control, Tasks, Agent, Heart, Gossip, event loop}
[2013-05-16 13:50:00,719: DEBUG/MainProcess] | Worker: Starting Beat
[2013-05-16 13:50:00,739: DEBUG/MainProcess] ^-- substep ok
[2013-05-16 13:50:00,741: DEBUG/MainProcess] | Worker: Starting Hub
[2013-05-16 13:50:00,741: DEBUG/MainProcess] ^-- substep ok
[2013-05-16 13:50:00,742: DEBUG/MainProcess] | Worker: Starting Pool
[2013-05-16 13:50:00,746: INFO/Beat] beat: Starting...
[2013-05-16 13:50:00,824: DEBUG/MainProcess] ^-- substep ok
[2013-05-16 13:50:00,825: DEBUG/MainProcess] | Worker: Starting Autoscaler
[2013-05-16 13:50:00,825: DEBUG/MainProcess] ^-- substep ok
[2013-05-16 13:50:00,825: DEBUG/MainProcess] | Worker: Starting Consumer
[2013-05-16 13:50:00,826: DEBUG/MainProcess] | Consumer: Starting Connection
[2013-05-16 13:50:00,865: INFO/MainProcess] Connected to amqp://guest@127.0.0.1:5672//
[2013-05-16 13:50:00,865: DEBUG/MainProcess] ^-- substep ok
[2013-05-16 13:50:00,865: DEBUG/MainProcess] | Consumer: Starting Mingle
[2013-05-16 13:50:00,866: INFO/MainProcess] mingle: searching for neighbors
[2013-05-16 13:50:02,234: DEBUG/MainProcess] | Worker: Closing Beat...
[2013-05-16 13:50:02,234: DEBUG/MainProcess] | Worker: Closing Hub...
[2013-05-16 13:50:02,234: DEBUG/MainProcess] | Worker: Closing Pool...
[2013-05-16 13:50:02,235: DEBUG/MainProcess] | Worker: Closing Autoscaler...
[2013-05-16 13:50:02,235: DEBUG/MainProcess] | Worker: Closing Consumer...
[2013-05-16 13:50:02,235: DEBUG/MainProcess] | Worker: Stopping Consumer...
[2013-05-16 13:50:02,236: DEBUG/MainProcess] | Worker: Stopping Autoscaler...
[2013-05-16 13:50:02,236: DEBUG/MainProcess] | Worker: Stopping Pool...
[2013-05-16 13:50:02,248: DEBUG/MainProcess] | Worker: Stopping Hub...
[2013-05-16 13:50:02,248: DEBUG/MainProcess] | Worker: Stopping Beat...
[2013-05-16 13:50:02,249: INFO/MainProcess] beat: Shutting down...
[2013-05-16 13:50:02,249: DEBUG/MainProcess] | Consumer: Shutdown event loop...
[2013-05-16 13:50:02,249: DEBUG/MainProcess] | Consumer: Shutdown Gossip...
[2013-05-16 13:50:02,249: DEBUG/MainProcess] | Consumer: Shutdown Heart...
[2013-05-16 13:50:02,249: DEBUG/MainProcess] | Consumer: Shutdown Tasks...
[2013-05-16 13:50:02,250: DEBUG/MainProcess] | Consumer: Shutdown Control...
[2013-05-16 13:50:02,250: DEBUG/MainProcess] | Consumer: Shutdown Events...
[2013-05-16 13:50:02,250: DEBUG/MainProcess] | Consumer: Shutdown Mingle...
[2013-05-16 13:50:02,250: DEBUG/MainProcess] | Consumer: Shutdown Connection...
[2013-05-16 13:50:02,253: DEBUG/MainProcess] removing tasks from inqueue until task handler finished
@fellowshipofone

Ok, I realize switching to 3.1 will take more steps than just installing these packages, so obviously discard my previous comment

@myusuf3

@fellowshipofone thats all I did and everything worked. What other steps do you need to take?

Here are fixed commits you can install from, which fixed my issue as far as I can tell. Adding this to your requirements.txt

-e git+git://github.com/celery/celery.git@d18b525e63b6a28600455012a8dd80d0d7fca14e#egg=celery
-e git+git://github.com/celery/billiard.git@fa5f9773c2d508aec30d679f2f09c6ee161a430d#egg=billiard
-e git+git://github.com/celery/kombu.git@cae2e296e2fb95f6d310cbd8465e9c27dbbe1534#egg=kombu
-e git+git://github.com/celery/py-amqp.git@3b215859b693610508b977cd24e835239314d6a3#egg=amqp
@fellowshipofone

I was running with django-celery too, and that seems to be affecting my app, but I think I can now skip it on celery-3.1.

I am trying a new setup now.

@myusuf3

@fellowshipofone can you get on irc? freenode #celery

@fellowshipofone

@myusuf3 thanks for the help, but apparently I have another issue.
@ask I have now updated one of my node properly and I can't get it to run properly:
https://gist.github.com/fellowshipofone/5592878

Here is my setup. Let me know how I can help debugging that one

Django==1.5.1
-e git://github.com/celery/py-amqp.git@3b215859b693610508b977cd24e835239314d6a3#egg=amqp-dev
amqplib==1.0.2
-e git://github.com/celery/billiard.git@e893760b2951862cd14a3985b34714eff750d0d3#egg=billiard-dev
-e git://github.com/celery/celery.git@2b30a673f288dfdb249be065aa3dc8c92e4e0985#egg=celery-dev
-e git://github.com/celery/django-celery.git@70adecec2d37b67610c5f49b4a1829b489bfd7e2#egg=django_celery-dev
-e git://github.com/celery/kombu.git@8aeb642ea154ba7a9223dfa473aadcbf7bc7a0c8#egg=kombu-dev
-e git+https://github.com/ShapeUp/libsaas.git@f689455b12479f91be51c4feee237a3747d26990#egg=libsaas-dev
@ask
Owner

Could be autoscale, I'm pretty sure it's not working at the moment

@ask
Owner

Autoscale fixed by 4f7c239, please upgrade kombu, billiard and celery!

@ask
Owner

Just fixed a file descriptor leak, found later as I was not finished running the stresstests. Please pull from celery + billiard.

All stresstests now passes for --maxtasksperchild, --autoscale and time limits.

@myusuf3

@ask is there a reason that celery throughput has gone down? when running tasks it just seems a lot slower.

@ask
Owner

Seems I messed up maxtasksperchild with the recent commit, rerunning the stresstest it locked up and spinned at 100% CPU. This code is incredibly tricky so not surprised, luckily this reproduces easily so will have a fix soon.

@myusuf3 I haven't looked at performance yet, I have a separate test for that. But if you post your settings then I can take a look.

@ask
Owner

I made some changes and now I can't reproduce the maxtasksperchild hanging+100% CPU issue anymore.
Not sure why it was fixed but the change includes trying to write the payload immediately instead of waiting for the next poll iteration. I added an assertion so that the worker will crash loudly if the eventloop continuously reports a process inqueue as writeable but the actual write operation always raises EINTR/EAGAIN (which was what was happening). If this happens to someone we can find out what's going on, and possibly change it so that it terminates the worker to try to recover that way (by creating a new socket from scratch).

@ask
Owner

Btw, I added new fields to inspect stats so that it now includes the number of writes to each process.
E.g. with 8 processes and after 80 tasks the stats report:

      "writes": {
            "all": "12.50%, 12.50%, 12.50%, 12.50%, 12.50%, 12.50%, 12.50%, 12.50%",
            "avg": "12.50%",
            "total": 80
        }

Here you can see that the tasks were divided evenly across the pool.
Since we're not doing manual scheduling (expensive), we rely on the processes to voluntarily read tasks by writing to any socket that is writeable (which means it's kind of like "work stealing"). This is made harder by the fact that the sockets are always buffered so I'm very happy to see such a stable number, but more real world testing needs to be done to say for certain.

Forgot to say that I'm still fixing a deadlock which happens when maxtasksperchild is used in combination with time limits (stresstest timelimits_soft when--maxtasksperchld=1and--concurrency=2`)

@myusuf3

@ask I am unsure versions we are talking about, can you provide exact sha1s?

I am using the latests as of Thursday and it has ran through the weekend without issue (locking up), albiet slower than before, in terms of throughput.

@ask
Owner

HEAD of all repositories

Waiting for your configuration information re performance

@ask
Owner

Running the benchmark I only see a ~10% performance decrease using benchmark settings.

@ask
Owner

Fixed in master. Please test so that we can release 3.1 sooner

@ask ask closed this
@JeffPaine

If it helps, I've been running on the below (HEADs when retrieved) for 48 hours with no lockups. I was experiencing them every few hours previously.

-e git+git://github.com/celery/celery.git@1c9bb490fa79d5d59d4d32bc298b3f52ec4d8de0#egg=celery
-e git+git://github.com/celery/billiard.git@a294684a2826cb5f47252a0fa2f0890e247532ed#egg=billiard
-e git+git://github.com/celery/kombu.git@8aeb642ea154ba7a9223dfa473aadcbf7bc7a0c8#egg=kombu
-e git+git://github.com/celery/py-amqp.git@3b215859b693610508b977cd24e835239314d6a3#egg=amqp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.