Celery hang With 100% CPU Usages #3712

ashoksahara · 2016-12-21T08:08:08Z

I am running celery with amazon SQS. In celery task, task is sending put request to a server using requests library of python. After successfully run task for first request. Celery Hangs with 100% cpu usage. Dont know whats going on.
The Strace Dump for hanging pid is -->>
futex(0x999104, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x999100, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x999140, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x999104, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x999100, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x999140, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x999104, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x999100, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x999140, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x999104, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x999100, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x999140, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x999104, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x999100, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x999140, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x999104, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x999100, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x999140, FUTEX_WAKE_PRIVATE, 1) = 1

Configuration -->>
celery == 4.0.2
kombu == 4

farshidce · 2016-12-22T18:17:25Z

I have been facing a similar issue with sqs broker but it occurs after the workers pick up the messages from the sqs. the celery remains at 100% an i don't even see any activity in the strace. the last thing i see is the message that the task is picked up.

[2016-12-19 03:37:26,266: DEBUG/MainProcess] TaskPool: Apply

when i run losf it seems normal as well
celery 19304 farshid 17u IPv4 550364 0t0 TCP 172.30.0.57:41524->54.239.28.115:https (CLOSE_WAIT)
celery 19304 farshid 18u IPv4 550366 0t0 TCP 172.30.0.57:41525->54.239.28.115:https (CLOSE_WAIT)
celery 19304 farshid 19r FIFO 0,8 0t0 550359 pipe
celery 19304 farshid 20r FIFO 0,8 0t0 550360 pipe
celery 19304 farshid 21u IPv4 550369 0t0 TCP 172.30.0.57:41526->54.239.28.115:https (ESTABLISHED)

would someone be able to help me add a log statement in celery that i can use to see why the task is never invoked .

i have been using same code base with amqp without any issue for months. this happened after i transitioned to sqs and it was okay first day or so but things stopped working afterwards

farshidce · 2016-12-22T20:04:34Z

i was finally able to get this to work by
1- using -P gevent
2- CELERYD_PREFETCH_MULTIPLIER=0

I added lot of log messages to find out where the message gets stuck and it seems the AsyncPool class ( which is the on_apply) method does not start running the task

I will spend more time o this to find out why AsyncPool does not run. This issue again only occurred when the SQS queue had thousand messages queued only. If i would start with a new Queue things would work fine.

Python 3.5.1 (default, May 11 2016, 01:09:13)
celery - latest
kombu (4.0.2)
celery (4.0.0) ( also tried with 4.0.1 and 4.0.2)

ask · 2017-01-03T21:06:37Z

Is this in the main process, or one of the child processes?

The private futex may imply a thread is being used, and some kind of mutex, when there should be no threads in the worker. Maybe 1) boto is using threads internally 2) your tasks are doing something, or 3) your tasks are initializing code in the parent that creates a mutex

rafales · 2017-01-13T20:55:27Z

@ask I ran into the same problem. With gdb I was able to narrow down the problem to code in SQS.py. Somehow this part of code goes into infinite loop:

        self.hub.call_soon(self._schedule_queue, queue)

    def _schedule_queue(self, queue):
        if queue in self._active_queues:
            if self.qos.can_consume():
                self._get_bulk_async(
                    queue, callback=promise(self._loop1, (queue,)),
                )
            else:
                self._loop1(queue)

Here is one of the backtracks from gdb:

(gdb) py-bt
Traceback (most recent call first):
  File "/usr/local/lib/python3.5/abc.py", line 182, in __instancecheck__
    if subclass in cls._abc_cache:
  <built-in method isinstance of module object at remote 0x7fdeeaba55e8>
  File "/usr/local/lib/python3.5/site-packages/kombu/async/hub.py", line 193, in call_soon
    if not isinstance(callback, Thenable):
  File "/usr/local/lib/python3.5/site-packages/kombu/transport/SQS.py", line 284, in _loop1
    self.hub.call_soon(self._schedule_queue, queue)
  File "/usr/local/lib/python3.5/site-packages/kombu/transport/SQS.py", line 293, in _schedule_queue
    self._loop1(queue)
  File "/usr/local/lib/python3.5/site-packages/vine/promises.py", line 131, in __call__
    retval = self.fun(*final_args, **final_kwargs)
  File "/usr/local/lib/python3.5/site-packages/kombu/async/hub.py", line 282, in create_loop
    item()
  <built-in method next of module object at remote 0x7fdeeaba55e8>
  File "/usr/local/lib/python3.5/site-packages/celery/worker/loops.py", line 88, in asynloop
    next(loop)
  File "/usr/local/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 594, in start
    c.loop(*c.loop_args())
  File "/usr/local/lib/python3.5/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/usr/local/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 318, in start
    blueprint.start(self)
  File "/usr/local/lib/python3.5/site-packages/celery/bootsteps.py", line 370, in start
    return self.obj.start()
  File "/usr/local/lib/python3.5/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/usr/local/lib/python3.5/site-packages/celery/worker/worker.py", line 203, in start
    self.blueprint.start(self)
  File "/usr/local/lib/python3.5/site-packages/celery/bin/worker.py", line 256, in run
    worker.start()
  File "/usr/local/lib/python3.5/site-packages/celery/bin/base.py", line 244, in __call__
    ret = self.run(*args, **kwargs)
  File "/usr/local/lib/python3.5/site-packages/celery/bin/worker.py", line 221, in run_from_argv
    return self(*args, **options)
  File "/usr/local/lib/python3.5/site-packages/celery/bin/celery.py", line 412, in execute
    ).run_from_argv(self.prog_name, argv[1:], command=argv[0])
  File "/usr/local/lib/python3.5/site-packages/celery/bin/celery.py", line 480, in handle_argv
    return self.execute(command, argv)
  File "/usr/local/lib/python3.5/site-packages/celery/bin/base.py", line 281, in execute_from_commandline
    return self.handle_argv(self.prog_name, argv[1:])
  File "/usr/local/lib/python3.5/site-packages/celery/bin/celery.py", line 488, in execute_from_commandline
    super(CeleryCommand, self).execute_from_commandline(argv)))
  File "/usr/local/lib/python3.5/site-packages/celery/bin/celery.py", line 326, in main
    cmd.execute_from_commandline(argv)
  File "/usr/local/lib/python3.5/site-packages/celery/__main__.py", line 14, in main
    _main()
  File "/usr/local/bin/celery", line 11, in <module>
    sys.exit(main())

rafales · 2017-01-13T20:59:15Z

@ask Also this is the case @farshidce ran into, where strace does not print anything.

farshidce · 2017-01-13T21:43:58Z

Is this in the main process, or one of the child processes?
the log messages was from the main process

The private futex may imply a thread is being used, and some kind of mutex, when there should be >>no threads in the worker. Maybe 1) boto is using threads internally 2) your tasks are doing >>something, or 3) your tasks are initializing code in the parent that creates a mute

i could not reproduce this after i flushed out all the messages using gevent ( instead of asyncpool)
the workers and methods i have are straightforward and don't use any thread and they have been working fine without any issue on rabbitmq for months.

I did not repro this issue again after switching to gevent but as soon as i find time i plan to run some load testing again to reproduce having 15k messages on the SQS and see if things get stuck again.

on a side note ( sorry for long response ) has anyone asked SQS team to implement what celery needs to respond to "inspect active" . this can be quite helpful in debugging these sort of issues...

rafales · 2017-01-13T22:11:32Z

Further investigation revealed some more information. The loop is indeed in the code I pasted above. Loop in in the _loop1/_schedule_queue. The reason for that is self.qos.can_consume() always returns False. I have prefetch multiplier set to 8. All 8 tasks are prefetched. Logs confirm it - before the problems begin I have 8 logs about "received task" from main process. After that celery hangs - received tasks are not being executed.

rafales · 2017-01-13T23:25:04Z

I think I got to the bottom of this. To trigger this bug you need prefetch multiplier set to X and you need to receive X messages at once. At that point qos.can_consume() returns False, call to _loop1 is made which schedules _schedule_queue in hub.

This means that if prefetch multiplier is reached the CPU usage will be high (especially if you have tasks which execute for a long time) which I think is bug on it's own.

Now, the real bug is in Hub.create_loop function in kombu/async/hub.py.

        while 1:
            for tick_callback in on_tick:
                tick_callback()

            while todo:
                item = todo.pop()
                if item:
                    item()

            poll_timeout = fire_timers(propagate=propagate) if scheduled else 1

The way call_soon() is implemented - if we schedule a function, which schedules a function, which schedules a function and so on... it will create an infinite loop here. Loop will never get a chance to poll readers and writers as it will be always stuck processing "todo" items.

grantmcconnaughey · 2017-01-14T00:06:13Z

Good catch @rafales. So this appears to affect SQS only. Is that correct? Also, have you found any sort of workaround? Or is "revert to Celery 3.x" the answer for now?

rafales · 2017-01-14T00:12:52Z

@grantmcconnaughey it may affect other backends too. For SQS setting prefetch multiplier to 0 should fix this for now. I hope it will get fixed soon though.

grantmcconnaughey · 2017-01-14T00:20:33Z

Cool, thanks a lot @rafales. I have CELERYD_PREFETCH_MULTIPLIER = 1, so I'll change that and see what happens.

grantmcconnaughey · 2017-01-15T16:59:02Z

Hey @rafales and @ask, I've done some more testing around this.

I'm using Celery/Kombu 4.0.2 and SQS. I have a small ECS cluster with 3 containers running. If I queue up 100 jobs all at once Celery will make it through 6 before it completely stops processing and CPU utilization goes up to 100% per container. It hangs at 6 because each container is running 2 celery processes (2 processes * 3 containers). That means each process executes one task before hanging.

I do have CELERYD_PREFETCH_MULTIPLIER set to 0, yet this is still happening.

This is what I'm running from the command line:

$ celery worker -A myapp.celery -Q myapp-builds --loglevel=INFO

Update: I bet this setting needs to be CELERY_WORKER_PREFETCH_MULTIPLIER now. I should really just switch to the lowercase setting names...

Setting CELERY_WORKER_PREFETCH_MULTIPLIER to 0 does fix this issue, which is great. Tasks no longer get stuck.

joestump · 2017-02-27T16:55:11Z

-P solo worked for me while -P gevent starts up an immediately exits. I didn't have time to do much investigating since I was in 🔥 fighting mode, but a few other notes:

Same behavior where it'd pre-fetch and then get stuck in a loop with 100% CPU usage.
Starting a solo worker in the foreground by itself exhibited the same behavior as multiple workers daemonized.
Had no issues with our SQS queues for 3 days on newly spun up EC2 instances. All of the sudden it went caput.
I have not tried setting CELERY_WORKER_PREFETCH_MULTIPLIER to 0 yet.

etos · 2017-03-06T14:27:51Z

-P solo worked for me also, i've created a faux level of concurrency using supervisord numprocs for now.

CELERY_WORKER_PREFETCH_MULTIPLIER set to 0 did unblock the queue, but ultimately dumped everything into the deadletter queue, so instead i set this to 2 (default:4) in order to distribute queue messages out more evenly to the celeryd's. Whilst this works, it is definitely more memory hungry. (but functional!)

tiyberius · 2017-04-08T15:23:58Z

I am experiencing this same issue. Seems like a pretty core piece of functionality, I'm surprised that it has gone several months without a fix. For what it's worth, thanks to @etos and @joestump I was able to workaround the problem using -P solo as suggested.

humungasaurus · 2017-04-09T22:30:22Z

Seeing the same issue. Is there a proposed fix at this point?

thedrow · 2017-04-15T14:21:45Z

We'd appreciate a PR that fixes this.

HitDaCa · 2017-06-06T02:16:22Z

So, having the same issue when trying to run celery on designated AWS ECS container images (Docker).

I was able to fix my workers with CELERY_WORKER_PREFETCH_MULTIPLIER = 0 & -P solo, however the solo option is not available for celery beat.

Anybody got suggestions how to solve this?

Broker: AWS SQS
Affected version: 4.0.x

rubendura · 2017-06-06T09:08:36Z

@HitDaCa Unless I'm wrong, celery beat is just a simple process that will push new tasks on the given schedule, it doesn't do any processing and thus it doesn't need any config for the workers pool.
Someone correct me if I'm wrong.

georgepsarakis · 2017-06-06T17:28:15Z

@rubendura sounds right, celery beat is a producer for recurring tasks. Thus the solo concurrency task pool setting should not apply here.

HitDaCa · 2017-06-06T19:39:07Z

@rubendura & @georgepsarakis Thank you for the clarification, I must have gotten my self confused around who processes a beats scheduled task.

Following this I assume the non functioning celery worker was responsible for not processing the scheduled task. The issue of the beat not being triggered in the first place most likely resulted from the worker using 100% of the available cpu.

chrisspen · 2017-06-09T17:20:19Z

I'm also experiencing this with:

celery==4.0.2
astroid==1.1.1
kombu==4.0.2

Celery appears to run fine for a few minutes. The logs even show dozens of tasks completing, but then everything appears to stop, no more tasks are retrieved, even though the SQS queue lists hundreds of pending tasks. Celery's logs show nothing is being processed, yet the workers are consuming 100% CPU.

chrisspen · 2017-06-09T17:25:57Z

@etos What broker are you using? Would setting CELERY_WORKER_PREFETCH_MULTIPLIER = 0 cause tasks to be deleted and not processed with SQS?

etos · 2017-06-09T23:25:23Z

Hi @chrisspen, Yeh, I believe this ticket only occurs with SQS

gabriel-amram · 2017-07-03T11:08:43Z

Continuing with @rafales debugging we digged a little further. The problem is indeed the fact that in kombu/async/hub.py the while todo: statement produces an infinite loop because the "todo item" it calls just immediately calls call_soon which adds this item again and the loop never ends.

i.e: in _schedule_queue if can_consume() is False this calls _loop1 which calls call_soon with the _schedule_queue callback. call_soon just adds this callback to the todo set.
so once its in the todo set the loop starts and calls the _schedule_queue method which immediately calls _loop1 that calls call_soon with _schedule_queue which is added to the todo, so the loop sees it has more items in the todo set, calls that again, and so on...

A possible solution (I'm not committing a PR yet, because I'm not sure what other problems it might cause) is "freeze" the todo set, let it call all of its items, let the rest of the code execute (the writers and readers) and then get back to the todo loop on the next iteration of the external loop, and let it do its magic again. so in kombu/async/hub.py create_loop:

todo = self._ready.copy()
self._ready.clear()
while todo:
     item = todo.pop()
     if item:
         item()

That way if iter() is the _schedule_queue callback it will be added to the todo but it will let the rest of the code execute and potentially let the worker clear some tasks and actually be able to consume more messages from the queue.

One caveat I can see here, is the fact that this problem is only related to SQS but the code changed here is not.
Anyone who knows the internals a bit better ( @ask @thedrow ) can maybe comment on this or a similar solution?

This seems to solve the issue in hand, but still not sure it doesn't create others.
will update.

tpyo · 2017-07-03T11:48:59Z

One caveat I can see here, is the fact that this problem is only related to SQS but the code changed here is not.

@gabriel-amram I've been following this issue for a while because I believed it might be related to similar behavior we're experiencing with RabbitMQ since switching to 4.x. Could be that it's not limited to SQS then?

gabriel-amram · 2017-07-04T08:20:08Z

@tpyo it might be, can you refer me to the issue you are talking about? maybe the code for RabbitMQ uses the same mechanism and then it can explain the other issue as well.

Marcelo-Theodoro · 2017-07-07T02:49:06Z

One more here facing this issue:

celery==4.0.2
kombu==4.0.2
python 2.7
Amazon SQS

We have several workers running several different queues. Looks like the problem only occurs when one queue has too many jobs on it, then the workers assigned to that queue stop working.

We're planning to temporary move to Redis, but I can make some more tests before that. I'll keep you all updated if I learn something new. Thanks everyone.

thedrow · 2017-07-09T09:09:24Z

@gabriel-amram I'm not sure I fully understand the solution. I do understand the analysis and I think it is correct.
Could you please issue a PR for us to review? Make sure to ping me so I won't miss it.

fixes celery/celery#3712 Before handling the todo items we "freeze" them by copying them aside and clearing the list. This way if an item in the todo list appends a new callable to the list itself it will be taken care of in the next iteration of the parent loop instead of producing an infinite loop by adding it to the list we're running on.

* Fix infinite loop in create_loop fixes celery/celery#3712 Before handling the todo items we "freeze" them by copying them aside and clearing the list. This way if an item in the todo list appends a new callable to the list itself it will be taken care of in the next iteration of the parent loop instead of producing an infinite loop by adding it to the list we're running on. * Changed the test to be aligned with the new implementation * passing flake8 * Avoid copying results with each iteration of the async loop. * Pop instead of slicing. * fixed: todos -> todo, fixed test to use MagicMock so we can use the len() method * MagicMock not supported in 2.7, implemented __len__ on Mock instead * added entry to changelog

celery/celery#3712

This was referenced Jan 13, 2017

Tasks received but not executing #3759

Closed

celery worker wedging, stops processing jobs #3487

Closed

maedoc mentioned this issue Jan 19, 2017

Prefetch with SQS enters infinite loop celery/kombu#685

Closed

thedrow added Component: Amazon SQS Broker Issue Type: Bug Report Severity: Major Status: Confirmed ✔ Status: Has Testcase ✔ Worker Hangs labels Apr 15, 2017

This was referenced Jul 9, 2017

Fix infinite loop in create_loop gabriel-amram/kombu#1

Closed

Fix infinite loop in create_loop celery/kombu#760

Merged

thedrow closed this as completed in celery/kombu#760 Jul 10, 2017

CermakM mentioned this issue May 31, 2018

HTTP 599 Could not resolve host: queue.amazonaws.com; Unknown error (None) openshiftio/openshift.io#3567

Closed

shapiromatron added a commit to shapiromatron/hawc that referenced this issue Jun 6, 2018

hotfix broken celery build; security update django

347eb0c

celery/celery#3712

galCohen88 mentioned this issue Nov 22, 2019

SQS backend will stop consuming tasks after failures #5844

Closed

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Celery hang With 100% CPU Usages #3712

Celery hang With 100% CPU Usages #3712

ashoksahara commented Dec 21, 2016

farshidce commented Dec 22, 2016

farshidce commented Dec 22, 2016 •

edited

ask commented Jan 3, 2017

rafales commented Jan 13, 2017 •

edited

rafales commented Jan 13, 2017

farshidce commented Jan 13, 2017

rafales commented Jan 13, 2017

rafales commented Jan 13, 2017

grantmcconnaughey commented Jan 14, 2017

rafales commented Jan 14, 2017

grantmcconnaughey commented Jan 14, 2017

grantmcconnaughey commented Jan 15, 2017 •

edited

joestump commented Feb 27, 2017 •

edited

etos commented Mar 6, 2017 •

edited

tiyberius commented Apr 8, 2017 •

edited

humungasaurus commented Apr 9, 2017 •

edited

thedrow commented Apr 15, 2017

HitDaCa commented Jun 6, 2017 •

edited

rubendura commented Jun 6, 2017

georgepsarakis commented Jun 6, 2017

HitDaCa commented Jun 6, 2017

chrisspen commented Jun 9, 2017

chrisspen commented Jun 9, 2017

etos commented Jun 9, 2017

gabriel-amram commented Jul 3, 2017 •

edited

tpyo commented Jul 3, 2017

gabriel-amram commented Jul 4, 2017

Marcelo-Theodoro commented Jul 7, 2017 •

edited

thedrow commented Jul 9, 2017

Celery hang With 100% CPU Usages #3712

Celery hang With 100% CPU Usages #3712

Comments

ashoksahara commented Dec 21, 2016

farshidce commented Dec 22, 2016

farshidce commented Dec 22, 2016 • edited

ask commented Jan 3, 2017

rafales commented Jan 13, 2017 • edited

rafales commented Jan 13, 2017

farshidce commented Jan 13, 2017

rafales commented Jan 13, 2017

rafales commented Jan 13, 2017

grantmcconnaughey commented Jan 14, 2017

rafales commented Jan 14, 2017

grantmcconnaughey commented Jan 14, 2017

grantmcconnaughey commented Jan 15, 2017 • edited

joestump commented Feb 27, 2017 • edited

etos commented Mar 6, 2017 • edited

tiyberius commented Apr 8, 2017 • edited

humungasaurus commented Apr 9, 2017 • edited

thedrow commented Apr 15, 2017

HitDaCa commented Jun 6, 2017 • edited

rubendura commented Jun 6, 2017

georgepsarakis commented Jun 6, 2017

HitDaCa commented Jun 6, 2017

chrisspen commented Jun 9, 2017

chrisspen commented Jun 9, 2017

etos commented Jun 9, 2017

gabriel-amram commented Jul 3, 2017 • edited

tpyo commented Jul 3, 2017

gabriel-amram commented Jul 4, 2017

Marcelo-Theodoro commented Jul 7, 2017 • edited

thedrow commented Jul 9, 2017

farshidce commented Dec 22, 2016 •

edited

rafales commented Jan 13, 2017 •

edited

grantmcconnaughey commented Jan 15, 2017 •

edited

joestump commented Feb 27, 2017 •

edited

etos commented Mar 6, 2017 •

edited

tiyberius commented Apr 8, 2017 •

edited

humungasaurus commented Apr 9, 2017 •

edited

HitDaCa commented Jun 6, 2017 •

edited

gabriel-amram commented Jul 3, 2017 •

edited

Marcelo-Theodoro commented Jul 7, 2017 •

edited