Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Celery hang With 100% CPU Usages #3712

Closed
ashoksahara opened this issue Dec 21, 2016 · 29 comments · Fixed by celery/kombu#760
Closed

Celery hang With 100% CPU Usages #3712

ashoksahara opened this issue Dec 21, 2016 · 29 comments · Fixed by celery/kombu#760

Comments

@ashoksahara
Copy link

I am running celery with amazon SQS. In celery task, task is sending put request to a server using requests library of python. After successfully run task for first request. Celery Hangs with 100% cpu usage. Dont know whats going on.
The Strace Dump for hanging pid is -->>
futex(0x999104, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x999100, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x999140, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x999104, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x999100, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x999140, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x999104, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x999100, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x999140, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x999104, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x999100, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x999140, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x999104, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x999100, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x999140, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x999104, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x999100, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x999140, FUTEX_WAKE_PRIVATE, 1) = 1

Configuration -->>
celery == 4.0.2
kombu == 4

@farshidce
Copy link

I have been facing a similar issue with sqs broker but it occurs after the workers pick up the messages from the sqs. the celery remains at 100% an i don't even see any activity in the strace. the last thing i see is the message that the task is picked up.

[2016-12-19 03:37:26,266: DEBUG/MainProcess] TaskPool: Apply

when i run losf it seems normal as well
celery 19304 farshid 17u IPv4 550364 0t0 TCP 172.30.0.57:41524->54.239.28.115:https (CLOSE_WAIT)
celery 19304 farshid 18u IPv4 550366 0t0 TCP 172.30.0.57:41525->54.239.28.115:https (CLOSE_WAIT)
celery 19304 farshid 19r FIFO 0,8 0t0 550359 pipe
celery 19304 farshid 20r FIFO 0,8 0t0 550360 pipe
celery 19304 farshid 21u IPv4 550369 0t0 TCP 172.30.0.57:41526->54.239.28.115:https (ESTABLISHED)

would someone be able to help me add a log statement in celery that i can use to see why the task is never invoked .

i have been using same code base with amqp without any issue for months. this happened after i transitioned to sqs and it was okay first day or so but things stopped working afterwards

@farshidce
Copy link

farshidce commented Dec 22, 2016

i was finally able to get this to work by
1- using -P gevent
2- CELERYD_PREFETCH_MULTIPLIER=0

I added lot of log messages to find out where the message gets stuck and it seems the AsyncPool class ( which is the on_apply) method does not start running the task

I will spend more time o this to find out why AsyncPool does not run. This issue again only occurred when the SQS queue had thousand messages queued only. If i would start with a new Queue things would work fine.

Python 3.5.1 (default, May 11 2016, 01:09:13)
celery - latest
kombu (4.0.2)
celery (4.0.0) ( also tried with 4.0.1 and 4.0.2)

@ask
Copy link
Contributor

ask commented Jan 3, 2017

Is this in the main process, or one of the child processes?

The private futex may imply a thread is being used, and some kind of mutex, when there should be no threads in the worker. Maybe 1) boto is using threads internally 2) your tasks are doing something, or 3) your tasks are initializing code in the parent that creates a mutex

@rafales
Copy link

rafales commented Jan 13, 2017

@ask I ran into the same problem. With gdb I was able to narrow down the problem to code in SQS.py. Somehow this part of code goes into infinite loop:

        self.hub.call_soon(self._schedule_queue, queue)

    def _schedule_queue(self, queue):
        if queue in self._active_queues:
            if self.qos.can_consume():
                self._get_bulk_async(
                    queue, callback=promise(self._loop1, (queue,)),
                )
            else:
                self._loop1(queue)

Here is one of the backtracks from gdb:

(gdb) py-bt
Traceback (most recent call first):
  File "/usr/local/lib/python3.5/abc.py", line 182, in __instancecheck__
    if subclass in cls._abc_cache:
  <built-in method isinstance of module object at remote 0x7fdeeaba55e8>
  File "/usr/local/lib/python3.5/site-packages/kombu/async/hub.py", line 193, in call_soon
    if not isinstance(callback, Thenable):
  File "/usr/local/lib/python3.5/site-packages/kombu/transport/SQS.py", line 284, in _loop1
    self.hub.call_soon(self._schedule_queue, queue)
  File "/usr/local/lib/python3.5/site-packages/kombu/transport/SQS.py", line 293, in _schedule_queue
    self._loop1(queue)
  File "/usr/local/lib/python3.5/site-packages/vine/promises.py", line 131, in __call__
    retval = self.fun(*final_args, **final_kwargs)
  File "/usr/local/lib/python3.5/site-packages/kombu/async/hub.py", line 282, in create_loop
    item()
  <built-in method next of module object at remote 0x7fdeeaba55e8>
  File "/usr/local/lib/python3.5/site-packages/celery/worker/loops.py", line 88, in asynloop
    next(loop)
  File "/usr/local/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 594, in start
    c.loop(*c.loop_args())
  File "/usr/local/lib/python3.5/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/usr/local/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 318, in start
    blueprint.start(self)
  File "/usr/local/lib/python3.5/site-packages/celery/bootsteps.py", line 370, in start
    return self.obj.start()
  File "/usr/local/lib/python3.5/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/usr/local/lib/python3.5/site-packages/celery/worker/worker.py", line 203, in start
    self.blueprint.start(self)
  File "/usr/local/lib/python3.5/site-packages/celery/bin/worker.py", line 256, in run
    worker.start()
  File "/usr/local/lib/python3.5/site-packages/celery/bin/base.py", line 244, in __call__
    ret = self.run(*args, **kwargs)
  File "/usr/local/lib/python3.5/site-packages/celery/bin/worker.py", line 221, in run_from_argv
    return self(*args, **options)
  File "/usr/local/lib/python3.5/site-packages/celery/bin/celery.py", line 412, in execute
    ).run_from_argv(self.prog_name, argv[1:], command=argv[0])
  File "/usr/local/lib/python3.5/site-packages/celery/bin/celery.py", line 480, in handle_argv
    return self.execute(command, argv)
  File "/usr/local/lib/python3.5/site-packages/celery/bin/base.py", line 281, in execute_from_commandline
    return self.handle_argv(self.prog_name, argv[1:])
  File "/usr/local/lib/python3.5/site-packages/celery/bin/celery.py", line 488, in execute_from_commandline
    super(CeleryCommand, self).execute_from_commandline(argv)))
  File "/usr/local/lib/python3.5/site-packages/celery/bin/celery.py", line 326, in main
    cmd.execute_from_commandline(argv)
  File "/usr/local/lib/python3.5/site-packages/celery/__main__.py", line 14, in main
    _main()
  File "/usr/local/bin/celery", line 11, in <module>
    sys.exit(main())

@rafales
Copy link

rafales commented Jan 13, 2017

@ask Also this is the case @farshidce ran into, where strace does not print anything.

@farshidce
Copy link

Is this in the main process, or one of the child processes?
the log messages was from the main process

The private futex may imply a thread is being used, and some kind of mutex, when there should be >>no threads in the worker. Maybe 1) boto is using threads internally 2) your tasks are doing >>something, or 3) your tasks are initializing code in the parent that creates a mute

i could not reproduce this after i flushed out all the messages using gevent ( instead of asyncpool)
the workers and methods i have are straightforward and don't use any thread and they have been working fine without any issue on rabbitmq for months.

I did not repro this issue again after switching to gevent but as soon as i find time i plan to run some load testing again to reproduce having 15k messages on the SQS and see if things get stuck again.

on a side note ( sorry for long response ) has anyone asked SQS team to implement what celery needs to respond to "inspect active" . this can be quite helpful in debugging these sort of issues...

@rafales
Copy link

rafales commented Jan 13, 2017

Further investigation revealed some more information. The loop is indeed in the code I pasted above. Loop in in the _loop1/_schedule_queue. The reason for that is self.qos.can_consume() always returns False. I have prefetch multiplier set to 8. All 8 tasks are prefetched. Logs confirm it - before the problems begin I have 8 logs about "received task" from main process. After that celery hangs - received tasks are not being executed.

@rafales
Copy link

rafales commented Jan 13, 2017

I think I got to the bottom of this. To trigger this bug you need prefetch multiplier set to X and you need to receive X messages at once. At that point qos.can_consume() returns False, call to _loop1 is made which schedules _schedule_queue in hub.

This means that if prefetch multiplier is reached the CPU usage will be high (especially if you have tasks which execute for a long time) which I think is bug on it's own.

Now, the real bug is in Hub.create_loop function in kombu/async/hub.py.

        while 1:
            for tick_callback in on_tick:
                tick_callback()

            while todo:
                item = todo.pop()
                if item:
                    item()

            poll_timeout = fire_timers(propagate=propagate) if scheduled else 1

The way call_soon() is implemented - if we schedule a function, which schedules a function, which schedules a function and so on... it will create an infinite loop here. Loop will never get a chance to poll readers and writers as it will be always stuck processing "todo" items.

@grantmcconnaughey
Copy link

Good catch @rafales. So this appears to affect SQS only. Is that correct? Also, have you found any sort of workaround? Or is "revert to Celery 3.x" the answer for now?

@rafales
Copy link

rafales commented Jan 14, 2017

@grantmcconnaughey it may affect other backends too. For SQS setting prefetch multiplier to 0 should fix this for now. I hope it will get fixed soon though.

@grantmcconnaughey
Copy link

Cool, thanks a lot @rafales. I have CELERYD_PREFETCH_MULTIPLIER = 1, so I'll change that and see what happens.

@grantmcconnaughey
Copy link

grantmcconnaughey commented Jan 15, 2017

Hey @rafales and @ask, I've done some more testing around this.

I'm using Celery/Kombu 4.0.2 and SQS. I have a small ECS cluster with 3 containers running. If I queue up 100 jobs all at once Celery will make it through 6 before it completely stops processing and CPU utilization goes up to 100% per container. It hangs at 6 because each container is running 2 celery processes (2 processes * 3 containers). That means each process executes one task before hanging.

I do have CELERYD_PREFETCH_MULTIPLIER set to 0, yet this is still happening.

This is what I'm running from the command line:

$ celery worker -A myapp.celery -Q myapp-builds --loglevel=INFO

Update: I bet this setting needs to be CELERY_WORKER_PREFETCH_MULTIPLIER now. I should really just switch to the lowercase setting names...

Setting CELERY_WORKER_PREFETCH_MULTIPLIER to 0 does fix this issue, which is great. Tasks no longer get stuck.

@joestump
Copy link

joestump commented Feb 27, 2017

-P solo worked for me while -P gevent starts up an immediately exits. I didn't have time to do much investigating since I was in 🔥 fighting mode, but a few other notes:

  • Same behavior where it'd pre-fetch and then get stuck in a loop with 100% CPU usage.
  • Starting a solo worker in the foreground by itself exhibited the same behavior as multiple workers daemonized.
  • Had no issues with our SQS queues for 3 days on newly spun up EC2 instances. All of the sudden it went caput.
  • I have not tried setting CELERY_WORKER_PREFETCH_MULTIPLIER to 0 yet.

@etos
Copy link

etos commented Mar 6, 2017

-P solo worked for me also, i've created a faux level of concurrency using supervisord numprocs for now.

CELERY_WORKER_PREFETCH_MULTIPLIER set to 0 did unblock the queue, but ultimately dumped everything into the deadletter queue, so instead i set this to 2 (default:4) in order to distribute queue messages out more evenly to the celeryd's. Whilst this works, it is definitely more memory hungry. (but functional!)

@tiyberius
Copy link

tiyberius commented Apr 8, 2017

I am experiencing this same issue. Seems like a pretty core piece of functionality, I'm surprised that it has gone several months without a fix. For what it's worth, thanks to @etos and @joestump I was able to workaround the problem using -P solo as suggested.

@humungasaurus
Copy link

humungasaurus commented Apr 9, 2017

Seeing the same issue. Is there a proposed fix at this point?

@thedrow
Copy link
Member

thedrow commented Apr 15, 2017

We'd appreciate a PR that fixes this.

@HitDaCa
Copy link

HitDaCa commented Jun 6, 2017

So, having the same issue when trying to run celery on designated AWS ECS container images (Docker).

I was able to fix my workers with CELERY_WORKER_PREFETCH_MULTIPLIER = 0 & -P solo, however the solo option is not available for celery beat.

Anybody got suggestions how to solve this?

Broker: AWS SQS
Affected version: 4.0.x

@rubendura
Copy link

@HitDaCa Unless I'm wrong, celery beat is just a simple process that will push new tasks on the given schedule, it doesn't do any processing and thus it doesn't need any config for the workers pool.
Someone correct me if I'm wrong.

@georgepsarakis
Copy link
Contributor

@rubendura sounds right, celery beat is a producer for recurring tasks. Thus the solo concurrency task pool setting should not apply here.

@HitDaCa
Copy link

HitDaCa commented Jun 6, 2017

@rubendura & @georgepsarakis Thank you for the clarification, I must have gotten my self confused around who processes a beats scheduled task.

Following this I assume the non functioning celery worker was responsible for not processing the scheduled task. The issue of the beat not being triggered in the first place most likely resulted from the worker using 100% of the available cpu.

@chrisspen
Copy link

I'm also experiencing this with:

celery==4.0.2
astroid==1.1.1
kombu==4.0.2

Celery appears to run fine for a few minutes. The logs even show dozens of tasks completing, but then everything appears to stop, no more tasks are retrieved, even though the SQS queue lists hundreds of pending tasks. Celery's logs show nothing is being processed, yet the workers are consuming 100% CPU.

@chrisspen
Copy link

@etos What broker are you using? Would setting CELERY_WORKER_PREFETCH_MULTIPLIER = 0 cause tasks to be deleted and not processed with SQS?

@etos
Copy link

etos commented Jun 9, 2017

Hi @chrisspen, Yeh, I believe this ticket only occurs with SQS

@gabriel-amram
Copy link

gabriel-amram commented Jul 3, 2017

Continuing with @rafales debugging we digged a little further. The problem is indeed the fact that in kombu/async/hub.py the while todo: statement produces an infinite loop because the "todo item" it calls just immediately calls call_soon which adds this item again and the loop never ends.

i.e: in _schedule_queue if can_consume() is False this calls _loop1 which calls call_soon with the _schedule_queue callback. call_soon just adds this callback to the todo set.
so once its in the todo set the loop starts and calls the _schedule_queue method which immediately calls _loop1 that calls call_soon with _schedule_queue which is added to the todo, so the loop sees it has more items in the todo set, calls that again, and so on...

A possible solution (I'm not committing a PR yet, because I'm not sure what other problems it might cause) is "freeze" the todo set, let it call all of its items, let the rest of the code execute (the writers and readers) and then get back to the todo loop on the next iteration of the external loop, and let it do its magic again. so in kombu/async/hub.py create_loop:

todo = self._ready.copy()
self._ready.clear()
while todo:
     item = todo.pop()
     if item:
         item()

That way if iter() is the _schedule_queue callback it will be added to the todo but it will let the rest of the code execute and potentially let the worker clear some tasks and actually be able to consume more messages from the queue.

One caveat I can see here, is the fact that this problem is only related to SQS but the code changed here is not.
Anyone who knows the internals a bit better ( @ask @thedrow ) can maybe comment on this or a similar solution?

This seems to solve the issue in hand, but still not sure it doesn't create others.
will update.

@tpyo
Copy link

tpyo commented Jul 3, 2017

One caveat I can see here, is the fact that this problem is only related to SQS but the code changed here is not.

@gabriel-amram I've been following this issue for a while because I believed it might be related to similar behavior we're experiencing with RabbitMQ since switching to 4.x. Could be that it's not limited to SQS then?

@gabriel-amram
Copy link

@tpyo it might be, can you refer me to the issue you are talking about? maybe the code for RabbitMQ uses the same mechanism and then it can explain the other issue as well.

@Marcelo-Theodoro
Copy link

Marcelo-Theodoro commented Jul 7, 2017

One more here facing this issue:

celery==4.0.2
kombu==4.0.2
python 2.7
Amazon SQS

We have several workers running several different queues. Looks like the problem only occurs when one queue has too many jobs on it, then the workers assigned to that queue stop working.

We're planning to temporary move to Redis, but I can make some more tests before that. I'll keep you all updated if I learn something new. Thanks everyone.

@thedrow
Copy link
Member

thedrow commented Jul 9, 2017

@gabriel-amram I'm not sure I fully understand the solution. I do understand the analysis and I think it is correct.
Could you please issue a PR for us to review? Make sure to ping me so I won't miss it.

gabriel-amram added a commit to gabriel-amram/kombu that referenced this issue Jul 9, 2017
fixes celery/celery#3712 

Before handling the todo items we "freeze" them by copying them aside and clearing the list.
This way if an item in the todo list appends a new callable to the list itself it will be taken care of in the next iteration of the parent loop instead of producing an infinite loop by adding it to the list we're running on.
thedrow pushed a commit to celery/kombu that referenced this issue Jul 10, 2017
* Fix infinite loop in create_loop

fixes celery/celery#3712 

Before handling the todo items we "freeze" them by copying them aside and clearing the list.
This way if an item in the todo list appends a new callable to the list itself it will be taken care of in the next iteration of the parent loop instead of producing an infinite loop by adding it to the list we're running on.

* Changed the test to be aligned with the new implementation

* passing flake8

* Avoid copying results with each iteration of the async loop.

* Pop instead of slicing.

* fixed: todos -> todo, fixed test to use MagicMock so we can use the len() method

* MagicMock not supported in 2.7, implemented __len__ on Mock instead

* added entry to changelog
shapiromatron added a commit to shapiromatron/hawc that referenced this issue Jun 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment