New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Celery hang With 100% CPU Usages #3712
Comments
I have been facing a similar issue with sqs broker but it occurs after the workers pick up the messages from the sqs. the celery remains at 100% an i don't even see any activity in the strace. the last thing i see is the message that the task is picked up. [2016-12-19 03:37:26,266: DEBUG/MainProcess] TaskPool: Apply when i run losf it seems normal as well would someone be able to help me add a log statement in celery that i can use to see why the task is never invoked . i have been using same code base with amqp without any issue for months. this happened after i transitioned to sqs and it was okay first day or so but things stopped working afterwards |
i was finally able to get this to work by I added lot of log messages to find out where the message gets stuck and it seems the AsyncPool class ( which is the on_apply) method does not start running the task I will spend more time o this to find out why AsyncPool does not run. This issue again only occurred when the SQS queue had thousand messages queued only. If i would start with a new Queue things would work fine. Python 3.5.1 (default, May 11 2016, 01:09:13) |
Is this in the main process, or one of the child processes? The private futex may imply a thread is being used, and some kind of mutex, when there should be no threads in the worker. Maybe 1) boto is using threads internally 2) your tasks are doing something, or 3) your tasks are initializing code in the parent that creates a mutex |
@ask I ran into the same problem. With gdb I was able to narrow down the problem to code in
Here is one of the backtracks from gdb:
|
@ask Also this is the case @farshidce ran into, where strace does not print anything. |
i could not reproduce this after i flushed out all the messages using gevent ( instead of asyncpool) I did not repro this issue again after switching to gevent but as soon as i find time i plan to run some load testing again to reproduce having 15k messages on the SQS and see if things get stuck again. on a side note ( sorry for long response ) has anyone asked SQS team to implement what celery needs to respond to "inspect active" . this can be quite helpful in debugging these sort of issues... |
Further investigation revealed some more information. The loop is indeed in the code I pasted above. Loop in in the _loop1/_schedule_queue. The reason for that is |
I think I got to the bottom of this. To trigger this bug you need prefetch multiplier set to X and you need to receive X messages at once. At that point This means that if prefetch multiplier is reached the CPU usage will be high (especially if you have tasks which execute for a long time) which I think is bug on it's own. Now, the real bug is in
The way |
Good catch @rafales. So this appears to affect SQS only. Is that correct? Also, have you found any sort of workaround? Or is "revert to Celery 3.x" the answer for now? |
@grantmcconnaughey it may affect other backends too. For SQS setting prefetch multiplier to 0 should fix this for now. I hope it will get fixed soon though. |
Cool, thanks a lot @rafales. I have |
Hey @rafales and @ask, I've done some more testing around this. I'm using Celery/Kombu 4.0.2 and SQS. I have a small ECS cluster with 3 containers running. If I queue up 100 jobs all at once Celery will make it through 6 before it completely stops processing and CPU utilization goes up to 100% per container. It hangs at 6 because each container is running 2 celery processes (2 processes * 3 containers). That means each process executes one task before hanging. I do have This is what I'm running from the command line:
Update: I bet this setting needs to be Setting |
|
|
Seeing the same issue. Is there a proposed fix at this point? |
We'd appreciate a PR that fixes this. |
So, having the same issue when trying to run celery on designated AWS ECS container images (Docker). I was able to fix my workers with Anybody got suggestions how to solve this? Broker: AWS SQS |
@HitDaCa Unless I'm wrong, celery beat is just a simple process that will push new tasks on the given schedule, it doesn't do any processing and thus it doesn't need any config for the workers pool. |
@rubendura sounds right, |
@rubendura & @georgepsarakis Thank you for the clarification, I must have gotten my self confused around who processes a beats scheduled task. Following this I assume the non functioning celery worker was responsible for not processing the scheduled task. The issue of the beat not being triggered in the first place most likely resulted from the worker using 100% of the available cpu. |
I'm also experiencing this with:
Celery appears to run fine for a few minutes. The logs even show dozens of tasks completing, but then everything appears to stop, no more tasks are retrieved, even though the SQS queue lists hundreds of pending tasks. Celery's logs show nothing is being processed, yet the workers are consuming 100% CPU. |
@etos What broker are you using? Would setting |
Hi @chrisspen, Yeh, I believe this ticket only occurs with SQS |
Continuing with @rafales debugging we digged a little further. The problem is indeed the fact that in i.e: in A possible solution (I'm not committing a PR yet, because I'm not sure what other problems it might cause) is "freeze" the
That way if One caveat I can see here, is the fact that this problem is only related to SQS but the code changed here is not. This seems to solve the issue in hand, but still not sure it doesn't create others. |
@gabriel-amram I've been following this issue for a while because I believed it might be related to similar behavior we're experiencing with RabbitMQ since switching to 4.x. Could be that it's not limited to SQS then? |
@tpyo it might be, can you refer me to the issue you are talking about? maybe the code for RabbitMQ uses the same mechanism and then it can explain the other issue as well. |
One more here facing this issue:
We have several workers running several different queues. Looks like the problem only occurs when one queue has too many jobs on it, then the workers assigned to that queue stop working. We're planning to temporary move to Redis, but I can make some more tests before that. I'll keep you all updated if I learn something new. Thanks everyone. |
@gabriel-amram I'm not sure I fully understand the solution. I do understand the analysis and I think it is correct. |
fixes celery/celery#3712 Before handling the todo items we "freeze" them by copying them aside and clearing the list. This way if an item in the todo list appends a new callable to the list itself it will be taken care of in the next iteration of the parent loop instead of producing an infinite loop by adding it to the list we're running on.
* Fix infinite loop in create_loop fixes celery/celery#3712 Before handling the todo items we "freeze" them by copying them aside and clearing the list. This way if an item in the todo list appends a new callable to the list itself it will be taken care of in the next iteration of the parent loop instead of producing an infinite loop by adding it to the list we're running on. * Changed the test to be aligned with the new implementation * passing flake8 * Avoid copying results with each iteration of the async loop. * Pop instead of slicing. * fixed: todos -> todo, fixed test to use MagicMock so we can use the len() method * MagicMock not supported in 2.7, implemented __len__ on Mock instead * added entry to changelog
I am running celery with amazon SQS. In celery task, task is sending put request to a server using requests library of python. After successfully run task for first request. Celery Hangs with 100% cpu usage. Dont know whats going on.
The Strace Dump for hanging pid is -->>
futex(0x999104, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x999100, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x999140, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x999104, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x999100, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x999140, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x999104, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x999100, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x999140, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x999104, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x999100, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x999140, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x999104, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x999100, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x999140, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x999104, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x999100, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x999140, FUTEX_WAKE_PRIVATE, 1) = 1
Configuration -->>
celery == 4.0.2
kombu == 4
The text was updated successfully, but these errors were encountered: