New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
celery worker wedging, stops processing jobs #3487
Comments
Can you strace(linux)/ktrace(bsd)/dtruss(macos) the process at the point when its using CPU? Also |
Now that's obnoxious. While I'm waiting for the deploy to add So I guess it was a bad message? But I'm using the JSON serializer... IDK. I'll close this in a few hours or Monday if it doesn't come back. |
I am intermittently getting #3486, even though I've done mutliple deploys and even rebuilt (ie, it's a whole new instance). |
Ok, yeah, it poofed. |
This still happens intermittently. |
Most recently, seemed to be triggered by a new deploy. |
Setting a redrive policy helps; the messages end up in the deadletter queue. |
To anyone else seeing this issue: Setting a redrive policy with a deadletter queue and a retry count of 20 seems to be working for me, but my application is extremely low-volume. This is hard to reproduce, but I suspect it's related to a bad message. One such message is:
Decoded and pretty-printed: {
"headers": {
"eta": null,
"task": "health_check_celery3.tasks.add",
"group": null,
"parent_id": null,
"argsrepr": "[4, 4]",
"timelimit": [
null,
null
],
"kwargsrepr": "{}",
"lang": "py",
"retries": 0,
"id": "69201ab9-bf4a-427d-b2b1-4f84690712af",
"root_id": null,
"origin": "gen13817@ip-172-31-53-11",
"expires": "2016-10-13T11:34:18.671748"
},
"body": "W1s0LCA0XSwge30sIHsiZXJyYmFja3MiOiBudWxsLCAiY2FsbGJhY2tzIjogbnVsbCwgImNob3JkIjogbnVsbCwgImNoYWluIjogbnVsbH1d",
"properties": {
"body_encoding": "base64",
"priority": 0,
"reply_to": "b04f6ea0-0260-377e-a723-774424e1c270",
"correlation_id": "69201ab9-bf4a-427d-b2b1-4f84690712af",
"delivery_mode": 2,
"delivery_tag": "b688bef6-926a-42d0-b756-1537fae68db7",
"delivery_info": {
"routing_key": "inno-beta-InnocenceJobQueue-NAX4OXV9F56W",
"exchange": ""
}
},
"content-encoding": "utf-8",
"content-type": "application\/json"
} Body: [
[4, 4],
{},
{
"errbacks": null,
"callbacks": null,
"chord": null,
"chain": null
}
] |
Nevermind. The redrive policy helps sometimes. Sometimes rebuilding the environment works (because elastic beanstalk), sometimes rebooting the instance works. |
I've experienced a similar issue using SQS and boto3. I was never able to consistently reproduce it, but did see it happen on multiple machines. The app worked fine before connecting it to SQS with boto3. Essentially everything worked fine sometimes. And other times it either wouldn't run the tasks or had some error message, which is now forgotten. Rebooting machines, clearing cache, etc could make it work again temporarily, but it always eventually quit out. We had to change our stack to avoid a production issue. I hope this helps reaffirm that there is an issue somewhere, perhaps with boto3, perhaps here |
I'm seeing the same issue here and it's driving me nuts. When running with
It seems to happen completely at random times, some tasks work sometimes, until they don't. I am honestly running out of ideas. :( |
@rubendura per @ask:
(I haven't gotten the chance to, because I haven't seen the issue recently.) |
I can't now as the issue is gone (for now). I noticed that for some reason timeouts where missing in this project's config, so added some task timeouts and purged all messages in my SQS queue. So far I've only seen one timeout, apparently due to a Django DB exception. I've fixed that, but it kind of bugs me that when an exception was raised in one of my workers, instead of printing some logs or retrying the task it just... hung. And brought my CPU to full usage. I would've expected to see some logs, or celery to capture the exception and maybe restart the worker (which means after the SQS visibility timeout the task would've been retried). The only thing that could potentially be messing with that is the Sentry integration, but I haven't investigated that yet. |
IDK. In my case the primary task is just a heartbeat that should never, ever throw an exception. |
Ok. I've got it again. It's some kind of weirdness with the processes cycling? The PIDs keep changing. |
It's Attached is the strace results, but I'm not sure what's happening with it. |
It's back. And this time in production. :( Restarting workers is not fixing the issue. ps aux:
strace -yy -t -p 5:
strace -yy -t -p 74:
strace -yy -t -p 77:
lsof -p 5 (excluded REG and DIR types):
lsof -p 74 (excluded REG and DIR types):
lsof -p 77 (excluded REG and DIR types):
logs output:
logs after sending SIGUSR1 to PID 5:
(logs data can be slightly out of order due to how CloudWatch logs captures log output from Docker). It might be just me not understanding whats going on in all this data, but to me it looks like there is a failure when worker processes try to communicate with the master process, but I have no clue why, or even if this is right. Even less how to fix it. Help. :( |
As it seems like this is an issue communicating workers with the master, I tried running celery with the "solo" worker pool and tasks are properly being run. Any help trying to fix this will be greatly appreciated. |
I think I'm running into this as well. I tried the above setting, but it's not working for me.
I'm using these settings:
I note this seems to happen when there are retries (and in this case the retries are actually in the past) Also, I had a similar trace as the above individuals: in lsof I noticed:
|
Using the |
The |
One of my workers have the same problem right now, the version is 4.0.1. The strace don't output nothing, but with ltrace I have a loop with:
|
Now this is odd. Under the Messing around in GDB seems to indicate it's hanging out in a Killing the process and letting |
It does seem that the |
While the use of the (Usually restarting the worker, either with a redeploy or a |
@astronouth7303 #3712 |
This is a duplicate of #3712 and it will be fixed in the next released version. :) |
I'm using django+celery on AWS Elastic Beanstalk using the SQS broker. I've got a periodic healthcheck issuing a trivial job (add two numbers). I'm currently on master, because 4.0.0rc4 has some critical bugs.
The celery worker is maxing out the CPU to 100%, but no jobs are getting processed.
The log is:
c3b4de0f-d863-46df-b174-39fca55cffef
is the only task I've noticed is repeating.I can't find any other even slightly relevant logs entries.
Really, all I know is that the queue isn't getting shorter, and the CPU is maxed out.
The text was updated successfully, but these errors were encountered: