Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

celery worker wedging, stops processing jobs #3487

Closed
AstraLuma opened this issue Sep 30, 2016 · 27 comments
Closed

celery worker wedging, stops processing jobs #3487

AstraLuma opened this issue Sep 30, 2016 · 27 comments

Comments

@AstraLuma
Copy link

I'm using django+celery on AWS Elastic Beanstalk using the SQS broker. I've got a periodic healthcheck issuing a trivial job (add two numbers). I'm currently on master, because 4.0.0rc4 has some critical bugs.

The celery worker is maxing out the CPU to 100%, but no jobs are getting processed.

The log is:

 -------------- celery@ip-172-31-52-97 v4.0.0rc4 (0today8)
---- **** -----
--- * ***  * -- Linux-4.4.15-25.57.amzn1.x86_64-x86_64-with-glibc2.3.4 2016-09-30 15:04:21
-- * - **** ---
- ** ---------- [config]
- ** ---------- .> app:         innocence:0x7f1118595a58
- ** ---------- .> transport:   sqs://localhost//
- ** ---------- .> results:
- *** --- * --- .> concurrency: 1 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** -----
 -------------- [queues]
                .> celery           exchange=celery(direct) key=celery


[tasks]
  . health_check_celery3.tasks.add
  . innocence.celery.debug_task
  . maca.tasks.reconcile
  . wings.tasks.generate_quickbooks_import

celery --app=innocence.celery:app worker --loglevel=INFO
[2016-09-30 15:08:27,542: INFO/MainProcess] Connected to sqs://localhost//
[2016-09-30 15:08:27,612: WARNING/MainProcess] celery@ip-172-31-52-97 ready.
[2016-09-30 15:08:29,721: INFO/MainProcess] Received task: health_check_celery3.tasks.add[0f6f7b7f-ead1-4093-9c31-2e818615df9b]   expires:[2016-09-30 15:06:49.282238+00:00]
[2016-09-30 15:08:29,722: INFO/MainProcess] Discarding revoked task: health_check_celery3.tasks.add[0f6f7b7f-ead1-4093-9c31-2e818615df9b]
[2016-09-30 15:08:29,730: INFO/MainProcess] Received task: health_check_celery3.tasks.add[6c46f583-2dc4-4f68-8c8a-2cff732e4a57]   expires:[2016-09-30 15:07:19.295102+00:00]
[2016-09-30 15:08:29,730: INFO/MainProcess] Discarding revoked task: health_check_celery3.tasks.add[6c46f583-2dc4-4f68-8c8a-2cff732e4a57]
[2016-09-30 15:08:29,732: INFO/MainProcess] Received task: health_check_celery3.tasks.add[26e28e04-10f5-4e41-b1cb-a284b0eeca02]   expires:[2016-09-30 15:08:07.423346+00:00]
[2016-09-30 15:08:29,732: INFO/MainProcess] Discarding revoked task: health_check_celery3.tasks.add[26e28e04-10f5-4e41-b1cb-a284b0eeca02]
[2016-09-30 15:08:29,733: INFO/MainProcess] Received task: health_check_celery3.tasks.add[c3b4de0f-d863-46df-b174-39fca55cffef]   expires:[2016-09-30 14:02:01.752967+00:00]
[2016-09-30 15:08:29,733: INFO/MainProcess] Discarding revoked task: health_check_celery3.tasks.add[c3b4de0f-d863-46df-b174-39fca55cffef]

c3b4de0f-d863-46df-b174-39fca55cffef is the only task I've noticed is repeating.

I can't find any other even slightly relevant logs entries.

Really, all I know is that the queue isn't getting shorter, and the CPU is maxed out.

@ask
Copy link
Contributor

ask commented Sep 30, 2016

Can you strace(linux)/ktrace(bsd)/dtruss(macos) the process at the point when its using CPU?

Also kill -USR1 <pidof mainprocess> a few times, and collect the tracebacks emitted in the logs.
If you pip install setproctitle before starting the worker processes, it's easier to see the type of process in ps listings to know the pid of the main process (e.g. ps auxww | grep MainProcess

@AstraLuma
Copy link
Author

Now that's obnoxious. While I'm waiting for the deploy to add setproctitle to go through, I clear the queue (because it had >800 healthchecks sitting in it and nobody needs that, and it was ok in the past), and the issue cleared up.

So I guess it was a bad message? But I'm using the JSON serializer... IDK. I'll close this in a few hours or Monday if it doesn't come back.

@AstraLuma
Copy link
Author

I am intermittently getting #3486, even though I've done mutliple deploys and even rebuilt (ie, it's a whole new instance).

@AstraLuma
Copy link
Author

Ok, yeah, it poofed.

@AstraLuma
Copy link
Author

This still happens intermittently.

@AstraLuma AstraLuma reopened this Oct 13, 2016
@AstraLuma
Copy link
Author

Most recently, seemed to be triggered by a new deploy.

@AstraLuma
Copy link
Author

Setting a redrive policy helps; the messages end up in the deadletter queue.

@ask ask added this to the v4.0 milestone Oct 14, 2016
@AstraLuma
Copy link
Author

To anyone else seeing this issue: Setting a redrive policy with a deadletter queue and a retry count of 20 seems to be working for me, but my application is extremely low-volume.

This is hard to reproduce, but I suspect it's related to a bad message. One such message is:

eyJoZWFkZXJzIjogeyJldGEiOiBudWxsLCAidGFzayI6ICJoZWFsdGhfY2hlY2tfY2VsZXJ5My50YXNrcy5hZGQiLCAiZ3JvdXAiOiBudWxsLCAicGFyZW50X2lkIjogbnVsbCwgImFyZ3NyZXByIjogIls0LCA0XSIsICJ0aW1lbGltaXQiOiBbbnVsbCwgbnVsbF0sICJrd2FyZ3NyZXByIjogInt9IiwgImxhbmciOiAicHkiLCAicmV0cmllcyI6IDAsICJpZCI6ICI2OTIwMWFiOS1iZjRhLTQyN2QtYjJiMS00Zjg0NjkwNzEyYWYiLCAicm9vdF9pZCI6IG51bGwsICJvcmlnaW4iOiAiZ2VuMTM4MTdAaXAtMTcyLTMxLTUzLTExIiwgImV4cGlyZXMiOiAiMjAxNi0xMC0xM1QxMTozNDoxOC42NzE3NDgifSwgImJvZHkiOiAiVzFzMExDQTBYU3dnZTMwc0lIc2laWEp5WW1GamEzTWlPaUJ1ZFd4c0xDQWlZMkZzYkdKaFkydHpJam9nYm5Wc2JDd2dJbU5vYjNKa0lqb2diblZzYkN3Z0ltTm9ZV2x1SWpvZ2JuVnNiSDFkIiwgInByb3BlcnRpZXMiOiB7ImJvZHlfZW5jb2RpbmciOiAiYmFzZTY0IiwgInByaW9yaXR5IjogMCwgInJlcGx5X3RvIjogImIwNGY2ZWEwLTAyNjAtMzc3ZS1hNzIzLTc3NDQyNGUxYzI3MCIsICJjb3JyZWxhdGlvbl9pZCI6ICI2OTIwMWFiOS1iZjRhLTQyN2QtYjJiMS00Zjg0NjkwNzEyYWYiLCAiZGVsaXZlcnlfbW9kZSI6IDIsICJkZWxpdmVyeV90YWciOiAiYjY4OGJlZjYtOTI2YS00MmQwLWI3NTYtMTUzN2ZhZTY4ZGI3IiwgImRlbGl2ZXJ5X2luZm8iOiB7InJvdXRpbmdfa2V5IjogImlubm8tYmV0YS1Jbm5vY2VuY2VKb2JRdWV1ZS1OQVg0T1hWOUY1NlciLCAiZXhjaGFuZ2UiOiAiIn19LCAiY29udGVudC1lbmNvZGluZyI6ICJ1dGYtOCIsICJjb250ZW50LXR5cGUiOiAiYXBwbGljYXRpb24vanNvbiJ9

Decoded and pretty-printed:

{
  "headers": {
    "eta": null,
    "task": "health_check_celery3.tasks.add",
    "group": null,
    "parent_id": null,
    "argsrepr": "[4, 4]",
    "timelimit": [
      null,
      null
    ],
    "kwargsrepr": "{}",
    "lang": "py",
    "retries": 0,
    "id": "69201ab9-bf4a-427d-b2b1-4f84690712af",
    "root_id": null,
    "origin": "gen13817@ip-172-31-53-11",
    "expires": "2016-10-13T11:34:18.671748"
  },
  "body": "W1s0LCA0XSwge30sIHsiZXJyYmFja3MiOiBudWxsLCAiY2FsbGJhY2tzIjogbnVsbCwgImNob3JkIjogbnVsbCwgImNoYWluIjogbnVsbH1d",
  "properties": {
    "body_encoding": "base64",
    "priority": 0,
    "reply_to": "b04f6ea0-0260-377e-a723-774424e1c270",
    "correlation_id": "69201ab9-bf4a-427d-b2b1-4f84690712af",
    "delivery_mode": 2,
    "delivery_tag": "b688bef6-926a-42d0-b756-1537fae68db7",
    "delivery_info": {
      "routing_key": "inno-beta-InnocenceJobQueue-NAX4OXV9F56W",
      "exchange": ""
    }
  },
  "content-encoding": "utf-8",
  "content-type": "application\/json"
}

Body:

[
  [4, 4],
  {},
  {
    "errbacks": null,
    "callbacks": null,
    "chord": null,
    "chain": null
  }
]

@AstraLuma
Copy link
Author

Nevermind. The redrive policy helps sometimes. Sometimes rebuilding the environment works (because elastic beanstalk), sometimes rebooting the instance works.

@rykener
Copy link

rykener commented Oct 29, 2016

I've experienced a similar issue using SQS and boto3. I was never able to consistently reproduce it, but did see it happen on multiple machines. The app worked fine before connecting it to SQS with boto3.

Essentially everything worked fine sometimes. And other times it either wouldn't run the tasks or had some error message, which is now forgotten. Rebooting machines, clearing cache, etc could make it work again temporarily, but it always eventually quit out. We had to change our stack to avoid a production issue.

I hope this helps reaffirm that there is an issue somewhere, perhaps with boto3, perhaps here

@ask ask modified the milestones: v4.1, v4.0 Nov 1, 2016
@rubendura
Copy link

I'm seeing the same issue here and it's driving me nuts. When running with -l debug I can see celery properly initialising, fetching 4 tasks from the queue, trying to apply one... and that's it.

[2016-12-07 14:16:34,502: INFO/MainProcess] Received task: project.analytics.tasks.track[862c0a36-92f4-4a07-a402-620b2be105bb]
[2016-12-07 14:16:34,503: DEBUG/MainProcess] TaskPool: Apply <function _fast_trace_task at 0x7fb567fd5048> (args:('project.analytics.tasks.track', '862c0a36-92f4-4a07-a402-620b2be105bb', {'eta': None, 'reply_to': '52cfb82d-693b-3e91-8e30-912fd7a2d748', 'argsrepr': "(UUID('98a18947-63d2-4b8a-81e5-6a0ba09e4dbc'), 'comment')", 'origin': 'gen75@3032e033202a', 'lang': 'py', 'timelimit': [None, None], 'id': '862c0a36-92f4-4a07-a402-620b2be105bb', 'parent_id': None, 'retries': 0, 'task': 'project.analytics.tasks.track', 'group': None, 'kwargsrepr': "{'increment': {'comments': 1}, 'properties': {'time': 1481110822}}", 'delivery_info': {'priority': 0, 'routing_key': 'projectV2Dev-CeleryBroker-S5CA1VI4GOGA', 'exchange': '', 'redelivered': None}, 'root_id': '862c0a36-92f4-4a07-a402-620b2be105bb', 'correlation_id': '862c0a36-92f4-4a07-a402-620b2be105bb', 'expires': None}, b'[["98a18947-63d2-4b8a-81e5-6a0ba09e4dbc", "comment"], {"increment": {"comments": 1}, "properties": {"time":... kwargs:{})
[2016-12-07 14:16:34,504: INFO/MainProcess] Received task: project.analytics.tasks.track[8e500cb1-ee7a-420e-abcf-d7149028b079]
[2016-12-07 14:16:34,517: INFO/MainProcess] Received task: project.analytics.tasks.firehose[fb334773-9849-4199-9bdb-4e5b984ddf37]
[2016-12-07 14:16:34,519: INFO/MainProcess] Received task: project.analytics.tasks.track[6911d3ea-cdae-45a8-a541-2ad05f3e0ab8]

It seems to happen completely at random times, some tasks work sometimes, until they don't. I am honestly running out of ideas. :(
Luckily for me this is just happening in our development environment and never happened in production. Which triggers the question: why?! I can't find any sensible answer to that question.

@AstraLuma
Copy link
Author

@rubendura per @ask:

Can you strace(linux)/ktrace(bsd)/dtruss(macos) the process at the point when its using CPU?

(I haven't gotten the chance to, because I haven't seen the issue recently.)

@rubendura
Copy link

I can't now as the issue is gone (for now).

I noticed that for some reason timeouts where missing in this project's config, so added some task timeouts and purged all messages in my SQS queue. So far I've only seen one timeout, apparently due to a Django DB exception.

I've fixed that, but it kind of bugs me that when an exception was raised in one of my workers, instead of printing some logs or retrying the task it just... hung. And brought my CPU to full usage. I would've expected to see some logs, or celery to capture the exception and maybe restart the worker (which means after the SQS visibility timeout the task would've been retried). The only thing that could potentially be messing with that is the Sentry integration, but I haven't investigated that yet.

@AstraLuma
Copy link
Author

IDK. In my case the primary task is just a heartbeat that should never, ever throw an exception.

@AstraLuma
Copy link
Author

Ok. I've got it again. It's some kind of weirdness with the processes cycling? The PIDs keep changing.

@AstraLuma
Copy link
Author

It's [celeryd: celery@REDACTED:MainProcess] -active- (--app=innocence.celery:app worker --loglevel=DEBUG) that's eating all the CPU.

Attached is the strace results, but I'm not sure what's happening with it.

trace.txt

@rubendura
Copy link

It's back. And this time in production. :( Restarting workers is not fixing the issue.
I exec into one of my celery containers. This is all the info I can get for now

ps aux:

PID   USER     TIME   COMMAND
    1 root       0:00 /sbin/tini -- /entrypoint.sh newrelic-admin run-program celery -A config worker --concurrency=2 --loglevel=INFO
    5 root       0:33 {celery} /venv/bin/python /venv/bin/celery -A config worker --concurrency=2 --loglevel=INFO
   74 root       0:00 {celery} /venv/bin/python /venv/bin/celery -A config worker --concurrency=2 --loglevel=INFO
   77 root       0:00 {celery} /venv/bin/python /venv/bin/celery -A config worker --concurrency=2 --loglevel=INFO
   80 root       0:00 sh
  146 root       0:00 ps aux

strace -yy -t -p 5:

strace: Process 5 attached
strace: [ Process PID=5 runs in x32 mode. ]

strace -yy -t -p 74:

strace: Process 74 attached
13:29:57 read(4<pipe:[222527138]>,

strace -yy -t -p 77:

strace: Process 77 attached
13:30:22 read(8<pipe:[222527140]>,

lsof -p 5 (excluded REG and DIR types):

celery    5 root    0r     FIFO      0,10      0t0 222525536 pipe
celery    5 root    1w     FIFO      0,10      0t0 222525537 pipe
celery    5 root    2w     FIFO      0,10      0t0 222525538 pipe
celery    5 root    3u  a_inode      0,11        0      5773 [eventpoll]
celery    5 root    4r     FIFO      0,10      0t0 222527138 pipe
celery    5 root    5w     FIFO      0,10      0t0 222527138 pipe
celery    5 root    6r     FIFO      0,10      0t0 222527139 pipe
celery    5 root    7w     FIFO      0,10      0t0 222527139 pipe
celery    5 root    8r     FIFO      0,10      0t0 222527140 pipe
celery    5 root    9w     FIFO      0,10      0t0 222527140 pipe
celery    5 root   10r     FIFO      0,10      0t0 222527141 pipe
celery    5 root   11w     FIFO      0,10      0t0 222527141 pipe
celery    5 root   12r     FIFO      0,10      0t0 222527142 pipe
celery    5 root   13r     FIFO      0,10      0t0 222527150 pipe
celery    5 root   14u     IPv4 222527212      0t0       TCP f1887ff8da62:54220->54.239.33.139:https (CLOSE_WAIT)
celery    5 root   15u     IPv4 222527376      0t0       TCP f1887ff8da62:54238->54.239.33.139:https (CLOSE_WAIT)
celery    5 root   16r     FIFO      0,10      0t0 222527142 pipe
celery    5 root   17r     FIFO      0,10      0t0 222527150 pipe
celery    5 root   18u     IPv4 222527463      0t0       TCP f1887ff8da62:54256->54.239.33.139:https (CLOSE_WAIT)

lsof -p 74 (excluded REG and DIR types):

celery   74 root    0r     FIFO   0,10      0t0 222525536 pipe
celery   74 root    1w     FIFO   0,10      0t0 222525537 pipe
celery   74 root    2w     FIFO   0,10      0t0 222525538 pipe
celery   74 root    3u  a_inode   0,11        0      5773 [eventpoll]
celery   74 root    4r     FIFO   0,10      0t0 222527138 pipe
celery   74 root    7w     FIFO   0,10      0t0 222527139 pipe
celery   74 root    8r     FIFO   0,10      0t0 222527140 pipe
celery   74 root    9w     FIFO   0,10      0t0 222527140 pipe
celery   74 root   10r     FIFO   0,10      0t0 222527141 pipe
celery   74 root   11w     FIFO   0,10      0t0 222527141 pipe
celery   74 root   12r      CHR    1,3      0t0 222525653 /dev/null
celery   74 root   13w     FIFO   0,10      0t0 222527142 pipe

lsof -p 77 (excluded REG and DIR types):

celery   77 root    0r     FIFO   0,10      0t0 222525536 pipe
celery   77 root    1w     FIFO   0,10      0t0 222525537 pipe
celery   77 root    2w     FIFO   0,10      0t0 222525538 pipe
celery   77 root    3u  a_inode   0,11        0      5773 [eventpoll]
celery   77 root    4r     FIFO   0,10      0t0 222527138 pipe
celery   77 root    5w     FIFO   0,10      0t0 222527138 pipe
celery   77 root    6r     FIFO   0,10      0t0 222527139 pipe
celery   77 root    7w     FIFO   0,10      0t0 222527139 pipe
celery   77 root    8r     FIFO   0,10      0t0 222527140 pipe
celery   77 root   11w     FIFO   0,10      0t0 222527141 pipe
celery   77 root   12r     FIFO   0,10      0t0 222527142 pipe
celery   77 root   13r      CHR    1,3      0t0 222525653 /dev/null
celery   77 root   14w     FIFO   0,10      0t0 222527150 pipe

logs output:

13:10:36
[2016-12-16 13:10:36,063: INFO/PoolWorker-1] New Relic Python Agent (2.74.0.54)

13:10:36
[2016-12-16 13:10:36,227: INFO/PoolWorker-2] New Relic Python Agent (2.74.0.54)

13:10:36
[2016-12-16 13:10:36,838: INFO/MainProcess] Connected to sqs://localhost//

13:10:37
[2016-12-16 13:10:37,136: INFO/MainProcess] celery@f1887ff8da62 ready.

13:10:37
[2016-12-16 13:10:37,382: INFO/PoolWorker-1] Successfully registered New Relic Python agent where app_name='Hypit (Celery)', pid=74, redirect_host='collector-248.newrelic.com' and agent_run_id='121175734131582946', in 1.24 seconds.

13:10:37
[2016-12-16 13:10:37,383: INFO/PoolWorker-1] Reporting to: https://rpm.newrelic.com/accounts/811448/applications/27690916

13:10:37
[2016-12-16 13:10:37,525: INFO/PoolWorker-2] Successfully registered New Relic Python agent where app_name='Hypit (Celery)', pid=77, redirect_host='collector-248.newrelic.com' and agent_run_id='121175734131582957', in 1.19 seconds.

13:10:37
[2016-12-16 13:10:37,526: INFO/PoolWorker-2] Reporting to: https://rpm.newrelic.com/accounts/811448/applications/27690916

13:10:38
[2016-12-16 13:10:38,229: INFO/MainProcess] Received task: hypit.notifications.tasks.process_new_follow[63e1c5f7-6d6e-4e99-afc5-23fbadf7bbd1]

13:10:38
[2016-12-16 13:10:38,234: INFO/MainProcess] Received task: hypit.analytics.tasks.track[281f6771-1b20-473f-a2a6-3b8344c7cbd0]

13:10:38
[2016-12-16 13:10:38,235: INFO/MainProcess] Received task: hypit.notifications.tasks.process_views[00e0cce5-4c9f-4557-8b5f-ec0f35e2eb2c]

13:10:38
[2016-12-16 13:10:38,236: INFO/MainProcess] Received task: hypit.notifications.tasks.process_views[07defc7f-7077-4cb4-81f5-1304f86484b2]

13:10:38
[2016-12-16 13:10:38,236: INFO/MainProcess] Received task: hypit.notifications.tasks.process_views[c2df896d-f169-4a00-8009-a66b245a0fd2]

13:10:38
[2016-12-16 13:10:38,237: INFO/MainProcess] Received task: hypit.users.tasks.subscribe_to_mailchimp[76f27135-a83c-4c67-8257-26606b900a49]

13:10:38
[2016-12-16 13:10:38,238: INFO/MainProcess] Received task: hypit.analytics.tasks.track[f4533190-5af2-455d-9c4d-f00de472e86b]

13:10:38
[2016-12-16 13:10:38,237: INFO/MainProcess] Received task: hypit.users.tasks.subscribe_to_mailchimp[87386f85-4dcf-4f08-8395-0966842f9ea0

logs after sending SIGUSR1 to PID 5:

13:26:00
MainThread

13:26:00
=================================================

13:26:00
File "/venv/bin/celery", line 11, in <module>

13:26:00
sys.exit(main())

13:26:00
File "/venv/lib/python3.5/site-packages/celery/__main__.py", line 14, in main

13:26:00
_main()

13:26:00
File "/venv/lib/python3.5/site-packages/celery/bin/celery.py", line 326, in main

13:26:00
cmd.execute_from_commandline(argv)

13:26:00
File "/venv/lib/python3.5/site-packages/celery/bin/celery.py", line 488, in execute_from_commandline

13:26:00
super(CeleryCommand, self).execute_from_commandline(argv)))

13:26:00
File "/venv/lib/python3.5/site-packages/celery/bin/base.py", line 281, in execute_from_commandline

13:26:00
return self.handle_argv(self.prog_name, argv[1:])

13:26:00
File "/venv/lib/python3.5/site-packages/celery/bin/celery.py", line 480, in handle_argv

13:26:00
return self.execute(command, argv)

13:26:00
File "/venv/lib/python3.5/site-packages/celery/bin/celery.py", line 412, in execute

13:26:00
).run_from_argv(self.prog_name, argv[1:], command=argv[0])

13:26:00
File "/venv/lib/python3.5/site-packages/celery/bin/worker.py", line 221, in run_from_argv

13:26:00
return self(*args, **options)

13:26:00
File "/venv/lib/python3.5/site-packages/celery/bin/base.py", line 244, in __call__

13:26:00
ret = self.run(*args, **kwargs)

13:26:00
File "/venv/lib/python3.5/site-packages/celery/bin/worker.py", line 256, in run

13:26:00
worker.start()

13:26:00
File "/venv/lib/python3.5/site-packages/celery/worker/worker.py", line 203, in start

13:26:00
self.blueprint.start(self)

13:26:00
File "/venv/lib/python3.5/site-packages/celery/bootsteps.py", line 119, in start

13:26:00
step.start(parent)

13:26:00
File "/venv/lib/python3.5/site-packages/celery/bootsteps.py", line 370, in start

13:26:00
return self.obj.start()

13:26:00
File "/venv/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 318, in start

13:26:00
blueprint.start(self)

13:26:00
File "/venv/lib/python3.5/site-packages/celery/bootsteps.py", line 119, in start

13:26:00
step.start(parent)

13:26:00
File "/venv/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 594, in start

13:26:00
c.loop(*c.loop_args())

13:26:00
File "/venv/lib/python3.5/site-packages/celery/worker/loops.py", line 88, in asynloop

13:26:00
next(loop)

13:26:00
File "/venv/lib/python3.5/site-packages/kombu/async/hub.py", line 282, in create_loop

13:26:00
item()

13:26:00
File "/venv/lib/python3.5/site-packages/vine/promises.py", line 136, in __call__

13:26:00
retval = fun(*final_args, **final_kwargs)

13:26:00
File "/venv/lib/python3.5/site-packages/kombu/transport/SQS.py", line 293, in _schedule_queue

13:26:00
self._loop1(queue)

13:26:00
File "/venv/lib/python3.5/site-packages/kombu/transport/SQS.py", line 284, in _loop1

13:26:00
self.hub.call_soon(self._schedule_queue, queue)

13:26:00
File "/venv/lib/python3.5/site-packages/kombu/async/hub.py", line 193, in call_soon

13:26:00
if not isinstance(callback, Thenable):

13:26:00
File "/usr/local/lib/python3.5/abc.py", line 188, in __instancecheck__

13:26:00
subclass in cls._abc_negative_cache):

13:26:00
File "/usr/local/lib/python3.5/_weakrefset.py", line 70, in __contains__

13:26:00
def __contains__(self, item):

13:26:00
File "/venv/lib/python3.5/site-packages/celery/apps/worker.py", line 345, in cry_handler

13:26:00
safe_say(cry())

13:26:00
File "/venv/lib/python3.5/site-packages/celery/utils/debug.py", line 193, in cry

13:26:00
traceback.print_stack(frame, file=out)

13:26:00
=================================================

13:26:00
LOCAL VARIABLES

13:26:00
=================================================

13:26:00
{'P': functools.partial(<built-in function print>, file=<vine.five.WhateverIO object at 0x7f4376de95e8>),

13:26:00
'frame': <frame object at 0x55ae1f8e9ff8>,

13:26:00
'out': <vine.five.WhateverIO object at 0x7f4376de95e8>,

13:26:00
'sep': '=================================================',

13:26:00
'sepchr': '=',

13:26:00
'seplen': 49,

13:26:00
'thread': <_MainThread(MainThread, started 139927906708264)>,

13:26:00
'threading': <module 'threading' from '/usr/local/lib/python3.5/threading.py'>,

13:26:00
'tid': 139927906708264,

13:26:00
'tmap': {139927906708264: <_MainThread(MainThread, started 139927906708264)>}}

(logs data can be slightly out of order due to how CloudWatch logs captures log output from Docker).

It might be just me not understanding whats going on in all this data, but to me it looks like there is a failure when worker processes try to communicate with the master process, but I have no clue why, or even if this is right. Even less how to fix it. Help. :(

@rubendura
Copy link

As it seems like this is an issue communicating workers with the master, I tried running celery with the "solo" worker pool and tasks are properly being run.
Although I don't think this is good in any way, I am kind of forced to run my workers with the solo pool in production so my tasks do not pile up on my queues.

Any help trying to fix this will be greatly appreciated.

@eire1130
Copy link

eire1130 commented Dec 16, 2016

I think I'm running into this as well. I tried the above setting, but it's not working for me.

[2016-12-16 12:45:54,741: INFO/MainProcess] Received task: accounts_send_subscription_email_task[0588aabc-2455-4db7-8a4c-83e77c0c74c1]  ETA:[2016-12-15 20:44:12.688502+00:00] 
[2016-12-16 12:45:54,741: INFO/MainProcess] Received task: accounts_send_recurring_notification_email_task[743c0246-8dcc-45bb-a7d4-05c09062f446]  ETA:[2016-12-15 22:41:00.658061+00:00] 
[2016-12-16 12:45:54,742: INFO/MainProcess] Received task: messaging_send_welcome_message_task[a64e2152-ce92-430f-8072-026507d7cec0]  
[2016-12-16 12:45:55,046: INFO/MainProcess] Task messaging_send_welcome_message_task[a64e2152-ce92-430f-8072-026507d7cec0] succeeded in 0.302867134567s: None
[2016-12-16 12:45:55,048: INFO/MainProcess] Received task: messaging_send_message_email_task[27194328-43da-4306-a3ff-f652d74cd344]  
[2016-12-16 12:45:55,146: INFO/MainProcess] Task messaging_send_message_email_task[27194328-43da-4306-a3ff-f652d74cd344] succeeded in 0.0971612338908s: None

I'm using these settings:

+WORKER_PREFETCH_MULTIPLIER = 1
+CELERY_WORKER_POOL = 'solo'

I note this seems to happen when there are retries (and in this case the retries are actually in the past)

Also, I had a similar trace as the above individuals:

in lsof I noticed:

python  27945 celery   12r  FIFO     0,8      0t0 8085168 pipe
python  27945 celery   13w  FIFO     0,8      0t0 8085168 pipe

@AstraLuma
Copy link
Author

Using the solo pool does not fix this for me.

@AstraLuma
Copy link
Author

The strace output when using the solo pool is the same as my prior attachment.

@knabben
Copy link

knabben commented Dec 22, 2016

One of my workers have the same problem right now, the version is 4.0.1.

The strace don't output nothing, but with ltrace I have a loop with:

sem_post(0x1af45a0, 0x7fb757b41758, 0x961e78, 0x19ee0c0)                                                                = 0
sem_wait(0x1af45a0, 1, 1, 0)                                                                                            = 0
memset(0x7fb75bef42f0, '\0', 16)                                                                                        = 0x7fb75bef42f0
memset(0x7fb757c31b00, '\0', 120)                                                                                       = 0x7fb757c31b00
memset(0x7fb757be6a18, '\0', 24)                                                                                        = 0x7fb757be6a18
memset(0x7fb757bc3b28, '\0', 8)                                                                                         = 0x7fb757bc3b28
memset(0x7fb757c272f0, '\0', 16)                                                                                        = 0x7fb757c272f0
memset(0x7fb757bc3b28, '\0', 8)                                                                                         = 0x7fb757bc3b28
memset(0x7fb757c272f0, '\0', 16)                                                                                        = 0x7fb757c272f0
memset(0x7fb75be456a8, '\0', 8)                                                                                         = 0x7fb75be456a8
memset(0x7fb75be456a8, '\0', 8)                                                                                         = 0x7fb75be456a8
sem_post(0x1af45a0, 0x7ffe5b8d9d88, 0x961e78, 0x19ee0c0)                                                                = 0
sem_wait(0x1af45a0, 1, 1, 0)                                                                                            = 0
memset(0x7fb75bef42f0, '\0', 16)                                                                                        = 0x7fb75bef42f0
memset(0x7fb757b794a8, '\0', 8)                                                                                         = 0x7fb757b794a8
memset(0x7fb75bf5ffe0, '\0', 16)                                                                                        = 0x7fb75bf5ffe0
memset(0x7fb757be6a18, '\0', 24)                                                                                        = 0x7fb757be6a18
memset(0x7fb75be45ee8, '\0', 8)                                                                                         = 0x7fb75be45ee8
memset(0x7fb75be57f08, '\0', 16)                                                                                        = 0x7fb75be57f08
memset(0x7fb75be45ce8, '\0', 8)                                                                                         = 0x7fb75be45ce8
memset(0x7fb75be45ee8, '\0', 8)                                                                                         = 0x7fb75be45ee8
memset(0x7fb75be45ee8, '\0', 8)                                                                                         = 0x7fb75be45ee8
memset(0x7fb75be57f08, '\0', 16)                                                                                        = 0x7fb75be57f08
memset(0x7fb75be45ce8, '\0', 8)                                                                                         = 0x7fb75be45ce8
memset(0x7fb75bef42f0, '\0', 16)                                                                                        = 0x7fb75bef42f0
memset(0x7fb757b8d938, '\0', 120)                                                                                       = 0x7fb757b8d938
memset(0x7fb757be6a18, '\0', 24)                                                                                        = 0x7fb757be6a18
memset(0x7fb757b794a8, '\0', 8)                                                                                         = 0x7fb757b794a8
memset(0x7fb757c272f0, '\0', 16)                                                                                        = 0x7fb757c272f0
memset(0x7fb757b794a8, '\0', 8)                                                                                         = 0x7fb757b794a8
sem_post(0x1af45a0, 0x1ad5f48, 0x961e78, 0x19ee0c0)                                                                     = 0
sem_wait(0x1af45a0, 1, 1, 0)       

@AstraLuma
Copy link
Author

AstraLuma commented Dec 28, 2016

Now this is odd. Under the solo pool, it's not processing jobs, but CPU is basically idle.

strace.txt

Messing around in GDB seems to indicate it's hanging out in a select() called from Python (not libcurl).

Killing the process and letting supervisord restart it seems to have worked.

@AstraLuma
Copy link
Author

AstraLuma commented Dec 28, 2016

It does seem that the solo pool disables the use of kill -USR1, or I'm not sure where to find the output.

@AstraLuma
Copy link
Author

AstraLuma commented Dec 29, 2016

While the use of the solo pool seems to prevent the CPU from maxing out when the system wedges, it does not prevent it from wedging and requiring a kick of some flavor.

(Usually restarting the worker, either with a redeploy or a kill seems to work most often.)

@AstraLuma AstraLuma changed the title [master] celery worker pegging CPU celery worker wedging, stops processing jobs Dec 29, 2016
@rafales
Copy link

rafales commented Jan 14, 2017

@astronouth7303 #3712

@thedrow
Copy link
Member

thedrow commented Jul 11, 2017

This is a duplicate of #3712 and it will be fixed in the next released version. :)

@thedrow thedrow closed this as completed Jul 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants