celery worker wedging, stops processing jobs #3487

AstraLuma · 2016-09-30T19:23:55Z

I'm using django+celery on AWS Elastic Beanstalk using the SQS broker. I've got a periodic healthcheck issuing a trivial job (add two numbers). I'm currently on master, because 4.0.0rc4 has some critical bugs.

The celery worker is maxing out the CPU to 100%, but no jobs are getting processed.

The log is:

 -------------- celery@ip-172-31-52-97 v4.0.0rc4 (0today8)
---- **** -----
--- * ***  * -- Linux-4.4.15-25.57.amzn1.x86_64-x86_64-with-glibc2.3.4 2016-09-30 15:04:21
-- * - **** ---
- ** ---------- [config]
- ** ---------- .> app:         innocence:0x7f1118595a58
- ** ---------- .> transport:   sqs://localhost//
- ** ---------- .> results:
- *** --- * --- .> concurrency: 1 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** -----
 -------------- [queues]
                .> celery           exchange=celery(direct) key=celery


[tasks]
  . health_check_celery3.tasks.add
  . innocence.celery.debug_task
  . maca.tasks.reconcile
  . wings.tasks.generate_quickbooks_import

celery --app=innocence.celery:app worker --loglevel=INFO
[2016-09-30 15:08:27,542: INFO/MainProcess] Connected to sqs://localhost//
[2016-09-30 15:08:27,612: WARNING/MainProcess] celery@ip-172-31-52-97 ready.
[2016-09-30 15:08:29,721: INFO/MainProcess] Received task: health_check_celery3.tasks.add[0f6f7b7f-ead1-4093-9c31-2e818615df9b]   expires:[2016-09-30 15:06:49.282238+00:00]
[2016-09-30 15:08:29,722: INFO/MainProcess] Discarding revoked task: health_check_celery3.tasks.add[0f6f7b7f-ead1-4093-9c31-2e818615df9b]
[2016-09-30 15:08:29,730: INFO/MainProcess] Received task: health_check_celery3.tasks.add[6c46f583-2dc4-4f68-8c8a-2cff732e4a57]   expires:[2016-09-30 15:07:19.295102+00:00]
[2016-09-30 15:08:29,730: INFO/MainProcess] Discarding revoked task: health_check_celery3.tasks.add[6c46f583-2dc4-4f68-8c8a-2cff732e4a57]
[2016-09-30 15:08:29,732: INFO/MainProcess] Received task: health_check_celery3.tasks.add[26e28e04-10f5-4e41-b1cb-a284b0eeca02]   expires:[2016-09-30 15:08:07.423346+00:00]
[2016-09-30 15:08:29,732: INFO/MainProcess] Discarding revoked task: health_check_celery3.tasks.add[26e28e04-10f5-4e41-b1cb-a284b0eeca02]
[2016-09-30 15:08:29,733: INFO/MainProcess] Received task: health_check_celery3.tasks.add[c3b4de0f-d863-46df-b174-39fca55cffef]   expires:[2016-09-30 14:02:01.752967+00:00]
[2016-09-30 15:08:29,733: INFO/MainProcess] Discarding revoked task: health_check_celery3.tasks.add[c3b4de0f-d863-46df-b174-39fca55cffef]

c3b4de0f-d863-46df-b174-39fca55cffef is the only task I've noticed is repeating.

I can't find any other even slightly relevant logs entries.

Really, all I know is that the queue isn't getting shorter, and the CPU is maxed out.

The text was updated successfully, but these errors were encountered:

ask · 2016-09-30T20:18:31Z

Can you strace(linux)/ktrace(bsd)/dtruss(macos) the process at the point when its using CPU?

Also kill -USR1 <pidof mainprocess> a few times, and collect the tracebacks emitted in the logs.
If you pip install setproctitle before starting the worker processes, it's easier to see the type of process in ps listings to know the pid of the main process (e.g. ps auxww | grep MainProcess

AstraLuma · 2016-09-30T20:37:24Z

Now that's obnoxious. While I'm waiting for the deploy to add setproctitle to go through, I clear the queue (because it had >800 healthchecks sitting in it and nobody needs that, and it was ok in the past), and the issue cleared up.

So I guess it was a bad message? But I'm using the JSON serializer... IDK. I'll close this in a few hours or Monday if it doesn't come back.

AstraLuma · 2016-09-30T20:43:14Z

I am intermittently getting #3486, even though I've done mutliple deploys and even rebuilt (ie, it's a whole new instance).

AstraLuma · 2016-10-03T13:52:52Z

Ok, yeah, it poofed.

AstraLuma · 2016-10-13T15:35:14Z

This still happens intermittently.

AstraLuma · 2016-10-13T15:38:27Z

Most recently, seemed to be triggered by a new deploy.

AstraLuma · 2016-10-13T17:07:26Z

Setting a redrive policy helps; the messages end up in the deadletter queue.

AstraLuma · 2016-10-17T15:32:24Z

To anyone else seeing this issue: Setting a redrive policy with a deadletter queue and a retry count of 20 seems to be working for me, but my application is extremely low-volume.

This is hard to reproduce, but I suspect it's related to a bad message. One such message is:

eyJoZWFkZXJzIjogeyJldGEiOiBudWxsLCAidGFzayI6ICJoZWFsdGhfY2hlY2tfY2VsZXJ5My50YXNrcy5hZGQiLCAiZ3JvdXAiOiBudWxsLCAicGFyZW50X2lkIjogbnVsbCwgImFyZ3NyZXByIjogIls0LCA0XSIsICJ0aW1lbGltaXQiOiBbbnVsbCwgbnVsbF0sICJrd2FyZ3NyZXByIjogInt9IiwgImxhbmciOiAicHkiLCAicmV0cmllcyI6IDAsICJpZCI6ICI2OTIwMWFiOS1iZjRhLTQyN2QtYjJiMS00Zjg0NjkwNzEyYWYiLCAicm9vdF9pZCI6IG51bGwsICJvcmlnaW4iOiAiZ2VuMTM4MTdAaXAtMTcyLTMxLTUzLTExIiwgImV4cGlyZXMiOiAiMjAxNi0xMC0xM1QxMTozNDoxOC42NzE3NDgifSwgImJvZHkiOiAiVzFzMExDQTBYU3dnZTMwc0lIc2laWEp5WW1GamEzTWlPaUJ1ZFd4c0xDQWlZMkZzYkdKaFkydHpJam9nYm5Wc2JDd2dJbU5vYjNKa0lqb2diblZzYkN3Z0ltTm9ZV2x1SWpvZ2JuVnNiSDFkIiwgInByb3BlcnRpZXMiOiB7ImJvZHlfZW5jb2RpbmciOiAiYmFzZTY0IiwgInByaW9yaXR5IjogMCwgInJlcGx5X3RvIjogImIwNGY2ZWEwLTAyNjAtMzc3ZS1hNzIzLTc3NDQyNGUxYzI3MCIsICJjb3JyZWxhdGlvbl9pZCI6ICI2OTIwMWFiOS1iZjRhLTQyN2QtYjJiMS00Zjg0NjkwNzEyYWYiLCAiZGVsaXZlcnlfbW9kZSI6IDIsICJkZWxpdmVyeV90YWciOiAiYjY4OGJlZjYtOTI2YS00MmQwLWI3NTYtMTUzN2ZhZTY4ZGI3IiwgImRlbGl2ZXJ5X2luZm8iOiB7InJvdXRpbmdfa2V5IjogImlubm8tYmV0YS1Jbm5vY2VuY2VKb2JRdWV1ZS1OQVg0T1hWOUY1NlciLCAiZXhjaGFuZ2UiOiAiIn19LCAiY29udGVudC1lbmNvZGluZyI6ICJ1dGYtOCIsICJjb250ZW50LXR5cGUiOiAiYXBwbGljYXRpb24vanNvbiJ9

Decoded and pretty-printed:

{
  "headers": {
    "eta": null,
    "task": "health_check_celery3.tasks.add",
    "group": null,
    "parent_id": null,
    "argsrepr": "[4, 4]",
    "timelimit": [
      null,
      null
    ],
    "kwargsrepr": "{}",
    "lang": "py",
    "retries": 0,
    "id": "69201ab9-bf4a-427d-b2b1-4f84690712af",
    "root_id": null,
    "origin": "gen13817@ip-172-31-53-11",
    "expires": "2016-10-13T11:34:18.671748"
  },
  "body": "W1s0LCA0XSwge30sIHsiZXJyYmFja3MiOiBudWxsLCAiY2FsbGJhY2tzIjogbnVsbCwgImNob3JkIjogbnVsbCwgImNoYWluIjogbnVsbH1d",
  "properties": {
    "body_encoding": "base64",
    "priority": 0,
    "reply_to": "b04f6ea0-0260-377e-a723-774424e1c270",
    "correlation_id": "69201ab9-bf4a-427d-b2b1-4f84690712af",
    "delivery_mode": 2,
    "delivery_tag": "b688bef6-926a-42d0-b756-1537fae68db7",
    "delivery_info": {
      "routing_key": "inno-beta-InnocenceJobQueue-NAX4OXV9F56W",
      "exchange": ""
    }
  },
  "content-encoding": "utf-8",
  "content-type": "application\/json"
}

Body:

[
  [4, 4],
  {},
  {
    "errbacks": null,
    "callbacks": null,
    "chord": null,
    "chain": null
  }
]

AstraLuma · 2016-10-25T17:06:35Z

Nevermind. The redrive policy helps sometimes. Sometimes rebuilding the environment works (because elastic beanstalk), sometimes rebooting the instance works.

rykener · 2016-10-29T01:56:40Z

I've experienced a similar issue using SQS and boto3. I was never able to consistently reproduce it, but did see it happen on multiple machines. The app worked fine before connecting it to SQS with boto3.

Essentially everything worked fine sometimes. And other times it either wouldn't run the tasks or had some error message, which is now forgotten. Rebooting machines, clearing cache, etc could make it work again temporarily, but it always eventually quit out. We had to change our stack to avoid a production issue.

I hope this helps reaffirm that there is an issue somewhere, perhaps with boto3, perhaps here

rubendura · 2016-12-07T15:25:25Z

I'm seeing the same issue here and it's driving me nuts. When running with -l debug I can see celery properly initialising, fetching 4 tasks from the queue, trying to apply one... and that's it.

[2016-12-07 14:16:34,502: INFO/MainProcess] Received task: project.analytics.tasks.track[862c0a36-92f4-4a07-a402-620b2be105bb]
[2016-12-07 14:16:34,503: DEBUG/MainProcess] TaskPool: Apply <function _fast_trace_task at 0x7fb567fd5048> (args:('project.analytics.tasks.track', '862c0a36-92f4-4a07-a402-620b2be105bb', {'eta': None, 'reply_to': '52cfb82d-693b-3e91-8e30-912fd7a2d748', 'argsrepr': "(UUID('98a18947-63d2-4b8a-81e5-6a0ba09e4dbc'), 'comment')", 'origin': 'gen75@3032e033202a', 'lang': 'py', 'timelimit': [None, None], 'id': '862c0a36-92f4-4a07-a402-620b2be105bb', 'parent_id': None, 'retries': 0, 'task': 'project.analytics.tasks.track', 'group': None, 'kwargsrepr': "{'increment': {'comments': 1}, 'properties': {'time': 1481110822}}", 'delivery_info': {'priority': 0, 'routing_key': 'projectV2Dev-CeleryBroker-S5CA1VI4GOGA', 'exchange': '', 'redelivered': None}, 'root_id': '862c0a36-92f4-4a07-a402-620b2be105bb', 'correlation_id': '862c0a36-92f4-4a07-a402-620b2be105bb', 'expires': None}, b'[["98a18947-63d2-4b8a-81e5-6a0ba09e4dbc", "comment"], {"increment": {"comments": 1}, "properties": {"time":... kwargs:{})
[2016-12-07 14:16:34,504: INFO/MainProcess] Received task: project.analytics.tasks.track[8e500cb1-ee7a-420e-abcf-d7149028b079]
[2016-12-07 14:16:34,517: INFO/MainProcess] Received task: project.analytics.tasks.firehose[fb334773-9849-4199-9bdb-4e5b984ddf37]
[2016-12-07 14:16:34,519: INFO/MainProcess] Received task: project.analytics.tasks.track[6911d3ea-cdae-45a8-a541-2ad05f3e0ab8]

It seems to happen completely at random times, some tasks work sometimes, until they don't. I am honestly running out of ideas. :(
Luckily for me this is just happening in our development environment and never happened in production. Which triggers the question: why?! I can't find any sensible answer to that question.

AstraLuma · 2016-12-07T17:39:16Z

@rubendura per @ask:

Can you strace(linux)/ktrace(bsd)/dtruss(macos) the process at the point when its using CPU?

(I haven't gotten the chance to, because I haven't seen the issue recently.)

rubendura · 2016-12-07T17:56:30Z

I can't now as the issue is gone (for now).

I noticed that for some reason timeouts where missing in this project's config, so added some task timeouts and purged all messages in my SQS queue. So far I've only seen one timeout, apparently due to a Django DB exception.

I've fixed that, but it kind of bugs me that when an exception was raised in one of my workers, instead of printing some logs or retrying the task it just... hung. And brought my CPU to full usage. I would've expected to see some logs, or celery to capture the exception and maybe restart the worker (which means after the SQS visibility timeout the task would've been retried). The only thing that could potentially be messing with that is the Sentry integration, but I haven't investigated that yet.

AstraLuma · 2016-12-07T19:21:56Z

IDK. In my case the primary task is just a heartbeat that should never, ever throw an exception.

AstraLuma · 2016-12-14T16:34:52Z

Ok. I've got it again. It's some kind of weirdness with the processes cycling? The PIDs keep changing.

AstraLuma · 2016-12-14T17:14:44Z

It's [celeryd: celery@REDACTED:MainProcess] -active- (--app=innocence.celery:app worker --loglevel=DEBUG) that's eating all the CPU.

Attached is the strace results, but I'm not sure what's happening with it.

trace.txt

rubendura · 2016-12-16T13:30:55Z

It's back. And this time in production. :( Restarting workers is not fixing the issue.
I exec into one of my celery containers. This is all the info I can get for now

ps aux:

PID   USER     TIME   COMMAND
    1 root       0:00 /sbin/tini -- /entrypoint.sh newrelic-admin run-program celery -A config worker --concurrency=2 --loglevel=INFO
    5 root       0:33 {celery} /venv/bin/python /venv/bin/celery -A config worker --concurrency=2 --loglevel=INFO
   74 root       0:00 {celery} /venv/bin/python /venv/bin/celery -A config worker --concurrency=2 --loglevel=INFO
   77 root       0:00 {celery} /venv/bin/python /venv/bin/celery -A config worker --concurrency=2 --loglevel=INFO
   80 root       0:00 sh
  146 root       0:00 ps aux

strace -yy -t -p 5:

strace: Process 5 attached
strace: [ Process PID=5 runs in x32 mode. ]

strace -yy -t -p 74:

strace: Process 74 attached
13:29:57 read(4<pipe:[222527138]>,

strace -yy -t -p 77:

strace: Process 77 attached
13:30:22 read(8<pipe:[222527140]>,

lsof -p 5 (excluded REG and DIR types):

celery    5 root    0r     FIFO      0,10      0t0 222525536 pipe
celery    5 root    1w     FIFO      0,10      0t0 222525537 pipe
celery    5 root    2w     FIFO      0,10      0t0 222525538 pipe
celery    5 root    3u  a_inode      0,11        0      5773 [eventpoll]
celery    5 root    4r     FIFO      0,10      0t0 222527138 pipe
celery    5 root    5w     FIFO      0,10      0t0 222527138 pipe
celery    5 root    6r     FIFO      0,10      0t0 222527139 pipe
celery    5 root    7w     FIFO      0,10      0t0 222527139 pipe
celery    5 root    8r     FIFO      0,10      0t0 222527140 pipe
celery    5 root    9w     FIFO      0,10      0t0 222527140 pipe
celery    5 root   10r     FIFO      0,10      0t0 222527141 pipe
celery    5 root   11w     FIFO      0,10      0t0 222527141 pipe
celery    5 root   12r     FIFO      0,10      0t0 222527142 pipe
celery    5 root   13r     FIFO      0,10      0t0 222527150 pipe
celery    5 root   14u     IPv4 222527212      0t0       TCP f1887ff8da62:54220->54.239.33.139:https (CLOSE_WAIT)
celery    5 root   15u     IPv4 222527376      0t0       TCP f1887ff8da62:54238->54.239.33.139:https (CLOSE_WAIT)
celery    5 root   16r     FIFO      0,10      0t0 222527142 pipe
celery    5 root   17r     FIFO      0,10      0t0 222527150 pipe
celery    5 root   18u     IPv4 222527463      0t0       TCP f1887ff8da62:54256->54.239.33.139:https (CLOSE_WAIT)

lsof -p 74 (excluded REG and DIR types):

celery   74 root    0r     FIFO   0,10      0t0 222525536 pipe
celery   74 root    1w     FIFO   0,10      0t0 222525537 pipe
celery   74 root    2w     FIFO   0,10      0t0 222525538 pipe
celery   74 root    3u  a_inode   0,11        0      5773 [eventpoll]
celery   74 root    4r     FIFO   0,10      0t0 222527138 pipe
celery   74 root    7w     FIFO   0,10      0t0 222527139 pipe
celery   74 root    8r     FIFO   0,10      0t0 222527140 pipe
celery   74 root    9w     FIFO   0,10      0t0 222527140 pipe
celery   74 root   10r     FIFO   0,10      0t0 222527141 pipe
celery   74 root   11w     FIFO   0,10      0t0 222527141 pipe
celery   74 root   12r      CHR    1,3      0t0 222525653 /dev/null
celery   74 root   13w     FIFO   0,10      0t0 222527142 pipe

lsof -p 77 (excluded REG and DIR types):

celery   77 root    0r     FIFO   0,10      0t0 222525536 pipe
celery   77 root    1w     FIFO   0,10      0t0 222525537 pipe
celery   77 root    2w     FIFO   0,10      0t0 222525538 pipe
celery   77 root    3u  a_inode   0,11        0      5773 [eventpoll]
celery   77 root    4r     FIFO   0,10      0t0 222527138 pipe
celery   77 root    5w     FIFO   0,10      0t0 222527138 pipe
celery   77 root    6r     FIFO   0,10      0t0 222527139 pipe
celery   77 root    7w     FIFO   0,10      0t0 222527139 pipe
celery   77 root    8r     FIFO   0,10      0t0 222527140 pipe
celery   77 root   11w     FIFO   0,10      0t0 222527141 pipe
celery   77 root   12r     FIFO   0,10      0t0 222527142 pipe
celery   77 root   13r      CHR    1,3      0t0 222525653 /dev/null
celery   77 root   14w     FIFO   0,10      0t0 222527150 pipe

logs output:

13:10:36
[2016-12-16 13:10:36,063: INFO/PoolWorker-1] New Relic Python Agent (2.74.0.54)

13:10:36
[2016-12-16 13:10:36,227: INFO/PoolWorker-2] New Relic Python Agent (2.74.0.54)

13:10:36
[2016-12-16 13:10:36,838: INFO/MainProcess] Connected to sqs://localhost//

13:10:37
[2016-12-16 13:10:37,136: INFO/MainProcess] celery@f1887ff8da62 ready.

13:10:37
[2016-12-16 13:10:37,382: INFO/PoolWorker-1] Successfully registered New Relic Python agent where app_name='Hypit (Celery)', pid=74, redirect_host='collector-248.newrelic.com' and agent_run_id='121175734131582946', in 1.24 seconds.

13:10:37
[2016-12-16 13:10:37,383: INFO/PoolWorker-1] Reporting to: https://rpm.newrelic.com/accounts/811448/applications/27690916

13:10:37
[2016-12-16 13:10:37,525: INFO/PoolWorker-2] Successfully registered New Relic Python agent where app_name='Hypit (Celery)', pid=77, redirect_host='collector-248.newrelic.com' and agent_run_id='121175734131582957', in 1.19 seconds.

13:10:37
[2016-12-16 13:10:37,526: INFO/PoolWorker-2] Reporting to: https://rpm.newrelic.com/accounts/811448/applications/27690916

13:10:38
[2016-12-16 13:10:38,229: INFO/MainProcess] Received task: hypit.notifications.tasks.process_new_follow[63e1c5f7-6d6e-4e99-afc5-23fbadf7bbd1]

13:10:38
[2016-12-16 13:10:38,234: INFO/MainProcess] Received task: hypit.analytics.tasks.track[281f6771-1b20-473f-a2a6-3b8344c7cbd0]

13:10:38
[2016-12-16 13:10:38,235: INFO/MainProcess] Received task: hypit.notifications.tasks.process_views[00e0cce5-4c9f-4557-8b5f-ec0f35e2eb2c]

13:10:38
[2016-12-16 13:10:38,236: INFO/MainProcess] Received task: hypit.notifications.tasks.process_views[07defc7f-7077-4cb4-81f5-1304f86484b2]

13:10:38
[2016-12-16 13:10:38,236: INFO/MainProcess] Received task: hypit.notifications.tasks.process_views[c2df896d-f169-4a00-8009-a66b245a0fd2]

13:10:38
[2016-12-16 13:10:38,237: INFO/MainProcess] Received task: hypit.users.tasks.subscribe_to_mailchimp[76f27135-a83c-4c67-8257-26606b900a49]

13:10:38
[2016-12-16 13:10:38,238: INFO/MainProcess] Received task: hypit.analytics.tasks.track[f4533190-5af2-455d-9c4d-f00de472e86b]

13:10:38
[2016-12-16 13:10:38,237: INFO/MainProcess] Received task: hypit.users.tasks.subscribe_to_mailchimp[87386f85-4dcf-4f08-8395-0966842f9ea0

logs after sending SIGUSR1 to PID 5:

13:26:00
MainThread

13:26:00
=================================================

13:26:00
File "/venv/bin/celery", line 11, in <module>

13:26:00
sys.exit(main())

13:26:00
File "/venv/lib/python3.5/site-packages/celery/__main__.py", line 14, in main

13:26:00
_main()

13:26:00
File "/venv/lib/python3.5/site-packages/celery/bin/celery.py", line 326, in main

13:26:00
cmd.execute_from_commandline(argv)

13:26:00
File "/venv/lib/python3.5/site-packages/celery/bin/celery.py", line 488, in execute_from_commandline

13:26:00
super(CeleryCommand, self).execute_from_commandline(argv)))

13:26:00
File "/venv/lib/python3.5/site-packages/celery/bin/base.py", line 281, in execute_from_commandline

13:26:00
return self.handle_argv(self.prog_name, argv[1:])

13:26:00
File "/venv/lib/python3.5/site-packages/celery/bin/celery.py", line 480, in handle_argv

13:26:00
return self.execute(command, argv)

13:26:00
File "/venv/lib/python3.5/site-packages/celery/bin/celery.py", line 412, in execute

13:26:00
).run_from_argv(self.prog_name, argv[1:], command=argv[0])

13:26:00
File "/venv/lib/python3.5/site-packages/celery/bin/worker.py", line 221, in run_from_argv

13:26:00
return self(*args, **options)

13:26:00
File "/venv/lib/python3.5/site-packages/celery/bin/base.py", line 244, in __call__

13:26:00
ret = self.run(*args, **kwargs)

13:26:00
File "/venv/lib/python3.5/site-packages/celery/bin/worker.py", line 256, in run

13:26:00
worker.start()

13:26:00
File "/venv/lib/python3.5/site-packages/celery/worker/worker.py", line 203, in start

13:26:00
self.blueprint.start(self)

13:26:00
File "/venv/lib/python3.5/site-packages/celery/bootsteps.py", line 119, in start

13:26:00
step.start(parent)

13:26:00
File "/venv/lib/python3.5/site-packages/celery/bootsteps.py", line 370, in start

13:26:00
return self.obj.start()

13:26:00
File "/venv/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 318, in start

13:26:00
blueprint.start(self)

13:26:00
File "/venv/lib/python3.5/site-packages/celery/bootsteps.py", line 119, in start

13:26:00
step.start(parent)

13:26:00
File "/venv/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 594, in start

13:26:00
c.loop(*c.loop_args())

13:26:00
File "/venv/lib/python3.5/site-packages/celery/worker/loops.py", line 88, in asynloop

13:26:00
next(loop)

13:26:00
File "/venv/lib/python3.5/site-packages/kombu/async/hub.py", line 282, in create_loop

13:26:00
item()

13:26:00
File "/venv/lib/python3.5/site-packages/vine/promises.py", line 136, in __call__

13:26:00
retval = fun(*final_args, **final_kwargs)

13:26:00
File "/venv/lib/python3.5/site-packages/kombu/transport/SQS.py", line 293, in _schedule_queue

13:26:00
self._loop1(queue)

13:26:00
File "/venv/lib/python3.5/site-packages/kombu/transport/SQS.py", line 284, in _loop1

13:26:00
self.hub.call_soon(self._schedule_queue, queue)

13:26:00
File "/venv/lib/python3.5/site-packages/kombu/async/hub.py", line 193, in call_soon

13:26:00
if not isinstance(callback, Thenable):

13:26:00
File "/usr/local/lib/python3.5/abc.py", line 188, in __instancecheck__

13:26:00
subclass in cls._abc_negative_cache):

13:26:00
File "/usr/local/lib/python3.5/_weakrefset.py", line 70, in __contains__

13:26:00
def __contains__(self, item):

13:26:00
File "/venv/lib/python3.5/site-packages/celery/apps/worker.py", line 345, in cry_handler

13:26:00
safe_say(cry())

13:26:00
File "/venv/lib/python3.5/site-packages/celery/utils/debug.py", line 193, in cry

13:26:00
traceback.print_stack(frame, file=out)

13:26:00
=================================================

13:26:00
LOCAL VARIABLES

13:26:00
=================================================

13:26:00
{'P': functools.partial(<built-in function print>, file=<vine.five.WhateverIO object at 0x7f4376de95e8>),

13:26:00
'frame': <frame object at 0x55ae1f8e9ff8>,

13:26:00
'out': <vine.five.WhateverIO object at 0x7f4376de95e8>,

13:26:00
'sep': '=================================================',

13:26:00
'sepchr': '=',

13:26:00
'seplen': 49,

13:26:00
'thread': <_MainThread(MainThread, started 139927906708264)>,

13:26:00
'threading': <module 'threading' from '/usr/local/lib/python3.5/threading.py'>,

13:26:00
'tid': 139927906708264,

13:26:00
'tmap': {139927906708264: <_MainThread(MainThread, started 139927906708264)>}}

(logs data can be slightly out of order due to how CloudWatch logs captures log output from Docker).

It might be just me not understanding whats going on in all this data, but to me it looks like there is a failure when worker processes try to communicate with the master process, but I have no clue why, or even if this is right. Even less how to fix it. Help. :(

rubendura · 2016-12-16T13:40:48Z

As it seems like this is an issue communicating workers with the master, I tried running celery with the "solo" worker pool and tasks are properly being run.
Although I don't think this is good in any way, I am kind of forced to run my workers with the solo pool in production so my tasks do not pile up on my queues.

Any help trying to fix this will be greatly appreciated.

eire1130 · 2016-12-16T18:10:29Z

I think I'm running into this as well. I tried the above setting, but it's not working for me.

[2016-12-16 12:45:54,741: INFO/MainProcess] Received task: accounts_send_subscription_email_task[0588aabc-2455-4db7-8a4c-83e77c0c74c1]  ETA:[2016-12-15 20:44:12.688502+00:00] 
[2016-12-16 12:45:54,741: INFO/MainProcess] Received task: accounts_send_recurring_notification_email_task[743c0246-8dcc-45bb-a7d4-05c09062f446]  ETA:[2016-12-15 22:41:00.658061+00:00] 
[2016-12-16 12:45:54,742: INFO/MainProcess] Received task: messaging_send_welcome_message_task[a64e2152-ce92-430f-8072-026507d7cec0]  
[2016-12-16 12:45:55,046: INFO/MainProcess] Task messaging_send_welcome_message_task[a64e2152-ce92-430f-8072-026507d7cec0] succeeded in 0.302867134567s: None
[2016-12-16 12:45:55,048: INFO/MainProcess] Received task: messaging_send_message_email_task[27194328-43da-4306-a3ff-f652d74cd344]  
[2016-12-16 12:45:55,146: INFO/MainProcess] Task messaging_send_message_email_task[27194328-43da-4306-a3ff-f652d74cd344] succeeded in 0.0971612338908s: None

I'm using these settings:

+WORKER_PREFETCH_MULTIPLIER = 1
+CELERY_WORKER_POOL = 'solo'

I note this seems to happen when there are retries (and in this case the retries are actually in the past)

Also, I had a similar trace as the above individuals:

in lsof I noticed:

python  27945 celery   12r  FIFO     0,8      0t0 8085168 pipe
python  27945 celery   13w  FIFO     0,8      0t0 8085168 pipe

AstraLuma · 2016-12-19T20:06:48Z

Using the solo pool does not fix this for me.

AstraLuma · 2016-12-19T20:08:47Z

The strace output when using the solo pool is the same as my prior attachment.

knabben · 2016-12-22T14:33:40Z

One of my workers have the same problem right now, the version is 4.0.1.

The strace don't output nothing, but with ltrace I have a loop with:

sem_post(0x1af45a0, 0x7fb757b41758, 0x961e78, 0x19ee0c0)                                                                = 0
sem_wait(0x1af45a0, 1, 1, 0)                                                                                            = 0
memset(0x7fb75bef42f0, '\0', 16)                                                                                        = 0x7fb75bef42f0
memset(0x7fb757c31b00, '\0', 120)                                                                                       = 0x7fb757c31b00
memset(0x7fb757be6a18, '\0', 24)                                                                                        = 0x7fb757be6a18
memset(0x7fb757bc3b28, '\0', 8)                                                                                         = 0x7fb757bc3b28
memset(0x7fb757c272f0, '\0', 16)                                                                                        = 0x7fb757c272f0
memset(0x7fb757bc3b28, '\0', 8)                                                                                         = 0x7fb757bc3b28
memset(0x7fb757c272f0, '\0', 16)                                                                                        = 0x7fb757c272f0
memset(0x7fb75be456a8, '\0', 8)                                                                                         = 0x7fb75be456a8
memset(0x7fb75be456a8, '\0', 8)                                                                                         = 0x7fb75be456a8
sem_post(0x1af45a0, 0x7ffe5b8d9d88, 0x961e78, 0x19ee0c0)                                                                = 0
sem_wait(0x1af45a0, 1, 1, 0)                                                                                            = 0
memset(0x7fb75bef42f0, '\0', 16)                                                                                        = 0x7fb75bef42f0
memset(0x7fb757b794a8, '\0', 8)                                                                                         = 0x7fb757b794a8
memset(0x7fb75bf5ffe0, '\0', 16)                                                                                        = 0x7fb75bf5ffe0
memset(0x7fb757be6a18, '\0', 24)                                                                                        = 0x7fb757be6a18
memset(0x7fb75be45ee8, '\0', 8)                                                                                         = 0x7fb75be45ee8
memset(0x7fb75be57f08, '\0', 16)                                                                                        = 0x7fb75be57f08
memset(0x7fb75be45ce8, '\0', 8)                                                                                         = 0x7fb75be45ce8
memset(0x7fb75be45ee8, '\0', 8)                                                                                         = 0x7fb75be45ee8
memset(0x7fb75be45ee8, '\0', 8)                                                                                         = 0x7fb75be45ee8
memset(0x7fb75be57f08, '\0', 16)                                                                                        = 0x7fb75be57f08
memset(0x7fb75be45ce8, '\0', 8)                                                                                         = 0x7fb75be45ce8
memset(0x7fb75bef42f0, '\0', 16)                                                                                        = 0x7fb75bef42f0
memset(0x7fb757b8d938, '\0', 120)                                                                                       = 0x7fb757b8d938
memset(0x7fb757be6a18, '\0', 24)                                                                                        = 0x7fb757be6a18
memset(0x7fb757b794a8, '\0', 8)                                                                                         = 0x7fb757b794a8
memset(0x7fb757c272f0, '\0', 16)                                                                                        = 0x7fb757c272f0
memset(0x7fb757b794a8, '\0', 8)                                                                                         = 0x7fb757b794a8
sem_post(0x1af45a0, 0x1ad5f48, 0x961e78, 0x19ee0c0)                                                                     = 0
sem_wait(0x1af45a0, 1, 1, 0)

AstraLuma · 2016-12-28T15:29:15Z

Now this is odd. Under the solo pool, it's not processing jobs, but CPU is basically idle.

strace.txt

Messing around in GDB seems to indicate it's hanging out in a select() called from Python (not libcurl).

Killing the process and letting supervisord restart it seems to have worked.

AstraLuma · 2016-12-28T15:44:47Z

It does seem that the solo pool disables the use of kill -USR1, or I'm not sure where to find the output.

AstraLuma · 2016-12-29T15:48:16Z

While the use of the solo pool seems to prevent the CPU from maxing out when the system wedges, it does not prevent it from wedging and requiring a kick of some flavor.

(Usually restarting the worker, either with a redeploy or a kill seems to work most often.)

rafales · 2017-01-14T00:01:46Z

@astronouth7303 #3712

thedrow · 2017-07-11T09:18:21Z

This is a duplicate of #3712 and it will be fixed in the next released version. :)

ask added the Component: Amazon SQS Broker label Sep 30, 2016

ask added the DEV-VERSION label Sep 30, 2016

AstraLuma closed this as completed Oct 3, 2016

AstraLuma reopened this Oct 13, 2016

ask added this to the v4.0 milestone Oct 14, 2016

ask modified the milestones: v4.1, v4.0 Nov 1, 2016

AstraLuma changed the title ~~[master] celery worker pegging CPU~~ celery worker wedging, stops processing jobs Dec 29, 2016

thedrow closed this as completed Jul 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

celery worker wedging, stops processing jobs #3487

celery worker wedging, stops processing jobs #3487

AstraLuma commented Sep 30, 2016

ask commented Sep 30, 2016

AstraLuma commented Sep 30, 2016

AstraLuma commented Sep 30, 2016

AstraLuma commented Oct 3, 2016

AstraLuma commented Oct 13, 2016

AstraLuma commented Oct 13, 2016

AstraLuma commented Oct 13, 2016

AstraLuma commented Oct 17, 2016

AstraLuma commented Oct 25, 2016

rykener commented Oct 29, 2016

rubendura commented Dec 7, 2016

AstraLuma commented Dec 7, 2016

rubendura commented Dec 7, 2016

AstraLuma commented Dec 7, 2016

AstraLuma commented Dec 14, 2016

AstraLuma commented Dec 14, 2016

rubendura commented Dec 16, 2016

rubendura commented Dec 16, 2016

eire1130 commented Dec 16, 2016 •

edited

AstraLuma commented Dec 19, 2016

AstraLuma commented Dec 19, 2016

knabben commented Dec 22, 2016 •

edited

AstraLuma commented Dec 28, 2016 •

edited

AstraLuma commented Dec 28, 2016 •

edited

AstraLuma commented Dec 29, 2016 •

edited

rafales commented Jan 14, 2017

thedrow commented Jul 11, 2017

celery worker wedging, stops processing jobs #3487

celery worker wedging, stops processing jobs #3487

Comments

AstraLuma commented Sep 30, 2016

ask commented Sep 30, 2016

AstraLuma commented Sep 30, 2016

AstraLuma commented Sep 30, 2016

AstraLuma commented Oct 3, 2016

AstraLuma commented Oct 13, 2016

AstraLuma commented Oct 13, 2016

AstraLuma commented Oct 13, 2016

AstraLuma commented Oct 17, 2016

AstraLuma commented Oct 25, 2016

rykener commented Oct 29, 2016

rubendura commented Dec 7, 2016

AstraLuma commented Dec 7, 2016

rubendura commented Dec 7, 2016

AstraLuma commented Dec 7, 2016

AstraLuma commented Dec 14, 2016

AstraLuma commented Dec 14, 2016

rubendura commented Dec 16, 2016

rubendura commented Dec 16, 2016

eire1130 commented Dec 16, 2016 • edited

AstraLuma commented Dec 19, 2016

AstraLuma commented Dec 19, 2016

knabben commented Dec 22, 2016 • edited

AstraLuma commented Dec 28, 2016 • edited

AstraLuma commented Dec 28, 2016 • edited

AstraLuma commented Dec 29, 2016 • edited

rafales commented Jan 14, 2017

thedrow commented Jul 11, 2017

eire1130 commented Dec 16, 2016 •

edited

knabben commented Dec 22, 2016 •

edited

AstraLuma commented Dec 28, 2016 •

edited

AstraLuma commented Dec 28, 2016 •

edited

AstraLuma commented Dec 29, 2016 •

edited