Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Celery worker using 100% CPU around epoll w/ prefork+redis and not consuming tasks #1845

Closed
bpowers opened this issue Feb 3, 2014 · 68 comments

Comments

Projects
None yet
@bpowers
Copy link
Contributor

commented Feb 3, 2014

Hello,

We're using Celery 3.1.6, and one of our queues is using 100% CPU for the master process. ps shows:

root      6960 59.2  0.7 160680 51628 ?        Rs   Feb01 2319:49 /$VENV/bin/python /$VENV/manage.py celery worker -Q ads -P prefork --autoscale=10,5 -n $NAME --settings=settings -l INFO
root     32727  0.3  1.1 266008 82860 ?        S    Feb02   6:29  \_ /$VENV/bin/python /$VENV/manage.py celery worker -Q ads -P prefork --autoscale=10,5 -n $NAME --settings=settings -l INFO
root     32729  0.2  1.1 265184 82040 ?        S    Feb02   5:54  \_ /$VENV/bin/python /$VENV/manage.py celery worker -Q ads -P prefork --autoscale=10,5 -n $NAME --settings=settings -l INFO
root     32730  0.2  1.4 282868 99788 ?        S    Feb02   5:29  \_ /$VENV/bin/python /$VENV/manage.py celery worker -Q ads -P prefork --autoscale=10,5 -n $NAME --settings=settings -l INFO
root       750  0.1  1.5 296584 113100 ?       S    Feb02   3:03  \_ /$VENV/bin/python /$VENV/manage.py celery worker -Q ads -P prefork --autoscale=10,5 -n $NAME --settings=settings -l INFO
root       917  0.2  1.1 262044 78888 ?        S    Feb02   6:00  \_ /$VENV/bin/python /$VENV/manage.py celery worker -Q ads -P prefork --autoscale=10,5 -n $NAME --settings=settings -l INFO
root      1873  0.0  1.2 269460 86276 ?        S    Feb02   1:34  \_ /$VENV/bin/python /$VENV/manage.py celery worker -Q ads -P prefork --autoscale=10,5 -n $NAME --settings=settings -l INFO
root      1875  0.0  1.4 285568 102384 ?       S    Feb02   0:36  \_ /$VENV/bin/python /$VENV/manage.py celery worker -Q ads -P prefork --autoscale=10,5 -n $NAME --settings=settings -l INFO
root      2347  0.0  0.8 245448 60984 ?        S    Feb02   1:47  \_ /$VENV/bin/python /$VENV/manage.py celery worker -Q ads -P prefork --autoscale=10,5 -n $NAME --settings=settings -l INFO
root      2348  0.0  0.8 247760 63356 ?        S    Feb02   1:33  \_ /$VENV/bin/python /$VENV/manage.py celery worker -Q ads -P prefork --autoscale=10,5 -n $NAME --settings=settings -l INFO
root      2362  0.0  1.0 257880 73416 ?        S    Feb02   1:09  \_ /$VENV/bin/python /$VENV/manage.py celery worker -Q ads -P prefork --autoscale=10,5 -n $NAME --settings=settings -l INFO

And strace is more interesting. It shows the following loop as the only thing happening in the parent:

gettimeofday({1391450996, 406286}, NULL) = 0
epoll_wait(4, {{EPOLLIN, {u32=6, u64=22205092589469702}}}, 1023, 1000) = 1
epoll_ctl(4, EPOLL_CTL_DEL, 34, {EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|0x4e9020, {u32=0, u64=22205092589469696}}) = -1 ENOENT (No such file or directory)
epoll_ctl(4, EPOLL_CTL_DEL, 67, {EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|0x4e9020, {u32=0, u64=22205092589469696}}) = -1 ENOENT (No such file or directory)
epoll_ctl(4, EPOLL_CTL_DEL, 68, {EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|0x4e9020, {u32=0, u64=22205092589469696}}) = -1 ENOENT (No such file or directory)
epoll_ctl(4, EPOLL_CTL_DEL, 38, {EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|0x4e9020, {u32=0, u64=22205092589469696}}) = -1 ENOENT (No such file or directory)
epoll_ctl(4, EPOLL_CTL_DEL, 10, {EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|0x4e9020, {u32=0, u64=22205092589469696}}) = -1 ENOENT (No such file or directory)
epoll_ctl(4, EPOLL_CTL_DEL, 53, {EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|0x4e9020, {u32=0, u64=22205092589469696}}) = -1 ENOENT (No such file or directory)
epoll_ctl(4, EPOLL_CTL_DEL, 26, {EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|0x4e9020, {u32=0, u64=22205092589469696}}) = -1 ENOENT (No such file or directory)
epoll_ctl(4, EPOLL_CTL_DEL, 59, {EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|0x4e9020, {u32=0, u64=22205092589469696}}) = -1 ENOENT (No such file or directory)
epoll_ctl(4, EPOLL_CTL_DEL, 30, {EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|0x4e9020, {u32=0, u64=22205092589469696}}) = -1 ENOENT (No such file or directory)
epoll_ctl(4, EPOLL_CTL_DEL, 63, {EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|0x4e9020, {u32=0, u64=22205092589469696}}) = -1 ENOENT (No such file or directory)
epoll_ctl(4, EPOLL_CTL_ADD, 51, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=51, u64=22205092589469747}}) = -1 EEXIST (File exists)
epoll_ctl(4, EPOLL_CTL_ADD, 52, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=52, u64=22205092589469748}}) = -1 EEXIST (File exists)
epoll_ctl(4, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=6, u64=22205092589469702}}) = -1 EEXIST (File exists)

Those FDs are pipes according to /proc/$PID/fd. (strace also shows heartbeat/worker chatter from other processes read via the redis socket).

Restarting the upstart job caused the worker to begin consuming tasks again. I think this is some corner case that isn't being handled well. Kombu's eventio.py has the following:

class _epoll(Poller):
    # ...
    def register(self, fd, events):
        try:
            self._epoll.register(fd, events)
        except Exception as exc:
            if get_errno(exc) != errno.EEXIST:
                raise
    def unregister(self, fd):
        try:
            self._epoll.unregister(fd)
        except (socket.error, ValueError, KeyError):
            pass
        except (IOError, OSError) as exc:
            if get_errno(exc) != errno.ENOENT:
                raise

We are specifically checking for ENOENT and EEXIST here and silently dropping the exception - it seems like this might be causing a problem for someone further down the stack, but I'm not familiar with this part of celery.

Any help/pointers appreciated. Let me know what other info is useful as well.

@idealopamp

This comment has been minimized.

Copy link

commented Feb 5, 2014

We are seeing the same issue, except the CPU is not running hot for us. One thing we have observed not mentioned in the OP is that when malfunctioning, celery continues to send PUBLISH commands for celeryev, but does NOT send any BRPOP commands.

E.g. in a functioning celery strace, we see lines like
31543 sendto(116, "*6\r\n$5\r\nBRPOP\r\n$14\r\nts.tasks.voi"..., 115, 0, NULL, 0) = 115

However, in the malfunctioning celery processes, there are no such commands issued. (We also see that this is missing from TCPDUMP and redis monitoring.)

Also, in case it's relevant, we are also seeing the issue documented in celery/kombu#292 , which also seems to be related to a problem with BRPOP registration.

@ask

This comment has been minimized.

Copy link
Contributor

commented Feb 9, 2014

It will not send BRPOP commands if the prefetch limit is exceeded, so make sure there are no active tasks at that point.

@bpowers

This comment has been minimized.

Copy link
Contributor Author

commented Feb 13, 2014

I have a feeling that commit d3674b6 fixed this. Hoping to test today.

@bpowers

This comment has been minimized.

Copy link
Contributor Author

commented Feb 18, 2014

As an update, I don't think d3674b6 fixes this. I still saw the epoll_ctl DEL and ADD calls after applying it, although the upgrade from 3.1.6 to 3.1.9 introduced major memory leaks in the prefork MainProcess, so we had to roll back to 3.1.6. I hope to open a new, separate bug about that shortly.

@ask ask added the Redis label Feb 21, 2014

@rcrowley

This comment has been minimized.

Copy link

commented Feb 22, 2014

I believe we've experienced this bug on 3.1.8. I haven't been able to reproduce it but I've seen it occur on two different boxen at approximately the same time.

Unhealthy:

epoll_ctl(15, EPOLL_CTL_DEL, 31, {EPOLLRDBAND|EPOLLWRNORM|EPOLLHUP|0x4f7020, {u32=0, u64=22420390710083584}}) = -1 ENOENT (No such file or directory)
epoll_ctl(15, EPOLL_CTL_DEL, 27, {EPOLLRDBAND|EPOLLWRNORM|EPOLLHUP|0x4f7020, {u32=0, u64=22420390710083584}}) = -1 ENOENT (No such file or directory)
epoll_ctl(15, EPOLL_CTL_DEL, 23, {EPOLLRDBAND|EPOLLWRNORM|EPOLLHUP|0x4f7020, {u32=0, u64=22420390710083584}}) = -1 ENOENT (No such file or directory)
gettimeofday({1393110357, 781546}, NULL) = 0
epoll_wait(15, {{EPOLLIN, {u32=48, u64=22420390710083632}}}, 1023, 1058) = 1
epoll_ctl(15, EPOLL_CTL_DEL, 19, {EPOLLRDBAND|EPOLLWRNORM|EPOLLHUP|0x4f7020, {u32=0, u64=22420390710083584}}) = -1 ENOENT (No such file or directory)
epoll_ctl(15, EPOLL_CTL_DEL, 31, {EPOLLRDBAND|EPOLLWRNORM|EPOLLHUP|0x4f7020, {u32=0, u64=22420390710083584}}) = -1 ENOENT (No such file or directory)
epoll_ctl(15, EPOLL_CTL_DEL, 27, {EPOLLRDBAND|EPOLLWRNORM|EPOLLHUP|0x4f7020, {u32=0, u64=22420390710083584}}) = -1 ENOENT (No such file or directory)

Healthy:

epoll_ctl(15, EPOLL_CTL_DEL, 31, {EPOLLRDBAND|EPOLLWRNORM|EPOLLHUP|0x4f7020, {u32=0, u64=22420390710083584}}) = -1 ENOENT (No such file or directory)
epoll_ctl(15, EPOLL_CTL_DEL, 19, {EPOLLRDBAND|EPOLLWRNORM|EPOLLHUP|0x4f7020, {u32=0, u64=22420390710083584}}) = -1 ENOENT (No such file or directory)
epoll_ctl(15, EPOLL_CTL_DEL, 23, {EPOLLRDBAND|EPOLLWRNORM|EPOLLHUP|0x4f7020, {u32=0, u64=22420390710083584}}) = -1 ENOENT (No such file or directory)
gettimeofday({1393112101, 47671}, NULL) = 0
clock_gettime(CLOCK_MONOTONIC, {7668144, 767080453}) = 0
open("/proc/loadavg", O_RDONLY)         = 60
read(60, "0.01 0.07 0.14 6/1732 10640\n", 64) = 28
close(60)                               = 0
gettimeofday({1393112101, 48626}, NULL) = 0
sendto(52, "*3\r\n$7\r\nPUBLISH\r\n$8\r\nceleryev\r\n$"..., 706, 0, NULL, 0) = 706
recvfrom(52, ":1\r\n", 8192, 0, NULL, NULL) = 4
gettimeofday({1393112101, 50508}, NULL) = 0
gettimeofday({1393112101, 50658}, NULL) = 0
epoll_wait(15, {{EPOLLIN, {u32=58, u64=22420390710083642}}}, 1023, 1997) = 1
recvfrom(58, "*3\r\n$7\r\nmessage\r\n$8\r\nceleryev\r\n$"..., 8192, 0, NULL, NULL) = 706
epoll_ctl(15, EPOLL_CTL_DEL, 27, {EPOLLRDBAND|EPOLLWRNORM|EPOLLHUP|0x4f7020, {u32=0, u64=22420390710083584}}) = -1 ENOENT (No such file or directory)
epoll_ctl(15, EPOLL_CTL_DEL, 31, {EPOLLRDBAND|EPOLLWRNORM|EPOLLHUP|0x4f7020, {u32=0, u64=22420390710083584}}) = -1 ENOENT (No such file or directory)
epoll_ctl(15, EPOLL_CTL_DEL, 19, {EPOLLRDBAND|EPOLLWRNORM|EPOLLHUP|0x4f7020, {u32=0, u64=22420390710083584}}) = -1 ENOENT (No such file or directory)
@mlcamilli

This comment has been minimized.

Copy link

commented Feb 28, 2014

Having this problem as well, anyone made any progress?

@Jwpe

This comment has been minimized.

Copy link

commented Feb 28, 2014

Just for context around @mlcamilli's comment, we are using 3.1.7 and experiencing exactly the same log output as @rcrowley. Restarting the offending 'stuck' worker through supervisor causes it to resume consuming tasks. We have our CELERYD_PREFETCH_MULTIPLIER set to 1, but we are now experimenting with the default multiplier to see if that makes a difference.

It seems like this is an underlying issue with the worker knowing to continue consuming tasks. Does anyone have more insight into this?

@bpowers

This comment has been minimized.

Copy link
Contributor Author

commented Feb 28, 2014

It is somewhat anecdotal, but to my knowledge I haven't seen stuck workers after upgrading to 3.1.9. What about you @idealopamp ?

@mlcamilli

This comment has been minimized.

Copy link

commented Feb 28, 2014

@bpowers 3.1.9 did not fix our problem, which seems to be the same problem @idealopamp is having

@ionelmc

This comment has been minimized.

Copy link
Contributor

commented Mar 1, 2014

Can you guys try to use the breakpoint signal to get a python stacktrace?

@bpowers

This comment has been minimized.

Copy link
Contributor Author

commented Mar 3, 2014

This is from cry() via SIGUSR1:

MainThread
=================================================
  File "/var/my_celery_app/manage.py", line 16, in <module>
    execute_from_command_line(sys.argv)
  File "/var/my_celery_app/lib/python2.7/site-packages/django/core/management/__init__.py", line 399, in execute_from_command_line
    utility.execute()
  File "/var/my_celery_app/lib/python2.7/site-packages/django/core/management/__init__.py", line 392, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/var/my_celery_app/lib/python2.7/site-packages/djcelery/management/commands/celery.py", line 22, in run_from_argv
    ['{0[0]} {0[1]}'.format(argv)] + argv[2:],
  File "/var/my_celery_app/lib/python2.7/site-packages/celery/bin/celery.py", line 768, in execute_from_commandline
    super(CeleryCommand, self).execute_from_commandline(argv)))
  File "/var/my_celery_app/lib/python2.7/site-packages/celery/bin/base.py", line 308, in execute_from_commandline
    return self.handle_argv(self.prog_name, argv[1:])
  File "/var/my_celery_app/lib/python2.7/site-packages/celery/bin/celery.py", line 760, in handle_argv
    return self.execute(command, argv)
  File "/var/my_celery_app/lib/python2.7/site-packages/celery/bin/celery.py", line 692, in execute
    ).run_from_argv(self.prog_name, argv[1:], command=argv[0])
  File "/var/my_celery_app/lib/python2.7/site-packages/celery/bin/worker.py", line 175, in run_from_argv
    return self(*args, **options)
  File "/var/my_celery_app/lib/python2.7/site-packages/celery/bin/base.py", line 271, in __call__
    ret = self.run(*args, **kwargs)
  File "/var/my_celery_app/lib/python2.7/site-packages/celery/bin/worker.py", line 208, in run
    state_db=self.node_format(state_db, hostname), **kwargs
  File "/var/my_celery_app/lib/python2.7/site-packages/celery/worker/__init__.py", line 206, in start
    self.blueprint.start(self)
  File "/var/my_celery_app/lib/python2.7/site-packages/celery/bootsteps.py", line 123, in start
    step.start(parent)
  File "/var/my_celery_app/lib/python2.7/site-packages/celery/bootsteps.py", line 373, in start
    return self.obj.start()
  File "/var/my_celery_app/lib/python2.7/site-packages/celery/worker/consumer.py", line 278, in start
    blueprint.start(self)
  File "/var/my_celery_app/lib/python2.7/site-packages/celery/bootsteps.py", line 123, in start
    step.start(parent)
  File "/var/my_celery_app/lib/python2.7/site-packages/celery/worker/consumer.py", line 796, in start
    c.loop(*c.loop_args())
  File "/var/my_celery_app/lib/python2.7/site-packages/celery/worker/loops.py", line 72, in asynloop
    next(loop)
  File "/var/my_celery_app/lib/python2.7/site-packages/kombu/async/hub.py", line 267, in create_loop
    tick_callback()
  File "/var/my_celery_app/lib/python2.7/site-packages/celery/concurrency/asynpool.py", line 655, in on_poll_start
    [hub_remove(fd) for fd in diff(active_writes)]
  File "/var/my_celery_app/lib/python2.7/site-packages/kombu/async/hub.py", line 166, in remove
    self._unregister(fd)
  File "/var/my_celery_app/lib/python2.7/site-packages/kombu/async/hub.py", line 230, in _unregister
    self.poller.unregister(fd)
  File "/var/my_celery_app/lib/python2.7/site-packages/kombu/utils/eventio.py", line 85, in unregister
    self._epoll.unregister(fd)
  File "/var/my_celery_app/lib/python2.7/site-packages/celery/apps/worker.py", line 347, in cry_handler
    safe_say(cry())
  File "/var/my_celery_app/lib/python2.7/site-packages/celery/utils/__init__.py", line 223, in cry
    traceback.print_stack(frame, file=out)
=================================================
LOCAL VARIABLES
=================================================
{'P': <functools.partial object at 0x3070f70>,
 'frame': <frame object at 0x3022930>,
 'out': <celery.five.WhateverIO object at 0x317c1d0>,
 'sep': '=================================================',
 'sepchr': '=',
 'seplen': 49,
 'thread': <_MainThread(MainThread, started 140334795097856)>,
 'threading': <module 'threading' from '/usr/lib/python2.7/threading.pyc'>,
 'tid': 140334795097856,
 'tmap': {140334795097856: <_MainThread(MainThread, started 140334795097856)>}}

(turns out my app is totally still seeing this)

@cameronmaske

This comment has been minimized.

Copy link

commented Mar 3, 2014

@bpowers Just trying to debug this on my end too, from the trace you posted above, what is the give away the bug is still present?

@bpowers

This comment has been minimized.

Copy link
Contributor Author

commented Mar 3, 2014

@cameronmaske well, I sent SIGUSR1 multiple times and it was always in [hub_remove(fd) for fd in diff(active_writes)], also the MainProcess/parent of that queue was using 100% CPU, and tasks either weren't being consumed, or were being consumed super slowly.

@bpowers

This comment has been minimized.

Copy link
Contributor Author

commented Mar 4, 2014

TL;DR: the redis server closed a connection (not sure why yet - either a timeout or redis might have crashed and restarted itself). Celery/kombu (redis backend) somehow loses track of this FD, but it stays in cycle.fds. Because the socket is in the CLOSE_WAIT state, it causes epoll_wait to return immediately, but for some reason the redis backend never attempts to read from this FD - effectively turning the asynpool into a busy-wait loop - running as fast as it can, using 100% CPU. @ionelmc Hopefully this helps

I've been looking into this a lot. The trick is noticing a couple of things. First, epoll_wait returns a FD that we're also adding (this is a shortened version of the content in my topmost comment/original report:

epoll_wait(4, {{EPOLLIN, {u32=6, u64=22205092589469702}}}, 1023, 1000) = 1
...
epoll_ctl(4, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=6, u64=22205092589469702}}) = -1 EEXIST (File exists)

Now, when I see this happening on our machines, I can look in /proc/$PID/fd and see that fd 6 (in this case) is a socket:

05:52:20 (master) [bpowers@fina socktest]$ ls -l /proc/4259/fd/6
lrwx------ 1 bpowers bpowers 64 Mar  4 17:52 /proc/4259/fd/6 -> socket:[1090478]

Then I use netstat with the socket id listed above (1090478 here):

05:52:24 (master) [bpowers@fina socktest]$ netstat -ae | grep 1090478
tcp        0      0 localhost:35453         localhost:distinct      CLOSE_WAIT  bpowers    1090478   

And we can see that this socket is in CLOSE_WAIT - the remote server (in our applications case, this is a connection to the redis server) has sent a FIN, and the local kernel is waiting for us to close the FD.

What we all have been seeing in the strace output with epoll is a symptom, rather than the cause. A socket registered with epoll that is in the CLOSE_WAIT state will cause epoll_wait to return immediately - I have 2 python scripts to test this here: https://github.com/bpowers/epoll-close-wait-example

This causes the hub to tick, but as the multiple strace outputs above show, we never attempt to recv from this FD - if we did, the redis backend would raise Empty in probably _brpop_start.

@bpowers

This comment has been minimized.

Copy link
Contributor Author

commented Mar 4, 2014

Finally, there is another stack trace that shows up often under SIGUSR1:

MainThread
=================================================
  File "/var/myapp/manage.py", line 16, in <module>
    execute_from_command_line(sys.argv)
  File "/var/myapp/lib/python2.7/site-packages/django/core/management/__init__.py", line 399, in execute_from_command_line
    utility.execute()
  File "/var/myapp/lib/python2.7/site-packages/django/core/management/__init__.py", line 392, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/var/myapp/lib/python2.7/site-packages/djcelery/management/commands/celery.py", line 22, in run_from_argv
    ['{0[0]} {0[1]}'.format(argv)] + argv[2:],
  File "/var/myapp/lib/python2.7/site-packages/celery/bin/celery.py", line 768, in execute_from_commandline
    super(CeleryCommand, self).execute_from_commandline(argv)))
  File "/var/myapp/lib/python2.7/site-packages/celery/bin/base.py", line 308, in execute_from_commandline
    return self.handle_argv(self.prog_name, argv[1:])
  File "/var/myapp/lib/python2.7/site-packages/celery/bin/celery.py", line 760, in handle_argv
    return self.execute(command, argv)
  File "/var/myapp/lib/python2.7/site-packages/celery/bin/celery.py", line 692, in execute
    ).run_from_argv(self.prog_name, argv[1:], command=argv[0])
  File "/var/myapp/lib/python2.7/site-packages/celery/bin/worker.py", line 175, in run_from_argv
    return self(*args, **options)
  File "/var/myapp/lib/python2.7/site-packages/celery/bin/base.py", line 271, in __call__
    ret = self.run(*args, **kwargs)
  File "/var/myapp/lib/python2.7/site-packages/celery/bin/worker.py", line 208, in run
    state_db=self.node_format(state_db, hostname), **kwargs
  File "/var/myapp/lib/python2.7/site-packages/celery/worker/__init__.py", line 206, in start
    self.blueprint.start(self)
  File "/var/myapp/lib/python2.7/site-packages/celery/bootsteps.py", line 123, in start
    step.start(parent)
  File "/var/myapp/lib/python2.7/site-packages/celery/bootsteps.py", line 373, in start
    return self.obj.start()
  File "/var/myapp/lib/python2.7/site-packages/celery/worker/consumer.py", line 278, in start
    blueprint.start(self)
  File "/var/myapp/lib/python2.7/site-packages/celery/bootsteps.py", line 123, in start
    step.start(parent)
  File "/var/myapp/lib/python2.7/site-packages/celery/worker/consumer.py", line 796, in start
    c.loop(*c.loop_args())
  File "/var/myapp/lib/python2.7/site-packages/celery/worker/loops.py", line 72, in asynloop
    next(loop)
  File "/var/myapp/lib/python2.7/site-packages/kombu/async/hub.py", line 267, in create_loop
    tick_callback()
  File "/var/myapp/lib/python2.7/site-packages/kombu/transport/redis.py", line 893, in on_poll_start
    [add_reader(fd, on_readable, fd) for fd in cycle.fds]
  File "/var/myapp/lib/python2.7/site-packages/kombu/async/hub.py", line 201, in add_reader
    return self.add(fds, callback, READ | ERR, args)
  File "/var/myapp/lib/python2.7/site-packages/kombu/async/hub.py", line 152, in add
    self.poller.register(fd, flags)
  File "/var/myapp/lib/python2.7/site-packages/kombu/utils/eventio.py", line 76, in register
    def register(self, fd, events):
  File "/var/myapp/lib/python2.7/site-packages/celery/apps/worker.py", line 347, in cry_handler
    safe_say(cry())
  File "/var/myapp/lib/python2.7/site-packages/celery/utils/__init__.py", line 223, in cry
    traceback.print_stack(frame, file=out)
=================================================
LOCAL VARIABLES
=================================================
{'P': <functools.partial object at 0x2fbc890>,
 'frame': <frame object at 0x32e6380>,
 'out': <celery.five.WhateverIO object at 0x2fd4050>,
 'sep': '=================================================',
 'sepchr': '=',
 'seplen': 49,
 'thread': <_MainThread(MainThread, started 140422350595840)>,
 'threading': <module 'threading' from '/usr/lib/python2.7/threading.pyc'>,
 'tid': 140422350595840,
 'tmap': {140422350595840: <_MainThread(MainThread, started 140422350595840)>}}
@ask

This comment has been minimized.

Copy link
Contributor

commented Mar 5, 2014

And just to confirm everyone is using Redis? I will attempt to run the stresstests under linux with redis to try and reproduce.

@ask

This comment has been minimized.

Copy link
Contributor

commented Mar 5, 2014

Probably there is an outdated fd in the event loop that causes the poll to run in a tight loop because the fd is always ready. Also note that some of those experiencing this bug may not have the same problem, especially if not using 100% CPU and there are active tasks. I believe EPOLL_CTL_DEL is normal too since it needs to remove the fds after each poll iteration.

@cameronmaske

This comment has been minimized.

Copy link

commented Mar 5, 2014

Great stuff @bpowers. @ask yup we are all using Redis (I work with @mlcamilli + @Jwpe)

@bpowers

This comment has been minimized.

Copy link
Contributor Author

commented Mar 5, 2014

I work with @idealopamp - we're both using redis. So I think thats everyone :)

@ask I agree about EPOLL_CTL_DEL and EPOLL_CTL_ADD - they are normal. What isn't normal and is part of the problem is what you said - an outdated fd in the event loop that the remote redis server has closed that is causing celery to busy-wait.

@bpowers

This comment has been minimized.

Copy link
Contributor Author

commented Mar 5, 2014

Ok, I have more, better data @ask. I ran one of our workers under strace from the beginning. I then see (after a few mins):

sendto(6, "*10\r\n$5\r\nBRPOP\r\n$7\r\ndefault\r\n$6\r\ncelery\r\n$10\r\ndefault\6\0263\r\n$9\r\ncelery\6\0263\r\n$10\r\ndefault\6\0266\r\n$9\r\ncelery\6\0266\r\n$10\r\ndefault\6\0269\r\n$9\r\ncelery\6\0269\r\n$1\r\n1\r\n", 144, 0, NULL, 0) = 144
epoll_wait(4, {{EPOLLIN, {u32=6, u64=22205092589469702}}}, 1023, 1000) = 1
recvfrom(6, "*2\r\n$6\r\ncelery\r\n$726\r\n{\"body\": \"${BASE64_PICKLE_DATA_ABOUT_NEW_TASK}\", \"headers\": {}, \"content-type\": \"application/x-python-serialize\", \"properties\": {\"body_encoding\": \"base64\", \"correlation_id\": \"c056887b-1436-41e9-97eb-21fcbea06773\", \"reply_to\": \"27a2d9ea-8f3a-3330-af64-0871bcb72e55\", \"delivery_info\": {\"priority\": 0, \"routing_key\": \"celery\", \"exchange\": \"celery\"}, \"delivery_mode\": 2, \"delivery_tag\": 841}, \"content-encoding\": \"binary\"}\r\n", 8192, 0, NULL, NULL) = 750
...

And then no BRPOP ever sent to celery again (or any sendto/recvfrom on FD 6). After 5 mins, we hit the 300 second redis server timeout (from our servers redis.conf), and fd 6 starts showing up in the epoll_wait resultset. So something is causing us to read a task from redis, but never issue a new BRPOP, but the FD continues to be in the set of FDs we add every tick of the async loop. Nothing useful shows up at INFO level logging, - I'll try debug in a few.

@ask

This comment has been minimized.

Copy link
Contributor

commented Mar 5, 2014

@bpowers Is this something you can reproduce fairly easily? Would you be able to create a small project demonstrating the issue?

@ask

This comment has been minimized.

Copy link
Contributor

commented Mar 5, 2014

@bpowers: And are you sure that one initial task is not still active?

@bpowers

This comment has been minimized.

Copy link
Contributor Author

commented Mar 5, 2014

It actually is definitely possible that one task is still active and takes longer than the 5 minute redis timeout. That could explain a lot. @ask I see it often, but I don't know if I can reduce it to a small test case easily. I'm going to start out by increasing our redis timeout.

@bpowers

This comment has been minimized.

Copy link
Contributor Author

commented Mar 5, 2014

@ask yes, great. This doesn't happen with the redis timeout set to 0 (infinite) for me. So I think the 'fix' for this boils down to: unregister FD from poller when we don't have an outstanding BRPOP.

@tracer0tong

This comment has been minimized.

Copy link

commented Mar 18, 2014

Hi guys, I have the same problem in #1912 with Redis broker, do you have solution for this issue (except setting timeout for Redis to infinite value)?

@bpowers

This comment has been minimized.

Copy link
Contributor Author

commented Mar 19, 2014

Setting redis to a high timeout value works very well for me. The code changes to avoid this are not straightforward to me, or I would attempt a patch myself.

@ask

This comment has been minimized.

Copy link
Contributor

commented Mar 20, 2014

If the timeout value affects this then I would guess that a socket error occurs but the error is not handled properly by removing the fd from the event loop. That can be hard to solve, need to delve into the depths of the redis code to see if we can add a callback for when a error is handled within and not propagated.

@WyseNynja

This comment has been minimized.

Copy link

commented Jun 10, 2014

Any progress on this?

What are the downsides to setting a high timeout?

@shshank

This comment has been minimized.

Copy link

commented Mar 23, 2015

-Ofair didnt help either.

@ask

This comment has been minimized.

Copy link
Contributor

commented May 7, 2015

I notice you are revoking tasks here, but doing so in an extremely ineffective way.

Revoking several tasks at once like this:

[result.revoke() for result in results]

should be replaced by:

app.control.revoke([result.id for result in results])

or GroupResult.revoke()

@kakiezhang

This comment has been minimized.

Copy link

commented Jun 17, 2015

i have the similar situation. when i use prefork ( 5 Queues and each 8 Concurrency ) pool type to run celery workers in 8 machine, the redis machine's cpu will be up to 100% usage in peak times.

i shutdown 1 machine, then it's working well, and the redis machine's cpu seems to be normal ( 6% ~ 10%)

i also check the slowlog info in redis, it's find that lpush and brpop 's execution time is up to 12 ~ 20ms.

the brpop's timeout = 1, so i doubt that if i created that money workers, they occupied resource to get the key. Some get, some timeout, even some of them exceed the timeout, and then they might be wait a long time util they can get the key again.

so, i decide to cut down the process to let the resource is not being so nerves.

is there anything i can do to help it work?

@chase-seibert

This comment has been minimized.

Copy link
Contributor

commented Jul 30, 2015

Just started running into this as well. Switching to -P gevent fixes it for me.

@dcramer

This comment has been minimized.

Copy link
Contributor

commented Oct 6, 2015

We're seeing this on some AWS infrastructure running Sentry:

  • Sentry 8.0.0.dev0
  • Celery 3.1.18
  • Kombu 3.0.26
  • Redis 2.10.x

We tried a huge number of things, and nothing seemed to help, but the symptoms are identical to what the original ticket describes (though this issue seems to have gone way off base, as it has nothing to do with revocation or gevent etc). The primary symptom we originally saw was workers stalling for 30 minutes every hour or so. We couldn't identify what the 30 minutes was, but now the best guess is it was due to the Redis timeout value. After several attempts we eventually got workers to stall far more frequently.

As part of deploying changes it was quickly recognized that workers would fail to restart (stuck in TERM), upon investigation we saw the symptoms described here. We saw idle clients in Redis that would eventually timeout (per the setting), but the workers didn't seem to predictably recover. We also saw things successfully dispatching jobs in some cases, but not starting/finishing.

Things attempted, and failed miserably at, in a number of combinations:

  • Set socket_timeout to 60 on the broker
  • Set timeout to 0 on redis (vs 1800)
  • Disable mingle, gossip, heartbeat (which, btw, seems to break celery entirely, not sure why this is an option)

Of note, Sentry defaults to disabling many options in Celery including rate limits, results, and we don't use any of the complex features.

Along the way we also increased worker concurrency from something like 8 to 32. This pointed out that Celery doesn't handle load balancing well. Unrelated, but something to note.

We aren't certain it's fully resolved, but here's what we're currently looking at, which at least got the queue caught up and was stable for 1h+:

  • Redis timeout of 1800
  • socket_timeout not set (whatever default is)
  • Worker running with -O fair --without-mingle --without-gossip
  • Soft timeout of 60s
  • Worker concurrency at 8, prefork model

After enabling the above in one go, we saw it churning through things as well as saw various timeouts in tasks. These timeouts were to the same Redis node in seemingly at least two situations:

  • Communication via Sentry's internal TSDB
  • (Unconfirmed due to empty stacktrace) a call to .delay()

My bet is that the resolution for us came from -O fair and lowering the prefect. It's not clear to me how or why, unless we are inducing far too much load on a single Redis instance and that again is causing the symptoms shown in this ticket.

Update problem did not resolve itself with the given changes.

@domenkozar

This comment has been minimized.

Copy link

commented Oct 27, 2015

Did you print strace for each worker? This is similar to #2606 (except that I didn't observe 100% CPU usage)

@ask

This comment has been minimized.

Copy link
Contributor

commented Oct 27, 2015

@dcramer @domenkozar Do you have the fanout_prefix and fanout_patterns transport options set? http://docs.celeryproject.org/en/latest/getting-started/brokers/redis.html#caveats

@dcramer

This comment has been minimized.

Copy link
Contributor

commented Oct 27, 2015

No. Even if we were, the issue of "literally everything stops working" is less so of a caveat than a critical bug.

We use very basic features. A dozen queues, all direct. A few periodic tasks. We also use ETA and Expires. Anything else we tested with and without and it didnt help. We also disable rate limits, task results, and task events. The working theory is similar to this issue: it has to do w/ the epoll loop. The one thing we do do, often, is within tasks create a lot of additional tasks (these are normal tasks, we dont use any of the Celery workflow tools).

@dcramer

This comment has been minimized.

Copy link
Contributor

commented Oct 27, 2015

@ask if we can get a fix in Kombu for this can we get a bugfix release out that doesn't require the newer Django stuff? Sentry isn't upgrading past Django 1.6.

@ask

This comment has been minimized.

Copy link
Contributor

commented Oct 28, 2015

You definitely should be setting those options, or the worker will be receiving multiple event messages for every task executed by every worker.

However, the 100% CPU issue is probably not related to that. I'm guessing that some socket is closed/broken by the remote end, but not removed from the event loop. When a socket is broken it is always readable, so the event loop spins on it.

It could be there is some place in redis-py/kombu where an exception is swallowed, and the connection automatically reconnected or something, but the event loop is never notified.

@ask

This comment has been minimized.

Copy link
Contributor

commented Oct 28, 2015

@dcramer: celery 3.1 should be compatible with Django 1.6, if not it would be a bug

@mitsuhiko

This comment has been minimized.

Copy link
Contributor

commented Nov 2, 2015

@ask I opened up a new ticket #2906 which covers the issue that @dcramer talked about here. From further investigation it looks like it might not be related to this one as I could never reproduce actual high CPU load.

@cralston0

This comment has been minimized.

Copy link

commented Jan 22, 2016

For anyone who stumbles across this thread in the future: Upgrade kombu to 3.0.33 or newer.

ask added a commit to celery/billiard that referenced this issue Mar 1, 2016

popen_fork is closing random file descriptors.
This can lead to deadlocks, 100% CPU usage (with epoll), random connection
failures etc.

Closes celery/celery#1845
Could fix: celery/celery#2142 celery/celery#2606

ask added a commit that referenced this issue Mar 2, 2016

[Asynpool] Fixes 100% CPU loop in epoll (round 2)
The billiard patch celery/billiard@4f4759b
was initially submitted to fix this problem, but on closer investigation
we were only leaking file descriptors.

I monkey patched os.open/socket/os.close etc to track what was going on,
and I found no evidence of the code closing random sockets,
instead I found out that:

1) epoll_wait always returned an error state for a Popen pipe fd.
2) the worker was trying to unregister this fd from epoll, but
3) ``epoll.unregister`` refused to do so giving an IOError(ENOENT)
   error.

So turns out this is an epoll quirk, and the solution is to duplicate the pipe fd
so that we can carefully control when it's removed from the process
file descriptor table.

Closes #1845
Could fix: #2142 #2606

ask added a commit that referenced this issue Mar 2, 2016

[Asynpool] Fixes 100% CPU loop in epoll (round 2)
The billiard patch celery/billiard@4f4759b
was initially submitted to fix this problem, but on closer investigation
we were only leaking file descriptors.

I monkey patched os.open/socket/os.close etc to track what was going on,
and I found no evidence of the code closing random sockets,
instead I found out that:

1) epoll_wait always returned an error state for a Popen pipe fd.
2) the worker was trying to unregister this fd from epoll, but
3) ``epoll.unregister`` refused to do so giving an IOError(ENOENT)
   error.

So turns out this is an epoll quirk, and the solution is to duplicate the pipe fd
so that we can carefully control when it's removed from the process
file descriptor table.

Closes #1845
Could fix: #2142 #2606
@markddavidoff

This comment has been minimized.

Copy link

commented Sep 24, 2018

still experiencing this issue with celery 3.1.25 and kombu 3.0.33...

@auvipy

This comment has been minimized.

Copy link
Member

commented Sep 26, 2018

this issue should be fixed in celery 4.2.1+ version. the version you are using is not supported anymore and don't contain most of the improvements and bug fixes were done in past 2+ years.

@PavanKulkarni

This comment has been minimized.

Copy link

commented Oct 1, 2018

@auvipy Most of the bug fixes referenced here are present in 4.1.0. any specific reason why you mention >4.2.1?

@auvipy

This comment has been minimized.

Copy link
Member

commented Oct 1, 2018

It's a general notion to point using latest stable version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.