New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Celery work process consumes 100% CPU after running for several days #1558
Comments
After running Celery worker process for some days, we found that the Python process consumes the 100% CPU load. I used the strace to dump the runtime stacktrace, I found that the CPU 100% process continuously pool and read. It must be related with the Consumer.consume_message. Do you have any suggestions how to debug this problem. |
What version of Celery is this? |
I should give your more details: Python: 2.7.3 We have more than 10 servers and each is running around 20 Celery workers. We didn't directly start our Celery work from celeryd, instead, our Worker class wraps the Celery Worker object, see below: class OurWorker(object):
def __init__(self, hostname=None, loglevel=None, logfile=None, autoscale=None):
self.hostname = hostname if hostname is not None else "our.worker.name"
if not self.hostname.startswith("our.worker.name"):
self.hostname = "our.worker.name.%s" % self.hostname
self.hostname = "%s.%s" % (self.hostname, SOCKET_HOSTNAME)
self.loglevel = loglevel if loglevel is not None \
else ("INFO", "DEBUG")[conf.PUSHD_DEBUG]
self.logfile = logfile if logfile is not None else self.hostname
self._worker = Worker(app=celery,
hostname=self.hostname,
include=["push.task.task"],
loglevel=self.loglevel,
queues=[conf.PUSHD_DISPATCH_QUEUE_NAME,
conf.PUSHD_SCHEDULED_DISPATCH_QUEUE_NAME],
autoscale=autoscale
)
self._worker.logfile = "%s/%s.log" % (conf.PUSHD_LOG_PATH, self.logfile)
#This line is dangerous, do not use and do not delete
#worker_ready.connect(self.dispatch_worker_ready)
def start(self):
"""
Start dispatch worker
"""
self._worker.run() After running for some weeks, some of the Celery worker starts consuming 100% CPU, it seems that the worker cannot get break from the following while.. while connection.more_to_read:
try:
events = poll(poll_timeout)
except ValueError: # Issue 882
return
if not events:
on_poll_empty()
for fileno, event in events or ():
try:
if event & READ:
readers[fileno](fileno, event)
if event & WRITE:
writers[fileno](fileno, event)
if event & ERR:
for handlermap in readers, writers:
try:
handlermap[fileno](fileno, event)
except KeyError:
pass
except (KeyError, Empty):
continue
except socket.error:
if self._state != CLOSE: # pragma: no cover
raise
if keep_draining:
drain_nowait()
poll_timeout = 0
else:
connection.more_to_read = False |
@ask any updates? We are deeply using Celery now. |
I'm not sure what causes this, but the loop in question is now also rewritten. It could be interesting to know what file descriptor 57 is in this case (the number is likely to change between runs). You can use |
Found similar problem. Usually the CPU usage increases after worker reconnects to rabbitmq after some network issues. Celery version is 3.0.23 and
|
I'm also having some issues with the cpu usage. I have celery version 3.0.24, django-celery 3.0.23 and kombu 2.5.16 installed. After seeing the following lines in the log file, I'm getting a python process (from celeryd) with 100% cpu usage:
celeryd is started by the following command:
|
I'm testing celery dev version for a week or so and today I tried to stop/kill and start the rabbitmq service, after which the workers with default pool class were having high CPU usage and truss (running on solaris) showed only lot of pollsys() calls. |
Thanks @dn0 Do you people use py-amqp or librabbitmq? Celery uses poll/select/epoll to see if the socket is readable,
Currently I have no idea how this happens, maybe there is some way to detect that the socket is broken |
It seems it happens because the socket is disconnected, so I think I may have a solution for this. |
I'm using latest amqp from github. |
I had amqp 1.0.13 installed. |
We are using librabbitmq. @ask, what's the solution? may share your thoughts? currently, we are using another monitor script to restart our workers if we found the CPU is 100% |
Thanks, I have tried to improve disconnection detection in the development version. There is a second way to fix this and that is to simply count the number of errors and reconnect if it |
Also found a discussion that seems relevant here: http://trac.wxwidgets.org/ticket/7504 |
@dafang: You should also upgrade to librabbitmq 1.0.2 then, as the ChannelError('bad frame read') is now a ConnectionError which is necessary in celery 3.1 for the connection to be restablished. |
I think I may have found a bug where this could happen, the select eventio implementation stores the raw socket objects when registering, but the poller works in filenos, That would not explain the problem when using epoll/kqueue though (linux/bsd), but in latest kombu it will now check that the socket is connected before continuing. There is no safe way to verify a socket, but the latest amqp will keep a 'connected' flag that is reset whenever a connection related error occurs while reading/writing to the socket.librabbitmq already implemented Connection.connected so no change required there (just have to upgrade kombu) |
Anyone managed to test this yet? |
@ask We will upgrade the kombu to the latest stable version on our production, will try and see whether this can fix the issue. Will post back after we monitor the server resource for some days. |
It seems that I've got the same issue. I'll try it with 3.1 in the next few days. Any idea when 3.1 will be released? |
@ask I've updated celery/kombu/... from master 2 or 3 days ago and now I tried to stop/start rabbitmq and everything went fine - after reconnecting the CPU usage stayed low. |
I have the same issue. The server runs for some days while the memory usage of the 31 erlang processes grows (slowly but it grows): Suddenly the celery process consumes 100% cpu. We are using just the default configuration for rabbitmq with a single queue and only two different processes that run every 10 seconds and two processes that run every day.
I am happy to provide more info to help fixing this issue. Update: Same issue after updating.
pip freeze | grep rabbit
|
@ask We have updated kombu to the latest version, and the code is running on our production for one and a half month, no CPU issues. I think we can close this issue now. |
I have upgrade alll software that is required for celery . but CPU consuption become high. |
@rahul16101989 Hey, can you make a new issue with details that would help in reproducing the issue? Celery settings, package versions, logs, strace logs can help. |
I had the same issue with Celery 3.1.25 / Kombu 3.0.37. |
I met the issue with Celery 4.0.0 / Kombu 4.0.0. I didn't meet such issue before to upgrade celery from 3.x to 4 and I was running with RabbitMQ as broker. My CPU usage on the celery worker went from 3% to 11% in 24H atfer the upgrade (get close to 40% after 3 days). However, on the migration I switched for Redis instead. I'm going to check that out with RabbitMQ to see if the issue is linked to the broker. |
Yesterday I restarted Celery using RabbitMQ as worker to check out if the increasing usage of CPU comes from the Redis broker. The results this morning is I've the same issue with RabbitMQ. As I said on my previous post, I didn't have the issue before the upgrade to Celery 4.X. I'm using Celery beat to schedule periodic tasks if that may be useful. |
FYI: The issue is gone with Kombu 4.0.1 |
After in production for around several days or weeks, some of the Celery worker process will in the state of consuming 100% CPU, following are what we found through the stack dump:
It seems that there is one EAGAIN error, but celery didn't handle it, so it continuously poll and read.
The text was updated successfully, but these errors were encountered: