Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuous memory leak #4843

Open
marvelph opened this issue Jun 23, 2018 · 69 comments · May be fixed by #5870
Open

Continuous memory leak #4843

marvelph opened this issue Jun 23, 2018 · 69 comments · May be fixed by #5870

Comments

@marvelph
Copy link

@marvelph marvelph commented Jun 23, 2018

There is a memory leak in the parent process of Celery's worker.
It is not a child process executing a task.
It happens suddenly every few days.
Unless you stop Celery, it consumes server memory in tens of hours.

This problem happens at least in Celery 4.1, and it also occurs in Celery 4.2.
Celery is running on Ubuntu 16 and brokers use RabbitMQ.

memory

@georgepsarakis

This comment has been minimized.

Copy link
Member

@georgepsarakis georgepsarakis commented Jun 23, 2018

Are you using Canvas workflows? Maybe #4839 is related.

Also I assume you are using prefork pool for worker concurrency?

@marvelph

This comment has been minimized.

Copy link
Author

@marvelph marvelph commented Jun 23, 2018

Thanks georgepsarakis.

I am not using workflow.
I use prefork concurrency 1 on single server.

@georgepsarakis

This comment has been minimized.

Copy link
Member

@georgepsarakis georgepsarakis commented Jun 23, 2018

The increase rate seems quite linear, quite weird. Is the worker processing tasks during this time period? Also, can you add a note with the complete command you are using to start the worker?

@marvelph

This comment has been minimized.

Copy link
Author

@marvelph marvelph commented Jun 23, 2018

Yes. The worker continues to process the task normally.

The worker is started with the following command.

/xxxxxxxx/bin/celery worker --app=xxxxxxxx --loglevel=INFO --pidfile=/var/run/xxxxxxxx.pid

@marvelph

This comment has been minimized.

Copy link
Author

@marvelph marvelph commented Jun 23, 2018

This problem is occurring in both the production environment and the test environment.
I can add memory profile and test output to the test environment.
If there is anything I can do, please say something.

@georgepsarakis

This comment has been minimized.

Copy link
Member

@georgepsarakis georgepsarakis commented Jun 23, 2018

We need to understand what the worker is running during the time that the memory increase is observed. Any information and details you can possibly provide would definitely. It is also good that you can reproduce this.

@marvelph

This comment has been minimized.

Copy link
Author

@marvelph marvelph commented Jun 23, 2018

Although it was a case occurred at a timing different from the graph, the next log was output at the timing when the memory leak started.

[2018-02-24 07:50:52,953: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
File "/xxxxxxxx/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 320, in start
blueprint.start(self)
File "/xxxxxxxx/lib/python3.5/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/xxxxxxxx/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 596, in start
c.loop(*c.loop_args())
File "/xxxxxxxx/lib/python3.5/site-packages/celery/worker/loops.py", line 88, in asynloop
next(loop)
File "/xxxxxxxx/lib/python3.5/site-packages/kombu/async/hub.py", line 293, in create_loop
poll_timeout = fire_timers(propagate=propagate) if scheduled else 1
File "/xxxxxxxx/lib/python3.5/site-packages/kombu/async/hub.py", line 136, in fire_timers
entry()
File "/xxxxxxxx/lib/python3.5/site-packages/kombu/async/timer.py", line 68, in __call__
return self.fun(*self.args, **self.kwargs)
File "/xxxxxxxx/lib/python3.5/site-packages/kombu/async/timer.py", line 127, in _reschedules
return fun(*args, **kwargs)
File "/xxxxxxxx/lib/python3.5/site-packages/kombu/connection.py", line 290, in heartbeat_check
return self.transport.heartbeat_check(self.connection, rate=rate)
File "/xxxxxxxx/lib/python3.5/site-packages/kombu/transport/pyamqp.py", line 149, in heartbeat_check
return connection.heartbeat_tick(rate=rate)
File "/xxxxxxxx/lib/python3.5/site-packages/amqp/connection.py", line 696, in heartbeat_tick
self.send_heartbeat()
File "/xxxxxxxx/lib/python3.5/site-packages/amqp/connection.py", line 647, in send_heartbeat
self.frame_writer(8, 0, None, None, None)
File "/xxxxxxxx/lib/python3.5/site-packages/amqp/method_framing.py", line 166, in write_frame
write(view[:offset])
File "/xxxxxxxx/lib/python3.5/site-packages/amqp/transport.py", line 258, in write
self._write(s)
ConnectionResetError: [Errno 104] Connection reset by peer
[2018-02-24 08:49:12,016: INFO/MainProcess] Connected to amqp://xxxxxxxx:**@xxx.xxx.xxx.xxx:5672/xxxxxxxx

It seems that it occurred when the connection with RabbitMQ was temporarily cut off.

@georgepsarakis

This comment has been minimized.

Copy link
Member

@georgepsarakis georgepsarakis commented Jun 24, 2018

@marvelph so it occurs during RabbitMQ reconnections? Perhaps these issues are related:

@marvelph

This comment has been minimized.

Copy link
Author

@marvelph marvelph commented Jun 24, 2018

Yes.
It seems that reconnection triggers it.

@jxltom

This comment has been minimized.

Copy link

@jxltom jxltom commented Jun 25, 2018

It looks like I'm having the same issue... It is so hard for me to find out what triggers it and why there is a memeory leak. It annoys me for at least a month. I fallback to used celery 3 and everything is fine.

For the memory leak issue, I'm using ubuntu 16, celery 4.1.0 with rabbitmq. I deployed it via docker.

The memory leak is with MainProcess not ForkPoolWorker. The memory usage of ForkPoolWorker is normal, but memory usage of MainProcess is always increasing. For five seconds, around 0.1MB memeory is leaked. The memory leak doesn't start after the work starts immediatly but maybe after one or two days.

I used gdb and pyrasite to inject the running process and try to gc.collect(), but nothing is collected.

I checked the log, the consumer: Connection to broker lost. Trying to re-establish the connection... did happens, but for now I'm not sure this is the time when memory leak happens.

Any hints for debugging this issue and to find out what really happens? Thanks.

@jxltom

This comment has been minimized.

Copy link

@jxltom jxltom commented Jun 25, 2018

Since @marvelph mentioned it may relate with rabbitmq reconnection, I try to stop my rabbitmq server. The memory usage did increase after each reconnection, following is the log. So I can confirm this celery/kombu#843 issue.

But after the connection is reconnected, the memory usage stops to gradually increase. So I'm not sure this is the reason for memory leak.

I will try to use redis to figure out whether this memory leak issue relates wtih rabbitmq or not.

[2018-06-25 02:43:33,456: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 316, in start
    blueprint.start(self)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 592, in start
    c.loop(*c.loop_args())
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/worker/loops.py", line 91, in asynloop
    next(loop)
  File "/app/.heroku/python/lib/python3.6/site-packages/kombu/asynchronous/hub.py", line 354, in create_loop
    cb(*cbargs)
  File "/app/.heroku/python/lib/python3.6/site-packages/kombu/transport/base.py", line 236, in on_readable
    reader(loop)
  File "/app/.heroku/python/lib/python3.6/site-packages/kombu/transport/base.py", line 218, in _read
    drain_events(timeout=0)
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/connection.py", line 491, in drain_events
    while not self.blocking_read(timeout):
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/connection.py", line 496, in blocking_read
    frame = self.transport.read_frame()
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/transport.py", line 243, in read_frame
    frame_header = read(7, True)
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/transport.py", line 418, in _read
    s = recv(n - len(rbuf))
ConnectionResetError: [Errno 104] Connection reset by peer
[2018-06-25 02:43:33,497: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 2.00 seconds...

[2018-06-25 02:43:35,526: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 4.00 seconds...

[2018-06-25 02:43:39,560: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 6.00 seconds...

[2018-06-25 02:43:45,599: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 8.00 seconds...

[2018-06-25 02:43:53,639: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 10.00 seconds...

[2018-06-25 02:44:03,680: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 12.00 seconds...

[2018-06-25 02:44:15,743: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 14.00 seconds...

[2018-06-25 02:44:29,790: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 16.00 seconds...

[2018-06-25 02:44:45,839: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 18.00 seconds...

[2018-06-25 02:45:03,890: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 20.00 seconds...

[2018-06-25 02:45:23,943: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 22.00 seconds...

[2018-06-25 02:45:46,002: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 24.00 seconds...

[2018-06-25 02:46:10,109: INFO/MainProcess] Connected to amqp://***:**@***:***/***
[2018-06-25 02:46:10,212: INFO/MainProcess] mingle: searching for neighbors
[2018-06-25 02:46:10,291: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 316, in start
    blueprint.start(self)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/worker/consumer/mingle.py", line 40, in start
    self.sync(c)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/worker/consumer/mingle.py", line 44, in sync
    replies = self.send_hello(c)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/worker/consumer/mingle.py", line 57, in send_hello
    replies = inspect.hello(c.hostname, our_revoked._data) or {}
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/app/control.py", line 132, in hello
    return self._request('hello', from_node=from_node, revoked=revoked)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/app/control.py", line 84, in _request
    timeout=self.timeout, reply=True,
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/app/control.py", line 439, in broadcast
    limit, callback, channel=channel,
  File "/app/.heroku/python/lib/python3.6/site-packages/kombu/pidbox.py", line 315, in _broadcast
    serializer=serializer)
  File "/app/.heroku/python/lib/python3.6/site-packages/kombu/pidbox.py", line 290, in _publish
    serializer=serializer,
  File "/app/.heroku/python/lib/python3.6/site-packages/kombu/messaging.py", line 181, in publish
    exchange_name, declare,
  File "/app/.heroku/python/lib/python3.6/site-packages/kombu/messaging.py", line 203, in _publish
    mandatory=mandatory, immediate=immediate,
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/channel.py", line 1732, in _basic_publish
    (0, exchange, routing_key, mandatory, immediate), msg
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/abstract_channel.py", line 50, in send_method
    conn.frame_writer(1, self.channel_id, sig, args, content)
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/method_framing.py", line 166, in write_frame
    write(view[:offset])
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/transport.py", line 275, in write
    self._write(s)
ConnectionResetError: [Errno 104] Connection reset by peer
[2018-06-25 02:46:10,375: INFO/MainProcess] Connected to amqp://***:**@***:***/***
[2018-06-25 02:46:10,526: INFO/MainProcess] mingle: searching for neighbors
[2018-06-25 02:46:11,764: INFO/MainProcess] mingle: all alone
@marvelph

This comment has been minimized.

Copy link
Author

@marvelph marvelph commented Jun 25, 2018

Although I checked the logs, I found a log of reconnection at the timing of memory leak, but there was also a case where a memory leak started at the timing when reconnection did not occur.
I agree with the idea of jxlton.

Also, when I was using Celery 3.x, I did not encounter such a problem.

@dmitry-kostin

This comment has been minimized.

Copy link

@dmitry-kostin dmitry-kostin commented Jun 25, 2018

same problem here
screenshot 2018-06-25 11 09 22
Every few days i have to restart workers due to this problem
there are no any significant clues in logs, but I have a suspicion that reconnects can affect; since i have reconnect log entries somewhere in time when memory starts constantly growing
My conf is ubuntu 17, 1 server - 1 worker with 3 concurrency; rabbit and redis on backend; all packages are the latest versions

@georgepsarakis

This comment has been minimized.

Copy link
Member

@georgepsarakis georgepsarakis commented Jun 25, 2018

@marvelph @dmitry-kostin could you please provide your exact configuration (omitting sensitive information of course) and possibly a task, or sample, that reproduces the issue? Also, do you have any estimate of the average uptime interval that the worker memory increase starts appearing?

@dmitry-kostin

This comment has been minimized.

Copy link

@dmitry-kostin dmitry-kostin commented Jun 25, 2018

the config is nearby to default

imports = ('app.tasks',)
result_persistent = True
task_ignore_result = False
task_acks_late = True
worker_concurrency = 3
worker_prefetch_multiplier = 4
enable_utc = True
timezone = 'Europe/Moscow'
broker_transport_options = {'visibility_timeout': 3600, 'confirm_publish': True, 'fanout_prefix': True, 'fanout_patterns': True}

screenshot 2018-06-25 11 35 17

Basically this is new deployed node; it was deployed on 06/21 18-50; stared to grow 6/23 around 05-00 and finally crashed 6/23 around 23-00

the task is pretty simple and there is no superlogic there, i think i can reproduce the whole situation on a clear temp project but have no free time for now, if i will be lucky i will try to do a full example on weekend

UPD
as you can see the task itself consumes some memory you can see it by spikes on the graph, but the time when memory stared to leak there were no any tasks produced or any other activities

@georgepsarakis

This comment has been minimized.

Copy link
Member

@georgepsarakis georgepsarakis commented Jun 25, 2018

@marvelph @dmitry-kostin @jxltom I noticed you use Python3. Would you mind enabling tracemalloc for the process? You may need to patch the worker process though to log memory allocation traces, let me know if you need help with that.

@jxltom

This comment has been minimized.

Copy link

@jxltom jxltom commented Jun 25, 2018

@georgepsarakis You mean enable tracemalloc in worker and log stats, such as the top 10 memory usage files, at a specific interval such as 5 minutes?

@georgepsarakis

This comment has been minimized.

Copy link
Member

@georgepsarakis georgepsarakis commented Jun 25, 2018

@jxltom I think something like that would help locate the part of code that is responsible. What do you think?

@jxltom

This comment has been minimized.

Copy link

@jxltom jxltom commented Jun 25, 2018

@georgepsarakis I'v tried to use gdb and https://github.com/lmacken/pyrasite to inject the memory leak process, and start debug via tracemalloc. Here is the top 10 file with highest mem usage.

I use resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024 and the memory usage is gradually increasing indeed.

>>> import tracemalloc
>>> 
>>> tracemalloc.start()
>>> snapshot = tracemalloc.take_snapshot()
>>> top_stats = snapshot.statistics('lineno')
>>> for stat in top_stats[:10]:
...     print(stat)
... 
/app/.heroku/python/lib/python3.6/site-packages/kombu/utils/eventio.py:84: size=12.0 KiB, count=1, average=12.0 KiB
/app/.heroku/python/lib/python3.6/site-packages/celery/worker/heartbeat.py:47: size=3520 B, count=8, average=440 B
/app/.heroku/python/lib/python3.6/site-packages/amqp/method_framing.py:166: size=3264 B, count=12, average=272 B
/app/.heroku/python/lib/python3.6/site-packages/celery/events/dispatcher.py:142: size=3060 B, count=10, average=306 B
/app/.heroku/python/lib/python3.6/site-packages/celery/events/dispatcher.py:157: size=2912 B, count=8, average=364 B
/app/.heroku/python/lib/python3.6/site-packages/amqp/abstract_channel.py:50: size=2912 B, count=8, average=364 B
/app/.heroku/python/lib/python3.6/site-packages/kombu/messaging.py:181: size=2816 B, count=12, average=235 B
/app/.heroku/python/lib/python3.6/site-packages/kombu/messaging.py:203: size=2816 B, count=8, average=352 B
/app/.heroku/python/lib/python3.6/site-packages/celery/events/dispatcher.py:199: size=2672 B, count=6, average=445 B
/app/.heroku/python/lib/python3.6/site-packages/amqp/channel.py:1734: size=2592 B, count=8, average=324 B

Here is the difference between two snapshots after around 5 minutes.

>>> snapshot2 = tracemalloc.take_snapshot()
>>> top_stats = snapshot2.compare_to(snapshot, 'lineno')
>>> print("[ Top 10 differences ]")
[ Top 10 differences ]

>>> for stat in top_stats[:10]:
...     print(stat)
... 
/app/.heroku/python/lib/python3.6/site-packages/celery/worker/heartbeat.py:47: size=220 KiB (+216 KiB), count=513 (+505), average=439 B
/app/.heroku/python/lib/python3.6/site-packages/celery/events/dispatcher.py:142: size=211 KiB (+208 KiB), count=758 (+748), average=285 B
/app/.heroku/python/lib/python3.6/site-packages/amqp/method_framing.py:166: size=210 KiB (+206 KiB), count=789 (+777), average=272 B
/app/.heroku/python/lib/python3.6/site-packages/celery/events/dispatcher.py:157: size=190 KiB (+187 KiB), count=530 (+522), average=366 B
/app/.heroku/python/lib/python3.6/site-packages/amqp/abstract_channel.py:50: size=186 KiB (+183 KiB), count=524 (+516), average=363 B
/app/.heroku/python/lib/python3.6/site-packages/celery/events/dispatcher.py:199: size=185 KiB (+182 KiB), count=490 (+484), average=386 B
/app/.heroku/python/lib/python3.6/site-packages/kombu/messaging.py:203: size=182 KiB (+179 KiB), count=528 (+520), average=353 B
/app/.heroku/python/lib/python3.6/site-packages/kombu/messaging.py:181: size=179 KiB (+176 KiB), count=786 (+774), average=233 B
/app/.heroku/python/lib/python3.6/site-packages/amqp/channel.py:1734: size=165 KiB (+163 KiB), count=525 (+517), average=323 B
/app/.heroku/python/lib/python3.6/site-packages/kombu/async/hub.py:293: size=157 KiB (+155 KiB), count=255 (+251), average=632 B
@jxltom

This comment has been minimized.

Copy link

@jxltom jxltom commented Jun 25, 2018

Any suggestions for how to continue to debug this? I have no clue for how to proceed. Thanks.

@marvelph

This comment has been minimized.

Copy link
Author

@marvelph marvelph commented Jun 26, 2018

@georgepsarakis

I want a little time to cut out the project for reproduction.

It is setting of Celery.

BROKER_URL = [
    'amqp://xxxxxxxx:yyyyyyyy@aaa.bbb.ccc.ddd:5672/zzzzzzzz'
]
BROKER_TRANSPORT_OPTIONS = {}

The scheduler has the following settings.

CELERYBEAT_SCHEDULE = {
    'aaaaaaaa_bbbbbbbb': {
        'task': 'aaaa.bbbbbbbb_cccccccc',
        'schedule': celery.schedules.crontab(minute=0),
    },
    'dddddddd_eeeeeeee': {
        'task': 'dddd.eeeeeeee_ffffffff',
        'schedule': celery.schedules.crontab(minute=0),
    },
}

On EC 2, I am using supervisord to operate it.

@marvelph

This comment has been minimized.

Copy link
Author

@marvelph marvelph commented Jun 26, 2018

@georgepsarakis
Since my test environment can tolerate performance degradation, you can use tracemalloc.
Can you make a patched Celery to dump memory usage?

@dmitry-kostin

This comment has been minimized.

Copy link

@dmitry-kostin dmitry-kostin commented Jun 26, 2018

@jxltom I bet tracemalloc with 5 minutes wont help to locate problem
For example I have 5 nodes and only 3 of them had this problem for last 4 days, and 2 worked fine all this this time, so it will be very tricky to locate problem ..
I feel like there is some toggle that switches on and then memory starts grow, until this switch memory consumption looks very well

@marvelph

This comment has been minimized.

Copy link
Author

@marvelph marvelph commented Jun 26, 2018

I tried to find out whether similar problems occurred in other running systems.
The frequency of occurrence varies, but a memory leak has occurred on three systems using Celery 4.x, and it has not happened on one system.
The system that has a memory leak is Python 3.5.x, and the system with no memory leak is Python 2.7.x.

@jxltom

This comment has been minimized.

Copy link

@jxltom jxltom commented Jun 26, 2018

@dmitry-kostin What's the difference with the other two normal nodes, are they both using same rabbitmq as broker?

Since our discussion mentioned it may related to rabbitmq, I started another new node with same configuration except for using redis instead. So far, this node has no memory leak after running 24 hours. I will post it here if it has memory leak later

@jxltom

This comment has been minimized.

Copy link

@jxltom jxltom commented Jun 26, 2018

@marvelph So do you mean that the three system with memory leak are using python3 while the one which is fine is using python2?

@dmitry-kostin

This comment has been minimized.

Copy link

@dmitry-kostin dmitry-kostin commented Jun 26, 2018

@jxltom no difference at all, and yes they are on python 3 & rabit as broker and redis on backend
I made a testing example to reproduce this, if it will succeed in a couple of days i will give credentials to this servers for somebody who aware how to locate this bug

@marvelph

This comment has been minimized.

Copy link
Author

@marvelph marvelph commented Jun 26, 2018

@jxltom
Yes.
As far as my environment is concerned, problems do not occur in Python 2.

@pembo13

This comment has been minimized.

Copy link

@pembo13 pembo13 commented Feb 21, 2019

Not sure how relevant this is, but I'm having my 2GB of SWAP space exhausted by celery in production. Stopping Flower didn't clear the memory, but stopping Celery did.

@auvipy

This comment has been minimized.

Copy link
Member

@auvipy auvipy commented Feb 21, 2019

could anyone try celery 4.3rc1?

@ldav1s

This comment has been minimized.

Copy link

@ldav1s ldav1s commented Feb 22, 2019

@auvipy I installed Celery 4.3.0rc1 + gevent 1.4.0. pip upgraded billiard to 3.6.0.0 and kombu 4.3.0.

Kind of puzzled that vine 1.2.0 wasn't also required by the rc1 package, given that #4839 is fixed by that upgrade.

Anyway, Celery 4.3.0 rc1 seems to run OK.

@georgepsarakis

This comment has been minimized.

Copy link
Member

@georgepsarakis georgepsarakis commented Feb 22, 2019

@ldav1s thanks a lot for the feedback. So, vine is declared as a dependency in py-amqp actually. In new installations the latest vine version will be installed but this might not happen in existing ones.

@thedrow perhaps we should declare the dependency in Celery requirements too?

@thedrow

This comment has been minimized.

Copy link
Member

@thedrow thedrow commented Feb 24, 2019

Let's open an issue about it and discuss it there.

@ldav1s

This comment has been minimized.

Copy link

@ldav1s ldav1s commented Feb 25, 2019

Celery 4.3.0rc1 + gevent 1.4.0 has been running a couple of days now. Looks like it's leaking in the same fashion as Celery 4.2.1 + gevent 1.4.0.

@yogevyuval

This comment has been minimized.

Copy link

@yogevyuval yogevyuval commented Mar 2, 2019

image

Having the same leak with celery 4.2.1, python 3.6

Any updates on this?

@bilalbayasut

This comment has been minimized.

Copy link

@bilalbayasut bilalbayasut commented Mar 28, 2019

having same problem here

@davidedeangelismdb

This comment has been minimized.

Copy link

@davidedeangelismdb davidedeangelismdb commented Mar 29, 2019

Greetings,

I'm experiencing a similar issue, but I'm not sure it is the same.

After I have migrated our celery app in a different environment/network, celery workers started to leak. Previously the celery application and the rabbitmq instance were in the same environment/network.

My configuration is on Python 3.6.5:

amqp (2.4.2)
billiard (3.5.0.5)
celery (4.1.1)
eventlet (0.22.0)
greenlet (0.4.15)
kombu (4.2.1)
vine (1.3.0)

celeryconfig

broker_url = rabbitmq
result_backend = mongodb
task_acks_late = True
result_expires = 0
task_default_rate_limit = 2000
task_soft_time_limit = 120
task_reject_on_worker_lost = True
loglevel = 'INFO'
worker_pool_restarts = True
broker_heartbeat = 0
broker_pool_limit = None

The application is composed by several workers with eventlet pool, started via command in supervisord:

[program:worker1]
command={{ celery_path }} worker -A celery_app --workdir {{ env_path }} -l info -E -P eventlet -c 250 -n worker1@{{ hostname }} -Q queue1,queue2

The memory leak behaviour it looks like this, every ~10 hours usually 1 worker, max 2 start leaking:
image

So I have created a broadcast message for being executed on each worker, for using tracemalloc, this is the result of top command on the machine, there is 1 worker only leaking with 1464m:

217m   1%   2   0% /usr/bin/python3 -m celery worker -A celery_app --workdir   379
189m   1%   0   0% /usr/bin/python3 -m celery worker -A celery_app --workdir   377     
1464m   9%   1   0% /usr/bin/python3 -m celery worker -A celery_app --workdir   378
218m   1%   0   0% /usr/bin/python3 -m celery worker -A celery_app --workdir   376 
217m   1%   2   0% /usr/bin/python3 -m celery worker -A celery_app --workdir   375
217m   1%   3   0% /usr/bin/python3 -m celery worker -A celery_app --workdir   394
163m   1%   0   0% /usr/bin/python3 -m celery beat -A celery_app --workdir /app

tracemalloc TOP 10 results on the leaking worker

[2019-03-29 07:18:03,809: WARNING/MainProcess] [ Top 10: worker5@hostname ]

[2019-03-29 07:18:03,809: WARNING/MainProcess] /usr/lib/python3.6/site-packages/eventlet/greenio/base.py:207: size=17.7 MiB, count=26389, average=702 B

[2019-03-29 07:18:03,810: WARNING/MainProcess] /usr/lib/python3.6/site-packages/kombu/messaging.py:203: size=16.3 MiB, count=44422, average=385 B

[2019-03-29 07:18:03,811: WARNING/MainProcess] /usr/lib/python3.6/site-packages/celery/worker/heartbeat.py:49: size=15.7 MiB, count=39431, average=418 B

[2019-03-29 07:18:03,812: WARNING/MainProcess] /usr/lib/python3.6/site-packages/celery/events/dispatcher.py:156: size=13.0 MiB, count=40760, average=334 B

[2019-03-29 07:18:03,812: WARNING/MainProcess] /usr/lib/python3.6/site-packages/eventlet/greenio/base.py:363: size=12.9 MiB, count=19507, average=695 B

[2019-03-29 07:18:03,813: WARNING/MainProcess] /usr/lib/python3.6/site-packages/amqp/transport.py:256: size=12.7 MiB, count=40443, average=328 B

[2019-03-29 07:18:03,814: WARNING/MainProcess] /usr/lib/python3.6/site-packages/celery/events/dispatcher.py:138: size=12.4 MiB, count=24189, average=539 B

[2019-03-29 07:18:03,814: WARNING/MainProcess] /usr/lib/python3.6/site-packages/amqp/transport.py:256: size=12.3 MiB, count=19771, average=655 B

[2019-03-29 07:18:03,815: WARNING/MainProcess] /usr/lib/python3.6/site-packages/amqp/connection.py:505: size=11.9 MiB, count=39514, average=317 B

[2019-03-29 07:18:03,816: WARNING/MainProcess] /usr/lib/python3.6/site-packages/kombu/messaging.py:181: size=11.8 MiB, count=61362, average=201 B

TOP 1 with 25 frames

TOP 1

[2019-03-29 07:33:05,787: WARNING/MainProcess] [ TOP 1: worker5@hostname ]

[2019-03-29 07:33:05,787: WARNING/MainProcess] 26938 memory blocks: 18457.2 KiB

[2019-03-29 07:33:05,788: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/eventlet/greenio/base.py", line 207

[2019-03-29 07:33:05,788: WARNING/MainProcess] mark_as_closed=self._mark_as_closed)

[2019-03-29 07:33:05,789: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/eventlet/greenio/base.py", line 328

[2019-03-29 07:33:05,789: WARNING/MainProcess] timeout_exc=socket_timeout('timed out'))

[2019-03-29 07:33:05,790: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/eventlet/greenio/base.py", line 357

[2019-03-29 07:33:05,790: WARNING/MainProcess] self._read_trampoline()

[2019-03-29 07:33:05,790: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/eventlet/greenio/base.py", line 363

[2019-03-29 07:33:05,791: WARNING/MainProcess] return self._recv_loop(self.fd.recv, b'', bufsize, flags)

[2019-03-29 07:33:05,791: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/amqp/transport.py", line 440

[2019-03-29 07:33:05,791: WARNING/MainProcess] s = recv(n - len(rbuf))

[2019-03-29 07:33:05,792: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/amqp/transport.py", line 256

[2019-03-29 07:33:05,792: WARNING/MainProcess] frame_header = read(7, True)

[2019-03-29 07:33:05,792: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/amqp/connection.py", line 505

[2019-03-29 07:33:05,793: WARNING/MainProcess] frame = self.transport.read_frame()

[2019-03-29 07:33:05,793: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/amqp/connection.py", line 500

[2019-03-29 07:33:05,793: WARNING/MainProcess] while not self.blocking_read(timeout):

[2019-03-29 07:33:05,793: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/kombu/transport/pyamqp.py", line 103

[2019-03-29 07:33:05,794: WARNING/MainProcess] return connection.drain_events(**kwargs)

[2019-03-29 07:33:05,794: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/kombu/connection.py", line 301

[2019-03-29 07:33:05,794: WARNING/MainProcess] return self.transport.drain_events(self.connection, **kwargs)

[2019-03-29 07:33:05,795: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/worker/pidbox.py", line 120

[2019-03-29 07:33:05,795: WARNING/MainProcess] connection.drain_events(timeout=1.0)

I hope it could help, there are no error in the logs, other than the missed heartbeat between the workers. Now I'm trying to use the exact version of the libs we were using the old env.

UPDATE: Using the same exact dependencies lib versions and a broker heartbeat every 5 minutes the application looked like stable for longer time: more than 2 days, than it leaked again.

There were small spike continuing for ~1hour time by time, but the were "absorbed/collected".. the last one looks like started the spike.

After the 1st spike, 1st ramp, I have restarted the leaking worker.. as you can see another worker started to leak after it or probably it was already leaking, 2nd ramp.

image

I'm going to test without heartbeat.

UPDATE: without heartbeat leaked again after 2 days, same behaviour

440m   3%   1   0% /usr/bin/python3 -m celery worker -A celery_app --without-heartbeat --workdir /app -l info -E -P eventlet -c 250 -Ofair -n worker1@ -Q p_1_queue,p_2_queue
176m   1%   0   0% /usr/bin/python3 -m celery worker -A celery_app --without-heartbeat --workdir /app -l info -E -P eventlet -c 250 -Ofair -n worker2@ -Q p_1_queue,p_2_queue
176m   1%   2   0% /usr/bin/python3 -m celery worker -A celery_app --without-heartbeat --workdir /app -l info -E -P eventlet -c 250 -Ofair -n worker5@ -Q p_1_queue,p_2_queue
176m   1%   1   0% /usr/bin/python3 -m celery worker -A celery_app --without-heartbeat --workdir /app -l info -E -P eventlet -c 250 -Ofair -n worker3@ -Q p_1_queue,p_2_queue
176m   1%   1   0% /usr/bin/python3 -m celery worker -A celery_app --without-heartbeat --workdir /app -l info -E -P eventlet -c 250 -Ofair -n worker4@ -Q p_1_queue,p_2_queue
171m   1%   1   0% /usr/bin/python3 -m celery worker -A celery_app --without-heartbeat --workdir /app -l info -E -P eventlet -c 20 -n worker_p_root@ -Q p_root_queue
157m   1%   0   0% /usr/bin/python3 -m celery beat -A celery_app --workdir /app --schedule /app/beat.db -l info

image

UPDATE:
Using celery 4.3.0 it seems the problem resolved and it is stable since a week
image

amqp (2.4.2)
billiard (3.6.0.0)
celery (4.3.0)
eventlet (0.24.1)
greenlet (0.4.15)
kombu (4.5.0)
vine (1.3.0)

Please let me know if I could help somehow, instrumenting the code. If necessary provide links and example please.

Thank you

@yevhen-m

This comment has been minimized.

Copy link

@yevhen-m yevhen-m commented Apr 24, 2019

I'm also having a memory leak. It looks like I've managed to find the cause.
https://github.com/celery/celery/blob/master/celery/events/dispatcher.py#L75
I can see that this buffer starts to grow after connection issues with rabbit. I don't understand why it fails to clear events eventually, it continues to grow over time and consume more and more ram. Passing buffer_while_offline=False here https://github.com/celery/celery/blob/master/celery/worker/consumer/events.py#L43 seems to fix the leak for me. Can someone please check if this is related?

@auvipy

This comment has been minimized.

Copy link
Member

@auvipy auvipy commented Apr 24, 2019

vinayinvicible added a commit to Coverfox/celery that referenced this issue May 8, 2019
This is causing memory leak in case of gevent workers
Fix for celery#4843
@auvipy auvipy modified the milestones: 4.4.0, 4.5 Jun 14, 2019
@slavpetroff

This comment has been minimized.

Copy link

@slavpetroff slavpetroff commented Sep 4, 2019

@yevhen-m thank you a lot! That helped us to solve the memory leakage!

@auvipy auvipy self-assigned this Sep 4, 2019
@thedrow

This comment has been minimized.

Copy link
Member

@thedrow thedrow commented Sep 10, 2019

Its good that we have a workaround but can we please find a proper fix?

@lxkaka

This comment has been minimized.

Copy link

@lxkaka lxkaka commented Dec 26, 2019

continuous follow this memory leak issue

image

@yoonnoon

This comment has been minimized.

Copy link

@yoonnoon yoonnoon commented Dec 27, 2019

celery-pod-screencshot-lastweek

I'm using celery in a production environment, and I'm deploying it via docker.
Like the screenshot, we are having the same problem.
Our production config is shown below.

Docker parent image: python 3.6.8-buster
Celery version: 4.2.0
Command Options:

  • concurrency 4
  • prefetch-multiplier 8
  • No result_backend
  • acks_late and reject_on_worker_lost

I wonder if upgrading celery's version to 4.3.0 solves the memory leak issue.

Thank you!

@auvipy

This comment has been minimized.

Copy link
Member

@auvipy auvipy commented Dec 27, 2019

celery 4.4.0 is the latest stable

@auvipy auvipy modified the milestones: 4.5, 4.4.x Dec 27, 2019
@auvipy auvipy removed their assignment Dec 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.