Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuous memory leak #4843

Open
marvelph opened this issue Jun 23, 2018 · 173 comments · Fixed by #5870
Open

Continuous memory leak #4843

marvelph opened this issue Jun 23, 2018 · 173 comments · Fixed by #5870

Comments

@marvelph
Copy link

marvelph commented Jun 23, 2018

There is a memory leak in the parent process of Celery's worker.
It is not a child process executing a task.
It happens suddenly every few days.
Unless you stop Celery, it consumes server memory in tens of hours.

This problem happens at least in Celery 4.1, and it also occurs in Celery 4.2.
Celery is running on Ubuntu 16 and brokers use RabbitMQ.

memory

@georgepsarakis
Copy link
Contributor

georgepsarakis commented Jun 23, 2018

Are you using Canvas workflows? Maybe #4839 is related.

Also I assume you are using prefork pool for worker concurrency?

@marvelph
Copy link
Author

Thanks georgepsarakis.

I am not using workflow.
I use prefork concurrency 1 on single server.

@georgepsarakis
Copy link
Contributor

The increase rate seems quite linear, quite weird. Is the worker processing tasks during this time period? Also, can you add a note with the complete command you are using to start the worker?

@marvelph
Copy link
Author

Yes. The worker continues to process the task normally.

The worker is started with the following command.

/xxxxxxxx/bin/celery worker --app=xxxxxxxx --loglevel=INFO --pidfile=/var/run/xxxxxxxx.pid

@marvelph
Copy link
Author

This problem is occurring in both the production environment and the test environment.
I can add memory profile and test output to the test environment.
If there is anything I can do, please say something.

@georgepsarakis
Copy link
Contributor

We need to understand what the worker is running during the time that the memory increase is observed. Any information and details you can possibly provide would definitely. It is also good that you can reproduce this.

@marvelph
Copy link
Author

Although it was a case occurred at a timing different from the graph, the next log was output at the timing when the memory leak started.

[2018-02-24 07:50:52,953: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
File "/xxxxxxxx/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 320, in start
blueprint.start(self)
File "/xxxxxxxx/lib/python3.5/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/xxxxxxxx/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 596, in start
c.loop(*c.loop_args())
File "/xxxxxxxx/lib/python3.5/site-packages/celery/worker/loops.py", line 88, in asynloop
next(loop)
File "/xxxxxxxx/lib/python3.5/site-packages/kombu/async/hub.py", line 293, in create_loop
poll_timeout = fire_timers(propagate=propagate) if scheduled else 1
File "/xxxxxxxx/lib/python3.5/site-packages/kombu/async/hub.py", line 136, in fire_timers
entry()
File "/xxxxxxxx/lib/python3.5/site-packages/kombu/async/timer.py", line 68, in __call__
return self.fun(*self.args, **self.kwargs)
File "/xxxxxxxx/lib/python3.5/site-packages/kombu/async/timer.py", line 127, in _reschedules
return fun(*args, **kwargs)
File "/xxxxxxxx/lib/python3.5/site-packages/kombu/connection.py", line 290, in heartbeat_check
return self.transport.heartbeat_check(self.connection, rate=rate)
File "/xxxxxxxx/lib/python3.5/site-packages/kombu/transport/pyamqp.py", line 149, in heartbeat_check
return connection.heartbeat_tick(rate=rate)
File "/xxxxxxxx/lib/python3.5/site-packages/amqp/connection.py", line 696, in heartbeat_tick
self.send_heartbeat()
File "/xxxxxxxx/lib/python3.5/site-packages/amqp/connection.py", line 647, in send_heartbeat
self.frame_writer(8, 0, None, None, None)
File "/xxxxxxxx/lib/python3.5/site-packages/amqp/method_framing.py", line 166, in write_frame
write(view[:offset])
File "/xxxxxxxx/lib/python3.5/site-packages/amqp/transport.py", line 258, in write
self._write(s)
ConnectionResetError: [Errno 104] Connection reset by peer
[2018-02-24 08:49:12,016: INFO/MainProcess] Connected to amqp://xxxxxxxx:**@xxx.xxx.xxx.xxx:5672/xxxxxxxx

It seems that it occurred when the connection with RabbitMQ was temporarily cut off.

@georgepsarakis
Copy link
Contributor

georgepsarakis commented Jun 24, 2018

@marvelph so it occurs during RabbitMQ reconnections? Perhaps these issues are related:

@marvelph
Copy link
Author

Yes.
It seems that reconnection triggers it.

@jxltom
Copy link

jxltom commented Jun 25, 2018

It looks like I'm having the same issue... It is so hard for me to find out what triggers it and why there is a memeory leak. It annoys me for at least a month. I fallback to used celery 3 and everything is fine.

For the memory leak issue, I'm using ubuntu 16, celery 4.1.0 with rabbitmq. I deployed it via docker.

The memory leak is with MainProcess not ForkPoolWorker. The memory usage of ForkPoolWorker is normal, but memory usage of MainProcess is always increasing. For five seconds, around 0.1MB memeory is leaked. The memory leak doesn't start after the work starts immediatly but maybe after one or two days.

I used gdb and pyrasite to inject the running process and try to gc.collect(), but nothing is collected.

I checked the log, the consumer: Connection to broker lost. Trying to re-establish the connection... did happens, but for now I'm not sure this is the time when memory leak happens.

Any hints for debugging this issue and to find out what really happens? Thanks.

@jxltom
Copy link

jxltom commented Jun 25, 2018

Since @marvelph mentioned it may relate with rabbitmq reconnection, I try to stop my rabbitmq server. The memory usage did increase after each reconnection, following is the log. So I can confirm this celery/kombu#843 issue.

But after the connection is reconnected, the memory usage stops to gradually increase. So I'm not sure this is the reason for memory leak.

I will try to use redis to figure out whether this memory leak issue relates wtih rabbitmq or not.

[2018-06-25 02:43:33,456: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 316, in start
    blueprint.start(self)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 592, in start
    c.loop(*c.loop_args())
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/worker/loops.py", line 91, in asynloop
    next(loop)
  File "/app/.heroku/python/lib/python3.6/site-packages/kombu/asynchronous/hub.py", line 354, in create_loop
    cb(*cbargs)
  File "/app/.heroku/python/lib/python3.6/site-packages/kombu/transport/base.py", line 236, in on_readable
    reader(loop)
  File "/app/.heroku/python/lib/python3.6/site-packages/kombu/transport/base.py", line 218, in _read
    drain_events(timeout=0)
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/connection.py", line 491, in drain_events
    while not self.blocking_read(timeout):
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/connection.py", line 496, in blocking_read
    frame = self.transport.read_frame()
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/transport.py", line 243, in read_frame
    frame_header = read(7, True)
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/transport.py", line 418, in _read
    s = recv(n - len(rbuf))
ConnectionResetError: [Errno 104] Connection reset by peer
[2018-06-25 02:43:33,497: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 2.00 seconds...

[2018-06-25 02:43:35,526: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 4.00 seconds...

[2018-06-25 02:43:39,560: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 6.00 seconds...

[2018-06-25 02:43:45,599: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 8.00 seconds...

[2018-06-25 02:43:53,639: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 10.00 seconds...

[2018-06-25 02:44:03,680: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 12.00 seconds...

[2018-06-25 02:44:15,743: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 14.00 seconds...

[2018-06-25 02:44:29,790: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 16.00 seconds...

[2018-06-25 02:44:45,839: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 18.00 seconds...

[2018-06-25 02:45:03,890: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 20.00 seconds...

[2018-06-25 02:45:23,943: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 22.00 seconds...

[2018-06-25 02:45:46,002: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 24.00 seconds...

[2018-06-25 02:46:10,109: INFO/MainProcess] Connected to amqp://***:**@***:***/***
[2018-06-25 02:46:10,212: INFO/MainProcess] mingle: searching for neighbors
[2018-06-25 02:46:10,291: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 316, in start
    blueprint.start(self)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/worker/consumer/mingle.py", line 40, in start
    self.sync(c)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/worker/consumer/mingle.py", line 44, in sync
    replies = self.send_hello(c)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/worker/consumer/mingle.py", line 57, in send_hello
    replies = inspect.hello(c.hostname, our_revoked._data) or {}
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/app/control.py", line 132, in hello
    return self._request('hello', from_node=from_node, revoked=revoked)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/app/control.py", line 84, in _request
    timeout=self.timeout, reply=True,
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/app/control.py", line 439, in broadcast
    limit, callback, channel=channel,
  File "/app/.heroku/python/lib/python3.6/site-packages/kombu/pidbox.py", line 315, in _broadcast
    serializer=serializer)
  File "/app/.heroku/python/lib/python3.6/site-packages/kombu/pidbox.py", line 290, in _publish
    serializer=serializer,
  File "/app/.heroku/python/lib/python3.6/site-packages/kombu/messaging.py", line 181, in publish
    exchange_name, declare,
  File "/app/.heroku/python/lib/python3.6/site-packages/kombu/messaging.py", line 203, in _publish
    mandatory=mandatory, immediate=immediate,
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/channel.py", line 1732, in _basic_publish
    (0, exchange, routing_key, mandatory, immediate), msg
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/abstract_channel.py", line 50, in send_method
    conn.frame_writer(1, self.channel_id, sig, args, content)
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/method_framing.py", line 166, in write_frame
    write(view[:offset])
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/transport.py", line 275, in write
    self._write(s)
ConnectionResetError: [Errno 104] Connection reset by peer
[2018-06-25 02:46:10,375: INFO/MainProcess] Connected to amqp://***:**@***:***/***
[2018-06-25 02:46:10,526: INFO/MainProcess] mingle: searching for neighbors
[2018-06-25 02:46:11,764: INFO/MainProcess] mingle: all alone

@marvelph
Copy link
Author

Although I checked the logs, I found a log of reconnection at the timing of memory leak, but there was also a case where a memory leak started at the timing when reconnection did not occur.
I agree with the idea of jxlton.

Also, when I was using Celery 3.x, I did not encounter such a problem.

@dmitry-kostin
Copy link

dmitry-kostin commented Jun 25, 2018

same problem here
screenshot 2018-06-25 11 09 22
Every few days i have to restart workers due to this problem
there are no any significant clues in logs, but I have a suspicion that reconnects can affect; since i have reconnect log entries somewhere in time when memory starts constantly growing
My conf is ubuntu 17, 1 server - 1 worker with 3 concurrency; rabbit and redis on backend; all packages are the latest versions

@georgepsarakis
Copy link
Contributor

@marvelph @dmitry-kostin could you please provide your exact configuration (omitting sensitive information of course) and possibly a task, or sample, that reproduces the issue? Also, do you have any estimate of the average uptime interval that the worker memory increase starts appearing?

@dmitry-kostin
Copy link

dmitry-kostin commented Jun 25, 2018

the config is nearby to default

imports = ('app.tasks',)
result_persistent = True
task_ignore_result = False
task_acks_late = True
worker_concurrency = 3
worker_prefetch_multiplier = 4
enable_utc = True
timezone = 'Europe/Moscow'
broker_transport_options = {'visibility_timeout': 3600, 'confirm_publish': True, 'fanout_prefix': True, 'fanout_patterns': True}

screenshot 2018-06-25 11 35 17

Basically this is new deployed node; it was deployed on 06/21 18-50; stared to grow 6/23 around 05-00 and finally crashed 6/23 around 23-00

the task is pretty simple and there is no superlogic there, i think i can reproduce the whole situation on a clear temp project but have no free time for now, if i will be lucky i will try to do a full example on weekend

UPD
as you can see the task itself consumes some memory you can see it by spikes on the graph, but the time when memory stared to leak there were no any tasks produced or any other activities

@georgepsarakis
Copy link
Contributor

@marvelph @dmitry-kostin @jxltom I noticed you use Python3. Would you mind enabling tracemalloc for the process? You may need to patch the worker process though to log memory allocation traces, let me know if you need help with that.

@jxltom
Copy link

jxltom commented Jun 25, 2018

@georgepsarakis You mean enable tracemalloc in worker and log stats, such as the top 10 memory usage files, at a specific interval such as 5 minutes?

@georgepsarakis
Copy link
Contributor

@jxltom I think something like that would help locate the part of code that is responsible. What do you think?

@jxltom
Copy link

jxltom commented Jun 25, 2018

@georgepsarakis I'v tried to use gdb and https://github.com/lmacken/pyrasite to inject the memory leak process, and start debug via tracemalloc. Here is the top 10 file with highest mem usage.

I use resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024 and the memory usage is gradually increasing indeed.

>>> import tracemalloc
>>> 
>>> tracemalloc.start()
>>> snapshot = tracemalloc.take_snapshot()
>>> top_stats = snapshot.statistics('lineno')
>>> for stat in top_stats[:10]:
...     print(stat)
... 
/app/.heroku/python/lib/python3.6/site-packages/kombu/utils/eventio.py:84: size=12.0 KiB, count=1, average=12.0 KiB
/app/.heroku/python/lib/python3.6/site-packages/celery/worker/heartbeat.py:47: size=3520 B, count=8, average=440 B
/app/.heroku/python/lib/python3.6/site-packages/amqp/method_framing.py:166: size=3264 B, count=12, average=272 B
/app/.heroku/python/lib/python3.6/site-packages/celery/events/dispatcher.py:142: size=3060 B, count=10, average=306 B
/app/.heroku/python/lib/python3.6/site-packages/celery/events/dispatcher.py:157: size=2912 B, count=8, average=364 B
/app/.heroku/python/lib/python3.6/site-packages/amqp/abstract_channel.py:50: size=2912 B, count=8, average=364 B
/app/.heroku/python/lib/python3.6/site-packages/kombu/messaging.py:181: size=2816 B, count=12, average=235 B
/app/.heroku/python/lib/python3.6/site-packages/kombu/messaging.py:203: size=2816 B, count=8, average=352 B
/app/.heroku/python/lib/python3.6/site-packages/celery/events/dispatcher.py:199: size=2672 B, count=6, average=445 B
/app/.heroku/python/lib/python3.6/site-packages/amqp/channel.py:1734: size=2592 B, count=8, average=324 B

Here is the difference between two snapshots after around 5 minutes.

>>> snapshot2 = tracemalloc.take_snapshot()
>>> top_stats = snapshot2.compare_to(snapshot, 'lineno')
>>> print("[ Top 10 differences ]")
[ Top 10 differences ]

>>> for stat in top_stats[:10]:
...     print(stat)
... 
/app/.heroku/python/lib/python3.6/site-packages/celery/worker/heartbeat.py:47: size=220 KiB (+216 KiB), count=513 (+505), average=439 B
/app/.heroku/python/lib/python3.6/site-packages/celery/events/dispatcher.py:142: size=211 KiB (+208 KiB), count=758 (+748), average=285 B
/app/.heroku/python/lib/python3.6/site-packages/amqp/method_framing.py:166: size=210 KiB (+206 KiB), count=789 (+777), average=272 B
/app/.heroku/python/lib/python3.6/site-packages/celery/events/dispatcher.py:157: size=190 KiB (+187 KiB), count=530 (+522), average=366 B
/app/.heroku/python/lib/python3.6/site-packages/amqp/abstract_channel.py:50: size=186 KiB (+183 KiB), count=524 (+516), average=363 B
/app/.heroku/python/lib/python3.6/site-packages/celery/events/dispatcher.py:199: size=185 KiB (+182 KiB), count=490 (+484), average=386 B
/app/.heroku/python/lib/python3.6/site-packages/kombu/messaging.py:203: size=182 KiB (+179 KiB), count=528 (+520), average=353 B
/app/.heroku/python/lib/python3.6/site-packages/kombu/messaging.py:181: size=179 KiB (+176 KiB), count=786 (+774), average=233 B
/app/.heroku/python/lib/python3.6/site-packages/amqp/channel.py:1734: size=165 KiB (+163 KiB), count=525 (+517), average=323 B
/app/.heroku/python/lib/python3.6/site-packages/kombu/async/hub.py:293: size=157 KiB (+155 KiB), count=255 (+251), average=632 B

@jxltom
Copy link

jxltom commented Jun 25, 2018

Any suggestions for how to continue to debug this? I have no clue for how to proceed. Thanks.

@marvelph
Copy link
Author

marvelph commented Jun 26, 2018

@georgepsarakis

I want a little time to cut out the project for reproduction.

It is setting of Celery.

BROKER_URL = [
    'amqp://xxxxxxxx:yyyyyyyy@aaa.bbb.ccc.ddd:5672/zzzzzzzz'
]
BROKER_TRANSPORT_OPTIONS = {}

The scheduler has the following settings.

CELERYBEAT_SCHEDULE = {
    'aaaaaaaa_bbbbbbbb': {
        'task': 'aaaa.bbbbbbbb_cccccccc',
        'schedule': celery.schedules.crontab(minute=0),
    },
    'dddddddd_eeeeeeee': {
        'task': 'dddd.eeeeeeee_ffffffff',
        'schedule': celery.schedules.crontab(minute=0),
    },
}

On EC 2, I am using supervisord to operate it.

@marvelph
Copy link
Author

@georgepsarakis
Since my test environment can tolerate performance degradation, you can use tracemalloc.
Can you make a patched Celery to dump memory usage?

@dmitry-kostin
Copy link

dmitry-kostin commented Jun 26, 2018

@jxltom I bet tracemalloc with 5 minutes wont help to locate problem
For example I have 5 nodes and only 3 of them had this problem for last 4 days, and 2 worked fine all this this time, so it will be very tricky to locate problem ..
I feel like there is some toggle that switches on and then memory starts grow, until this switch memory consumption looks very well

@marvelph
Copy link
Author

I tried to find out whether similar problems occurred in other running systems.
The frequency of occurrence varies, but a memory leak has occurred on three systems using Celery 4.x, and it has not happened on one system.
The system that has a memory leak is Python 3.5.x, and the system with no memory leak is Python 2.7.x.

@jxltom
Copy link

jxltom commented Jun 26, 2018

@dmitry-kostin What's the difference with the other two normal nodes, are they both using same rabbitmq as broker?

Since our discussion mentioned it may related to rabbitmq, I started another new node with same configuration except for using redis instead. So far, this node has no memory leak after running 24 hours. I will post it here if it has memory leak later

@jxltom
Copy link

jxltom commented Jun 26, 2018

@marvelph So do you mean that the three system with memory leak are using python3 while the one which is fine is using python2?

@dmitry-kostin
Copy link

dmitry-kostin commented Jun 26, 2018

@jxltom no difference at all, and yes they are on python 3 & rabit as broker and redis on backend
I made a testing example to reproduce this, if it will succeed in a couple of days i will give credentials to this servers for somebody who aware how to locate this bug

@marvelph
Copy link
Author

@jxltom
Yes.
As far as my environment is concerned, problems do not occur in Python 2.

@pawl
Copy link
Contributor

pawl commented Dec 24, 2021

The pull request I made today with the fix for the Redis broker leaking memory (when connections to the broker fail) was just merged.

I'm not aware of any other ways to reproduce memory leaks for #4843 at the moment.

Here's a summary of the fixes so far:

These fixes should completely prevent leaks due to disconnected connections to the broker:

And, if there are still some scenarios where that doesn't work... There's also these fixes that make Connections and Transports use ~150kb less memory each (making some potential leaks much less severe):

Thank you @auvipy for all the feedback and help with getting this stuff reviewed and merged.

@auvipy
Copy link
Member

auvipy commented Dec 24, 2021

@pawl thanks to you and your team mates for the great collaboration & contributions. will push point releases with other merged changes next Sunday if not swallowed by family/holiday vibes. but next week for sure

@caleb15
Copy link

caleb15 commented Jan 6, 2022

@auvipy Just to double-check, version 5.2.3 of celery that you pushed recently has the memory leak fixes, right?

@pawl
Copy link
Contributor

pawl commented Jan 6, 2022

@caleb15 Celery 5.2.3 does have a minor leak fix I didn't mention in my comment above: #7187 But, I'm not sure that one is the main one that is generating the complaints in this thread.

I think the main leak fixes are going to come from upgrading kombu to 5.2.3 (if you're using the redis broker) and py-amqp to 5.0.9 (if you're using py-amqp for connecting to rabbitmq).

For more details, see: #4843 (comment)

You may also want to check out this new section of the docs about handling memory leaks: https://docs.celeryproject.org/en/stable/userguide/optimizing.html#memory-usage

@Kludex
Copy link
Contributor

Kludex commented Oct 3, 2022

@auvipy Were you able to confirm that the issue was solved? If you don't know, I'll spend time checking.

Please let me know. 🙏

@auvipy
Copy link
Member

auvipy commented Oct 15, 2022

@auvipy Were you able to confirm that the issue was solved? If you don't know, I'll spend time checking.

Please let me know. pray

it was partially fixed. but another attempt to fix or figure out the remaining leaks would be very helpful. I sorry for late reply, I took a a week break

@Kludex
Copy link
Contributor

Kludex commented Oct 17, 2022

I've created this repository: https://github.com/Kludex/celery-leak

On my observations, the memory grows until a certain point, and then it remains constant. It took around 2k tasks to get to the point of being constant.

Can someone point me out, how to reproduce it or what I should try to reproduce it?

@harshita01398
Copy link

harshita01398 commented Jan 16, 2023

Seeing this on Celery-4.3.1, Kombu-4.6.11, Redis-4.1.2

Below is average memory chart. The available memory increases when service is restarted during deployment twice a day(mon-fri)

During weekends, available memory keeps on decreasing until service is restarted

image

@auvipy Any suggestion/fix for this? Does upgrading resolve this issue?

@auvipy
Copy link
Member

auvipy commented Jan 16, 2023

first of all, we really can't tell much anything about an unsupported version, which was released almost 5 years ago. using latest version usually provide more stability in general, and if any issues were raised, generally easier to reproduce/fix.

@norbertcyran
Copy link
Contributor

In our case, what we thought was a memory leak, actually turned out to be eta tasks accumulated in the workers. Over a period of a few days, our RAM usage was increasing by 30GB. I hope it might be useful for some of you.

More info:

@oleks-popovych
Copy link

I'm experiencing memory leak in forked worker. Essentially not all memory freed after consequent task execution.
What kind of approaches I could use to minimize or fix memory leak, except limiting number of tasks or allowed size of memory to consume?

@some1ataplace
Copy link

some1ataplace commented Mar 31, 2023

General tips and guidance on how to approach fixing memory leaks in Python, which can be applied to the Celery project.

  1. Identify the leak source: Use memory profiling tools like memory_profiler or objgraph to identify the objects that are causing the memory leak. This will help you pinpoint the part of the code that needs fixing.
from memory_profiler import profile

@profile
def your_function():
    # Your code here
  1. Use weak references: If the memory leak is caused by circular references between objects, you can use Python's weakref module to create weak references that don't prevent garbage collection.
import weakref

class MyClass:
    def __init__(self, other_instance=None):
        self.other_instance = weakref.ref(other_instance) if other_instance else None

instance1 = MyClass()
instance2 = MyClass(instance1)
instance1.other_instance = weakref.ref(instance2)

Another example:

import weakref
from celery import Celery

app = Celery('tasks', broker='pyamqp://guest@localhost//')

class ResourceHolder:
    def __init__(self, data):
        self.data = data

# Create a weak reference dictionary for resources
resources = weakref.WeakValueDictionary()

@app.task
def process_resource(resource_id):
    resource_holder = resources.get(resource_id)
    if resource_holder is not None:
        # Process your resource_holder.data here
        pass

def main():
    # Load all resources
    for resource_data in load_resources():
        resource_holder = ResourceHolder(resource_data)
        resources[id(resource_holder)] = resource_holder
        process_resource.apply_async((id(resource_holder),))

if __name__ == "__main__":
    main()

This example assumes that you have resources that need to be processed. Instead of passing the actual resource object to the Celery task, you maintain a weak reference dictionary, and only pass the id. This way, once the resource is no longer needed, it can be garbage collected, preventing a memory leak.

  1. Properly close resources: Ensure that you're properly closing resources like file handles, sockets, and database connections. Use context managers (with statement) whenever possible.
with open('file.txt', 'r') as f:
    content = f.read()
  1. Clear caches and buffers: If you're using caches or buffers, make sure to clear them periodically or when they're no longer needed.

cache.clear()

  1. Use garbage collection: In some cases, you may need to manually call Python's garbage collector to clean up unused objects. Be cautious when using this approach, as it can impact performance.
import gc

gc.collect()
  1. Optimize data structures: Sometimes, memory leaks can be caused by inefficient data structures. Consider using more memory-efficient data structures like array.array, slots, or namedtuple, depending on your use case.
from collections import namedtuple

MyTuple = namedtuple('MyTuple', ['field1', 'field2'])
  1. Limit task results: In the case of Celery, you may want to limit the number of task results stored in the backend by setting the task result expiration time.

app.conf.update(CELERY_TASK_RESULT_EXPIRES=3600)

  1. Monitor and profile: Continuously monitor the memory usage of your application and profile it regularly to identify any potential memory leaks early on.

@KyeRussell
Copy link

That...kind of reads like a ChatGPT answer.

@FabriQuinteros
Copy link

I have a memory leak in a process of sending emails, there are 50 celery tasks executed every certain distance(eta) in parallel, that is, it is not necessary to finish a sending task to start, I do it with group() of celery.

Where what he mainly did in this process is (open and close the connection many times with the mail server to send mails) and generate records in the database (there are around 1000 records in 45 minutes) and there comes a time where my memory collapses to the maximum available, what I suppose is that there is a memory leak and it is never recovered, so no matter how long the function ends until the worker is restarted, that memory will not be recovered, what can you recommend I do to avoid this leak?

django 3.2.18
celery 5.2.7
vine 5.0.0
kombu 5.2.3

@norbertcyran
Copy link
Contributor

@FabriQuinteros if you use eta tasks, you might find this comment useful: #4843 (comment)

@FabriQuinteros
Copy link

@norbertcyran I checked it but my problem is short term not long term. I have many other tasks scheduled, besides these. The problem is when they start to run. Not at the moment where I long them to the task queue

@hadpro24
Copy link

hadpro24 commented Jul 16, 2023

Hi guys, I advise you to use jmalloc. It has helped us to considerably reduce memory consumption.

Here's my Dockerfile configuration


FROM python:3.11-slim
ENV PYTHONDONTWRITEBYTECODE 1
ENV PYTHONUNBUFFERED 1

RUN groupadd -r app && useradd -r -g app app

RUN apt-get update
RUN apt-get install -y --no-install-recommends \
build-essential gcc libpq-dev libc-dev libmagic1 libpq5
RUN apt-get install libjemalloc2 && rm -rf /var/lib/apt/lists/*

ENV LD_PRELOAD /usr/lib/x86_64-linux-gnu/libjemalloc.so.2

WORKDIR /app
COPY requirements.txt .
RUN pip install --upgrade Cython && pip install -r requirements.txt

COPY . .
RUN chown -R app:app /app
USER app

CMD ["/bin/bash", "./entrypoint.sh"]

https://github.com/jemalloc/jemalloc

@adalyuf
Copy link

adalyuf commented Aug 5, 2023

For anyone running into this on Django, this helped my memory leak.

Most answers online mention setting CELERYD_MAX_TASKS_PER_CHILD - this is the right idea but the lingo needs to be updated for new django/celery projects.

Celery has switched the naming of certain configuration options.
This would expect CELERYD_MAX_TASKS_PER_CHILD to become worker_max_tasks_per_child, however this is not what should be used in a Django settings file, for use in settings, we need to uppercase and prefix with CELERY.

Celery has a command to make this conversion easy:
celery upgrade settings <project>/settings.py --django

This then will change CELERYD_MAX_TASKS_PER_CHILD to CELERY_WORKER_MAX_TASKS_PER_CHILD

To troubleshoot whether this is working or not, run flower and on the Flower -> Pool tab you should see
Max tasks per child | 200

If this approach doesn't work, you can add it to the worker invocation as
celery ... worker ... --max-tasks-per-child=200

@Robin528919
Copy link

image
-P eventlet Memory keeps rising steadily, is there a solution?

@Robin528919
Copy link

Uploading image.png…
Memory keeps rising steadily, is there a solution?

@Nusnus Nusnus modified the milestones: 5.3.x, 5.5 Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment