Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Celery[redis] bunch of TID.reply.celery.pidbox objects taking a lot of memory #6089

Open
dejlek opened this issue May 13, 2020 · 16 comments
Open

Comments

@dejlek
Copy link
Contributor

dejlek commented May 13, 2020

Since there is no guidance/advise request I had to file a bug report, and I apologise for that in advance. Recently we have started having memory issues with our Redis (ElastiCache) server. Using RedisInsight i found that we have hundreds of list objects that look like 70e68057-de21-3ed6-9798-26cd42ad8456.reply.celery.pidbox that take between 50M and 150M of RAM and have TTL = -1 (in other words they do not expire!).

Question is how to prevent this from happening? Is it a bug? Is there a way to maintain these keys (periodically cleanup somehow)? Any constructive advise is welcome!

@auvipy
Copy link
Member

auvipy commented May 13, 2020

are you profiling your code? if so then you should also be able to find out the root cause of memory leak and a possible solution

@dejlek
Copy link
Contributor Author

dejlek commented May 13, 2020

Erm... my code does not make those keys - it is Celery that does that. I am guessing they are related to some of the Chords that we run every day. Why do they have TTL = -1? Our code does not touch the Celery broker (ElastiCache instance).

@auvipy
Copy link
Member

auvipy commented May 13, 2020

I was trying to suggest that as it's being used in the production of your system, it would be wise to profile the Leakey source of celery :)

@dejlek
Copy link
Contributor Author

dejlek commented May 13, 2020

Well, this is not a regular "memory leak" - Celery creates enormous objects in Redis with TTL=-1 and if it fails to delete them for whatever reason, we have trouble...

@auvipy
Copy link
Member

auvipy commented May 17, 2020

https://stackoverflow.com/questions/141351/how-do-i-find-what-is-using-memory-in-a-python-process-in-a-production-system/61260839#61260839

@dejlek
Copy link
Contributor Author

dejlek commented May 21, 2020

Did you read what I wrote above? :) - It is not a regular memory leak!
Celery leaves large objects in Redis with TTL=-1 and eventually Redis runs out of memory over time! We found this "the hard way" few days ago, and wrote a maintenance script that runs every week to cleanup...

I humbly believe setting TTL=-1 for those keys is a bug.

@jheld
Copy link
Contributor

jheld commented May 24, 2020

Anything related to the result expires setting? Perhaps you made the setting None so all the tasks for that worker keep their results in the backend (until TTL or eviction).

Can you provide more info?

@dejlek
Copy link
Contributor Author

dejlek commented May 26, 2020

No, we do not use the result_expires setting. It should be at its default (1 day if I remember). I checked anyway with inspect conf and there was no result_expires in the output.

@gautamp8
Copy link

gautamp8 commented Jan 17, 2021

This seems to be same as celery/kombu#294
We're also experiencing the same issue

@auvipy
Copy link
Member

auvipy commented Jan 17, 2021

Did you read what I wrote above? :) - It is not a regular memory leak!
Celery leaves large objects in Redis with TTL=-1 and eventually Redis runs out of memory over time! We found this "the hard way" few days ago, and wrote a maintenance script that runs every week to cleanup...

I humbly believe setting TTL=-1 for those keys is a bug.

sorry for not being meaningfully helpful! do you have the bandwidth to come up with a workable solution to the problem? after you come with a PR we could share our effort with you.

@linar-jether
Copy link

Our solution to this issue is to periodically remove keys that have been idle for more than X days

red = app.backend.client
for key in red.scan_iter("*", 100000):
    if key.decode().startswith('_kombu'):
        continueif red.object('idletime', key) >= (24 * 7 * 3600):
        red.delete(key)

@humitos
Copy link
Contributor

humitos commented Feb 7, 2023

We are hitting this issue as well. I've been trying to debug it but I wasn't able to understand the root cause yet. I'm trying to plot some of the data I have access to.

In [94]: import math
    ...: from readthedocs.worker import app
    ...: total_memory = 0
    ...: keys = app.backend.client.keys('*reply.celery.pidbox*')
    ...: print(f'{"Key":60} | {"idletime (secs)":15} | {"ttl":3} | {"Memory (mb)":12} | {"refcounts":3}')
    ...: for key in keys:
    ...:     idletime = app.backend.client.object('idletime', key)
    ...:     refcounts = app.backend.client.object('refcount', key)
    ...:     ttl = app.backend.client.ttl(key)
    ...:     memory = math.ceil(app.backend.client.memory_usage(key) / 1024 / 1024)
    ...:     total_memory += memory
    ...:     print(f'{str(key):60} | {idletime:15} | {ttl:3} | {memory:12} | {refcounts:3}')
    ...: print('Total memory (mb):', total_memory)
    ...: print('Total keys:', len(keys))
Key                                                          | idletime (secs) | ttl | Memory (mb)  | refcounts
b'132fe875-4f6e-3764-b532-ac1e7ea20444.reply.celery.pidbox'  |            8118 |  -1 |            3 |   1
b'bcd95240-5a6f-3b6a-9aa2-12986e69e244.reply.celery.pidbox'  |            8196 |  -1 |           13 |   1
b'2c3ef1f1-eb4d-3b05-b761-540df20703f7.reply.celery.pidbox'  |            8116 |  -1 |            2 |   1
b'4cad48f6-1770-3202-bcd8-2bab73f66a0e.reply.celery.pidbox'  |            8114 |  -1 |           13 |   1
b'c6de8817-13e1-3743-bbf9-c2160bbf2717.reply.celery.pidbox'  |            8196 |  -1 |            7 |   1
b'4aaa5579-290b-3bd1-b9fb-451a73e090e6.reply.celery.pidbox'  |            8115 |  -1 |           13 |   1
b'0731b964-2e3a-36d0-9f17-6b271efa85a8.reply.celery.pidbox'  |            8118 |  -1 |            3 |   1
b'de88f2d1-8f05-3e8d-a633-98795918e1ad.reply.celery.pidbox'  |            8115 |  -1 |           26 |   1
b'1ae2d2f2-e2c2-3718-bcdf-1d627b0dda9a.reply.celery.pidbox'  |            8195 |  -1 |            2 |   1
b'd204af83-83a3-3b81-b65e-1e3f3dd7ac58.reply.celery.pidbox'  |            8197 |  -1 |           32 |   1
b'e0254a72-cde8-3b67-9b9b-d1b2f26b3bff.reply.celery.pidbox'  |            8118 |  -1 |            7 |   1
b'8fb8a143-f86f-3bd8-8809-26f7e8839d3b.reply.celery.pidbox'  |            8118 |  -1 |            7 |   1
b'aaf678d5-4415-3a56-beac-0beec765784a.reply.celery.pidbox'  |            8116 |  -1 |           13 |   1
b'5ec473cd-7f2e-3ea9-89e6-bf86b0442aa5.reply.celery.pidbox'  |            8116 |  -1 |           44 |   1
b'a9aca4e1-1c62-3ab5-af65-8386476b9a9f.reply.celery.pidbox'  |            2104 |  -1 |           26 |   1
b'b160558a-ac56-3ce7-8031-c83a95c90cb0.reply.celery.pidbox'  |            8116 |  -1 |            3 |   1
b'd3203913-9791-3fd5-af45-5f8e28cdde9c.reply.celery.pidbox'  |            8196 |  -1 |            2 |   1
b'79d8881b-8eaf-3adb-a1c4-40f0a32ae9d3.reply.celery.pidbox'  |            8121 |  -1 |           51 |   1
b'39317dbe-f93f-31cb-8980-3107ba788411.reply.celery.pidbox'  |            8196 |  -1 |            7 |   1
b'e099525f-9b7b-3436-9515-fe5e69d4bd7f.reply.celery.pidbox'  |            8119 |  -1 |            2 |   1
b'5d03f9af-a894-3eb8-9499-63b113505564.reply.celery.pidbox'  |            8120 |  -1 |            7 |   1
Total memory (mb): 283
Total keys: 21

In [95]:

This keys are never cleanup and we are manually deleting them once in a while --each week more frequently, tho.

Not sure if related or not, but it seems our Redis gets OOM when we receive a bunch of tasks at the same time:

Screenshot_2023-02-07_11-48-50

We are using:

celery==5.2.7
kombu==5.2.4

Is there any other important info I should be provide to help. I'm happy to keep debugging this, but I'd probably need more directions here. Note that we are only noticing this in production.

Edit: after 1h it went from 283Mb to 726Mb of pidbox keys.

@auvipy auvipy added this to the 5.3 milestone Feb 7, 2023
@humitos
Copy link
Contributor

humitos commented Feb 7, 2023

If I understand correctly, this is the Celery/Kombu code involved into this:

I suppose that's why (not supported on Redis) we have all the messages with TTL=-1, right?

humitos added a commit to readthedocs/readthedocs.org that referenced this issue Feb 8, 2023
Simple task to remove `pidbox` keys older than 15 minutes.
This is a workaround to avoid Redis OOM for now.

We will need to find out a better solution here. There is an upstream issue
opened that we should check in the near future and probably remove this
workaround: celery/celery#6089
humitos added a commit to readthedocs/readthedocs.org that referenced this issue Feb 8, 2023
Simple task to remove `pidbox` keys older than 15 minutes.
This is a workaround to avoid Redis OOM for now.

We will need to find out a better solution here. There is an upstream issue
opened that we should check in the near future and probably remove this
workaround: celery/celery#6089
@dejlek
Copy link
Contributor Author

dejlek commented Feb 8, 2023

Redis supports expiration of keys since the beginning.

@humitos
Copy link
Contributor

humitos commented Feb 8, 2023

Yeah, but reading the docstring of the method, it seems the Celery integration with Redis does not support it: https://github.com/celery/kombu/blob/main/kombu/entity.py#L464-L473

humitos added a commit to readthedocs/readthedocs.org that referenced this issue Feb 13, 2023
* Celery: cleanup `pidbox` keys

Simple task to remove `pidbox` keys older than 15 minutes.
This is a workaround to avoid Redis OOM for now.

We will need to find out a better solution here. There is an upstream issue
opened that we should check in the near future and probably remove this
workaround: celery/celery#6089

* Celery: use `redis` to get the client

`app.backend.client` is not working anymore since we are not using a backend
result anymore.
@Nusnus Nusnus modified the milestones: 5.3, Future Feb 19, 2023
@harniruthwik
Copy link

We are facing similar issue, Can some help with the final workable solution ?

DeD1rk added a commit to svthalia/concrexit that referenced this issue Mar 6, 2024
There was a strange bug where over a few months, celery's
idle CPU usage kept increasing. It seems this may have been related
to the healthcheck being killed by docker, without cleaning up after
itself. This lead to hundreds of thousands of 'celery.pidbox' keys
being left behind on redis, which slowed down redis.

See celery/celery#6089
DeD1rk added a commit to svthalia/concrexit that referenced this issue Apr 12, 2024
There was a strange bug where over a few months, celery's
idle CPU usage kept increasing. It seems this may have been related
to the healthcheck being killed by docker, without cleaning up after
itself. This lead to hundreds of thousands of 'celery.pidbox' keys
being left behind on redis, which slowed down redis.

See celery/celery#6089
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants