Celery[redis] bunch of TID.reply.celery.pidbox objects taking a lot of memory #6089

dejlek · 2020-05-13T13:23:24Z

Since there is no guidance/advise request I had to file a bug report, and I apologise for that in advance. Recently we have started having memory issues with our Redis (ElastiCache) server. Using RedisInsight i found that we have hundreds of list objects that look like 70e68057-de21-3ed6-9798-26cd42ad8456.reply.celery.pidbox that take between 50M and 150M of RAM and have TTL = -1 (in other words they do not expire!).

Question is how to prevent this from happening? Is it a bug? Is there a way to maintain these keys (periodically cleanup somehow)? Any constructive advise is welcome!

The text was updated successfully, but these errors were encountered:

auvipy · 2020-05-13T13:38:06Z

are you profiling your code? if so then you should also be able to find out the root cause of memory leak and a possible solution

dejlek · 2020-05-13T13:40:20Z

Erm... my code does not make those keys - it is Celery that does that. I am guessing they are related to some of the Chords that we run every day. Why do they have TTL = -1? Our code does not touch the Celery broker (ElastiCache instance).

auvipy · 2020-05-13T13:49:04Z

I was trying to suggest that as it's being used in the production of your system, it would be wise to profile the Leakey source of celery :)

dejlek · 2020-05-13T13:51:15Z

Well, this is not a regular "memory leak" - Celery creates enormous objects in Redis with TTL=-1 and if it fails to delete them for whatever reason, we have trouble...

auvipy · 2020-05-17T06:12:23Z

https://stackoverflow.com/questions/141351/how-do-i-find-what-is-using-memory-in-a-python-process-in-a-production-system/61260839#61260839

dejlek · 2020-05-21T09:15:34Z

Did you read what I wrote above? :) - It is not a regular memory leak!
Celery leaves large objects in Redis with TTL=-1 and eventually Redis runs out of memory over time! We found this "the hard way" few days ago, and wrote a maintenance script that runs every week to cleanup...

I humbly believe setting TTL=-1 for those keys is a bug.

jheld · 2020-05-24T00:08:09Z

Anything related to the result expires setting? Perhaps you made the setting None so all the tasks for that worker keep their results in the backend (until TTL or eviction).

Can you provide more info?

dejlek · 2020-05-26T10:02:36Z

No, we do not use the result_expires setting. It should be at its default (1 day if I remember). I checked anyway with inspect conf and there was no result_expires in the output.

gautamp8 · 2021-01-17T07:13:25Z

This seems to be same as celery/kombu#294
We're also experiencing the same issue

auvipy · 2021-01-17T12:58:55Z

Did you read what I wrote above? :) - It is not a regular memory leak!
Celery leaves large objects in Redis with TTL=-1 and eventually Redis runs out of memory over time! We found this "the hard way" few days ago, and wrote a maintenance script that runs every week to cleanup...

I humbly believe setting TTL=-1 for those keys is a bug.

sorry for not being meaningfully helpful! do you have the bandwidth to come up with a workable solution to the problem? after you come with a PR we could share our effort with you.

linar-jether · 2022-10-25T11:56:01Z

Our solution to this issue is to periodically remove keys that have been idle for more than X days

red = app.backend.client
for key in red.scan_iter("*", 100000):
    if key.decode().startswith('_kombu'):
        continue

    if red.object('idletime', key) >= (24 * 7 * 3600):
        red.delete(key)

humitos · 2023-02-07T16:10:01Z

We are hitting this issue as well. I've been trying to debug it but I wasn't able to understand the root cause yet. I'm trying to plot some of the data I have access to.

In [94]: import math
    ...: from readthedocs.worker import app
    ...: total_memory = 0
    ...: keys = app.backend.client.keys('*reply.celery.pidbox*')
    ...: print(f'{"Key":60} | {"idletime (secs)":15} | {"ttl":3} | {"Memory (mb)":12} | {"refcounts":3}')
    ...: for key in keys:
    ...:     idletime = app.backend.client.object('idletime', key)
    ...:     refcounts = app.backend.client.object('refcount', key)
    ...:     ttl = app.backend.client.ttl(key)
    ...:     memory = math.ceil(app.backend.client.memory_usage(key) / 1024 / 1024)
    ...:     total_memory += memory
    ...:     print(f'{str(key):60} | {idletime:15} | {ttl:3} | {memory:12} | {refcounts:3}')
    ...: print('Total memory (mb):', total_memory)
    ...: print('Total keys:', len(keys))
Key                                                          | idletime (secs) | ttl | Memory (mb)  | refcounts
b'132fe875-4f6e-3764-b532-ac1e7ea20444.reply.celery.pidbox'  |            8118 |  -1 |            3 |   1
b'bcd95240-5a6f-3b6a-9aa2-12986e69e244.reply.celery.pidbox'  |            8196 |  -1 |           13 |   1
b'2c3ef1f1-eb4d-3b05-b761-540df20703f7.reply.celery.pidbox'  |            8116 |  -1 |            2 |   1
b'4cad48f6-1770-3202-bcd8-2bab73f66a0e.reply.celery.pidbox'  |            8114 |  -1 |           13 |   1
b'c6de8817-13e1-3743-bbf9-c2160bbf2717.reply.celery.pidbox'  |            8196 |  -1 |            7 |   1
b'4aaa5579-290b-3bd1-b9fb-451a73e090e6.reply.celery.pidbox'  |            8115 |  -1 |           13 |   1
b'0731b964-2e3a-36d0-9f17-6b271efa85a8.reply.celery.pidbox'  |            8118 |  -1 |            3 |   1
b'de88f2d1-8f05-3e8d-a633-98795918e1ad.reply.celery.pidbox'  |            8115 |  -1 |           26 |   1
b'1ae2d2f2-e2c2-3718-bcdf-1d627b0dda9a.reply.celery.pidbox'  |            8195 |  -1 |            2 |   1
b'd204af83-83a3-3b81-b65e-1e3f3dd7ac58.reply.celery.pidbox'  |            8197 |  -1 |           32 |   1
b'e0254a72-cde8-3b67-9b9b-d1b2f26b3bff.reply.celery.pidbox'  |            8118 |  -1 |            7 |   1
b'8fb8a143-f86f-3bd8-8809-26f7e8839d3b.reply.celery.pidbox'  |            8118 |  -1 |            7 |   1
b'aaf678d5-4415-3a56-beac-0beec765784a.reply.celery.pidbox'  |            8116 |  -1 |           13 |   1
b'5ec473cd-7f2e-3ea9-89e6-bf86b0442aa5.reply.celery.pidbox'  |            8116 |  -1 |           44 |   1
b'a9aca4e1-1c62-3ab5-af65-8386476b9a9f.reply.celery.pidbox'  |            2104 |  -1 |           26 |   1
b'b160558a-ac56-3ce7-8031-c83a95c90cb0.reply.celery.pidbox'  |            8116 |  -1 |            3 |   1
b'd3203913-9791-3fd5-af45-5f8e28cdde9c.reply.celery.pidbox'  |            8196 |  -1 |            2 |   1
b'79d8881b-8eaf-3adb-a1c4-40f0a32ae9d3.reply.celery.pidbox'  |            8121 |  -1 |           51 |   1
b'39317dbe-f93f-31cb-8980-3107ba788411.reply.celery.pidbox'  |            8196 |  -1 |            7 |   1
b'e099525f-9b7b-3436-9515-fe5e69d4bd7f.reply.celery.pidbox'  |            8119 |  -1 |            2 |   1
b'5d03f9af-a894-3eb8-9499-63b113505564.reply.celery.pidbox'  |            8120 |  -1 |            7 |   1
Total memory (mb): 283
Total keys: 21

In [95]:

This keys are never cleanup and we are manually deleting them once in a while --each week more frequently, tho.

Not sure if related or not, but it seems our Redis gets OOM when we receive a bunch of tasks at the same time:

We are using:

celery==5.2.7
kombu==5.2.4

Is there any other important info I should be provide to help. I'm happy to keep debugging this, but I'd probably need more directions here. Note that we are only noticing this in production.

Edit: after 1h it went from 283Mb to 726Mb of pidbox keys.

humitos · 2023-02-07T16:26:08Z

If I understand correctly, this is the Celery/Kombu code involved into this:

pidbox message are defined: https://github.com/celery/kombu/blob/main/kombu/pidbox.py#L157
TTL for these messages is at: https://github.com/celery/kombu/blob/a3de6f66c1c62cba5008f078c2df20d97f32dcbe/kombu/pidbox.py#L196
TTL comes from https://docs.celeryq.dev/en/latest/userguide/configuration.html#control-queue-ttl
Reply queue is created at https://github.com/celery/kombu/blob/main/kombu/pidbox.py#L233-L243 with TTL
TTL is only supported in RabbitMQ: https://github.com/celery/kombu/blob/main/kombu/entity.py#L464-L473

I suppose that's why (not supported on Redis) we have all the messages with TTL=-1, right?

Simple task to remove `pidbox` keys older than 15 minutes. This is a workaround to avoid Redis OOM for now. We will need to find out a better solution here. There is an upstream issue opened that we should check in the near future and probably remove this workaround: celery/celery#6089

dejlek · 2023-02-08T13:38:24Z

Redis supports expiration of keys since the beginning.

humitos · 2023-02-08T14:16:06Z

Yeah, but reading the docstring of the method, it seems the Celery integration with Redis does not support it: https://github.com/celery/kombu/blob/main/kombu/entity.py#L464-L473

* Celery: cleanup `pidbox` keys Simple task to remove `pidbox` keys older than 15 minutes. This is a workaround to avoid Redis OOM for now. We will need to find out a better solution here. There is an upstream issue opened that we should check in the near future and probably remove this workaround: celery/celery#6089 * Celery: use `redis` to get the client `app.backend.client` is not working anymore since we are not using a backend result anymore.

harniruthwik · 2023-07-14T05:06:33Z

We are facing similar issue, Can some help with the final workable solution ?

There was a strange bug where over a few months, celery's idle CPU usage kept increasing. It seems this may have been related to the healthcheck being killed by docker, without cleaning up after itself. This lead to hundreds of thousands of 'celery.pidbox' keys being left behind on redis, which slowed down redis. See celery/celery#6089

dejlek added the Issue Type: Bug Report label May 13, 2020

auvipy added the Status: Needs Verification ✘ label May 29, 2020

auvipy added memory leak Component: Redis Results Backend labels Feb 21, 2021

auvipy added this to the 5.3 milestone Feb 7, 2023

humitos mentioned this issue Feb 8, 2023

Celery: cleanup pidbox keys readthedocs/readthedocs.org#10002

Merged

Nusnus modified the milestones: 5.3, Future Feb 19, 2023

auvipy removed the Status: Needs Verification ✘ label Jul 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Celery[redis] bunch of TID.reply.celery.pidbox objects taking a lot of memory #6089

Celery[redis] bunch of TID.reply.celery.pidbox objects taking a lot of memory #6089

dejlek commented May 13, 2020 •

edited by sync-by-unito bot

auvipy commented May 13, 2020

dejlek commented May 13, 2020 •

edited

auvipy commented May 13, 2020

dejlek commented May 13, 2020

auvipy commented May 17, 2020

dejlek commented May 21, 2020 •

edited

jheld commented May 24, 2020

dejlek commented May 26, 2020 •

edited

gautamp8 commented Jan 17, 2021 •

edited

auvipy commented Jan 17, 2021

linar-jether commented Oct 25, 2022

humitos commented Feb 7, 2023 •

edited

humitos commented Feb 7, 2023

dejlek commented Feb 8, 2023

humitos commented Feb 8, 2023

harniruthwik commented Jul 14, 2023

Celery[redis] bunch of TID.reply.celery.pidbox objects taking a lot of memory #6089

Celery[redis] bunch of TID.reply.celery.pidbox objects taking a lot of memory #6089

Comments

dejlek commented May 13, 2020 • edited by sync-by-unito bot

auvipy commented May 13, 2020

dejlek commented May 13, 2020 • edited

auvipy commented May 13, 2020

dejlek commented May 13, 2020

auvipy commented May 17, 2020

dejlek commented May 21, 2020 • edited

jheld commented May 24, 2020

dejlek commented May 26, 2020 • edited

gautamp8 commented Jan 17, 2021 • edited

auvipy commented Jan 17, 2021

linar-jether commented Oct 25, 2022

humitos commented Feb 7, 2023 • edited

humitos commented Feb 7, 2023

dejlek commented Feb 8, 2023

humitos commented Feb 8, 2023

harniruthwik commented Jul 14, 2023

dejlek commented May 13, 2020 •

edited by sync-by-unito bot

dejlek commented May 13, 2020 •

edited

dejlek commented May 21, 2020 •

edited

dejlek commented May 26, 2020 •

edited

gautamp8 commented Jan 17, 2021 •

edited

humitos commented Feb 7, 2023 •

edited