Memory leak on celery over Heroku (actually I don't think it's a celery issue), just can't figure out what's happening #3339

filwaitman · 2016-07-25T18:48:22Z

Hey @ask! How are you doing?

I'm getting a weird behavior on celery working over Heroku.

Sounds like I'm having a memory leak. Actually it behaves like a memory leak, but I don't think this is the case - just don't know what's this about, though.

For some reason seems like my scheduled tasks are not releasing memory after they're finished.
It doesn't seem to be related to scheduled tasks code, since I removed all code running on it and this "leak" is still present (see details below).

I know this "memory releasing" part is Python's responsibility, and I don't expect the memory to be released right after task is executed. But I'm getting my celery machines on a memory usage rate of 170% (by using swap and getting a bunch of R14 errors). Check it out:

(I have restarted celery at 14pm UTC. That's why memory was released.)

My pip requirements:

celery==3.1.20
django-celery==3.1.17

How I'm using celery on this machine (Procfile):

python manage.py celery worker -E -B -l INFO  # -E because I'm using celerycam

My celery configs:

BROKER_URL = env('BROKER_URL', default='redis://127.0.0.1:6379')  # Being defined on heroku settings var
CELERY_ACCEPT_CONTENT = ['json']
CELERY_TASK_SERIALIZER = 'json'
CELERY_RESULT_SERIALIZER = 'json'
CELERY_TIMEZONE = TIME_ZONE
CELERY_RESULT_BACKEND = 'djcelery.backends.database:DatabaseBackend'
CELERYBEAT_SCHEDULER = 'djcelery.schedulers.DatabaseScheduler'

CELERYBEAT_SCHEDULE = {}

(let me know if you need any additional info)

I used to have a bunch of things on my celery scheduler. For debugging purposes I erased them (there's no scheduled tasks running). Without a single scheduled task running the "leak" has vanished.
After this I created a (stupid?) scheduled task just for testing purposes:

# settings.py
CELERYBEAT_SCHEDULE = {
    'Nothing at all (test for memory leak)': {
        'task': 'main.tasks.nothing_at_all',
        'schedule': timedelta(minutes=10)
    },
}

# tasks.py
@shared_task
def nothing_at_all():
    a = range(1000000)
    print a[0]

Random notes:

I don't really think it's a celery issue since I used it many times before and it worked flawlessly.
I can't reproduce it locally. At least I tried and it worked as it should.
On a desperate measure I tried setting CELERYD_MAX_TASKS_PER_CHILD on heroku machines. Not even this stopped the "leak".
I tried disabling the celerycam (just in case). Didn't solve the issue.
I'm able to flush memory by restarting the celery service (this may be obvious, just wanted to mention that)

As I mentioned before I'm pretty convinced this is not an issue on your side. But maybe you know what's happening here. I confess I'm a bit lost now. 😆

Let me know if you have any clue what's going on.

Thanks!

The text was updated successfully, but these errors were encountered:

codingjoe · 2016-08-03T09:40:20Z

We're experiencing the same issue recently, but using an amqp backend.

ask · 2016-08-04T00:38:38Z

I'm not sure, but from my experience Python never releases memory back to the OS once allocated. Apparently the rationale is that releasing memory is expensive and Python will want to use it again.

Try to call the task multiple times to see if the number keeps on growing, if the number grows then add a import gc; gc.collect() before the task returns.

ask · 2016-08-04T00:39:54Z

Btw, maxtasksperchild should be releasing the memory since that will kill the child process. Are you sure it's the child process here that is consuming the memory?

filwaitman · 2016-08-05T15:27:24Z

@codingjoe somewhat good to know I'm not the only one with this issue. 😫

@ask I tried the gc.collect() and it didn't solve. Actually I had tried this before but I forgot to mention on description. Sorry for that.

Also, I'm sure its related to a child process. What I did to ensure that:

Disabled all tasks and left the celery machine "useless" for a while.
Enabled one single task:

@shared_task
def nothing_at_all():
    a = range(3000000)
    print a[0]

    import gc
    gc.collect()

As a result, the celery machine was stable during (1) (its memory usage was on 171MB for a long time), and the leak behavior re-appeared when I made (2). See attached image.

And in fact, I think gc.collect() and maxtasksperchild would have the same effect. And none of these solved the issue... =/

codingjoe · 2016-08-06T18:38:56Z

@filwaitman sorry to disappoint you. It was actually a task, that was leaking memory and it ran until the entire machine was killed by heroku.

You should get new relic to profile you tasks. The problem might just be there.

filwaitman · 2016-08-07T14:46:19Z

@codingjoe got it. On my case I'm using a test task doing basically nothing. =(
Also, I'm already using NewRelic, but I didn't find anything useful for my debugging there. Not this time. 😆

ask · 2016-08-08T23:34:05Z

That you only generate a list of integers in this task, would suggest to me that this behavior is intrinsic to Python. Celery itself won't hold onto these numbers.

The reason I suggested trying gc.collect would be so data is collected fast enough for memory to be reused, as I've seen the following behavior before:

task1 alloc 1000 objects -> garbage: 1,000
task2 alloc 1000 objects -> garbage: 2,000
task3 alloc 1000 objects -> garbage: 3,000
..
task1000 alloc 1000 objects -> garbage: 1,000,000
<implicit gc.collect>
task10001 alloc 1000 objects -> garbage: 1000

which meant the gc collect cycles were too slow for the memory allocated by python
to be reused, and the process RSS size grew unbounded, even with the occasional collection cycle.

With explicit gc.collect between tasks you'd see the expected:

task1 alloc 1000 objects -> garbage: 1000
task1 gc.collect
...
task1000 alloc 1000 objects -> garbage: 1000

and process RSS usage is constant.

I was under the impression that doing an explicit gc.collect would only help objects
with cyclic references, and that scalar objects like numbers will be collected as soon
as they go out of scope. I may be wrong about that, but then I believe
you'd have to force these numbers to go out of scope:

@shared_task
def nothing_at_all():
    a = range(3000000)
    print a[0]
    del(a)  # < -- remove reference count.

    import gc
    gc.collect()

filwaitman · 2016-08-09T22:52:03Z

Oh, man. Of course, the del a should be there. 😄
Well, I added this but it didn't solve the issue as well. "Leak" is still there.

Anyway, I'm 99.9% sure this is not a celery issue (sounds like a infrastructure one). With this on mind and since you guys have a lot of real issues to solve: Do you want me to close it?

I mean, I opened this as a hope you've seen this before, and I don't wanna bother you (anymore) if this is not the case 😆

auvipy · 2016-08-10T20:33:40Z

closing for now

vesterbaek · 2016-11-09T12:17:08Z

@filwaitman: did you find a resolution to this? I'm seeing the same with Celery on Heroku

filwaitman · 2016-11-09T13:50:52Z

@vesterbaek nope. I'm still facing this. I'm ignoring this because the project owner doesn't let me debug it decently ("dev environment is too busy to let it stuck to debug this"). 🙃

jjzhangg · 2016-12-20T13:51:18Z

This blog might help debugging: http://chase-seibert.github.io/blog/2013/08/03/diagnosing-memory-leaks-python.html

FabioFleitas · 2016-12-26T06:51:26Z

I'm also facing the same issue using Celery on Heroku. I end up restarting that dyno when it happens, but that's not ideal.

greenberge · 2017-01-18T04:15:40Z

Ditto. It's a nightmare. We have to restart twice a day.

surkova · 2017-02-02T15:13:29Z

Yeah, same here. Restarting every other day.

greenberge · 2017-02-02T15:20:16Z

Since my last post, we tried @ask's approach as detailed above. So far so good. Memory usage barely creeps up at all any more.

GDay · 2017-07-24T10:09:52Z

I had this exact problem. My Celery tasks run every 1 minute and every 20 minutes.

Eventually, I found that adding CELERYD_MAX_TASKS_PER_CHILD = 1 to the settings help to prevent this. It basically creates a new worker now every time a task has been completed. It might not be the cleanest option, but it works perfectly for me.

(that line was added in my settings with the second deployment.)

DomHudson · 2017-09-12T10:37:01Z

Hi @GDay and @filwaitman did you ever get to the bottom of this? I've got some very tasks that need access to a very large object (almost 5gb in memory) and am experiencing the same issue. The usage starts off fine, but over approximately 20 tasks it exhausts all my ram on a very high powered machine (32 core, 64gb ram) and is killed by the kernel. If I set the max tasks to 1 it does seem to help but unsure if it's simply mitigating the issue... I'd be very grateful to know if you did deduce more information on this.
Many thanks

filwaitman · 2017-09-12T10:44:11Z

@DomHudson nope. =(
Actually I'm still with this and created some workarounds to deal with leak.

GDay · 2017-09-12T10:46:39Z

@DomHudson The small trick I explained still works fine for me. It never exceeds the limits and does the job perfectly. Never had any issues ever since.

vesterbaek · 2017-09-12T10:49:04Z

Facing the same issue - with no solution yet. Bad things happen when I start getting R14 because the system seriously slows down - and is not restarted until R15 is raised (swap exceeded). I'm considering monitoring the logs for R14 and restart the dyno on first sight.

DomHudson · 2017-09-12T10:53:14Z

Okay thanks both very much . I'll stick with the CELERYD_MAX_TASKS_PER_CHILD for now and report back if I find any more information.
Edit:
@vesterbaek did the above setting make any difference in your environment?

vesterbaek · 2017-09-12T11:01:18Z

@DomHudson haven't tried it actually. At peak I'm running quite a few tasks per second and I'm concerned with the performance implications of having to spawn a new child thread or process (not sure which is used) for each task.

I'm running my workers with --max-memory-per-child=100000 but that has not stopped the memory usage increasing over time.

fjsj · 2017-09-12T13:47:48Z

Had similar bugs in a self-hosted deployment. Using RabbitMQ instead of Redis as a broker solved my problem.

DomHudson · 2017-09-12T15:46:37Z

@vesterbaek agreed - I was also worried about this and indeed it does seem to be having an adverse effect, slowing down my task processing noticeably.

Interesting @fjsj , did you find that the kernel was listing the RabbitMQ service as consuming the memory or the celery processes?

fjsj · 2017-09-12T16:37:23Z

@DomHudson I was using Redis as a broker, but the memory leaks were in Celery processes. When I've changed to RabbitMQ, the leak was gone. I guess there's something broken in Celery-Redis integration. I'm now using RabbitMQ as a broker and Redis as a result backend. It's fine now.

DomHudson · 2017-09-12T17:23:05Z

Sorry yes, Rabbit was a typo - okay great thanks for the insight! I will investigate if the same occurs on my end.

vesterbaek · 2017-09-13T06:53:48Z

I'm on Celery 4.1.0 and using RabbitMQ as broker with no result backend. There had the same experience with leaks on Celery 3.x

vinitkumar · 2018-04-03T06:25:52Z

Anybody found a solution to this issue? We are facing the same issues with celery 4.1.0 on Heroku.

ijames · 2018-06-20T18:02:28Z

In case anybody is coming into this at 4.0+, CELERYD_MAX_TASKS_PER_CHILD has been renamed:

https://docs.celeryproject.org/en/latest/userguide/configuration.html#std:setting-worker_max_tasks_per_child

So the above suggestion documented by @GDay would be:

worker_max_tasks_per_child = 1

Which I'm going to be trying on my next push.

SHxKM · 2018-08-16T16:05:45Z

Facing the same issue with Heroku. My task is actually a long running one but I don't know if that should affect the celery worker instance.

filwaitman changed the title ~~Memory leak on celery over Heroku (actually I don't think it's a celery issue), just can't figure out what's happening here.~~ Memory leak on celery over Heroku (actually I don't think it's a celery issue), just can't figure out what's happening Jul 25, 2016

auvipy added the Status: Feedback Needed ✘ label Aug 6, 2016

auvipy closed this as completed Aug 10, 2016

auvipy removed the Status: Feedback Needed ✘ label Aug 10, 2016

vesterbaek mentioned this issue Sep 12, 2017

Delay memory leak #3279

Closed

thedrow mentioned this issue Dec 11, 2018

memory leak with celery.chord (A tiny/simple example with docker-compose) #5230

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak on celery over Heroku (actually I don't think it's a celery issue), just can't figure out what's happening #3339

Memory leak on celery over Heroku (actually I don't think it's a celery issue), just can't figure out what's happening #3339

filwaitman commented Jul 25, 2016

codingjoe commented Aug 3, 2016

ask commented Aug 4, 2016

ask commented Aug 4, 2016

filwaitman commented Aug 5, 2016

codingjoe commented Aug 6, 2016

filwaitman commented Aug 7, 2016

ask commented Aug 8, 2016 •

edited

filwaitman commented Aug 9, 2016

auvipy commented Aug 10, 2016

vesterbaek commented Nov 9, 2016

filwaitman commented Nov 9, 2016

jjzhangg commented Dec 20, 2016

FabioFleitas commented Dec 26, 2016

greenberge commented Jan 18, 2017

surkova commented Feb 2, 2017

greenberge commented Feb 2, 2017

GDay commented Jul 24, 2017

DomHudson commented Sep 12, 2017

filwaitman commented Sep 12, 2017

GDay commented Sep 12, 2017

vesterbaek commented Sep 12, 2017

DomHudson commented Sep 12, 2017

vesterbaek commented Sep 12, 2017

fjsj commented Sep 12, 2017

DomHudson commented Sep 12, 2017

fjsj commented Sep 12, 2017

DomHudson commented Sep 12, 2017

vesterbaek commented Sep 13, 2017

vinitkumar commented Apr 3, 2018 •

edited

ijames commented Jun 20, 2018 •

edited

SHxKM commented Aug 16, 2018

Memory leak on celery over Heroku (actually I don't think it's a celery issue), just can't figure out what's happening #3339

Memory leak on celery over Heroku (actually I don't think it's a celery issue), just can't figure out what's happening #3339

Comments

filwaitman commented Jul 25, 2016

codingjoe commented Aug 3, 2016

ask commented Aug 4, 2016

ask commented Aug 4, 2016

filwaitman commented Aug 5, 2016

codingjoe commented Aug 6, 2016

filwaitman commented Aug 7, 2016

ask commented Aug 8, 2016 • edited

filwaitman commented Aug 9, 2016

auvipy commented Aug 10, 2016

vesterbaek commented Nov 9, 2016

filwaitman commented Nov 9, 2016

jjzhangg commented Dec 20, 2016

FabioFleitas commented Dec 26, 2016

greenberge commented Jan 18, 2017

surkova commented Feb 2, 2017

greenberge commented Feb 2, 2017

GDay commented Jul 24, 2017

DomHudson commented Sep 12, 2017

filwaitman commented Sep 12, 2017

GDay commented Sep 12, 2017

vesterbaek commented Sep 12, 2017

DomHudson commented Sep 12, 2017

vesterbaek commented Sep 12, 2017

fjsj commented Sep 12, 2017

DomHudson commented Sep 12, 2017

fjsj commented Sep 12, 2017

DomHudson commented Sep 12, 2017

vesterbaek commented Sep 13, 2017

vinitkumar commented Apr 3, 2018 • edited

ijames commented Jun 20, 2018 • edited

SHxKM commented Aug 16, 2018

ask commented Aug 8, 2016 •

edited

vinitkumar commented Apr 3, 2018 •

edited

ijames commented Jun 20, 2018 •

edited