Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak on celery over Heroku (actually I don't think it's a celery issue), just can't figure out what's happening #3339

Closed
filwaitman opened this issue Jul 25, 2016 · 31 comments

Comments

@filwaitman
Copy link

Hey @ask! How are you doing?

I'm getting a weird behavior on celery working over Heroku.

Sounds like I'm having a memory leak. Actually it behaves like a memory leak, but I don't think this is the case - just don't know what's this about, though.

For some reason seems like my scheduled tasks are not releasing memory after they're finished.
It doesn't seem to be related to scheduled tasks code, since I removed all code running on it and this "leak" is still present (see details below).

I know this "memory releasing" part is Python's responsibility, and I don't expect the memory to be released right after task is executed. But I'm getting my celery machines on a memory usage rate of 170% (by using swap and getting a bunch of R14 errors). Check it out:

screenshot from 2016-07-25 15-33-28

(I have restarted celery at 14pm UTC. That's why memory was released.)

My pip requirements:

celery==3.1.20
django-celery==3.1.17

How I'm using celery on this machine (Procfile):

python manage.py celery worker -E -B -l INFO  # -E because I'm using celerycam

My celery configs:

BROKER_URL = env('BROKER_URL', default='redis://127.0.0.1:6379')  # Being defined on heroku settings var
CELERY_ACCEPT_CONTENT = ['json']
CELERY_TASK_SERIALIZER = 'json'
CELERY_RESULT_SERIALIZER = 'json'
CELERY_TIMEZONE = TIME_ZONE
CELERY_RESULT_BACKEND = 'djcelery.backends.database:DatabaseBackend'
CELERYBEAT_SCHEDULER = 'djcelery.schedulers.DatabaseScheduler'

CELERYBEAT_SCHEDULE = {}

(let me know if you need any additional info)

I used to have a bunch of things on my celery scheduler. For debugging purposes I erased them (there's no scheduled tasks running). Without a single scheduled task running the "leak" has vanished.
After this I created a (stupid?) scheduled task just for testing purposes:

# settings.py
CELERYBEAT_SCHEDULE = {
    'Nothing at all (test for memory leak)': {
        'task': 'main.tasks.nothing_at_all',
        'schedule': timedelta(minutes=10)
    },
}

# tasks.py
@shared_task
def nothing_at_all():
    a = range(1000000)
    print a[0]

Random notes:

  • I don't really think it's a celery issue since I used it many times before and it worked flawlessly.
  • I can't reproduce it locally. At least I tried and it worked as it should.
  • On a desperate measure I tried setting CELERYD_MAX_TASKS_PER_CHILD on heroku machines. Not even this stopped the "leak".
  • I tried disabling the celerycam (just in case). Didn't solve the issue.
  • I'm able to flush memory by restarting the celery service (this may be obvious, just wanted to mention that)

As I mentioned before I'm pretty convinced this is not an issue on your side. But maybe you know what's happening here. I confess I'm a bit lost now. 😆

Let me know if you have any clue what's going on.

Thanks!

@filwaitman filwaitman changed the title Memory leak on celery over Heroku (actually I don't think it's a celery issue), just can't figure out what's happening here. Memory leak on celery over Heroku (actually I don't think it's a celery issue), just can't figure out what's happening Jul 25, 2016
@codingjoe
Copy link

We're experiencing the same issue recently, but using an amqp backend.

@ask
Copy link
Contributor

ask commented Aug 4, 2016

I'm not sure, but from my experience Python never releases memory back to the OS once allocated. Apparently the rationale is that releasing memory is expensive and Python will want to use it again.

Try to call the task multiple times to see if the number keeps on growing, if the number grows then add a import gc; gc.collect() before the task returns.

@ask
Copy link
Contributor

ask commented Aug 4, 2016

Btw, maxtasksperchild should be releasing the memory since that will kill the child process. Are you sure it's the child process here that is consuming the memory?

@filwaitman
Copy link
Author

@codingjoe somewhat good to know I'm not the only one with this issue. 😫

@ask I tried the gc.collect() and it didn't solve. Actually I had tried this before but I forgot to mention on description. Sorry for that.

Also, I'm sure its related to a child process. What I did to ensure that:

  1. Disabled all tasks and left the celery machine "useless" for a while.
  2. Enabled one single task:
@shared_task
def nothing_at_all():
    a = range(3000000)
    print a[0]

    import gc
    gc.collect()

As a result, the celery machine was stable during (1) (its memory usage was on 171MB for a long time), and the leak behavior re-appeared when I made (2). See attached image.

screenshot from 2016-08-05 11-29-54

And in fact, I think gc.collect() and maxtasksperchild would have the same effect. And none of these solved the issue... =/

@codingjoe
Copy link

@filwaitman sorry to disappoint you. It was actually a task, that was leaking memory and it ran until the entire machine was killed by heroku.

You should get new relic to profile you tasks. The problem might just be there.

@filwaitman
Copy link
Author

@codingjoe got it. On my case I'm using a test task doing basically nothing. =(
Also, I'm already using NewRelic, but I didn't find anything useful for my debugging there. Not this time. 😆

@ask
Copy link
Contributor

ask commented Aug 8, 2016

That you only generate a list of integers in this task, would suggest to me that this behavior is intrinsic to Python. Celery itself won't hold onto these numbers.

The reason I suggested trying gc.collect would be so data is collected fast enough for memory to be reused, as I've seen the following behavior before:

task1 alloc 1000 objects -> garbage: 1,000
task2 alloc 1000 objects -> garbage: 2,000
task3 alloc 1000 objects -> garbage: 3,000
..
task1000 alloc 1000 objects -> garbage: 1,000,000
<implicit gc.collect>
task10001 alloc 1000 objects -> garbage: 1000

which meant the gc collect cycles were too slow for the memory allocated by python
to be reused, and the process RSS size grew unbounded, even with the occasional collection cycle.

With explicit gc.collect between tasks you'd see the expected:

task1 alloc 1000 objects -> garbage: 1000
task1 gc.collect
...
task1000 alloc 1000 objects -> garbage: 1000

and process RSS usage is constant.

I was under the impression that doing an explicit gc.collect would only help objects
with cyclic references, and that scalar objects like numbers will be collected as soon
as they go out of scope. I may be wrong about that, but then I believe
you'd have to force these numbers to go out of scope:

@shared_task
def nothing_at_all():
    a = range(3000000)
    print a[0]
    del(a)  # < -- remove reference count.

    import gc
    gc.collect()

@filwaitman
Copy link
Author

Oh, man. Of course, the del a should be there. 😄
Well, I added this but it didn't solve the issue as well. "Leak" is still there.

Anyway, I'm 99.9% sure this is not a celery issue (sounds like a infrastructure one). With this on mind and since you guys have a lot of real issues to solve: Do you want me to close it?

I mean, I opened this as a hope you've seen this before, and I don't wanna bother you (anymore) if this is not the case 😆

@auvipy
Copy link
Member

auvipy commented Aug 10, 2016

closing for now

@vesterbaek
Copy link

@filwaitman: did you find a resolution to this? I'm seeing the same with Celery on Heroku

@filwaitman
Copy link
Author

@vesterbaek nope. I'm still facing this. I'm ignoring this because the project owner doesn't let me debug it decently ("dev environment is too busy to let it stuck to debug this"). 🙃

@jjzhangg
Copy link

@FabioFleitas
Copy link

I'm also facing the same issue using Celery on Heroku. I end up restarting that dyno when it happens, but that's not ideal.

@greenberge
Copy link

Ditto. It's a nightmare. We have to restart twice a day.

@surkova
Copy link

surkova commented Feb 2, 2017

Yeah, same here. Restarting every other day.

@greenberge
Copy link

Since my last post, we tried @ask's approach as detailed above. So far so good. Memory usage barely creeps up at all any more.

@GDay
Copy link

GDay commented Jul 24, 2017

I had this exact problem. My Celery tasks run every 1 minute and every 20 minutes.

Eventually, I found that adding CELERYD_MAX_TASKS_PER_CHILD = 1 to the settings help to prevent this. It basically creates a new worker now every time a task has been completed. It might not be the cleanest option, but it works perfectly for me.
Here is how my worker performs right now
(that line was added in my settings with the second deployment.)

@DomHudson
Copy link

Hi @GDay and @filwaitman did you ever get to the bottom of this? I've got some very tasks that need access to a very large object (almost 5gb in memory) and am experiencing the same issue. The usage starts off fine, but over approximately 20 tasks it exhausts all my ram on a very high powered machine (32 core, 64gb ram) and is killed by the kernel. If I set the max tasks to 1 it does seem to help but unsure if it's simply mitigating the issue... I'd be very grateful to know if you did deduce more information on this.
Many thanks

@filwaitman
Copy link
Author

@DomHudson nope. =(
Actually I'm still with this and created some workarounds to deal with leak.

@GDay
Copy link

GDay commented Sep 12, 2017

@DomHudson The small trick I explained still works fine for me. It never exceeds the limits and does the job perfectly. Never had any issues ever since.

@vesterbaek
Copy link

Facing the same issue - with no solution yet. Bad things happen when I start getting R14 because the system seriously slows down - and is not restarted until R15 is raised (swap exceeded). I'm considering monitoring the logs for R14 and restart the dyno on first sight.

@DomHudson
Copy link

Okay thanks both very much . I'll stick with the CELERYD_MAX_TASKS_PER_CHILD for now and report back if I find any more information.
Edit:
@vesterbaek did the above setting make any difference in your environment?

@vesterbaek
Copy link

@DomHudson haven't tried it actually. At peak I'm running quite a few tasks per second and I'm concerned with the performance implications of having to spawn a new child thread or process (not sure which is used) for each task.

I'm running my workers with --max-memory-per-child=100000 but that has not stopped the memory usage increasing over time.

@fjsj
Copy link

fjsj commented Sep 12, 2017

Had similar bugs in a self-hosted deployment. Using RabbitMQ instead of Redis as a broker solved my problem.

@DomHudson
Copy link

@vesterbaek agreed - I was also worried about this and indeed it does seem to be having an adverse effect, slowing down my task processing noticeably.

Interesting @fjsj , did you find that the kernel was listing the RabbitMQ service as consuming the memory or the celery processes?

@fjsj
Copy link

fjsj commented Sep 12, 2017

@DomHudson I was using Redis as a broker, but the memory leaks were in Celery processes. When I've changed to RabbitMQ, the leak was gone. I guess there's something broken in Celery-Redis integration. I'm now using RabbitMQ as a broker and Redis as a result backend. It's fine now.

@DomHudson
Copy link

Sorry yes, Rabbit was a typo - okay great thanks for the insight! I will investigate if the same occurs on my end.

@vesterbaek
Copy link

I'm on Celery 4.1.0 and using RabbitMQ as broker with no result backend. There had the same experience with leaks on Celery 3.x

@vinitkumar
Copy link

vinitkumar commented Apr 3, 2018

Anybody found a solution to this issue? We are facing the same issues with celery 4.1.0 on Heroku.

@ijames
Copy link

ijames commented Jun 20, 2018

In case anybody is coming into this at 4.0+, CELERYD_MAX_TASKS_PER_CHILD has been renamed:

https://docs.celeryproject.org/en/latest/userguide/configuration.html#std:setting-worker_max_tasks_per_child

So the above suggestion documented by @GDay would be:

worker_max_tasks_per_child = 1

Which I'm going to be trying on my next push.

@SHxKM
Copy link

SHxKM commented Aug 16, 2018

Facing the same issue with Heroku. My task is actually a long running one but I don't know if that should affect the celery worker instance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests