New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leak of AsyncResult #4839

Open
wants to merge 1 commit into
base: master
from

Conversation

@monsterxx03

monsterxx03 commented Jun 22, 2018

#4131 fixed a memory leak issue by remove weakref to a bound method. But when I upgraded my prod celery to 4.2.0, I come to this memory leak issue again. The test script in the issue also report memory leak in 4.2.0 : #4131 (comment)

The root case is AsyncResult defined __del__ method, but the self.on_ready created a ref cycle as well. __del__ will prevent gc to collect ref cycle objects.

A simple script to test:

import resource
import gc

from celery import Celery

app = Celery()
app.conf.update(BROKER_URL='redis://localhost:6379/1')

@app.task
def dummy():
    return '1'


def print_mem():
    print 'Memory usage: %s (kb)' % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

def run():
    for i in range(10000):
        dummy.delay()
        if i % 1000 == 0:
            print_mem()

run()
print gc.garbag

Before remove __del__:

Memory usage: 31736 (kb)
Memory usage: 32916 (kb)
Memory usage: 34628 (kb)
Memory usage: 36432 (kb)
Memory usage: 38292 (kb)
Memory usage: 39812 (kb)
Memory usage: 41668 (kb)
Memory usage: 42608 (kb)
Memory usage: 43576 (kb)
Memory usage: 44592 (kb)

[<AsyncResult: 9ad49b99-73e3-41de-a972-3ab196874f91>, <AsyncResult: 7638fc5a-6c14-4fe5-bbe7-1d19ab992582>, <AsyncResult: 95a158b6-9c1b-408e-9364-679b01e3d36f>, <AsyncResult: 6bd7cac9-4a65-47ae-97ee-8a0bb0b24cee>.......]

After remove __del__:

Memory usage: 31828 (kb)
Memory usage: 31960 (kb)
Memory usage: 31960 (kb)
Memory usage: 31960 (kb)
Memory usage: 31960 (kb)
Memory usage: 31960 (kb)
Memory usage: 31960 (kb)
Memory usage: 31960 (kb)
Memory usage: 32188 (kb)
Memory usage: 32188 (kb)

[]
Remove __del__ from AsyncResult to make gc work.
Objects have `__del__()` defined, will be gc uncollectable if they
are part of a reference cycle.

@monsterxx03 monsterxx03 referenced this pull request Jun 22, 2018

Open

Memory leak in celery.app.events.State? #4832

1 of 2 tasks complete
@monsterxx03

This comment has been minimized.

@auvipy

any test to avoid regression?

@georgepsarakis

This comment has been minimized.

Member

georgepsarakis commented Jun 22, 2018

@monsterxx03 did you by any chance find how the reference cycle is created in the first place?

@monsterxx03

This comment has been minimized.

monsterxx03 commented Jun 23, 2018

The problem is here: https://github.com/celery/celery/blob/4.2/celery/result.py#L102

AsyncResult -> self.on_ready -> promise -> self._on_fulfilled

since _on_fulfilfilled is an instance method of AsyncResult, it refs back to AsyncResult, the cycle complete.

At first, I try to set weak=True for promise(self._on_fulfilled), actually, it works, the memory leak issue disappear, but later, I found this pr #4131 , weak ref not work on bound method, I think it makes sense, what really need is WeakMethod, but it's only available after python3.4. And I agree the point in #4131 (comment) ref cycle is not a problem, python gc can handle it. BTW, the problem in 4131 is the func passed to promise never called if weak=True

Then I'm confusing why memory leak still happen(I thought gc can handle ref cycle), then I checked history of celery/result.py, found the __del__ method is introduced in 2636251 , okay, ref cycle is not a problem, unless we didn't define a __del__ for it, it will cause object uncollectable https://docs.python.org/2/library/gc.html#gc.garbage

To clean up:

AsyncResult has a ref cycle, and define __del__ which caused the memory leak, and it's 100% reproducible under python2.7.

In my test, add weak=True for promise or remove the __del__ can both fix the memory leak of AsyncResult, but weak=True will make _on_fulfilled never called, so I remove __del__.

I think the best way is resolve the ref cycle and keep __del__, but I didn't see an obvious way so far, my quick fix just solve the memory issue(__del__ won't be called in current code as well, since it won't be gc) @georgepsarakis

@georgepsarakis

This comment has been minimized.

Member

georgepsarakis commented Jun 23, 2018

@monsterxx03 thanks for the detailed explanation.

I think that the best way to handle this is to actually backport WeakMethod for Python versions that do not support it. This should probably be introduced in vine. promise objects should detect whether the passed function is bound, and use WeakMethod instead of ref.

How does that sound?

@georgepsarakis georgepsarakis self-assigned this Jun 23, 2018

@monsterxx03

This comment has been minimized.

monsterxx03 commented Jun 23, 2018

@georgepsarakis that would be great, the promise's weak parameter is easy to be mis-used if developer don't be careful.

BTW I tested this patch in my prod env for one day, the memory leak issue is fixed, at least for me

screen shot 2018-06-23 at 2 39 57 pm

It's a service always adding tasks to celery, calling apply_async. The celery worker processes don't have leak issue.

Waiting for a better release with vine upgrade, or merged as a tmp fix is both okay for me, thank you 👍

@auvipy auvipy added this to the v4.3 milestone Jul 5, 2018

sloria added a commit to CenterForOpenScience/osf.io that referenced this pull request Jul 19, 2018

Use fork of celery 4.2.1 with celery/celery#4839 merged
This is an attempt to fix the OOM issues with the task pods
on prod

Johnetordoff added a commit to Johnetordoff/osf.io that referenced this pull request Jul 23, 2018

Merge branch 'master' of https://github.com/CenterForOpenScience/osf.io
… into remove-support-page

* 'master' of https://github.com/CenterForOpenScience/osf.io: (37 commits)
  Don't transact to avoid deadlocks
  Add script to fix poor mapping logic -h/t @caseyrollins
  Use timezone.now() when checking for failed registrations
  Remove duplicate psyarxiv custom subject.
  Use fork of celery 4.2.1 with celery/celery#4839 merged
  Upgrade celery and kombu
  Pin to an exact version of amqp
  Fix script
  Remove script
  Add one-time mapping script
  Update CHANGELOG, package.json
  Add a management command to check for unapplied migrations.
  Update collectionSubmission ES mapping
  Add merge migration
  Fix migration
  Prevent superfluous queries
  fix sizing for logo
  Add provider-based collection search filtering
  Fix asset url retrieval
  Add "Other" sorting tests
  ...

Johnetordoff added a commit to Johnetordoff/osf.io that referenced this pull request Jul 24, 2018

Merge branch 'develop' of https://github.com/CenterForOpenScience/osf.io
 into get-delete-external-ids

* 'develop' of https://github.com/CenterForOpenScience/osf.io: (98 commits)
  Bump version and update changelog
  fix tests for non-dict based client attributes
  Update gitlab serializers for fangorn config js
  Add enforce_csrf waffle switch so that this can be merged earlier
  Update markdown-it-video with infinite loop fix
  Fix indentation
  Improve tests by using user cookie method and faster user creation
  Tests to ensure API requests with Session auth fail without CSRF protection
  Travis and Test Speed ups
  Rename --quick to --all
  Don't transact to avoid deadlocks
  Add script to fix poor mapping logic -h/t @caseyrollins
  Add same enforce_csrf method as drf's SessionAuthentication
  replace bad reverse migration with warning
  Use timezone.now() when checking for failed registrations
  Remove duplicate psyarxiv custom subject.
  Use fork of celery 4.2.1 with celery/celery#4839 merged
  Upgrade celery and kombu
  Set TIME_ZONE in settings
  add scheme to match UI
  ...

Johnetordoff added a commit to Johnetordoff/osf.io that referenced this pull request Jul 24, 2018

Merge branch 'develop' of https://github.com/CenterForOpenScience/osf.io
 into jobs-and-schools-v2

* 'develop' of https://github.com/CenterForOpenScience/osf.io: (87 commits)
  Bump version and update changelog
  fix tests for non-dict based client attributes
  Update gitlab serializers for fangorn config js
  Add enforce_csrf waffle switch so that this can be merged earlier
  Update markdown-it-video with infinite loop fix
  Fix indentation
  Improve tests by using user cookie method and faster user creation
  Tests to ensure API requests with Session auth fail without CSRF protection
  Travis and Test Speed ups
  Rename --quick to --all
  Don't transact to avoid deadlocks
  Add script to fix poor mapping logic -h/t @caseyrollins
  Add same enforce_csrf method as drf's SessionAuthentication
  replace bad reverse migration with warning
  Use timezone.now() when checking for failed registrations
  Remove duplicate psyarxiv custom subject.
  Use fork of celery 4.2.1 with celery/celery#4839 merged
  Upgrade celery and kombu
  Set TIME_ZONE in settings
  add scheme to match UI
  ...

Johnetordoff added a commit to Johnetordoff/osf.io that referenced this pull request Jul 24, 2018

Merge branch 'feature/storage-i18n' of https://github.com/CenterForOp…
…enScience/osf.io into storage-i18n-tests

* 'feature/storage-i18n' of https://github.com/CenterForOpenScience/osf.io: (83 commits)
  fix mocking for gitlab
  Bump version and update changelog
  fix tests for non-dict based client attributes
  Update gitlab serializers for fangorn config js
  Add enforce_csrf waffle switch so that this can be merged earlier
  Update markdown-it-video with infinite loop fix
  Fix indentation
  Improve tests by using user cookie method and faster user creation
  Tests to ensure API requests with Session auth fail without CSRF protection
  Travis and Test Speed ups
  Rename --quick to --all
  Don't transact to avoid deadlocks
  Add script to fix poor mapping logic -h/t @caseyrollins
  Add same enforce_csrf method as drf's SessionAuthentication
  replace bad reverse migration with warning
  Use timezone.now() when checking for failed registrations
  Remove duplicate psyarxiv custom subject.
  Use fork of celery 4.2.1 with celery/celery#4839 merged
  Upgrade celery and kombu
  Set TIME_ZONE in settings
  ...

Johnetordoff added a commit to Johnetordoff/osf.io that referenced this pull request Jul 25, 2018

Merge branch 'master' of https://github.com/CenterForOpenScience/osf.io
… into datacite-schema

* 'master' of https://github.com/CenterForOpenScience/osf.io: (72 commits)
  fix mocking for gitlab
  Bump version and update changelog
  fix tests for non-dict based client attributes
  Update gitlab serializers for fangorn config js
  Add enforce_csrf waffle switch so that this can be merged earlier
  Update markdown-it-video with infinite loop fix
  Fix indentation
  Improve tests by using user cookie method and faster user creation
  Tests to ensure API requests with Session auth fail without CSRF protection
  Travis and Test Speed ups
  Rename --quick to --all
  Don't transact to avoid deadlocks
  Add script to fix poor mapping logic -h/t @caseyrollins
  Add same enforce_csrf method as drf's SessionAuthentication
  replace bad reverse migration with warning
  Use timezone.now() when checking for failed registrations
  Remove duplicate psyarxiv custom subject.
  Use fork of celery 4.2.1 with celery/celery#4839 merged
  Upgrade celery and kombu
  Set TIME_ZONE in settings
  ...

Johnetordoff added a commit to Johnetordoff/osf.io that referenced this pull request Jul 25, 2018

Merge branch 'develop' of https://github.com/CenterForOpenScience/osf.io
 into v2-email-prefs

* 'develop' of https://github.com/CenterForOpenScience/osf.io: (75 commits)
  Add write scopes instead of read scopes to write-only user endpoints.
  fix mocking for gitlab
  Bump version and update changelog
  fix tests for non-dict based client attributes
  Update gitlab serializers for fangorn config js
  Add enforce_csrf waffle switch so that this can be merged earlier
  Update markdown-it-video with infinite loop fix
  Fix indentation
  Improve tests by using user cookie method and faster user creation
  Tests to ensure API requests with Session auth fail without CSRF protection
  Travis and Test Speed ups
  Rename --quick to --all
  Don't transact to avoid deadlocks
  Add script to fix poor mapping logic -h/t @caseyrollins
  Add same enforce_csrf method as drf's SessionAuthentication
  replace bad reverse migration with warning
  Use timezone.now() when checking for failed registrations
  Remove duplicate psyarxiv custom subject.
  Use fork of celery 4.2.1 with celery/celery#4839 merged
  Upgrade celery and kombu
  ...

Signed-off-by: John Tordoff <john@cos.io>

# Conflicts:
#	api/users/serializers.py
#	api/users/views.py
@fernandotakai

This comment has been minimized.

fernandotakai commented Aug 1, 2018

I was having the same issue after upgrading to celery 4.2 (kombu 4.2.1) -- my task producers where exponentially leaking memory while the consumers were quite ok.

After applying this patch, memory usage went back to stable.

image

(those drops on the graph are restarts, while the constant line at the end is after the patch was applied)

@JioHooney

This comment has been minimized.

JioHooney commented Oct 4, 2018

Has this problem been resolved?

@thedrow

This comment has been minimized.

Member

thedrow commented Oct 8, 2018

I don't think so.
See #4839 (comment) and celery/vine#21
None of us have found the time to properly resolve this issue.
@JioHooney PRs are always welcome.

@auvipy auvipy self-assigned this Oct 8, 2018

@codingEnzo

This comment has been minimized.

codingEnzo commented Oct 9, 2018

@monsterxx03 i think you can try the test unit on py3
i found it's ok on my env (fedora27, python3.6.6, celery4.2.1), with your code

@monsterxx03

This comment has been minimized.

monsterxx03 commented Oct 10, 2018

@codingEnzo this bug only effect python < 3.4.

When I posting this pr, my prod env is running on python2.7, will 100% come to this memory leak issue. And this PR is a workable workaround.

Recently, I migrate my prod env to python 3.6, celery won't come to memory leak any more, I'm using celery 4.2.1, without this patch.

I'm not sure merge this patch is good for python3.6, seems the behave is hard to test in unit test.

@auvipy auvipy removed this from the v4.3 milestone Oct 10, 2018

thedrow referenced this pull request Oct 16, 2018

Do not subscribe to Redis channels when results are ignored (#4709)
* Add ignored property to AsyncResult

* Set AsyncResult as ignored on task creation & bypass backend callback

* Cancel subscription when calling AsyncResult.forget

* Cancel pending result operations on AsyncResult destruction

* Integration tests for Redis channel unsubscriptions (#4707)

* Handle KeyError when resetting Redis client objects

https://travis-ci.org/celery/celery/jobs/375637202
@austintrose

This comment has been minimized.

austintrose commented Oct 25, 2018

Confirming what others have already said: We're running Python 2.7, and recently developed a memory leak after upgrading from Celery 4.1.1 to 4.2.1. Applying the patch suggested by @monsterxx03 fixed the leak:

screen shot 2018-10-25 at 10 42 03 am

@jsh2134

This comment has been minimized.

jsh2134 commented Nov 9, 2018

Can confirm this same memory leak with Celery 4.2.1 and python2.7.

Until it is merged upstream, is there a way to create an AsyncResult subclass that will fix this behavior?

@clokep

This comment has been minimized.

Contributor

clokep commented Nov 9, 2018

@jsh2134 We (@austintrose and I) fixed this with something like this:

from celery.result import AsyncResult

# AsyncResult objects have a memory leak in them in Celery 4.2.1.
# See https://github.com/celery/celery/pull/4839/
delattr(AsyncResult, '__del__')

We put this in the same file where we create our Celery application. (Hacky yes, but it seems to work.)

@georgepsarakis

This comment has been minimized.

Member

georgepsarakis commented Nov 24, 2018

We also need to revert the changes in #4131 in order to take advantage of the fix in celery/vine#24 , as mentioned above #4839 (comment) .

@georgepsarakis

This comment has been minimized.

Member

georgepsarakis commented Nov 30, 2018

@thedrow @auvipy @xirdneh any ideas on how to move forward? Please see also my comment above.

@georgepsarakis

This comment has been minimized.

Member

georgepsarakis commented Dec 17, 2018

Can someone please run some tests with:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment