New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be selective about how file descriptors are removed since they may be reused for a different purpose. #353

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
5 participants
@rogerhu
Contributor

rogerhu commented May 8, 2014

Kombu was removing them after they were being reused by another worker process.

I could repo by using the test program mentioned in celery/celery#1785 (comment) and using 8 concurrent workers. I put logging in front of every single hub_remove() command I could find -- until came upon this place, which showed that file descriptor 12 was being removed after it was being reclaimed by another worker using it for read purposes and triggering the assertion error. The KeyError in #353 causes the file descriptor to be removed completely, causing the assertion failure.

[2014-05-08 00:21:02,644: ERROR/MainProcess] on_process_up proc.outqR_fd=12, proc.inqW_fd=11
[2014-05-08 00:21:02,652: ERROR/MainProcess] hub_remove=104
[2014-05-08 00:21:02,652: ERROR/MainProcess] kombu KeyError hub_remove
[2014-05-08 00:21:02,653: ERROR/MainProcess] hub_remove=12

AssertionError on proc.outqR_fd=12 inside on_process_alive() triggered because of the hub_remove=12 call made inside this section.

@ionelmc @ask

@rogerhu

This comment has been minimized.

Show comment
Hide comment
@rogerhu

rogerhu May 8, 2014

Contributor

Related to assert proc.outqR_fd in hub.readers" AssertionError errors we were seeing.

Contributor

rogerhu commented May 8, 2014

Related to assert proc.outqR_fd in hub.readers" AssertionError errors we were seeing.

@rogerhu rogerhu changed the title from Be selective about how file descriptors are removed since they may be reused for a different repose. to Be selective about how file descriptors are removed since they may be reused for a different purpose. May 8, 2014

Be selective about how file descriptors are removed since they may be…
… reused for a different purpose.

Kombu was removing them after they were being reused by another worker process.
@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls May 8, 2014

Coverage Status

Coverage increased (+13.83%) when pulling e0c15ae on rogerhu:fix_kombu_async into d1624d9 on celery:master.

coveralls commented May 8, 2014

Coverage Status

Coverage increased (+13.83%) when pulling e0c15ae on rogerhu:fix_kombu_async into d1624d9 on celery:master.

@rogerhu

This comment has been minimized.

Show comment
Hide comment
@rogerhu

rogerhu May 8, 2014

Contributor

Disregard previous concerns -- I noticed that remove_reader() and remove_writer() will add back the writer or reader is first discarded so it appears to be OK to use.

This issue is related to: celery/celery#1785

Contributor

rogerhu commented May 8, 2014

Disregard previous concerns -- I noticed that remove_reader() and remove_writer() will add back the writer or reader is first discarded so it appears to be OK to use.

This issue is related to: celery/celery#1785

rogerhu added a commit to rogerhu/celery that referenced this pull request May 8, 2014

Be more selective about how file descriptors get removed from Kombu's…
… hub.

Given that file descriptor changes appear to get triggered on the Kombu side, these changes
may not make a material impact.  However, to make things more consistent with the
changes introduced in celery/kombu#353, the changes have been
updated here.

This change would also help allow refactoring for remove_reader()/remove_writer() to
be smarter about how file descriptors get managed in the future (i.e. using a counter
instead of removes() to avoid possible race conditions with file descriptors being reused)
@rogerhu

This comment has been minimized.

Show comment
Hide comment
@rogerhu

rogerhu May 12, 2014

Contributor

Can someone take a look at this PR? Celery v3.1 in its current state has workers crashing quite often. I believe this fix helps the situation.

Contributor

rogerhu commented May 12, 2014

Can someone take a look at this PR? Celery v3.1 in its current state has workers crashing quite often. I believe this fix helps the situation.

@felixsan

This comment has been minimized.

Show comment
Hide comment
@felixsan

felixsan commented May 13, 2014

+1

@ask

This comment has been minimized.

Show comment
Hide comment
@ask

ask May 16, 2014

Member

I'm testing this with the stresstests, but I really don't feel all that comfortable without having a stresstest case that reproduces the problem in the first place. Do you think it can be reproduced remotely by calling tasks?

Member

ask commented May 16, 2014

I'm testing this with the stresstests, but I really don't feel all that comfortable without having a stresstest case that reproduces the problem in the first place. Do you think it can be reproduced remotely by calling tasks?

@rogerhu

This comment has been minimized.

Show comment
Hide comment
@rogerhu

rogerhu May 16, 2014

Contributor

As I mentioned in this PR:

  1. Take the code from celery/celery#1785 (comment) (call it celery_test.py)

  2. Run celery_test.py loop on one process.

  3. Run celery_test.py worker --concurency=8 on another terminal.

It fails within < 1 minutes with the assertion failure.

Using Ubuntu 12.04 and an Amazon EC2 instance.

Contributor

rogerhu commented May 16, 2014

As I mentioned in this PR:

  1. Take the code from celery/celery#1785 (comment) (call it celery_test.py)

  2. Run celery_test.py loop on one process.

  3. Run celery_test.py worker --concurency=8 on another terminal.

It fails within < 1 minutes with the assertion failure.

Using Ubuntu 12.04 and an Amazon EC2 instance.

@adepue

This comment has been minimized.

Show comment
Hide comment
@adepue

adepue May 16, 2014

+1 for the crash issue as well

adepue commented May 16, 2014

+1 for the crash issue as well

ask added a commit to celery/celery that referenced this pull request May 20, 2014

Be more selective about how file descriptors get removed from Kombu's…
… hub.

Given that file descriptor changes appear to get triggered on the Kombu side, these changes
may not make a material impact.  However, to make things more consistent with the
changes introduced in celery/kombu#353, the changes have been
updated here.

This change would also help allow refactoring for remove_reader()/remove_writer() to
be smarter about how file descriptors get managed in the future (i.e. using a counter
instead of removes() to avoid possible race conditions with file descriptors being reused)

ask added a commit to celery/celery that referenced this pull request May 20, 2014

Be more selective about how file descriptors get removed from Kombu's…
… hub.

Given that file descriptor changes appear to get triggered on the Kombu side, these changes
may not make a material impact.  However, to make things more consistent with the
changes introduced in celery/kombu#353, the changes have been
updated here.

This change would also help allow refactoring for remove_reader()/remove_writer() to
be smarter about how file descriptors get managed in the future (i.e. using a counter
instead of removes() to avoid possible race conditions with file descriptors being reused)
@ask

This comment has been minimized.

Show comment
Hide comment
@ask

ask May 20, 2014

Member

merged into master and 3.0

Member

ask commented May 20, 2014

merged into master and 3.0

@ask ask closed this May 20, 2014

@rogerhu

This comment has been minimized.

Show comment
Hide comment
@rogerhu

rogerhu May 21, 2014

Contributor

@ask I was able to repro with your stress test suite but only by doing a few things:

  1. Disabling events reporting (commenting out the call to self.app.control.enable_events()) -- this seems to introduce enough delays that prevent the issue from occurring.
  2. Running via app.start() instead -- see https://github.com/rogerhu/celery/blob/678bce5f1f6edebb84e7b3f9efb851d0d49671a8/funtests/stress/go.py. This appears to skip all the bootsteps startup work -- any reason why this would cause issues?

The commits that I used are here: https://github.com/rogerhu/celery/compare/celery:3.1...suite_fun2?expand=1

The error does not always happen so I am still trying to understand why. The code from celery/celery#1785 (comment) seems to be a more consistent way to repro these problems than using the test suite. You can see in the commits that I basically tried to put the code back into the test suite to try to find the differences.

Contributor

rogerhu commented May 21, 2014

@ask I was able to repro with your stress test suite but only by doing a few things:

  1. Disabling events reporting (commenting out the call to self.app.control.enable_events()) -- this seems to introduce enough delays that prevent the issue from occurring.
  2. Running via app.start() instead -- see https://github.com/rogerhu/celery/blob/678bce5f1f6edebb84e7b3f9efb851d0d49671a8/funtests/stress/go.py. This appears to skip all the bootsteps startup work -- any reason why this would cause issues?

The commits that I used are here: https://github.com/rogerhu/celery/compare/celery:3.1...suite_fun2?expand=1

The error does not always happen so I am still trying to understand why. The code from celery/celery#1785 (comment) seems to be a more consistent way to repro these problems than using the test suite. You can see in the commits that I basically tried to put the code back into the test suite to try to find the differences.

@rogerhu

This comment has been minimized.

Show comment
Hide comment
@rogerhu

rogerhu May 21, 2014

Contributor

Also added rogerhu/celery@4e9baff just now and it seems that piling a bunch of tasks into the queue and removing all the delays that are in the test suite loop makes the problem much more consistent to cause the assertion errors.

Contributor

rogerhu commented May 21, 2014

Also added rogerhu/celery@4e9baff just now and it seems that piling a bunch of tasks into the queue and removing all the delays that are in the test suite loop makes the problem much more consistent to cause the assertion errors.

@rogerhu

This comment has been minimized.

Show comment
Hide comment
@rogerhu

rogerhu May 22, 2014

Contributor

@ask I created celery/celery#2056 that helps to trigger the behavior.

Contributor

rogerhu commented May 22, 2014

@ask I created celery/celery#2056 that helps to trigger the behavior.

@rogerhu

This comment has been minimized.

Show comment
Hide comment
@rogerhu

rogerhu May 30, 2014

Contributor

Wondering when you might consider releasing an update? Trying to see if this fix will resolve issues with Celery workers that stop responding to taking on jobs from the MainProcess when MAX_TASKS_PER_CHILD is set.

Contributor

rogerhu commented May 30, 2014

Wondering when you might consider releasing an update? Trying to see if this fix will resolve issues with Celery workers that stop responding to taking on jobs from the MainProcess when MAX_TASKS_PER_CHILD is set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment