Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pool gets stuck if hard timeout hits on one or more workers #104

Closed
noxdafox opened this issue Jan 15, 2014 · 5 comments
Closed

Pool gets stuck if hard timeout hits on one or more workers #104

noxdafox opened this issue Jan 15, 2014 · 5 comments

Comments

@noxdafox
Copy link

Greetings,

I developed a service which runs heavy tasks in C code between a Pool of processes, as the tasks complexity might explode in some cases, the hard timeout feature is really useful for my use case as any signal different from SIGKILL would be ignored.

While running I notice that sometimes the service gets silently stuck, CPU usage drops to 0 and all the workers seems idling.

Using gdb I managed to observe that the pool logic seems unable to read or to write anything from the Pipe (the SimpleQueues), this sounds weird as if the Pipe would be stuffed the writes would block but the reads would definitely succeed.

My guess is the following:
When a SIGTERM or (especially) a SIGKILL are delivered, meanwhile the process is accessing the Pipe, the shared Lock used to protect the pipe might remain in a "locked" state.

This would explain why both put() and get() from the SimpleQueues are blocking and why I can spot this on gdb:

93 def enter(self):

94 return self._semlock.enter()

It is clear that something is unable to acquire a lock.

At this point the service gets stuck in a deadlock situation from which I cannot easily recover as:

  • A terminate will not work as the sentinels won't be delivered to the workers and there seem not to be any timeout over such mechanism.
  • Trying to call release() on the SimpleQueues lock is blocking, probably on some other lock.
  • Calling restart() on the pool, as well, seems useless.

This bug is quite critical as, at the moment, the only way to get out seems to be a suicide of the service itself through an unmanaged exception.

@noxdafox
Copy link
Author

I forgot to mention:

The issue shows only if a worker timeout occurs and seems to appear randomly.
That's why I guess the problem is linked to the SIGKILL signal delivery.

@ask
Copy link
Contributor

ask commented Jan 15, 2014

This is a known issue and is exactly why the Celery worker has moved to using async I/O and one socket/pipe per child process...

A posix semaphore will not be released if the process is killed, and I don't know of any easy workaround to fix this.
The new pool in Celery is great, it performs better and is more flexible too, but the implementation is very complex and you would need some boilerplate to use it. I would like to merge it back in to billiard, but currently it requires the kombu.async eventloop. The goal is for everything in celery (including kombu.async) to be compatible with the new tulip/asyncio API in Python 3.4, but for now this is all a mess.

https://github.com/celery/celery/blob/master/celery/concurrency/asynpool.py

The Pool bootstep is responsible for creating, starting and stopping the pool:
https://github.com/celery/celery/blob/master/celery/worker/components.py#L112-L185

You need a kombu.async.Hub object and call AsynPool.register_with_eventloop(hub).
Then you need to start the event loop: `loop = hub.create_loop(); for _ in loop: pass'

@noxdafox
Copy link
Author

Problem is that Celery uses amqp and lots of other dependencies which I don't want to import within the project.

Plus I have to stick to Debian stable so backporting all those libraries will be quite funny. I think the only feasible solution is to keep the service dying every time the issue is spotted.

I wish there was an easy way to set up a pool of workers without all of this workarounds pain (python standard pool is too simple to be of any use).

@ask
Copy link
Contributor

ask commented Jan 15, 2014

Right, but the problem it attempts to solve is not simple. The stdlib pool may work in some cases but falls short if you need a reliable service to execute arbitrary code.

@ask ask closed this as completed Jan 15, 2014
@noxdafox
Copy link
Author

A solution I implemented works acquiring the multiprocessing.Lock whenever a SIGTERM/SIGKILL must be delivered to a process.

https://github.com/noxdafox/pebble/blob/master/pebble/process/pool.py#L151

The library is currently used within a production system and hasn't shown any issue since we migrate to this solution (with billiard it was happening quite frequently).
The assumption made is that the main process is retaining the resources (Lock and Pipe) before killing the children, therefore avoiding them to leave them in a inconsistent state.

I could provide a patch if needed but I guess your attention is now more concentrated elsewhere :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants