Pool gets stuck if hard timeout hits on one or more workers #104

noxdafox · 2014-01-15T09:25:37Z

Greetings,

I developed a service which runs heavy tasks in C code between a Pool of processes, as the tasks complexity might explode in some cases, the hard timeout feature is really useful for my use case as any signal different from SIGKILL would be ignored.

While running I notice that sometimes the service gets silently stuck, CPU usage drops to 0 and all the workers seems idling.

Using gdb I managed to observe that the pool logic seems unable to read or to write anything from the Pipe (the SimpleQueues), this sounds weird as if the Pipe would be stuffed the writes would block but the reads would definitely succeed.

My guess is the following:
When a SIGTERM or (especially) a SIGKILL are delivered, meanwhile the process is accessing the Pipe, the shared Lock used to protect the pipe might remain in a "locked" state.

This would explain why both put() and get() from the SimpleQueues are blocking and why I can spot this on gdb:

93 def enter(self):

94 return self._semlock.enter()

It is clear that something is unable to acquire a lock.

At this point the service gets stuck in a deadlock situation from which I cannot easily recover as:

A terminate will not work as the sentinels won't be delivered to the workers and there seem not to be any timeout over such mechanism.
Trying to call release() on the SimpleQueues lock is blocking, probably on some other lock.
Calling restart() on the pool, as well, seems useless.

This bug is quite critical as, at the moment, the only way to get out seems to be a suicide of the service itself through an unmanaged exception.

noxdafox · 2014-01-15T09:32:37Z

I forgot to mention:

The issue shows only if a worker timeout occurs and seems to appear randomly.
That's why I guess the problem is linked to the SIGKILL signal delivery.

ask · 2014-01-15T13:03:42Z

This is a known issue and is exactly why the Celery worker has moved to using async I/O and one socket/pipe per child process...

A posix semaphore will not be released if the process is killed, and I don't know of any easy workaround to fix this.
The new pool in Celery is great, it performs better and is more flexible too, but the implementation is very complex and you would need some boilerplate to use it. I would like to merge it back in to billiard, but currently it requires the kombu.async eventloop. The goal is for everything in celery (including kombu.async) to be compatible with the new tulip/asyncio API in Python 3.4, but for now this is all a mess.

https://github.com/celery/celery/blob/master/celery/concurrency/asynpool.py

The Pool bootstep is responsible for creating, starting and stopping the pool:
https://github.com/celery/celery/blob/master/celery/worker/components.py#L112-L185

You need a kombu.async.Hub object and call AsynPool.register_with_eventloop(hub).
Then you need to start the event loop: `loop = hub.create_loop(); for _ in loop: pass'

noxdafox · 2014-01-15T13:19:30Z

Problem is that Celery uses amqp and lots of other dependencies which I don't want to import within the project.

Plus I have to stick to Debian stable so backporting all those libraries will be quite funny. I think the only feasible solution is to keep the service dying every time the issue is spotted.

I wish there was an easy way to set up a pool of workers without all of this workarounds pain (python standard pool is too simple to be of any use).

ask · 2014-01-15T13:37:07Z

Right, but the problem it attempts to solve is not simple. The stdlib pool may work in some cases but falls short if you need a reliable service to execute arbitrary code.

noxdafox · 2014-08-11T20:03:58Z

A solution I implemented works acquiring the multiprocessing.Lock whenever a SIGTERM/SIGKILL must be delivered to a process.

https://github.com/noxdafox/pebble/blob/master/pebble/process/pool.py#L151

The library is currently used within a production system and hasn't shown any issue since we migrate to this solution (with billiard it was happening quite frequently).
The assumption made is that the main process is retaining the resources (Lock and Pipe) before killing the children, therefore avoiding them to leave them in a inconsistent state.

I could provide a patch if needed but I guess your attention is now more concentrated elsewhere :)

ask closed this as completed Jan 15, 2014

noxdafox mentioned this issue Jun 30, 2017

Corrupted Queue in Python's multiprocessing Pool implementation noxdafox/pebble#13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pool gets stuck if hard timeout hits on one or more workers #104

Pool gets stuck if hard timeout hits on one or more workers #104

noxdafox commented Jan 15, 2014

noxdafox commented Jan 15, 2014

ask commented Jan 15, 2014

noxdafox commented Jan 15, 2014

ask commented Jan 15, 2014

noxdafox commented Aug 11, 2014

Pool gets stuck if hard timeout hits on one or more workers #104

Pool gets stuck if hard timeout hits on one or more workers #104

Comments

noxdafox commented Jan 15, 2014

noxdafox commented Jan 15, 2014

ask commented Jan 15, 2014

noxdafox commented Jan 15, 2014

ask commented Jan 15, 2014

noxdafox commented Aug 11, 2014