-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pool gets stuck if hard timeout hits on one or more workers #104
Comments
I forgot to mention: The issue shows only if a worker timeout occurs and seems to appear randomly. |
This is a known issue and is exactly why the Celery worker has moved to using async I/O and one socket/pipe per child process... A posix semaphore will not be released if the process is killed, and I don't know of any easy workaround to fix this. https://github.com/celery/celery/blob/master/celery/concurrency/asynpool.py The Pool bootstep is responsible for creating, starting and stopping the pool: You need a kombu.async.Hub object and call |
Problem is that Celery uses amqp and lots of other dependencies which I don't want to import within the project. Plus I have to stick to Debian stable so backporting all those libraries will be quite funny. I think the only feasible solution is to keep the service dying every time the issue is spotted. I wish there was an easy way to set up a pool of workers without all of this workarounds pain (python standard pool is too simple to be of any use). |
Right, but the problem it attempts to solve is not simple. The stdlib pool may work in some cases but falls short if you need a reliable service to execute arbitrary code. |
A solution I implemented works acquiring the multiprocessing.Lock whenever a SIGTERM/SIGKILL must be delivered to a process. https://github.com/noxdafox/pebble/blob/master/pebble/process/pool.py#L151 The library is currently used within a production system and hasn't shown any issue since we migrate to this solution (with billiard it was happening quite frequently). I could provide a patch if needed but I guess your attention is now more concentrated elsewhere :) |
Greetings,
I developed a service which runs heavy tasks in C code between a Pool of processes, as the tasks complexity might explode in some cases, the hard timeout feature is really useful for my use case as any signal different from SIGKILL would be ignored.
While running I notice that sometimes the service gets silently stuck, CPU usage drops to 0 and all the workers seems idling.
Using gdb I managed to observe that the pool logic seems unable to read or to write anything from the Pipe (the SimpleQueues), this sounds weird as if the Pipe would be stuffed the writes would block but the reads would definitely succeed.
My guess is the following:
When a SIGTERM or (especially) a SIGKILL are delivered, meanwhile the process is accessing the Pipe, the shared Lock used to protect the pipe might remain in a "locked" state.
This would explain why both put() and get() from the SimpleQueues are blocking and why I can spot this on gdb:
93 def enter(self):
It is clear that something is unable to acquire a lock.
At this point the service gets stuck in a deadlock situation from which I cannot easily recover as:
This bug is quite critical as, at the moment, the only way to get out seems to be a suicide of the service itself through an unmanaged exception.
The text was updated successfully, but these errors were encountered: