Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handle of crashed processes #659

Merged
merged 6 commits into from
Sep 5, 2022

Conversation

gschwind
Copy link
Collaborator

@gschwind gschwind commented Jun 28, 2022

Overview

Improve the management of sub-process on linux platform.

As describe in #493 process may crash and the database keep them as running process. At some point the server does not accept new request because it reach the max parallel request limit. This patch series expect to handle this situation more properly on linux platform. The patch series must, as preliminary requirement fix an issue with the MultiProcessing module which can keep zombies processes forever, here an instance of what I can get with ps -f -u apache:

apache     21711   21702  0 14:57 ?        00:00:00 /usr/sbin/apache2 -D INFO -D LANGUAGE -D WSGI -D PROXY -DFOREGROUND
apache     21712   21702  0 14:57 ?        00:00:00 /usr/sbin/apache2 -D INFO -D LANGUAGE -D WSGI -D PROXY -DFOREGROUND
apache     21758   21704  0 14:57 ?        00:00:00 [apache2] <defunct>
apache     21884   21702  0 14:58 ?        00:00:00 /usr/sbin/apache2 -D INFO -D LANGUAGE -D WSGI -D PROXY -DFOREGROUN

I think the issue is not limited to apache. To fix the issue this patch series provide a new DetachProcessing mode that actually detach the processing of request, ensuring that the new process get become a child of pid 1. This ensure that processes will not end up as zombies. This DetachProcessing should be good for any use case on linux, i.e. other server than apache.

Thus to ensure that the patch series work the configuration must use the mode detachprocessing, other wise terminated process cannot be detected. The heuristic used in the patch series is basically to check if a process exist with the stored pid, if the pid is not here anymore, we are sure that the process is not running anymore. In case the pid still there, we are not sure, because linux may reuse the pid for another process. This is why this is a safe heuristic but not 100% accurate.

I tried several more accurate heuristic, but sometime they do not work and other time make things much more complex.

Moreover the patch check and cleanup sub-process only when pywps reach the max parralel process limit, this mean that process may be considered as not finished for a long time, i.e. until we reach the max parralel process limit. It will be better to check it at every status request also, but to implement this we require to address #354 .

Best regards

Related Issue / Discussion

This is related to #493

Additional Information

Contribution Agreement

(as per https://github.com/geopython/pywps/blob/master/CONTRIBUTING.rst#contributions-and-licensing)

  • I'd like to contribute [feature X|bugfix Y|docs|something else] to PyWPS. I confirm that my contributions to PyWPS will be compatible with the PyWPS license guidelines at the time of contribution.
  • I have already previously agreed to the PyWPS Contributions and Licensing Guidelines

@gschwind gschwind marked this pull request as ready for review June 28, 2022 13:39
@coveralls
Copy link

coveralls commented Jun 28, 2022

Coverage Status

Coverage decreased (-0.3%) to 81.053% when pulling 57b6301 on gschwind:new-handle-process into 7839f52 on geopython:main.

@gschwind gschwind changed the title New handle process Better handle of crashed processes Jun 28, 2022
@gschwind
Copy link
Collaborator Author

Hello,

While the patch improve the situation, the patch leave an issue that the status file of crashed process is not updated.

The issue may be difficult to solve because how storage are currently handled. In my case where status file are stored as file, I may solve the issue, but in general cases the #354 will be the solution.

Best regards

DetachProcesing intended to replace the broken MultiProcessing. Detach
process will create a sub-process that will be detached from it's parent
process, i.e. the parent pid will be 1. That way prevent sub-process to
stay in zombie mode, as the default MultiProcessing do.

This mode is crafted for linux platform. To use it add the following
configuration:

[processing]
mode=detachprocessing
In the current code the sub-process PID is set once to a wring value,
this function allow the sub-process to update the PID to the actual
value.
The dblog.cleanup_crashed_process implement an heuristic to detect
crashed sub-process, i.e. process that did not terminated in the
expected way. This may happen if user use kill with SIGKILL or the
process get killed by OOM killer or the process exit by himself with
exit(0) or some case that I do not have in mind. In this case the
process entry in the database is not updated properly to reflect the
fact that the process has finished. This process entry stack up and
block new request to be processed. This function prevent this by
checking if a process with the same pid still alive, if not process with
the same pid is found we are sure that the process is not running
anymore, thus we update is status as failed.

Currently this function is called when the limit of parralle request is
reach to check if some process are crashed.
@cehbrecht cehbrecht self-requested a review September 5, 2022 14:01
@cehbrecht cehbrecht merged commit 85ca819 into geopython:main Sep 5, 2022
@gschwind gschwind deleted the new-handle-process branch September 14, 2023 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants