New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Euthanize unresponsive ipdevpoll workers #2129
Euthanize unresponsive ipdevpoll workers #2129
Conversation
Because: - We need the functionality of Deferred.addTimeout(), introduced in Twisted 16.5. - Since the requirements specified Twisted versions up to 17 are ok for both Python versions, the most common installation methods for NAV will already have Twisted 17 installed, regardless of whether it is running on Python 3 or 2. - The next NAV feature release will drop support for Python 2 anyway.
Because: - We want to be able to verify that the worker processes are actually responsive.
Because: - We've thus far been unable to pinpoint the problem that sometimes causes worker processes to freeze under NAV 5.0. - Killing an unresponsive worker process will cause all the jobs in the worker to fail and the master will reschedule them automatically. - Until the actual issue be identified, an option to monitor and kill unresponsive workers should suffice as a workaround.
Because: - It's nice to be able to change the values when debugging. These aren't necessarily options that normal users will ever need to touch, so I'm not documenting these options just yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might not be pertinent to this but will the ipdevpoll workers be properly killed by SIGTERM? I read recently that a common problem in python on Docker is that the python programs are commonly set up to be explicitly killed by SIGINT (KeyboardInterrupt) and that such programs should either also explicitly listen for SIGTERM, or docker should send SIGINT instead of SIGTERM. With explicit I mean catch in try:/except:
, run a cleanup procedure and then sys.exit()
. Not that we're running only in docker, and not that I can see that we need to explicitly catch anything in order to clean anything up, just something to keep in mind.
They should be, unless they are properly f***ed and won't even process signals. nav/python/nav/ipdevpoll/daemon.py Lines 106 to 107 in 8b26198
nav/python/nav/ipdevpoll/daemon.py Lines 209 to 213 in 8b26198
|
Since the cause of #2105 still has not been found, I'm taking the time to produce a potential workaround (which may still be useful if the cause of #2105 is found):
This PR implements a
Ping
command to the ipdevpoll worker AMP protocol, and uses this to regularly ping all running workers. Any worker who is not known to be done, and does not respond to a ping request will be euthanized using aSIGTERM
signal. The existing ipdevpoll scheduler will automatically reschedule the jobs that are lost/frozen as a consequence.Please see comments in individual commits for various rationales.