Euthanize unresponsive ipdevpoll workers #2129

lunkwill42 · 2020-03-06T14:46:35Z

Since the cause of #2105 still has not been found, I'm taking the time to produce a potential workaround (which may still be useful if the cause of #2105 is found):

This PR implements a Ping command to the ipdevpoll worker AMP protocol, and uses this to regularly ping all running workers. Any worker who is not known to be done, and does not respond to a ping request will be euthanized using a SIGTERM signal. The existing ipdevpoll scheduler will automatically reschedule the jobs that are lost/frozen as a consequence.

Please see comments in individual commits for various rationales.

Because: - We need the functionality of Deferred.addTimeout(), introduced in Twisted 16.5. - Since the requirements specified Twisted versions up to 17 are ok for both Python versions, the most common installation methods for NAV will already have Twisted 17 installed, regardless of whether it is running on Python 3 or 2. - The next NAV feature release will drop support for Python 2 anyway.

Because: - We want to be able to verify that the worker processes are actually responsive.

Because: - We've thus far been unable to pinpoint the problem that sometimes causes worker processes to freeze under NAV 5.0. - Killing an unresponsive worker process will cause all the jobs in the worker to fail and the master will reschedule them automatically. - Until the actual issue be identified, an option to monitor and kill unresponsive workers should suffice as a workaround.

Because: - It's nice to be able to change the values when debugging. These aren't necessarily options that normal users will ever need to touch, so I'm not documenting these options just yet.

hmpf

Might not be pertinent to this but will the ipdevpoll workers be properly killed by SIGTERM? I read recently that a common problem in python on Docker is that the python programs are commonly set up to be explicitly killed by SIGINT (KeyboardInterrupt) and that such programs should either also explicitly listen for SIGTERM, or docker should send SIGINT instead of SIGTERM. With explicit I mean catch in try:/except:, run a cleanup procedure and then sys.exit(). Not that we're running only in docker, and not that I can see that we need to explicitly catch anything in order to clean anything up, just something to keep in mind.

lunkwill42 · 2020-03-09T07:00:11Z

Might not be pertinent to this but will the ipdevpoll workers be properly killed by SIGTERM?

They should be, unless they are properly f***ed and won't even process signals.

nav/python/nav/ipdevpoll/daemon.py

Lines 106 to 107 in 8b26198

    
           signal.signal(signal.SIGTERM, self.sigterm_handler) 
        
           signal.signal(signal.SIGINT, self.sigterm_handler)

nav/python/nav/ipdevpoll/daemon.py

Lines 209 to 213 in 8b26198

    
           def sigterm_handler(self, signum, _frame): 
        
               """Cleanly shuts down logging system and the reactor.""" 
        
               self._logger.warning("%s received: Shutting down", signame(signum)) 
        
               self._shutdown_start_time = time.time() 
        
               reactor.callFromThread(reactor.stop)

lunkwill42 added 4 commits March 6, 2020 12:44

Add a Ping command to the ipdevpoll AMP protocol

7b1399e

Because: - We want to be able to verify that the worker processes are actually responsive.

Make worker ping sweep details configurable

878f22a

Because: - It's nice to be able to change the values when debugging. These aren't necessarily options that normal users will ever need to touch, so I'm not documenting these options just yet.

lunkwill42 added this to the 5.0.5 milestone Mar 6, 2020

lunkwill42 requested a review from hmpf March 6, 2020 14:46

lunkwill42 self-assigned this Mar 6, 2020

hmpf approved these changes Mar 9, 2020

View reviewed changes

lunkwill42 added the enhancement label Mar 9, 2020

lunkwill42 merged commit 4d5bf5d into Uninett:5.0.x Mar 9, 2020

lunkwill42 deleted the debug/ipdevpoll-worker-ping branch March 9, 2020 09:43

lunkwill42 mentioned this pull request Jul 25, 2022

[BUG] ipdevpoll worker processes freeze up #2105

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Euthanize unresponsive ipdevpoll workers #2129

Euthanize unresponsive ipdevpoll workers #2129

lunkwill42 commented Mar 6, 2020

hmpf left a comment

lunkwill42 commented Mar 9, 2020

Euthanize unresponsive ipdevpoll workers #2129

Euthanize unresponsive ipdevpoll workers #2129

Conversation

lunkwill42 commented Mar 6, 2020

hmpf left a comment

Choose a reason for hiding this comment

lunkwill42 commented Mar 9, 2020