Explicitly test for ipdevpoll subprocess pool timeouts #2548

lunkwill42 · 2023-01-06T14:51:34Z

We are seeing intermittent coverage failures on pull requests, specifically related to the nav.ipdevpoll.pool module. Codecov fails the PRs because it seems that the PR has reduced the coverage of the pool module, yet we never see any of these PRs actually touching any relevant code that would reduce the coverage of this file.

The code lines that have "lost" coverage all deal with the handling of process pool workers that time out and need to be euthanized.

Working theory: It is likely that what is happening is that there is no test that tests explicitly for a ipdevpoll process pool worker timeout/euthanization incident. This just happens randomly on some tests, likely due to timing issues. Whenever this occurs in a test suite run, the timeout handlers are run, ensuring that a new worker gets the job, but the tests are otherwise unaffected - and the timeout handlers get covered. However, when the test suite later runs and the timeouts do not happen, the coverage is apparently reduced.

One of the latest PRs this explicitly happened was #2545. The link to the coverage report is here (but this may become outdated by the time the PR is merged): https://app.codecov.io/gh/Uninett/nav/pull/2545

The text was updated successfully, but these errors were encountered:

lunkwill42 · 2023-01-06T14:55:44Z

The "uncovered" methods appear to be

nav/python/nav/ipdevpoll/pool.py

Lines 291 to 321 in 2d7b2a0

    
               @inlineCallbacks 
        
               def _euthanize_unresponsive_worker(self, timeout=10): 
        
                   """Sends the ping command to the worker. If the ping command does not succeed 
        
                   within the configured timeout, the worker is killed using the SIGTERM signal, 
        
                   under the assumption the process has frozen somehow. 
        
                   """ 
        
                   is_alive = not self.done()  # assume the best 
        
                   if not self.done(): 
        
                       try: 
        
                           is_alive = yield self.responds_to_ping(timeout) 
        
                       except twisted.internet.defer.TimeoutError: 
        
                           self._logger.warning("PING: Timed out for %r", self) 
        
                           is_alive = False 
        
                       except Exception:  # pylint: disable=broad-except 
        
                           self._logger.exception( 
        
                               "PING: Unhandled exception while pinging %r", self 
        
                           ) 
        
                           is_alive = None 
        
                   # check again; no need to kill worker if its status became 'done'while waiting 
        
                   if not self.done(): 
        
                       try: 
        
                           if not is_alive: 
        
                               self._logger.warning( 
        
                                   "PING: Not responding, attempting to kill: %r", self 
        
                               ) 
        
                               os.kill(self.pid, signal.SIGTERM) 
        
                       except Exception:  # pylint: disable=broad-except 
        
                           self._logger.exception( 
        
                               "PING: Ignoring unhandled exception when killing worker %r", self 
        
                           )

and

nav/python/nav/ipdevpoll/pool.py

Lines 342 to 356 in 2d7b2a0

    
               @inlineCallbacks 
        
               def responds_to_ping(self, timeout=10): 
        
                   """Verifies that this worker is alive. 
        
                   :param timeout: The maximum allowable number of seconds for the worker to 
        
                                   respond 
        
                   :type timeout: int 
        
                   :return: A Deferred whose result will be True if the worker process responded 
        
                            correctly and within the set timeout. 
        
                   """ 
        
                   self._logger.debug("PING: %r", self) 
        
                   deferred = self.process.callRemote(Ping) 
        
                   response = yield deferred.addTimeout(timeout, clock=reactor) 
        
                   self._logger.debug("PING: Response from %r: %r", self, response) 
        
                   returnValue(response.get("result") == "pong")

Explicit coverage of the worker ping and euthanization methods did not exist. Instead, these methods would "accidentally" receive coverage in some test runs, and not at all in others. This would cause coverage stats to flip back and forth, and codecov.io would fail seemingly random PRs for decreasing coverage of these methods. Fixes Uninett#2548

lunkwill42 added the tests label Jan 6, 2023

lunkwill42 changed the title ~~Explicitly test for ipdevpool subprocess pool timeouts~~ Explicitly test for ipdevpoll subprocess pool timeouts Jan 10, 2023

johannaengland self-assigned this Mar 1, 2023

lunkwill42 mentioned this issue Mar 23, 2023

Add unit tests for ipdevpoll worker euthanization #2599

Merged

lunkwill42 self-assigned this Mar 23, 2023

johannaengland removed their assignment Mar 27, 2023

lunkwill42 closed this as completed in #2599 Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explicitly test for ipdevpoll subprocess pool timeouts #2548

Explicitly test for ipdevpoll subprocess pool timeouts #2548

lunkwill42 commented Jan 6, 2023 •

edited

lunkwill42 commented Jan 6, 2023

Explicitly test for ipdevpoll subprocess pool timeouts #2548

Explicitly test for ipdevpoll subprocess pool timeouts #2548

Comments

lunkwill42 commented Jan 6, 2023 • edited

lunkwill42 commented Jan 6, 2023

lunkwill42 commented Jan 6, 2023 •

edited