Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly test for ipdevpoll subprocess pool timeouts #2548

Closed
lunkwill42 opened this issue Jan 6, 2023 · 1 comment · Fixed by #2599
Closed

Explicitly test for ipdevpoll subprocess pool timeouts #2548

lunkwill42 opened this issue Jan 6, 2023 · 1 comment · Fixed by #2599
Assignees
Labels

Comments

@lunkwill42
Copy link
Member

lunkwill42 commented Jan 6, 2023

We are seeing intermittent coverage failures on pull requests, specifically related to the nav.ipdevpoll.pool module. Codecov fails the PRs because it seems that the PR has reduced the coverage of the pool module, yet we never see any of these PRs actually touching any relevant code that would reduce the coverage of this file.

The code lines that have "lost" coverage all deal with the handling of process pool workers that time out and need to be euthanized.

Working theory: It is likely that what is happening is that there is no test that tests explicitly for a ipdevpoll process pool worker timeout/euthanization incident. This just happens randomly on some tests, likely due to timing issues. Whenever this occurs in a test suite run, the timeout handlers are run, ensuring that a new worker gets the job, but the tests are otherwise unaffected - and the timeout handlers get covered. However, when the test suite later runs and the timeouts do not happen, the coverage is apparently reduced.

One of the latest PRs this explicitly happened was #2545. The link to the coverage report is here (but this may become outdated by the time the PR is merged): https://app.codecov.io/gh/Uninett/nav/pull/2545

@lunkwill42
Copy link
Member Author

The "uncovered" methods appear to be

@inlineCallbacks
def _euthanize_unresponsive_worker(self, timeout=10):
"""Sends the ping command to the worker. If the ping command does not succeed
within the configured timeout, the worker is killed using the SIGTERM signal,
under the assumption the process has frozen somehow.
"""
is_alive = not self.done() # assume the best
if not self.done():
try:
is_alive = yield self.responds_to_ping(timeout)
except twisted.internet.defer.TimeoutError:
self._logger.warning("PING: Timed out for %r", self)
is_alive = False
except Exception: # pylint: disable=broad-except
self._logger.exception(
"PING: Unhandled exception while pinging %r", self
)
is_alive = None
# check again; no need to kill worker if its status became 'done'while waiting
if not self.done():
try:
if not is_alive:
self._logger.warning(
"PING: Not responding, attempting to kill: %r", self
)
os.kill(self.pid, signal.SIGTERM)
except Exception: # pylint: disable=broad-except
self._logger.exception(
"PING: Ignoring unhandled exception when killing worker %r", self
)

and

@inlineCallbacks
def responds_to_ping(self, timeout=10):
"""Verifies that this worker is alive.
:param timeout: The maximum allowable number of seconds for the worker to
respond
:type timeout: int
:return: A Deferred whose result will be True if the worker process responded
correctly and within the set timeout.
"""
self._logger.debug("PING: %r", self)
deferred = self.process.callRemote(Ping)
response = yield deferred.addTimeout(timeout, clock=reactor)
self._logger.debug("PING: Response from %r: %r", self, response)
returnValue(response.get("result") == "pong")

@lunkwill42 lunkwill42 changed the title Explicitly test for ipdevpool subprocess pool timeouts Explicitly test for ipdevpoll subprocess pool timeouts Jan 10, 2023
@johannaengland johannaengland self-assigned this Mar 1, 2023
lunkwill42 added a commit to lunkwill42/nav that referenced this issue Mar 23, 2023
Explicit coverage of the worker ping and euthanization methods did
not exist.  Instead, these methods would "accidentally" receive
coverage in some test runs, and not at all in others.  This would cause
coverage stats to flip back and forth, and codecov.io would fail
seemingly random PRs for decreasing coverage of these methods.

Fixes Uninett#2548
@lunkwill42 lunkwill42 self-assigned this Mar 23, 2023
@johannaengland johannaengland removed their assignment Mar 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants