Improve robustness of `PipInstall` plugin #7111

hendrikmakait · 2022-10-05T14:30:15Z

Closes #7102
Closes #7037
Supersedes #7086

Tests added / passed
Passes pre-commit run --all-files

hendrikmakait

I'm not sure how much effort we should put into testing this given that testing is entirely artificial and based on mocks at the moment.

hendrikmakait · 2022-10-05T14:46:03Z

distributed/diagnostics/plugin.py

+                    logger.error(
+                        "Pip install failed with '%s'", stderr.decode().strip()
+                    )
+                    return


I think that in addition to logging the error, we may want to raise a RuntimeError here instead of returning. This way, we can propagate the error to the user.

agreed. This would result in the worker failing to start which, I believe, is a good default.

In fact, this will not result in the worker failing to start due to
catch_errors defaulting to True in Worker.plugin_add but it will raise an error on the client-side:

distributed/distributed/worker.py

Lines 1811 to 1816 in aaab17c

async def plugin_add(

self,

plugin: WorkerPlugin | bytes,

name: str | None = None,

catch_errors: bool = True,

) -> dict[str, Any]:

-------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) Input In [5], in <cell line: 1>() ----> 1 client.register_worker_plugin(plugin) File ~/projects/dask/distributed/distributed/client.py:4632, in Client.register_worker_plugin(self, plugin, name, nanny) 4628 name = _get_plugin_name(plugin) 4630 assert name -> 4632 return self.sync( 4633 self._register_worker_plugin, plugin=plugin, name=name, nanny=nanny 4634 ) File ~/projects/dask/distributed/distributed/utils.py:339, in SyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs) 337 return future 338 else: --> 339 return sync( 340 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs 341 ) File ~/projects/dask/distributed/distributed/utils.py:406, in sync(loop, func, callback_timeout, *args, **kwargs) 404 if error: 405 typ, exc, tb = error --> 406 raise exc.with_traceback(tb) 407 else: 408 return result File ~/projects/dask/distributed/distributed/utils.py:379, in sync.<locals>.f() 377 future = asyncio.wait_for(future, callback_timeout) 378 future = asyncio.ensure_future(future) --> 379 result = yield future 380 except Exception: 381 error = sys.exc_info() File /opt/homebrew/Caskroom/mambaforge/base/envs/dask-distributed-py3.9/lib/python3.9/site-packages/tornado/gen.py:762, in Runner.run(self) 759 exc_info = None 761 try: --> 762 value = future.result() 763 except Exception: 764 exc_info = sys.exc_info() File ~/projects/dask/distributed/distributed/client.py:4558, in Client._register_worker_plugin(self, plugin, name, nanny) 4554 if response["status"] == "error": 4555 _, exc, tb = clean_exception( 4556 response["exception"], response["traceback"] 4557 ) -> 4558 raise exc.with_traceback(tb) 4559 return responses File ~/projects/dask/distributed/distributed/diagnostics/plugin.py:304, in setup() 302 msg = f"Pip install failed with '{stderr.decode().strip()}'" 303 logger.error(msg) --> 304 raise RuntimeError(msg) 305 else: 306 logger.info( 307 "The following packages have already been installed: %s", 308 self.packages, 309 ) RuntimeError: Pip install failed with 'ERROR: Could not find a version that satisfies the requirement pyarrov (from versions: none) ERROR: No matching distribution found for pyarrov'

catch_errors defaulting to True in Worker.plugin_add but it will raise an error on the client-side:

That's unfortunate but still OK

fjetter · 2022-10-05T14:59:33Z

distributed/diagnostics/plugin.py

+                    logger.error(
+                        "Pip install failed with '%s'", stderr.decode().strip()
+                    )
+                    return


agreed. This would result in the worker failing to start which, I believe, is a good default.

fjetter · 2022-10-05T15:01:05Z

distributed/diagnostics/plugin.py

@@ -282,34 +278,74 @@ def __init__(self, packages, pip_options=None, restart=False):
        self.packages = packages
        self.restart = restart
        self.pip_options = pip_options or []
+        self.id = f"pip-install-{uuid.uuid4()}"


I think we should use the worker.id instead. I think that there is a 1to1 relationship between PipInstallPlugin and Worker instances but either way, the uniqueness property should link to the worker, shouldn't ti?

This ID is unique per PipInstall instance. It's mainly there to avoid conflicts between multiple PipInstall plugins all trying to install a bunch of packages on the worker.

See PipInstall._compose_{installed|restarted}_key() for the namespacing. We achieve the uniqueness per worker of PipInstall._compose_restarted_key() using worker.nanny, which should be the correct key, but we may want to add the pid to that.

worker.nanny is not perfect but I believe good enough. I don't think we have the PID of the nanny and I don't think this is necessary. Let's not overdo it here

TIL: We teardown plugins that are overwritten which allows us to nicely remove their metadata. We can register also plugins to custom names ignoring their name attribute. However, we do not tell those plugins the name they are registered by. For PipInstall plugins, plugin.name will remain pip. We also do not pass that information on setup or teardown. Otherwise, we would be able to drop the ID and just use their name.

github-actions · 2022-10-05T17:34:52Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files +      1       15 suites +1 6h 33m 35s ⏱️ + 56m 40s
  3 159 tests +      3   3 072 ✔️ +    12   85 💤 - 11 2 ❌ +2
23 381 runs +1 276 22 472 ✔️ +1 336 907 💤 - 62 2 ❌ +2

For more details on these failures, see this check.

Results for commit d34c5c9. ± Comparison against base commit 6ac679e.

♻️ This comment has been updated with latest results.

hendrikmakait · 2022-10-06T09:48:43Z

distributed/diagnostics/plugin.py

+            if not await self._is_installed(worker):
+                logger.info("Pip installing the following packages: %s", self.packages)
+                await self._set_installed(worker)
+                proc = subprocess.Popen(


To avoid the PipInstall plugin from hogging the GIL if the installation of packages takes some time, we could switch to using asynchronous subprocesses here. One thing I'm a bit concerned about in that scenario is that we have no way of blocking a restart of the worker while the subprocess is running due to some other reason.

Let's take one step at a time. This is only relevant on startup and during plugin initialization, the worker is not even connected to the scheduler, yet. I don't think anything will happen if we block here.

(Also, does Popen really block the GIL or just the event loop? Same effect really but I'm still curious)

Sorry, I was not careful with the wording here. subprocess.communicate() hogs the event loop. After a small expedition down the GIL rabbithole, it looks like subprocess.communicate() does indeed release the GIL:
https://github.com/python/cpython/blob/e39ae6bef2c357a88e232dcab2e4b4c0f367544b/Modules/posixmodule.c#L8489-L8491

The visible effect remains the same, the worker might become unresponsive for a while, which will likely be exacerbated in #7103 since conda install doesn't feel like the fastest thing in the world.

fjetter · 2022-10-06T12:50:52Z

distributed/tests/test_worker.py

+@gen_cluster(client=True, nthreads=[("", 1), ("", 1)])
+async def test_pip_install_multiple_workers(c, s, a, b):
+    with captured_logger(
+        "distributed.diagnostics.plugin", level=logging.INFO
+    ) as logger:
+        mocked = mock.Mock()
+        mocked.configure_mock(
+            **{"communicate.return_value": (b"", b""), "wait.return_value": 0}
+        )
+        with mock.patch(
+            "distributed.diagnostics.plugin.subprocess.Popen", return_value=mocked
+        ) as Popen:
+            await c.register_worker_plugin(
+                PipInstall(packages=["requests"], pip_options=["--upgrade"])
+            )
+
+            args = Popen.call_args[0][0]
+            assert "python" in args[0]
+            assert args[1:] == ["-m", "pip", "install", "--upgrade", "requests"]
+            assert Popen.call_count == 1
+            logs = logger.getvalue()
+            assert "Pip installing" in logs
+            assert "already been installed" in logs


This test should ensure that the worker is restarted, shouldn't it?

Restarting only happens if

we use a Nanny

we set restart=True

This test does not restart since it is just using your average Worker which can't restart.

I've been trying to build a test that uses a Nanny and restarts, but I'm running into problems with mocking out the calls to pip and the subprocessing happening inside a Nanny.

I've been trying to build a test that uses a Nanny and restarts, but I'm running into problems with mocking out the calls to pip and the subprocessing happening inside a Nanny.

I've been able to resolve this and added another test.

distributed/diagnostics/plugin.py

fjetter

I'm surprised to see the codecov complaining about all the missed lines. Just mocking the popen call should be sufficient to test the entire plugin, shouldn't it?

hendrikmakait · 2022-10-06T15:56:03Z

Note that without throwing another SchedulerPlugin into the mix to manage metadata, we do not remove metadata upon removal of PipInstall plugins.

fjetter · 2022-10-06T15:58:45Z

Note that without throwing another SchedulerPlugin into the mix to manage metadata, we do not remove metadata upon removal of PipInstall plugins.

That's perfectly fine.

fjetter

Thanks @hendrikmakait

hendrikmakait added 2 commits October 5, 2022 16:29

Improve robustness of PipInstall plugin

57f8f62

Explicit

37eee2e

hendrikmakait commented Oct 5, 2022

View reviewed changes

fjetter reviewed Oct 5, 2022

View reviewed changes

Raise RuntimeError

549c902

pytest.raises

74f71c5

hendrikmakait commented Oct 6, 2022

View reviewed changes

fjetter reviewed Oct 6, 2022

View reviewed changes

distributed/diagnostics/plugin.py Outdated Show resolved Hide resolved

fjetter reviewed Oct 6, 2022

View reviewed changes

fjetter mentioned this pull request Oct 6, 2022

Enable PipInstall plugin to install from private PyPI server #7086

Closed

2 tasks

hendrikmakait force-pushed the robust-pipinstall branch from 3817177 to 74f71c5 Compare October 6, 2022 13:29

hendrikmakait added 2 commits October 6, 2022 16:56

Test PipInstall restarting Nanny

a34cf10

Test that failing does not restart

8f0fb85

hendrikmakait requested a review from fjetter October 6, 2022 16:28

Code review suggestion

d34c5c9

fjetter approved these changes Oct 7, 2022

View reviewed changes

fjetter merged commit adf8a54 into dask:main Oct 7, 2022

gjoseph92 pushed a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022

Improve robustness of PipInstall plugin (dask#7111)

f14590f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve robustness of `PipInstall` plugin #7111

Improve robustness of `PipInstall` plugin #7111

hendrikmakait commented Oct 5, 2022 •

edited

hendrikmakait left a comment

hendrikmakait Oct 5, 2022

fjetter Oct 5, 2022

hendrikmakait Oct 5, 2022

fjetter Oct 6, 2022

fjetter Oct 5, 2022

fjetter Oct 5, 2022

hendrikmakait Oct 5, 2022

hendrikmakait Oct 5, 2022

fjetter Oct 6, 2022

hendrikmakait Oct 6, 2022

github-actions bot commented Oct 5, 2022 •

edited

hendrikmakait Oct 6, 2022

fjetter Oct 6, 2022 •

edited

hendrikmakait Oct 6, 2022

fjetter Oct 6, 2022

hendrikmakait Oct 6, 2022

hendrikmakait Oct 6, 2022

fjetter left a comment

hendrikmakait commented Oct 6, 2022

fjetter commented Oct 6, 2022

fjetter left a comment

	async def plugin_add(
	self,
	plugin: WorkerPlugin \| bytes,
	name: str \| None = None,
	catch_errors: bool = True,
	) -> dict[str, Any]:

Improve robustness of PipInstall plugin #7111

Improve robustness of PipInstall plugin #7111

Conversation

hendrikmakait commented Oct 5, 2022 • edited

hendrikmakait left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Oct 5, 2022 • edited

Unit Test Results

Choose a reason for hiding this comment

fjetter Oct 6, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter left a comment

Choose a reason for hiding this comment

hendrikmakait commented Oct 6, 2022

fjetter commented Oct 6, 2022

fjetter left a comment

Choose a reason for hiding this comment

Improve robustness of `PipInstall` plugin #7111

Improve robustness of `PipInstall` plugin #7111

hendrikmakait commented Oct 5, 2022 •

edited

github-actions bot commented Oct 5, 2022 •

edited

fjetter Oct 6, 2022 •

edited