Close state machine and add-ins first in `Worker.close` #8066

hendrikmakait · 2023-08-03T10:16:54Z

Partially closes #8062

Tests added / passed
Passes pre-commit run --all-files

github-actions · 2023-08-03T11:26:40Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      20 files ±    0       20 suites ±0 11h 29m 24s ⏱️ + 17m 52s
  3 752 tests -     2   3 644 ✔️ +    2   104 💤 -   2 3 ❌ - 3 1 🔥 +1
36 477 runs +163 34 809 ✔️ +252 1 663 💤 - 85 4 ❌ - 5 1 🔥 +1

For more details on these failures and errors, see this check.

Results for commit 567bbcd. ± Comparison against base commit 84e1984.

This pull request removes 2 tests.

distributed.protocol.tests.test_numpy
distributed.shuffle.tests.test_rechunk

♻️ This comment has been updated with latest results.

hendrikmakait · 2023-08-04T05:22:33Z

CI doesn't seem to be less happy than usual. test_high_unmanaged_memory_warning is a new flake, but appears to be #8034.

fjetter

What I would do/try

I think everything in this PR is good until the preload teardown
Close clients afterwards
Stop PC
stop listner
close RPC
close stream
shutdown TPE
...

This way we're drawing a hard line for external communication on all channels

fjetter · 2023-08-04T07:58:33Z

distributed/worker.py

+            for plugin in self.plugins.values()
+            if hasattr(plugin, "teardown")
+        ]
+        await asyncio.gather(*(td for td in teardowns if isawaitable(td)))


Ohhh, this is bad. This was here before but we should fix this regardless

Suggested change

await asyncio.gather(*(td for td in teardowns if isawaitable(td)))

for res in await asyncio.gather(*(td for td in teardowns if isawaitable(td)), return_exceptions=True):

if isinstance(res, Exception):

logger.error("Encountered exception during teardown of plugin %s", res)

or smth like that

ah, ok, just saw #8072

fjetter · 2023-08-04T08:02:15Z

distributed/worker.py

+        for preload in self.preloads:
+            try:
+                await preload.teardown()
+            except Exception:
+                logger.exception("Failed to tear down preload")


Preloads are an interesting case. They are processed right after ServerNode is initialized (i.e. not started). Therefore, during startup, preloads don't have access to comms and there could be a case for delaying this even further.

I don't think this is super relevant, though

Agreed, this placement feels a bit shaky. I'd leave it here for now until it becomes relevant.

fjetter · 2023-08-04T08:05:40Z

distributed/worker.py

@@ -1597,8 +1597,6 @@ async def close(  # type: ignore

        await self.scheduler.close_rpc()


This is misleading. This is actually a noop since Worker.scheduler is of type PooledRPCCall and not rpc

Suggested change

await self.scheduler.close_rpc()

distributed/distributed/core.py

Lines 1348 to 1382 in eb297b3

class PooledRPCCall:

"""The result of ConnectionPool()('host:port')

See Also:

ConnectionPool

"""

def __init__(self, addr, pool, serializers=None, deserializers=None):

self.addr = addr

self.pool = pool

self.serializers = serializers

self.deserializers = deserializers if deserializers is not None else serializers

@property

def address(self):

return self.addr

def __getattr__(self, key):

async def send_recv_from_rpc(**kwargs):

if self.serializers is not None and kwargs.get("serializers") is None:

kwargs["serializers"] = self.serializers

if self.deserializers is not None and kwargs.get("deserializers") is None:

kwargs["deserializers"] = self.deserializers

comm = await self.pool.connect(self.addr)

prev_name, comm.name = comm.name, "ConnectionPool." + key

try:

return await send_recv(comm=comm, op=key, **kwargs)

finally:

self.pool.reuse(self.addr, comm)

comm.name = prev_name

return send_recv_from_rpc

async def close_rpc(self):

pass

fjetter · 2023-08-04T08:07:57Z

distributed/worker.py

@@ -1649,10 +1647,11 @@ def _close(executor, wait):
        await self.rpc.close()

        self.status = Status.closed
+        setproctitle("dask worker [closed]")
+
        await ServerNode.close(self)

        self.__exit_stack.__exit__(None, None, None)


Just a side comment: The cleanest way would probably be to have everything be contextaware such that we can throw everything into the exitstack

fjetter · 2023-08-04T08:09:10Z

distributed/worker.py

+        for pc in self.periodic_callbacks.values():
+            pc.stop()

-        await asyncio.gather(*(td for td in teardowns if isawaitable(td)))
+        self.stop()


My gut feeling tells me to close the clients before stopping

fjetter · 2023-08-04T08:10:55Z

distributed/worker.py

@@ -1649,10 +1647,11 @@ def _close(executor, wait):
        await self.rpc.close()


My gut feeling would tell me to put listener stops, RPC close and stream close together. This would basically mark the point where we're severing all outside communication and will only do some internal housekeeping afterwards.
As it is right now, these events are still spread out a log, aren't they?

fjetter · 2023-08-04T08:13:30Z

Generally speaking, I'm happy if CI is happy but the above order feels more sensible to me. Either way, this already looks like an improvement

hendrikmakait · 2023-08-04T08:48:21Z

@fjetter: I'm working on a follow-up PR that would provide a full ordering. The point of this PR is to break it down into smaller pieces and make P2P happy with the initial step.

fjetter

I'm fine with the code changes. Haven't cross referenced the failing tests to see if something new came in. The shuffle problems seem to be resolved so that's a win

hendrikmakait · 2023-08-04T12:28:05Z

I don't recall test_gather_on_worker_bad_recipient failing. This appears to be unrelated, as it fails before any worker closes.

hendrikmakait · 2023-08-04T13:42:15Z

test_crashed_worker_during_transfer still failed once, but it was flaky before #7698.

hendrikmakait added 2 commits August 3, 2023 12:07

First inform everyone, then stop async instructions

ba2cdf0

Cleanup add-ins first

0744bc1

hendrikmakait changed the title ~~Restructure Worker.close~~ Reorder operations in Worker.close Aug 3, 2023

hendrikmakait changed the title ~~Reorder operations in Worker.close~~ Close state machine and add-ins first in Worker.close Aug 4, 2023

hendrikmakait marked this pull request as ready for review August 4, 2023 05:23

hendrikmakait requested a review from fjetter as a code owner August 4, 2023 05:23

Remove comment

c9ef7c8

fjetter reviewed Aug 4, 2023

View reviewed changes

fjetter approved these changes Aug 4, 2023

View reviewed changes

Don't rely on timing

567bbcd

hendrikmakait mentioned this pull request Aug 4, 2023

Flakiness in test_shuffle.py #8074

Open

hendrikmakait merged commit 8cc2be4 into dask:main Aug 4, 2023
22 of 27 checks passed

hendrikmakait mentioned this pull request Aug 4, 2023

Reorder operations in Worker.close #8076

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Close state machine and add-ins first in `Worker.close` #8066

Close state machine and add-ins first in `Worker.close` #8066

hendrikmakait commented Aug 3, 2023 •

edited

github-actions bot commented Aug 3, 2023 •

edited

hendrikmakait commented Aug 4, 2023

fjetter left a comment

fjetter Aug 4, 2023

fjetter Aug 4, 2023

fjetter Aug 4, 2023

hendrikmakait Aug 4, 2023

fjetter Aug 4, 2023

fjetter Aug 4, 2023

fjetter Aug 4, 2023

fjetter Aug 4, 2023

fjetter commented Aug 4, 2023

hendrikmakait commented Aug 4, 2023

fjetter left a comment

hendrikmakait commented Aug 4, 2023

hendrikmakait commented Aug 4, 2023

-        await asyncio.gather(*(td for td in teardowns if isawaitable(td)))
+        for res in await asyncio.gather(*(td for td in teardowns if isawaitable(td)), return_exceptions=True):
+            if isinstance(res, Exception):
+                logger.error("Encountered exception during teardown of plugin %s", res)

		@@ -1597,8 +1597,6 @@ async def close( # type: ignore

		await self.scheduler.close_rpc()

	class PooledRPCCall:
	"""The result of ConnectionPool()('host:port')

	See Also:
	ConnectionPool
	"""

	def __init__(self, addr, pool, serializers=None, deserializers=None):
	self.addr = addr
	self.pool = pool
	self.serializers = serializers
	self.deserializers = deserializers if deserializers is not None else serializers

	@property
	def address(self):
	return self.addr

	def __getattr__(self, key):
	async def send_recv_from_rpc(**kwargs):
	if self.serializers is not None and kwargs.get("serializers") is None:
	kwargs["serializers"] = self.serializers
	if self.deserializers is not None and kwargs.get("deserializers") is None:
	kwargs["deserializers"] = self.deserializers
	comm = await self.pool.connect(self.addr)
	prev_name, comm.name = comm.name, "ConnectionPool." + key
	try:
	return await send_recv(comm=comm, op=key, **kwargs)
	finally:
	self.pool.reuse(self.addr, comm)
	comm.name = prev_name

	return send_recv_from_rpc

	async def close_rpc(self):
	pass

		@@ -1649,10 +1647,11 @@ def _close(executor, wait):
		await self.rpc.close()

Close state machine and add-ins first in Worker.close #8066

Close state machine and add-ins first in Worker.close #8066

Conversation

hendrikmakait commented Aug 3, 2023 • edited

github-actions bot commented Aug 3, 2023 • edited

Unit Test Results

hendrikmakait commented Aug 4, 2023

fjetter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter commented Aug 4, 2023

hendrikmakait commented Aug 4, 2023

fjetter left a comment

Choose a reason for hiding this comment

hendrikmakait commented Aug 4, 2023

hendrikmakait commented Aug 4, 2023

Close state machine and add-ins first in `Worker.close` #8066

Close state machine and add-ins first in `Worker.close` #8066

hendrikmakait commented Aug 3, 2023 •

edited

github-actions bot commented Aug 3, 2023 •

edited