Fixup `cluster_info` sync handling #5488

jcrist · 2021-11-03T20:48:41Z

Previously a network blip would cause this periodic callback to log an
error condition to the user every second. This fixes that in the
following way:

We now catch error conditions and only log a warning after a few
consecutive errors (so a single network blip will go unnoticed).
In the case of an error, we backoff a bit before retrying.

Fixes #5472.

Previously a network blip would cause this periodic callback to log an error condition to the user every second. This fixes that in the following way: - We now catch error conditions and only log a warning after a few consecutive errors (so a single network blip will go unnoticed). - In the case of an error, we backoff a bit before retrying.

jcrist · 2021-11-03T21:45:14Z

Test failures are unrelated.

jrbourbeau

Thanks for the patch @jcrist -- I can confirm this fixed the issue in #5472. Interestingly this async task + while-loop approach is similar to what @jacobtomlinson initially started with in #5033.

@jacobtomlinson out of curiosity, is this periodic syncing behavior actually being used anywhere? Would it be sufficient to have a single message sent during cluster startup?

cc @jacobtomlinson @fjetter who may have thoughts on the changes here

While I left some comments, I don't mean for them to be blocking. This PR could be merged as is and would be to have included in the release tomorrow (xref dask/community#197)

jrbourbeau · 2021-11-04T21:26:31Z

distributed/deploy/cluster.py

-            self._sync_cluster_info, self._sync_interval * 1000
-        )
+        # Start a background task for syncing cluster info with the scheduler
+        self._sync_cluster_info_task = asyncio.ensure_future(self._sync_cluster_info())


Nit: I think asyncio.create_task is preferred over asyncio.ensure_future for Python 3.7+

Suggested change

self._sync_cluster_info_task = asyncio.ensure_future(self._sync_cluster_info())

self._sync_cluster_info_task = asyncio.create_task(self._sync_cluster_info())

jrbourbeau · 2021-11-04T21:27:06Z

distributed/deploy/cluster.py

+        self.status = Status.closing
+


jrbourbeau · 2021-11-04T21:28:29Z

distributed/deploy/tests/test_local.py

+
+        async def error(*args, **kwargs):
+            nonlocal error_called
+            await asyncio.sleep(0.001)


Why is this sleep needed?

jacobtomlinson

This looks great, thanks for fixing this up @jcrist.

@jrbourbeau I haven't opened a PR that makes use of this yet, but it's on my backlog. We need to periodically sync because state on the cluster object changes. If you call scale for instance the number of workers changes. When reconstructing this object we need to know about this.

jrbourbeau

Thanks @jcrist

jrbourbeau reviewed Nov 4, 2021

View reviewed changes

jrbourbeau mentioned this pull request Nov 4, 2021

Release 2021.11.0 dask/community#197

Closed

jacobtomlinson approved these changes Nov 5, 2021

View reviewed changes

jrbourbeau approved these changes Nov 5, 2021

View reviewed changes

jrbourbeau merged commit 06ba74b into dask:main Nov 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixup `cluster_info` sync handling #5488

Fixup `cluster_info` sync handling #5488

jcrist commented Nov 3, 2021

jcrist commented Nov 3, 2021

jrbourbeau left a comment

jrbourbeau Nov 4, 2021

jrbourbeau Nov 4, 2021

jrbourbeau Nov 4, 2021

jacobtomlinson left a comment

jrbourbeau left a comment

	self._sync_cluster_info_task = asyncio.ensure_future(self._sync_cluster_info())
	self._sync_cluster_info_task = asyncio.create_task(self._sync_cluster_info())

Fixup cluster_info sync handling #5488

Fixup cluster_info sync handling #5488

Conversation

jcrist commented Nov 3, 2021

jcrist commented Nov 3, 2021

jrbourbeau left a comment

Choose a reason for hiding this comment

jrbourbeau Nov 4, 2021

Choose a reason for hiding this comment

jrbourbeau Nov 4, 2021

Choose a reason for hiding this comment

jrbourbeau Nov 4, 2021

Choose a reason for hiding this comment

jacobtomlinson left a comment

Choose a reason for hiding this comment

jrbourbeau left a comment

Choose a reason for hiding this comment

Fixup `cluster_info` sync handling #5488

Fixup `cluster_info` sync handling #5488