Overhaul transitions for the resumed state #6699

crusaderky · 2022-07-09T15:03:25Z

Mutually exclusive with Remove resumed state #6716
Mutually exclusive with Remove cancelled, resumed, and long-running states #6844
Closes Resumed tasks don't release resources if they fail #6682
Closes Deadlock when a flaky task is stolen #6689
Closes _transition_from_resumed contains legacy code and documentation #6693
Closes AssertionError in WorkerState._transition_cancelled_error #6877

github-actions · 2022-07-09T16:04:14Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±0       15 suites ±0 6h 30m 37s ⏱️ - 35m 18s
  3 005 tests +5   2 909 ✔️ +1   82 💤 -   7   4 ❌ +  1 10 🔥 +10
22 239 runs - 7 21 186 ✔️ - 6 967 💤 - 84 16 ❌ +13 70 🔥 +70

For more details on these failures and errors, see this check.

Results for commit f190b4b. ± Comparison against base commit 1d0701b.

♻️ This comment has been updated with latest results.

crusaderky · 2022-07-12T13:41:32Z

There are two failing tests.
The amount of brain cells that died trying to understand them exhausted me, so I stopped work on this and I restarted from scratch on a more radical approach: #6716

fjetter · 2022-08-08T10:10:17Z

distributed/worker_state_machine.py

+                ts.coming_from = None
+                ts.state = "released"
+                ts.done = False
+                ts.previous = None
+                ts.next = None
+                return {ts: "waiting"}, []


I think we're missing a in_flight_tasks.remove/discard(ts) here

I moved them all in _gather_dep_done_common

fjetter

There are a couple of small things but overall this looks great. I think spelling out the resumed transitions explicitly was long overdue

fjetter · 2022-08-08T10:13:50Z

distributed/worker_state_machine.py

+        so we're entering cancelled state and waiting until it completes.
+        """
+        assert ts.state in ("executing", "long-running")
+        ts.previous = cast(Literal["executing", "long-running"], ts.state)


I'm curious, what does this literal cast give us?

ts.state is a Literal[executing, long-running, flight, <all other states>]
ts.previous is a Literal[executing, long-running, flight, None]

mypy is not smart enough to realise that the assertion on the line above guarantees the intersection of the two domains.

fjetter · 2022-08-08T10:17:45Z

distributed/worker_state_machine.py

+            ts.state = "released"
            msg = error_message(e)
-            recs = {ts: tuple(msg.values())}
+            recs2 = {ts: tuple(msg.values())}
+            instr2: Instructions = []


IIUC this would create a released->error transition which is not defined.

By not setting ts.state = "released" we'd get an ordinary executing->error transition. The serialization should be picked up as a task compute / user error which I think is OK.

released->error is defined, and we need it anyway in case the serialization problem is caused by scatter (although I'd bet the use case is untested).

crusaderky · 2022-08-15T15:11:46Z

@fjetter
I realised that _transition_from_executing is conceptually flawed.
It assumes that we're getting there from one of the exit events of execute(), or in other words assert ts.done.
Except that the assertion will fail in case of scatter.

e.g.
x = c.submit(sleep, 10)
(execute jupyter cell)
(modify and re-execute jupyter cell within 10 seconds:)
x = c.scatter(1)

In this case, there's going to be a transition executing->memory while ts.done is False, followed by a transition memory->memory when execute() completes. You'll end up with the task permanently in WorkerState.executing, permanently 1 less thread to run other tasks on, and permanently reduced resources.

I'm deleting _transition_from_executing and moving all of its logic to _execute_done_common, like in #6844. It's where it should be: a task is removed from the executing set and releases resources iff execute() has just terminated.

crusaderky · 2022-08-15T20:08:35Z

distributed/worker_state_machine.py

+        if ts.previous in ("executing", "long-running"):
+            # Task dependencies may have been released;
+            # can't call _validate_task_executing
+            assert not ts.waiting_for_data


@fjetter

If you call _validate_task_executing() from here, only one test in the whole suite becomes flaky, test_cancel_fire_and_forget (dependencies can be released if the dependent is cancelled, but they must be in memory if it's executing) and on my computer only 0.1% of the times (more frequently on CI).
I added test_cancel_with_dependencies_in_memory to reproduce the use case deterministically. Still I find it very scary how little coverage we have for cancelled/resumed tasks with dependencies.

I find it very scary how little coverage we have for cancelled/resumed tasks with dependencies.

I do not disagree. At the time we started patching up the worker we barely had any coverage at all for any of these internals. We've come a long way and there is still a lot to do to reach 100% coverage.
We have to be a bit pragmatic here; I do not believe it is feasible or effective for us right now to push for absolute coverage.

Follow-up: #6893

fjetter · 2022-08-16T12:32:52Z

I realised that _transition_from_executing is conceptually flawed.

possible

Except that the assertion will fail in case of scatter.

Why is scattering even transitioning via executing? That sounds wrong. I remember concretely having a released->memory transition for this at some point in time

crusaderky · 2022-08-16T14:52:55Z

Why is scattering even transitioning via executing? That sounds wrong. I remember concretely having a released->memory transition for this at some point in time

There's nothing that stops a user from scattering an object while a task with the same key is already running. This is most likely to happen while prototyping on a jupyter notebook.

crusaderky · 2022-08-16T22:23:32Z

distributed/tests/test_cancelled_state.py

-
-    class BrokenWorker(Worker):
-        block_get_data = True
+def test_flight_cancelled_error(ws):


The previous test did not trip on #6877. I chose not to bother investigating why and just rewrote it.

why not keep both?

Because a test that does not test its declared use case is useless and misleading

crusaderky · 2022-08-16T22:28:37Z

distributed/tests/test_cancelled_state.py

@@ -533,7 +514,7 @@ async def release_all_futures():

    await lock_compute.release()

-    if not raise_error:
+    if not wait_for_processing and not raise_error:


test_resumed_cancelled_handle_compute defeated me.
I could not make any sense of it.
I found it so unfathomable that it made me originally can this PR and restart from scratch with #6716, and then again with #6844.
This change makes it green again, but I am not sure that the tested stories are not highlighting any problems.

I opened #6905 with some explanations. I admit the test is a bit convoluted but I believe it is valuable

Looks like the test is flaky now. I'll try investigating.

My best guess for the flakyness is that the ordering with wait_for_state(f3.key, "processing", s) and lock_compute.release is very timing sensitive. Previously, the state machine was wired in such a way that this wouldn't matter but that is no longer the case

You're right, if I add a sleep(1) just before await lock_compute.release(), the two tests with wait_for_processing=False fail deterministically.

It's a race condition of client.submit (client->scheduler comms) vs. distributed.Lock release (client->scheduler->worker).

Pushing a fix.

crusaderky · 2022-08-16T22:37:19Z

distributed/worker_state_machine.py

@@ -1902,41 +1911,6 @@ def _transition_waiting_ready(

        return self._ensure_computing()

-    def _transition_cancelled_error(


Now uses _transition_cancelled_released

crusaderky · 2022-08-16T22:37:56Z

distributed/worker_state_machine.py

@@ -1960,7 +1934,7 @@ def _transition_generic_error(

        return {}, [smsg]

-    def _transition_executing_error(


Now uses _transition_generic_error

crusaderky · 2022-08-16T22:40:10Z

distributed/worker_state_machine.py

            return {}, []
        else:
            assert ts.previous == "flight"
            ts.state = "resumed"
            ts.next = "waiting"
            return {}, []

-    def _transition_cancelled_forgotten(


This is impossible

crusaderky · 2022-08-16T22:41:26Z

distributed/worker_state_machine.py

-        ts.done = False
-        return {}, []
-
-    def _transition_generic_memory(


Moved below and refactored

crusaderky · 2022-08-16T22:41:54Z

distributed/worker_state_machine.py

        return self._transition_generic_fetch(ts, stimulus_id=stimulus_id)

-    def _transition_flight_error(


Now uses _transition_generic_error

crusaderky · 2022-08-16T22:54:49Z

distributed/worker_state_machine.py

+        for ts in self.in_flight_tasks:
+            assert ts.state == "flight" or (
+                ts.state in ("cancelled", "resumed") and ts.previous == "flight"
+            ), ts


These assertions can actually fail with a poorly-timed scatter()

crusaderky · 2022-08-16T23:24:42Z

@fjetter ready for final review and merge

fjetter

Looks great. Most comments are mostly informative and should not block.

fjetter · 2022-08-18T09:18:54Z

distributed/tests/test_cancelled_state.py

+    elif not wait_for_processing and raise_error:
+        assert await f4 == 4 + 2
+
+        assert_story(


note: instead of asserting on the story, we could as well assert on the events triggered. I consider the events much more readable and concise

In fact, I'm actually very curious what kind of events we previously ignored. It makes sense that the worker should not behave identically with different scheduler timings but apparently we ignored something before this change

fjetter · 2022-08-18T09:21:01Z

distributed/tests/test_cancelled_state.py

-
-    class BrokenWorker(Worker):
-        block_get_data = True
+def test_flight_cancelled_error(ws):


why not keep both?

fjetter · 2022-08-18T10:21:23Z

distributed/tests/test_cancelled_state.py

+                (f3.key, "ready", "executing", "executing", {}),
+                (f3.key, "executing", "released", "cancelled", {}),
+                (f3.key, "cancelled", "fetch", "resumed", {}),
+                (f3.key, "resumed", "error", "released", {f3.key: "fetch"}),


This is actually a change in behavior. Previously we stayed in the error state. That was arguable a false behavior given the definitions of the cancelled and resumed state but I don't think this will matter in practice.

Just pointing it out, I'm fine with the new behavior

fjetter · 2022-08-18T10:24:50Z

distributed/tests/test_cancelled_state.py

+                (f3.key, "resumed", "waiting", "executing", {}),
+                (f3.key, "executing", "memory", "memory", {}),


+1 looks much cleaner. I vaguely remember that I intentionally did not reset to executing for some reason but I'm glad if we can do it properly. This makes the states much more deterministic

fjetter · 2022-08-18T11:33:59Z

distributed/worker_state_machine.py


-        self._release_resources(ts)
+        ts.previous = None
+        ts.done = False


This is not necessary. The generic_release triggers a purge_state which rests this

Suggested change

ts.done = False

Yes, this was a conscious decision. I'd like to be as explicit as possible in the specific code.

fjetter · 2022-08-18T11:38:52Z

distributed/worker_state_machine.py

            # Note: this is not the same as recommending {ts: "released"} on the
            # previous line, as it would instead transition the task to cancelled - but
            # a task that raised the Reschedule() exception is finished!


Is this comment still relevant? I don't really understand it and it seems to reference a line that no longer exists

The comment is still valid.
transition_executing_released and transition_resumed_released contain assert not ts.done. I could of course change them with a special code path that is exclusive of the rescheduled state, but I thought that keeping the logic within transition_*_rescheduled was a lot more readable.

follow-up: #6685

fjetter · 2022-08-18T11:45:23Z

distributed/tests/test_cancelled_state.py

@@ -533,7 +514,7 @@ async def release_all_futures():

    await lock_compute.release()

-    if not raise_error:
+    if not wait_for_processing and not raise_error:


My best guess for the flakyness is that the ordering with wait_for_state(f3.key, "processing", s) and lock_compute.release is very timing sensitive. Previously, the state machine was wired in such a way that this wouldn't matter but that is no longer the case

crusaderky force-pushed the resumed branch 5 times, most recently from 64100c4 to 04cc7ac Compare July 11, 2022 10:52

crusaderky mentioned this pull request Jul 11, 2022

Pickle worker state machine exceptions #6702

Merged

crusaderky force-pushed the resumed branch from 04cc7ac to b7911db Compare July 11, 2022 12:00

This was referenced Jul 12, 2022

Remove resumed state #6716

Closed

Improve tests for cancelled state #6717

Merged

cancelled->long-running and resumed->long-running are invalid transitions #6709

Closed

Validate all state-based collections in WorkerState #6708

Closed

crusaderky force-pushed the resumed branch from 6c382f7 to c42a381 Compare July 12, 2022 23:57

crusaderky self-assigned this Aug 3, 2022

crusaderky mentioned this pull request Aug 6, 2022

Remove cancelled, resumed, and long-running states #6844

Closed

fjetter reviewed Aug 8, 2022

View reviewed changes

crusaderky force-pushed the resumed branch 3 times, most recently from 03554df to fbbeb29 Compare August 12, 2022 08:34

crusaderky commented Aug 15, 2022

View reviewed changes

crusaderky mentioned this pull request Aug 16, 2022

Cancelled tasks release dependencies #6893

Open

crusaderky commented Aug 16, 2022

View reviewed changes

crusaderky marked this pull request as ready for review August 16, 2022 23:05

resumed

09a7385

crusaderky force-pushed the resumed branch from 1b1f630 to 09a7385 Compare August 17, 2022 14:56

fjetter mentioned this pull request Aug 18, 2022

Add docs for test_resumed_cancelled_handle_compute #6905

Open

fjetter approved these changes Aug 18, 2022

View reviewed changes

crusaderky added 2 commits August 18, 2022 15:43

Tweaks

f190b4b

Merge branch 'main' into resumed

074ac9a

crusaderky merged commit 58a5a3c into dask:main Aug 18, 2022

crusaderky deleted the resumed branch August 18, 2022 17:01

gjoseph92 pushed a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022

Overhaul transitions for the resumed state (dask#6699)

72762f9

		@@ -1902,41 +1911,6 @@ def _transition_waiting_ready(

		return self._ensure_computing()

		def _transition_cancelled_error(

		@@ -1960,7 +1934,7 @@ def _transition_generic_error(

		return {}, [smsg]

		def _transition_executing_error(

		return self._transition_generic_fetch(ts, stimulus_id=stimulus_id)

		def _transition_flight_error(

		(f3.key, "resumed", "waiting", "executing", {}),
		(f3.key, "executing", "memory", "memory", {}),

Overhaul transitions for the resumed state #6699

Overhaul transitions for the resumed state #6699

Conversation

crusaderky commented Jul 9, 2022 • edited Loading

github-actions bot commented Jul 9, 2022 • edited Loading

Unit Test Results

crusaderky commented Jul 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky Aug 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Aug 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter commented Aug 16, 2022 • edited Loading

crusaderky commented Aug 16, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky Aug 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky Aug 16, 2022 • edited Loading

Choose a reason for hiding this comment

crusaderky commented Aug 16, 2022

fjetter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Jul 9, 2022 •

edited

Loading

github-actions bot commented Jul 9, 2022 •

edited

Loading

crusaderky Aug 15, 2022 •

edited

Loading

fjetter commented Aug 16, 2022 •

edited

Loading

crusaderky Aug 16, 2022 •

edited

Loading

crusaderky Aug 16, 2022 •

edited

Loading