Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log all invalid worker transitions to scheduler #6134

Merged
merged 4 commits into from Apr 18, 2022

Conversation

mrocklin
Copy link
Member

Also incorporate this into the testing framework so that we fail loudly
in this occurs in the test suite, even while hidden in a coroutine.

Also incorporate this into the testing framework so that we fail loudly
in this occurs in the test suite, even while hidden in a coroutine.
@github-actions
Copy link
Contributor

github-actions bot commented Apr 15, 2022

Unit Test Results

       16 files  ±  0         16 suites  ±0   7h 11m 26s ⏱️ - 42m 1s
  2 745 tests +  2    2 665 ✔️ +  3       80 💤  - 1  0 ±0 
21 845 runs  +16  20 810 ✔️ +16  1 035 💤 ±0  0 ±0 

Results for commit c130caa. ± Comparison against base commit f0e9f89.

♻️ This comment has been updated with latest results.

@mrocklin
Copy link
Member Author

@gjoseph92 you might also find this one of interest (although no rush on it).

@@ -864,6 +864,7 @@ async def start_cluster(
await asyncio.sleep(0.01)
if time() > start + 30:
await asyncio.gather(*(w.close(timeout=1) for w in workers))
assert not s.events["invalid-worker-transition"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be a little concerned about erroring out within this cleanup logic? Maybe pull the events off the scheduler, await s.close(), then assert? Same for the other one. Or maybe it doesn't matter, not sure.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be good to test (in test_utils_test.py) that this works as expected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. See recent commits. I also invested a little in nicely formatted outputs. Dumping the commit message below:

Now we get errors like the following for InvalidTransition

Worker: tcp://127.0.0.1:36443
InvalidTransition: inc-03d935909bba38f9a49655e867cbf56a :: released->foo
  Story:
    ('inc-03d935909bba38f9a49655e867cbf56a', 'compute-task', 'compute-task-1650136473.0490825', 1650136473.049759)
    ('inc-03d935909bba38f9a49655e867cbf56a', 'released', 'waiting', 'waiting', {'inc-03d935909bba38f9a49655e867cbf56a': 'ready'}, 'compute-task-1650136473.0490825', 1650136473.0498002)
    ('inc-03d935909bba38f9a49655e867cbf56a', 'waiting', 'ready', 'ready', {'inc-03d935909bba38f9a49655e867cbf56a': 'executing'}, 'compute-task-1650136473.0490825', 1650136473.0498285)
    ('inc-03d935909bba38f9a49655e867cbf56a', 'ready', 'executing', 'executing', {}, 'compute-task-1650136473.0490825', 1650136473.0498428)
    ('inc-03d935909bba38f9a49655e867cbf56a', 'put-in-memory', 'compute-task-1650136473.0490825', 1650136473.0505066)
    ('inc-03d935909bba38f9a49655e867cbf56a', 'executing', 'memory', 'memory', {}, 'compute-task-1650136473.0490825', 1650136473.0505476)
    ('inc-03d935909bba38f9a49655e867cbf56a', 'ensure-task-exists', 'memory', 'compute-task-1650136473.0510707', 1650136473.0558414)
    ('free-keys', ('inc-03d935909bba38f9a49655e867cbf56a',), 'memory-released-1650136473.05692', 1650136473.0609956)
    ('inc-03d935909bba38f9a49655e867cbf56a', 'release-key', 'memory-released-1650136473.05692', 1650136473.0610087)
    ('inc-03d935909bba38f9a49655e867cbf56a', 'memory', 'released', 'released', {}, 'memory-released-1650136473.05692', 1650136473.061037)

We also test that this shows up in gen_cluster tests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be a little concerned about erroring out within this cleanup logic? Maybe pull the events off the scheduler, await s.close(), then assert? Same for the other one. Or maybe it doesn't matter, not sure.

Did this too. Good catch.

Now we get errors like the following for InvalidTransition

```
Worker: tcp://127.0.0.1:36443
InvalidTransition: inc-03d935909bba38f9a49655e867cbf56a :: released->foo
  Story:
    ('inc-03d935909bba38f9a49655e867cbf56a', 'compute-task', 'compute-task-1650136473.0490825', 1650136473.049759)
    ('inc-03d935909bba38f9a49655e867cbf56a', 'released', 'waiting', 'waiting', {'inc-03d935909bba38f9a49655e867cbf56a': 'ready'}, 'compute-task-1650136473.0490825', 1650136473.0498002)
    ('inc-03d935909bba38f9a49655e867cbf56a', 'waiting', 'ready', 'ready', {'inc-03d935909bba38f9a49655e867cbf56a': 'executing'}, 'compute-task-1650136473.0490825', 1650136473.0498285)
    ('inc-03d935909bba38f9a49655e867cbf56a', 'ready', 'executing', 'executing', {}, 'compute-task-1650136473.0490825', 1650136473.0498428)
    ('inc-03d935909bba38f9a49655e867cbf56a', 'put-in-memory', 'compute-task-1650136473.0490825', 1650136473.0505066)
    ('inc-03d935909bba38f9a49655e867cbf56a', 'executing', 'memory', 'memory', {}, 'compute-task-1650136473.0490825', 1650136473.0505476)
    ('inc-03d935909bba38f9a49655e867cbf56a', 'ensure-task-exists', 'memory', 'compute-task-1650136473.0510707', 1650136473.0558414)
    ('free-keys', ('inc-03d935909bba38f9a49655e867cbf56a',), 'memory-released-1650136473.05692', 1650136473.0609956)
    ('inc-03d935909bba38f9a49655e867cbf56a', 'release-key', 'memory-released-1650136473.05692', 1650136473.0610087)
    ('inc-03d935909bba38f9a49655e867cbf56a', 'memory', 'released', 'released', {}, 'memory-released-1650136473.05692', 1650136473.061037)
```

We also test that this shows up in gen_cluster tests
assert a.address in str(s.events["invalid-worker-transition"])
assert ts.key in str(s.events["invalid-worker-transition"])

del s.events["invalid-worker-transition"] # for test cleanup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interested in why this is necessary. As far as I can tell Scheduler.events contains nested, simple types. Could there by more complex types containing references that prevent garbage collection?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to stop the check in gen_cluster from triggering and causing the test to err.

@@ -879,6 +894,7 @@ async def end_worker(w):
await asyncio.gather(*(end_worker(w) for w in workers))
await s.close() # wait until scheduler stops completely
s.stop()
check_invalid_worker_transitions(s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that transition failures are catastrophic events, but this makes me wonder whether there are other events that should be checked for during testing fixture teardown. Why not test for them here too? Should we also be serialising them in Scheduler.events as a matter of policy and reconstituting them here? Is this a once-off special-case handling strategy for transition failures?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, but for now I'd like to focus on these events. If we find that we want to do this a couple more times then we'll probably generalize things. For now I'd like to focus on a concrete problem for which we have an immediate need of a solution. If we try to do something general then we'll be guessing I think about what the right structure is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine as exploratory code to tease out failure. I'm concerned that it becomes a forgotten special case. Do we have a process for keeping track of this kind of thing (another github issue for instance?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not terribly concerned about us forgetting to track things in gen_cluster as they come up. This already has dozens of checks for various inconsistencies (see functions like clean). We're pretty good at adding to this as things arise. If there is something in particular that you've run into and would like to see checked that isn't then yes, a github issue would be a great place to raise that.

@sjperkins sjperkins self-requested a review April 18, 2022 15:32
while not s.events["invalid-worker-transition"]:
await asyncio.sleep(0.01)

with pytest.raises(Exception) as info:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with pytest.raises(Exception) as info:
with pytest.raises(ValueError) as info:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you motivate this change? Do you care in particular that it is a ValueError, or are you just trying to match the behavior as it exists today?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, to make my thinking explicit here, I don't care if this is a ValueError or some other kind of error. I personally like using ValueErrors but sometimes folks here like to use very specific things. If they did that in the future I wouldn't want to hold them back. I would want to hold them back if relevant data wasn't in the error message (as is enforced below).

I've chosen to use Exception here because I don't have a strong enough preference on type to want to constrain future developers too much.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you motivate this change?

Just for the normal reason of catching the specific exception type, instead of any Exception: don't accidentally catch something else. I saw that check_invalid_worker_transitions raises ValueError currently, which is what you're hoping to catch.

If, say, the cluster actually deadlocked and timed out, the assert "invalid" in str(info).lower() should fail the test anyway, so it's not that big of a deal. It's just a little easier to read the intention of the test this way.

I've chosen to use Exception here because I don't have a strong enough preference on type to want to constrain future developers too much

If future developers want to change the exception type, then can't they can make a one-line change to the test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If, say, the cluster actually deadlocked and timed out, the assert "invalid" in str(info).lower() should fail the test anyway, so it's not that big of a deal. It's just a little easier to read the intention of the test this way.

Yeah, I don't think that we're going to catch unwanted things here. I would also want to be way more specific in actual code.

If future developers want to change the exception type, then can't they can make a one-line change to the test?

Yeah, in this case it's pretty cheap. In general though I like to test only things that I care about. Otherwise we end up asking devs to change all sorts of unrelated tests that they don't know anything about, which error prone. In general my practice here is to "test things you care about, don't test to match current behavior" which is something that I'd love to socialize a bit across the project (historically we didn't do this, and it resulted in significant pain). In this case though I can't come up with any way to trigger this test except in that one line, so I don't care that much. I'm just sticking to my default policy.

@mrocklin mrocklin merged commit bbe9e9d into dask:main Apr 18, 2022
@mrocklin mrocklin deleted the log-invalid-transition branch April 18, 2022 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants