Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix resource allocation for tasks with dependencies #6676

Merged
merged 25 commits into from
Jul 9, 2022

Conversation

hendrikmakait
Copy link
Member

@hendrikmakait hendrikmakait commented Jul 6, 2022

@github-actions
Copy link
Contributor

github-actions bot commented Jul 6, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±    0         15 suites  ±0   5h 48m 12s ⏱️ - 47m 16s
  2 950 tests +  11    2 861 ✔️ +  10    85 💤 +1  3  - 1  1 🔥 +1 
21 047 runs   - 734  20 057 ✔️  - 727  986 💤  - 7  3  - 1  1 🔥 +1 

For more details on these failures and errors, see this check.

Results for commit 086178b. ± Comparison against base commit 40867c7.

♻️ This comment has been updated with latest results.

@@ -1888,6 +1888,9 @@ def _transition_from_resumed(
assert next_state in {"waiting", "fetch"}, next_state
assert ts._previous in {"executing", "flight"}, ts._previous

if ts._previous == "executing":
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Write test that ensures that we do in fact release IFF the task is finished.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and line 1888 are missing the use case of 'long-running'. I'll open a follow-up PR about it, no point adding scope creep here.

I'd like to have the test you mention within this PR please.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ts._previous should never happen to begin with in this method. I'll clean up the code in here in a separate PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref: #6693

@hendrikmakait hendrikmakait self-assigned this Jul 6, 2022
@crusaderky
Copy link
Collaborator

Please merge from main. Also there are failing tests

Copy link
Member Author

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased onto the latest master to reduce the diff. Two tests have been XFAILed to be tackled in independent PRs. The corresponding issues are #6565 and #6682.

assert ws.available_resources == {"R": 1}


@pytest.mark.xfail(reason="distributed#6565")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XFAIL to be tackled in a follow-up PR. #6565 is already tracking this issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already green

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, somehow missed that.

assert ws.available_resources == {"R": 1}


@pytest.mark.xfail(reason="distributed#6682")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XFAIL to be tackled in a follow-up PR. #6682 has been created to tackle this issue.

@hendrikmakait hendrikmakait marked this pull request as ready for review July 6, 2022 18:53
@hendrikmakait hendrikmakait changed the title Fix resource allocation to tasks Fix resource allocation for tasks with dependencies Jul 6, 2022
@hendrikmakait
Copy link
Member Author

Failing tests are known flakes.

stimulus_id="compute",
)
)
assert ws.tasks["x"].state == "resumed"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you review this?
From reading the code (see _transition_from_resumed), "resumed" should be exclusively on one of the following loops:

  • executing or long-running -> cancelled -> fetch
  • flight -> cancelled -> waiting

in the executing -> cancelled -> waiting loop that you implemented here, I expect ts.state to be 'executing'.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test does not create an executing -> cancelled -> waiting loop, but an executing -> cancelled -> fetch loop by cancelling x and then gathering it as a dependency to y.

Copy link
Collaborator

@crusaderky crusaderky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@crusaderky
Copy link
Collaborator

It's probably a good idea to park this momentarily and write two separate PRs,

  • all_running_tasks plus unit tests
  • dummy methods for the two events

@@ -2462,7 +2462,6 @@ def ws_with_running_task(ws, request):
SecedeEvent(key="x", compute_duration=1.0, stimulus_id="secede")
)
assert ws.tasks["x"].state == request.param
assert ws.available_resources == {"R": 0}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this into the dedicated test_ws_with_running_task

@@ -1033,6 +1036,98 @@ def test_gather_priority(ws):
]


@pytest.mark.parametrize("state", ["executing", "long-running"])
def test_running_constrained_task_acquires_resources(state, ws):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicating logic from ws_with_running_task and test_ws_with_running_task to have an explicit test focused on resource restrictions that is resilient to changes to those functions.

@hendrikmakait
Copy link
Member Author

CI issues are being caused by #6692. I'd move _validate_resources to a separate PR that will be blocked by #6692 in order to unblock this PR. Thoughts, @crusaderky?

assert ws.available_resources == {"R": 1}

instructions = ws.handle_stimulus(
GatherDepSuccessEvent("gather-dep-done", "127.0.0.1:1235", 8, {"x": 1.0})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unreadable, please never go full-positional

@@ -1963,6 +1964,8 @@ def _transition_from_resumed(

if ts._previous in ("executing", "long-running"):
self._release_resources(ts)
self.executing.discard(ts)
self.long_running.discard(ts)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually superfluous as we're invoking purge_state afterwards. But I think it's a good idea to have it for cleanliness' sake.

@crusaderky
Copy link
Collaborator

See code review: 74defe4
There's an xfail in test_resumed_task_releases_resources:

E       AssertionError: assert {'R': 0} == {'R': 1}
E         Differing items:
E         {'R': 0} != {'R': 1}
E         Full diff:
E         - {'R': 1}
E         ?       ^
E         + {'R': 0}
E         ?       ^

The culprits are, again, #6689 + #6693.
_transition_from_resumed is not hit in case of error, so the change in this PR does not have the effect you hoped.

I'm going to merge the PR as-is, but it means it doesn't close #6663.
I'll resolve with #6689, #6693, and #6663 in a single PR before I leave.

@hendrikmakait
Copy link
Member Author

There is a separate issue open around the issue with failing resumed tasks: #6682. This PR should solve the reproducer in #6663 (it works just fine on my machine), so I'd close that issue with merging and keep #6682 open.

@hendrikmakait
Copy link
Member Author

@crusaderky: Thanks for the thorough review!

@crusaderky crusaderky merged commit 6765e6e into dask:main Jul 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SerializeableLock fails when used with resources
2 participants