Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert idle classification when worker-saturation is set #7278

Merged
merged 3 commits into from Nov 10, 2022

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Nov 9, 2022

This restores the behavior of the idle set to the behavior before #6614 when worker-saturation is set.

All code paths in the new queuing path will use a different idleness set based on a different definition of idle.

Closes #7085

ref #7191

@fjetter fjetter mentioned this pull request Nov 9, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Nov 9, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  +       12         15 suites  +12   6h 26m 31s ⏱️ + 5h 32m 51s
  3 175 tests +  1 907    3 089 ✔️ +  1 857    83 💤 +  47  3 +3 
23 492 runs  +19 691  22 585 ✔️ +18 892  903 💤 +795  4 +4 

For more details on these failures, see this check.

Results for commit 7215427. ± Comparison against base commit 88515db.

♻️ This comment has been updated with latest results.

Copy link
Collaborator

@gjoseph92 gjoseph92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update test_queued_paused_new_worker and test_queued_paused_unpaused in test_scheduler.py, and any other tests which make explicit assertions about Scheduler.idle.

distributed/scheduler.py Outdated Show resolved Hide resolved
Comment on lines +3073 to 3076
saturated.discard(ws)
if self.is_unoccupied(ws, occ, p):
if ws.status == Status.running:
idle[ws.address] = ws
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
saturated.discard(ws)
if self.is_unoccupied(ws, occ, p):
if ws.status == Status.running:
idle[ws.address] = ws
if self.is_unoccupied(ws, occ, p):
if ws.status == Status.running:
idle[ws.address] = ws
saturated.discard(ws)

This is more consistent with previous behavior. Notice that before, if a worker was occupied, but not saturated, it wouldn't be removed from the saturated set. This is probably not intentional or correct, but we're trying to match previous behavior here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. saturated.discard was always called unless the worker was truly classified as such, see
    idle = self.idle
    saturated = self.saturated
    if p < nc or occ < nc * avg / 2:
    idle[ws.address] = ws
    saturated.discard(ws)
    else:
    idle.pop(ws.address, None)
    if p > nc:
    pending: float = occ * (p - nc) / (p * nc)
    if 0.4 < pending > 1.9 * avg:
    saturated.add(ws)
    return
    saturated.discard(ws)
    so my behavior is consistent with what it was before
  2. Other than dashboard visuals, saturated is only used in stealing to avoid sorting over all workers, https://github.com/fjetter/distributed/blob/a5d686572e3289e9d7ce71c063205cc35d4a06c2/distributed/stealing.py#L422-L431 and I'm not too concerned about this since stealing is a bit erratic either way

distributed/scheduler.py Outdated Show resolved Hide resolved
else not _worker_full(ws, self.WORKER_SATURATION)
):
saturated.discard(ws)
if self.is_unoccupied(ws, occ, p):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite concerned that we're now calling is_unoccupied every time, even when queuing is enabled. This significantly slows down the scheduler: #7256. The urgency of fixing that was diminished by queuing being on by default and getting to skip that slow code path.

I'm not sure that a known and large scheduler performance degradation is worth avoiding hypothetical small changes to work-stealing behavior due to the changed definition of idle when queuing is on.

If we can fix #7256 before a release, then I'm happy with this change, otherwise I'd be concerned by this tradeoff.

Copy link
Collaborator

@gjoseph92 gjoseph92 Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After running some benchmarks, it looks like occupancy might not have as much of an effect on end-to-end runtime as I'd expected: #7256 (comment). So I'm happy with this if we want to go with it.

For performance reasons and practicality though, I'd like to consider #7280 as another solution to #7085.

Edit: that uses occupancy too, so there's a similar performance cost. I think doing both PRs would be a good idea.

gjoseph92 added a commit to gjoseph92/snakebench that referenced this pull request Nov 10, 2022
@fjetter
Copy link
Member Author

fjetter commented Nov 10, 2022

The performance regression around occupancy was already introduced in release 2022.10.0. I'm looking into this right now but will not hold off on merging this PR because of that.

@fjetter fjetter merged commit 27a91dd into dask:main Nov 10, 2022
@fjetter fjetter deleted the revert_idle_classification_rootish_tasks branch November 10, 2022 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

worker-saturation impacts balancing in work-stealing
2 participants