Accurate occupancy calculation / occupancy replacement

# Context

Occupancy is a measure of expected work on a Worker given a couple of assumptions. If there was one thread on that worker and if network communication and compute could not overlap, the occupancy would give us the expected time this worker will require to compute all of its keys. It is an estimation of busyness.

Occupancy is currently used for two mechanisms

- `Scheduler.worker_objective` to sort worker from least busy to most busy. Typically we will select the least busy worker for a new task assignment
- Work balancing / work stealing. This is a background rebalancing mechanism that reassigns tasks to ensure that occupancy is uniformly distributed on all workers


This measure has currently a couple of flaws

1. The `worker_objective` is trying to compute a "earliest start time" for a given key and tries to minimize this. Exclusively considering runtimes is not sufficient since it neglects task priorities. A task that is assigned to a worker with very high occupancy and thousands of tasks queued up would still have the chance to cut the line. In fact, we are relying on this mechanism heavily to reduce memory pressure. Simply looking at expected start time might even act suppressingly to achieve this goal. See https://github.com/dask/distributed/issues/5253 for a deeper discussion
2. (https://github.com/dask/distributed/issues/7003 / https://github.com/dask/distributed/issues/7004) This measure assumes that CPU and network occupancy stacks linearily. The reality is rather that this happens concurrently. While they are not entirely decoupled (e.g. GIL / event loop) a linear approximation is vastly incorrect.
3. The [way occupancy is calculated is](https://github.com/dask/distributed/blob/77aeecb126ecd6f48eb217c42418d5af100b3cc5/distributed/scheduler.py#L2832-L2858) intentionally allowing for occupancies to be inaccurate for a short amount of time since the current implementation would not allow a fast / near constant time way of propagating effects of updated timing measurements to the cluster. It also does not account live changes to held tasks in memory, i.e. both compute and network occupacny drift apart. This drift needs to manually be corrected in [`reevaluate_occupancy`](https://github.com/dask/distributed/blob/77aeecb126ecd6f48eb217c42418d5af100b3cc5/distributed/scheduler.py#L7487-L7529). Therefore, drastic changes to number of workers and/or timing measurements require time to be propagated to the entire cluster which until corrected can cause very poor scheduling (and stealing) decisions, e.g. https://github.com/dask/distributed/issues/6573#issuecomment-1239413543
4. Occupancy is subject to errors for unknown tasks. Particularly with 3.) this can cause oscillating behavior with stealing


To address problem 1.) we'd need to rework fundamentally how we schedule. some of the possibilities are discussed in https://github.com/dask/distributed/issues/5253 without a conclusion due to performance concerns.

Problem 2.) and 3.) is mostly about the implementation itself and there are already small increments available on how to improve this (see https://github.com/dask/distributed/pull/7020). I believe we can overcome these limitations entirely (not for free) while paving the way for a solution to 1.) at the same time.

# Drop in replacement?

Assuming that the variance of durations within a `TaskGroup` is not too large, a suitable approximation for CPU occupancy can be computed accurately by counting the number of tasks per group assigned to a worker.
Using the duration_average we can calculate an accurate occupancy in `O(TG)` instead of `O(T)` where `TG` is the number of `TaskGroup`s currently assigned to a worker and `T` is the number of tasks assigned to a worker. I expect `TG` to typically be <10 and consider iterating over this collection to be basically constant time. 
(Note: This entire argument can be made for `TaskPrefix`es as well if performance doesn't play out but I believe we should start measureing duration_average on TG level. That's a bit off-topic)

### Pseudo code

```python
from distributed.scheduler import TaskState, WorkerState, TaskGroup

class WorkerState:
    ...
    # Reference to scheduler task_groups
    task_groups: dict[str, int]

    needs_what: dict[TaskState, int]

    # Replace all scheduler updates of ws.processing with this.
    # Occupancy arg is merely there for drop in replacement
    def add_to_processing(self, ts: TaskState) -> None:
        self.task_groups[ts.group.name] += 1
        self.processing[ts] = None
        for dts in ts.dependencies:
            if self not in dts.who_has:
                self.need_replica(dts)

    # This will effectively replace Scheduler.get_comm_cost
    # Comm cost is always called in 
    # - move_task_request
    # - _set_duration_estimate
    #    - add_to_processing
    #    - _reevaluate_occupancy_worker
    def need_replica(self, ts: TaskState) -> None:
        if ts not in self.needs_what:
            self.needs_what[ts] = 1
            self._network_occ += ts.get_nbytes()
        else:
            self.needs_what[ts] += 1

    def dont_need_replica(self, ts: TaskState) -> None:
        if ts in self.needs_what:
            self.needs_what[ts] -= 1
            if not self.needs_what[ts]:
                del self.needs_what[ts]
                self._network_occ -= ts.get_nbytes()

    def remove_from_processing(self, ts: TaskState) -> None:
        self.task_groups[ts.group.name] -= 1
        del self.processing[ts]
        # TODO Loop over the smaller set of the two, see Scheduler.get_comm_cost
        for dts in ts.dependencies:
            if dts in self.needs_what:
                self.dont_need_replica(dts)
            
    @property
    def occupancy(self):
        return sum(
            task_groups[group_name].prefix.duration_average * count
            for group_name, count in self.task_groups.items()
        ) + self._network_occ

class Scheduler:
    ...
    task_groups: dict[str, TaskGroup]


    def add_replica(self, ts: TaskState, ws: WorkerState):
        ws.dont_need_replica(ts)

    def _set_duration_estimate(self, ts: TaskState, ws: WorkerState) -> None:
        # TODO: duration while not completed is not measured by the task groups. 
        # I'm sure we can find a way to accomodate for this shortcoming
        old = ws.occupancy
        ws.add_to_processing(ts)
        self.total_occupancy += ws.occupancy - old

```

This pseudo code would effectively allow us to
- Accurately track what dependencies a worker needs to fetch. This alone could be a valuable piece of information for scheduling decisions
- Calculate an accurate occupancy by iterating only over task groups instead of tasks
- Differentiate properly network vs cpu cost and do not double count dependencies
- Inspect what task groups are already scheduled on that worker


I believe this implementation would server as a drop-in replacement to our existing logic but would allow us to remove `Scheduler.reevaluate_occupancy` and therefore not allow occupancy calculations to drift. The cost we'd pay for this is twofold

- We'd always loop over all task dependencies if we assign a `Task` to a `Worker`. However, this is already happening in `Scheduler._set_duration_estimate` to calculate the comm cost so this will not add any computational complexity but rather shift it
- We would also need to iterate over all dependencies when removing a task from a worker. This is also already a requirement, at least for tasks finishing successfully (see `scheduler._add_to_memory`) to compute follow up transitions.
- The biggest hit is the occupancy itself since it would be computed on-demand. The most performance critical section is `worker_objective` which already isn't constant time since it is looping over a tasks dependencies to compute transfer cost. Is the number of task groups on a worker significantly larger than the number of dependencies of a task? At the same time, even if this is larger, I would be surprised if the removal of `Scheduler.reevaluate_occupancy` would amortize this additional cost.
- If we are willing to allow some drift in occupancy we could also cache the workers occupancies. Cache would be invalidated whenever any taskgroup measurement would significantly change or smth else.
- `worker_objective` also includes a loop over the dependencies which could leverage the newly introduced `WorkerState.needs_what`

# Further progress towards problem 1.)

With this implementation we could further and more simply investigate scheduling based on task group / prefixes and would allow us to pivot towards a "uniform number of tasks per group" scheduling approach that no longer accounts for durations at all, removing the need for unknown_durations. For instance, if we could define an ordering of task groups we could infer if a task can cut the line and make progress towards 1.)


cc @hendrikmakait @gjoseph92 @wence- @crusaderky 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Accurate occupancy calculation / occupancy replacement #7027

Context

Drop in replacement?

Pseudo code

Further progress towards problem 1.)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Accurate occupancy calculation / occupancy replacement #7027

Description

Context

Drop in replacement?

Pseudo code

Further progress towards problem 1.)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions