Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eagerly update aggregate statistics for TaskPrefix instead of calculating them on-demand #8681

Merged
merged 10 commits into from
Jun 10, 2024

Conversation

hendrikmakait
Copy link
Member

Closes #8680

  • Tests added / passed
  • Passes pre-commit run --all-files

Copy link
Contributor

github-actions bot commented Jun 7, 2024

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    29 files  ±    0      29 suites  ±0   11h 3m 47s ⏱️ + 1h 19m 38s
 4 057 tests  -     5   3 953 ✅ +    5     97 💤  -   9  7 ❌ +3 
55 883 runs  +7 589  53 712 ✅ +7 354  2 163 💤 +249  8 ❌ +3 

For more details on these failures, see this check.

Results for commit c84d42d. ± Comparison against base commit 5708bdf.

This pull request removes 14 and adds 9 tests. Note that renamed tests count towards both.
distributed.protocol.tests.test_arrow
distributed.protocol.tests.test_collection
distributed.protocol.tests.test_highlevelgraph
distributed.protocol.tests.test_numpy
distributed.protocol.tests.test_pandas
distributed.shuffle.tests.test_graph
distributed.shuffle.tests.test_merge
distributed.shuffle.tests.test_merge_column_and_index
distributed.shuffle.tests.test_metrics
distributed.shuffle.tests.test_rechunk
…
distributed.diagnostics.tests.test_memray ‑ test_basic_integration_scheduler
distributed.diagnostics.tests.test_memray ‑ test_basic_integration_scheduler_report_args[False]
distributed.diagnostics.tests.test_memray ‑ test_basic_integration_scheduler_report_args[report_args0]
distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers[1]
distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers[False]
distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers[True]
distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers_report_args[False]
distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers_report_args[report_args0]
distributed.tests.test_scheduler ‑ test_task_group_and_prefix_statistics

♻️ This comment has been updated with latest results.

like ``{"memory": 10, "processing": 3, "released": 4, ...}``
"""
return merge_with(sum, [tg.states for tg in self.groups])
@_deprecated(use_instead="groups") # type: ignore[misc]
Copy link
Member Author

@hendrikmakait hendrikmakait Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that we remove groups from prefixes once all their tasks have been forgotten in

if ts.state == "forgotten" and tg.name in self.task_groups:
# Remove TaskGroup if all tasks are in the forgotten state
if all(v == 0 or k == "forgotten" for k, v in tg.states.items()):
ts.prefix.groups.remove(tg)
del self.task_groups[tg.name]

means that the additional filtering applied to active in
@property
def active(self) -> list[TaskGroup]:
return [
tg
for tg in self.groups
if any(k != "forgotten" and v != 0 for k, v in tg.states.items())
]
is superfluous.

@hendrikmakait hendrikmakait marked this pull request as ready for review June 7, 2024 09:25
class TaskPrefix(TaskCollection):
"""Collection tracking all tasks within a prefix

# FIXME: This comment belongs to the TaskGroup
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: I need to adjust this

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo minor nits

Comment on lines 1469 to 1470
group.add(self)
group.prefix.add(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intuitively, I would expect group.add(self) to be sufficient and not require the caller to also do group.prefix.add(self).

Why can't TaskGroup.add be extended to call TaskPrefix.add?

Same question for transition below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, this is now aligned.

self._groups.pop(tg)
for state, count in tg.states.items():
self.states[state] -= count
self.duration -= tg.duration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we've had this problem before but I strongly suspect we'll have to deal with floating point arithmetic precision here.

Historically, we've occasionally encountered negative occupancy issues and maybe this goes back to this diff.

To avoid this issue entirely we may want to track duration as an integer with ms accuracy to avoid this problem. This would require a bit of refactoring and is likely out of scope for this PR.

A quick fix would be to add self.duration = max(self.duration, 0) below this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the duration tracking is only loosely-coupled with other durations in the system, so internally tracking this in ms-resolution and exposing the seconds-variant should be pretty quick to do without far-reaching consequences.

Copy link
Member Author

@hendrikmakait hendrikmakait Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2306ae7 stores the duration in microseconds internally. (I think that's the smallest precision we get, milliseconds are too coarse-grained and cause test failures.)

Comment on lines 1490 to 1493
self.group.transition(self._state, value)
self.prefix.transition(self._state, value)
self._state = value
self.prefix.state_counts[value] += 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something obvious: I think this is the only truly performance critical section since this is effectively called on every task state transition. The duration and nbytes update is only called whenever a task completes so at least five times more rarely. Since this essentially performs the same operations (plus a couple of additional function indirections/calls) this should perform similarly to the old version

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly, the nbytes and duration updates should now be twice as expensive but that's a constant. Given that we iterated over everything and performed this computation every 100ms before I assume this is a net positive

Comment on lines 2060 to 2065
if ts.state == "forgotten" and tg.name in self.task_groups:
# Remove TaskGroup if all tasks are in the forgotten state
if all(v == 0 or k == "forgotten" for k, v in tg.states.items()):
ts.prefix.groups.remove(tg)
ts.prefix.remove_group(tg)
del self.task_groups[tg.name]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this cannot be encapsulated in the TaskState.transition method

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I guess not entirely. The del self.task_groups[tg.name] would still have to be done here. never mind then

@hendrikmakait hendrikmakait merged commit 490b696 into dask:main Jun 10, 2024
27 of 34 checks passed
@hendrikmakait hendrikmakait deleted the eagerly-updated-prefix-stats branch June 10, 2024 09:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update aggregate statistics for TaskPrefix instead of calculating them on demand
2 participants