Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add idle time to fine performance metrics #7938

Merged
merged 3 commits into from Jun 22, 2023
Merged

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented Jun 21, 2023

Add to fine performance metrics the delta between (end-to-end-runtime * nthreads) and the time spent by workers on tasks.

Such delta does not increase when there are no tasks running anywhere on the cluster.

If you are observing multiple spans at once, e.g. all calls to a certain library, do not double-count overlapping time and do not count time when none of the selected spans are executing.
If you are cherry-picking specific spans, this delta may be in part caused by work stolen by other tasks.

Demo

After running the ML preprocessing notebook already featured in dask/community#301:

Screenshot from 2023-06-21 15-02-06

(I added 3 lines to dask/dask to isolate the I/O time: crusaderky/dask@0f36901)

Summarized insights I obtained from the dashboard:

Activity CPU time notes
I/O (avoidable) 65m Measure partitions
I/O (read/write dataset) 55m This is good
thread-cpu (this is good) 49m This is good
thread-noncpu 48m probably GIL contention
idle workers 66m - Can't saturate cluster
- scheduler is at 100% CPU
- poorly pipelined network transfers
- other?
everything else 37m
TOTAL 320m This is your AWS and Coiled bill

The workflow currently features a whopping 67% waste in runtime.

Known issues

  • The end-to-end runtime does not count time spent before the first task appears on the scheduler for any given burst of activity, e.g. time spent optimizing, serializing, transferring and deserializing the dask graph.
  • The "idle or other spans" metric also includes unfinished tasks: Fine performance metrics: Meter currently-executing tasks #7677

Note

I noticed that keeping the Fine Performance Metrics dashboard open while the computation is running is very CPU intensive for the scheduler. However, this seems to be a problem specific to Bokeh rendering; calling FinePerformanceMetrics.update(), which is invoked every 500ms, costs a modest ~2.5ms.

CC @ntabris @milesgranger

@crusaderky crusaderky self-assigned this Jun 21, 2023
@crusaderky crusaderky marked this pull request as ready for review June 21, 2023 14:22
@crusaderky crusaderky requested a review from fjetter as a code owner June 21, 2023 14:22
distributed/scheduler.py Outdated Show resolved Hide resolved
@github-actions
Copy link
Contributor

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       20 files  ±  0         20 suites  ±0   12h 45m 56s ⏱️ + 18m 52s
  3 698 tests +  9    3 589 ✔️ +  8     106 💤 ±0  3 +1 
35 766 runs  +88  34 014 ✔️ +87  1 748 💤  - 1  4 +2 

For more details on these failures, see this check.

Results for commit 092e9d7. ± Comparison against base commit 3de722a.

Copy link
Contributor

@milesgranger milesgranger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit, take it or leave it. Otherwise looks great. 👍

Comment on lines 3629 to 3630
# Custom metrics can provide any hashable as the label
activity = str(activity)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Know this comment was here before, but think it's not helpful/misleading(?). str doesn't need its input to be hashable.

Suggested change
# Custom metrics can provide any hashable as the label
activity = str(activity)
activity = str(activity)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sort function shortly afterwards will break if you don't wrap that arbitrary hashable into a string. That comment line is explaining that activity is not necessarily a string, and as such it's important. I'm amending the comment not to mention hashable.

@crusaderky crusaderky merged commit 429ef8c into dask:main Jun 22, 2023
21 of 24 checks passed
@crusaderky crusaderky deleted the spans_idle branch June 22, 2023 09:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants