Add idle time to fine performance metrics #7938

crusaderky · 2023-06-21T14:06:55Z

Add to fine performance metrics the delta between (end-to-end-runtime * nthreads) and the time spent by workers on tasks.

Such delta does not increase when there are no tasks running anywhere on the cluster.

If you are observing multiple spans at once, e.g. all calls to a certain library, do not double-count overlapping time and do not count time when none of the selected spans are executing.
If you are cherry-picking specific spans, this delta may be in part caused by work stolen by other tasks.

Demo

After running the ML preprocessing notebook already featured in dask/community#301:

(I added 3 lines to dask/dask to isolate the I/O time: crusaderky/dask@0f36901)

Summarized insights I obtained from the dashboard:

Activity	CPU time	notes
I/O (avoidable)	65m	Measure partitions
I/O (read/write dataset)	55m	This is good
thread-cpu (this is good)	49m	This is good
thread-noncpu	48m	probably GIL contention
idle workers	66m	- Can't saturate cluster
		- scheduler is at 100% CPU
		- poorly pipelined network transfers
		- other?
everything else	37m
TOTAL	320m	This is your AWS and Coiled bill

The workflow currently features a whopping 67% waste in runtime.

Known issues

The end-to-end runtime does not count time spent before the first task appears on the scheduler for any given burst of activity, e.g. time spent optimizing, serializing, transferring and deserializing the dask graph.
The "idle or other spans" metric also includes unfinished tasks: Fine performance metrics: Meter currently-executing tasks #7677

Note

I noticed that keeping the Fine Performance Metrics dashboard open while the computation is running is very CPU intensive for the scheduler. However, this seems to be a problem specific to Bokeh rendering; calling FinePerformanceMetrics.update(), which is invoked every 500ms, costs a modest ~2.5ms.

CC @ntabris @milesgranger

distributed/scheduler.py

github-actions · 2023-06-21T15:36:38Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      20 files ±  0       20 suites ±0 12h 45m 56s ⏱️ + 18m 52s
  3 698 tests +  9   3 589 ✔️ +  8   106 💤 ±0 3 ❌ +1
35 766 runs +88 34 014 ✔️ +87 1 748 💤 - 1 4 ❌ +2

For more details on these failures, see this check.

Results for commit 092e9d7. ± Comparison against base commit 3de722a.

milesgranger

One nit, take it or leave it. Otherwise looks great. 👍

milesgranger · 2023-06-22T07:54:41Z

distributed/dashboard/components/scheduler.py

+            # Custom metrics can provide any hashable as the label
+            activity = str(activity)


Know this comment was here before, but think it's not helpful/misleading(?). str doesn't need its input to be hashable.

Suggested change

# Custom metrics can provide any hashable as the label

activity = str(activity)

activity = str(activity)

The sort function shortly afterwards will break if you don't wrap that arbitrary hashable into a string. That comment line is explaining that activity is not necessarily a string, and as such it's important. I'm amending the comment not to mention hashable.

distributed/dashboard/components/scheduler.py

Show idle time in fine performance metrics

810ac98

crusaderky self-assigned this Jun 21, 2023

crusaderky marked this pull request as ready for review June 21, 2023 14:22

crusaderky requested a review from fjetter as a code owner June 21, 2023 14:22

crusaderky added the diagnostics label Jun 21, 2023

crusaderky commented Jun 21, 2023

View reviewed changes

distributed/scheduler.py Outdated Show resolved Hide resolved

Update distributed/scheduler.py

092e9d7

This was referenced Jun 21, 2023

Fine performance metrics: Break down idle time on the Worker #7671

Open

Fine performance metrics meta-issue #7665

Open

milesgranger approved these changes Jun 22, 2023

View reviewed changes

crusaderky commented Jun 22, 2023

View reviewed changes

distributed/dashboard/components/scheduler.py Outdated Show resolved Hide resolved

Update distributed/dashboard/components/scheduler.py

b561c47

crusaderky merged commit 429ef8c into dask:main Jun 22, 2023
21 of 24 checks passed

crusaderky deleted the spans_idle branch June 22, 2023 09:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add idle time to fine performance metrics #7938

Add idle time to fine performance metrics #7938

crusaderky commented Jun 21, 2023 •

edited

github-actions bot commented Jun 21, 2023

milesgranger left a comment

milesgranger Jun 22, 2023

crusaderky Jun 22, 2023

		# Custom metrics can provide any hashable as the label
		activity = str(activity)

Add idle time to fine performance metrics #7938

Add idle time to fine performance metrics #7938

Conversation

crusaderky commented Jun 21, 2023 • edited

Demo

Known issues

Note

github-actions bot commented Jun 21, 2023

Unit Test Results

milesgranger left a comment

Choose a reason for hiding this comment

milesgranger Jun 22, 2023

Choose a reason for hiding this comment

crusaderky Jun 22, 2023

Choose a reason for hiding this comment

crusaderky commented Jun 21, 2023 •

edited