Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dashboard: Fine Performance Metrics #7725

Merged

Conversation

milesgranger
Copy link
Contributor

@milesgranger milesgranger commented Mar 30, 2023

Closes #7679

  • Tests added / passed
  • Passes pre-commit run --all-files

I'm not very satisfied with the implementation itself, open to suggestions. Maybe after we hammer out a data schema / dataclasses for the metrics the parsing will become a bit more ergonomic.


CURRENT STATE:

(I'll update these GIF(s) / comments as the PR progresses)

There is a function selector box to view/keep specific functions, otherwise the functions will be removed from the chart by function after a set amount of time (currently 10s)

So please let me have your input and here is an example of other data which is to be added as well to this page later:

Example data
{'latency': 3.71983003616333,
 'profile-duration': 2.245496988296509,
 'tick-duration': 34.589913845062256,
 'transfer-bandwidth': 1884128.5683220457,
 'transfer-duration': 10.240396499633789,
 ('execute', 'do_work', 'deserialize', 'seconds'): 5.4306990932673216e-05,
 ('execute', 'do_work', 'other', 'seconds'): 0.0006036278791725636,
 ('execute', 'do_work', 'thread-cpu', 'seconds'): 2.276999999999992e-06,
 ('execute', 'do_work', 'thread-noncpu', 'seconds'): 1.4889137695312508e-05,
 ('execute', 'getitem', 'deserialize', 'seconds'): 0.0015358439995907247,
 ('execute', 'getitem', 'memory-read', 'bytes'): 10386398260.0,
 ('execute', 'getitem', 'memory-read', 'count'): 91.0,
 ('execute', 'getitem', 'other', 'seconds'): 5.236524414503947,
 ('execute', 'getitem', 'thread-cpu', 'seconds'): 2.1170829520000005,
 ('execute', 'getitem', 'thread-noncpu', 'seconds'): 2.387235046886108,
 ('execute', 'make-timeseries', 'deserialize', 'seconds'): 0.7757770742173307,
 ('execute', 'make-timeseries', 'other', 'seconds'): 5.269970175053459,
 ('execute', 'make-timeseries', 'thread-cpu', 'seconds'): 28.740306391999994,
 ('execute', 'make-timeseries', 'thread-noncpu', 'seconds'): 30.920829996778693,
 ('execute', 'series-groupby-count-chunk', 'deserialize', 'seconds'): 0.002064074680674821,
 ('execute', 'series-groupby-count-chunk', 'memory-read', 'bytes'): 6254721966.0,
 ('execute', 'series-groupby-count-chunk', 'memory-read', 'count'): 66.0,
 ('execute', 'series-groupby-count-chunk', 'other', 'seconds'): 2.1850595266441815,
 ('execute', 'series-groupby-count-chunk', 'thread-cpu', 'seconds'): 5.864710648999999,
 ('execute', 'series-groupby-count-chunk', 'thread-noncpu', 'seconds'): 6.863055864824464,
 ('execute', 'series-groupby-count-combine', 'deserialize', 'seconds'): 0.00040181202348321676,
 ('execute', 'series-groupby-count-combine', 'memory-read', 'bytes'): 162853.0,
 ('execute', 'series-groupby-count-combine', 'memory-read', 'count'): 48.0,
 ('execute', 'series-groupby-count-combine', 'other', 'seconds'): 0.4925127071910538,
 ('execute', 'series-groupby-count-combine', 'thread-cpu', 'seconds'): 0.011580911999999666,
 ('execute', 'series-groupby-count-combine', 'thread-noncpu', 'seconds'): 0.058887513750732756,
 ('execute', 'series-groupby-sum-chunk', 'deserialize', 'seconds'): 0.001760255021508783,
 ('execute', 'series-groupby-sum-chunk', 'memory-read', 'bytes'): 6065340352.0,
 ('execute', 'series-groupby-sum-chunk', 'memory-read', 'count'): 64.0,
 ('execute', 'series-groupby-sum-chunk', 'other', 'seconds'): 2.655588552413974,
 ('execute', 'series-groupby-sum-chunk', 'thread-cpu', 'seconds'): 5.599028371999999,
 ('execute', 'series-groupby-sum-chunk', 'thread-noncpu', 'seconds'): 5.118410087396363,
 ('execute', 'series-groupby-sum-combine', 'deserialize', 'seconds'): 0.00034545507514849305,
 ('execute', 'series-groupby-sum-combine', 'memory-read', 'bytes'): 135802.0,
 ('execute', 'series-groupby-sum-combine', 'memory-read', 'count'): 40.0,
 ('execute', 'series-groupby-sum-combine', 'other', 'seconds'): 0.22763406857848167,
 ('execute', 'series-groupby-sum-combine', 'thread-cpu', 'seconds'): 0.008895635999999874,
 ('execute', 'series-groupby-sum-combine', 'thread-noncpu', 'seconds'): 0.0017752643448487593,
 ('gather-dep', 'decompress', 'seconds'): 0.0006368309259414673,
 ('gather-dep', 'deserialize', 'seconds'): 0.02663757384289056,
 ('gather-dep', 'network', 'seconds'): 9.711461509461515,
 ('gather-dep', 'other', 'seconds'): 0.9370217787800357,
 ('get-data', 'compress', 'seconds'): 0.0006392610375769436,
 ('get-data', 'memory-read', 'bytes'): 258869.0,
 ('get-data', 'memory-read', 'count'): 76.0,
 ('get-data', 'network', 'seconds'): 4.732347336539533,
 ('get-data', 'serialize', 'seconds'): 0.01716823095921427}

Current state:

ezgif-3-1a9a06bfef

@milesgranger milesgranger force-pushed the 7679-perf-metrics-bokeh-dashboard branch from 6f13c00 to 83045dc Compare March 31, 2023 13:33
@milesgranger milesgranger force-pushed the 7679-perf-metrics-bokeh-dashboard branch from 83045dc to a3965b0 Compare April 11, 2023 12:28
@milesgranger milesgranger changed the title [WIP] Dashboard: Fine Performance Metrics Dashboard: Fine Performance Metrics Apr 11, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Apr 11, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       26 files  ±  0         26 suites  ±0   16h 1m 40s ⏱️ + 7m 36s
  3 582 tests +  1    3 472 ✔️ +  4     105 💤 ±0  5  - 3 
45 351 runs  +11  43 184 ✔️ +15  2 162 💤 ±0  5  - 4 

For more details on these failures, see this check.

Results for commit a4e1def. ± Comparison against base commit f4f4718.

♻️ This comment has been updated with latest results.

@milesgranger milesgranger force-pushed the 7679-perf-metrics-bokeh-dashboard branch 2 times, most recently from 0d25a90 to b93fe0b Compare April 12, 2023 08:27
@milesgranger milesgranger marked this pull request as ready for review April 13, 2023 06:52
@milesgranger
Copy link
Contributor Author

Ready for a roasting. :)
@fjetter noted this shouldn't be a 'perfect' impl the first round since it's hidden a bit and ought to be improved later on after schema/dataclasses for the fine-grained perf metrics is developed later.

cc @jrbourbeau @j-bennet @hendrikmakait

@milesgranger
Copy link
Contributor Author

@crusaderky maybe you'd have some time to review. 🙏

Copy link
Collaborator

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very small comments


idx = self.data["operation"].index(operation)
while idx >= len(self.data["operation_val"]):
self.data["operation_val"].append(float("NaN"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to model these two keys as a NamedTuple to protect against unequal lengths?

def init_root(self):
def handle_toggle(_toggle):
self.toggle.label = (
"Bytes (Toggle for Values)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These strings look like something that could get out of sync very easily, maybe use variables?

@crusaderky crusaderky requested review from crusaderky and removed request for fjetter April 26, 2023 14:19
Copy link
Collaborator

@crusaderky crusaderky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, I must say I'm very excited about this new view.

Must have

Bokeh 3.1 specific issues

The page is broken on Bokeh 3.1 (everything is fine on Bokeh 2):

  • The plot looks off and I can't see the bar chart for execute.
  • The plot doesn't seem to update itself in realtime. I have to reload the page every time.
  • Selecting a prefix makes the plot go completely blank.
  • Every few seconds, the plot goes blank. On my stderr I get:
bokeh.core.serialization.DeserializationError: can't resolve reference 'p278815'
2023-04-26 16:20:39,524 - bokeh.server.protocol_handler - ERROR - error handling message
 message: Message 'PATCH-DOC' content: {'events': [{'kind': 'ModelChanged', 'model': {'id': 'p582495'}, 'attr': 'inner_height', 'new': 409}]} 
 error: DeserializationError("can't resolve reference 'p582495'")

There's also a stack trace, but it's completely internal to bokeh.

Other must have

  • Please change unit selection to a dynamically populated dropdown
  • Changing the unit only impacts the execute-by-prefix chart. It should impact all.
  • I think it would make more sense for get_data to be a pie chart
  • Can't find the plot in dask-labextension
  • The tooltip for execute by prefix needs polishing:
    image
  • Please add unit tests. Typically (see other plots) these unit tests run the rendering in a couple of relevant use cases (e.g. no tasks / with tasks) and simply verify that they don't raise.
  • The subtitles are incoherent/confusing. Please change to:
    • Task execution, by prefix
    • Task execution, by activity
    • Send data, by activity (confusingly, get_data is what is called through RPC to get data for yourself; but when it runs locally it's sending the data to a peer)

Bonus

These critiques don't block this PR from merging and can be delivered at a later date.

  • Please add a plot for gather_dep (title: "Receive data, by activity"). It's not quite redundant with get_data, as for example it can highlight deserialization-specific issues.
  • The bytes plot, as it is, is not very useful. The reason is that memory-read is going to eclipse everything else. The purpose of that metric is to compare it to disk-read and calculate cache hit ratios; nothing else. It would become instantly more useful if there was a filter by activity so that it can be excluded (just like the filter by prefix).
  • The stacked bar charts could use a vertical axis
  • The pie charts could use an indication of percentage in the tooltip

@crusaderky
Copy link
Collaborator

I'd be happy to merge this PR with only the "Other must have" section done, plus a check that hides it with bokeh 3, and then have a follow-up issue to introduce bokeh 3 support. I don't know the effort involved in making it work for bokeh 3; the overhead may not be worth it.

@milesgranger
Copy link
Contributor Author

Thanks @crusaderky for the thorough review!
What you suggest sounds great, I'll address the 'Other must have' and follow up with the fixes/adjustments. Maybe start on this tomorrow if not early next week. Much appreciated!

@milesgranger milesgranger force-pushed the 7679-perf-metrics-bokeh-dashboard branch from 5e7947c to 884262d Compare May 4, 2023 13:31
@milesgranger
Copy link
Contributor Author

milesgranger commented May 4, 2023

Please change unit selection to a dynamically populated dropdown

Maybe I misunderstand, but your'e talking about the same toggle button, yes? If it was dynamic those plots are explicitly collecting "bytes" or "seconds"; using thins like format_time/bytes, I don't know how it ought to react with arbitrary other types.

Can't find the plot in dask-labextension

I don't know what this is, and grepping thru the codebase I'm not sure how other plots are connected to this.

@milesgranger milesgranger force-pushed the 7679-perf-metrics-bokeh-dashboard branch from d3caebd to f5ee251 Compare May 9, 2023 10:29
@crusaderky
Copy link
Collaborator

crusaderky commented May 9, 2023

Please change unit selection to a dynamically populated dropdown

Maybe I misunderstand, but your'e talking about the same toggle button, yes? If it was dynamic those plots are explicitly collecting "bytes" or "seconds"; using thins like format_time/bytes, I don't know how it ought to react with arbitrary other types.

It should react to any arbitrary units it finds. For example, somebody may want to track time that doesn't add up to the wall time:

@context_meter.meter("some_stuff", unit="gpu-seconds", func=my_video_card.get_elapsed)
def runs_on_gpu():
    ...

client.submit(runs_on_gpu, key="foo").result()

The above will appear in Scheduler.cumulative_worker_metrics as ("execute", "foo", "some_stuff", "gpu-seconds"): <elapsed>} and, importantly, it will not detract from ("execute", "foo", "other", "seconds") for the purpose of calculating the end-to-end wall time.

Another example: track number of iterations needed to converge:

def converging():
    n = 0
    while not converged():
        n += 1
        ...
    context_meter.digest_metric("blah", n, "iterations")

client.submit(converging, key="c-123").result()

The above will appear in Scheduler.cumulative_worker_metrics as ("execute", "c-123", "blah", "iterations"): <n>}.

I would suggest you keep it simple: use ad-hoc formatting for seconds and bytes and just print everything else unformatted.

Can't find the plot in dask-labextension

I don't know what this is, and grepping thru the codebase I'm not sure how other plots are connected to this.

https://github.com/dask/dask-labextension

@milesgranger milesgranger force-pushed the 7679-perf-metrics-bokeh-dashboard branch from 604f902 to e994cfc Compare May 10, 2023 08:35
@milesgranger
Copy link
Contributor Author

Okay, @crusaderky another look when you have time, have a drop down dynamic unit selection and I don't know about the dask-labextension, seems to work fine for me. Maybe you need to re-install the extension or something? Not that familiar with jupyter lab. 🤷‍♂️

image

@fjetter
Copy link
Member

fjetter commented May 10, 2023

It should react to any arbitrary units it finds. For example, somebody may want to track time that doesn't add up to the wall time:

@crusaderky you are asking for functionality that is not even used at the moment. I consider this entire feature still experimental and I explicitly stated already that this thing does not need to be perfect.

This PR has been open now for more than a month now. I do not want to see any more features being added to this but want this to be wrapped up asap. If anything is missing, please create a follow up ticket for it.

@crusaderky
Copy link
Collaborator

crusaderky commented May 10, 2023

See code review commit. Mostly renamed stuff to align it to metrics.py.

Sizes are off. The top-left panel is the only one that resizes dynamically; the other two are fixed.
The result is that, in anything but a very specific window size, the top-left panel looks either gigantic or extremely small.
Reloading the page once you pick the window size does not fix the problem.
If you fix this one issue, I'm happy to merge (I'll be away until the 22nd, don't wait for me).

Peek.2023-05-11.00-33.mp4

@crusaderky
Copy link
Collaborator

@milesgranger milesgranger force-pushed the 7679-perf-metrics-bokeh-dashboard branch from 4ae732d to 290e5d3 Compare May 11, 2023 07:59
@milesgranger milesgranger force-pushed the 7679-perf-metrics-bokeh-dashboard branch from 290e5d3 to a4e1def Compare May 11, 2023 08:00
@hendrikmakait hendrikmakait self-requested a review May 16, 2023 10:06
@fjetter
Copy link
Member

fjetter commented May 16, 2023

I'm still observing an artifact with the plots. If I am observing the dashboard while the computation is going on, the metrics are added to the chart as I'd expect.
Once the computation finishes, a couple of tasks are dropped from the dashboard. A couple seconds/a minute later some other tasks show up again.

If I refresh the page, I get accurate results for a short time but the same process of "drop a couple of tasks" happens again.

Considering how long this is open now I'm inclined to merge regardless. There is value in this dashboard and I'd like to get this out asap and prefer polishing it later.

from dask.datasets import timeseries
from distributed import Client
client = Client()
ddf = timeseries("2000", "2020", partition_freq="1w")
ddf.groupby("id").x.mean().compute()

Screen Recording 2023-05-16 at 11 53 46

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't reviewed the core logic. Given that this is an optional and non-critical component, I'd be fine with merging. @hendrikmakait you intended to review as well. I'll leave it up to you to merge.
If during review anything major comes up, please move this to a follow up ticket


futures = c.map(slowinc, range(10), delay=0.001)
await wait(futures)
await asyncio.sleep(1) # wait for metrics to arrive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (not worthy of holding off the merge): I'd like us to be very conservative with this kind of sleeping since it artificially slows down our test suite.

Ideally, there was a hook we could listen to to know when we can check our assertions. This hook often doesn't exist but we can still implement the test more forgivingly, e.g.

start = time()
while True:
    try:
        assert not cl.task_exec_data

        cl.update()
        assert cl.task_exec_data
        assert cl.task_exec_data["functions"] == ["slowinc"]
        break
    except AssertionError:
        if time() - start > 1:
            raise
        await asyncio.sleep(0.01)

Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. I haven't review the core logic, but focused on the requested changes. Another small issue I observed is that the charts don't scale if they have no data in them.

I'm fine with merging this and getting it out there. We can iron out the remaining issues in follow-up PRs. This dashboard is already very useful as-is.

@hendrikmakait hendrikmakait merged commit 7f272d9 into dask:main May 16, 2023
27 of 33 checks passed
@milesgranger milesgranger deleted the 7679-perf-metrics-bokeh-dashboard branch May 16, 2023 10:58
@fjetter
Copy link
Member

fjetter commented May 16, 2023

🎉

@crusaderky
Copy link
Collaborator

To recap, these are the follow-ups:

I think at least the first one should be delivered before we can properly advertise this feature to the wider public.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fine performance metrics: Bokeh dashboard
5 participants