Shuffle metrics 4/4: Remove bespoke diagnostics #8367

crusaderky · 2023-11-19T22:40:57Z

Closes Use metering for P2P shuffling instrumentation #7943

Please read: #7943 (comment)
There are four commits in this PR. All but the last are the previous PRs in the chain.

crusaderky · 2023-11-19T23:33:35Z

distributed/dashboard/components/scheduler.py

@@ -4326,16 +4326,12 @@ def __init__(self, scheduler, **kwargs):
                "comm_memory": [],
                "comm_memory_limit": [],
                "comm_buckets": [],
-                "comm_avg_duration": [],
-                "comm_avg_size": [],


We're losing a little bit of functionality here.
IMHO it's not a big deal. Worth noting that we still have the information under the fine performance metrics (you'll have to calculate seconds/count and bytes/count yourself).

I don't think I agree that this is not a big deal. The decaying averages of comm_avg_* gave a (admittedly very crude) way of understanding distributions over time which are helpful to understand performance. (See also #8364 (comment)). For end-user analytics total averages should be enough to hint at problems, but I'm wondering if we should have a second set of metrics that is focused on debugging/performance optimization that goes into more detail.

github-actions · 2023-11-20T00:27:31Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      27 files ±  0       27 suites ±0 11h 45m 9s ⏱️ + 8m 29s
  3 949 tests +  1   3 835 ✔️ +  1   110 💤 ±0 4 ❌ ±0
49 672 runs +31 47 367 ✔️ +39 2 301 💤 - 6 4 ❌ - 2

For more details on these failures, see this check.

Results for commit 5311963. ± Comparison against base commit 9273186.

♻️ This comment has been updated with latest results.

hendrikmakait

I'd prefer to keep the existing bespoke metrics around for now. Some of them are more detailed and give us information per worker which has been extremely helpful in the past - in particular when viewed in real-time. Keeping them will also help us compare the information we get from the new approach and iterate if we identify gaps.

hendrikmakait · 2023-11-29T10:56:17Z

distributed/shuffle/_core.py

+                label = (label,)
+            if isinstance(label[0], str) and label[0].startswith("shuffle-"):
+                label = (label[0][len("shuffle-") :], *label[1:])
+            name = ("shuffle", self.span_id, where, *label, unit)


General nit: I'd store these metrics under p2p not shuffle. IMO this should be clearer as there are other shuffle implementations and some P2P-based algorithms are not necessarily what would be called a shuffle by the respective end users (e.g., rechunk). This is a general grievance I have with the P2P codebase, but this feels like a good starting point to change things.

Renamed "shuffle" tag to "p2p" everywhere.

hendrikmakait · 2023-11-29T11:00:17Z

distributed/dashboard/components/scheduler.py

@@ -4326,16 +4326,12 @@ def __init__(self, scheduler, **kwargs):
                "comm_memory": [],
                "comm_memory_limit": [],
                "comm_buckets": [],
-                "comm_avg_duration": [],
-                "comm_avg_size": [],


I don't think I agree that this is not a big deal. The decaying averages of comm_avg_* gave a (admittedly very crude) way of understanding distributions over time which are helpful to understand performance. (See also #8364 (comment)). For end-user analytics total averages should be enough to hint at problems, but I'm wondering if we should have a second set of metrics that is focused on debugging/performance optimization that goes into more detail.

crusaderky · 2023-12-20T12:32:16Z

@hendrikmakait I've reinstated all metrics that are visible from the dashboard, as discussed. This is ready for review again.

hendrikmakait

Thanks, @crusaderky, this entire series of changes looks great!

crusaderky mentioned this pull request Nov 19, 2023

Use metering for P2P shuffling instrumentation #7943

Closed

crusaderky self-assigned this Nov 19, 2023

crusaderky commented Nov 19, 2023

View reviewed changes

crusaderky marked this pull request as ready for review November 19, 2023 23:38

crusaderky requested a review from fjetter as a code owner November 19, 2023 23:38

crusaderky force-pushed the shuffle/metrics4 branch 3 times, most recently from d74aa0a to 4bd63e2 Compare November 30, 2023 16:35

crusaderky force-pushed the shuffle/metrics4 branch 3 times, most recently from cff1192 to b5a6821 Compare December 12, 2023 18:33

crusaderky force-pushed the shuffle/metrics4 branch 2 times, most recently from ed27fa6 to 77beba9 Compare December 18, 2023 14:07

hendrikmakait self-requested a review December 18, 2023 14:17

hendrikmakait reviewed Dec 18, 2023

View reviewed changes

crusaderky force-pushed the shuffle/metrics4 branch 7 times, most recently from 4fc4fe4 to 4993e5a Compare December 20, 2023 00:10

Shuffle metrics 4/4: Remove bespoke diagnostics

982a6e6

crusaderky force-pushed the shuffle/metrics4 branch from 4993e5a to 982a6e6 Compare December 20, 2023 00:13

Reintroduce avg_duration and avg_size

5311963

hendrikmakait self-requested a review December 21, 2023 14:54

hendrikmakait approved these changes Dec 21, 2023

View reviewed changes

hendrikmakait merged commit 81774d4 into dask:main Dec 21, 2023
29 of 34 checks passed

crusaderky deleted the shuffle/metrics4 branch December 22, 2023 14:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shuffle metrics 4/4: Remove bespoke diagnostics #8367

Shuffle metrics 4/4: Remove bespoke diagnostics #8367

crusaderky commented Nov 19, 2023 •

edited

crusaderky Nov 19, 2023

hendrikmakait Nov 29, 2023

github-actions bot commented Nov 20, 2023 •

edited

hendrikmakait left a comment •

edited

hendrikmakait Nov 29, 2023

crusaderky Dec 19, 2023

hendrikmakait Nov 29, 2023

crusaderky commented Dec 20, 2023

hendrikmakait left a comment

Shuffle metrics 4/4: Remove bespoke diagnostics #8367

Shuffle metrics 4/4: Remove bespoke diagnostics #8367

Conversation

crusaderky commented Nov 19, 2023 • edited

crusaderky Nov 19, 2023

Choose a reason for hiding this comment

hendrikmakait Nov 29, 2023

Choose a reason for hiding this comment

github-actions bot commented Nov 20, 2023 • edited

Unit Test Results

hendrikmakait left a comment • edited

Choose a reason for hiding this comment

hendrikmakait Nov 29, 2023

Choose a reason for hiding this comment

crusaderky Dec 19, 2023

Choose a reason for hiding this comment

hendrikmakait Nov 29, 2023

Choose a reason for hiding this comment

crusaderky commented Dec 20, 2023

hendrikmakait left a comment

Choose a reason for hiding this comment

crusaderky commented Nov 19, 2023 •

edited

github-actions bot commented Nov 20, 2023 •

edited

hendrikmakait left a comment •

edited