feat: profiling v2 [MD-27] #9032

azhou-determined · 2024-03-20T21:22:22Z

Description

Profiling V2 (Project Doc)

Individual commits should have already been reviewed, this is the final feature branch to main PR.

Major changes:

Functionality for timing metrics in the Determined profiler was dropped. Determined profiler is now only responsible for system metrics.
Determined Profiler now lives in Core API.
Determined profiler's system metrics now use backend's generic metrics framework

Test Plan

This feature should be tested across a few different Trial APIs

PyTorch

For mnist_pytorch:

Change determined/examples/tutorials/mnist_pytorch/train.py to add profiling_enabled:

trainer.fit(max_length=max_length, latest_checkpoint=latest_checkpoint, profiling_enabled=True)

Submit the experiment. Go to the "Profiler" tab for that experiment in the Web UI. Verify that "System Metrics" is rendered with metric values.

TF Keras

In determined/examples/computer_vision/iris_tf_keras/distributed.yaml, enabled profiling:

profiling:
  enabled: true

Submit the experiment. Go to the "Profiler" tab for that experiment in the Web UI. Verify that "System Metrics" is rendered with metric values.

Core API

Create a Core API script and expconf.

Experiment Config:

name: profiling
entrypoint: python3 profiling.py

searcher:
   name: single
   metric: x
   max_length: 1

max_restarts: 0

profiling.py

import logging
import time

import determined as det
from determined import core


def main(core_context):
    core_context.profiler.on(sampling_interval=0.1, samples_per_report=10)
    for batch in range(60*5):
        steps_completed = batch + 1
        if steps_completed % 5 == 0:
            core_context.train.report_training_metrics(
                steps_completed=steps_completed, metrics={"x": batch}
            )
        if steps_completed % 10 == 0:
            core_context.train.report_validation_metrics(steps_completed=steps_completed, metrics={"x": batch})
        time.sleep(1)
    core_context.profiler.off()


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO, format=det.LOG_FORMAT)
    with core.init() as core_context:
        main(core_context=core_context)

Submit the experiment. Go to the "Profiler" tab for that experiment in the Web UI. Verify that "System Metrics" is rendered with metric values.

Commentary (optional)

Checklist

Changes have been manually QA'd
User-facing API changes need the "User-facing API Change" label.
Release notes should be added as a separate file under docs/release-notes/.
See Release Note for details.
Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

netlify · 2024-03-20T21:22:39Z

✅ Deploy Preview for determined-ui canceled.

Name	Link
🔨 Latest commit	`9806379`
🔍 Latest deploy log	https://app.netlify.com/sites/determined-ui/deploys/66033597fe907b00080a46b8

gt2345

WebUI stamp

docs/model-dev-guide/profiling.rst

wes-turner · 2024-03-22T00:03:14Z

docs/model-dev-guide/profiling.rst

+
+   System Metrics record agent-level metrics, so when there are multiple experiments on the same
+   agent, it is difficult to analyze. It is recommended that profiling is done with only a single
+   experiment per agent.


This could be a place to reference how this can be configured.

this can't really be configured. i left this warning from the previous doc because it's still relevant, but it has to do with being aware that there could be other experiments using the same agent as your experiment.

I think it would be frustrating to me to read this note and then be unable to figure out how to do the thing it recommends. Let's chat about it.

removed this recommendation

docs/model-dev-guide/profiling.rst

wes-turner · 2024-03-22T00:25:07Z

docs/reference/experiment-config-reference.rst

-Optional. Profiling is supported for all frameworks, though timings are only collected for
-``PyTorchTrial``. Profiles are collected for a maximum of 5 minutes, regardless of the settings
-below.
+Optional. Defaults to false.


What does it do?

Not being purposefully dense, I can't differentiate between:

profiling: profiler: <val> enabled: <val>

removed this section

docs/.redirects/redirects.json

wes-turner · 2024-03-22T00:48:39Z

harness/determined/core/_profiler.py

+
+    Supports up to 1 level of nesting. Returns a single merged dictionary where the values are
+    averaged across all dictionaries in the given list by key.
+    # TODO (anda): find a cleaner way to do this.


If this is important to you, could you please create a ticket and reference the ticket in the TODO instead of yourself? And if it's not important enough for that, then please remove the TODO.

i've created the ticket https://hpe-aiatscale.atlassian.net/browse/MD-338. i will add it to the comment but don't want to commit right now because i'm waiting for a longrunning CI to finish.

wes-turner · 2024-03-22T00:50:51Z

harness/determined/core/_profiler.py

+    for sample in metric_samples:
+        for k, v in sample.items():
+            if isinstance(v, dict):
+                aggregated_metrics[k] = aggregated_metrics.get(k, {})


nit: little clearer with a defaultdict, I think.

aggregated_metrics = defaultdict(int) aggregated_metrics = defaultdict(defaultdict(int))

eh, don't love the way this reads. .get(k, {}) is easily readable IMO. i do think defaultdict is best-practices way of doing this, but since i have a ticket to refactor this method anyway, i don't think it's worth it.

wes-turner · 2024-03-22T00:56:37Z

harness/determined/core/_profiler.py

+
+
+class _Network(_MetricGroupCollector):
+    group = "network"


I'm a little surprised this works. I'd have expected that it had to look like

@property def group: return "network"

My typechecker doesn't like it, either.

I'm a little unsettled by this. I don't think the implemented code works, either, and that it hasn't resulted in failed tests makes me wonder if something is wrong with tests.

currently implemented:

def group(self) -> str: return "network"

harness/determined/core/_profiler.py

harness/tests/core/test_profiler.py

codecov · 2024-03-22T15:10:41Z

Codecov Report

Attention: Patch coverage is 48.89241% with 323 lines in your changes are missing coverage. Please review.

Project coverage is 47.79%. Comparing base (1992c97) to head (9806379).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #9032      +/-   ##
==========================================
+ Coverage   47.70%   47.79%   +0.08%     
==========================================
  Files        1166     1166              
  Lines      143876   143603     -273     
  Branches     2379     2377       -2     
==========================================
- Hits        68636    68630       -6     
+ Misses      75081    74814     -267     
  Partials      159      159

Flag	Coverage Δ
backend	`42.83% <10.42%> (-0.14%)`	⬇️
harness	`64.34% <61.87%> (+0.59%)`	⬆️
web	`40.74% <80.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
harness/determined/_trial_controller.py	`83.33% <100.00%> (+5.15%)`	⬆️
harness/determined/core/__init__.py	`100.00% <100.00%> (ø)`
harness/determined/keras/callbacks.py	`91.50% <100.00%> (+0.35%)`	⬆️
...determined/pytorch/deepspeed/_deepspeed_context.py	`80.62% <100.00%> (-0.19%)`	⬇️
harness/tests/experiment/pytorch_utils.py	`95.58% <ø> (ø)`
master/pkg/model/metrics.go	`66.66% <ø> (ø)`
...es/ExperimentDetails/ExperimentSingleTrialTabs.tsx	`85.55% <100.00%> (-0.08%)`	⬇️
...react/src/pages/TrialDetails/Profiles/Profiler.tsx	`70.31% <ø> (+1.19%)`	⬆️
...bui/react/src/pages/TrialDetails/Profiles/utils.ts	`62.85% <100.00%> (-1.04%)`	⬇️
harness/determined/keras/_tf_keras_trial.py	`82.83% <85.71%> (+1.46%)`	⬆️
... and 16 more

... and 10 files with indirect coverage changes

wes-turner · 2024-03-22T20:13:49Z

e2e_tests/tests/experiment/test_profiling.py

-    [
-        (conf.fixtures_path("mnist_pytorch"), True),
-    ],
+    "model_def",


nit: I don't understand the parameterization. I think it's cleaner without it

generic_metrics: - DB schema changes - Changes to backend ReportTrialMetrics APIs

* Refactor ProfilerAgent into Core API

remove throughput and timing metric views on web UI for profiler tab

* chore: aggregate profiling metrics before reporting

…MD-301] (#8970) * Migrate existing profiler metrics: - historical data migration `trial_profiler_metrics` -> `metrics` - shim existing trial profiler metrics APIs to fetch from `metrics`

* docs: better docs for profiling

* chore: remove profiling not enabled check in web UI

…release note to link to it

optimize migrations on metrics table (landed as part of #9032)

optimize migrations on metrics table (landed as part of #9032) (cherry picked from commit a07f0fb)

azhou-determined requested review from ioga, wes-turner and gt2345 March 20, 2024 21:22

azhou-determined requested review from a team as code owners March 20, 2024 21:22

cla-bot bot added the cla-signed label Mar 20, 2024

determined-ci requested a review from a team March 20, 2024 21:22

determined-ci added the documentation Improvements or additions to documentation label Mar 20, 2024

gt2345 approved these changes Mar 21, 2024

View reviewed changes

wes-turner reviewed Mar 21, 2024

View reviewed changes

docs/model-dev-guide/profiling.rst Outdated Show resolved Hide resolved

wes-turner reviewed Mar 21, 2024

View reviewed changes

docs/model-dev-guide/profiling.rst Outdated Show resolved Hide resolved

wes-turner reviewed Mar 21, 2024

View reviewed changes

docs/model-dev-guide/profiling.rst Outdated Show resolved Hide resolved