Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/NotebookVM/how-to-use-azureml/track-and-monitor-experiments/manage-runs/manage-runs.png)

# Manage runs

## Table of contents

1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Start, monitor and complete a run](#Start,-monitor-and-complete-a-run)
1. [Add properties and tags](#Add-properties-and-tags)
1. [Query properties and tags](#Query-properties-and-tags)
1. [Start and query child runs](#Start-and-query-child-runs)
1. [Cancel or fail runs](#Cancel-or-fail-runs)
1. [Reproduce a run](#Reproduce-a-run)
1. [Next steps](#Next-steps)

## Introduction

When you're building enterprise-grade machine learning models, it is important to track, organize, monitor and reproduce your training runs. For example, you might want to trace the lineage behind a model deployed to production, and re-run the training experiment to troubleshoot issues. 

This notebooks shows examples how to use Azure Machine Learning services to manage your training runs.

## Setup

If you are using an Azure Machine Learning Notebook VM, you are all set.  Otherwise, go through the [configuration](../../../configuration.ipynb) Notebook first if you haven't already to establish your connection to the AzureML Workspace. Also, if you're new to Azure ML, we recommend that you go through [the tutorial](https://docs.microsoft.com/en-us/azure/machine-learning/service/tutorial-train-models-with-aml) first to learn the basic concepts.

Let's first import required packages, check Azure ML SDK version, connect to your workspace and create an Experiment to hold the runs.

In [1]:
import azureml.core
from azureml.core import Workspace, Experiment, Run
from azureml.core import ScriptRunConfig

print(azureml.core.VERSION)

1.18.0


In [2]:
ws = Workspace.from_config()

In [3]:
exp = Experiment(workspace=ws, name="explore-runs")

## Start, monitor and complete a run

A run is an unit of execution, typically to train a model, but for other purposes as well, such as loading or transforming data. Runs are tracked by Azure ML service, and can be instrumented with metrics and artifact logging.

A simplest way to start a run in your interactive Python session is to call *Experiment.start_logging* method. You can then log metrics from within the run.

In [4]:
notebook_run = exp.start_logging()

notebook_run.log(name="message", value="Hello from run!")

print(notebook_run.get_status())

Running


Use *get_status method* to get the status of the run.

In [5]:
print(notebook_run.get_status())

Running


Also, you can simply enter the run to get a link to Azure Portal details

In [6]:
notebook_run

Experiment,Id,Type,Status,Details Page,Docs Page
explore-runs,ccd7d9a0-8fff-4591-aee8-576e5a778e22,,Running,Link to Azure Machine Learning studio,Link to Documentation


Method *get_details* gives you more details on the run.

In [7]:
notebook_run.get_details()

{'runId': 'ccd7d9a0-8fff-4591-aee8-576e5a778e22',
 'target': 'local',
 'status': 'Running',
 'startTimeUtc': '2020-11-14T07:48:35.052863Z',
 'properties': {'ContentSnapshotId': '5888d6a8-3fb3-4558-9bd7-1b3d2e91638e'},
 'inputDatasets': [],
 'outputDatasets': [],
 'logFiles': {}}

Use *complete* method to end the run.

In [8]:
notebook_run.complete()
print(notebook_run.get_status())

Completed


You can also use Python's *with...as* pattern. The run will automatically complete when moving out of scope. This way you don't need to manually complete the run.

In [9]:
with exp.start_logging() as notebook_run:
    notebook_run.log(name="message", value="Hello from run!")
    print("Is it still running?",notebook_run.get_status())
    
print("Has it completed?",notebook_run.get_status())

Is it still running? Running
Has it completed? Completed


Next, let's look at submitting a run as a separate Python process. To keep the example simple, we submit the run on local computer. Other targets could include remote VMs and Machine Learning Compute clusters in your Azure ML Workspace.

We use *hello.py* script as an example. To perform logging, we need to get a reference to the Run instance from within the scope of the script. We do this using *Run.get_context* method.

In [10]:
!more hello.py

# Copyright (c) Microsoft. All rights reserved.
# Licensed under the MIT license.

from azureml.core import Run

submitted_run = Run.get_context()
submitted_run.log(name="message", value="Hello from run!")


Submitted runs take a snapshot of the *source_directory* to use when executing. You can control which files are available to the run by using an *.amlignore* file.

In [11]:
%%writefile .amlignore
# Exclude the outputs directory automatically created by our earlier runs.
/outputs

Writing .amlignore


Let's submit the run on a local computer. A standard pattern in Azure ML SDK is to create run configuration, and then use *Experiment.submit* method.

In [12]:
run_config = ScriptRunConfig(source_directory='.', script='hello.py')

local_script_run = exp.submit(run_config)

You can view the status of the run as before

In [13]:
print(local_script_run.get_status())
local_script_run

Preparing


Experiment,Id,Type,Status,Details Page,Docs Page
explore-runs,explore-runs_1605340191_c5915ce9,azureml.scriptrun,Preparing,Link to Azure Machine Learning studio,Link to Documentation


Submitted runs have additional log files you can inspect using *get_details_with_logs*.

In [14]:
local_script_run.get_details_with_logs()

{'runId': 'explore-runs_1605340191_c5915ce9',
 'target': 'local',
 'status': 'Preparing',
 'properties': {'_azureml.ComputeTargetType': 'local',
  'ContentSnapshotId': 'e6566ec7-cd07-46ee-b568-f015d4871e2d'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'script': 'hello.py',
  'useAbsolutePath': False,
  'arguments': [],
  'sourceDirectoryDataStore': None,
  'framework': 'Python',
  'communicator': 'None',
  'target': 'local',
  'dataReferences': {},
  'data': {},
  'outputData': {},
  'jobName': None,
  'maxRunDurationSeconds': 2592000,
  'nodeCount': 1,
  'priority': None,
  'environment': {'name': 'Experiment explore-runs Environment',
   'version': 'Autosave_2020-11-14T07:49:52Z_64c97215',
   'python': {'interpreterPath': 'python',
    'userManagedDependencies': False,
    'condaDependencies': {'channels': ['anaconda', 'conda-forge'],
     'dependencies': ['python=3.6.2', {'pip': ['azureml-defaults']}],
     'name': 'azureml_da3e97fcb51801118b8e80207f3e01ad'},
  

Use *wait_for_completion* method to block the local execution until remote run is complete.

In [15]:
local_script_run.wait_for_completion(show_output=True)
print(local_script_run.get_status())

RunId: explore-runs_1605340191_c5915ce9
Web View: https://ml.azure.com/experiments/explore-runs/runs/explore-runs_1605340191_c5915ce9?wsid=/subscriptions/888519c8-2387-461a-aff3-b31b86e2438e/resourcegroups/aml-quickstarts-126067/workspaces/quick-starts-ws-126067

Streaming azureml-logs/60_control_log.txt

[2020-11-14T07:49:56.250529] Using urllib.request Python 3.0 or later
Streaming log file azureml-logs/60_control_log.txt
Running: ['/bin/bash', '/tmp/azureml_runs/explore-runs_1605340191_c5915ce9/azureml-environment-setup/conda_env_checker.sh']
Starting the daemon thread to refresh tokens in background for process with pid = 10184
Materialized conda environment not found on target: /home/azureuser/.azureml/envs/azureml_da3e97fcb51801118b8e80207f3e01ad


[2020-11-14T07:49:56.366451] Logging experiment preparation status in history service.
Running: ['/bin/bash', '/tmp/azureml_runs/explore-runs_1605340191_c5915ce9/azureml-environment-setup/conda_env_builder.sh']
Running: ['conda', '--ve

## Add properties and tags

Properties and tags help you organize your runs. You can use them to describe, for example, who authored the run, what the results were, and what machine learning approach was used. And as you'll later learn, properties and tags can be used to query the history of your runs to find the important ones.

For example, let's add "author" property to the run:

In [16]:
local_script_run.add_properties({"author":"azureml-user"})
print(local_script_run.get_properties())

{'_azureml.ComputeTargetType': 'local', 'ContentSnapshotId': 'e6566ec7-cd07-46ee-b568-f015d4871e2d', 'author': 'azureml-user'}


Properties are immutable. Once you assign a value it cannot be changed, making them useful as a permanent record for auditing purposes.

In [17]:
try:
    local_script_run.add_properties({"author":"different-user"})
except Exception as e:
    print(e)

ServiceException:
	Code: 400
	Message: (UserError) Cannot modify existing values in Properties
	Details:

	Headers: {
	    "Date": "Sat, 14 Nov 2020 07:51:31 GMT",
	    "Content-Type": "application/json; charset=utf-8",
	    "Content-Length": "582",
	    "Connection": "keep-alive",
	    "Request-Context": "appId=cid-v1:2d2e8e63-272e-4b3c-8598-4ee570a0e70d",
	    "x-ms-response-type": "error",
	    "x-ms-client-request-id": "876d8b53-0283-4174-a5b2-82fe78fb1805",
	    "x-ms-client-session-id": "",
	    "X-Content-Type-Options": "nosniff",
	    "x-request-time": "0.063",
	    "Strict-Transport-Security": "max-age=15724800; includeSubDomains; preload"
	}
	InnerException: {
    "additional_properties": {},
    "error": {
        "additional_properties": {
            "debugInfo": null
        },
        "code": "UserError",
        "severity": null,
        "message": "Cannot modify existing values in Properties",
        "message_format": null,
        "message_parameters": null,
        

Tags on the other hand can be changed:

In [18]:
local_script_run.tag("quality", "great run")
print(local_script_run.get_tags())

{'quality': 'great run'}


In [19]:
local_script_run.tag("quality", "fantastic run")
print(local_script_run.get_tags())

{'quality': 'fantastic run'}


You can also add a simple string tag. It appears in the tag dictionary with value of None

In [20]:
local_script_run.tag("worth another look")
print(local_script_run.get_tags())

{'quality': 'fantastic run', 'worth another look': None}


## Query properties and tags

You can query runs within an experiment that match specific properties and tags.

In [21]:
list(exp.get_runs(properties={"author":"azureml-user"},tags={"quality":"fantastic run"}))

[Run(Experiment: explore-runs,
 Id: explore-runs_1605340191_c5915ce9,
 Type: azureml.scriptrun,
 Status: Completed)]

In [22]:
list(exp.get_runs(properties={"author":"azureml-user"},tags="worth another look"))

[Run(Experiment: explore-runs,
 Id: explore-runs_1605340191_c5915ce9,
 Type: azureml.scriptrun,
 Status: Completed)]

## Start and query child runs

You can use child runs to group together related runs, for example different hyperparameter tuning iterations.

Let's use *hello_with_children* script to create a batch of 5 child runs from within a submitted run.

In [23]:
!more hello_with_children.py

# Copyright (c) Microsoft. All rights reserved.
# Licensed under the MIT license.

from azureml.core import Run

run = Run.get_context()

child_runs = run.create_children(count=5)
for c, child in enumerate(child_runs):
    child.log(name="Hello from child run ", value=c)
    child.complete()


In [24]:
run_config = ScriptRunConfig(source_directory='.', script='hello_with_children.py')

local_script_run = exp.submit(run_config)
local_script_run.wait_for_completion(show_output=True)
print(local_script_run.get_status())

RunId: explore-runs_1605340334_8879f805
Web View: https://ml.azure.com/experiments/explore-runs/runs/explore-runs_1605340334_8879f805?wsid=/subscriptions/888519c8-2387-461a-aff3-b31b86e2438e/resourcegroups/aml-quickstarts-126067/workspaces/quick-starts-ws-126067

Streaming azureml-logs/70_driver_log.txt

[2020-11-14T07:52:17.169713] Entering context manager injector.
[context_manager_injector.py] Command line Options: Namespace(inject=['ProjectPythonPath:context_managers.ProjectPythonPath', 'RunHistory:context_managers.RunHistory', 'TrackUserError:context_managers.TrackUserError'], invocation=['hello_with_children.py'])
Script type = None
Starting the daemon thread to refresh tokens in background for process with pid = 11454
Entering Run History Context Manager.
Current directory:  /tmp/azureml_runs/explore-runs_1605340334_8879f805
Preparing to call script [ hello_with_children.py ] with arguments: []
After variable expansion, calling script [ hello_with_children.py ] with arguments: [

You can start child runs one by one. Note that this is less efficient than submitting a batch of runs, because each creation results in a network call.

Child runs too complete automatically as they move out of scope.

In [25]:
with exp.start_logging() as parent_run:
    for c,count in enumerate(range(5)):
        with parent_run.child_run() as child:
            child.log(name="Hello from child run", value=c)

To query the child runs belonging to specific parent, use *get_children* method.

In [26]:
list(parent_run.get_children())

[Run(Experiment: explore-runs,
 Id: f078812b-6c3a-4ae9-bc2d-c75adfd635a9,
 Type: None,
 Status: Completed),
 Run(Experiment: explore-runs,
 Id: a087fe50-0887-481f-a91c-4fbcbeeac12c,
 Type: None,
 Status: Completed),
 Run(Experiment: explore-runs,
 Id: a143a947-758e-4ccf-bab9-d047323b377b,
 Type: None,
 Status: Completed),
 Run(Experiment: explore-runs,
 Id: 05ad4402-87a4-4a5e-ba36-7ed04a52155c,
 Type: None,
 Status: Completed),
 Run(Experiment: explore-runs,
 Id: 1bc6cbd4-292e-40c7-b80d-ed85543c2515,
 Type: None,
 Status: Completed)]

## Cancel or fail runs

Sometimes, you realize that the run is not performing as intended, and you want to cancel it instead of waiting for it to complete.

As an example, let's create a Python script with a delay in the middle.

In [27]:
!more hello_with_delay.py

# Copyright (c) Microsoft. All rights reserved.
# Licensed under the MIT license.

import time

print("Wait for 10 seconds..")
time.sleep(10)
print("Done waiting")


You can use *cancel* method to cancel a run.

In [28]:
run_config = ScriptRunConfig(source_directory='.', script='hello_with_delay.py')

local_script_run = exp.submit(run_config)
print("Did the run start?",local_script_run.get_status())
local_script_run.cancel()
print("Did the run cancel?",local_script_run.get_status())

Did the run start? Running
Did the run cancel? Canceled


You can also mark an unsuccessful run as failed.

In [29]:
local_script_run = exp.submit(run_config)
local_script_run.fail()
print(local_script_run.get_status())

Failed


## Reproduce a run

When updating or troubleshooting on a model deployed to production, you sometimes need to revisit the original training run that produced the model. To help you with this, Azure ML service by default creates snapshots of your scripts a the time of run submission:

You can use *restore_snapshot* to obtain a zip package of the latest snapshot of the script folder. 

In [30]:
local_script_run.restore_snapshot(path="snapshots")

'/mnt/batch/tasks/shared/LS_root/mounts/clusters/notebook126067/code/Users/odl_user_126067/manage-runs/snapshots/0af23864-c62d-4a8c-ba7d-d27f9e9ee3e8.zip'

You can then extract the zip package, examine the code, and submit your run again.

## Next steps

 * To learn more about logging APIs, see [logging API notebook](./logging-api/logging-api.ipynb)
 * To learn more about remote runs, see [train on AML compute notebook](./train-on-amlcompute/train-on-amlcompute.ipynb)