- Set tracking URI
- Create experiment
- Create train and test sets
- Train while logging metrics and artifacts. 
    - Train script should use mlflow to log model HP values, errors, save final D_init, W, embedding after each epoch

TODO:
- Modify train.py if necessary
- Make sure can run train and log parameters with current formulation
- Run with different sets of hyperparameters

In [1]:
import sys, os
import mlflow
import mlflow.azureml

import azureml.core
from azureml.core import Workspace

ws = Workspace.from_config()

print("SDK version:", azureml.core.VERSION)
print("MLflow version:", mlflow.version.VERSION)
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

SDK version: 1.28.0
MLflow version: 1.17.0
reactionmodelling
researchproj
uksouth
4ba7b086-969d-41c4-a647-2784cde6af4b


In [2]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# choose a name for your CPU cluster
cpu_cluster_name = "cpucluster"

# verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace = ws, name = cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


In [3]:
# set tracking URI
mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())

# create mlflow experiment
experiment_name = "train-tsgen"
mlflow.set_experiment(experiment_name)

# create backend config object
backend_config = {"COMPUTE": "cpucluster", "USE_CONDA": False}

In [4]:
# submit run
remote_mlflow_run = mlflow.projects.run(uri=".", 
                                    parameters={"layers": 2, "hidden_size": 128, "iterations": 3, "batch_size": 8},
                                    backend = "azureml",
                                    backend_config = backend_config,
                                    synchronous=True)

Class AzureMLProjectBackend: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
2021/05/26 06:42:49 INFO mlflow.projects.utils: === Created directory /tmp/tmp1s81niet for downloading remote URIs passed to arguments of type 'path' ===
No Python version provided, defaulting to "3.6.2"
'enabled' is deprecated. Please use the azureml.core.runconfig.DockerConfiguration object with the 'use_docker' param instead.
Submitting /mnt/batch/tasks/shared/LS_root/mounts/clusters/cpu-core2-ram8/code/Users/rmhavij/ts_gen/azure-experiments directory for run. The size of the directory >= 25 MB, so it can take a few minutes.
Class AzureMLSubmittedRun: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Submitted Run failed with exception ActivityFailedException:
	Message: Activity Failed:
{
    "error": {
        "code": "UserError",
        "message": "Im

RunId: train-tsgen_1622011371_4cc25371
Web View: https://ml.azure.com/runs/train-tsgen_1622011371_4cc25371?wsid=/subscriptions/4ba7b086-969d-41c4-a647-2784cde6af4b/resourcegroups/ResearchProj/workspaces/ReactionModelling&tid=1faf88fe-a998-4c5b-93c9-210a11d9a5c2

Streaming azureml-logs/20_image_build_log.txt

2021/05/26 06:43:25 Downloading source code...
2021/05/26 06:43:26 Finished downloading source code
2021/05/26 06:43:27 Creating Docker network: acb_default_network, driver: 'bridge'
2021/05/26 06:43:27 Successfully set up Docker network: acb_default_network
2021/05/26 06:43:27 Setting up Docker configuration...
2021/05/26 06:43:28 Successfully set up Docker configuration
2021/05/26 06:43:28 Logging in to registry: a27013b3703d4978a7b549edc2126d58.azurecr.io
2021/05/26 06:43:29 Successfully logged into a27013b3703d4978a7b549edc2126d58.azurecr.io
2021/05/26 06:43:29 Executing step ID: acb_step_0. Timeout(sec): 5400, Working directory: '', Network: 'acb_default_network'
2021/05/26 06

ExecutionException: Run (ID 'train-tsgen_1622011371_4cc25371') failed

In [None]:
# view metrics and artifacts in your workspace
run.get_metrics()

# once run complete
# the model folder produced from the run is registered. This includes the MLmodel file, model.pkl and the conda.yaml.
run.register_model(model_name = 'my-model', model_path = 'model')

# then view registered model in worksapce with aml studio

In [11]:
from azureml.train.hyperdrive import RandomParameterSampling, BanditPolicy, HyperDriveConfig, PrimaryMetricGoal
from azureml.train.hyperdrive import choice, loguniform

In [12]:
ps = RandomParameterSampling(
    {
        '--batch_size': choice(), # 8
        '--hidden_size': choice(), # 128
        '--layers': choice(), #2
        '--iterations': choice() # 3
    }
)

# BanditPolicy checks job every (evaluation_interval) number of iterations terminating the job if primary metric outside of slack_factor
early_term_policy = BanditPolicy(evaluation_interval = 2, slack_factor = 0.1)

# HyperDriveConfig
hdc = HyperDriveConfig(estimator = est, hyperparameter_sampling = ps, 
                       policy = early_term_policy, primary_metric_name = 'Accuracy',
                       primary_metric_goal = PrimaryMetricGoal.MAXIMIZE, 
                       max_total_runs = 20, max_concurrent_runs = 4)


HyperDriveConfigException: HyperDriveConfigException:
	Message: Please specify an input for choice.
	InnerException None
	ErrorResponse 
{
    "error": {
        "code": "UserError",
        "message": "Please specify an input for choice.",
        "details_uri": "https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.hyperdriveconfig?view=azure-ml-py",
        "target": "options",
        "inner_error": {
            "code": "BadArgument",
            "inner_error": {
                "code": "ArgumentBlankOrEmpty"
            }
        }
    }
}