# Setup AzureML and submit a training

## Setup Git
First you´ll want to setup the git repository. There are two git repositories for our Data Scientists, one on github [github](https://github.com/Welthungerhilfe/cgm-ml) and one [git repository](https://dev.azure.com/cgmwhh/ChildGrowthMonitor/_git/cgm-ml-service) in Azure DevOps.
Clone the Az DevOps repository running the following cell. To get access to the Azure DevOps repository "cgm-ml-service", if not already given, you´ll need to contact one of our Az DevOps Project Admins, Ankit, Sanket or Markus.

In [None]:
!git clone https://cgmwhh@dev.azure.com/cgmwhh/ChildGrowthMonitor/_git/cgm-ml-service

## Connect to Azure ML
Now you´ll need to connect to the Azure ML Workspace. If your account is registered as a member in the Azure Active Directory just run the cell without any adjustments. In case you´re invited as a Guest (external), you´ll need to run the second option specifying the tenant and subscription id.

In [1]:
from azureml.core import Workspace
#workspace = Workspace.from_config()

#Use the above for Users from the WHH tenant. Following code is only needed to specify the tenant we´re authenticating against.

from azureml.core.authentication import InteractiveLoginAuthentication

interactive_auth = InteractiveLoginAuthentication(tenant_id="006dabd7-456d-465b-a87f-f7d557e319c8")
workspace = Workspace(subscription_id="9b82ecea-6780-4b85-8acf-d27d79028f07",
                      resource_group="cgm-ml-prod",
                      workspace_name="cgm-azureml-prod",
                      auth=interactive_auth)
              

## Access Data in Azure ML
Now we will access our data. In the Azure ML prod Workspace you´ll have access to anonymized data only. You will find the rgb and pcd data from the Storage Account and the Training Targets from the PostgreSQL both registered as Datasets in Azure ML.
A list of available Datasets can be seen [here](https://ml.azure.com/data?wsid=/subscriptions/9b82ecea-6780-4b85-8acf-d27d79028f07/resourcegroups/cgm-ml-prod/workspaces/cgm-azureml-prod&tid=006dabd7-456d-465b-a87f-f7d557e319c8).
The code snippet under "Consume" will connect you to the Workspace and download the Dataset to the target path. This might take a while.

First you´ll want to update all your azureml packages if needed. Just list all outdated azureml packages and upgrade them with the following cell:

In [2]:
!pip list --outdated --format=freeze | grep azureml* | cut -d = -f 1  | xargs -n1 pip install -U

Collecting azureml-accel-models
  Downloading azureml_accel_models-1.13.0-py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 456 kB/s  eta 0:00:01
[?25hCollecting azureml-core~=1.13.0
  Downloading azureml_core-1.13.0-py3-none-any.whl (2.0 MB)
[K     |████████████████████████████████| 2.0 MB 9.4 MB/s eta 0:00:01
Installing collected packages: azureml-core, azureml-accel-models
  Attempting uninstall: azureml-core
    Found existing installation: azureml-core 1.12.0.post1
    Uninstalling azureml-core-1.12.0.post1:
      Successfully uninstalled azureml-core-1.12.0.post1
  Attempting uninstall: azureml-accel-models
    Found existing installation: azureml-accel-models 1.12.0
    Uninstalling azureml-accel-models-1.12.0:
      Successfully uninstalled azureml-accel-models-1.12.0
[31mERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend

Now you should be able to load a tabular dataset to a pandas dataframe. You may get an SSL Error if your packages are not up-to-date.

In [None]:
# azureml-core of version 1.0.72 or higher is required
# azureml-dataprep[pandas] of version 1.1.34 or higher is required
from azureml.core import Workspace, Dataset

subscription_id = '9b82ecea-6780-4b85-8acf-d27d79028f07'
resource_group = 'cgm-ml-prod'
workspace_name = 'cgm-azureml-prod'

workspace = Workspace(subscription_id, resource_group, workspace_name)

dataset = Dataset.get_by_name(workspace, name='measure_table')
dataset.to_pandas_dataframe()

There are two types of Datasets in Azure ML, File Datasets and Tabular Datasets. The above example is a File Dataset with all anonymized rgb and pcd data from the prod Storage Account. The Tabular Datasets should provide the training targets for the File Datasets you will be training your neural nets with.

## Training on Azure ML
Training on Azure ML is normally performed on remote Compute Targets you can either create via the Python SDK or from the Azure ML Portal. You´ll create an Environment with the dependencies for your training and push it to your remote Compute Target, where the training will be performed. The output you´ll specify in your training script will be uploaded to the Azure ML Service where you can access it after the training is terminated.

The following cell will check if the Compute Target "gpu-cluster" already exists in the workspace and will create it if it does not already exist. The variable compute_target will hold the ComputeTarget Object representing the remote Compute Target.

In [7]:
from azureml.core import Experiment
from azureml.core import Workspace, Run

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cluster_name = "cpu-cluster"

try:
    compute_target = ComputeTarget(workspace=workspace, name=cluster_name)
    print('Found existing compute target')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(
        vm_size='Standard_D2_v2', 
       max_nodes=1)
    compute_target = ComputeTarget.create(workspace, cluster_name, compute_config)
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
compute_target

Found existing compute target


AmlCompute(workspace=Workspace.create(name='cgm-azureml-prod', subscription_id='9b82ecea-6780-4b85-8acf-d27d79028f07', resource_group='cgm-ml-prod'), name=cpu-cluster, id=/subscriptions/9b82ecea-6780-4b85-8acf-d27d79028f07/resourceGroups/cgm-ml-prod/providers/Microsoft.MachineLearningServices/workspaces/cgm-azureml-prod/computes/cpu-cluster, type=AmlCompute, provisioning_state=Succeeded, location=westeurope, tags=None)

## Create an Experiment
The Experiment in Azure ML is the Object that holds your Training Runs. Each time you submit a training to a remote Compute Target you encapsulate it in a Run Object. That Objects holds information regarding the Environment, Dataset and scripts used for training. It also will hold your output.

In [9]:
from azureml.core import Experiment
experiment_name = "My-first-Experiment"
experiment = Experiment(workspace=workspace, name=experiment_name)
experiment

Name,Workspace,Report Page,Docs Page
My-first-Experiment,cgm-azureml-prod,Link to Azure Machine Learning studio,Link to Documentation


## Get Data for Training

Now you have a Compute Target. You will also need a Dataset you can perform your training with. You can select your Dataset by name through the Workspace Object like this:

In [10]:
# azureml-core of version 1.0.72 or higher is required
from azureml.core import Workspace, Dataset

subscription_id = '9b82ecea-6780-4b85-8acf-d27d79028f07'
resource_group = 'cgm-ml-prod'
workspace_name = 'cgm-azureml-prod'

workspace = Workspace(subscription_id, resource_group, workspace_name)

dataset = Dataset.get_by_name(workspace, name='measure_table')


## Submit a Training
Now you´ll need to specify the conda and pip packages your training script depends on. You could create and register the Environment yourself, but for now we´ll leverage the Estimator Object to submit the Training. We´ll pass your dependencies to the Estimator Object so that the Azure ML will check the existing Environments and either will select an Environment which includes all your dependencies or create a new Environment from your dependencies for you.

In [12]:
!mkdir -p code

In [21]:
%%writefile ./code/train.py

from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from azureml.core import Run, Experiment

import math
import joblib

run = Run.get_context()

print("Running in online mode...")
experiment = run.experiment
workspace = experiment.workspace
dataset_ref = run.input_datasets["dataset"]




x_df = dataset_ref.to_pandas_dataframe()[['weight', 'height', 'muac']].dropna()
y_df = x_df.pop("muac")

X_train, X_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=66)



alphas = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

for alpha in alphas:
    run = experiment.start_logging()
    run.log("alpha_value", alpha)
    
    model = Ridge(alpha=alpha)
    model.fit(X=X_train, y=y_train)
    y_pred = model.predict(X=X_test)
    rmse = math.sqrt(mean_squared_error(y_true=y_test, y_pred=y_pred))
    run.log("rmse", rmse)
    
    model_name = "model_alpha_" + str(alpha) + ".pkl"
    filename = "outputs/" + model_name
    
    joblib.dump(value=model, filename=filename)
    run.upload_file(name=model_name, path_or_stream=filename)
    run.complete()


Overwriting ./code/train.py


In the Estimator Object you specify the training script and the source directory which holds all your scripts for the training.

In [22]:
from azureml.train.estimator import Estimator

pip_packages = [
    "azureml-dataprep[fuse,pandas]",
    "glob2",
    "sklearn",
    "joblib"
]

# Create the estimator.
estimator = Estimator(
    source_directory="./code",
    compute_target=compute_target,
    entry_script="train.py",
    inputs=[dataset.as_named_input("dataset")],
    pip_packages=pip_packages
)

# Set compute target.
estimator.run_config.target = compute_target

# Run the experiment.
run = experiment.submit(estimator)

# Show outpus.
from azureml.widgets import RunDetails
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…