Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Training, hyperparameter tune, and deploy with PyTorch Lightning

## Introduction:

## Prerequisite:
* Understand the [architecture and terms](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture) introduced by Azure Machine Learning
* install the AML SDK
* create a workspace and downlod its configuration file (`config.json`)

In [None]:
%matplotlib inline
import numpy as np
import os
import matplotlib.pyplot as plt

import azureml
from azureml.core import Workspace
from azureml.core import Experiment
from azureml.core import ScriptRunConfig
from azureml.core import Dataset
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core import Environment
from azureml.telemetry import set_diagnostics_collection
from azureml.widgets import RunDetails
from azureml.train.hyperdrive import RandomParameterSampling, BayesianParameterSampling, BanditPolicy, HyperDriveConfig, PrimaryMetricGoal
from azureml.train.hyperdrive import choice, loguniform, uniform

In [None]:
# check core SDK version number
print('This code run confrimed - SDK version: 1.19.0')
print("Azure ML SDK Version: ", azureml.core.VERSION)

set_diagnostics_collection(send_diagnostics=True)

## Initialize workspace
Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the Prerequisites step. `Workspace.from_config()` creates a workspace object from the details stored in `config.json`.

In [None]:
ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep='\n')

exp = Experiment(workspace=ws, name='ImageClassification-PyTorchLightning')

## Connect dataset
In order to train on the `cat_dogs` dataset that was created via Azure ML SDK or Portal.

In [None]:
dataset = Dataset.get_by_name(ws, name='cat_dogs')

# Set training cluster - AmlCompute
You can create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model.

In [None]:
compute_target = ComputeTarget(workspace=ws, name='gpucluster6')
compute_target

## Calcurate Count of GPU

This is for useful **multi GPU** senario like NC12,24 series.

vm_size list
https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes-gpu

In [None]:
import json
import re

comp_dict = compute_target.get_status().serialize()
vm_size = comp_dict['vmSize']
print(vm_size)

def get_gpu_count(vm_size):
    pattern=r'\d{1,2}'
    s = re.search(pattern, vm_size)
    gpu_count = vm_size[s.start():s.end()]
    return int(gpu_count) // 6

n_gpu = get_gpu_count(vm_size)
print('gpu_count:' + str(n_gpu))

## Create an environment
Define a conda environment YAML file with your training script dependencies and create an Azure ML environment.

Reference for-

Document:
https://docs.microsoft.com/ja-jp/azure/machine-learning/how-to-use-environments#use-a-prebuilt-docker-image

base_image:
https://github.com/Azure/AzureML-Containers

In [None]:
env = Environment.from_conda_specification('my_pl', 'environment.yml')
# specify a GPU base image
env.docker.enabled = True
env.docker.base_image = (
    "mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.2-cudnn8-ubuntu18.04"
)

# Prepare training script

Now you will need to create your training script. In this tutorial, the training script is already provided for you at `train.py`. In practice, you should be able to take any custom training script as is and run it with Azure ML without having to modify your code.

However, if you would like to use Azure ML's tracking and metrics capabilities, you will have to add a small amount of Azure ML code inside your training script.

In `train.py`, we will log some metrics to our Azure ML run. To do so, we will access the Azure ML Run object within the script:


In [None]:
import shutil

script_folder = './script'
os.makedirs(script_folder, exist_ok=True)

# the training logic is in the train.py file.
shutil.copy('./train.py', script_folder)

# Configure the Single job

Create a ScriptRunConfig object to specify the configuration details of your training job, including your training script, environment to use, and the compute target to run on. The following code will configure a single-node PyTorch job.

In [None]:
arguments = [
    '--data-folder', dataset.as_mount(),
    '--batch-size', 50,
    '--epoch', 1,
    '--learning-rate', 0.001,
    '--momentum', 0.9,
    '--model-name', 'resnet',
    '--optimizer', 'Adagrad',
    '--criterion', 'cross_entropy',
    '--gpus', n_gpu,
    '--feature_extract', True
]

'''
        '--optimizer': choice('SGD',
        
        
        'Adagrad','Adadelta','Adam','AdamW','SparseAdam', 'Adamax', 'ASGD', 'LBFGS', 'RMSprop', 'Rprop'),
        '--criterion': choice('cross_entropy', 'binary_cross_entropy', 'binary_cross_entropy_with_logits', 'poisson_nll_loss', 'hinge_embedding_loss', 'kl_div', 'l1_loss', 'mse_loss', 'margin_ranking_loss', 'multilabel_margin_loss', 'multilabel_soft_margin_loss', 'multi_margin_loss','nll_loss', 'smooth_l1_loss', 'soft_margin_loss')

'''

config = ScriptRunConfig(
                source_directory=script_folder,
                script='train.py',
                arguments=arguments,
                compute_target=compute_target, 
                max_run_duration_seconds=600, # 10 minutes
                environment=env
                )

## Submit job to run
Submit the estimator to the Azure ML experiment to kick off the execution.

In [None]:
run = exp.submit(config)

### Monitor the Run
As the Run is executed, it will go through the following stages:
1. Preparing: A docker image is created matching the Python environment specified by the TensorFlow estimator and it will be uploaded to the workspace's Azure Container Registry. This step will only happen once for each Python environment -- the container will then be cached for subsequent runs. Creating and uploading the image takes about **5 minutes**. While the job is preparing, logs are streamed to the run history and can be viewed to monitor the progress of the image creation.

2. Scaling: If the compute needs to be scaled up (i.e. the AmlCompute cluster requires more nodes to execute the run than currently available), the cluster will attempt to scale up in order to make the required amount of nodes available. Scaling typically takes about **5 minutes**.

3. Running: All scripts in the script folder are uploaded to the compute target, data stores are mounted/copied and the `entry_script` is executed. While the job is running, stdout and the `./logs` folder are streamed to the run history and can be viewed to monitor the progress of the run.

4. Post-Processing: The `./outputs` folder of the run is copied over to the run history

There are multiple ways to check the progress of a running job. We can use a Jupyter notebook widget. 

**Note: The widget will automatically update ever 10-15 seconds, always showing you the most up-to-date information about the run**

In [None]:
RunDetails(run).show()

We can also periodically check the status of the run object, and navigate to Azure portal to monitor the run.

In [None]:
%%time
run.wait_for_completion(show_output=True)

## Download the saved model

In the training script, the PyTorch model is saved into two files, `model.dist` and `model.pt`, in the `outputs/models` folder on the gpucluster AmlCompute node. Azure ML automatically uploaded anything written in the `./outputs` folder into run history file store. Subsequently, we can use the `run` object to download the model files. They are under the the `outputs/model` folder in the run history file store, and are downloaded into a local folder named `model`.

In [None]:
# create a model folder in the current directory
os.makedirs('./model', exist_ok=True)

for f in run.get_file_names():
    if f.startswith('outputs/model'):
        output_file_path = os.path.join('./model', f.split('/')[-1])
        print('Downloading from {} to {} ...'.format(f, output_file_path))
        run.download_file(name=f, output_file_path=output_file_path)

# Hyperparameter tuning by HyperDrive
We have trained the model with one set of hyperparameters, now let's how we can do hyperparameter tuning by launching multiple runs on the cluster. First let's define the parameter space using random sampling.

1st time: Use **Random Sampling** to understand rough hyperparameter range.

2nd time or later: Use** Bayesian Sampling** using 1st job result to more optimize explorer

In [None]:
# BayesianParameterSampling dones't support Eearly Termination Policy
# https://docs.microsoft.com/ja-jp/azure/machine-learning/service/how-to-tune-hyperparameters

# Use Random Sampling as 1st Phase
ps = RandomParameterSampling(
    {
        '--batch-size': choice(50, 100),
        '--epoch': choice(1, 2),
        '--learning-rate': loguniform(-4, -1),
        '--momentum': loguniform(-3, -1),
        '--model-name': choice('resnet', 'alexnet', 'vgg', 'squeezenet', 'densenet', 'inception'),
        '--optimizer': choice('SGD','Adagrad','Adadelta','Adam','AdamW','Adamax', 'ASGD', 'RMSprop', 'Rprop'),
        '--criterion': choice('cross_entropy')
    }
)

# After Random Sampling finished, try to better hyperparameters using Basyean Sampling
'''
ps = BayesianParameterSampling(
    {
        '--batch-size': choice(50, 100, 150, 200, 250, 300),
        '--epoch': choice(20, 25, 30, 35),
        '--learning-rate': loguniform(-4, -1),
        '--momentum': loguniform(-2, -1),
        '--model-name': choice('resnet', 'alexnet', 'vgg', 'squeezenet', 'densenet', 'inception'),
        '--optimizer': choice('SGD','Adagrad','Adadelta','Adam','AdamW','SparseAdam', 'Adamax', 'ASGD', 'RMSprop', 'Rprop'),
        '--criterion': choice('cross_entropy', 'binary_cross_entropy', 'binary_cross_entropy_with_logits', 'poisson_nll_loss', 'hinge_embedding_loss', 'kl_div', 'l1_loss', 'mse_loss', 'margin_ranking_loss', 'multilabel_margin_loss', 'multilabel_soft_margin_loss', 'multi_margin_loss','nll_loss', 'smooth_l1_loss', 'soft_margin_loss')
    }
)

'''

Now we will define an early termnination policy. The `BanditPolicy` basically states to check the job every 2 iterations. If the primary metric (defined later) falls outside of the top 10% range, Azure ML terminate the job. This saves us from continuing to explore hyperparameters that don't show promise of helping reach our target metric.

In [None]:
policy = BanditPolicy(slack_factor=0.15, evaluation_interval=2, delay_evaluation=10)

Now we are ready to configure a run configuration object, and specify the primary metric `Accuracy` that's recorded in your training runs. If you go back to visit the training script, you will notice that this value is being logged after every epoch (a full batch set). We also want to tell the service that we are looking to maximizing this value. We also set the number of samples to 20, and maximal concurrent job to 4, which is the same as the number of nodes in our computer cluster.

In [None]:
from azureml.core import ScriptRunConfig

arguments = [
    '--data-folder', dataset.as_mount(),
    '--gpus', n_gpu,
    '--feature_extract', True
]

config = ScriptRunConfig(
                source_directory=script_folder,
                script='train.py',
                arguments=arguments,
                compute_target=compute_target, 
                max_run_duration_seconds=600, # 10 minutes
                environment=env
                )

hdc = HyperDriveConfig(run_config=config, 
                       hyperparameter_sampling=ps, 
                       policy=policy, # Comment out this line for Baysian Sampling
                       primary_metric_name='Accuracy', 
                       primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, 
                       max_total_runs=180,
                       max_concurrent_runs=4,
                       max_duration_minutes=20)

Finally, let's launch the hyperparameter tuning job.

In [None]:
hdr = exp.submit(config=hdc)

We can use a run history widget to show the progress. Be patient as this might take a while to complete.

In [None]:
from azureml.widgets import RunDetails
RunDetails(hdr).show()

In [None]:
hdr.wait_for_completion(show_output=True)

## Find and register best model
When all the jobs finish, we can find out the one that has the highest accuracy.

In [None]:
best_run = hdr.get_best_run_by_primary_metric()
print(best_run.get_details()['runDefinition']['arguments'])

Now let's list the model files uploaded during the run.

In [None]:
print(best_run.get_file_names())

We can then register the folder (and all files in it) as a model named `dog_cats_imageclassification_pytourch` under the workspace for deployment.

In [None]:
model = best_run.register_model(model_name='dog_cats_imageclassification_pytourch', model_path='outputs/model')