# Pytorch Elastic Training Example using Azure Machine Learning Service
This notebook contains an end-to-end walkthrough of Imagenet example using Azure Machine Learning service.

Steps:
* Get user credentials using Service Principal
* Create Resource Group
* Create IP Address
* Create Vnet and Subnet
* Create NIC
* Create VM
* Setup etcd on VM
* Initialize an AzureML workspace
* Register a datastore
* Create an experiment
* Provision a compute target
* Create an Estimator
* Configure and Run

## Prerequisites
* Azure Subscription
* Azure Machine Learning workspace
* Azure Management SDK
* Azure Machine Learning SDK

If you are using [Azure Compute Instance](https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-instance), no additional setup is required. Otherwise you need to manually install the required SDK's 
* pip install azure-mgmt-network
* pip install azure-mgmt-compute
* pip install --upgrage azureml-sdk

## Library Import

In [None]:
# Regular python libraries
import os
import requests
import sys

# Azure libraries
from azure.mgmt.resource import ResourceManagementClient
from azure.mgmt.compute import ComputeManagementClient
from azure.mgmt.network import NetworkManagementClient
from azure.mgmt.compute.models import DiskCreateOption
from utils import *


## Azure Account Information

User's credentials are required to create the Azure Network, Compute resources for Pytorch Elastic Training. Instructions for generating tenant, client id and secret can be found at [portal](https://docs.microsoft.com/en-us/azure/active-directory/develop/howto-create-service-principal-portal), [CLI](https://docs.microsoft.com/en-us/cli/azure/create-an-azure-service-principal-azure-cli?view=azure-cli-latest), [Powershell](https://docs.microsoft.com/en-us/powershell/azure/create-azure-service-principal-azureps).

In [None]:
subscription_id = "<subscription_id>"
resource_group = "<Resource group name>"
region = "<Resource group region>"

tenant = "<app id>"
client_id = "<client id>"
secret = "<secret>"


In [None]:
# Get users service principal credentials
credentials = get_credentials(tenant, client_id, secret)

## Create Resource Group
Creates a resource group with the specified name if one doesn't exist

In [None]:
# Create a resource group if one doesn't exist
resource_group_client = ResourceManagementClient(
    credentials,
    subscription_id
)

create_resource_group(resource_group_client, resource_group, region)

## Network Setup
Create a Public IP Address, Vnet, Subnet and NIC.

In [None]:
network_client = NetworkManagementClient(
    credentials,
    subscription_id
)

ip_name = "pet-test-ip"
vnet_name = "pet-test-vnet"
subnet_name = "pet-test-subnet"
nic_name = "pet-test-nic"
ipconfig_name = "pet-test-ipconfig"

Network resource creation can be skipped if using existing resources. Make sure to correctly populate the ip_name, vnet_name, subnet_name, nic_name and ipconfig_name fields in the above cell.

In [None]:
create_public_ip_address(network_client, resource_group, region, ip_name)

create_vnet(network_client, resource_group, region, vnet_name)

create_subnet(network_client, resource_group, vnet_name, subnet_name)

create_nic(network_client, resource_group, region, vnet_name, subnet_name, ip_name, ipconfig_name, nic_name)

## VM setup
Creates a ubuntu VM in the vnet created above and setup etcd to listen on port 2379

In [None]:
compute_client = ComputeManagementClient(
    credentials,
    subscription_id
)

vm_name = "pet-test-vm"
vm_size = "<Azure VM Size>"

# Create a VM for etcd
create_vm(network_client, compute_client, resource_group, region, nic_name, vm_name, vm_size)

# Run custom script extension to setup etcd
setup_etcd(compute_client, resource_group, vm_name)

## Azure Machine Learing Library import

In [None]:
# AzureML libraries
import azureml.core
from azureml.core import Experiment, Workspace, Datastore, Run
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.container_registry import ContainerRegistry
from azureml.core.runconfig import MpiConfiguration, RunConfiguration, DEFAULT_GPU_IMAGE
from azureml.train.dnn import PyTorch
from azureml.train.estimator import Estimator
from azureml.widgets import RunDetails

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

## AzureML Workspace setup
If you are not running on [Azure Compute Instance](https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-instance), please refer to the [Configuration Notebook](https://github.com/Azure/MachineLearningNotebooks/blob/56e0ebc5acb9614fac51d8b98ede5acee8003820/configuration.ipynb) on establishing connection to AzureML workspace.

In [None]:
workspace_name = "<Workspace name>"

try:
    ws = Workspace(subscription_id = subscription_id, resource_group = resource_group, workspace_name = workspace_name)
    # write the details of the workspace to a configuration file to the notebook library
    #ws.write_config()
    print("Workspace configuration succeeded. Skip the workspace creation steps below")
except:
    print("Workspace not accessible. Change your parameters or create a new workspace below")

## Datastore registration
The following code assumes that the training data is already copied to Azure Blob storage with the following directory structure. It is recommened to retain this directory structure to run this notebook without code updates. In case the directory structure is different, the constructor of PyTorch estimator where the datastore is mounted should be modified.

    data
    |
    |__train

In [None]:
# Register the datastore with the workspace
ds = Datastore.register_azure_blob_container(workspace=ws, 
                                             datastore_name='<Blob store name>',
                                             container_name='<container name>',
                                             account_name="<Storage account name>", 
                                             account_key="<Storage account key>"
                                            )

In [None]:
# Print the workspace attributes
print('Datastore name: ' + ds.name, 
      'Container name: ' + ds.container_name, 
      'Datastore type: ' + ds.datastore_type, 
      'Workspace name: ' + ds.workspace.name, sep = '\n')

## Create an Experiment
Experiment is a logical container in AzureML workspace. It hosts run records which can include run metrics and output artifacts from your experiments. More information on Experiment can be found [here](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.experiment.experiment?view=azure-ml-py).

In [None]:
# Create an experiment
experiment_name = 'pet-imagenet'
pet_experiment = Experiment(ws, name=experiment_name)

## Provision Training cluster
Create AzureML training cluster in the VNET created above. For information on AzureML compute, please read [this](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute).
* Azure VM Size: VM family of nodes provisioned by AmlCompute.
* min_nodes: Minimum number of nodes while running a job on AmlCompute
* max_nodes: Maximum nodes to autoscale while running a job on AmlCompute

In [None]:
# Create the compute cluster
pet_cluster_name = "pet-test-cluster" 

# Verify that the cluster doesn't exist already
try:
    pet_compute_target = ComputeTarget(workspace=ws, name=pet_cluster_name)
    print('Found existing compute target.')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='<Azure VM Size>',
                                                          min_nodes=<min_nodes>,
                                                          max_nodes=<max_nodes>,
                                                          vnet_name=vnet_name,
                                                          vnet_resourcegroup_name=resource_group,
                                                          subnet_name=subnet_name)
    
    # create the cluster
    pet_compute_target = ComputeTarget.create(ws, pet_cluster_name, compute_config)
    pet_compute_target.wait_for_completion(show_output=True)

# Use the 'status' property to get a detailed status for the current cluster. 
#print(pet_compute_target.status.serialize())

## Estimator definition and run submission
The estimator uses a custom docker image and main.py as the entry script for execution.
For more information on Estimator, refer [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-train-pytorch).

In [None]:
# Define the project folder
project_folder = '.' # This is to allow the libraries stored under pytorch/ to be loaded

## Using a public image published on Azure.
image_name = 'mcr.microsoft.com/azureml/elastic:pytorch-elastic-openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04'

# Define the Pytorch estimator
pet_estimator = PyTorch(source_directory=project_folder,
                    # Compute configuration
                    compute_target=pet_compute_target,
                    node_count=1, 
                    use_gpu=True,
                    
                    #Docker image
                    use_docker=True,
                    custom_docker_image=image_name,
                    user_managed=True,
                    
                    # Training script parameters
                    script_params = {
                        # Required Params
                        "--input_path" : ds.path('data/train/').as_mount()
                    },
                    
                    entry_script='main.py',
                    inputs=[ds.path('data/').as_mount()]
                   )

In [None]:
pet_estimator.run_config.environment.environment_variables = {"RDZV_ENDPOINT":"10.0.0.4:2379", "ETCD_PROTOCOL":"http","MIN_SIZE":<min_nodes>, "MAX_SIZE":<max_nodes>}

In [None]:
# Submit the run
pet_run = pet_experiment.submit(pet_estimator)
RunDetails(pet_run).show()