# Pytorch Elastic Training Example using Azure Machine Learning Service
This notebook contains an end-to-end walkthrough of Imagenet example using Azure Machine Learning service.

Steps:
* Get user credentials using Service Principal
* Create Resource Group
* Create IP Address
* Create Vnet and Subnet
* Create NIC
* Create VM
* Setup etcd on VM
* Initialize an AzureML workspace
* Register a datastore
* Create an experiment
* Provision a compute target
* Create an Estimator
* Configure and Run

## Prerequisites
* Azure Subscription
* Azure Machine Learning workspace
* Azure Management SDK
* Azure Machine Learning SDK

If you are using [Azure Compute Instance](https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-instance), no additional setup is required. Otherwise you need to manually install the required SDK's 
* pip install azure-mgmt-network
* pip install azure-mgmt-compute
* pip install --upgrage azureml-sdk

## Library Import

In [None]:
# Regular python libraries
import os
import requests
import sys

# Azure libraries
from azure.mgmt.resource import ResourceManagementClient
from azure.mgmt.compute import ComputeManagementClient
from azure.mgmt.network import NetworkManagementClient
from azure.mgmt.compute.models import DiskCreateOption
from utils import ElasticRun


## Azure Account Information

User's credentials are required to create the Azure Network, Compute resources for Pytorch Elastic Training. Instructions for generating tenant, client id and secret can be found at [portal](https://docs.microsoft.com/en-us/azure/active-directory/develop/howto-create-service-principal-portal), [CLI](https://docs.microsoft.com/en-us/cli/azure/create-an-azure-service-principal-azure-cli?view=azure-cli-latest), [Powershell](https://docs.microsoft.com/en-us/powershell/azure/create-azure-service-principal-azureps).

In [None]:
SUBSCRIPTION_ID = "<subscription_id>"
RESOURCE_GROUP = "<Resource group name>"
REGION = "<Resource group region>"

TENANT = "<app id>"
CLIENT_ID = "<client id>"
SECRET = "<secret>"


In [None]:
# Create Elastic Run Oject to track Azure resources and AzureML experiment
elastic_run = ElasticRun(TENANT, CLIENT_ID, SECRET)

## Create Resource Group
Creates a resource group with the specified name if one doesn't exist

In [None]:
# Resource group name and region for Elastic runs
elastic_run.init_resource_group(RESOURCE_GROUP, REGION)

# Create a resource group if one doesn't exist
elastic_run.create_resource_group()

## Network Setup
Create a Public IP Address, Vnet, Subnet and NIC.

In [None]:
IP_NAME = "pet-test-ip"
VNET_NAME = "pet-test-vnet"
NSG_NAME = "pet-test-nsg"
SUBNET_NAME = "pet-test-subnet"
NIC_NAME = "pet-test-nic"
IPCONFIG_NAME = "pet-test-ipconfig"

Network resource creation can be skipped if using existing resources. Make sure to correctly populate the ip_name, vnet_name, nsg_name, subnet_name, nic_name and ipconfig_name fields in the above cell. Vnet requirements for AMLCompute can be found [here](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-enable-virtual-network#mlcports).

In [None]:
elastic_run.init_network_resources(IP_NAME, VNET_NAME, NSG_NAME, SUBNET_NAME, NIC_NAME, IPCONFIG_NAME)
elastic_run.create_network_resources()

## VM setup
Creates a ubuntu VM in the vnet created above and setup etcd to listen on port 2379

In [None]:
ETCD_VM_NAME = "pet-test-vm"
ETCD_VM_SIZE = "<Azure VM Size>"

elastic_run.init_etcd_vm(ETCD_VM_NAME, ETCD_VM_SIZE)
elastic_run.create_setup_etcd_vm()

# verify etcd
elastic_run.verify_etcd()

## Azure Machine Learing Library import

In [None]:
# AzureML libraries
import azureml.core
from azureml.core import Datastore
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.container_registry import ContainerRegistry
from azureml.core.runconfig import MpiConfiguration, RunConfiguration, DEFAULT_GPU_IMAGE
from azureml.train.dnn import PyTorch
from azureml.train.estimator import Estimator

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

## AzureML Workspace setup
If you are not running on [Azure Compute Instance](https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-instance), please refer to the [Configuration Notebook](https://github.com/Azure/MachineLearningNotebooks/blob/56e0ebc5acb9614fac51d8b98ede5acee8003820/configuration.ipynb) on establishing connection to AzureML workspace. 

If a workspace doesn't exist, a new one can be created using elastic_run.create_workspace(WORKSPACE_NAME)

In [None]:
WORKSPACE_NAME = "<Workspace name>"

ws = elastic_run.get_workspace(WORKSPACE_NAME)

# ws = elastic_run.create_workspace(WORKSPACE_NAME)

## Datastore registration
The following code assumes that the training data is already copied to Azure Blob storage with the following directory structure. It is recommened to retain this directory structure to run this notebook without code updates. In case the directory structure is different, the constructor of PyTorch estimator where the datastore is mounted should be modified.

    data
    |
    |__train

In [None]:
# Register the datastore with the workspace
ds = Datastore.register_azure_blob_container(workspace=ws, 
                                             datastore_name='<Blob store name>',
                                             container_name='<container name>',
                                             account_name="<Storage account name>", 
                                             account_key="<Storage account key>"
                                            )

In [None]:
# Print the workspace attributes
print('Datastore name: ' + ds.name, 
      'Container name: ' + ds.container_name, 
      'Datastore type: ' + ds.datastore_type, 
      'Workspace name: ' + ds.workspace.name, sep = '\n')

## Provision Training cluster
Create AzureML training cluster in the VNET created above. For information on AzureML compute, please read [this](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute).
* Azure VM Size: VM family of nodes provisioned by AmlCompute.
* min_nodes: Minimum number of nodes while running a job on AmlCompute
* max_nodes: Maximum nodes to autoscale while running a job on AmlCompute

In [None]:
# Create the compute cluster
PET_CLUSTER_NAME = "pet-test-cluster" 
MIN_NODES = <> # minimum number of nodes to provision in training cluster
MAX_NODES = <> # maximum number of nodes to provision in training cluster
COMPUTE_SIZE = <'Azure VM Size for AMLCompute Cluster'>

elastic_run.create_compute_target(PET_CLUSTER_NAME, MIN_NODES, MAX_NODES, COMPUTE_SIZE)

## Estimator definition and run submission
The estimator uses a custom docker image and main.py as the entry script for execution.
For more information on Estimator, refer [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-train-pytorch).

In [None]:
# Define the project folder
project_folder = '.' # This is to allow the libraries stored under pytorch/ to be loaded

## Using a public image published on Azure.
image_name = 'mcr.microsoft.com/azureml/elastic:pytorch-elastic-openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04'

# Define the Pytorch estimator
pet_estimator = PyTorch(source_directory=project_folder,
                    # Compute configuration
                    compute_target=pet_compute_target,
                    node_count=1, 
                    use_gpu=True,
                    
                    #Docker image
                    use_docker=True,
                    custom_docker_image=image_name,
                    user_managed=True,
                    
                    # Training script parameters
                    script_params = {
                        # Required Params
                        "--input_path" : ds.path('data/train/').as_mount()
                    },
                    
                    entry_script='main.py',
                    inputs=[ds.path('data/').as_mount()]
                   )

In [None]:
pet_estimator.run_config.environment.environment_variables = {"RDZV_ENDPOINT":"10.0.0.4:2379", "ETCD_PROTOCOL":"http","MIN_SIZE":min_nodes, "MAX_SIZE":max_nodes, "JOB_ID" : "<Unique ID>"}

## Create an Experiment
Experiment is a logical container in AzureML workspace. It hosts run records which can include run metrics and output artifacts from your experiments. More information on Experiment can be found [here](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.experiment.experiment?view=azure-ml-py).

In [None]:
# Create an experiment
EXPERIMENT_NAME = 'pet-imagenet'
elastic_run.create_experiment(EXPERIMENT_NAME)

In [None]:
# Submit the run
elastic_run.submit_job(pet_estimator)