## Setup
This tutorial assumes you have a conda environment with the dependencies described in [env.yaml](env.yaml).  The following cell creates this environment. You then need to activate it in your IDE.

In [None]:
!conda env create -f env.yaml

## Prerequesites
1. An Azure subscription. If you don't have on, you can [create a free account](https://aka.ms/AMLFree).
2. An active Azure Resource Group.
3. A working Azure ML Workspace.

The following steps help you set up [2]() and [3](). If these resources already exist, you can skip this cell. Visit [this link](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace-cli?tabs=createnewresources) for more information on how to manage your AML workspace using Azure CLI.

In the following example, a new workspace named `mnist-pytorch-lit-ws` is attached to the new resource group `your_name-rg` located in `east-us`. The shell used is powershell, but the same command can be executed in bash for MacOS and Linux users. To install Azure CLI, [click here](https://learn.microsoft.com/en-us/cli/azure/) and follow the instructions.


```powershell
# add the "azure-cli-ml" extension
!az extension add --name azure-cli-ml`
# login to azure (will open a browser for authentification)
!az login
# create a resource group named "your_name-rg"
!az group create --name your_name-rg --location eastus --tags env=tutorial
# create a workspace named "mnist-pytorch-lit-ws" attached to resource group "your_name-rg"
!az ml workspace create -w mnist-pytorch-lit-ws -g your_name-rg --tags env=tutorial
```

> **Important**:
> When you deploy an Azure Machine Learning workspace, various [other services are created by default](https://learn.microsoft.com/en-us/azure/machine-learning/concept-workspace#associated-resources). Theses services include an [Azure Storage Account](https://azure.microsoft.com/en-us/products/category/storage/) used to store models, checkpoints and our most importantly the training data. Depending on your needs, you might want to create the storage account ahead of time and link it to the workspace. For ML workflows, Microsoft recommands [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs/), the "massively scalable and secure object storage". Visit the [official documentation](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-datastore?tabs=sdk-identity-based-access%2Ccli-adls-identity-based-access%2Ccli-azfiles-account-key%2Ccli-adlsgen1-identity-based-access) to learn how to programmatically create datastores.

## Connect to your Azure Workspace

The interactions with your workspace abstracted through a `azure.ai.ml.MLClient` object. It requires an authentification method, the ID of your subscription and the names of your resource group and workspace. We use `DefaultAzureCredential` to gain access to the workspace but alternatively the `InteractiveBrowserCredential` can be used if you prefer using a browser to authenticate. 

In [None]:
# Workspace client
from azure.ai.ml import MLClient

# Authentification package
from azure.identity import DefaultAzureCredential

cred = DefaultAzureCredential()
# uncomment to use the InteractiveBrowserCredential instead
# from azure.identity import InteractiveBrowserCredential
# cred = InteractiveBrowserCredential()

Using your subscription ID, the names of your resource group and workspace, we can now create a workspace handler. We store privileged information using environment variables and the `python-dotenv` package. The environment variables are defined a file named `.env` that you must create. The following cell copies the basic template `.env.example`. 

In [None]:
!cp .env.example .env

Set your subscription ID in `.env`. Now load the environment variables

In [None]:
from dotenv import load_dotenv
import os

# change the values in .env if needed
load_dotenv()

Connect to your workspace

In [None]:
sub_id = os.getenv("AZURE_SUBSCRIPTION_ID") # set the value in `.env`
azure_RG = os.getenv("AZURE_RESOURCE_GROUP") # matches the name of the resource group we created
azure_WS = os.getenv("AZURE_WORKSPACE")  # matches the name of the workspace we created

ml_client = MLClient(
    credential=cred,
    subscription_id=sub_id,
    resource_group_name=azure_RG,
    workspace_name=azure_WS
)

## Create compute resources to run jobs
Cloud-based tasks are specified by jobs that run on various VMs. In this tutorial, we'll need two clusters: a basic CPU to upload our local data and a GPU cluster to run the training job. Multiple choice of architectures [are available here](https://azure.microsoft.com/en-us/pricing/details/machine-learning/) but beware of subscription restrictions that can limit your options to basic VMs. Free accounts are limit to single-GPU VMs and 6 vCPUs.

In [None]:
from azure.ai.ml.entities import AmlCompute
from azure.core.exceptions import ResourceNotFoundError

cpu_compute_target = "cpu-cluster"

# Checks if the resource exists and creates it if not.
# An `if` statement would be better, but since there is currently no `exists` method, 
# we need this try/catch statement 
try: 
    # throw an exception if the resource does not exist
    cpu_cluster = ml_client.compute.get(cpu_compute_target)
    print(f"Found existing cluster {cpu_compute_target}.")
except ResourceNotFoundError:
    print("Creating a new cpu compute target...")
    cpu_cluster = AmlCompute(
        # Name of the cluster
        name=cpu_compute_target,
        # Azure ML Compute is the on-demand VM service
        type="amlcompute",
        # VM Family
        size="STANDARD_DS3_V2",
        # Minimum running nodes when there is no job running
        min_instances=0,
        # Nodes in the cluster
        max_instances=4,
        # How many seconds will the node running after the job termination
        idle_time_before_scale_down=180,
        # Dedicated or LowPriority. The latter is cheaper but there is a chance of job termination
        tier="Dedicated",
    )
    # Create the cluster
    cpu_cluster = ml_client.begin_create_or_update(cpu_cluster).result()

print(f"AMLCompute with name {cpu_cluster.name} is used with compute size {cpu_cluster.size}")

For GPUs, we create the cluster from the smallest possible VM family.

In [None]:
from azure.ai.ml.entities import AmlCompute

gpu_compute_target = "gpu-cluster"

try:
    # let's see if the compute target already exists
    gpu_cluster = ml_client.compute.get(gpu_compute_target)
    print(
        f"You already have a cluster named {gpu_compute_target}, we'll reuse it as is."
    )

except Exception:
    print("Creating a new gpu compute target...")

    gpu_cluster = AmlCompute(
        name="gpu-cluster",
        type="amlcompute",
        size="STANDARD_NC6",  # 1 x NVIDIA Tesla K80 ($0.90 USD per hour)
        min_instances=0,
        max_instances=4,
        idle_time_before_scale_down=180,
        tier="Dedicated",
    )

    gpu_cluster = ml_client.begin_create_or_update(gpu_cluster).result()

print(
    f"AMLCompute with name {gpu_cluster.name} is created with the compute size {gpu_cluster.size}"
)

> **Important**.
> During the writing of this tutorial, I learned that Microsoft intends to decommission their NC, NCv2 and ND-series VMs in favor of new hardware by the end of 2023. Their [migration guide](https://learn.microsoft.com/en-us/azure/virtual-machines/n-series-migration) recommands upgrading to `Standard_NC4as_T4_v3` or `Standard_NC8as_T4` VMs. However, these options are unavailable to new accounts. You need to [request more quotas](https://learn.microsoft.com/en-us/azure/quotas/per-vm-quota-requests) and have access to at least 20 cores.

### 1. Upload the data

We train a basic PyTorch Lightning MLP classifier on the well-known MNIST dataset. We could download the dataset using a public URL during training, but this would increase the time usage of our most expensive hardware. Alternatively, we can submit a job to download the data from the same URL. In production, we would probably skip this step entirely since the data will already be stored somewhere in Azure. 

In [None]:
# download training and test data 
from torchvision.datasets.mnist import MNIST
import os

# create local folder to store data
os.makedirs("./data", exist_ok="ok")

# download training data
train_data = MNIST(
    "./data/train",
    train=True, download=True,
)
# download test data
test_data = MNIST(
    "./data/test",
    train=False, download=True
)

In [None]:
# create archive
!tar -czvf mnist.tar data

In [None]:
from azure.ai.ml import command
from azure.ai.ml import Input, Output
from azure.ai.ml.constants import AssetTypes

experiment_name = "train_mnist_pytorch-lit"

# we define the command here and submit it in the next cell
upload_dataset_job = command(
    display_name="upload_mnist",
    command="tar xvfm ${{inputs.archive}} --no-same-owner -C ${{outputs.images}}",
    inputs={
        "archive": Input(
            type=AssetTypes.URI_FILE,
            path="mnist.tar"
        )
    },
    outputs={
        "images": Output(
            type=AssetTypes.URI_FOLDER,
            mode="upload",
            # path must begin with `azureml://` and must point to a cloud path
            path="azureml://datastores/workspaceblobstore/paths/mnist-pytorch-lit-tutorial"
        )
    },
    # an existing environment with pre-installed libraries
    # we can create our own for the current purposes we can re-use an existing one
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1",
    # the cpu compute resource we just created
    compute=cpu_compute_target,
    # assemble jobs under the same experiment
    experiment_name=experiment_name
)

In [None]:
import webbrowser

# submit the command
returned_job = ml_client.create_or_update(
    upload_dataset_job,
)

# get a URL for the status of the job
print("The url to see your live job running is returned by the sdk:")
print(returned_job.studio_url)
# open the browser with this url
webbrowser.open(returned_job.studio_url)

# print the pipeline run id
print(
    f"The pipeline details can be access programmatically using identifier: {returned_job.name}"
)

This task might take a few minutes to complete. You can track its progress on Azure Portal. By default, Azure will have created four different datastores. The files should be stored in the `workspacedatablob`. You can browse this storage device by accessing `Assets > Data > Datastores` in your [Microsoft Azure Machine Learning Studio](https://ml.azure.com/) interface. Your output should be similar to this:

![expected result](media/azure-upload-job-result.png )


### 2. Create a custom training environment
When developing locally, we generally define a Python virtual environment to define the dependencies of our project. Azure provides [environments](https://docs.microsoft.com/azure/machine-learning/concept-environments) to specify the dependencies. They can be created in three different ways:

1. By using a curated environment.
2. By creating a custom environment from a Docker image.
3. By creating a custom environment from a conda specification and a pre-existing Docker image.

AzureML provides many curated or ready-made environments, which are useful for common training and inference scenarios. However, these contain only basic dependencies. A list of existing curated environments can be found [here](https://learn.microsoft.com/en-us/azure/machine-learning/resource-curated-environments) or from the Environments tab in Azure Machine Learning Studio. You can also create your own custom environments using a Docker image, or a conda configuration. Regular Docker images don't have access to hardware acceleration so you must select specific images provided by Nvidia and [Azure](https://learn.microsoft.com/en-us/azure/machine-learning/concept-prebuilt-docker-images-inference). 

In this example, we use a custom conda environment for the training job, using a conda yaml file. More information about environment management can be found [here](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?tabs=python).

In [None]:
from azure.ai.ml.entities import Environment
import os 

custom_env_name = "mnist-pytorch-lit-env"

pipeline_job_env = Environment(
    # name of the environment
    name=custom_env_name,
    # short description of the environmenttas
    description="Custom environment for the MNIST PytorchLit tutorial",
    # tags 
    tags={"pytorch": "1.12", "cuda": "11.6"},
    # path to the conda environment file
    conda_file="env.yaml",
    # URI of a custom-based image, here we use an existing image with GPU support
    image="mcr.microsoft.com/azureml/curated/acpt-pytorch-1.12-py38-cuda11.6-gpu:4",
)
pipeline_job_env = ml_client.environments.create_or_update(pipeline_job_env)

print(
    f"""Environment with name {pipeline_job_env.name} is registered to workspace,
     the environment version is {pipeline_job_env.version}"""
)

### 3. Create and submit the training command job
Using the environment created in the previous step, we can now define the training job.

In [None]:
from azure.ai.ml import command
from azure.ai.ml import Input
from azure.ai.ml.constants import AssetTypes

training_job = command(
    # local path to the source code
    code="./src/",
    # the commands to run the python script
    command="""python train.py \
        --path_to_data ${{inputs.path_to_data}} \
        --batch_size ${{inputs.batch_size}} \
        --max_epochs ${{inputs.max_epochs}} \
        --num_workers ${{inputs.num_workers}} \
        --hidden_dim ${{inputs.hidden_dim}} \
    """,
    # inject variables to the command above
    inputs={
        "path_to_data": Input(
            type="uri_folder",
            path="azureml://datastores/workspaceblobstore/paths/mnist-pytorch-lit-tutorial/data",
            mode="download" # use `download` to make access faster, `mount` if dataset is larger than VM
        ),
        "batch_size": 32,
        "max_epochs": 5,
        "num_workers": 5,
        "hidden_dim": 3,
    },
    distribution={
        "type": "PyTorch",
        # set process count to the number of gpus on the node
        # NC6 has only one
        "process_count_per_instance": 1
    },
    # you can create multiple versions of the same environment, use @latest to fetch the latest one
    environment=f"{custom_env_name}@latest",
    # the name of compute infrastructure needed
    compute=gpu_compute_target,
    # set instance count to the number of nodes you want to use (1 * 6vCPUs)
    # ***to use more resources, you will need to increase your quotas***
    # https://learn.microsoft.com/en-us/azure/quotas/per-vm-quota-requests
    # for the moment we leave it to the smallest possible value so that new accounts do not encounter quota limits
    instance_count=1,
    # assemble training job with upload job
    experiment_name=experiment_name,
    # friendly name displayed in tables (option)
    display_name="MNIST PyTorchLit",
    # job description
    description="Training a MLP classifier on MNIST dataset"
)

### Submit the job

In [None]:
import webbrowser

# submit the command
returned_job = ml_client.jobs.create_or_update(
    training_job,
    experiment_name=experiment_name
)

# get a URL for the status of the job
print("The url to see your live job running is returned by the sdk:")
print(returned_job.studio_url)
# open the browser with this url
webbrowser.open(returned_job.studio_url)

# print the pipeline run id
print(
    f"The pipeline details can be access programmatically using identifier: {returned_job.name}"
)

> **Important**:
> Your default subscription might be limited to 6 vCPUs. To increase your quotas, follow the official instructions [here](https://learn.microsoft.com/en-us/azure/quotas/per-vm-quota-requests).

![expected result](media/azure-training-job-preparing.png)

As the job is executed, it goes through the following stages:

**Preparing**: A docker image is created according to the environment defined. The image is uploaded to the workspace's container registry and cached for later runs. Logs are also streamed to the job history and can be viewed to monitor progress. If a curated environment is specified, the cached image backing that curated environment will be used.

**Scaling**: The cluster attempts to scale up if it requires more nodes to execute the run than are currently available.

**Running**: All scripts in the script folder *src* are uploaded to the compute target, data stores are mounted or copied, and the script is executed. Outputs from *stdout* and the *./logs* folder are streamed to the job history and can be used to monitor the job.

## Cleanup

The following cells remove the created workspace and resource group. We will not delete your subscription since you would likely also lose your free credits. Deleting a resource group also removes its child resources (workspaces, storage accounts, key vaults, container registries, etc.). To help familiarize yourself with Azure CLI we delete the workspace before deleting its parent resource group.

In [None]:
# delete the workspace
!az ml workspace delete --name mnist-pytorch-lit-ws

# delete the resource group
!az group delete --name mnist