# README.

This notebook is the entrypoint for Azure ML enabled training.
In its essence, it connects to Azure ML, makes sure that everything is ready over there, and starts the training.
To that end, this notebook gathers all necessary sourcecodes in a temp-folder, which will be pushed to Azure ML for training.

# Imports.

In [10]:
%reload_ext autoreload
%autoreload 2

import os
import shutil

from pathlib import Path


from azureml.core import Dataset, Experiment, Environment, Run, ScriptRunConfig, Workspace
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.runconfig import MpiConfiguration
from azureml.train.dnn import TensorFlow

from src.constants import REPO_DIR
from src.train_util import copy_dir
from src.config import CONFIG, DATASET_MODE_MOUNT, DATASET_MODE_DOWNLOAD

# Setting screws

In [14]:
dataset_name = "anon-depthmap-95k"
experiment_name = "q3-cnndepthmap-resnet-height-95k"
tags = {}

# Create temp folder and copy code.

Here you have to be very precise, which code to copy.
And most importantly, which code NOT to copy.

In [3]:
code_dir = Path("src")
temp_path = Path("temp_train")
copy_dir(src=code_dir, tgt=temp_path, glob_pattern='*.py')
copy_dir(src=REPO_DIR / "src/common", tgt=temp_path / "temp_common", glob_pattern='*/*.py', should_touch_init=True)

2021-04-27 11:35:34,626 - INFO - Creating temp folder - /mnt/resource/batch/tasks/shared/LS_root/mounts/clusters/jzcomp/code/Users/jziegler/cgm-ml/src/models/CNNDepthMap/CNNDepthMap-height/q3-cnndepthmap-resnet-height/src/train_util.py: line 10
2021-04-27 11:35:36,416 - INFO - Copying to temp_train the following files: [PosixPath('src/config.py'), PosixPath('src/constants.py'), PosixPath('src/model.py'), PosixPath('src/preprocessing.py'), PosixPath('src/train.py'), PosixPath('src/train_util.py'), PosixPath('src/utils.py'), PosixPath('src/__init__.py')] - /mnt/resource/batch/tasks/shared/LS_root/mounts/clusters/jzcomp/code/Users/jziegler/cgm-ml/src/models/CNNDepthMap/CNNDepthMap-height/q3-cnndepthmap-resnet-height/src/train_util.py: line 18
2021-04-27 11:35:37,510 - INFO - Creating temp folder - /mnt/resource/batch/tasks/shared/LS_root/mounts/clusters/jzcomp/code/Users/jziegler/cgm-ml/src/models/CNNDepthMap/CNNDepthMap-height/q3-cnndepthmap-resnet-height/src/train_util.py: line 10
2021-

# Connect to azure workspace.

Make sure that you have a config.json file with the keys subscription_id, resource_group, and cgm-ml-dev. Either here (not so nice) or in a parent folder (okay but not perfect), or in the root folder of this repo (way to go).

In [4]:
workspace = Workspace.from_config()
workspace

2021-04-27 11:35:52,254 - INFO - Found the config file in: /config.json - /anaconda/envs/azureml_py36/lib/python3.6/site-packages/azureml/core/workspace.py: line 287


Workspace.create(name='cgm-ml-prod-we-azml', subscription_id='9b5bbfae-d5d1-4aae-a2ca-75159c0c887d', resource_group='cgm-ml-prod-we-rg')

# Get the experiment.

- You should always arrange all your runs in an experiment.
- Create at least one experiment per sprint.
- Make sure that the name of the experiment reflects the sprint number.
- On top of that you could also add other tokens to the name. For example network architecture, dataset name, and/or targets.

In [5]:
experiment = Experiment(workspace=workspace, name=experiment_name)
experiment

2021-04-27 11:55:07,947 - INFO - Created a worker pool for first use - /anaconda/envs/azureml_py36/lib/python3.6/site-packages/azureml/_restclient/clientbase.py: line 192


Name,Workspace,Report Page,Docs Page
q3-cnndepthmap-resnet-height-95k,cgm-ml-prod-we-azml,Link to Azure Machine Learning studio,Link to Documentation


# Find/create a compute target.

Connects to a compute cluster on Azure ML.
If the compute cluster does not exist, it will be created.

Note: Usually computer clusters autoscale. This means that new nodes are created when necessary. And unused VMs will be shut down.

In [6]:
cluster_name = "gpu-cluster"

# Compute cluster exists. Just connect to it.
try:
    compute_target = ComputeTarget(workspace=workspace, name=cluster_name)
    print("Found existing compute target.")

# Compute cluster does not exist. Create one.    
except ComputeTargetException:
    print("Creating a new compute target...")
    compute_config = AmlCompute.provisioning_configuration(
        vm_size='Standard_NC6', 
        max_nodes=4
    )
    compute_target = ComputeTarget.create(workspace, cluster_name, compute_config)
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
compute_target

Found existing compute target.


AmlCompute(workspace=Workspace.create(name='cgm-ml-prod-we-azml', subscription_id='9b5bbfae-d5d1-4aae-a2ca-75159c0c887d', resource_group='cgm-ml-prod-we-rg'), name=gpu-cluster, id=/subscriptions/9b5bbfae-d5d1-4aae-a2ca-75159c0c887d/resourceGroups/cgm-ml-prod-we-rg/providers/Microsoft.MachineLearningServices/workspaces/cgm-ml-prod-we-azml/computes/gpu-cluster, type=AmlCompute, provisioning_state=Succeeded, location=westeurope, tags=None)

# Get the dataset for training.

Here you specify which dataset to use.

Note: Double check on Azure ML that you are using the right one.

In [7]:
dataset = workspace.datasets[dataset_name]
dataset

{
  "source": [
    "('omdena_datasets', '95k_depthmap_trainingdata/**')"
  ],
  "definition": [
    "GetDatastoreFiles"
  ],
  "registration": {
    "id": "8c6604b8-d248-410a-9424-09b0bc883369",
    "name": "anon-depthmap-95k",
    "version": 1,
    "description": "A depthmap based dataset containing 95k artifacts, to be used for training.",
    "workspace": "Workspace.create(name='cgm-ml-prod-we-azml', subscription_id='9b5bbfae-d5d1-4aae-a2ca-75159c0c887d', resource_group='cgm-ml-prod-we-rg')"
  }
}

# Push the training source code to Azure.

Creates an estimator (a template for a compute cluster node) and pushes it to the compute cluster.

In [8]:
script_params = {f"--{k}": v for k, v in CONFIG.items()}
script_params

{'--DATASET_MODE': 'dataset_mode_download',
 '--DATASET_NAME': 'anon-depthmap-95k',
 '--DATASET_NAME_LOCAL': 'anon-depthmap-mini',
 '--SPLIT_SEED': 0,
 '--IMAGE_TARGET_HEIGHT': 240,
 '--IMAGE_TARGET_WIDTH': 180,
 '--EPOCHS': 1000,
 '--BATCH_SIZE': 256,
 '--SHUFFLE_BUFFER_SIZE': 2560,
 '--NORMALIZATION_VALUE': 7.5,
 '--LEARNING_RATE': 0.0007,
 '--USE_ONE_CYCLE': True,
 '--USE_DROPOUT': False,
 '--USE_WANDB': False,
 '--TARGET_INDEXES': [0]}

In [11]:
curated_env_name = "cgm-env"

ENV_EXISTS = True
if ENV_EXISTS:
    cgm_env = Environment.get(workspace=workspace, name=curated_env_name)
else:
    cgm_env = Environment.from_conda_specification(name=curated_env_name, file_path=REPO_DIR / "environment_train.yml")
    cgm_env.docker.enabled = True
    cgm_env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04'
    # cgm_env.register(workspace)  # Please be careful not to overwrite existing environments

In [12]:
if CONFIG.DATASET_MODE == DATASET_MODE_MOUNT:
    dataset_argument = dataset.as_named_input('cgm_dataset').as_mount()
elif CONFIG.DATASET_MODE == DATASET_MODE_DOWNLOAD:
    dataset_argument = dataset.as_named_input('cgm_dataset').as_download()
else:
    raise Exception("Please specify DATASET_MODE")

In [15]:
# Create the ScriptRunConfig
script_run_config = ScriptRunConfig(source_directory=temp_path,
                                    compute_target=compute_target,
                                    script='train.py',
                                    arguments=[dataset_argument] + [str(item) for sublist in script_params.items() for item in sublist],
                                    environment=cgm_env,
)

# Set compute target.
script_run_config.run_config.target = compute_target

# Run the experiment.
run = experiment.submit(config=script_run_config, tags=tags)

# Show run.
run

2021-04-27 11:58:50,073 - INFO - ScriptRunSubmit - /anaconda/envs/azureml_py36/lib/python3.6/site-packages/azureml/data/_loggerfactory.py: line 151


Experiment,Id,Type,Status,Details Page,Docs Page
q3-cnndepthmap-resnet-height-95k,q3-cnndepthmap-resnet-height-95k_1619524730_4da2df3b,azureml.scriptrun,Preparing,Link to Azure Machine Learning studio,Link to Documentation


# Delete temp folder.

After all code has been pushed to Azure ML, the temp folder will be removed.

In [16]:
shutil.rmtree(temp_path)

## 