##  High Performance ML Algorithms (HPMLA).

## Instructions

### Install Dependencies and Create Configuration file.
Follow [instructions](https://github.com/Azure/BatchAI/tree/master/recipes) to install all dependencies and create configuration file.
Use `utilities.py` and `configuration.json.template` from [here](https://github.com/Azure/BatchAI/tree/master/recipes). Fill the values in `configuration.json` using the template and the instructions link.
Make sure `utilities.py` is in the current directory or `PYTHONPATH` of the running jupyter notebook. 
Make sure `configuration.json` that you create is in the current directory or give the full path of `configuration.json` in the cell below.

### Read Configuration and Create Batch AI client

In [None]:
from __future__ import print_function

from datetime import datetime
import sys

from azure.storage.file import FileService
from azure.storage.blob import BlockBlobService
import azure.mgmt.batchai.models as models

sys.path.append('../../')
import utilities as utils

from azure.storage.file import FileService
cfg = utils.config.Configuration('configuration.json')
client = utils.config.create_batchai_client(cfg)
nodeCount = 3
threadPerNode = 1
datasetblobpath = 'criteo-libsvm-uniform'
datasetsharepath = 'criteo-libsvm-uniform-share'


Create Batch AI workspace if not exists:

In [None]:
_ = client.workspaces.create(cfg.resource_group, cfg.workspace, cfg.location).result()

### Create File Share

For this example we will create a new file share to use for output data.

**Note** You don't need to create a share every cluster or a job you deploy.

In [None]:
azure_file_share_name = datasetsharepath
file_service = FileService(cfg.storage_account_name, cfg.storage_account_key)
file_service.create_share(azure_file_share_name, fail_on_exist=False)
print('Done')

## 2. Create Azure Batch AI Compute Cluster

### Configure Compute Cluster

- For this example we will use a GPU cluster of `STANDARD_NC24S_V3` nodes. Number of nodes in the cluster is configured with `nodes_count` variable;
- We will call the cluster `symsgdaicluster_nc24s_3`. But you should replace this and the `nodes_count` to whatever the name of your cluster is and its nodes_count.

In [None]:
nodes_count = nodeCount
cluster_name = 'symsgdaicluster_nc24s_3'

parameters = models.ClusterCreateParameters(
    vm_size='STANDARD_NC24S_V3',
    scale_settings=models.ScaleSettings(
        manual=models.ManualScaleSettings(target_node_count=nodes_count)
    ),
    user_account_settings=models.UserAccountSettings(
        admin_user_name=cfg.admin,
        admin_user_password=cfg.admin_password or None,
        admin_user_ssh_public_key=cfg.admin_ssh_key or None,
    )
)

### Create Compute Cluster

In [None]:
# You only need this if you do not have existing cluster. If you have a cluster then remove this step but mount the volumes
client.clusters.create(cfg.resource_group, cfg.workspace, cluster_name, parameters).result()

In [None]:
client.config.credentials.id


### Monitor Cluster Creation

utilities.py contains a helper function allowing to wait for the cluster to become available - all nodes are allocated and finished preparation.

In [None]:
cluster = client.clusters.get(cfg.resource_group, cfg.workspace, cluster_name)
utils.cluster.print_cluster_status(cluster)

## 3. Run Azure Batch AI Training Job

### Configure Input Directories

The job needs input and output directories which we will add. They are not needed and I will take them out in future since we give the full path to `supersgd` anyways.

### Configure Output Directories
We will store standard and error output of the job in one of the directories in output blob:

In [None]:
print(nodeCount)
print(threadPerNode)
input_directories = []
output_directories=[]

In [None]:
from azure.mgmt.batchai.models.image_source_registry import ImageSourceRegistry
azure_file_share_mount_path = azure_file_share_name
azure_blob_work_mount_path = datasetblobpath

parameters = models.JobCreateParameters(
     cluster=models.ResourceId(id=cluster.id),
     node_count=nodes_count,
     output_directories=[
        models.OutputDirectory(
            id='MODEL',
            path_prefix='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}'.format(
                azure_file_share_mount_path),
            path_suffix='models')
    ],
    input_directories=[
        models.InputDirectory(
            id='DATASET',
            path='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}/{1}'.format(azure_blob_work_mount_path, 'part-')),
        models.InputDirectory(
            id='MODEL',
            path='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}/{1}'.format(azure_blob_work_mount_path, 'models')),
    ],
    std_out_err_path_prefix='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}'.format(azure_file_share_mount_path),
    mount_volumes=models.MountVolumes(
        azure_file_shares=[
            models.AzureFileShareReference(
                account_name=cfg.storage_account_name,
                credentials=models.AzureStorageCredentialsInfo(
                    account_key=cfg.storage_account_key),
                azure_file_url='https://{0}.file.core.windows.net/{1}'.format(
                    cfg.storage_account_name, azure_file_share_mount_path),
                relative_mount_path=azure_file_share_mount_path)
        ],
        azure_blob_file_systems=[
            models.AzureBlobFileSystemReference(
                account_name=cfg.storage_account_name,
                credentials=models.AzureStorageCredentialsInfo(
                    account_key=cfg.storage_account_key),
                container_name=azure_blob_work_mount_path,
                relative_mount_path=azure_blob_work_mount_path),
        ],
    ),
     container_settings=models.ContainerSettings(
         image_source_registry=models.ImageSourceRegistry(
             image='msmadl/symsgd:0.0.2')),
     job_preparation=models.JobPreparation(
         command_line="ls"),
     custom_toolkit_settings = models.CustomToolkitSettings(
    command_line="mpirun --allow-run-as-root -mca btl_tcp_if_exclude docker0 --hostfile $AZ_BATCHAI_MPI_HOST_FILE -np 3 /parasail/supersgd -l 1e-4 -k 32 -mc 1e-2 -e 10 -r 10 -f $AZ_BATCHAI_INPUT_DATASET -t 1 -gl 1 -glDir $AZ_BATCHAI_OUTPUT_MODEL -mem -bd $AZ_BATCHAI_INPUT_DATASET"
    )
)


### Create a training Job and wait for Job completion


In [None]:
experiment_name = 'parasail_experiment'
experiment = client.experiments.create(cfg.resource_group, cfg.workspace, experiment_name).result()
job_name = datetime.utcnow().strftime('tf_%m_%d_%Y_%H%M%S')
job = client.jobs.create(cfg.resource_group, cfg.workspace, experiment_name, job_name, parameters).result()
print('Created Job {0} in Experiment {1}'.format(job.name, experiment.name))

### Wait for Job to Finish
The job will start running when the cluster will have enough idle nodes. The following code waits for job to start running printing the cluster state. During job run, the code prints current content of stdeout-0.txt (the output of the worker running on the first node).

In [None]:
utils.job.wait_for_job_completion(client, cfg.resource_group, cfg.workspace, 
                                  experiment_name, job_name, cluster_name, 'stdouterr', 'stdout-wk-0.txt')

### List stdout.txt and stderr.txt files for the Job

In [None]:
files = client.jobs.list_output_files(cfg.resource_group, cfg.workspace, experiment_name, job_name,
                                      models.JobsListOutputFilesOptions(outputdirectoryid='stdouterr')) 
for f in list(files):
    print(f.name, f.download_url or 'directory')