# Part 2. Remote Training

1. Set up the "Environment" image
2. Run it remotely on a cluster
3. Run hyperparameter search on a cluster


## Workspace

In [1]:
import os

subscription_id = os.getenv("SUBSCRIPTION_ID")
resource_group = os.getenv("RESOURCE_GROUP")
workspace_name = os.getenv("WORKSPACE_NAME")
workspace_region = os.getenv("WORKSPACE_REGION")

In [2]:
from azureml.core import Workspace

try:
    ws = Workspace(subscription_id = subscription_id, resource_group = resource_group, workspace_name = workspace_name)
    # write the details of the workspace to a configuration file to the notebook library
    ws.write_config() # ws = Workspace.from_config()
    print("Workspace configuration succeeded.")
except:
    print("Workspace not accessible. Change your parameters or create a new workspace below.")

Workspace configuration succeeded.


## Compute

In [3]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cpu_cluster_name = "cpu-d2-cluster"

# You can create it via the UI, or via code.
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print("Found existing cpu-cluster")
except ComputeTargetException:
    print("Creating new cpu-cluster")
    
    # We provision specific Azure ML (Aml) Compute - these are cheap shared resouces.
    # Note: VM must be D1 or higher. See https://azureprice.net/
    compute_config = AmlCompute.provisioning_configuration(
        vm_size="Standard_D2_v2",
        vm_priority='lowpriority', # <- Like spot instances
        min_nodes=0,
        max_nodes=16)

    # Create the cluster with the specified name and configuration
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)
    
    # Wait for the cluster to complete, show the output log
    cpu_cluster.wait_for_completion(show_output=True)

Found existing cpu-cluster


Note by default they allow up to 12 lowpriority CPU VMs. I asked to increase quota but haven't heard back yet.

## Dataset

"Azure Blob Storage" is like S3. Our workspace comes with a default storage. We'll upload our data there and download it to the VMs.

In [4]:
ds = ws.get_default_datastore()
print(ds.name, ds.datastore_type, ds.account_name, ds.container_name)

workspaceblobstore AzureBlob testmlworkspac1857784133 azureml-blobstore-4626ff12-7ea7-4adf-ac2e-5e8595e1aa4f


In [5]:
ds.upload_files(['data/train.csv'], target_path='data', overwrite=True)

Uploading an estimated of 1 files
Uploading data/train.csv
Uploaded data/train.csv, 1 files out of an estimated total of 1
Uploaded 1 files


$AZUREML_DATAREFERENCE_790c22c7f7234c8da52c611e96895e0b

A **Dataset** object can reference one or many files from various locations in various formats. Dataset provides you with the ability to download or mount the files to your compute.

In [6]:
# initialize file dataset 
from azureml.core import Dataset
ds_paths = [(ds, 'data/')] # load all files in there
dataset = Dataset.File.from_files(path = ds_paths)

In [7]:
# list the files referenced by the dataset
dataset.to_path()

array(['/train.csv'], dtype=object)

## Code

Extract our training procedure into a python file (or an entire `src` directory). This will be **copied** onto each VM.

**Try it locally:**

`python scripts/train.py --data-folder data`

In [8]:
script_folder = './scripts'

## Environment

There are many ways to create your env, with `conda`, `pip`, or your own custom Docker image.

Remember to **register** the environment so you can reuse it in the future.

In [9]:
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies

# Azure also comes with a bunch of default envs. We'll use our own. But just to show...
envs = Environment.list(workspace=ws)
envs.keys()

dict_keys(['my-sklearn-env', 'AzureML-Tutorial', 'AzureML-Minimal', 'AzureML-Chainer-5.1.0-GPU', 'AzureML-PyTorch-1.2-CPU', 'AzureML-TensorFlow-1.12-CPU', 'AzureML-TensorFlow-1.13-CPU', 'AzureML-PyTorch-1.1-CPU', 'AzureML-TensorFlow-1.10-CPU', 'AzureML-PyTorch-1.0-GPU', 'AzureML-TensorFlow-1.12-GPU', 'AzureML-TensorFlow-1.13-GPU', 'AzureML-Chainer-5.1.0-CPU', 'AzureML-PyTorch-1.0-CPU', 'AzureML-Scikit-learn-0.20.3', 'AzureML-PyTorch-1.2-GPU', 'AzureML-PyTorch-1.1-GPU', 'AzureML-TensorFlow-1.10-GPU', 'AzureML-PyTorch-1.3-GPU', 'AzureML-TensorFlow-2.0-CPU', 'AzureML-PyTorch-1.3-CPU', 'AzureML-TensorFlow-2.0-GPU', 'AzureML-PySpark-MmlSpark-0.15'])

In [10]:
try:
    conda_env = Environment.get(workspace=ws, name="my-sklearn-env", version="1")
except:
    conda_env = Environment("my-sklearn-env")
    conda_env.python.conda_dependencies = CondaDependencies.create(pip_packages=['scikit-learn',
                                                                                 'azureml-sdk',
                                                                                 'azureml-dataprep[pandas,fuse]>=1.1.21'])

    # Other ways to create an env:
    
    # myenv = Environment.from_conda_specification(name = "myenv",
    #                                              file_path = "path-to-conda-specification-file")

    # myenv = Environment.from_pip_requirements(name = "myenv"
    #                                           file_path = "path-to-pip-requirements-file")

    # myenv = Environment.from_existing_conda_environment(name = "myenv",
    #                                                     conda_environment_name = "mycondaenv")

    # Can also install private wheels


    # Register it for reuse later
    conda_env.register(workspace=ws)

## RunConfig

There are many types of RunConfig. The one we'll use allows us to run a custom script.

In [13]:
from azureml.core import ScriptRunConfig
from uuid import uuid4

src = ScriptRunConfig(source_directory=script_folder,
                      script='train.py',
                      arguments=['--data-folder', dataset.as_named_input('data').as_download('/tmp/{}'.format(uuid4())),
                                 '--n-estimators', 100])

src.run_config.framework = "python"
src.run_config.environment = conda_env
# Note: If you comment out the following, it actually runs locally (sets up a custom conda env)
src.run_config.target = cpu_cluster.name

## Experiment

In [14]:
from azureml.core import Experiment
exp = Experiment(workspace=ws, name='titanic-22')

In [15]:
run = exp.submit(config=src)

from azureml.widgets import RunDetails
RunDetails(run).show()

# run.wait_for_completion(show_output=True)

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

# Run it many times!

In [16]:
configs = []

for _ in range(10):
    for n_estimators in [10, 50, 100, 500]:
        src = ScriptRunConfig(source_directory=script_folder,
                              script='train.py',
                              arguments=['--data-folder', dataset.as_named_input('data').as_download('/tmp/{}'.format(uuid4())),
                                         '--n-estimators', n_estimators])

        src.run_config.framework = "python"
        src.run_config.environment = conda_env
        src.run_config.target = cpu_cluster.name
        
        configs.append(src)

In [None]:
#configs

In [17]:
from tqdm import tqdm
for src in tqdm(configs):
    exp.submit(config=src)

100%|██████████| 40/40 [04:40<00:00,  7.01s/it]


# Hyperparam Search (Hyperdrive)

It's roughly the same set of ideas, but configured a little differently.

See [documentation for more details](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters)

In [18]:
from azureml.train.estimator import Estimator
from azureml.train.hyperdrive import BayesianParameterSampling
from azureml.train.hyperdrive import HyperDriveConfig
from azureml.train.hyperdrive import PrimaryMetricGoal
from azureml.train.hyperdrive.parameter_expressions import choice

In [19]:
# This looks a lot like configuring a ScriptRunConfig
estimator = Estimator(
    source_directory = script_folder, 
    compute_target = cpu_cluster,
    entry_script = 'train.py',
    script_params = {
        "--data-folder": dataset.as_named_input('data').as_mount()
    },
    environment_definition = conda_env)



In [20]:
param_sampling = BayesianParameterSampling({
    # choice := choose one from the list
    "n-estimators": choice(10, 50, 100, 500, 750, 1000),
    "max-depth": choice(4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24),
    "min-samples-split": choice(2, 3, 4),
})

In [21]:
hyperdrive_run_config = HyperDriveConfig(
    estimator = estimator,
    hyperparameter_sampling = param_sampling,
    policy = None,
    primary_metric_name = "roc_auc",
    primary_metric_goal = PrimaryMetricGoal.MAXIMIZE,
    max_total_runs = 100,
    max_concurrent_runs = 10)

In [22]:
exp = Experiment(workspace=ws, name='titanic-hyperdrive-22')
hyperdrive_run = exp.submit(hyperdrive_run_config)

In [23]:
RunDetails(hyperdrive_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

In [27]:
hyperdrive_run

Experiment,Id,Type,Status,Details Page,Docs Page
titanic-hyperdrive-22,titanic-hyperdrive-22_1573488932805900,hyperdrive,Running,Link to Azure Machine Learning studio,Link to Documentation


In [28]:
best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run

Experiment,Id,Type,Status,Details Page,Docs Page
titanic-hyperdrive-22,titanic-hyperdrive-22_1573488932805900_26,azureml.scriptrun,Completed,Link to Azure Machine Learning studio,Link to Documentation


In [29]:
file_names = best_run.get_file_names()
file_names

['azureml-logs/55_azureml-execution-tvmps_1f47327aef9b4d27cf7182be70207001040c7a84a66ed58ccdedf97ccbf78c11_p.txt',
 'azureml-logs/65_job_prep-tvmps_1f47327aef9b4d27cf7182be70207001040c7a84a66ed58ccdedf97ccbf78c11_p.txt',
 'azureml-logs/70_driver_log.txt',
 'azureml-logs/75_job_post-tvmps_1f47327aef9b4d27cf7182be70207001040c7a84a66ed58ccdedf97ccbf78c11_p.txt',
 'azureml-logs/process_info.json',
 'azureml-logs/process_status.json',
 'logs/azureml/137_azureml.log',
 'logs/azureml/azureml.log',
 'outputs/rf-1000-19-3.pkl']

In [30]:
from sklearn.externals import joblib

model_folder = "./model"
os.makedirs(model_folder, exist_ok = True)

for f in file_names:
    best_run.download_file(f, model_folder)

best_model_name_ = [f for f in os.listdir(model_folder) if f.endswith('.pkl')][0]
best_model_ = joblib.load(os.path.join(model_folder, best_model_name_))

Trying to unpickle estimator DecisionTreeClassifier from version 0.21.3 when using version 0.20.3. This might lead to breaking code or invalid results. Use at your own risk.


KeyError: 0

In [None]:
best_model_