# Part 1. Local Training

1. Train a sklearn locally
2. Set up Workspace
3. Track Runs of the Experiment



# 1.1 Train a sklearn model locally

In [1]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from tqdm import tqdm

In [2]:
df = pd.read_csv('data/train.csv')

In [3]:
print(df.shape)
df.head()

(891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
# We'll skip feature eng and just use the numerical columns.
X = df[['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']].copy()
y = df['Survived'].copy()

X['Age'] = X['Age'].fillna(X['Age'].median())
assert np.any(X.isna()) == False

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
m = RandomForestClassifier(n_estimators=10)
m.fit(X_train, y_train)
roc_auc_score(y_test, m.predict(X_test))

0.6564558629776022

Let's wrap this up in a function:

In [7]:
def train_model(csv_fname, n_estimators):
    df = pd.read_csv(csv_fname)
    
    # We'll skip feature eng and just use the numerical columns.
    X = df[['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']].copy()
    y = df['Survived'].copy()
    X['Age'] = X['Age'].fillna(X['Age'].median())
    assert np.any(X.isna()) == False
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    m = RandomForestClassifier(n_estimators=n_estimators)
    m.fit(X_train, y_train)
    return roc_auc_score(y_test, m.predict(X_test))

train_model('data/train.csv', n_estimators=10)

0.6860095389507155

# 1.2 Set up Workspace

On the Azure portal:

1. Create resource group
2. Create a Machine Learning workspace

This automatically creates a few other resources for you as well.

# 1.3 Track Runs of the Experiment

Base:
```
pip install azureml-sdk
```

Optionals:

```
pip install azureml-sdk[notebooks,tensorboard,explain,automl]
pip install azureml-dataprep[pandas]
pip install azureml-contrib-explain-model
```

Note automl actually installs all the code locally on your computer... which takes a while to install.

In [8]:
import azureml.core
print("You are currently using version", azureml.core.VERSION, "of the Azure ML SDK")

You are currently using version 1.0.72 of the Azure ML SDK


In [9]:
import os

subscription_id = os.getenv("SUBSCRIPTION_ID")
resource_group = os.getenv("RESOURCE_GROUP")
workspace_name = os.getenv("WORKSPACE_NAME")
workspace_region = os.getenv("WORKSPACE_REGION")

In [10]:
from azureml.core import Workspace

try:
    ws = Workspace(subscription_id = subscription_id, resource_group = resource_group, workspace_name = workspace_name)
    # write the details of the workspace to a configuration file to the notebook library
    ws.write_config() # ws = Workspace.from_config()
    print("Workspace configuration succeeded.")
except:
    print("Workspace not accessible. Change your parameters or create a new workspace below.")

Workspace configuration succeeded.


In [11]:
from azureml.core import Experiment, Run

experiment = Experiment(workspace = ws, name = "titanic-3")

In [15]:
# This uploads the code your git repo is tracking; if your repo is too big, it will fail.
run = experiment.start_logging()

In [16]:
run.log('n_estimators', 10)

roc_auc = train_model('data/train.csv', n_estimators=10)

run.log('roc_auc', roc_auc)

In [17]:
run.complete()

You can tag and log a bunch of stuff in Run. [See documentation](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.run(class)?view=azure-ml-py)

Now's a good time to view this on the Azure portal.

## Track many experiments and compare them

In [18]:
# Note: I'm using the same experiment
experiment = Experiment(workspace = ws, name = "titanic-1")

for n_estimators in tqdm([1,5,10,50,100,200,300,400,500]):
    run = experiment.start_logging()
    run.log('n_estimators', 10)
    roc_auc = train_model('data/train.csv', n_estimators=n_estimators)
    run.log('roc_auc', roc_auc)
    run.complete()

100%|██████████| 9/9 [01:23<00:00,  9.23s/it]


## Note: **You can't actually delete any runs or experiments**. No idea why...

We can fetch them all and filter.
Perhaps a good practice is to keep each experiment small.

In [19]:
%%time
run_metrics = []
for run in experiment.get_runs():
    if run.status == 'Completed':
        run_metrics.append(run.get_metrics())
run_metrics = pd.DataFrame(run_metrics)

CPU times: user 1.91 s, sys: 163 ms, total: 2.08 s
Wall time: 29.5 s


Alternatively we can filter like so:

```
list(exp.get_runs(properties={"author":"azureml-user"},tags="worth another look"))
```


In [20]:
# I called run.log('n_estimators', ...) twice in a one run, by accident. This creates a list object instead of overwriting.
# So I'm just converting the list back into the actual value.
run_metrics['n_estimators'] = run_metrics['n_estimators'].map(lambda x: x[0] if isinstance(x, list) else x)

In [21]:
run_metrics.groupby('n_estimators') \
        .roc_auc.agg(['mean', 'std']) \
        .rename(columns={'mean': 'roc_auc-mean', 'std': 'roc_auc-std'})

Unnamed: 0_level_0,roc_auc-mean,roc_auc-std
n_estimators,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.688588,0.04779
5,0.70641,0.027006
10,0.667017,0.039594
50,0.73473,0.013213
100,0.723661,0.017606
200,0.720627,0.006344
300,0.731175,0.004351
400,0.733854,0.007632
500,0.737002,0.007094


# Additional Resources

https://github.com/Azure/MachineLearningNotebooks/tree/69d4344dff6de3633773b89b818f20bd630cf40c/how-to-use-azureml