## Training a model

In the previous notebook we prepared our dataset and registered it as Tabular dataset so that it is ready to use. In this notebook we will build and train a ML model and run it as an Experiment. An Experiment in Azure ML is a <i>container of trials that reprepresent multiple model runs</i>. In an Experiment you can track metrics which will allow you to quickly compare multiple model runs.

To run an Experiment, configuration settings are required. The easiest way to do this is by using an Estimator, which combines RunConfiguration and ScriptRunConfig. In the Estimator you get to define basic configurations like python environment, input data, and input script, but also more advanced settings like (hyper)parameters and compute targets (other than local).

We will now write and run a quick xgboost model to get to know these features. Once you're familiar with the Azure ML setup, it is time to take advantage of the features and make your way to the best steelplate fault prediction model!

In [61]:
import os
import numpy as np
import pandas as pd
from azureml.core import Workspace, Experiment, Dataset, Environment, Model
from azureml.core.conda_dependencies import CondaDependencies
from azureml.train.estimator import Estimator
from azureml.widgets import RunDetails

# Load the workspace
ws = Workspace.from_config()

# Load and view default datastore
# Datastores are references to storage locations such as Azure Storage blob containers
default_ds = ws.get_default_datastore()
steelplate_ds = ws.datasets.get("steelplate training dataset")

# Create folder for experiment files
experiment_folder = "steelplate_training"
os.makedirs(experiment_folder, exist_ok=True)

### Write experiment file

Although it is possible to run an Experiment inline (see 01-Getting_Started_with_Azure_ML), it is more convenient to write a training script as it will also be easier to re-use.

In [58]:
%%writefile $experiment_folder/steelplate_xgboost.py
from azureml.core import Run
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from xgboost import XGBClassifier

run = Run.get_context()

print("Loading Data..")
steelplate_df = run.input_datasets['steelplate'].to_pandas_dataframe()
X, y = steelplate_df.drop(columns = ['Column1', 'Healthy']).values, steelplate_df.loc[:, steelplate_df.columns == 'Healthy'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

print("Training an xgboost model..")
model = XGBClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

print("Evaluate predictions..")
accuracy = np.average(y_pred == y_test)
run.log('Accuracy', np.float(accuracy))
f1_score = f1_score(y_test, y_pred, average='weighted')
run.log('F1 score', np.float(f1_score))
AUC = roc_auc_score(y_test, y_pred)
run.log('AUC', np.float(AUC))

print(f"Accuracy: {round((accuracy * 100.0),2)}%") 
print(f"F1 score: {round(f1_score,2)}")
print(f"AUC: {round(AUC,2)}")

print("Saving model..")
os.makedirs('outputs', exist_ok=True)
joblib.dump(value=model, filename='outputs/steelplate_model.pkl')

run.complete()

Overwriting steelplate_training/steelplate_xgboost.py


### Create Environment and Estimator, run Experiment

Before we can run the Experiment using `steelplate_xgboost.py`, we will have to define our python environment and Estimator. Since everyone will be working from the same workspace this hackathon, it is important to track your work by using personalized names.

In [54]:
# Create a Python environment for the experiment
steelplate_env = Environment("steelplate-env")
steelplate_env.python.user_managed_dependencies = False # Let Azure ML manage dependencies
steelplate_env.docker.enabled = True # Use a docker container

# Create a set of package dependencies (conda or pip as required)
steelplate_packages = CondaDependencies.create(conda_packages=['scikit-learn'],
                                          pip_packages=['azureml-defaults', 'azureml-dataprep[pandas]', 'xgboost'])

# Add the dependencies to the environment
steelplate_env.python.conda_dependencies = steelplate_packages
print(steelplate_env.name, 'defined.')

steelplate-env defined.


In [60]:
# Create an estimator
estimator = Estimator(source_directory=experiment_folder,
                      inputs=[steelplate_ds.as_named_input('steelplate')],
                      compute_target = 'local',
                      environment_definition = steelplate_env,
                      entry_script='steelplate_xgboost.py')


# Create an experiment
experiment_name = 'steelplate_training_emma' # change this name to create your own Experiment 
experiment = Experiment(workspace = ws, name = experiment_name)

# Run the experiment
run = experiment.submit(config=estimator, tags={'model': 'xgboost'})

# Show the run details while running
RunDetails(run).show()
run.wait_for_completion()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 'â€¦

{'runId': 'steelplate_training_emma_1600245260_7228eb3e',
 'target': 'local',
 'status': 'Finalizing',
 'startTimeUtc': '2020-09-16T08:34:24.612857Z',
 'properties': {'_azureml.ComputeTargetType': 'local',
  'ContentSnapshotId': '93b9689d-54a7-47cf-af6b-755df2219866',
  'azureml.git.repository_uri': 'https://github.com/emmavandelaar/azureml-pdm-hackathon.git',
  'mlflow.source.git.repoURL': 'https://github.com/emmavandelaar/azureml-pdm-hackathon.git',
  'azureml.git.branch': 'master',
  'mlflow.source.git.branch': 'master',
  'azureml.git.commit': 'af595b64a7f45b29ae5080c998d3a7d2d64bbe97',
  'mlflow.source.git.commit': 'af595b64a7f45b29ae5080c998d3a7d2d64bbe97',
  'azureml.git.dirty': 'True'},
 'inputDatasets': [{'dataset': {'id': 'c5bb6a9f-e0c6-46f4-b413-0281b66b4e96'}, 'consumptionDetails': {'type': 'RunInput', 'inputName': 'steelplate', 'mechanism': 'Direct'}}],
 'runDefinition': {'script': 'steelplate_xgboost.py',
  'scriptType': None,
  'useAbsolutePath': False,
  'arguments': []

In [70]:
# Register the model
model_name = 'steelplate_model_emma' # change this name to track your own model
run.register_model(model_path='outputs/steelplate_model.pkl', model_name=model_name,
                   tags={'Model':'XGBoost'},
                   properties={'AUC': run.get_metrics()['AUC'], 'Accuracy': run.get_metrics()['Accuracy'],
                              'F1 score': run.get_metrics()['F1 score']})

# List registered models
models = [model_name] # update this list with models you want to track
model_list = Model.list(ws)

for model in Model.list(ws):
    for model_name in models:
        if model.name == model_name:
            print(model.name, 'version:', model.version)
            for tag_name in model.tags:
                tag = model.tags[tag_name]
                print ('\t',tag_name, ':', tag)
            for prop_name in model.properties:
                prop = model.properties[prop_name]
                print ('\t',prop_name, ':', prop)
            print('\n')

steelplate_model_emma version: 3
	 Model : XGBoost
	 AUC : 0.8225557461406519
	 Accuracy : 0.8208471894735466
	 F1 score : 0.9486098009425065


steelplate_model_emma version: 2
	 Model : XGBoost
	 AUC : 0.8225557461406519
	 Accuracy : 0.8208471894735466


steelplate_model_emma version: 1
	 Model : XGBoost
	 AUC : 0.8225557461406519
	 Accuracy : 0.8208471894735466




### Try it yourself!

Now that you know how to use Azure ML studio to train your models and track their performance, it is time to start training yourself. Maybe you want to try out a different ML model or tune by configuring different hyperparameter settings. You could even consider picking a different Healthy/Unhealthy distinction and create a new training dataset. The sky is the limit!