# Tutorial #1: Model training with Azure Machine Learning

In this tutorial, you will train a machine learning model on local and Azure compute resources. You will explore the Azure Machine Learning service and the Azure ML SDK for Python. 
This notebook serves as a quick start to hands-on Azure Machine Learning service. 

Before you start this tutorial, you need to create a workspace in the Azure portal first.
[Create and manage Azure Machine Learning workspaces in the Azure portal](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace)

The following are covered in this tutorial:
* Extract data from Azure Search Service
* Train a xgboost model on local machine and on Azure compute resources
* Register the model in Azure Machine Learning Workspace

If you are trying out this tutorial for the first time, please run the code cells in this tutorial sequentially.
Tutorial #2 will cover the basics of deploying a model. 

## References

[Azure Machine Learning documentation](https://github.com/leekokhow/azureml/blob/master/predict-employee-retention-part1-training.ipynb).
                                                                

## Set up your development environment

### Dependencies required for local machine setup in order to use Azure ML SDK.

Step 1. You need to create a [free Azure account](https://azure.microsoft.com/en-gb/free/) first. This tutorial will use Anaconda on your local machine to connect to your Azure account.

Step 2. This notebook was tested in Anaconda Jupyter Notebook. 
Once you have installed Anaconda on your machine, run the following pip commands to download these packages into Anaconda:
    
+ conda install anaconda-client
+ conda update anaconda
+ pip install azureml-sdk[notebooks,automl]
+ pip install azureml-dataprep[pandas]
+ conda update conda

**Note: If you need to upgrade the azureml components, uninstall the old version first before install the new ones.**

OR you can use a [free Microsoft Azure Notebooks](https://notebooks.azure.com/) to run this notebook if you don't have Anaconda.

### Import Azure Machine Learning SDK for Python 

This step is to check that you have installed Azure Machine Learning SDK for Python.

**Note: if you encounter ModuleNotFoundError, try uninstall all the azureml components first then re-install them again.**

In [1]:
import azureml.core

# check core SDK version number (need Python 3.6 kernel if you run this in Microsoft Azure Notebooks)
print("Azure ML SDK Version: ", azureml.core.VERSION)

Azure ML SDK Version:  1.24.0


### Connect Azure Machine Learning Workspace

Create a workspace object from the existing workspace. `Workspace.from_config()` reads the file **config.json** and loads the details into an object named `workspace`.

If you see this message:
"Performing interactive authentication. Please follow the instructions on the terminal.
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code &lt;token\&gt; to authenticate."
    
Click on the link and use the &lt;token\&gt; given to authenticate. After authenticated, run this script again to get load the Workspace.&lt;/token\&gt;&lt;/token\&gt;

In [2]:
# Load workspace configuration from the config.json file in the current folder.
from azureml.core import Workspace
workspace = Workspace.from_config()
print(workspace.name, workspace.location, workspace.resource_group, workspace.location, sep='\t')

csidmlws	southeastasia	cmt-202011001	southeastasia


### Create an Experiment

An Experiment tracks the runs in your workspace. A workspace can have muliple experiments. 

In [3]:
from azureml.core import Experiment

experiment_name = 'predict-emailservice-xgboost'
exp = Experiment(workspace=workspace, name=experiment_name)

## Extract data

This example read api data from Azure Search Service. It requires a api_config.json that has Azure Search Service credentials. A python script **Azure_search_client.py** which has helper functions which help to extract data from Azure Search Service.

In [4]:
from azure_search_client import azure_search_client as azs_client 
from pandas.io.json import json_normalize
import pandas as pd
import json
import concurrent
import datetime
from itertools import chain
import random
import numpy as np
from random import sample
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import GroupShuffleSplit
import xgboost as xgb
from sklearn.model_selection import GroupKFold
from tqdm import tqdm

### *get_search_results* 

This function sends the query into azure search service and produce the results in json format.

read in 2 inputs : service and query. 
1. service :class, is a class created using azure_search_client.py 
2. query :string, what we wish to send to the azure search service.

output: json dictionary

### *retrieve_from_search* 

This function calls *get_search_results* and flatten the json output into pandas dataframe. It then creates random ratings and create session id and query column.

read in 3 inputs : query, sessionid, azs_service
1. service :class, is a class created using azure_search_client.py 
2. query :string, what we wish to send to the azure search service.
3. sessionid :int

output: pandas dataframe

In [5]:
def get_search_results(service, query):
    search_request_body = {
        "search": query,
        "featuresMode": "enabled",
        "scoringStatistics": "global",
        "count": "true"
    }
    return service.search(search_request_body)

def retrieve_from_search(query, sessionid, azs_service):
    
    ## Call the api service to retrieve json format data
    json_search_results = get_search_results(azs_service, query)
    
    ## Flatten the json format data into pandas dataframe
    search_results = json_normalize(json_search_results).fillna(0)
    search_results = search_results.fillna(0).sort_values(['@search.score'], ascending=False)
    search_results['query'] = query.lower()
    search_results['sessionid'] = sessionid
    print('{} rows for query : {}'.format(search_results.shape[0], query))
    
    #Producing random ratings which can be remove during production stage
    rows = search_results.shape[0]
    first = random.randint(1, rows)
    second = random.randint(1, rows-first) if first < rows else 0
    third = rows-first-second
    sequence = [5]* first + [3]* second+[1]* third
    random.shuffle(sequence)
    search_results['rating'] = sequence
    
    return search_results

This code connects to the api service using the api_config json that we created. You may follow Guide for Azure Search Service.pdf to understand the location in retrieving the credentials required.

The api_config.json looks like this:


{"service_name": "xxxxx", 
    "endpoint": "xxxxx", 
    "api_version": "2020-06-30-preview", 
    "api_key": "xxxxx", 
    "index_name": "xxxxx"}

In [6]:
azs_service = azs_client.from_json('api_config.json')
azs_service

<azure_search_client.azure_search_client at 0x7f69d0661fd0>

The following code creates hardcoded queries that would be send to the search service using the function <i>retrieve_from_search</i>.

In [37]:
# Create the necessary queries to create dataset
query_input = ['thank you', 'clarify']

query_dataset = pd.DataFrame()
sessionid =1
for query in query_input:
    query_dataset = pd.concat([query_dataset, retrieve_from_search(query, sessionid, azs_service)])
    sessionid+=1
    
query_dataset

20 rows for query : thank you
3 rows for query : clarify


Unnamed: 0,@search.score,AzureSearch_DocumentKey,@search.features.BODY_Q.uniqueTokenMatches,@search.features.BODY_Q.similarityScore,@search.features.BODY_Q.termFrequency,@search.features.BODY_R.uniqueTokenMatches,@search.features.BODY_R.similarityScore,@search.features.BODY_R.termFrequency,query,sessionid,rating
0,2.936771,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,2.0,2.892284,5.0,1.0,0.044488,5.0,thank you,1,5
1,2.564289,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,1.0,0.752885,1.0,2.0,1.811404,6.0,thank you,1,5
2,2.246215,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,2.0,2.208126,2.0,1.0,0.03809,2.0,thank you,1,5
3,2.246215,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,2.0,2.208126,2.0,1.0,0.03809,2.0,thank you,1,5
4,1.962676,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,0.0,0.0,0.0,2.0,1.962676,4.0,thank you,1,5
5,1.804916,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,1.0,0.849617,2.0,2.0,0.955299,12.0,thank you,1,5
6,1.551757,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,0.0,0.0,0.0,2.0,1.551757,8.0,thank you,1,5
7,1.457604,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,2.0,1.416613,2.0,1.0,0.040991,10.0,thank you,1,5
8,1.355074,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,0.0,0.0,0.0,2.0,1.355074,7.0,thank you,1,5
9,1.105296,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,2.0,1.064417,2.0,1.0,0.040879,6.0,thank you,1,5


### Saving the dataset into blob storage 
We can use code to save our dataframe query_dataset into a file and push this file into our datastore. As such, other members just require the file directory in the datastore to read this file.

In [35]:
from azureml.core import Workspace, Dataset
os.makedirs('data', exist_ok=True)
local_path = 'data/query_dataset.csv'
query_dataset.to_csv(local_path)

# get the datastore to upload prepared data
datastore = workspace.get_default_datastore()

# upload the local file from src_dir to the target_path in datastore
datastore.upload(src_dir='data', target_path='data', overwrite=True)

# reading the dataset referencing from datastore, 
# use .from_delimited_files as we are reading from csv
reading_dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, ('data/query_dataset.csv'))])

query_dataset = reading_dataset.to_pandas_dataframe()

Uploading an estimated of 1 files
Uploading data/query_dataset.csv
Uploaded data/query_dataset.csv, 1 files out of an estimated total of 1
Uploaded 1 files


### To register the file as a dataset in AzureML workspace, we have ensure that our dataframe is a Dataset object
The code __Dataset.Tabular.from_delimited_files__ creates a Dataset Object. We require our pandas dataframe to be a Dataset Object in order to register it in our AzureML Wirkspace. To convert Dataset Object to pandas dataframe, we just need to run __reading_dataset.to_pandas_dataframe()__

+ create_new_version = True to allow updates to the current registered dataset

In [8]:
from azureml.core import Dataset

## This code has been run above to read as a dataset object from file name in datastore
# reading_dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, ('data/query_dataset.csv'))])
reading_dataset.register(workspace=workspace, name='query_dataset_tabular',create_new_version=True)

### To ensure that the dataset registered in the workspace is the same as our query_dataset, we can read the dataset from our created Dataset object in our workspace

In [38]:
from azureml.core import Workspace, Dataset

dataset = Dataset.get_by_name(workspace, name='query_dataset_tabular')
query_dataset = dataset.to_pandas_dataframe()

query_dataset

Unnamed: 0,@search.score,AzureSearch_DocumentKey,@search.features.BODY_Q.uniqueTokenMatches,@search.features.BODY_Q.similarityScore,@search.features.BODY_Q.termFrequency,@search.features.BODY_R.uniqueTokenMatches,@search.features.BODY_R.similarityScore,@search.features.BODY_R.termFrequency,query,sessionid,rating
0,2.936771,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,2.0,2.892284,5.0,1.0,0.044488,5.0,thank you,1,5
1,2.564289,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,1.0,0.752885,1.0,2.0,1.811404,6.0,thank you,1,5
2,2.246215,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,2.0,2.208126,2.0,1.0,0.03809,2.0,thank you,1,5
3,2.246215,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,2.0,2.208126,2.0,1.0,0.03809,2.0,thank you,1,5
4,1.962676,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,0.0,0.0,0.0,2.0,1.962676,4.0,thank you,1,5
5,1.804916,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,1.0,0.849617,2.0,2.0,0.955299,12.0,thank you,1,5
6,1.551757,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,0.0,0.0,0.0,2.0,1.551757,8.0,thank you,1,5
7,1.457604,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,2.0,1.416613,2.0,1.0,0.040991,10.0,thank you,1,5
8,1.355074,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,0.0,0.0,0.0,2.0,1.355074,7.0,thank you,1,5
9,1.105296,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,2.0,1.064417,2.0,1.0,0.040879,6.0,thank you,1,5


The next step would be to create a simple XGBRanker model from xgboost package to run prediction on our query_dataset. The scores are how the outputs of each group are in comparision with each other. The higher the score would be better.

In [26]:
X, y = query_dataset, query_dataset.rating

#### Query labels for each document in our dataset.
query_ids = X['sessionid'].to_numpy()

params = {'objective': 'rank:ndcg', 'learning_rate': 0.5
          ,'min_child_weight': 0.1
#           , 'reg_alpha': 0.5
          ,'max_depth': 10, 'n_estimators': 200
         }

ranker = xgb.XGBRanker(**params)

# Choose only columns that are numeric
ranker.fit(X.drop(columns=['query','rating', 'sessionid', 'AzureSearch_DocumentKey', 'keyphrases'], axis=1),
           y, np.unique(query_ids, return_counts=True)[1],
           eval_metric='ndcg',
           verbose=False)

xgb_scores = ranker.predict(X.drop(columns=['query','rating', 'sessionid',
                                            'AzureSearch_DocumentKey', 'keyphrases'], axis=1))
xgb_scores

array([-0.44844714, -0.44844714, -0.44844714, -0.44844714, -0.44844714,
       -0.44844714, -0.44844714, -0.44844714, -0.44844714, -0.44844714,
       -0.44844714, -0.44844714, -0.44844714, -0.44844714, -0.44844714,
       -0.44844714, -0.44844714, -0.44844714, -0.44844714, -0.44844714,
        1.4972795 , -0.44844714, -0.44844714], dtype=float32)

In [27]:
output_score = pd.DataFrame(xgb_scores)
output_score.columns=['score']
output_score_sorted = output_score.sort_values('score', ascending = True).reset_index()
num_rows = output_score_sorted.shape[0]
rate_3 = int(num_rows*0.7)
rate_1 = int(num_rows*0.2)
output_score_sorted['pred_rating'] = 5
output_score_sorted.loc[0: rate_1, 'pred_rating'] = 1
output_score_sorted.loc[rate_1: rate_3, 'pred_rating'] = 3
query_dataset['index'] = query_dataset .index
check_acc = query_dataset.merge(output_score_sorted, on = ['index'])

Created a self defined score metric to calculate the accuracy of the model
+ If both pred_rating and rating are the same, score of 1 is awarded
+ If both pred_rating and rating are not the same, case when
  + pred_rating is 5, rating is 1, score of 0 is awarded, vice versa
  + pred_rating is 5, rating is 3, score of 0.5 is awarded, vice versa
  + pred_rating is 3, rating is 1, score of 0 is awarded.
  + pred_rating is 1, rating is 3, score of 0.5 is awarded.

In [23]:
def score_metric(x):
    
    if (x.pred_rating == 5 and x.rating ==1) or (x.pred_rating == 1 and x.rating ==5):
        return 0
    
    if (x.pred_rating == 5 and x.rating == 3) or (x.pred_rating == 3 and x.rating == 5):
        return 0.5
    
    if x.pred_rating == 1 and x.rating == 3:
        return 0.5
    
    if x.pred_rating == 3 and x.rating == 1:
        return 0
    
    if (x.pred_rating == x.rating):
        return 1

print('Exact Accuracy is', round((check_acc['pred_rating']==check_acc['rating']).sum()/check_acc.shape[0], 2))
print('Accuracy is', round(sum(check_acc.apply(lambda x: score_metric(x), axis = 1))/check_acc.shape[0], 2))

Exact Accuracy is 0.51
Accuracy is 0.69


## Model Training

For this task, submit a job to train model in your local machine. To submit a job you:

* Create training scripts
* Create training environment
* Submit a run

The training results will be stored in your Azure ML workspace.

## Step 1. Create training scripts

Create training scripts in the directory you just created. Notice how the script saves the model:
    
+ The training script saves your model into a directory named outputs. <br>
`joblib.dump(value=clf, filename='outputs/predict-emailservice-xgboostmodel.pkl')`<br>
Anything written in this directory is automatically uploaded into your workspace. You'll access your model from this directory later in the tutorial. <br>
 <br>
+ The first script (train_email.py) reads in the credentials to generate our training dataset. The second script follows the same model building code that we use to build and train our model. <br>
 <br>
+ run.log(xxx, yyy) will create a print statement when we submit a run. XXX will be the name and YYY will be the value. It is similar to a log file.

[Run class](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.run(class)?view=azure-ml-py)

[Joblib](https://joblib.readthedocs.io/en/latest/)

In [37]:
%%writefile train_email_xgboost.py

# import argparse
import os
from azureml.core import Run
from predict_emailservice_xgboost import generate_model
from joblib import dump
from azure_search_client import azure_search_client as azs_client 
import random
import numpy as np
from random import sample
import pandas as pd

# get hold of the current run
run = Run.get_context()

# Connecting to the api service
azs_service = azs_client.from_json('api_config.json')
run.log('Connecting to api service', azs_service)

# Generate model
xgb_ranker = generate_model(azs_service, run)

# note file saved in the outputs folder is automatically uploaded into experiment record
os.makedirs('outputs', exist_ok=True)
dump(value=xgb_ranker, filename='outputs/predict-emailservice-xgboostmodel.pkl')
run.log('End of run','Training completed')

Overwriting train_email_xgboost.py


In [62]:
%%writefile predict_emailservice_xgboost.py

from azureml.core import Run
from pandas.io.json import json_normalize
import pandas as pd
import json
import concurrent
import datetime
from itertools import chain
import random
import numpy as np
from random import sample
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import GroupShuffleSplit
import xgboost as xgb
from sklearn.model_selection import GroupKFold
from tqdm import tqdm

def get_search_results(service, query):
    search_request_body = {
        "search": query,
        "featuresMode": "enabled",
        "scoringStatistics": "global",
        "count": "true"
    }
    
    return service.search(search_request_body)

def retrieve_from_search(query, sessionid, azs_service):
    
    ## Call the api service to retrieve json format data
    json_search_results = get_search_results(azs_service, query)
    
    ## Flatten the json format data into pandas dataframe
    search_results = json_normalize(json_search_results).fillna(0)
    search_results = search_results.fillna(0).sort_values(['@search.score'], ascending=False)
    search_results['query'] = query.lower()
    search_results['sessionid'] = sessionid
    print('{} rows for query : {}'.format(search_results.shape[0], query))
    
    #Producing random ratings which can be remove during production stage
    rows = search_results.shape[0]
    first = random.randint(1, rows)
    second = random.randint(1, rows-first) if first < rows else 0
    third = rows-first-second
    sequence = [5]* first + [3]* second+[1]* third
    random.shuffle(sequence)
    search_results['rating'] = sequence
    
    return search_results


def score_metric(x):
    
    if (x.pred_rating == 5 and x.rating ==1) or (x.pred_rating == 1 and x.rating ==5):
        return 0
    
    if (x.pred_rating == 5 and x.rating == 3) or (x.pred_rating == 3 and x.rating == 5):
        return 0.5
    
    if x.pred_rating == 1 and x.rating == 3:
        return 0.5
    
    if x.pred_rating == 3 and x.rating == 1:
        return 0
    
    if (x.pred_rating == x.rating):
        return 1
    
def generate_model(azs_service, run):
    
    # Create the necessary queries to create dataset
    query_input = ['how to do child relief', 'income tax ', 'transfer account', 'claim',
                   'whats child relief', 'do claims', 'PTR claim', 'how to do Course Fees Relief', 
                   'reduce tax', 'tax filing', 'reduce personal relief', 'duplicate relief', 
                   'state unavailability', 'university relief', 'auto-inclusion scheme (AIS)', 'e-Service availablility', 
                   'child relief', 'tax claim', 'relief', 'income']

    query_dataset = pd.DataFrame()
    sessionid =1
    for query in query_input:
        query_dataset = pd.concat([query_dataset, retrieve_from_search(query, sessionid, azs_service)])
        sessionid+=1

    query_dataset = query_dataset.fillna(0).reset_index(drop =True)

    X, y = query_dataset, query_dataset.rating

    # Query labels for each document in our dataset.
    query_ids = X['sessionid'].to_numpy()


    params = {'objective': 'rank:ndcg', 'learning_rate': 0.5
          ,'min_child_weight': 0.1
#           , 'reg_alpha': 0.5
          ,'max_depth': 10, 'n_estimators': 200
         }

    xgb_ranker = xgb.XGBRanker(**params)
    run.log('Setting up xgb params', params)

    xgb_ranker.fit(X.drop(columns=['query','rating', 'sessionid', 'AzureSearch_DocumentKey', 'keyphrases'], axis=1)
                   , y, np.unique(query_ids, return_counts=True)[1],
               eval_metric='ndcg',
               verbose=False)

    
    xgb_scores = xgb_ranker.predict(X.drop(columns=['query','rating', 'sessionid', 'AzureSearch_DocumentKey', 'keyphrases'], axis=1))

    # Create the output dataset and also the accuracy
    output_score = pd.DataFrame(xgb_scores)
    output_score.columns=['score']
    output_score_sorted = output_score.sort_values('score', ascending = True).reset_index()
    num_rows = output_score_sorted.shape[0]
    rate_3 = int(num_rows*0.7)
    rate_1 = int(num_rows*0.2)
    output_score_sorted['pred_rating'] = 5
    output_score_sorted.loc[0: rate_1, 'pred_rating'] = 1
    output_score_sorted.loc[rate_1: rate_3, 'pred_rating'] = 3
    
    run.log('Predicted value counts', output_score_sorted.pred_rating.value_counts())
        
    query_dataset['index'] = query_dataset .index
    check_acc = query_dataset.merge(output_score_sorted, on = ['index'])
    
    run.log('Exact Accuracy is', round((check_acc['pred_rating']==check_acc['rating']).sum()/check_acc.shape[0], 2))
    run.log('Accuracy is', round(sum(check_acc.apply(lambda x: score_metric(x), axis = 1))/check_acc.shape[0], 2))
    
    return xgb_ranker

Overwriting predict_emailservice_xgboost.py


## Step 3. Create training environment in local machine

The steps here is to create a local training environment, such as to leverage on the Anaconda installed on local machine. However, you can also run this "locally" in the Microsoft Azure Notebooks.

Details are provided at https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-environment#local
    
[What are Azure Machine Learning environments?](https://docs.microsoft.com/en-us/azure/machine-learning/concept-environments)

### Define a user-managed environment
By default, Azure Machine Learning service will build a Conda environment with dependencies you specified, and will execute the run in that environment instead of using any Python libraries that you installed on the base image.
A later example in this example will demonstrate the use of Environment when training the model on Azure. 

In some situations, your custom base image may already contain a Python environment with packages that you want to use.

When using a user-managed environment for local training, you are responsible for ensuring that all the necessary packages are available in the Python environment you choose to run the script in.

+ Create and attach: There's no need to create or attach a compute target to use your local computer as the training environment.
+ Configure: When you use your local computer as a compute target, the training code is run in your development environment. If that environment already has the Python packages you need, use the user-managed environment.

To use your own installed packages, set the parameter Environment.python.user_managed_dependencies = True. Ensure that the base image contains a Python interpreter, and has the packages your training script needs.

In [51]:
from azureml.core import Environment

# Create a 'user-managed environment' environment.
user_managed_env = Environment("user-managed-env-xgboost")

user_managed_env.python.user_managed_dependencies = True

# You can choose a specific Python environment by pointing to a Python path 
#user_managed_env.python.interpreter_path = '/home/johndoe/miniconda3/envs/myenv/bin/python'

### Create ScriptRunConfig

Whatever the way you manage your environment, you need to use the ScriptRunConfig class. ScriptRunConfig identifies the training script to run in the experiment and the environment in which to run it. 

ScriptRunConfig includes
+ source_directory: The source directory that contains your training script
+ script: Identify the training script
+ run_config: The run configuration, which in turn defines where the training will occur

Note: ScriptRunConfig doesn't allow you to pass dataset to the training script.


In [28]:
directory = os.getcwd()

In [53]:
from azureml.core import ScriptRunConfig

src = ScriptRunConfig(source_directory= directory, script='train_email_xgboost.py')
src.run_config.environment = user_managed_env

## 4. Submit a run

After you create a run configuration, you use it to run your experiment. An experiment is a logical container in an Azure ML Workspace. It contains a series of trials called Runs. As such, it hosts run records such as run metrics, logs, and other output artifacts from your experiments.

The code pattern to submit a training run is the same for all types of compute targets:
+ Create an experiment to run.
+ Submit the run.
+ Wait for the run to complete.



### Submit the run

In [54]:
run = exp.submit(src)
run

Experiment,Id,Type,Status,Details Page,Docs Page
predict-emailservice-xgboost,predict-emailservice-xgboost_1618551593_b1da1b58,azureml.scriptrun,Running,Link to Azure Machine Learning studio,Link to Documentation


### Wait for the run to complete

After you submit the run, you can immediately execute this code to watch the progress of the run with a Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10 to 15 seconds until the job finishes:

In [55]:
from azureml.widgets import RunDetails
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

### Get log results upon completion

Model training and monitoring happen in the background. Wait until the model has finished training before you run more code. Use wait_for_completion to show when the model training is finished:

In [56]:
run.wait_for_completion(show_output=False)  # specify True for a verbose log

{'runId': 'predict-emailservice-xgboost_1618551593_b1da1b58',
 'target': 'local',
 'status': 'Completed',
 'startTimeUtc': '2021-04-16T05:39:58.687472Z',
 'endTimeUtc': '2021-04-16T05:40:14.123717Z',
 'properties': {'_azureml.ComputeTargetType': 'local',
  'ContentSnapshotId': 'fc3ece24-ba9a-4631-907f-c2d9b8c962ea'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'script': 'train_email_xgboost.py',
  'command': '',
  'useAbsolutePath': False,
  'arguments': [],
  'sourceDirectoryDataStore': None,
  'framework': 'Python',
  'communicator': 'None',
  'target': 'local',
  'dataReferences': {},
  'data': {},
  'outputData': {},
  'jobName': None,
  'maxRunDurationSeconds': 2592000,
  'nodeCount': 1,
  'priority': None,
  'credentialPassthrough': False,
  'identity': None,
  'environment': {'name': 'user-managed-env-xgboost',
   'version': 'Autosave_2021-03-26T03:58:54Z_11eca9d9',
   'python': {'interpreterPath': 'python',
    'userManagedDependencies': True,
    'condaDepe

Note: All these calculations were run on your local machine, in the conda environment you defined above. You can find the results in:

    + ~/.azureml/envs/azureml_xxxx for the conda environment you just created
    + ~/AppData/Local/Temp/azureml_runs/train-on-local_xxxx for the machine learning models you trained (this path may differ depending on the platform you use). This folder also contains
        - Logs (under azureml_logs/)
        - Output pickled files (under outputs/)
        - The configuration files (credentials, local and docker image setups)
        - The train.py and mylib.py scripts
        - The current notebook

Take a few minutes to examine the output of the cell above. It shows the content of some of the log files, and extra information on the conda environment used.


### Display run results

Display the information captured by run.log(). Results will appear only after the run completed.

In [57]:
print(run.get_metrics())

{'Connecting to api service': '<azure_search_client.azure_search_client object at 0x7fdbc03af2b0>', 'Setting up xgb params': "{'objective': 'rank:ndcg', 'learning_rate': 0.5, 'min_child_weight': 0.1, 'max_depth': 10, 'n_estimators': 200}", 'Predicted value counts': '3    142\n5     84\n1     56\nName: pred_rating, dtype: int64', 'Exact Accuracy is': 0.5, 'Accuracy is': 0.67, 'End of run': 'Training completed'}


In [58]:
run.get_metrics('Setting up xgb params').get('Setting up xgb params')

"{'objective': 'rank:ndcg', 'learning_rate': 0.5, 'min_child_weight': 0.1, 'max_depth': 10, 'n_estimators': 200}"

## Register Model

The last step in the training script wrote the file outputs/predict-emailservice-xgbmodel.pkl in a directory named outputs in the VM of the cluster where the job is run. "outputs" is a special directory in that all content in this directory is automatically uploaded to your workspace. This content appears in the run record in the experiment under your workspace. So the model file is now also available in your workspace.

You can see files associated with that run:

In [59]:
print(run.get_file_names())

['azureml-logs/60_control_log.txt', 'azureml-logs/70_driver_log.txt', 'logs/azureml/4279_azureml.log', 'outputs/predict-emailservice-xgboostmodel.pkl']


Register the model in the workspace, so that you or other collaborators can later query, examine, and deploy this model. You can store the metrics you captured and store them into "tags" in the Model object. 

In [60]:
# Adding metrics to tags so that these information can be used for model comparison purpose.
# metrics = ['Accuracy','Precision','Recall','F1-score']
metrics = ['Setting up xgb params', 'Predicted value counts', 'Exact Accuracy is', 'Accuracy is']

tags = {}
for key in metrics:
    tags[key] = run.get_metrics(key).get(key)

# register model, note the metric values are stored in "tags".
model = run.register_model(model_name='predict-emailservice-xgboostmodel',
                           model_path='outputs/predict-emailservice-xgboostmodel.pkl',
                           tags=tags
                          )
print(model.name, model.id, model.version, model.tags, sep='\t')

predict-emailservice-xgboostmodel	predict-emailservice-xgboostmodel:2	2	{'Setting up xgb params': "{'objective': 'rank:ndcg', 'learning_rate': 0.5, 'min_child_weight': 0.1, 'max_depth': 10, 'n_estimators': 200}", 'Predicted value counts': '3    142\n5     84\n1     56\nName: pred_rating, dtype: int64', 'Exact Accuracy is': '0.5', 'Accuracy is': '0.67'}


Once you have registered the model, you can proceed to Tutorial#2 to deploy the model.