# Track model training in notebooks with MLflow

You can use MLflow in a notebook to track any models you train. As you'll run this notebook with an Azure Machine Learning compute instance, you don't need to set up MLflow: it's already installed and integrated. 

You'll prepare some data and train a model to predict diabetes. You'll use autologging, and custom logging to explore how you can use MLflow in notebooks.

## Before you start

You'll need the latest version of the **azure-ai-ml** package to run the code in this notebook. Run the cell below to verify that it is installed.

> **Note**:
> If the **azure-ai-ml** package is not installed, run `pip install azure-ai-ml` to install it.

In [1]:
pip show azure-ai-ml

Name: azure-ai-ml
Version: 1.22.4
Summary: Microsoft Azure Machine Learning Client Library for Python
Home-page: https://github.com/Azure/azure-sdk-for-python
Author: Microsoft Corporation
Author-email: azuresdkengsysadmins@microsoft.com
License: MIT License
Location: c:\users\alienware\miniconda3\envs\py310\lib\site-packages
Requires: azure-common, azure-core, azure-mgmt-core, azure-storage-blob, azure-storage-file-datalake, azure-storage-file-share, colorama, isodate, jsonschema, marshmallow, msrest, opencensus-ext-azure, opencensus-ext-logging, pydash, pyjwt, pyyaml, strictyaml, tqdm, typing-extensions
Required-by: 
Note: you may need to restart the kernel to use updated packages.




## Connect to your workspace

With the required SDK packages installed, now you're ready to connect to your workspace.

To connect to a workspace, we need identifier parameters - a subscription ID, resource group name, and workspace name. Since you're working with a compute instance, managed by Azure Machine Learning, you can use the default values to connect to the workspace.

In [2]:
import configparser
import os
from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient


# Load environment variables from config.ini file
config = configparser.ConfigParser()

# Use an absolute path to avoid path issues
config_file_path = "G:/My Drive/Ingegneria/Data Science GD/My-Practice/my models/Azure ML/config.ini"
config.read(config_file_path)

# all following IDs to be retrieved, to login correctly
os.environ['AZURE_CLIENT_ID'] = config['azure']['client_id']
os.environ['AZURE_CLIENT_SECRET'] = config['azure']['client_secret']
os.environ['AZURE_TENANT_ID'] = config['azure']['tenant_id']
os.environ['AZURE_SUBSCRIPTION_ID'] = config['azure']['subscription_id']
os.environ['AZURE_STORAGE_KEY'] = config['azure']['storage_key']


credential = DefaultAzureCredential()
credential.get_token("https://management.azure.com/.default")

# Initialize MLClient with the obtained credential
ml_client = MLClient(
    credential=credential,
    subscription_id=os.environ['AZURE_SUBSCRIPTION_ID'],
    resource_group_name="rg-dp100-labs",
    workspace_name="mlw-dp100-labs"
)
# ml_client

## Configure MLflow

If running this notebook on a compute instance in the Azure Machine Learning studio, you don't need to configure MLflow. 

Still, it's good to verify that the necessary library is indeed installed.

> **Note**:
> If the **mlflow** library is not installed, run `pip install mlflow` to install it.

In [3]:
# pip show mlflow

## Prepare the data

You'll train a diabetes classification model. The training data is stored in the **data** folder as **diabetes.csv**. 

First, let's read the data:

In [4]:
import pandas as pd

print("Reading data...")
df = pd.read_csv('./data/diabetes.csv')
df.head()

Reading data...


Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
0,1354778,0,171,80,34,23,43.509726,1.213191,21,0
1,1147438,8,92,93,47,36,21.240576,0.158365,23,0
2,1640031,7,115,47,52,35,41.511523,0.079019,23,0
3,1883350,9,103,78,25,304,29.582192,1.28287,43,1
4,1424119,1,85,59,27,35,42.604536,0.549542,22,0


Next, you'll split the data into features and the label (Diabetes):

In [5]:
print("Splitting data...")
X, y = df[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, df['Diabetic'].values

Splitting data...


In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

You now have four dataframes:

- `X_train`: The training dataset containing the features.
- `X_test`: The test dataset containing the features.
- `y_train`: The label for the training dataset.
- `y_test`: The label for the test dataset.

You'll use these to train and evaluate the models you'll train.

## Create an MLflow experiment

Now that you're ready to train machine learning models, you'll first create an MLflow experiment. By creating the experiment, you can group all runs within one experiment and make it easier to find the runs in the studio.

Use MLflow on a local device

When you prefer working in notebooks on a local device, you can also make use of MLflow. You'll need to configure MLflow by using the following code in your local notebook to configure MLflow to point to the Azure Machine Learning workspace, and set it to the workspace tracking URI.

In [7]:
import mlflow
from azureml.core import Workspace

ws = Workspace.get(name="mlw-dp100-labs",
                   subscription_id=os.environ['AZURE_SUBSCRIPTION_ID'],
                   resource_group="rg-dp100-labs")

mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())

experiment_name = "mlflow-experiment-diabetes"
mlflow.set_experiment(experiment_name)
mlflow

Failure while loading azureml_run_type_providers. Failed to load entrypoint azureml.scriptrun = azureml.core.script_run:ScriptRun._from_run_dto with exception (urllib3 1.26.15 (c:\users\alienware\miniconda3\envs\py310\lib\site-packages), Requirement.parse('urllib3<3.0.0,>1.26.17')).


<module 'mlflow' from 'C:\\Users\\Alienware\\miniconda3\\envs\\py310\\lib\\site-packages\\mlflow\\__init__.py'>

## Train and track models

To track a model you train, you can use MLflow and enable autologging. The following cell will train a classification model using logistic regression. You'll notice that you don't need to calculate any evaluation metrics because they're automatically created and logged by MLflow.

In [8]:
from sklearn.linear_model import LogisticRegression

with mlflow.start_run():
    mlflow.sklearn.autolog()

    model = LogisticRegression(C=1/0.1, solver="liblinear").fit(X_train, y_train)


🏃 View run mango_energy_yhxstz1p at: https://westeurope.api.azureml.ms/mlflow/v2.0/subscriptions/a90ed0cd-b0b9-4e3a-bd85-67272a44de15/resourceGroups/rg-dp100-labs/providers/Microsoft.MachineLearningServices/workspaces/mlw-dp100-labs/#/experiments/9575aac6-a3d1-4b15-bf28-a90955705d55/runs/e2a83878-9208-4ebf-9e43-6de5894fce06
🧪 View experiment at: https://westeurope.api.azureml.ms/mlflow/v2.0/subscriptions/a90ed0cd-b0b9-4e3a-bd85-67272a44de15/resourceGroups/rg-dp100-labs/providers/Microsoft.MachineLearningServices/workspaces/mlw-dp100-labs/#/experiments/9575aac6-a3d1-4b15-bf28-a90955705d55


You can also use custom logging with MLflow. You can add custom logging to autologging, or you can use only custom logging.

Let's train two more models with scikit-learn. Since you ran the `mlflow.sklearn.autolog()` command before, MLflow will now automatically log any model trained with scikit-learn. To disable the autologging, run the following cell:

In [9]:
mlflow.sklearn.autolog(disable=True)

Now, you can train and track models using only custom logging. 

When you run the following cell, you'll only log one parameter and one metric.

In [10]:
from sklearn.linear_model import LogisticRegression
import numpy as np

with mlflow.start_run():
    model = LogisticRegression(C=1/0.1, solver="liblinear").fit(X_train, y_train)

    y_hat = model.predict(X_test)
    acc = np.average(y_hat == y_test)

    mlflow.log_param("regularization_rate", 0.1)
    mlflow.log_metric("Accuracy", acc)

🏃 View run keen_endive_52hg0vbq at: https://westeurope.api.azureml.ms/mlflow/v2.0/subscriptions/a90ed0cd-b0b9-4e3a-bd85-67272a44de15/resourceGroups/rg-dp100-labs/providers/Microsoft.MachineLearningServices/workspaces/mlw-dp100-labs/#/experiments/9575aac6-a3d1-4b15-bf28-a90955705d55/runs/f846c1bc-d784-4e61-b28c-c1ae2596f010
🧪 View experiment at: https://westeurope.api.azureml.ms/mlflow/v2.0/subscriptions/a90ed0cd-b0b9-4e3a-bd85-67272a44de15/resourceGroups/rg-dp100-labs/providers/Microsoft.MachineLearningServices/workspaces/mlw-dp100-labs/#/experiments/9575aac6-a3d1-4b15-bf28-a90955705d55


The reason why you'd want to track models, could be to compare the results of models you train with different hyperparameter values. 

For example, you just trained a logistic regression model with a regularization rate of 0.1. Now, train another model, but this time with a regularization rate of 0.01. Since you're also tracking the accuracy, you can compare and decide which rate results in a better performing model.

In [11]:
from sklearn.linear_model import LogisticRegression
import numpy as np

with mlflow.start_run():
    model = LogisticRegression(C=1/0.01, solver="liblinear").fit(X_train, y_train)

    y_hat = model.predict(X_test)
    acc = np.average(y_hat == y_test)

    mlflow.log_param("regularization_rate", 0.01)
    mlflow.log_metric("Accuracy", acc)

🏃 View run modest_line_l54q6cqc at: https://westeurope.api.azureml.ms/mlflow/v2.0/subscriptions/a90ed0cd-b0b9-4e3a-bd85-67272a44de15/resourceGroups/rg-dp100-labs/providers/Microsoft.MachineLearningServices/workspaces/mlw-dp100-labs/#/experiments/9575aac6-a3d1-4b15-bf28-a90955705d55/runs/3cd05a1e-0492-4c4b-b9ae-a268a9e90bfc
🧪 View experiment at: https://westeurope.api.azureml.ms/mlflow/v2.0/subscriptions/a90ed0cd-b0b9-4e3a-bd85-67272a44de15/resourceGroups/rg-dp100-labs/providers/Microsoft.MachineLearningServices/workspaces/mlw-dp100-labs/#/experiments/9575aac6-a3d1-4b15-bf28-a90955705d55


Another reason to track your model's results is when you're testing another estimator. All models you've trained so far used the logistic regression estimator. 

Run the following cell to train a model with the decision tree classifier estimator and review whether the accuracy is higher compared to the other runs.

In [12]:
from sklearn.tree import DecisionTreeClassifier
import numpy as np

with mlflow.start_run():
    model = DecisionTreeClassifier().fit(X_train, y_train)

    y_hat = model.predict(X_test)
    acc = np.average(y_hat == y_test)

    mlflow.log_param("estimator", "DecisionTreeClassifier")
    mlflow.log_metric("Accuracy", acc)

🏃 View run sleepy_tent_rk8bjrsh at: https://westeurope.api.azureml.ms/mlflow/v2.0/subscriptions/a90ed0cd-b0b9-4e3a-bd85-67272a44de15/resourceGroups/rg-dp100-labs/providers/Microsoft.MachineLearningServices/workspaces/mlw-dp100-labs/#/experiments/9575aac6-a3d1-4b15-bf28-a90955705d55/runs/aea2c2cc-6851-464b-9cb6-e2fb5cc8cff4
🧪 View experiment at: https://westeurope.api.azureml.ms/mlflow/v2.0/subscriptions/a90ed0cd-b0b9-4e3a-bd85-67272a44de15/resourceGroups/rg-dp100-labs/providers/Microsoft.MachineLearningServices/workspaces/mlw-dp100-labs/#/experiments/9575aac6-a3d1-4b15-bf28-a90955705d55


Finally, let's try to log an artifact. An artifact can be any file. For example, you can plot the ROC curve and store the plot as an image. The image can be logged as an artifact. 

Run the following cell to log a parameter, metric, and an artifact.

In [13]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
import numpy as np

with mlflow.start_run():
    model = DecisionTreeClassifier().fit(X_train, y_train)

    y_hat = model.predict(X_test)
    acc = np.average(y_hat == y_test)

    # plot ROC curve
    y_scores = model.predict_proba(X_test)

    fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
    fig = plt.figure(figsize=(6, 4))
    # Plot the diagonal 50% line
    plt.plot([0, 1], [0, 1], 'k--')
    # Plot the FPR and TPR achieved by our model
    plt.plot(fpr, tpr)
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve')
    plt.savefig("ROC-Curve.png")

    mlflow.log_param("estimator", "DecisionTreeClassifier")
    mlflow.log_metric("Accuracy", acc)
    mlflow.log_artifact("ROC-Curve.png")

🏃 View run amiable_sand_mpx5j6n8 at: https://westeurope.api.azureml.ms/mlflow/v2.0/subscriptions/a90ed0cd-b0b9-4e3a-bd85-67272a44de15/resourceGroups/rg-dp100-labs/providers/Microsoft.MachineLearningServices/workspaces/mlw-dp100-labs/#/experiments/9575aac6-a3d1-4b15-bf28-a90955705d55/runs/0673808b-5710-40ee-bded-fb3268907373
🧪 View experiment at: https://westeurope.api.azureml.ms/mlflow/v2.0/subscriptions/a90ed0cd-b0b9-4e3a-bd85-67272a44de15/resourceGroups/rg-dp100-labs/providers/Microsoft.MachineLearningServices/workspaces/mlw-dp100-labs/#/experiments/9575aac6-a3d1-4b15-bf28-a90955705d55


Review the model's results on the Jobs page of the Azure Machine Learning studio. 

- You'll find the parameters under **Params** in the **Overview** tab.
- You'll find the metrics under **Metrics** in the **Overview** tab, and in the **Metrics** tab.
- You'll find the artifacts in the **Outputs + logs** tab.

![Screenshot of outputs and logs tab on the Jobs page.](./images/output-logs.png)