<img src="./images/logo.png" alt="Drawing" style="width: 500px;"/>

# **Exercise 5:** Tracking, Registering and Inferencing Models in MLflow

You've trained a model. What's next? You could gather further produce data and train the model further? You could serve the model through an endpoint to allow others and/or frontend applications to use it? Perhaps you'd want to use it as a base model to train a whole new model with a whole new dataset? 

To do any of these requires a deeper dive into ML into MLflow - the most popular and contributed open-source machine learning platform that comes natively installed with **HPE Ezmeral Unified Analytics.** And in this exercise, we'll do just that.

In this exercise, you will learn how to perform the following on MLflow:

- Manage artifacts & metrics on MLflow
- Register the model
- Manage models, including moving them to and from Production staging
- Inference the model using an MLflow endpoint

By the end of this exercise, you will have a firm understanding of the art of Machine Learning Operations (MLOps) with MLflow.

Let's dive in!

### **Prerequsites**

As instructed in the [Introductory notebook](./00.introduction.ipynb), ensure that you have run `pip install -r requirements.txt` in a Terminal window, located in the same working directory, prior to running this notebook. 

<div class="alert alert-block alert-danger">
<b>Important:</b> This exercise requires the completion of Exercise 4:  Building a Image Classification Model with Tensorflow and MLflow.</div>

## **1. Declaring Variables and Importing Libraries**

Let's re-declare the variables related to our MLflow experiement such that we can access them in this exercise.

In [1]:
# Experiment variables for MLflow
experiment_name = "retail-experiment"
model_name = "produce-detection"
artifact_path = "model"


Next, we'll import the necessary libraries. To learn more about these libraries, check out Section 1 of [Exercise 4](./04.model_training.ipynb).

Ignore any warnings that appear.

In [2]:
import mlflow
from mlflow.tracking.client import MlflowClient
from mlflow.entities.model_registry.model_version_status import ModelVersionStatus
from IPython.display import display
from PIL import Image
import numpy as np
from tensorflow.keras.preprocessing.image import load_img,img_to_array
from io import BytesIO

2025-03-07 10:19:44.892476: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-07 10:19:44.896261: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-07 10:19:44.907740: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1741342784.926713    2179 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1741342784.932720    2179 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-07 10:19:44.953531: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU ins

In [3]:
%update_token

Token successfully refreshed.


## **2. Checking In**

In the last exercise, we trained a model and saved it as an MLflow artifact. Let's go inspect it in MLflow.

1. Navigate back to the Unified Analytics dashboard.
1. In the sidebar navigation menu, select `Data Science` > `Experiments`.
1. The **MLflow Experiments** page will open in a new tab.

Here, you will be able to see all of your **Experiements**. Clicking on any one Experiment (such as our `retail-experiment` experiment), you can see all of the experimental **runs** you have executed under that Experiment

4. Click `Columns` dropdown in the middle-right of options row above the Run table. 

This is where you can add more quick bits of information presented with each Run, making them easy to compare. 

5. Check `accuracy`, which is the accuracy of the model on our test set, and `val_loss`, the error with our validations during the training run. 

<img src="./images/exercise5/mlflow1.png" alt="Drawing" style="width: 80%;"/>

6. Click on the Run Name to explore it further. `produce-detection-...`

In this pane, we can see the **parameters** we set up for this model, as well as the **metrics**, all of the **artifacts** associated with run (including the model file and data) and the **Full Path** to the model, which we can call to access the model from this specific run. 

<img src="./images/exercise5/model.png" alt="Drawing" style="width: 80%;"/>

## **3. Registering a Model in the Model Regsitry**

Back here in the notebook, we're going to learn how to add the model to the MLflow **Model Registry**. A Model Regsitry is a specialized library to store, track, and manage all the different versions of your models.  

Storing your models in a Model Registry has many major advantages, including:

**Model Storage**: Just like a library stores books, a model registry stores trained machine learning models. These models are like the recipes you create to solve specific problems.

**Version Control**:  As you experiment and improve your models, you create new versions. A model registry keeps track of all these different versions, allowing you to compare them and see which one performs best and "rollback" if future experiments give an undesirable output. 

**Documentation**:  In addition to the models themselves, a registry can store important information about each model, like the data it was trained on, its performance metrics, and who created it. This documentation helps everyone understand what the model does and how it was built.

**Collaboration**:  A model registry acts as a central hub for data scientists and engineers working on the same project. They can all access and use the models stored there, making collaboration smoother.

**Deployment**:  Once you've chosen the best model version, the registry can help you deploy it into production, meaning you can use it to make real-world predictions.

Overall, a model registry helps organizations manage the lifecycle of their machine learning models, from creation to deployment. It ensures everyone's on the same page, models are well-documented, and the best versions are easily accessible.

First, let's bring up the runs assosicated with our `retail-experiment` experiment and get the ID of the most recent run. 

In [4]:
# Search for runs in the specified experiment, ordering by start time in descending order
runs = mlflow.search_runs(experiment_ids=[mlflow.get_experiment_by_name(experiment_name).experiment_id],
                          order_by=["start_time desc"],
                          filter_string="")

# Check if there are any runs
if not runs.empty:
    # Get the run ID of the last active run
    last_run_id = runs.iloc[0]["run_id"]
    print("Last active run ID for experiment '{}': {}".format(experiment_name, last_run_id))
else:
    print("No runs found for experiment '{}'.".format(experiment_name))

Last active run ID for experiment 'retail-experiment': dcb9bf579a2449e88cb47e697b81af39


We'll use then create a URI for our model to specify where it is using the run ID and artifact path (declared above as just `model`). 

Using this URI, we can **register the model** in the **MLflow Model Registry** under a given model name (declared above as `retail-recognition`). 

In [5]:
# set parameters model_uri and model details and register model in mlflow
model_uri = "runs:/{run_id}/{artifact_path}".format(run_id=last_run_id, artifact_path=artifact_path)
model_details = mlflow.register_model(model_uri=model_uri, name=model_name)

Successfully registered model 'produce-detection'.
2025/03/07 10:20:18 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: produce-detection, version 1
Created version '1' of model 'produce-detection'.


Then, we're going to make a new **Version** of our model. You will usually do this after you run the same model, but perhaps with different parameters. 

We can differentiate between models it by giving them a different **Description** and/or specifying a **Version** number.

In [6]:
# Create a second Version of the model
model_uri = "runs:/{run_id}/{artifact_path}".format(run_id=last_run_id, artifact_path=artifact_path)
model_details = mlflow.register_model(model_uri=model_uri, name=model_name)

# Update the second model with a description and a Version number.
client = MlflowClient()

client.update_model_version(
    name=model_details.name,
    version=2,
    description="Fruit & Vegetables Cashierless Store",
)

# Get the details of the model version
model_version_details = client.get_model_version(
    name=model_details.name,
    version=2
)

# Convert the status to a readable string and print it
status = ModelVersionStatus.from_string(model_version_details.status)
print("Model status: %s" % ModelVersionStatus.to_string(status))

Registered model 'produce-detection' already exists. Creating a new version of this model...
2025/03/07 10:20:19 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: produce-detection, version 2


Model status: READY


Created version '2' of model 'produce-detection'.


### Model Staging

Model Staging is an important step in the machine learning workflow, as it allows for complete model governance and visibility between multiple parties (e.g. what models are currently in deployment, being worked on, older archived versons, etc). Staging also ensures that only specific versions of any given model are being inferenced. 

There are three Stages of development you can classify a model (and/or a specific version of a model) under in MLflow:
- `None`: No action has yet been taken on this model version.
- `Staging`: This model/version is currently under active development and being prepared for Production.
- `Production`: This model/version is live and being used/inferenced by external applications. 

If you check the page, currently all versions of our model have a *Stage* status of `None`. 

Staging is an important step in the machine learning workflow, as it allows for complete model governance and visibility between multiple parties (e.g. what models are currently in deployment, being worked on, older archived versons, etc). Staging also ensures that only specific versions of any given model are being inferenced. 

Given our previous test result, I'd say we're good to send the **latest** version of our model to the `Production` stage!

In [7]:
# create an instance of the MlflowClient
client = MlflowClient()

# Get the latest model created for our experiment
latest_versions = client.get_latest_versions(name=model_name, stages=["None"])
latest_version = latest_versions[-1]

# Transition the desired model version to production stage
client.transition_model_version_stage(
    name=model_name,
    version=latest_version.version,
    stage='Production',
)

print(f"Model: {latest_version.name}, version: {latest_version.version} has been moved to Production")

Model: produce-detection, version: 2 has been moved to Production


  latest_versions = client.get_latest_versions(name=model_name, stages=["None"])
  client.transition_model_version_stage(


As for our older, previous versions we will move them to the `Archive` stage.

In [8]:
# Transition model versions to a different stage if their current stage is not "production"
model_versions = client.search_model_versions("")

# Transition model versions to a different stage if their current stage is not "production"
for mv in model_versions:
    if mv.name == model_name:
        if mv.version != latest_version.version:
            client.transition_model_version_stage(
                name=mv.name,
                version=mv.version,
                stage="Archived"
            )
            print(f"Model: {mv.name}, version: {mv.version} has been moved to Archived")

            # Update Model Version Description
            client.update_model_version(
                name=mv.name,
                version=mv.version,
                description="Model Moved to Archived"
            )

Model: produce-detection, version: 1 has been moved to Archived


  client.transition_model_version_stage(


Now, we can see all versions of our model have an updated *Stage* status. 

<img src="./images/exercise5/registry2.png" alt="Drawing" style="width: 80%;"/>

## **6. Model Testing**

Similar to the previous exercise, we will now test our model - but instead of calling the model from a saved variable in the notebook's memory, this time we will call it from the MLflow Model Registry.

In [9]:
%update_token

Token successfully refreshed.


In [10]:
# Get the source URI or location of the model version
logged_model = latest_version.source

# Load model as a PyFuncModel.
loaded_model = mlflow.pyfunc.load_model(logged_model)

Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

2025-03-07 10:22:43.799619: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
  saveable.load_own_variables(weights_store.get(inner_path))


We'll redeclare our labels from the previous notebook so we can better interpret the answer our model gives us.

In [11]:
labels = {'apple': 0, 'banana': 1, 'carrot': 2, 'cucumber': 3, 'lemon': 4, 'orange': 5}
labels = dict((v, k) for k, v in labels.items())

Now, like last exercise, **go out** onto Google Images and find a **different** image of a **orange**, an **apple** or a **lemon**. Copy the link into the `online_url` variable.

If you are in an offline/proxy environment, upload your file to the same directory as this notebook and call it **test_image.jpg**. An has been supplied if you just wish to run the cell without finding your own image.

In [12]:
def predict(location, model):
    
    #Check to see if a web URL or a local file is being parsed 
    if "http" in location:
        response = requests.get(location)
        img = Image.open(BytesIO(response.content))
    else:
        from tensorflow.keras.preprocessing.image import load_img,img_to_array
        img=load_img(location,target_size=(224,224,3))
    
    #Convert the image into the dimensions of the model's input layer.
    img=img_to_array(img)
    img=img/255
    img=np.expand_dims(img,[0])
    
    # Infer the model with the image
    answer=model.predict(img)

    # Format the answer
    y_class = answer.argmax(axis=-1)
    y = " ".join(str(x) for x in y_class)
    y = int(y)
    res = labels[y]
    
    return res

In [13]:
# Load the path for our image, be it a web URL or a local file
online_url = ""
local_url = os.getcwd() + "/images/test_image.jpg"

if online_url:
    image_url = online_url
else:
    image_url = local_url

# Parse in our loaded_model, which is the latest version of our model pulled from MLflow. 
img = predict(image_url, loaded_model)
print("The model predicts: " + img)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 894ms/step
The model predicts: orange


Did the model correctly guess what was in your image? If so, great! If not... *still* also great!

As you have now observed through using a Model Registry, **incorrect predictions** from models provide vital feedback that can help you to understand the model's behaviour and improve either the training dataset or the model. Using the MLflow Model Registry, we can compare multiple versions of the same model and understand how tweaking the dataset and model parameters can affect performance.

# **Conclusion**

In this exercise, you got to explore the ins and outs of the machine learning workflow platform, MLflow. You learned how to register models from experiment runs into the MLflow Model Registry, update versions of models in the registry from existing runs, and modify the staging of models in the registry to allow teams working on this model to keep track of what models are deployed into production. 

Lastly, you learned how to pull a model from the saved artifacts associated a model in the registry using the model's MLflow URI.

In the next exercise, you will use that URI to **serve** the latest version of your model from the repository using **KServe** - making it callable from your own retail application!

For more on managing models in the MLflow Model Registry, see the offical <a href="https://mlflow.org/docs/latest/model-registry.html">MLflow documentation </a>. 

In [18]:
import mlflow

# Set the experiment name to 'Default'
experiment_name = "Default"

# Set the run ID
run_id = "b27f6141bb8c432a818255ee2d9df0f8"

# Get the run details
run = mlflow.get_run(run_id)

# Fetch the artifact location from the run's metadata
artifact_uri = run.info.artifact_uri
print(f"Artifact URI: {artifact_uri}")

# Define the model path within the artifact (if you know it; assume it's 'model')
model_path = f"{artifact_uri}/model"


Artifact URI: s3://mlflow.ddpcai/0/b27f6141bb8c432a818255ee2d9df0f8/artifacts


In [19]:
# Register the model with MLflow
model_name = "fine_tuned_fb125_model"  # Set your desired model name

# Register the model to the model registry
model_uri = mlflow.register_model(model_path, model_name)

# Optionally, set the stage (e.g., "Production", "Staging")
mlflow.set_model_version_stage(model_uri, stage="Staging")

print(f"Model registered with URI: {model_uri}")

Successfully registered model 'fine_tuned_fb125_model'.
2025/03/08 17:07:57 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: fine_tuned_fb125_model, version 1
Created version '1' of model 'fine_tuned_fb125_model'.


AttributeError: module 'mlflow' has no attribute 'set_model_version_stage'