<div style="align: center;">
    <br>
    <img src="https://www.nyc.gov/assets/tlc/images/content/hero/MRP-Closing-Week.jpg" style="display:block; margin:auto; width:65%; height:250px;">
</div><br><br> 

<div style="letter-spacing:normal; opacity:1.;">
<!--   https://xkcd.com/color/rgb/   -->
  <p style="text-align:center; background-color: lightsalmon; color: Jaguar; border-radius:10px; font-family:monospace; 
            line-height:1.4; font-size:32px; font-weight:bold; text-transform: uppercase; padding: 9px;">
            <strong>TLC Trip Record Data</strong></p>  
  
  <p style="text-align:center; background-color:romance; color: Jaguar; border-radius:10px; font-family:monospace; 
            line-height:1.4; font-size:22px; font-weight:normal; text-transform: capitalize; padding: 5px;"
     >Machine Learning Module: MLFLOW - Ride Duration Prediction using Regression Analysis<br>( MLFLOW )</p>    
</div>

- https://mlflow.org/docs/0.7.0/index.html

**Dataset Info**


**Context**

Yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. The data used in the attached datasets were collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP). The trip data was not created by the TLC, and TLC makes no representations as to the accuracy of these data.

For-Hire Vehicle (“FHV”) trip records include fields capturing the dispatching base license number and the pick-up date, time, and taxi zone location ID (shape file below). These records are generated from the FHV Trip Record submissions made by bases. Note: The TLC publishes base trip record data as submitted by the bases, and we cannot guarantee or confirm their accuracy or completeness. Therefore, this may not represent the total amount of trips dispatched by all TLC-licensed bases. The TLC performs routine reviews of the records and takes enforcement actions when necessary to ensure, to the extent possible, complete and accurate information.


**ATTENTION!**

On 05/13/2022, we are making the following changes to trip record files:

- All files will be stored in the PARQUET format. Please see the ‘Working With PARQUET Format’ under the Data Dictionaries and MetaData section.
- Trip data will be published monthly (with two months delay) instead of bi-annually.
- HVFHV files will now include 17 more columns (please see High Volume FHV Trips Dictionary for details). Additional columns will be added to the old files as well. The earliest date to include additional columns: February 2019.
- Yellow trip data will now include 1 additional column (‘airport_fee’, please see Yellow Trips Dictionary for details). The additional column will be added to the old files as well. The earliest date to include the additional column: January 2011.


**Download the data for January, February and March 2022**

Dataset: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page


**Data Dictionaries and MetaData**

- We'll use the same `NYC taxi dataset`, but instead of "Yellow Taxi Trip Records", we'll use `"Green Taxi Trip Records"`.

> `Green Trips Data Dictionary`: https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_green.pdf

**TASK**

The goal of this homework is to get familiar with tools like MLflow for experiment tracking and model management.<br>

Questions: https://github.com/DataTalksClub/mlops-zoomcamp/blob/main/cohorts/2023/02-experiment-tracking/homework.md


**Table of Content**


1. Import Libraries and Ingest Data
    - Q1: Install the package.<br>    
2. Recognizing and Understanding Data
    - Q2. Download and preprocess the data<br>
    

<div style="letter-spacing:normal; opacity:1.;">
  <h1 style="text-align:center; background-color: lightsalmon; color: Jaguar; border-radius:10px; font-family:monospace; border-radius:20px;
            line-height:1.4; font-size:32px; font-weight:bold; text-transform: uppercase; padding: 9px;">
            <strong>1. Import Libraries & Ingest Data</strong></h1>   
</div>

**pip freeze**

- https://docs.conda.io/projects/conda/en/4.6.0/_downloads/52a95608c49671267e40c689e0bc00ca/conda-cheatsheet.pdf

In [1]:
# %%capture cap --no-stderr

# !conda create --name "exp-tracking-env-py39" python=3.9  jupyter -y

In [2]:
# cap.show()

In [3]:
# check enviroment
# !conda info | grep 'active env'
# !conda info -e
# !conda env list

In [4]:
%%writefile requirements.txt

# To get started with MLflow you'll need to install the appropriate Python package.
pandas==2.0.2
fastparquet==2023.4.0
# pyarrow==11.0.0
seaborn==0.12.2
scikit-learn==1.2.2
xgboost==1.7.5
hyperopt==0.2.7

jupyter==1.0.0
mlflow==2.3.2
boto3==1.26.144
setuptools==67.7.2

Overwriting requirements.txt


In [5]:
import os, sys, platform
print("Python  :", sys.version)
print("Platform:", platform.platform())
print("Actv Env:", os.environ['CONDA_DEFAULT_ENV'])

!{sys.executable} -m pip install -r requirements.txt -Uq

Python  : 3.9.16 (main, Mar  8 2023, 14:00:05) 
[GCC 11.2.0]
Platform: Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Actv Env: exp-tracking-env-py39


In [6]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import stats

import pickle
from glob import glob
from tqdm import tqdm
tqdm._instances.clear()

# memory management performs garbage collection 
import gc
gc.collect()

0

## Q1. Install the package

To get started with MLflow you'll need to install the appropriate Python package.

For this we recommend creating a separate Python environment, for example, you can use [conda environments](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html#managing-envs), 
and then install the package there with `pip` or `conda`.

Once you installed the package, run the command `mlflow --version` and check the output.

- **What's the version that you have?**

In [7]:
import mlflow; print("mlflow.__version__: ", mlflow.__version__)
from mlflow.tracking import MlflowClient
from mlflow.exceptions import MlflowException

client = MlflowClient()

mlflow.__version__:  2.3.2


<div style="letter-spacing:normal; opacity:1.;">
  <h1 style="text-align:center; background-color: lightsalmon; color: Jaguar; border-radius:10px; font-family:monospace; border-radius:20px;
            line-height:1.4; font-size:32px; font-weight:bold; text-transform: uppercase; padding: 9px;">
            <strong>2. Recognizing and Understanding Data</strong></h1>   
</div>

## Q2. Download and preprocess the data

We'll use the Green Taxi Trip Records dataset to predict the amount of tips for each trip. 

Download the data for January, February and March 2022 in parquet format from [here](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

Use the script `preprocess_data.py` located in the folder [`homework`](https://github.com/DataTalksClub/mlops-zoomcamp/tree/main/cohorts/2023/02-experiment-tracking/homework) to preprocess the data.

The script will:

* load the data from the folder `<TAXI_DATA_FOLDER>` (the folder where you have downloaded the data),
* fit a `DictVectorizer` on the training set (January 2022 data),
* save the preprocessed datasets and the `DictVectorizer` to disk.


**Your task is to download the datasets and then execute this command:**

```
python preprocess_data.py --raw_data_path <TAXI_DATA_FOLDER> --dest_path ./output
```

Tip: go to `02-experiment-tracking/homework/` folder before executing the command and change the value of `<TAXI_DATA_FOLDER>` to the location where you saved the data.

- **So what's the size of the saved `DictVectorizer` file?**

## Ingest Data [wget](https://linuxways.net/centos/linux-wget-command-with-examples/) or [curl](https://daniel.haxx.se/blog/2020/09/10/store-the-curl-output-over-there/)

In [8]:
# "Green Taxi Trip Records" Download the data for January, February and March 2022
!wget -q -N -P "./trip-data" https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-01.parquet
!wget -q -N -P "./trip-data" https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-02.parquet
!wget -q -N -P "./trip-data" https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-03.parquet

In [9]:
glob('trip-data/green*.parquet')

['trip-data/green_tripdata_2022-01.parquet',
 'trip-data/green_tripdata_2022-02.parquet',
 'trip-data/green_tripdata_2022-03.parquet']

In [10]:
%%writefile preprocess_data.py

# Source: https://github.com/DataTalksClub/mlops-zoomcamp/blob/main/cohorts/2023/02-experiment-tracking/homework/preprocess_data.py

import os
import pickle
import click
import pandas as pd

from sklearn.feature_extraction import DictVectorizer

# import warnings
# # Ignore all warnings
# # warnings.filterwarnings("ignore")

# # Filter the specific warning message
# warnings.filterwarnings("ignore", category=UserWarning, module="setuptools")
# warnings.filterwarnings("ignore", category=UserWarning, message="Setuptools is replacing distutils.")


def dump_pickle(obj, filename: str):
    with open(filename, "wb") as f_out:
        return pickle.dump(obj, f_out)


def read_dataframe(filename: str):
    df = pd.read_parquet(filename)

    df['duration'] = df['lpep_dropoff_datetime'] - df['lpep_pickup_datetime']
    df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)
    df = df[(df.duration >= 1) & (df.duration <= 60)]

    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype(str)

    return df


def preprocess(df: pd.DataFrame, dv: DictVectorizer, fit_dv: bool = False):
    df['PU_DO'] = df['PULocationID'] + '_' + df['DOLocationID']
    categorical = ['PU_DO']
    numerical = ['trip_distance']
    dicts = df[categorical + numerical].to_dict(orient='records')
    if fit_dv:
        X = dv.fit_transform(dicts)
    else:
        X = dv.transform(dicts)
    return X, dv


@click.command()
@click.option(
    "--raw_data_path",
    help="Location where the raw NYC taxi trip data was saved"
)
@click.option(
    "--dest_path",
    help="Location where the resulting files will be saved"
)
def run_data_prep(raw_data_path: str, dest_path: str, dataset: str = "green"):
    # Load parquet files
    df_train = read_dataframe(
        os.path.join(raw_data_path, f"{dataset}_tripdata_2022-01.parquet")
    )
    df_val = read_dataframe(
        os.path.join(raw_data_path, f"{dataset}_tripdata_2022-02.parquet")
    )
    df_test = read_dataframe(
        os.path.join(raw_data_path, f"{dataset}_tripdata_2022-03.parquet")
    )

    # Extract the target
    target = 'tip_amount'
    y_train = df_train[target].values
    y_val = df_val[target].values
    y_test = df_test[target].values

    # Fit the DictVectorizer and preprocess data
    dv = DictVectorizer()
    X_train, dv = preprocess(df_train, dv, fit_dv=True)
    X_val, _ = preprocess(df_val, dv, fit_dv=False)
    X_test, _ = preprocess(df_test, dv, fit_dv=False)

    # Create dest_path folder unless it already exists
    os.makedirs(dest_path, exist_ok=True)

    # Save DictVectorizer and datasets
    dump_pickle(dv, os.path.join(dest_path, "dv.pkl"))
    dump_pickle((X_train, y_train), os.path.join(dest_path, "train.pkl"))
    dump_pickle((X_val, y_val), os.path.join(dest_path, "val.pkl"))
    dump_pickle((X_test, y_test), os.path.join(dest_path, "test.pkl"))


if __name__ == '__main__':
    run_data_prep()

Overwriting preprocess_data.py


In [11]:
TAXI_DATA_FOLDER = "./trip-data"

!python preprocess_data.py --raw_data_path $TAXI_DATA_FOLDER --dest_path ./output

In [12]:
# So what's the size of the saved DictVectorizer file?
!stat "./output/dv.pkl"  | grep Size:       | awk '{print $2}'
!du -h "./output/dv.pkl" | awk '{print $1}'

153660
152K


## Q3. Train a model with autolog

We will train a `RandomForestRegressor` (from Scikit-Learn) on the taxi dataset.

We have prepared the training script `train.py` for this exercise, which can be also found in the folder [`homework`](https://github.com/DataTalksClub/mlops-zoomcamp/tree/main/cohorts/2023/02-experiment-tracking/homework). 

The script will:

* load the datasets produced by the previous step,
* train the model on the training set,
* calculate the RMSE score on the validation set.

Your task is to modify the script to enable [**autologging**](https://mlflow.org/docs/latest/tracking.html#automatic-logging) with MLflow, execute the script and then launch the MLflow UI to check that the experiment run was properly tracked. 

Tip 1: don't forget to wrap the training code with a `with mlflow.start_run():` statement as we showed in the videos.

Tip 2: don't modify the hyperparameters of the model to make sure that the training will finish quickly.

Additional code:
- Run Terminal: `mlflow ui --backend-store-uri sqlite:///mlflow.db`


<br>

- **What is the value of the `max_depth` parameter?**

In [13]:
# !cd path/.ipynb
# kill $(lsof -ti :5000)
# mlflow ui --backend-store-uri sqlite:///mlflow.db --port 8000
# !mlflow ui --backend-store-uri sqlite:///mlflow.db

In [13]:
import mlflow

mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("nyc-taxi-experiment")

2023/06/04 23:31:45 INFO mlflow.tracking.fluent: Experiment with name 'nyc-taxi-experiment' does not exist. Creating a new experiment.


<Experiment: artifact_location='/mnt/c/Users/clk/Jupyter_Notebook/mlops-zoomcamp-2023/cohorts/2023/02-experiment-tracking/mlruns/1', creation_time=1685910705834, experiment_id='1', last_update_time=1685910705834, lifecycle_stage='active', name='nyc-taxi-experiment', tags={}>

In [14]:
print(f"tracking URI: '{mlflow.get_tracking_uri()}'")

tracking URI: 'sqlite:///mlflow.db'


In [15]:
%%writefile train.py

import os
import click
import pickle
import mlflow

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

import warnings
# Ignore all warnings
# warnings.filterwarnings("ignore")

# Filter the specific warning message
warnings.filterwarnings("ignore", category=UserWarning, module="setuptools")
warnings.filterwarnings("ignore", category=UserWarning, message="Setuptools is replacing distutils.")


def load_pickle(filename: str):
    with open(filename, "rb") as f_in:
        return pickle.load(f_in)


@click.command()
@click.option(
    "--data_path",
    default="./output",
    help="Location where the processed NYC taxi trip data was saved"
)


def run_train(data_path: str):
    
    mlflow.set_tracking_uri("sqlite:///mlflow.db")
    mlflow.set_experiment("nyc-taxi-experiment")
    
    # before your training code to enable automatic logging of sklearn metrics, params, and models
    mlflow.sklearn.autolog()
    
    with mlflow.start_run():
        mlflow.set_tag("developer", "muce")
        mlflow.log_param("train-data-path", './output/train.pkl')
        mlflow.log_param("valid-data-path", './output/val.pkl')
        mlflow.log_param("test-data-path",  './output/test.pkl')
        

        X_train, y_train = load_pickle(os.path.join(data_path, "train.pkl"))
        X_val, y_val     = load_pickle(os.path.join(data_path, "val.pkl"))
        
        params = {"max_depth": 10, "random_state": 0}
        mlflow.log_params(params)
        
        rf     = RandomForestRegressor(**params)
        rf.fit(X_train, y_train)
        y_pred = rf.predict(X_val)
        # autolog_run = mlflow.last_active_run()


        rmse = mean_squared_error(y_val, y_pred, squared=False)
        mlflow.log_metric("rmse", rmse)
        
        folder_path = 'models'
        os.makedirs(folder_path, exist_ok=True)
        
        # Log Model two options
        # save model, preprocessor or pipeline
        with open('models/ride_duration_rf_model.bin', 'wb') as f_out:
            pickle.dump(rf, f_out)
        
        # as artifact, save model, preprocessor or pipeline
        mlflow.log_artifact(local_path="models/ride_duration_rf_model.bin", artifact_path="models_pickle")
        
        # as model, save model
        mlflow.sklearn.log_model(sk_model = rf, artifact_path = "models_mlflow")
        
        print(f"default artifacts URI: '{mlflow.get_artifact_uri()}'")


if __name__ == '__main__':
    run_train()

Overwriting train.py


In [16]:
DATA_FOLDER = "./output/"

!python train.py --data_path $DATA_FOLDER

default artifacts URI: '/mnt/c/Users/clk/Jupyter_Notebook/mlops-zoomcamp-2023/cohorts/2023/02-experiment-tracking/mlruns/1/fdb8534b432146acb89a8d4f0ae2e90b/artifacts'


# Q4. Tune model hyperparameters

Now let's try to reduce the validation error by tuning the hyperparameters of the RandomForestRegressor using optuna. We have prepared the script hpo.py for this exercise.

Your task is to modify the script hpo.py and make sure that the validation RMSE is logged to the tracking server for each run of the hyperparameter optimization (you will need to add a few lines of code to the objective function) and run the script without passing any parameters.

After that, open UI and explore the runs from the experiment called random-forest-hyperopt to answer the question below.

Note: Don't use autologging for this exercise.

The idea is to just log the information that you need to answer the question below, including:

the list of hyperparameters that are passed to the objective function during the optimization,
the RMSE obtained on the validation set (February 2022 data).

- **What's the best validation RMSE that you got?**

In [29]:
%%writefile hpo.py

import os
import pickle
import click
import mlflow
import optuna

from optuna.samplers import TPESampler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# import warnings
# # Ignore all warnings
# # warnings.filterwarnings("ignore")

# # Filter the specific warning message
# warnings.filterwarnings("ignore", category=UserWarning, module="setuptools")
# warnings.filterwarnings("ignore", category=UserWarning, message="Setuptools is replacing distutils.")


mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("random-forest-hyperopt")


def load_pickle(filename):
    with open(filename, "rb") as f_in:
        return pickle.load(f_in)


@click.command()
@click.option(
    "--data_path",
    default="./output",
    help="Location where the processed NYC taxi trip data was saved"
)
@click.option(
    "--num_trials",
    default=10,
    help="The number of parameter evaluations for the optimizer to explore"
)
def run_optimization(data_path: str, num_trials: int):

    X_train, y_train = load_pickle(os.path.join(data_path, "train.pkl"))
    X_val, y_val = load_pickle(os.path.join(data_path, "val.pkl"))

    def objective(trial):
        params = {
            'n_estimators': trial.suggest_int('n_estimators', 10, 50, 1),
            'max_depth': trial.suggest_int('max_depth', 1, 20, 1),
            'min_samples_split': trial.suggest_int('min_samples_split', 2, 10, 1),
            'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 4, 1),
            'random_state': 42,
            'n_jobs': -1
        }

        rf = RandomForestRegressor(**params)
        rf.fit(X_train, y_train)
        y_pred = rf.predict(X_val)
        rmse = mean_squared_error(y_val, y_pred, squared=False)

        # Log the validation RMSE to the tracking server
        with mlflow.start_run():
            mlflow.log_params(params)
            mlflow.log_metric('val_rmse', rmse)

        return rmse

    sampler = TPESampler(seed=42)
    study = optuna.create_study(direction="minimize", sampler=sampler)
    study.optimize(objective, n_trials=num_trials)


if __name__ == '__main__':
    run_optimization()

Overwriting hpo.py


In [30]:
DATA_FOLDER = "./output/"

!python hpo.py --data_path $DATA_FOLDER

[32m[I 2023-06-04 23:48:33,183][0m A new study created in memory with name: no-name-a7cf2c7b-d47c-4ea4-a7f7-029516ebeaf1[0m
[32m[I 2023-06-04 23:48:36,945][0m Trial 0 finished with value: 2.451379690825458 and parameters: {'n_estimators': 25, 'max_depth': 20, 'min_samples_split': 8, 'min_samples_leaf': 3}. Best is trial 0 with value: 2.451379690825458.[0m
[32m[I 2023-06-04 23:48:37,419][0m Trial 1 finished with value: 2.4667366020368333 and parameters: {'n_estimators': 16, 'max_depth': 4, 'min_samples_split': 2, 'min_samples_leaf': 4}. Best is trial 0 with value: 2.451379690825458.[0m
[32m[I 2023-06-04 23:48:40,626][0m Trial 2 finished with value: 2.449827329704216 and parameters: {'n_estimators': 34, 'max_depth': 15, 'min_samples_split': 2, 'min_samples_leaf': 4}. Best is trial 2 with value: 2.449827329704216.[0m
[32m[I 2023-06-04 23:48:41,756][0m Trial 3 finished with value: 2.460983516558473 and parameters: {'n_estimators': 44, 'max_depth': 5, 'min_samples_split': 3, '

# Q5. Promote the best model to the model registry

The results from the hyperparameter optimization are quite good. So, we can assume that we are ready to test some of these models in production. In this exercise, you'll promote the best model to the model registry. We have prepared a script called register_model.py, which will check the results from the previous step and select the top 5 runs. After that, it will calculate the RMSE of those models on the test set (March 2022 data) and save the results to a new experiment called random-forest-best-models.

Your task is to update the script register_model.py so that it selects the model with the lowest RMSE on the test set and registers it to the model registry.

Tips for MLflow:

you can use the method search_runs from the MlflowClient to get the model with the lowest RMSE,
to register the model you can use the method mlflow.register_model and you will need to pass the right model_uri in the form of a string that looks like this: "runs:/<RUN_ID>/model", and the name of the model (make sure to choose a good one!).

- **What is the test RMSE of the best model?**

In [31]:
# logged_model = 'runs:/94bb075609a44506bfa9fc0da24327fc/models_mlflow'

# # Load model as a PyFuncModel.
# loaded_model = mlflow.pyfunc.load_model(logged_model)
# loaded_model

In [32]:
%%writefile register_model.py

import os
import pickle
import click
import mlflow

from mlflow.entities import ViewType
from mlflow.tracking import MlflowClient
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

import warnings
# Ignore all warnings
# warnings.filterwarnings("ignore")

# Filter the specific warning message
warnings.filterwarnings("ignore", category=UserWarning, module="setuptools")
warnings.filterwarnings("ignore", category=UserWarning, message="Setuptools is replacing distutils.")


HPO_EXPERIMENT_NAME = "random-forest-hyperopt"
EXPERIMENT_NAME = "random-forest-best-models"
RF_PARAMS = ['max_depth', 'n_estimators', 'min_samples_split', 'min_samples_leaf', 'random_state', 'n_jobs']

mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment(EXPERIMENT_NAME)
mlflow.sklearn.autolog()


def load_pickle(filename):
    with open(filename, "rb") as f_in:
        return pickle.load(f_in)


def train_and_log_model(data_path, params):
    X_train, y_train = load_pickle(os.path.join(data_path, "train.pkl"))
    X_val, y_val = load_pickle(os.path.join(data_path, "val.pkl"))
    X_test, y_test = load_pickle(os.path.join(data_path, "test.pkl"))

    with mlflow.start_run():
        for param in RF_PARAMS:
            params[param] = int(params[param])

        rf = RandomForestRegressor(**params)
        rf.fit(X_train, y_train)

        # Evaluate model on the validation and test sets
        val_rmse = mean_squared_error(y_val, rf.predict(X_val), squared=False)
        mlflow.log_metric("val_rmse", val_rmse)
        test_rmse = mean_squared_error(y_test, rf.predict(X_test), squared=False)
        mlflow.log_metric("test_rmse", test_rmse)

        # Log model parameters and artifacts
        for param, value in params.items():
            mlflow.log_param(param, value)
        mlflow.sklearn.log_model(rf, "random_forest_model")


@click.command()
@click.option(
    "--data_path",
    default="./output",
    help="Location where the processed NYC taxi trip data was saved"
)
@click.option(
    "--top_n",
    default=5,
    type=int,
    help="Number of top models that need to be evaluated to decide which one to promote"
)
def run_register_model(data_path: str, top_n: int):

    client = MlflowClient()

    # Retrieve the top_n model runs and log the models
    experiment = client.get_experiment_by_name(HPO_EXPERIMENT_NAME)
    runs = client.search_runs(
        experiment_ids=experiment.experiment_id,
        run_view_type=ViewType.ACTIVE_ONLY,
        max_results=top_n,
        order_by=["metrics.rmse ASC"]
    )
    for run in runs:
        train_and_log_model(data_path=data_path, params=run.data.params)

    # Select the model with the lowest test RMSE
    experiment = client.get_experiment_by_name(EXPERIMENT_NAME)
    best_run = client.search_runs(
        experiment_ids=experiment.experiment_id,
        run_view_type=ViewType.ACTIVE_ONLY,
        max_results=1,
        order_by=["metrics.test_rmse ASC"]
    )[0]

    # Register the best model
    model_uri = f"runs:/{best_run.info.run_id}/random_forest_model"
    model_name = "best_random_forest_model"
    mlflow.register_model(model_uri, name=model_name)

    print("Test RMSE of the best model: {:.4f}".format(best_run.data.metrics["test_rmse"]))


if __name__ == '__main__':
    run_register_model()


Overwriting register_model.py


In [33]:
DATA_FOLDER = "./output/"

!python register_model.py --data_path $DATA_FOLDER

Registered model 'best_random_forest_model' already exists. Creating a new version of this model...
2023/06/04 23:49:31 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation. Model name: best_random_forest_model, version 3
Created version '3' of model 'best_random_forest_model'.
Test RMSE of the best model: 2.2914


# Q6. Model metadata

**Now explore your best model in the model registry using UI. What information does the model registry contain about each model?**

- Version number
- Source experiment
- Model signature
- All the above answers are correct

- Model Name: The name assigned to the model during registration.

- Model Version: The version number assigned to the model during registration. Each version represents a unique iteration or update of the model.

- Source: The experiment from which the model was registered. It indicates the origin and context of the model.

- Creation Time: The timestamp indicating when the model was registered in the model registry.

- Last Modified Time: The timestamp indicating the most recent modification or update made to the model.

- User: The username or identity of the user who registered the model.

- Run ID: The ID of the MLflow run associated with the model registration.

- Artifact Location: The storage location of the model artifacts, such as the saved model files.

- Model Signature: The signature of the model, which describes the input and output schema or structure of the model.

- Tags: Additional tags or metadata associated with the model. Tags can provide further information or categorization for easy search and organization.

# End of The Project