
# Experiment Tracking and Model Registry Lab

## Overview

In this lab you will each download a new dataset and attempt to train a good model, and use mlflow to keep track of all of your experiments, log your metrics, artifacts and models, and then register a final set of models for "deployment", though we won't actually deploy them anywhere yet.

## Goal

Your goal is **not** to become a master at MLFlow - this is not a course on learning all of the ins and outs of MLFlow. Instead, your goal is to understand when and why it is important to track your model development process (tracking experiments, artifacts and models) and to get into the habit of doing so, and then learn at least the basics of how MLFlow helps you do this so that you can then compare with other tools that are available.

## Instructions

Once you have selected a set of data, create a brand new experiment in MLFlow and begin exploring your data. Do some EDA, clean up, and learn about your data. You do not need to begin tracking anything yet, but you can if you want to (e.g. you can log different versions of your data as you clean it up and do any feature engineering). Do not spend a ton of time on this part. Your goal isn't really to build a great model, so don't spend hours on feature engineering and missing data imputation and things like that.

Once your data is clean, begin training models and tracking your experiments. If you intend to use this same dataset for your final project, then start thinking about what your model might look like when you actually deploy it. For example, when you engineer new features, be sure to save the code that does this, as you will need this in the future. If your final model has 1000 complex features, you might have a difficult time deploying it later on. If your final model takes 15 minutes to train, or takes a long time to score a new batch of data, you may want to think about training a less complex model.

Now, when tracking your experiments, at a *minimum*, you should:

1. Try at least 3 different ML algorithms (e.g. linear regression, decision tree, random forest, etc.).
2. Do hyperparameter tuning for **each** algorithm.
3. Do some very basic feature selection, and repeat the above steps with these reduced sets of features.
4. Identify the top 3 best models and note these down for later.
6. Choose the **final** "best" model that you would deploy or use on future data, stage it (in MLFlow), and run it on the test set to get a final measure of performance. Don't forget to log the test set metric.
7. Be sure you logged the exact training, validation, and testing datasets for the 3 best models, as well as hyperparameter values, and the values of your metrics.  
8. Push your code to Github. No need to track the mlruns folder, the images folder, any datasets, or the sqlite database in git.

### Turning It In

In the MLFlow UI, next to the refresh button you should see three vertical dots. Click the dots and then download your experiments as a csv file. Open the csv file in Excel and highlight the rows for your top 3 models from step 4, highlight the run where you applied your best model to the test set, and then save as an excel file. Take a snapshot of the Models page in the MLFLow UI showing the model you staged in step 6 above. Submit the excel file and the snapshot to Canvas.

In [26]:
import mlflow
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split, StratifiedKFold, RandomizedSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials



In [27]:
mlflow.set_tracking_uri('sqlite:///mlflow.db')
mlflow.set_experiment('lab2_better_exp')

<Experiment: artifact_location='/Users/adamgent/ml_ops/local_mlflow/mlruns/4', creation_time=1742877804338, experiment_id='4', last_update_time=1742877804338, lifecycle_stage='active', name='lab2_better_exp', tags={}>

In [28]:
df = pd.read_csv('~/ml_ops/transformed_df.csv')
y = df['y'].copy().values


X = df.drop(columns='y').copy()

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42,
                                                    stratify=y)



In [29]:
import mlflow
import mlflow.sklearn
import itertools
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Define hyperparameter space
param_grid = {
    'n_estimators': [200, 400],
    'max_depth': [2, 6],
    'min_samples_split': [60, 100],
    'min_samples_leaf': [30, 60],
    'max_features': ['sqrt', 'log2']
}

# Create all possible combinations of hyperparameters
param_combinations = list(itertools.product(*param_grid.values()))
param_keys = list(param_grid.keys())

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

with mlflow.start_run(run_name="RF_ManualGridCV"):
    for combo in param_combinations:
        params = dict(zip(param_keys, combo))
        model = RandomForestClassifier(
            **params, 
            random_state=42, 
            class_weight='balanced'
        )
        
        # Cross-val score on training data (scoring on recall)
        cv_scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='recall')
        mean_cv_recall = cv_scores.mean()

        # Train on full training data
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

        # Log as nested run
        with mlflow.start_run(nested=True):
            mlflow.log_params(params)
            mlflow.log_metric("recall_cv", mean_cv_recall)
            mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
            mlflow.log_metric("precision", precision_score(y_test, y_pred))
            mlflow.log_metric("recall", recall_score(y_test, y_pred))
            mlflow.log_metric("f1", f1_score(y_test, y_pred))
            mlflow.sklearn.log_model(model, "rf_model")




In [30]:
import itertools
import mlflow
import mlflow.sklearn

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score

# Define hyperparameter space
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.3],
    'estimator__max_depth': [2, 4, 6]
}

# Expand param grid
param_combos = list(itertools.product(*param_grid.values()))
param_keys = list(param_grid.keys())

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

with mlflow.start_run(run_name="AdaBoost_ManualGridCV"):

    best_recall = -1
    best_model = None
    best_params = None

    for combo in param_combos:
        params = dict(zip(param_keys, combo))

        base_estimator = DecisionTreeClassifier(
            max_depth=params.pop('estimator__max_depth'),
            class_weight='balanced',
            random_state=42
        )

        model = AdaBoostClassifier(
            estimator=base_estimator,
            algorithm='SAMME',  # to suppress future warnings
            random_state=42,
            **params
        )

        # Cross-validation recall
        cv_scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='recall')
        mean_cv_recall = cv_scores.mean()

        # Train on full train set
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

        # Log this run
        with mlflow.start_run(nested=True):
            mlflow.log_params({**params, "estimator__max_depth": base_estimator.max_depth})
            mlflow.log_metric("recall_cv", mean_cv_recall)
            mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
            mlflow.log_metric("precision", precision_score(y_test, y_pred))
            mlflow.log_metric("recall", recall_score(y_test, y_pred))
            mlflow.log_metric("f1", f1_score(y_test, y_pred))
            mlflow.sklearn.log_model(model, "adaboost_model")

        # Track best model
        if mean_cv_recall > best_recall:
            best_recall = mean_cv_recall
            best_model = model
            best_params = {**params, "estimator__max_depth": base_estimator.max_depth}

    # Log best model at top level
    mlflow.log_params(best_params)
    mlflow.log_metric("best_recall_cv", best_recall)
    mlflow.sklearn.log_model(best_model, "best_adaboost_model")


1 fits failed out of a total of 3.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
1 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/homebrew/anaconda3/envs/mlops/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/homebrew/anaconda3/envs/mlops/lib/python3.12/site-packages/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/anaconda3/envs/mlops/lib/python3.12/site-packages/sklearn/ensemble/_weight_boosting.py", line 169, in fit
    sample_weight, estimator_weight, estimator_error = sel

In [31]:
import mlflow
import mlflow.sklearn
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

cs = [.001, .01, .1, 1, 10, 100]

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

with mlflow.start_run(run_name="Lasso_Logistic_C_Search"):

    best_recall = -1
    best_model = None
    best_c = None

    for c in cs:
        model = LogisticRegression(
            penalty='l1',
            solver='saga',
            C=c,
            class_weight='balanced',
            random_state=42,
            max_iter=1000
        )

        # Cross-validate
        cv_results = cross_validate(
            model, X_train, y_train,
            cv=cv,
            scoring=['accuracy', 'precision', 'recall', 'f1'],
            return_train_score=False
        )

        mean_accuracy = cv_results['test_accuracy'].mean()
        mean_precision = cv_results['test_precision'].mean()
        mean_recall = cv_results['test_recall'].mean()
        mean_f1 = cv_results['test_f1'].mean()

        # Fit to full train set for metrics and coef count
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        num_nonzero_coefs = np.sum(model.coef_ != 0)

        with mlflow.start_run(nested=True):
            mlflow.log_param("C", c)
            mlflow.log_metric("cv_accuracy", mean_accuracy)
            mlflow.log_metric("cv_precision", mean_precision)
            mlflow.log_metric("cv_recall", mean_recall)
            mlflow.log_metric("cv_f1", mean_f1)
            mlflow.log_metric("test_accuracy", accuracy_score(y_test, y_pred))
            mlflow.log_metric("test_precision", precision_score(y_test, y_pred))
            mlflow.log_metric("test_recall", recall_score(y_test, y_pred))
            mlflow.log_metric("test_f1", f1_score(y_test, y_pred))
            mlflow.log_metric("n_nonzero_coefs", num_nonzero_coefs)
            mlflow.sklearn.log_model(model, "lasso_logistic_model")

        if mean_recall > best_recall:
            best_recall = mean_recall
            best_model = model
            best_c = c

    # Log best model at the top level
    mlflow.log_param("best_C", best_c)
    mlflow.log_metric("best_cv_recall", best_recall)
    mlflow.sklearn.log_model(best_model, "best_lasso_logistic_model")




In [None]:
with mlflow.start_run(run_name="Final Best Model Registration"):
    # Log test metrics
    y_pred = best_model.predict(X_test)
    mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
    mlflow.log_metric("precision", precision_score(y_test, y_pred))
    mlflow.log_metric("recall", recall_score(y_test, y_pred))
    mlflow.log_metric("f1", f1_score(y_test, y_pred))

    # Log the model
    mlflow.sklearn.log_model(best_model, "model")

    # Register the model
    result = mlflow.register_model(
        model_uri=f"runs:/{mlflow.active_run().info.run_id}/model",
        name="best_overall_model"
    )
