
# Experiment Tracking and Model Registry Lab

## Overview

In this lab you will each download a new dataset and attempt to train a good model, and use mlflow to keep track of all of your experiments, log your metrics, artifacts and models, and then register a final set of models for "deployment", though we won't actually deploy them anywhere yet.

## Goal

Your goal is **not** to become a master at MLFlow - this is not a course on learning all of the ins and outs of MLFlow. Instead, your goal is to understand when and why it is important to track your model development process (tracking experiments, artifacts and models) and to get into the habit of doing so, and then learn at least the basics of how MLFlow helps you do this so that you can then compare with other tools that are available.

## Data

You can choose your own dataset to use here. It will be helpful to choose a dataset that is already fairly clean and easy to work with. You can even use a dataset that you've used in a previous course. We will do a lot of labs where we do different things with datasets, so if you can find one that is interesting enough for modeling, it should work for most of the rest of the course. 

There are tons of places where you can find open public datasets. Choose something that interests you, but don't overthink it.

[Kaggle Datasets](https://www.kaggle.com/datasets)  
[HuggingFace Datasets](https://huggingface.co/docs/datasets/index)  
[Dagshub Datasets](https://dagshub.com/datasets/)  
[UCI](https://archive.ics.uci.edu/ml/datasets.php)  
[Open Data on AWS](https://registry.opendata.aws/)  
[Yelp](https://www.yelp.com/dataset)  
[MovieLens](https://grouplens.org/datasets/movielens/)  
And so many more...

## Instructions

Once you have selected a set of data, create a brand new experiment in MLFlow and begin exploring your data. Do some EDA, clean up, and learn about your data. You do not need to begin tracking anything yet, but you can if you want to (e.g. you can log different versions of your data as you clean it up and do any feature engineering). Do not spend a ton of time on this part. Your goal isn't really to build a great model, so don't spend hours on feature engineering and missing data imputation and things like that.

Once your data is clean, begin training models and tracking your experiments. If you intend to use this same dataset for your final project, then start thinking about what your model might look like when you actually deploy it. For example, when you engineer new features, be sure to save the code that does this, as you will need this in the future. If your final model has 1000 complex features, you might have a difficult time deploying it later on. If your final model takes 15 minutes to train, or takes a long time to score a new batch of data, you may want to think about training a less complex model.

Now, when tracking your experiments, at a *minimum*, you should:

1. Try at least 3 different ML algorithms (e.g. linear regression, decision tree, random forest, etc.).
2. Do hyperparameter tuning for **each** algorithm.
3. Do some very basic feature selection, and repeat the above steps with these reduced sets of features.
4. Identify the top 3 best models and note these down for later.
6. Choose the **final** "best" model that you would deploy or use on future data, stage it (in MLFlow), and run it on the test set to get a final measure of performance. Don't forget to log the test set metric.
7. Be sure you logged the exact training, validation, and testing datasets for the 3 best models, as well as hyperparameter values, and the values of your metrics.  
8. Push your code to Github. No need to track the mlruns folder, the images folder, any datasets, or the sqlite database in git.

### Turning It In

In the MLFlow UI, next to the refresh button you should see three vertical dots. Click the dots and then download your experiments as a csv file. Open the csv file in Excel and highlight the rows for your top 3 models from step 4, highlight the run where you applied your best model to the test set, and then save as an excel file. Take a snapshot of the Models page in the MLFLow UI showing the model you staged in step 6 above. Submit the excel file and the snapshot to Canvas.

# Table of Contents

1. [3 different ML algorithms]()
2. [Hyperparameter Tuning]()
3. [Feature Selection]()
4. [Identify Top 3 best model]()
5. [Final Best model]()
6. [Training Validation and Testing Models]()
7. [Github]()

Command to Start `mlflow ui --backend-store-uri sqlite:///mlflow.db`

In [None]:
!pip install dataset

In [2]:
import mlflow
import mlflow.sklearn
from datasets import load_dataset
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from joblib import dump, load
import os

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
mlflow.__version__

'2.15.1'

In [3]:
# # Create new experiment
# experiment_name = "Amazon_Polarity_Experiment"
# try:
#     experiment_id = mlflow.create_experiment(experiment_name)
# except:
#     experiment_id = mlflow.get_experiment_by_name(experiment_name).experiment_id
mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("amazon-experiment")

<Experiment: artifact_location='/Users/drewhoang/Desktop/mlops-course/notebooks/mlruns/1', creation_time=1742579036555, experiment_id='1', last_update_time=1742579036555, lifecycle_stage='active', name='amazon-experiment', tags={}>

In [4]:
# Load dataset from Hugging Face
print("Loading dataset from Hugging Face...")
dataset = load_dataset("fancyzhx/amazon_polarity")


Loading dataset from Hugging Face...


In [5]:
train_df = pd.DataFrame(dataset['train'])
test_df = pd.DataFrame(dataset['test'])

In [6]:
print(f"Train dataset shape: {train_df.shape}")
print(f"Test dataset shape: {test_df.shape}")

Train dataset shape: (3600000, 3)
Test dataset shape: (400000, 3)


In [7]:
# Combine title and content for feature extraction
train_df['text'] = train_df['title'] + " " + train_df['content']
test_df['text'] = test_df['title'] + " " + test_df['content']

In [8]:
train_data, val_data, train_labels, val_labels = train_test_split(
    train_df['text'], 
    train_df['label'], 
    test_size=0.2, 
    random_state=42,
    stratify=train_df['label']
)

In [53]:
tfidf_vectorizer = TfidfVectorizer(max_features=500, min_df=5, max_df=0.7)
X_train = tfidf_vectorizer.fit_transform(train_data)
X_val = tfidf_vectorizer.transform(val_data)
X_test = tfidf_vectorizer.transform(test_df['text'])

In [10]:
y_train = train_labels
y_val = val_labels
y_test = test_df['label']

In [11]:
print("\nClass distribution:")
print(train_df['label'].value_counts())


Class distribution:
label
1    1800000
0    1800000
Name: count, dtype: int64


In [12]:
os.makedirs('processed_data', exist_ok=True)
os.makedirs('models', exist_ok=True)

In [13]:
dump(tfidf_vectorizer, 'processed_data/tfidf_vectorizer.joblib')
pickle.dump((X_train, y_train), open('processed_data/train_data.pkl', 'wb'))
pickle.dump((X_val, y_val), open('processed_data/val_data.pkl', 'wb'))
pickle.dump((X_test, y_test), open('processed_data/test_data.pkl', 'wb'))


# Log the processed data

In [14]:
with mlflow.start_run():
    mlflow.log_artifacts('processed_data', artifact_path='processed_data')
    mlflow.set_tag("process", "preprocessing")
    mlflow.log_params({
        "vectorizer": "TF-IDF",
        "max_features": 5000,
        "min_df": 5,
        "max_df": 0.7
    })
mlflow.end_run()
print("Data preprocessing completed and artifacts logged.")

Data preprocessing completed and artifacts logged.


In [14]:
import numpy as np
def feature_selection(model, X_train, y_train, X_val, y_val, n_features=1000):
    model.fit(X_train, y_train)
    
    if hasattr(model, 'coef_'):
        importances = np.abs(model.coef_[0])
    elif hasattr(model, 'feature_importances_'):
        importances = model.feature_importances_
    else:
        return X_train, X_val
    
    feature_names = np.array(tfidf_vectorizer.get_feature_names_out())
    sorted_idx = np.argsort(importances)[::-1][:n_features]
    
    # Get the selected feature names
    selected_features = feature_names[sorted_idx]
    
    # Create a new vectorizer with only selected features
    new_vectorizer = TfidfVectorizer(vocabulary=selected_features)
    new_vectorizer.fit(train_data)
    
    X_train_selected = new_vectorizer.transform(train_data)
    X_val_selected = new_vectorizer.transform(val_data)
    
    return X_train_selected, X_val_selected, new_vectorizer

In [16]:
lr_selector = LogisticRegression(max_iter=1000)


In [17]:
with mlflow.start_run():
    # Top 1000 features
    X_train_1000, X_val_1000, vectorizer_1000 = feature_selection(lr_selector, X_train, y_train, X_val, y_val, n_features=1000)
    dump(vectorizer_1000, 'processed_data/tfidf_vectorizer_1000.joblib')
    pickle.dump((X_train_1000, y_train), open('processed_data/train_data_1000.pkl', 'wb'))
    pickle.dump((X_val_1000, y_val), open('processed_data/val_data_1000.pkl', 'wb'))
    
    # Top 500 features
    X_train_500, X_val_500, vectorizer_500 = feature_selection(lr_selector, X_train, y_train, X_val, y_val, n_features=500)
    dump(vectorizer_500, 'processed_data/tfidf_vectorizer_500.joblib')
    pickle.dump((X_train_500, y_train), open('processed_data/train_data_500.pkl', 'wb'))
    pickle.dump((X_val_500, y_val), open('processed_data/val_data_500.pkl', 'wb'))
    
    mlflow.log_artifacts('processed_data', artifact_path='processed_data')
    mlflow.set_tag("process", "feature_selection")
    mlflow.log_params({
        "feature_selector": "LogisticRegression",
        "feature_set_sizes": "5000, 1000, 500"
    })
mlflow.end_run()
print("Feature selection completed and artifacts logged.")

Feature selection completed and artifacts logged.


In [17]:
X_train_500, X_val_500, vectorizer_500 = feature_selection(lr_selector, X_train, y_train, X_val, y_val, n_features=500)
dump(vectorizer_500, 'processed_data/tfidf_vectorizer_500.joblib')
pickle.dump((X_train_500, y_train), open('processed_data/train_data_500.pkl', 'wb'))
pickle.dump((X_val_500, y_val), open('processed_data/val_data_500.pkl', 'wb'))
    

### Algorithm 1: Logistic Regression with Hyperparameter Tuning


In [18]:
# 500 features
lr_params_500 = [
    {"C": 0.1, "solver": "liblinear", "max_iter": 1000},
    {"C": 1.0, "solver": "liblinear", "max_iter": 1000},
    {"C": 10.0, "solver": "liblinear", "max_iter": 1000},
]

for i, params in enumerate(lr_params_500):
    with mlflow.start_run():
        # Create the model with parameters
        lr = LogisticRegression(**params)
        
        # Log parameters
        mlflow.set_tags({"Model": "LogisticRegression", "Feature Set": "500features"})
        mlflow.log_params(params)
        mlflow.log_param("feature_set", "500features")
        mlflow.log_param("feature_count", X_train_500.shape[1])

        # Train model
        lr.fit(X_train_500, y_train)

        # Evaluate on validation data
        y_pred = lr.predict(X_val_500)
        accuracy = accuracy_score(y_val, y_pred)
        precision = precision_score(y_val, y_pred)
        recall = recall_score(y_val, y_pred)
        f1 = f1_score(y_val, y_pred)

        # Log metrics
        mlflow.log_metric("accuracy", accuracy)
        mlflow.log_metric("precision", precision)
        mlflow.log_metric("recall", recall)
        mlflow.log_metric("f1_score", f1)

        # Save model
        model_path = f"models/LogisticRegression_500features_{i+1}.pkl"
        with open(model_path, "wb") as f:
            pickle.dump(lr, f)

        # Create a sample input for model signature
        input_example = X_train_500[:1]  # Just use the first training example

        # Log model with signature
        mlflow.sklearn.log_model(
            lr, f"LogisticRegression_500features", input_example=input_example
        )

        mlflow.log_artifact(model_path)
    
    mlflow.end_run()



In [None]:
# Define parameter grid
lr_param_grid = {
    "C": [0.01, 0.1, 1.0, 10.0, 100.0],
    "solver": ["liblinear", "saga"],
    "max_iter": [1000],
    "class_weight": [None, "balanced"],
}

# Define feature set name
feature_set_name = "500features"

# Initialize storage for runs
all_runs = []

# Run with MLflow tracking
with mlflow.start_run():
    # Log basic parameters
    mlflow.log_param("feature_set", feature_set_name)
    mlflow.log_param("model_type", "LogisticRegression")
    mlflow.log_param("feature_count", X_train.shape[1])
    
    # Log the full parameter grid
    mlflow.log_param("param_grid", str(lr_param_grid))

    # Hyperparameter tuning
    lr_grid = GridSearchCV(
        LogisticRegression(random_state=42),
        lr_param_grid,
        cv=3,
        scoring="f1",
        verbose=1,
    )
    lr_grid.fit(X_train, y_train)

    # Get best model and parameters
    best_lr = lr_grid.best_estimator_
    best_params = lr_grid.best_params_
    
    # Log best parameters from grid search
    mlflow.log_params(best_params)
    mlflow.log_param("best_cv_score", lr_grid.best_score_)
    
    # Set tags for easy filtering in MLflow UI
    mlflow.set_tags({"Model": "LogisticRegression", "Feature Set": feature_set_name})
    
    # Get predictions on validation set
    y_pred = best_lr.predict(X_val)
    
    # Calculate metrics
    accuracy = accuracy_score(y_val, y_pred)
    precision = precision_score(y_val, y_pred, average='binary')
    recall = recall_score(y_val, y_pred, average='binary')
    f1 = f1_score(y_val, y_pred, average='binary')
    
    # Log metrics
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("precision", precision)
    mlflow.log_metric("recall", recall)
    mlflow.log_metric("f1_score", f1)
    
    # Save model as pickle file
    model_path = f"models/LogisticRegression_{feature_set_name}.pkl"
    with open(model_path, "wb") as f:
        pickle.dump(best_lr, f)
    
    # Log model with MLflow
    mlflow.sklearn.log_model(
        best_lr, 
        f"LogisticRegression_{feature_set_name}", 
        input_example=X_train[:1]
    )
    
    # Log the pickle file as an artifact
    mlflow.log_artifact(model_path)
    
    # Track this run's information
    all_runs.append(
        {
            "run_id": mlflow.active_run().info.run_id,
            "model": "LogisticRegression",
            "feature_set": feature_set_name,
            "params": best_params,
            "accuracy": accuracy,
            "precision": precision,
            "recall": recall,
            "f1_score": f1,
        }
    )
mlflow.end_run()


# Print results
print(f"Best LogisticRegression model (feature set: {feature_set_name}):")
print(f"Best parameters: {best_params}")
print(f"Validation metrics:")
print(f"Accuracy:  {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1 Score:  {f1:.4f}")

### Algorithm 2: Random Forest with Hyperparameter Tuning

In [None]:
rf_params_500 = [
    {"n_estimators": 100, "max_depth": 10, "random_state": 42},
    {"n_estimators": 200, "max_depth": 15, "random_state": 42},
    {"n_estimators": 300, "max_depth": None, "random_state": 42},
]


for i, params in enumerate(rf_params_500):
    with mlflow.start_run(run_name=f"RandomForest_500_{i+1}"):
        # Create the model with parameters
        rf = RandomForestClassifier(**params)
        
        # Log parameters
        mlflow.set_tags({"Model": "RandomForest", "Feature Set": "500features"})
        mlflow.log_params(params)
        mlflow.log_param("feature_set", "500features")
        mlflow.log_param("feature_count", X_train_500.shape[1])

        # Train model
        rf.fit(X_train_500, y_train)

        # Evaluate on validation data
        y_pred = rf.predict(X_val_500)
        accuracy = accuracy_score(y_val, y_pred)
        precision = precision_score(y_val, y_pred)
        recall = recall_score(y_val, y_pred)
        f1 = f1_score(y_val, y_pred)

        # Log metrics
        mlflow.log_metric("accuracy", accuracy)
        mlflow.log_metric("precision", precision)
        mlflow.log_metric("recall", recall)
        mlflow.log_metric("f1_score", f1)

        # Save model
        model_path = f"models/RandomForest_500features_{i+1}.pkl"
        with open(model_path, "wb") as f:
            pickle.dump(rf, f)

        # Log model with signature
        mlflow.sklearn.log_model(
            rf, f"RandomForest_500features", input_example=X_train_500[:1]
        )

        mlflow.log_artifact(model_path)
        
    mlflow.end_run()



### Algorithm 3: Decision Tree with Hyperparameter Tuning

In [18]:
# Decision Tree with 500 features
dt_params_500 = [
    {"max_depth": 5, "min_samples_split": 2, "random_state": 42},
    {"max_depth": 10, "min_samples_split": 5, "random_state": 42},
    {"max_depth": 15, "min_samples_split": 10, "random_state": 42},
]

for i, params in enumerate(dt_params_500):
    with mlflow.start_run():
        # Create the model with parameters
        dt = DecisionTreeClassifier(**params)
        
        # Log parameters
        mlflow.set_tags({"Model": "DecisionTree", "Feature Set": "500features"})
        mlflow.log_params(params)
        mlflow.log_param("feature_set", "500features")
        mlflow.log_param("feature_count", X_train_500.shape[1])

        # Train model
        dt.fit(X_train_500, y_train)

        # Evaluate on validation data
        y_pred = dt.predict(X_val_500)
        accuracy = accuracy_score(y_val, y_pred)
        precision = precision_score(y_val, y_pred)
        recall = recall_score(y_val, y_pred)
        f1 = f1_score(y_val, y_pred)

        # Log metrics
        mlflow.log_metric("accuracy", accuracy)
        mlflow.log_metric("precision", precision)
        mlflow.log_metric("recall", recall)
        mlflow.log_metric("f1_score", f1)

        # Save model
        model_path = f"models/DecisionTree_500features_{i+1}.pkl"
        with open(model_path, "wb") as f:
            pickle.dump(dt, f)

        # Log model with signature
        mlflow.sklearn.log_model(
            dt, f"DecisionTree_500features", input_example=X_train_500[:1]
        )

        mlflow.log_artifact(model_path)
    mlflow.end_run()




In [20]:
def find_top_models(n=3, experiment_name="amazon-experiment", metric="f1_score"):
    """
    """
    experiment = mlflow.get_experiment_by_name(experiment_name)
    if experiment is None:
        raise ValueError(f"Experiment '{experiment_name}' not found.")
    
    # Search all runs in the experiment
    runs = mlflow.search_runs(experiment_ids=[experiment.experiment_id])
    
    # Sort by the specified metric
    metric_col = f"metrics.{metric}"
    if metric_col not in runs.columns:
        raise ValueError(f"Metric '{metric}' not found in run data.")
    
    # Find the top n runs
    top_runs = runs.sort_values(metric_col, ascending=False).head(n)
    
    # Prepare the results for display
    results = []
    for i, (idx, run) in enumerate(top_runs.iterrows()):
        model_info = {
            'rank': i + 1,
            'run_id': run.run_id,
            'model_type': run.get('tags.Model', 'Unknown'),
            'feature_set': run.get('params.feature_set', 'Unknown'),
            'metric_value': run[metric_col],
            'accuracy': run.get('metrics.accuracy', 0),
            'precision': run.get('metrics.precision', 0),
            'recall': run.get('metrics.recall', 0),
            'f1_score': run.get('metrics.f1_score', 0)
        }
        
        # Extract hyperparameters
        params = {k.replace('params.', ''): v for k, v in run.items() if k.startswith('params.')}
        model_info['hyperparameters'] = params
        
        results.append(model_info)
    
    return pd.DataFrame(results)

In [32]:
top_models = find_top_models(n=3, experiment_name="amazon-experiment", metric="f1_score")
    

In [33]:
print("\nTop 3 models:")
for i, (_, model) in enumerate(top_models.iterrows()):
    print(f"{model['run_id']}")
    print(f"{i+1}. {model['model_type']} with feature set '{model['feature_set']}'")
    print(f"   F1 Score: {model['f1_score']:.4f}, Accuracy: {model['accuracy']:.4f}")
    print(f"   Key hyperparameters: ", end="")
    hyperparams = model['hyperparameters']
    # Display a few key hyperparameters
    for param in list(hyperparams.keys())[:3]:  # Show first 3 params
        if param not in ['feature_set', 'feature_count']:
            print(f"{param}={hyperparams[param]}", end=", ")
    print("...")


Top 3 models:
050dcc296f1146cdb6a3714865040b96
1. LogisticRegression with feature set '500features'
   F1 Score: 0.8785, Accuracy: 0.8784
   Key hyperparameters: random_state=None, max_depth=None, ...
0ea4cf72cba7488183a269081da261c5
2. LogisticRegression with feature set '500features'
   F1 Score: 0.8783, Accuracy: 0.8782
   Key hyperparameters: random_state=None, max_depth=None, ...
b206a94ace464b9684bc937580247c2c
3. LogisticRegression with feature set '500features'
   F1 Score: 0.8783, Accuracy: 0.8782
   Key hyperparameters: random_state=None, max_depth=None, ...


In [36]:
def feature_selection(model, X_train, y_train, X_val, X_test, y_val, 
                      n_features=1000, tfidf_vectorizer=None, 
                      train_data=None, val_data=None, test_data=None):
    """
    """
    # Check if tfidf_vectorizer is None
    if tfidf_vectorizer is None:
        print("Warning: tfidf_vectorizer is None. Cannot perform feature selection.")
        return X_train, X_val, X_test, None
    
    # Fit the model if it's not already fitted
    if not hasattr(model, 'coef_') and not hasattr(model, 'feature_importances_'):
        try:
            model.fit(X_train, y_train)
        except Exception as e:
            print(f"Error fitting model: {e}")
            return X_train, X_val, X_test, tfidf_vectorizer
    
    # Get feature importances based on model type
    if hasattr(model, 'coef_'):
        importances = np.abs(model.coef_[0])
    elif hasattr(model, 'feature_importances_'):
        importances = model.feature_importances_
    else:
        print("Model doesn't have coef_ or feature_importances_ attribute. Returning original features.")
        return X_train, X_val, X_test, tfidf_vectorizer
    
    # Safety check for feature names method
    try:
        # For scikit-learn >= 1.0
        feature_names = np.array(tfidf_vectorizer.get_feature_names_out())
    except AttributeError:
        try:
            # For scikit-learn < 1.0
            feature_names = np.array(tfidf_vectorizer.get_feature_names())
        except AttributeError:
            print("Could not get feature names from vectorizer. Returning original features.")
            return X_train, X_val, X_test, tfidf_vectorizer
    
    # Ensure we don't request more features than available
    n_features = min(n_features, len(feature_names))
    
    # Sort features by importance
    sorted_idx = np.argsort(importances)[::-1][:n_features]
    
    # Get the selected feature names
    selected_features = feature_names[sorted_idx]
    
    # Check if we have the original text data
    if train_data is None or val_data is None or test_data is None:
        print("Original text data is required for creating a new vectorizer. Returning original features.")
        return X_train, X_val, X_test, tfidf_vectorizer
    
    # Create a new vectorizer with only selected features
    new_vectorizer = TfidfVectorizer(vocabulary=dict(zip(selected_features, range(len(selected_features)))))
    
    try:
        # Transform all datasets with the new vectorizer
        X_train_selected = new_vectorizer.fit_transform(train_data)
        X_val_selected = new_vectorizer.transform(val_data)
        X_test_selected = new_vectorizer.transform(test_data)
    except Exception as e:
        print(f"Error transforming data with new vectorizer: {e}")
        return X_train, X_val, X_test, tfidf_vectorizer
    
    return X_train_selected, X_val_selected, X_test_selected, new_vectorizer

In [37]:
X_train_500, X_val_500, X_test_500, vectorizer_500 = feature_selection(lr_selector, X_train, y_train, X_val, X_test, y_val, n_features=500)



In [38]:
datasets = {
        "500features": (X_train_500, X_val_500, X_test_500, y_train, y_val, y_test),
}
top_models_df = pd.DataFrame([
        {
            'rank': 1,
            'run_id': '050dcc296f1146cdb6a3714865040b96',
            'model_type': 'LogisticRegression',
            'feature_set': '500features'
        },
        {
            'rank': 2,
            'run_id': '0ea4cf72cba7488183a269081da261c5',
            'model_type': 'LogisticRegression',
            'feature_set': '500features'
        },
        {
            'rank': 3,
            'run_id': 'sb206a94ace464b9684bc937580247c2c',
            'model_type': 'LogisticRegression',
            'feature_set': '500features'
        }
    ])
mlflow.set_experiment('best-model-experiment')
for _, model_info in top_models_df.iterrows():
        run_id = model_info['run_id']
        feature_set = model_info['feature_set']
        model_type = model_info['model_type']
        
        # Get the datasets for this feature set
        if feature_set not in datasets:
            print(f"Warning: No datasets found for feature set {feature_set}")
            continue
            
        X_train, X_val, X_test, y_train, y_val, y_test = datasets[feature_set]
        
        # Create a new run linked to the original
        with mlflow.start_run():
            mlflow.set_tag("parent_run_id", run_id)
            mlflow.set_tag("content", "datasets")
            mlflow.set_tag("model_type", model_type)
            mlflow.set_tag("feature_set", feature_set)
            
            # Log dataset shapes
            mlflow.log_param("X_train_shape", str(X_train.shape))
            mlflow.log_param("X_val_shape", str(X_val.shape))
            mlflow.log_param("X_test_shape", str(X_test.shape))
            mlflow.log_param("y_train_shape", str(y_train.shape))
            mlflow.log_param("y_val_shape", str(y_val.shape))
            mlflow.log_param("y_test_shape", str(y_test.shape))
            
            # Save datasets as pickle files for reproducibility
            dataset_path = f"datasets_{model_type}_{feature_set}.pkl"
            with open(dataset_path, "wb") as f:
                pickle.dump({
                    "X_train_shape": X_train.shape,
                    "X_val_shape": X_val.shape,
                    "X_test_shape": X_test.shape,
                    "y_train_shape": y_train.shape,
                    "y_val_shape": y_val.shape,
                    "y_test_shape": y_test.shape,
                    "X_train_sample": X_train[:5] if X_train.shape[0] >= 5 else X_train,
                    "X_val_sample": X_val[:5] if X_val.shape[0] >= 5 else X_val,
                    "X_test_sample": X_test[:5] if X_test.shape[0] >= 5 else X_test,
                }, f)
            mlflow.log_artifact(dataset_path)
        mlflow.end_run()

2025/03/21 12:11:16 INFO mlflow.tracking.fluent: Experiment with name 'best-model-experiment' does not exist. Creating a new experiment.


<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 129516152 stored elements and shape (2880000, 5000)>

In [55]:
import mlflow
import pandas as pd
import pickle
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
import numpy as np

# Set the MLflow experiment
mlflow.set_experiment('best-model-selection')

datasets = {
    "500features": (X_train_500, X_val_500, X_test_500, y_train, y_val, y_test),
}


best_model_info = {
    'run_id': '050dcc296f1146cdb6a3714865040b96',  
    'model_type': 'LogisticRegression',
    'feature_set': '500features',
    'hyperparameters': {"C": 10.0, "solver": "liblinear", "max_iter": 1000} 
}

# Unpack the information
best_run_id = best_model_info['run_id']
model_type = best_model_info['model_type']
feature_set = best_model_info['feature_set']

# Unpack test data
X_train_500, X_val_500, X_test_500, vectorizer_500 = feature_selection(lr_selector, X_train, y_train, X_val, X_test, y_val, n_features=500)

# Create a client for model registry operations
client = mlflow.tracking.MlflowClient()

# Set a name for the registered model - this will be our production model
model_name = f"production-{model_type}-{feature_set}-b"

# Load the model from the run
run_path = f"runs:/{best_run_id}/{model_type}_{feature_set}"
best_model = mlflow.sklearn.load_model(run_path)

# Register the model (create if it doesn't exist)
try:
    client.create_registered_model(model_name)
    print(f"Created new registered model: {model_name}")
except mlflow.exceptions.RestException:
    print(f"Model {model_name} already exists, will create a new version.")

# Register a new version
model_details = mlflow.register_model(
    model_uri=run_path,
    name=model_name
)

print(f"Registered model version: {model_details.version}")

# Transition directly to Production (since this is our final best model)
client.transition_model_version_stage(
    name=model_name,
    version=model_details.version,
    stage="Production",
    archive_existing_versions=True  # Archive any existing production versions
)

print(f"Model {model_name} version {model_details.version} is now in Production stage")

# Create a new run to log final test results
with mlflow.start_run():
    # Set parent run ID for lineage tracking
    mlflow.set_tag("parent_run_id", best_run_id)
    mlflow.set_tag("model_type", model_type)
    mlflow.set_tag("feature_set", feature_set)
    mlflow.set_tag("evaluation", "final_test")
    mlflow.set_tag("stage", "Production")
    
    # Log hyperparameters
    if 'hyperparameters' in best_model_info and best_model_info['hyperparameters']:
        for param_name, param_value in best_model_info['hyperparameters'].items():
            if param_name not in ['feature_set', 'feature_count']:  # Avoid duplicates
                mlflow.log_param(param_name, param_value)
                
    mlflow.log_param("model_name", model_name)
    mlflow.log_param("model_version", model_details.version)
    
    # Get predictions - both class labels and probabilities if available
    y_pred = best_model.predict(X_test_500)

    
    # Calculate comprehensive test metrics
    test_accuracy = accuracy_score(y_test, y_pred)
    test_precision = precision_score(y_test, y_pred, average='binary')
    test_recall = recall_score(y_test, y_pred, average='binary')
    test_f1 = f1_score(y_test, y_pred, average='binary')
    
    
    # Log all test metrics
    mlflow.log_metric("test_accuracy", test_accuracy)
    mlflow.log_metric("test_precision", test_precision)
    mlflow.log_metric("test_recall", test_recall)
    mlflow.log_metric("test_f1_score", test_f1)
    
    # Log test data for reproducibility
    test_data_path = f"final_test_data_{model_type}_{feature_set}.pkl"
    with open(test_data_path, "wb") as f:
        pickle.dump({
            "X_test": X_test_500, 
            "y_test": y_test,
            "test_metrics": {
                "accuracy": test_accuracy,
                "precision": test_precision,
                "recall": test_recall,
                "f1": test_f1
            }
        }, f)
    mlflow.log_artifact(test_data_path)
    
    # Print results
    print("\nFinal Test Evaluation Results:")
    print(f"Accuracy: {test_accuracy:.4f}")
    print(f"Precision: {test_precision:.4f}")
    print(f"Recall: {test_recall:.4f}")
    print(f"F1 Score: {test_f1:.4f}")

    # Log the final model for deployment
    mlflow.sklearn.log_model(
        sk_model=best_model,
        artifact_path="final_model",
        registered_model_name=model_name
    )

    # Return test metrics summary
    test_metrics = {
        "accuracy": test_accuracy,
        "precision": test_precision,
        "recall": test_recall,
        "f1": test_f1
    }

# End the run
mlflow.end_run()

print(f"\nFinal model {model_name} (version {model_details.version}) is now in Production stage and ready for deployment")
print(f"All evaluation metrics and artifacts have been logged to MLflow")

Created new registered model: production-LogisticRegression-500features-b
Registered model version: 1
Model production-LogisticRegression-500features-b version 1 is now in Production stage


Registered model 'production-LogisticRegression-500features-b' already exists. Creating a new version of this model...
Created version '1' of model 'production-LogisticRegression-500features-b'.
  client.transition_model_version_stage(



Final Test Evaluation Results:
Accuracy: 0.4951
Precision: 0.4859
Recall: 0.1694
F1 Score: 0.2512


Registered model 'production-LogisticRegression-500features-b' already exists. Creating a new version of this model...
Created version '2' of model 'production-LogisticRegression-500features-b'.



Final model production-LogisticRegression-500features-b (version 1) is now in Production stage and ready for deployment
All evaluation metrics and artifacts have been logged to MLflow


<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 22283 stored elements and shape (500, 5000)>

In [None]:
# Get the best model (rank 1)
best_model_info = top_models_df.iloc[0]
best_run_id = best_model_info['run_id']
model_type = best_model_info['model_type']
feature_set = best_model_info['feature_set']
    
# Unpack test data
#X_train, X_val, X_test, y_train, y_val, y_test = datasets[feature_set]
    
# Create a client for model registry operations
client = mlflow.tracking.MlflowClient()
    
# Set a name for the registered model
model_name = f"best-{model_type}-{feature_set}"
    
# Load the model from the run
run_path = f"runs:/{best_run_id}/{model_type}_{feature_set}"
best_model = mlflow.sklearn.load_model(run_path)

# Register the model (create if it doesn't exist)
try:
    client.create_registered_model(model_name)
except mlflow.exceptions.RestException:
    print(f"Model {model_name} already exists, will create a new version.")
    
# Register a new version
model_details = mlflow.register_model(
    model_uri=run_path,
    name=model_name
)

# Transition to staging
client.transition_model_version_stage(
    name=model_name,
    version=model_details.version,
    stage="Staging"
)

# Create a new run to log test results
with mlflow.start_run():
    # Set parent run ID for lineage tracking
    mlflow.set_tag("parent_run_id", best_run_id)
    mlflow.set_tag("model_type", model_type)
    mlflow.set_tag("feature_set", feature_set)
    mlflow.set_tag("evaluation", "test")
    
    # Log hyperparameters
    for param_name, param_value in best_model_info['hyperparameters'].items():
        if param_name not in ['feature_set', 'feature_count']:  # Avoid duplicates
            mlflow.log_param(param_name, param_value)
            
    mlflow.log_param("model_name", model_name)
    mlflow.log_param("model_version", model_details.version)
    
    # Predict on test data
    y_pred = best_model.predict(X_test)
    
    # Calculate test metrics
    test_accuracy = accuracy_score(y_test, y_pred)
    test_precision = precision_score(y_test, y_pred, average='binary')
    test_recall = recall_score(y_test, y_pred, average='binary')
    test_f1 = f1_score(y_test, y_pred, average='binary')
    
    # Log test metrics
    mlflow.log_metric("test_accuracy", test_accuracy)
    mlflow.log_metric("test_precision", test_precision)
    mlflow.log_metric("test_recall", test_recall)
    mlflow.log_metric("test_f1_score", test_f1)
            
    # Log test data for reproducibility
    test_data_path = f"test_data_{model_type}_{feature_set}.pkl"
    with open(test_data_path, "wb") as f:
        pickle.dump({"X_test_shape": X_test.shape, "y_test_shape": y_test.shape}, f)
    mlflow.log_artifact(test_data_path)
    
    # Return test metrics
    test_metrics = {
        "accuracy": test_accuracy,
        "precision": test_precision,
        "recall": test_recall,
        "f1": test_f1
    }

mlflow.end_run()     

In [24]:
def stage_best_model(top_models_df, test_data):
    """
    """
    # Get the best model (rank 1)
    best_model_info = top_models_df.iloc[0]
    best_run_id = best_model_info['run_id']
    model_type = best_model_info['model_type']
    feature_set = best_model_info['feature_set']
    
    # Unpack test data
    X_test, y_test = test_data
    
    # Create a client for model registry operations
    client = mlflow.tracking.MlflowClient()
    
    # Set a name for the registered model
    model_name = f"best-{model_type}-{feature_set}"
    
    # Load the model from the run
    run_path = f"runs:/{best_run_id}/{model_type}_{feature_set}"
    best_model = mlflow.sklearn.load_model(run_path)
    
    # Register the model (create if it doesn't exist)
    try:
        client.create_registered_model(model_name)
    except mlflow.exceptions.RestException:
        print(f"Model {model_name} already exists, will create a new version.")
        
    # Register a new version
    model_details = mlflow.register_model(
        model_uri=run_path,
        name=model_name
    )
    
    # Transition to staging
    client.transition_model_version_stage(
        name=model_name,
        version=model_details.version,
        stage="Staging"
    )
    
    # Create a new run to log test results
    with mlflow.start_run():
        # Set parent run ID for lineage tracking
        mlflow.set_tag("parent_run_id", best_run_id)
        mlflow.set_tag("model_type", model_type)
        mlflow.set_tag("feature_set", feature_set)
        mlflow.set_tag("evaluation", "test")
        
        # Log hyperparameters
        for param_name, param_value in best_model_info['hyperparameters'].items():
            if param_name not in ['feature_set', 'feature_count']:  # Avoid duplicates
                mlflow.log_param(param_name, param_value)
                
        mlflow.log_param("model_name", model_name)
        mlflow.log_param("model_version", model_details.version)
        
        # Predict on test data
        y_pred = best_model.predict(X_test)
        
        # Calculate test metrics
        test_accuracy = accuracy_score(y_test, y_pred)
        test_precision = precision_score(y_test, y_pred, average='binary')
        test_recall = recall_score(y_test, y_pred, average='binary')
        test_f1 = f1_score(y_test, y_pred, average='binary')
        
        # Log test metrics
        mlflow.log_metric("test_accuracy", test_accuracy)
        mlflow.log_metric("test_precision", test_precision)
        mlflow.log_metric("test_recall", test_recall)
        mlflow.log_metric("test_f1_score", test_f1)
                
        # Log test data for reproducibility
        test_data_path = f"test_data_{model_type}_{feature_set}.pkl"
        with open(test_data_path, "wb") as f:
            pickle.dump({"X_test_shape": X_test.shape, "y_test_shape": y_test.shape}, f)
        mlflow.log_artifact(test_data_path)
        
        # Return test metrics
        test_metrics = {
            "accuracy": test_accuracy,
            "precision": test_precision,
            "recall": test_recall,
            "f1": test_f1
        }
    mlflow.end_run()
    return best_model, test_metrics, model_name, model_details.version


In [None]:
top_models = pd.DataFrame([{
        'rank': 1,
        'run_id': 'sample_run_id',
        'model_type': 'LogisticRegression',
        'feature_set': '500features',
        'hyperparameters': {'C': 1.0, 'solver': 'liblinear'}
    }])
    
# Get test data for the best model's feature set
best_feature_set = top_models.iloc[0]['feature_set']


'500features'

In [None]:
print("\nStaging best model and evaluating on test set...")
best_feature_set = top_models.iloc[0]['feature_set']
test_data = (datasets[best_feature_set][2], datasets[best_feature_set][5])  # X_test, y_test
    
best_model, test_metrics, model_name, model_version = stage_best_model(top_models, test_data)
    