# **Machine Learning Operations Final Project**
Group Members:
- **Bradley Stoller**
- **Samuel Martinez Koss**
- **Xigang Zhang**
- **Zhiwei Guo**

## **General Notebook Configurations**

In [0]:
RANDOM_SEED = 42

print('General configuration complete')

General configuration complete


## **AWS Configurations**
AWS setup: credentials, S3 buckets, IAM roles, and service clients

In [0]:
# AWS Credentials
AWS_ACCESS_KEY = 'REMOVEDRWIMQ423ZX2EREMOVED'
AWS_SECRET_KEY = 'REMOVED5KRZgS8+oMPZCbNm3/0yBs0BwFlbx1yHREMOVED'

# S3 Configuration
BUCKET_NAME = 'ml-ops-fp'
PREFIX = 'pokemon-classification'

# IAM Role
ROLE = 'REMOVED84002890:role/service-role/AmazonSageMaker-ExecutionRole-20241120TREMOVED'
ROLE_ARN = 'arn:aws:iam::116527261367:role/SageMakerExecutionRole'

# Initialize SageMaker Session and Boto3 Clients
import sagemaker
import boto3

SESSION = sagemaker.Session()
REGION = SESSION.boto_region_name

SM_CLIENT = boto3.client('sagemaker')
S3_CLIENT = boto3.client('s3')

print(f"AWS Configuration loaded")
print(f"- Region: {REGION}")
print(f"- Bucket: {BUCKET_NAME}")
print(f"- Role ARN: {ROLE_ARN}")

sagemaker.config INFO - Fetched defaults config from location: /etc/xdg/sagemaker/config.yaml


sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.Session.DefaultS3Bucket


sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.Session.DefaultS3ObjectKeyPrefix


sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.Session.DefaultS3Bucket


sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.Session.DefaultS3ObjectKeyPrefix


AWS Configuration loaded
- Region: us-east-2
- Bucket: ml-ops-fp
- Role ARN: arn:aws:iam::116527261367:role/SageMakerExecutionRole


## **Import Library**

### Install Non-Default Sagemaker Studio Packages
- `imbalanced-learn`
- `mlflow`
- `statsmodels`
- `evidently`

In [0]:
# Install import libraries
import subprocess
import sys

# Install imbalanced-learn, mlflow, statsmodels, and evidently
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "imbalanced-learn"])
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "mlflow"])
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "statsmodels"])
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "--upgrade", "evidently"])

print("All packages installed")

All packages installed


### Import all necessary packages

In [0]:
# Core Libraries
import pandas as pd
import numpy as np
import json
import time
import joblib
import tarfile
import warnings
from io import StringIO
from collections import Counter
from itertools import product
import os
from pathlib import Path

# Data Processing & ML
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.utils.class_weight import compute_sample_weight
from sklearn.metrics import (
    classification_report, confusion_matrix, roc_curve, auc,
    precision_recall_curve, accuracy_score, precision_score,
    recall_score, f1_score
)

# Imbalanced Learning
from imblearn.over_sampling import SMOTE

# Statistical Analysis
import scipy.stats as stats
from scipy.stats import randint, uniform
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# XGBoost
import xgboost as xgb

# MLflow
import mlflow
import mlflow.sklearn
from mlflow.models import infer_signature

# SageMaker
import sagemaker
import boto3
from sagemaker.automl.automl import AutoML
from sagemaker.sklearn.model import SKLearnModel
from sagemaker.xgboost import XGBoostModel
from sagemaker.predictor import Predictor
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer, CSVDeserializer
from sagemaker.model_monitor import DefaultModelMonitor, MonitoringExecution
from sagemaker.model_monitor.dataset_format import DatasetFormat

# Evidently
import base64
from evidently.legacy.test_suite import TestSuite
from evidently.legacy.tests import TestNumberOfDriftedColumns

warnings.filterwarnings('ignore')

print("All libraries imported")

All libraries imported


## **Function Library**

In [0]:
def clean_data(data):
    """Clearn the data to be used for modeling. """

    # Copy data to avoid modifying original
    data_copy = data.copy()

    # Drop columns irrelevant to mega evolutions
    data_copy = data_copy.drop(columns=[
        'Name',
        'Number',
        'hasGender',
        'Pr_Male',
    ])

    # Create binary indicators for optional secondary attributes
    data_copy['Has_Type_2'] = data_copy['Type_2'].notna().astype(int)
    data_copy['Has_Egg_Group_2'] = data_copy['Egg_Group_2'].notna().astype(int)

    # Bin Catch_Rate into difficulty categories (higher rate = easier to catch)
    data_copy['Catch_Difficulty'] = pd.cut(
        data_copy['Catch_Rate'],
        bins=3,
        labels=['Hard', 'Medium', 'Easy']
    ).astype(str)  # Convert to string immediately to avoid categorical issues

    # Separate features and target (drop original Catch_Rate, keep binned version)
    features = data_copy.drop(['Catch_Rate', 'hasMegaEvolution'], axis=1)
    target = data_copy['hasMegaEvolution']

    return features, target

def scale_and_encode(X_train, X_test):
    """Scale and encode the data for modeling. """

    # Identify integer colums that need conversion
    num_cols = [
        c for c in X_train.select_dtypes('integer').columns
        if len(X_train[c].unique()) > 10
    ]
    X_train = X_train.astype({c: 'float32' for c in num_cols})
    X_test = X_test.astype({c: 'float32' for c in num_cols})

    # Standardize interval features
    num_cols = X_train.select_dtypes(exclude=['bool', 'object', 'string', 'int64']).columns
    for col in num_cols:
        scaler = StandardScaler()
        X_train[col] = scaler.fit_transform(X_train[[col]])
        X_test[col] = scaler.transform(X_test[[col]])

    # Encode categorical features using a label encoder
    cat_cols = X_train.select_dtypes(include=['object', 'bool']).columns
    for col in cat_cols:
        X_train[col] = X_train[col].astype(str).replace('nan', 'missing')
        X_test[col] = X_test[col].astype(str).replace('nan', 'missing')

        le = LabelEncoder()
        X_train[col] = le.fit_transform(X_train[col])

        # Ensure that unseen columns are encoded as -1
        test_x_col = []
        for val in X_test[col]:
            if val in le.classes_:
                test_x_col.append(le.transform([val])[0])
            else:
                test_x_col.append(-1)
        X_test[col] = test_x_col

    return X_train, X_test

def prepare_data(data, verbose=False):
    """Clean the data, split the data, and apply SMOTE for modeling. """

    # Remove unnecessary cols, feature engineering
    X, y = clean_data(data)

    # Split data into train and test
    X_train_raw, X_test_raw, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify= y
    )

    # Scale interval features and encode categorical features
    X_train, X_test = scale_and_encode(X_train_raw, X_test_raw)

    # Apply SMOTE to handle class imbalance
    smote = SMOTE(random_state=42)
    X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

    if verbose:
        # Check class distribution before SMOTE
        print("\nClass distribution before SMOTE:")
        print(Counter(y_train))

        # Check class distribution after SMOTE
        print("\nClass distribution after SMOTE:")
        print(Counter(y_train_resampled))

    return X_train_resampled, X_test, y_train_resampled, y_test, X_train, y_train

def log_model_metrics(y_actual, y_pred):
    """Setup logging configurations for MLflow usage. """

    mlflow.log_metric('accuracy', accuracy_score(y_actual, y_pred))
    mlflow.log_metric('precision', precision_score(y_actual, y_pred))
    mlflow.log_metric('recall', recall_score(y_actual, y_pred))
    mlflow.log_metric('f1', f1_score(y_actual, y_pred))

def confusion_matrix_plot(y_actual, y_pred):
    """Plot a confusion matrix of the results for MLflow. """

    conf_matrix = confusion_matrix(y_actual, y_pred)
    
    fig, ax = plt.subplots(figsize=(8, 6))

    # Use a color palette that clearly differentiates cells (custom or built-in)
    cmap = sns.color_palette("RdYlBu_r", as_cmap=True)

    # Draw heatmap with square cells and no colorbar for cleaner look
    sns.heatmap(
        conf_matrix,
        annot=True,
        fmt='d',
        cmap=cmap,
        xticklabels=['No Mega', 'Has Mega'],
        yticklabels=['No Mega', 'Has Mega'],
        cbar=False,
        square=True,
        linewidths=1,
        linecolor='gray',
        ax=ax
    )

    ax.set_xlabel('Predicted', fontsize=12, fontweight='bold')
    ax.set_ylabel('Actual', fontsize=12, fontweight='bold')
    ax.set_title('Mega Evolution Prediction Confusion Matrix', fontsize=14, pad=20)

    # Add refined annotations inside cells - positions adjusted for better spacing
    # Get bounding box coordinates of each cell for placement reference
    for i in range(2):
        for j in range(2):
            text = ""
            if i == 0 and j == 0:
                text = 'True Negatives\n(Correctly predicted\nno Mega Evolution)'
                xytext = (j + 0.5, i + 0.7)
            elif i == 0 and j == 1:
                text = 'False Positives\n(Incorrectly predicted\nMega Evolution)'
                xytext = (j + 0.5, i + 0.7)
            elif i == 1 and j == 0:
                text = 'False Negatives\n(Missed actual\nMega Evolution)'
                xytext = (j + 0.5, i + 0.7)
            elif i == 1 and j == 1:
                text = 'True Positives\n(Correctly predicted\nMega Evolution)'
                xytext = (j + 0.5, i + 0.7)

            ax.text(
                xytext[0], xytext[1], text,
                ha='center',
                va='center',
                fontsize=9,
                color='black',
                bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="gray", alpha=0.7)
            )

    fig.tight_layout()
    return fig

def roc_curve_plot(y_actual, y_prob):
    """Plot a ROC Curve plot of the results for MLflow. """

    # Compute ROC curve and AUC
    fpr, tpr, _ = roc_curve(y_actual, y_prob)
    roc_auc = auc(fpr, tpr)

    fig, ax = plt.subplots(figsize=(8, 6))
    ax.plot(fpr, tpr, lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
    ax.plot([0, 1], [0, 1], lw=2, linestyle='--')

    ax.set_xlim([0.0, 1.0])
    ax.set_ylim([0.0, 1.05])
    ax.set_xlabel('False Positive Rate')
    ax.set_ylabel('True Positive Rate')
    ax.set_title('Receiver Operating Characteristic')

    ax.legend(loc="lower right")
    return fig

def precision_recall_plot(y_actual, y_prob):
    """Plot a precision-recall plot of the results for MLflow. """

    # Compute precision-recall values
    precision, recall, _ = precision_recall_curve(y_actual, y_prob)

    fig, ax = plt.subplots(figsize=(8, 6))
    ax.plot(recall, precision, lw=2)
    ax.set_xlabel("Recall")
    ax.set_ylabel("Precision")
    ax.set_title("Precision-Recall Curve")

    return fig

def feature_importance_plot(model, features):    
    """Plot features importances for MLflow. """

    importances = model.feature_importances_
    feature_importance_dict = {name: imp for name, imp in zip(features, importances)}
    sorted_features = sorted(feature_importance_dict.items(), key=lambda x: x[1], reverse=True)

    # Prepare top 10 features
    top_n = min(10, len(sorted_features))
    feature_names_plot = [name for name, _ in sorted_features[:top_n]]
    importance_values = [imp for _, imp in sorted_features[:top_n]]

    fig, ax = plt.subplots(figsize=(12, 8))
    ax.set_title("Feature Importances")

    ax.barh(range(top_n), importance_values, align="center")
    ax.set_yticks(range(top_n))
    ax.set_yticklabels(feature_names_plot)
    ax.invert_yaxis()  # Highest importance at the top
    ax.set_xlabel("Importance")

    fig.tight_layout()
    return fig

def log_model_plots(y_actual, y_pred, y_prob, model, features):
    """Set up logging for plots used by MLflow. """

    fig = confusion_matrix_plot(y_actual, y_pred)
    mlflow.log_figure(fig, 'confusion_matrix.png')
    plt.close(fig)

    fig = roc_curve_plot(y_actual, y_prob)
    mlflow.log_figure(fig, 'roc_curve.png')
    plt.close(fig)

    fig = precision_recall_plot(y_actual, y_prob)
    mlflow.log_figure(fig, 'precision_recall.png')
    plt.close(fig)

    fig = feature_importance_plot(model, features)
    mlflow.log_figure(fig, 'feature_importance.png')
    plt.close(fig)

def mlflow_pipeline(features, target, random_seed= RANDOM_SEED, run_name= 'test_run', experiment_name= 'Default', **model_params):
    """Create the MLflow experiment pipeline. """

    warnings.filterwarnings('ignore')
    np.random.seed(random_seed)
    mlflow.set_tracking_uri('file:./mlruns')
    mlflow.set_experiment(experiment_name)

    with mlflow.start_run(run_name=run_name):
        gb_model = GradientBoostingClassifier(random_state=random_seed, **model_params)

        gb_model.fit(features, target)
        y_pred = gb_model.predict(features)
        y_prob = gb_model.predict_proba(features)[:, 1]

        mlflow.log_params(model_params)
        log_model_metrics(target, y_pred)
        log_model_plots(target, y_pred, y_prob, gb_model, features.columns)

        signature = infer_signature(features, gb_model.predict(features))
        mlflow.sklearn.log_model(
            gb_model, name=run_name[:5], signature=signature
        )

def sample_params(param_dist, n_samples=5, random_state=None):
    """Get the sample parameters for usage in MLflow. """

    rng = np.random.default_rng(random_state)
    samples = []
    for _ in range(n_samples):
        sample = {}
        for param, dist in param_dist.items():
            if hasattr(dist, 'rvs'):
                val = dist.rvs(random_state=rng)
                if dist.dist.name == 'randint':
                    val = int(val)
                sample[param] = val
            else:
                sample[param] = rng.choice(dist)
        samples.append(sample)
    return samples

def upload_mlflow(run_id, experiment_id, bucket_name=BUCKET_NAME, s3_prefix="MLflow/best-run"):
    """Upload the best MLflow run files to S3. """
    
    # Path to the specific run
    local_run_path = f"./mlruns/{experiment_id}/{run_id}"
    
    if not os.path.exists(local_run_path):
        print(f"Run directory not found: {local_run_path}")
        return None
    
    s3_path = f"s3://{bucket_name}/{s3_prefix}/"
    print(f"Uploading best run {run_id} to {s3_path}...")
    
    # Upload all files in the run directory
    file_count = 0
    for root, dirs, files in os.walk(local_run_path):
        for file in files:
            local_file = os.path.join(root, file)
            relative_path = os.path.relpath(local_file, local_run_path)
            s3_key = f"{s3_prefix}/{relative_path}"
            
            S3_CLIENT.upload_file(local_file, bucket_name, s3_key)
            file_count += 1
    
    print(f"Successfully uploaded {file_count} files to {s3_path}")

def detect_model_drift(test_data, y_pred, report_s3_key, predictor, baseline_data, bucket_name, s3_client):
    """Detect model drift by comparing test data predictions against baseline. """

    print("="*60)
    print("MODEL MONITORING: DRIFT DETECTION")
    print("="*60)
    
    # Add predictions to the baseline data using the deployed model
    print("Generating baseline predictions from the deployed model...")
    baseline_predictions = []

    for i in range(len(baseline_data)):
        sample = baseline_data.iloc[i:i+1].values
        pred = predictor.predict(sample)
        pred_value = float(np.array(pred).flatten()[0])
        baseline_predictions.append(1 if pred_value > 0.5 else 0)
    
    baseline_with_predictions = baseline_data.copy()
    baseline_with_predictions['prediction'] = baseline_predictions
    
    current_with_predictions = test_data.copy()
    current_with_predictions['prediction'] = y_pred
    
    # Create Evidently AI test suite for drift detection
    test_suite = TestSuite(tests=[
        TestNumberOfDriftedColumns(lt=3),
    ])
    
    # Run drift detection
    test_suite.run(reference_data=baseline_with_predictions, current_data=current_with_predictions)
    
    # Get results
    results = test_suite.as_dict()
    
    # Print the drift summary
    print("\nDrift Detection Results:")
    print(f"Total features analyzed: {baseline_with_predictions.shape[1]}")
    
    # Extract the test results
    for test in results['tests']:
        print(f"\nTest: {test['name']}")
        print(f"Status: {test['status']}")
        if 'description' in test:
            print(f"Description: {test['description']}")
        if 'parameters' in test:
            print(f"Parameters: {test['parameters']}")
    
    # Save the Drift Report
    s3_client.put_object(
        Bucket=bucket_name,
        Key=report_s3_key,
        Body=test_suite.get_html(),
        ContentType='text/html'
    )
    
    print(f"\nReport saved to s3://{bucket_name}/{report_s3_key}")
    
    return results

## **1. Choose a dataset that has an outcome (predictive) variable**
For our dataset, we chose to use a variety of information about pokemon characters, with the predictive variable being `hasMegaEvolution`.

In [0]:
DATA_KEY = 'pokemon-data/pokemon.csv'
S3_PATH = f's3://{BUCKET_NAME}/{DATA_KEY}'
df = pd.read_csv(S3_PATH)
print(f"Dataset Shape: {df.shape}\n")
df.info()
df.head()

Dataset Shape: (721, 23)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 721 entries, 0 to 720
Data columns (total 23 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Number            721 non-null    int64  
 1   Name              721 non-null    object 
 2   Type_1            721 non-null    object 
 3   Type_2            350 non-null    object 
 4   Total             721 non-null    int64  
 5   HP                721 non-null    int64  
 6   Attack            721 non-null    int64  
 7   Defense           721 non-null    int64  
 8   Sp_Atk            721 non-null    int64  
 9   Sp_Def            721 non-null    int64  
 10  Speed             721 non-null    int64  
 11  Generation        721 non-null    int64  
 12  isLegendary       721 non-null    bool   
 13  Color             721 non-null    object 
 14  hasGender         721 non-null    bool   
 15  Pr_Male           644 non-null    float64
 16  Egg_Group_1       

Unnamed: 0,Number,Name,Type_1,Type_2,Total,HP,Attack,Defense,Sp_Atk,Sp_Def,Speed,Generation,isLegendary,Color,hasGender,Pr_Male,Egg_Group_1,Egg_Group_2,hasMegaEvolution,Height_m,Weight_kg,Catch_Rate,Body_Style
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False,Green,True,0.875,Monster,Grass,False,0.71,6.9,45,quadruped
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False,Green,True,0.875,Monster,Grass,False,0.99,13.0,45,quadruped
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False,Green,True,0.875,Monster,Grass,True,2.01,100.0,45,quadruped
3,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False,Red,True,0.875,Monster,Dragon,False,0.61,8.5,45,bipedal_tailed
4,5,Charmeleon,Fire,,405,58,64,58,80,65,80,1,False,Red,True,0.875,Monster,Dragon,False,1.09,19.0,45,bipedal_tailed


An EDA of the dataset can be found in the `eda.ipynb` file in the github. In this notebook, we will move forward with the dataset from here.

# **2. Split that into train and test**

In [0]:
X_train, X_test, y_train, y_test, X_train_raw, y_train_raw = prepare_data(df)

print(f"Number of Features: {len(X_train.columns)}")
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

Number of Features: 20
X_train shape: (1078, 20)
X_test shape: (145, 20)
y_train shape: (1078,)
y_test shape: (145,)


As we can see, the model has been appropriately split, with a 20% test size. As such, we can move forward with the modeling steps.

## **3. Define a metric to evaluate a machine learning model**
Since our model will be a binary classifier, we will use `accuracy` as the primary metric to evaluate our model. We feel that accuracy provides the most balanced representation of both classes (after applying SMOTE). Additionally, accuracy is easy to interpret with regards to pokemon mega evolution detection rates. Additionally, it is further appropriate since cost of false positives and false negatives are equal in this use case.

However, in MLflow, we will log both `accuracy` and `F1-score` in the `log_model_metrics()` function.

## **4. Build a pipeline using Airflow or MLflow or your platform pipeline to train a machine learning model using the train dataset (use AutoML to refine the category of algorithms).**
For our workflow, we will first use AWS AutoML to determine the optimal algorithm category, and then MLflow as the pipeline for training our machine learning model.

### 4a. Use AWS AutoML to determine the optimal algorithm category
In order to use AWS AutoML, we will first upload our training data to S3, and then use it to with the AutoML model.

In [0]:
# Combine the training data
train_df = pd.concat([
    pd.DataFrame(X_train, columns= X_train.columns if hasattr(X_train, 'columns') else None),
    pd.Series(y_train, name= 'hasMegaEvolution')
], axis= 1)

# Save it locally and then upload it to S3
train_df.to_csv('pokemon_train.csv', index= False)

S3_CLIENT.upload_file('pokemon_train.csv', BUCKET_NAME, 'automl/input/pokemon_train.csv')

print("Data uploaded to S3")
print(f"- s3://{BUCKET_NAME}/automl/input/pokemon_train.csv")

Data uploaded to S3
- s3://ml-ops-fp/automl/input/pokemon_train.csv


With AWS AutoML, we will try seven candidate algorithms, allotting only 3 minutes (180 seconds) for each candidate or 21 total minutes (1,260 seconds) for cost minimization. However, these should still provide enough signal to determine the best algorithm category.

In [0]:
# Set the training data location
TRAIN_S3_PATH = f's3://{BUCKET_NAME}/automl/input/pokemon_train.csv'
OUTPUT_S3_PATH = f's3://{BUCKET_NAME}/automl/output'

# Create an AWS AutoML job
timestamp = strftime('%Y-%m-%d-%H-%M-%S', gmtime())
automl = AutoML(
    role= ROLE_ARN,
    target_attribute_name= 'hasMegaEvolution',
    output_path= OUTPUT_S3_PATH,
    base_job_name= 'pokemon-automl',
    sagemaker_session= SESSION,
    problem_type= 'BinaryClassification',
    max_candidates= 7,
    max_runtime_per_training_job_in_seconds= 180,
    total_job_runtime_in_seconds= 1260,
    job_objective= {'MetricName': 'Accuracy'},
    mode= 'HYPERPARAMETER_TUNING'
)

# Start the AutoML job
print("Starting AutoML job...")
automl.fit(
    inputs= TRAIN_S3_PATH,
    wait= False,
    logs= False
)

job_name = automl.current_job_name
print(f"AutoML job started: {job_name}")

Starting AutoML job...


AutoML job started: pokemon--2025-12-08-19-02-49-681


In [0]:
# Check AutoML status
status = automl.describe_auto_ml_job()

print(f"Job Name: {status['AutoMLJobName']}")
print(f"Status: {status['AutoMLJobStatus']}")

Job Name: pokemon--2025-12-08-19-02-49-681
Status: Completed


In [0]:
# Get best candidate only
best = automl.best_candidate()

print("="*70)
print("BEST MODEL HYPERPARAMETERS")
print("="*70)

name = best['CandidateName']
metric = best['FinalAutoMLJobObjectiveMetric']['Value']

print(f"\nModel: {name}")
print(f"Accuracy: {metric:.4f} ({metric*100:.2f}%)")
print(f"\n{'='*70}")

# Get training job details
for step in best.get('CandidateSteps', []):
    if step['CandidateStepType'] == 'AWS::SageMaker::TrainingJob':
        training_job_name = step['CandidateStepArn'].split('/')[-1]
        
        try:
            training_job = SM_CLIENT.describe_training_job(
                TrainingJobName= training_job_name
            )
            
            hyperparams = training_job.get('HyperParameters', {})
            
            # Extract algorithm details
            print(f"\nHyperparameters:")
            
            # Look for key parameters that indicate the algorithm
            key_params = [
                'predictor_type', 'algorithm', 'estimator', 
                'max_depth', 'n_estimators', 'learning_rate',
                'booster', 'tree_method', 'model_type', 'eta', 'num_round'
            ]
            
            algorithm_found = False
            for param in key_params:
                if param in hyperparams:
                    print(f"  {param}: {hyperparams[param]}")
                    algorithm_found = True
            
            # If specific params found, print all
            if algorithm_found:
                print(f"\n  All Hyperparameters:")
                for key, value in sorted(hyperparams.items()):
                    if key not in key_params:
                        print(f"    {key}: {value}")
            else:
                # Print everything if we didn't find specific indicators
                for key, value in sorted(hyperparams.items()):
                    print(f"  {key}: {value}")
                    
        except Exception as e:
            print(f"  Error getting training job: {e}")

BEST MODEL HYPERPARAMETERS

Model: pokemon--2025-12-08-19-02-49-62u-003-c9d165ab
Accuracy: 0.9731 (97.31%)


Hyperparameters:
  processor_module: candidate_data_processors.dpp4
  sagemaker_program: candidate_data_processors.trainer
  sagemaker_submit_directory: /opt/ml/input/data/code

Hyperparameters:
  max_depth: 4
  eta: 0.6641367908850111
  num_round: 364

  All Hyperparameters:
    _kfold: 5
    _tuning_objective_metric: validation:accuracy
    alpha: 1.5323566593501716e-06
    colsample_bytree: 0.9597047988623544
    eval_metric: accuracy,f1_binary,auc,balanced_accuracy,precision,recall,logloss
    gamma: 0.00035887647058464487
    lambda: 0.9645996772792648
    min_child_weight: 0.002811123203178802
    objective: binary:logistic
    subsample: 0.7488351445742566


As seen in the AutoML output above, the hyperparameters of the best model are those used in gradient boosting algorithms. As such, since AutoML has told us that gradient boosting (XG-Boost) is the best algorithm category, we can move forward with our experimentation as to the most optimal XG-Boost hyperparameters with MLflow.

### 4b. Use MLflow to create a pipeline for training a machine learning model
Since we know from AWS AutoML that the best algorithm is Gradient Boosting (XG-Boost), we will run expriments via a MLflow pipeline to determine the optimal set of hyperparameters. Note: since this notebook was created in Sagemaker for AWS AutoML use, MLflow must be used locally since Sagemaker does not support server MLflow usage

In [0]:
PARAM_DIST = {
    'n_estimators': [2, 3, 5, 7],
    'learning_rate': uniform(0.01, 0.1),
    'max_depth': randint(2, 4),
    'min_samples_split': randint(5, 20)
}
random_grid = sample_params(
    PARAM_DIST, 
    n_samples= 10, 
    random_state= RANDOM_SEED
)

In [0]:
# Set up MLflow
mlflow.set_tracking_uri("file:./mlruns")
mlflow.set_experiment('Pokemon')

# Verify MLflow tracking
print(f"Tracking URI: {mlflow.get_tracking_uri()}")
print(f"Experiment: {mlflow.get_experiment_by_name('Pokemon')}\n")

# Run the expriments to find the best model
EXPERIMENT_NAME = 'Pokemon'
for i, params in enumerate(random_grid):
    print(f"[{i+1}/{len(random_grid)}] Training model with params: {params}")
    mlflow_pipeline(
        features= X_train,
        target= y_train,
        random_seed= RANDOM_SEED,
        run_name= f'run_{i + 1}',
        experiment_name= EXPERIMENT_NAME,
        **params
    )

print("\nAll experiments logged to ./mlruns/")

Tracking URI: file:./mlruns
Experiment: <Experiment: artifact_location='file:///home/sagemaker-user/mlruns/504153010705164362', creation_time=1765139956628, experiment_id='504153010705164362', last_update_time=1765139956628, lifecycle_stage='active', name='Pokemon', tags={}>

[1/10] Training model with params: {'n_estimators': 2, 'learning_rate': 0.053887843975205234, 'max_depth': 3, 'min_samples_split': 11}


[2/10] Training model with params: {'n_estimators': 7, 'learning_rate': 0.07973680290593639, 'max_depth': 2, 'min_samples_split': 6}


[3/10] Training model with params: {'n_estimators': 5, 'learning_rate': 0.0861139701990353, 'max_depth': 3, 'min_samples_split': 15}


[4/10] Training model with params: {'n_estimators': 7, 'learning_rate': 0.02281136326755459, 'max_depth': 3, 'min_samples_split': 11}


[5/10] Training model with params: {'n_estimators': 5, 'learning_rate': 0.10267649888486018, 'max_depth': 2, 'min_samples_split': 16}


[6/10] Training model with params: {'n_estimators': 5, 'learning_rate': 0.092276161327083, 'max_depth': 3, 'min_samples_split': 11}


[7/10] Training model with params: {'n_estimators': 3, 'learning_rate': 0.06545847870158349, 'max_depth': 2, 'min_samples_split': 18}


[8/10] Training model with params: {'n_estimators': 2, 'learning_rate': 0.09276311719925821, 'max_depth': 2, 'min_samples_split': 14}


[9/10] Training model with params: {'n_estimators': 2, 'learning_rate': 0.04545259681298684, 'max_depth': 3, 'min_samples_split': 6}


[10/10] Training model with params: {'n_estimators': 7, 'learning_rate': 0.09931211213221977, 'max_depth': 3, 'min_samples_split': 16}



All experiments logged to ./mlruns/


In [0]:
# Get all runs from the MLflow experiments
experiment = mlflow.get_experiment_by_name('Pokemon')
runs = mlflow.search_runs(experiment_ids= [experiment.experiment_id])


# Sort and print the restults
print("Experiment Results:")
print(runs[[
    'metrics.accuracy', 
    'params.n_estimators',
    'params.learning_rate',
    'params.max_depth',
    'params.min_samples_split',
]].head(10).to_string())

# Show best run hyperparameters
best_run = runs.sort_values('metrics.accuracy', ascending= False).iloc[0]
print("\nBest Model Parameters:")
print(best_run)

Experiment Results:
   metrics.accuracy params.n_estimators  params.learning_rate params.max_depth params.min_samples_split
0          0.933210                   7   0.09931211213221977                3                       16
1          0.912801                   2   0.04545259681298684                3                        6
2          0.846939                   2   0.09276311719925821                2                       14
3          0.857143                   3   0.06545847870158349                2                       18
4          0.916512                   5     0.092276161327083                3                       11
5          0.880334                   5   0.10267649888486018                2                       16
6          0.895176                   7   0.02281136326755459                3                       11
7          0.916512                   5    0.0861139701990353                3                       15
8          0.879406                   7   0.

In [0]:
# Upload best MLflow run artifacts to S3
upload_mlflow(best_run['run_id'], experiment.experiment_id)

Uploading best run 72bc39cf86b14d86964e6c0ab6a32c64 to s3://ml-ops-fp/MLflow/best-run/...


Successfully uploaded 18 files to s3://ml-ops-fp/MLflow/best-run/


As seen above, the best model is a gradient boosting model with 7 estimators, a max depth of 3, and a minimum of 16 samples to split. In addition to recoridng parameter sets tested, and logging performance metrics, we have also generated certain model diagnostic plots for each experiment that are related to our classification task. 

Below, we can see these diagnostic plots for our best run. Specifically, we can begin by looking at the confusion matrix:

[![confusion matrix]()](https://raw.githubusercontent.com/bhstoller/ml-ops-fp/main/mlflow-images/confusion_matrix.png)

As seen in the confusion matrix, we have high numbers of true positives and true negatives, the counts of which represent the SMOTE-augmented data used for model training. Overall, the model is clearly classifying the majority of characters correctly. Next, we can examine the feature importances:

![feature importances](mlflow-images/feature_importance.png)

As seen in the feature importance plots, predictions are most impacted by the special defense statistic (`Sp_Def`) of certain pokemon characters. `Generation`, `Attack`, `Height_m`, and `Type_2` are the next most important features. After these features, the remaining importances are relatively minimal. Next we can examine the precision-recall curve:

![precision recall](mlflow-images/precision_recall.png)

As seen in the precision-recall curve, the model is performing quite well against both precision and recall, with a good balance being found between the two. This indicates the model's predictions are fairly balanced and not skewed towards one or the other (eg always predicting that a character has a mega evolution). Lastly, we can examine the ROC curve:

![roc curve](mlflow-images/roc_curve.png)

As seen in the ROC curve, the area under the curve of 0.98 is quite high. This reflects the ability of the model to distinguish between pokemon which have and do not have mega evolutions.

Overall, we can be very confident that this model is highly effective (headlined by the high accuracy of 93%), and can move forward with deploying it for inference.

### 5a. Convert the best MLflow model to a native XG-Boost model
However, to deploy a XG-Boost model via AWS, the model must be a native XG-Boost model. Thus, we will use the exact hyperparameters from the best MLflow model above to create a native XG-Boost model via the `XGBoost` library.

### 5a. Convert the best MLflow model to a native XG-Boost model
To deploy via AWS, the model must be a native XG-Boost model Thus, we will use the exact hyperparameters from the best MLflow model to create a native XG-Boost model via the `XGBoost` library.

In [0]:
best_run = runs.sort_values('metrics.accuracy', ascending= False).iloc[0]

# Extract hyperparameters
best_params = {
    'n_estimators': int(best_run['params.n_estimators']),
    'learning_rate': float(best_run['params.learning_rate']),
    'max_depth': int(best_run['params.max_depth']),
    'min_samples_split': int(best_run['params.min_samples_split'])
}

# Map sklearn params to XGBoost params
xgb_params = {
    'n_estimators': best_params['n_estimators'],
    'learning_rate': best_params['learning_rate'],
    'max_depth': best_params['max_depth'],
    'min_child_weight': best_params['min_samples_split'],
    'objective': 'binary:logistic',
    'eval_metric': 'logloss',
    'use_label_encoder': False,
    'random_state': RANDOM_SEED
}

print("Training XG-Boost model with the optimal hyperparameters...")

# Train XGBoost model
xgb_model = xgb.XGBClassifier(**xgb_params)
xgb_model.fit(X_train, y_train)

# Evaluate
y_pred = xgb_model.predict(X_test)
xgb_accuracy = accuracy_score(y_test, y_pred)
xgb_f1 = f1_score(y_test, y_pred)

print(f"\nXG-Boost model trained")
print(f"- Accuracy: {xgb_accuracy:.4f}")

Training XG-Boost model with the optimal hyperparameters...

XG-Boost model trained
- Accuracy: 0.8828


As we can see here, we are getting very similar performance to the MLflow best model, which is expected since we are using the same hyperparameters. Now that we have our native XG-Boost model, we can deploy it for inference via AWS.

### 5b. Deploy the native XG-Boost model to AWS for inference

In [0]:
# Set S3 configurations for model saving
MODEL_KEY = 'model/model.tar.gz'
MODEL_S3_PATH = f's3://{BUCKET_NAME}/{MODEL_KEY}'
ENDPOINT_NAME = 'pokemon-model'

# Save the model
xgb_model.save_model('xgboost-model')

# Convert to tar.gz format (SageMaker requirement)
with tarfile.open('model.tar.gz', 'w:gz') as tar:
    tar.add('xgboost-model')
print("Model packaged as model.tar.gz")

# Upload to S3
S3_CLIENT.upload_file(
    Filename='model.tar.gz',
    Bucket=BUCKET_NAME,
    Key= MODEL_KEY
)
print(f"Model uploaded to: {S3_PATH}")

xgb_model_sm = XGBoostModel(
    model_data= MODEL_S3_PATH,
    role= ROLE_ARN,
    framework_version= '1.7-1',
    sagemaker_session= SESSION
)

print("\nDeploying model to SageMaker endpoint...")

# Deploy to endpoint
try:
    predictor = xgb_model_sm.deploy(
        initial_instance_count= 1,
        instance_type= 'ml.m5.large',
        endpoint_name= ENDPOINT_NAME
    )
    print("\nModel successfully deployed")
    print(f"- Endpoint name: {ENDPOINT_NAME}")
    print(f"- Model location: {MODEL_S3_PATH}")
    print(f"- Instance type: ml.m5.large")
    
except Exception as e:
    if 'Cannot create already existing endpoint' in str(e) or 'already exists' in str(e):
        print(f"\nModel already deployed")
        print(f"- Endpoint name: {ENDPOINT_NAME}")
        print(f"- Model location: {MODEL_S3_PATH}")
        print(f"- Instance type: ml.m5.large")
    else:
        raise

Model packaged as model.tar.gz
Model uploaded to: s3://ml-ops-fp/pokemon-data/pokemon.csv



Deploying model to SageMaker endpoint...



Model already deployed
- Endpoint name: pokemon-model
- Model location: s3://ml-ops-fp/model/model.tar.gz
- Instance type: ml.m5.large


Now, by using `predictor` we can inference the model the way we normally would had it be defined in the notebook. However, the model is actually deployed in the cloud via a AWS endpoint (specifically `pokemon-model`). As such, now we can move forward with model monitoring.

## **6. Set up model monitoring (if there is a monitoring dashboard show that)**
While trying to set up model monitoring via AWS, we encountered a known-limitation involving data type errors. As such, we recieved express permission from Professor Bose to use Evidently AI for model monitoring, and sync the drift reports to S3 to stay in our AWS ecosystem.

For the baseline in our drift detection, we will use the training data (`X_train`) for data input drift detection. However, it is important to note that we are using the training data *before* SMOTE was applied.

In [0]:
# Prepare baseline data for drift comparison
baseline_data = X_train_raw.copy()

print(f"Baseline data prepared:")
print(f"  - Shape: {baseline_data.shape}")
print(f"  - Features: {baseline_data.shape[1]}")
print(f"  - Samples: {baseline_data.shape[0]}")

print("\nMonitoring configuration complete")
print("  - Monitoring tool: Evidently AI")
print("  - Baseline: Original test dataset")
print("  - Drift detection enabled: True")

Baseline data prepared:
  - Shape: (576, 20)
  - Features: 20
  - Samples: 576

Monitoring configuration complete
  - Monitoring tool: Evidently AI
  - Baseline: Original test dataset
  - Drift detection enabled: True


As we can see, we have now set up model monitoring via Evidently AI and AWS, and can now commence with inferencing the `X_test` data and monitoring drift.

## **7. Use the test data with the deployed model and validate the results (metric) and model monitoring**
First, we will inference the deployed model via the AWS endpoint. From this, we will obtain the predictions and accuracy (evaluation metric).

In [0]:
# Connect to the model endpoint
predictor = Predictor(
    endpoint_name= ENDPOINT_NAME,
    sagemaker_session= SESSION,
    serializer= CSVSerializer(),
    deserializer= CSVDeserializer()
)

print("Testing model using test (X_test) data...")

# Get predictions
y_pred_original = []
for i in range(len(X_test)):
    sample = X_test.iloc[i:i+1].values  # Get sample values
    pred = predictor.predict(sample)  # Get predictions from deployed model
    pred_value = float(np.array(pred).flatten()[0])  # Flatten and convert to float
    y_pred_original.append(1 if pred_value > 0.5 else 0)  # Convert to binary classification

accuracy_original = accuracy_score(y_test, y_pred_original)

print(f"\nResults:")
print(f"- Accuracy: {accuracy_original:.3f}")

Testing model using test (X_test) data...



Results:
- Accuracy: 0.883


As seen above, we were able to successfully inference the deployed model, with a very similar test accuracy (88.3) to the train accuracy (88.28). This speaks to our model's strong generalizability without over or underfitting. Now, we can monitor these results in context with Evidently AI:

In [0]:
REPORT_TYPE = 'original'
REPORT_S3_PATH = f"reports/monitoring_report_{REPORT_TYPE}.html"

# Monitor Model Predictions: X_test
results_original = detect_model_drift(
    test_data= X_test,
    y_pred= y_pred_original,
    report_s3_key= REPORT_S3_PATH,
    predictor= predictor,
    baseline_data= baseline_data,
    bucket_name= BUCKET_NAME,
    s3_client= S3_CLIENT
)

MODEL MONITORING: DRIFT DETECTION
Generating baseline predictions from the deployed model...



Drift Detection Results:
Total features analyzed: 21

Test: Number of Drifted Features
Status: SUCCESS
Description: The drift is detected for 2 out of 21 features. The test threshold is lt=3.
Parameters: {'condition': {'lt': 3}, 'features': {'prediction': {'stattest': 'Z-test p_value', 'score': 0.163, 'threshold': 0.05, 'detected': False}, 'Attack': {'stattest': 'K-S p_value', 'score': 0.466, 'threshold': 0.05, 'detected': False}, 'Body_Style': {'stattest': 'K-S p_value', 'score': 0.858, 'threshold': 0.05, 'detected': False}, 'Catch_Difficulty': {'stattest': 'chi-square p_value', 'score': 0.184, 'threshold': 0.05, 'detected': False}, 'Color': {'stattest': 'K-S p_value', 'score': 0.549, 'threshold': 0.05, 'detected': False}, 'Defense': {'stattest': 'K-S p_value', 'score': 0.056, 'threshold': 0.05, 'detected': False}, 'Egg_Group_1': {'stattest': 'K-S p_value', 'score': 0.999, 'threshold': 0.05, 'detected': False}, 'Egg_Group_2': {'stattest': 'K-S p_value', 'score': 0.545, 'threshold': 0

We can see the drift report findings above, and via the drift report that was uploaded to S3:
![X_test Evidently Report](evidently-images/x_test_report.png)

From the report, we see that there is no significant drift detected when inferencing the X_test. While two columns (`Type_2` and `Has_Type_2`) are detected as drift with X_test, this is not a reflection of actual drift in the data, rather it is a stratification issue between the train and test data. Specifically, for the feature `Type_2` there a large number of possible pokemon values, as such the balance of classes likely differs due to randomization in the creation of train and test.

However, even just these two columns do not exceed the total threshold for data input drift existing, and the predictions p-value is 1.0, suggesting no drift at all in the accuracy. This is expected since the test data in `X_test` should not be materially different from `X_train` since they were comprise the same original data.

Now we can change some of the columns in X_test and see if we observe drift from those changes.

## **8. Change at least 2 feature values of the test dataset (you can put in random values or swap 2 features)**
To change the data, we made four main modifications:
1. Swapping column 0 and column 1
2. Swapping column 2 and column 7
3. Randomizing column 2
4. Randomizing column 4

In [0]:
print("Modifying test data...")
X_test_modified = X_test.copy()  # Create a copy for modification

# Change 1
print("Change 1: Swap column 0 and column 1")
X_test_modified.iloc[:, [0, 1]] = X_test_modified.iloc[:, [1, 0]].values

# Change 2
print("Change 2: Swap column 2 and column 7")
X_test_modified.iloc[:, [2, 7]] = X_test_modified.iloc[:, [7, 2]].values

# Change 3
print("Change 3: Randomizing column 2")
X_test_modified.iloc[:, 2] = np.random.randint(50, 150, size=len(X_test_modified))

# Change 4
print("Change 4: Randomizing column 4")
X_test_modified.iloc[:, 4] = np.random.randint(50, 150, size=len(X_test_modified))

print("Test data modified successfully")
print(f"\nBefore and After Comparison: Row 0:")
print(f"Original: {X_test.iloc[0, :10].values}")
print(f"\nModified: {X_test_modified.iloc[0, :10].values}")

Modifying test data...
Change 1: Swap column 0 and column 1
Change 2: Swap column 2 and column 7
Change 3: Randomizing column 2
Change 4: Randomizing column 4
Test data modified successfully

Before and After Comparison: Row 0:
Original: [ 6.         18.         -1.57698178 -1.10222852 -1.2190696  -1.06975269
  0.02599269 -1.12151659 -1.71020758  2.        ]

Modified: [ 1.80000000e+01  6.00000000e+00  1.01000000e+02 -1.10222852e+00
  5.20000000e+01 -1.06975269e+00  2.59926897e-02 -1.57698178e+00
 -1.71020758e+00  2.00000000e+00]


Now that we have modified X_test, we can inference `X_test_modified` to the deployed model and see if we detect drift, which we should since we changed the data.

## **9. Use the "changed" test data with the deployed model and validate the results (metric) and verify observation with model monitoring.**

In [0]:
print("Testing model using modified (X_test_modified) data...")

# Get predictions
y_pred_modified = []
for i in range(len(X_test_modified)):
    sample = X_test_modified.iloc[i:i+1].values  # Get sample values
    pred = predictor.predict(sample)  # Get predictions from deployed model
    pred_value = float(np.array(pred).flatten()[0])  # Flatten and convert to float
    y_pred_modified.append(1 if pred_value > 0.5 else 0)  # Convert to binary classification

accuracy_modified = accuracy_score(y_test, y_pred_modified)

print(f"\nResults:")
print(f"- Accuracy: {accuracy_modified:.3f}")

Testing model using modified (X_test_modified) data...



Results:
- Accuracy: 0.738


As seen above, we were able to successfully inference the deployed model with `X_test_modified`, but this time with a much lower accuracy than with `X_test` or our `X_train`. Specifically, after changing the data, the new accuracy is 73.8%, which is lower than the test accuracy of 88.3% and the train accuracy of 88.3% as well. However, this is very expected since we made significant changes to the data. Now, we can monitor these results in context with Evidently AI:

In [0]:
REPORT_TYPE = 'modified'
REPORT_S3_PATH = f"reports/monitoring_report_{REPORT_TYPE}.html"

# Monitor Model Predictions: X_test_modified
results_original = detect_model_drift(
    test_data= X_test_modified,
    y_pred= y_pred_modified,
    report_s3_key= REPORT_S3_PATH,
    predictor= predictor,
    baseline_data= baseline_data,
    bucket_name= BUCKET_NAME,
    s3_client= S3_CLIENT
)

MODEL MONITORING: DRIFT DETECTION
Generating baseline predictions from the deployed model...



Drift Detection Results:
Total features analyzed: 21

Test: Number of Drifted Features
Status: FAIL
Description: The drift is detected for 7 out of 21 features. The test threshold is lt=3.
Parameters: {'condition': {'lt': 3}, 'features': {'prediction': {'stattest': 'Z-test p_value', 'score': 0.0, 'threshold': 0.05, 'detected': True}, 'Attack': {'stattest': 'K-S p_value', 'score': 0.0, 'threshold': 0.05, 'detected': True}, 'Body_Style': {'stattest': 'K-S p_value', 'score': 0.858, 'threshold': 0.05, 'detected': False}, 'Catch_Difficulty': {'stattest': 'chi-square p_value', 'score': 0.184, 'threshold': 0.05, 'detected': False}, 'Color': {'stattest': 'K-S p_value', 'score': 0.549, 'threshold': 0.05, 'detected': False}, 'Defense': {'stattest': 'K-S p_value', 'score': 0.056, 'threshold': 0.05, 'detected': False}, 'Egg_Group_1': {'stattest': 'K-S p_value', 'score': 0.999, 'threshold': 0.05, 'detected': False}, 'Egg_Group_2': {'stattest': 'K-S p_value', 'score': 0.545, 'threshold': 0.05, 'det

We can see the drift report findings above, and via the drift report that was uploaded to S3:
![X_test Evidently Report](evidently-images/x_test_modified_report.png)

From the report, we now see that there is significant drift detected when inferencing the X_test, for seven columns (five new columns and the original two that were detected). Specifically, we see that the drift is detected in the exact columns that we swapped and randomized. Furthermore, the prediction drift p-value is now 0.0, confirming that now only the data input, but the prediction accuracy itself is significantly different than for the training. 


While these changes were obviously intentional, it underscores the importance of setting up proper model monitoring so that these types of issues can be detected as populations change over time, and predictions are made at times that are far from when the model was created.