# ü§ñ Notebook 3: Model Training & Comparison

**Author:** Amey Talkatkar | **Course:** MLOps with Agentic AI

## üéØ Learning Objectives
- Train multiple models (Linear Regression, Random Forest, XGBoost)
- Compare model performance systematically
- Understand evaluation metrics (RMSE, MAE, R¬≤)
- Select best model based on business needs
- Save models for production

## üî• The Problem
DS trained only one model:
- Chose Linear Regression (simple)
- Never compared with other algorithms
- Never tuned hyperparameters
- Deployed to production
- Boss: "Why is accuracy only 70%? Could we do better?"
- DS: "I don't know... let me try other models"

**Solution: Always compare multiple models!**

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import time
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Libraries imported")

## Step 1: Load Processed Data

In [None]:
X_train = pd.read_csv('../data/processed/X_train.csv')
X_test = pd.read_csv('../data/processed/X_test.csv')
y_train = pd.read_csv('../data/processed/y_train.csv').squeeze()
y_test = pd.read_csv('../data/processed/y_test.csv').squeeze()

print(f"‚úÖ Data loaded")
print(f"   Train: {len(X_train):,} samples, {len(X_train.columns)} features")
print(f"   Test:  {len(X_test):,} samples")

## Step 2: Define Evaluation Function

In [None]:
def evaluate_model(model, X_test, y_test, model_name):
    """Evaluate model and return metrics"""
    y_pred = model.predict(X_test)
    
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100
    
    return {
        'model': model_name,
        'rmse': rmse,
        'mae': mae,
        'r2': r2,
        'mape': mape,
        'predictions': y_pred
    }

## Step 3: Train Linear Regression (Baseline)

In [None]:
print("üîπ Training Linear Regression...")
start_time = time.time()

lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

train_time = time.time() - start_time
lr_results = evaluate_model(lr_model, X_test, y_test, 'Linear Regression')
lr_results['train_time'] = train_time

print(f"‚úÖ Linear Regression trained in {train_time:.2f}s")
print(f"   RMSE: ${lr_results['rmse']:,.2f}")
print(f"   MAE:  ${lr_results['mae']:,.2f}")
print(f"   R¬≤:   {lr_results['r2']:.4f}")
print(f"   MAPE: {lr_results['mape']:.2f}%")

## Step 4: Train Random Forest

In [None]:
print("üå≤ Training Random Forest...")
start_time = time.time()

rf_model = RandomForestRegressor(
    n_estimators=100,
    max_depth=10,
    min_samples_split=20,
    min_samples_leaf=10,
    random_state=42,
    n_jobs=-1  # Use all CPU cores
)
rf_model.fit(X_train, y_train)

train_time = time.time() - start_time
rf_results = evaluate_model(rf_model, X_test, y_test, 'Random Forest')
rf_results['train_time'] = train_time

print(f"‚úÖ Random Forest trained in {train_time:.2f}s")
print(f"   RMSE: ${rf_results['rmse']:,.2f}")
print(f"   MAE:  ${rf_results['mae']:,.2f}")
print(f"   R¬≤:   {rf_results['r2']:.4f}")
print(f"   MAPE: {rf_results['mape']:.2f}%")

## Step 5: Train XGBoost

In [None]:
print("‚ö° Training XGBoost...")
start_time = time.time()

xgb_model = XGBRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    min_child_weight=3,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    n_jobs=-1,
    verbosity=0
)
xgb_model.fit(X_train, y_train)

train_time = time.time() - start_time
xgb_results = evaluate_model(xgb_model, X_test, y_test, 'XGBoost')
xgb_results['train_time'] = train_time

print(f"‚úÖ XGBoost trained in {train_time:.2f}s")
print(f"   RMSE: ${xgb_results['rmse']:,.2f}")
print(f"   MAE:  ${xgb_results['mae']:,.2f}")
print(f"   R¬≤:   {xgb_results['r2']:.4f}")
print(f"   MAPE: {xgb_results['mape']:.2f}%")

## Step 6: Compare Models

In [None]:
# Create comparison dataframe
comparison = pd.DataFrame([
    {k: v for k, v in lr_results.items() if k != 'predictions'},
    {k: v for k, v in rf_results.items() if k != 'predictions'},
    {k: v for k, v in xgb_results.items() if k != 'predictions'}
])

comparison = comparison.sort_values('rmse')

print("\nüìä MODEL COMPARISON")
print("="*80)
display(comparison)

best_model_name = comparison.iloc[0]['model']
print(f"\nüèÜ BEST MODEL: {best_model_name}")

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# RMSE comparison
axes[0].bar(comparison['model'], comparison['rmse'], color=['skyblue', 'lightgreen', 'coral'])
axes[0].set_title('RMSE Comparison (Lower is Better)', fontweight='bold')
axes[0].set_ylabel('RMSE ($)')
axes[0].tick_params(axis='x', rotation=45)
axes[0].grid(axis='y', alpha=0.3)

# R¬≤ comparison
axes[1].bar(comparison['model'], comparison['r2'], color=['skyblue', 'lightgreen', 'coral'])
axes[1].set_title('R¬≤ Score Comparison (Higher is Better)', fontweight='bold')
axes[1].set_ylabel('R¬≤ Score')
axes[1].tick_params(axis='x', rotation=45)
axes[1].grid(axis='y', alpha=0.3)
axes[1].set_ylim([0, 1])

# Training time comparison
axes[2].bar(comparison['model'], comparison['train_time'], color=['skyblue', 'lightgreen', 'coral'])
axes[2].set_title('Training Time Comparison', fontweight='bold')
axes[2].set_ylabel('Time (seconds)')
axes[2].tick_params(axis='x', rotation=45)
axes[2].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

## Step 7: Prediction Analysis

In [None]:
# Actual vs Predicted plots
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, (results, model_name) in enumerate([
    (lr_results, 'Linear Regression'),
    (rf_results, 'Random Forest'),
    (xgb_results, 'XGBoost')
]):
    axes[idx].scatter(y_test, results['predictions'], alpha=0.3, s=10)
    axes[idx].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
                   'r--', lw=2, label='Perfect Prediction')
    axes[idx].set_title(f'{model_name}\nRMSE: ${results["rmse"]:,.0f}', fontweight='bold')
    axes[idx].set_xlabel('Actual Sales ($)')
    axes[idx].set_ylabel('Predicted Sales ($)')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Step 8: Feature Importance (Best Model)

In [None]:
# Get feature importance from best model
if best_model_name == 'Random Forest':
    best_model = rf_model
    importances = best_model.feature_importances_
elif best_model_name == 'XGBoost':
    best_model = xgb_model
    importances = best_model.feature_importances_
else:
    importances = np.abs(lr_model.coef_)
    best_model = lr_model

# Create feature importance dataframe
feature_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': importances
}).sort_values('importance', ascending=False)

# Plot top 15 features
plt.figure(figsize=(12, 6))
plt.barh(feature_importance.head(15)['feature'], 
         feature_importance.head(15)['importance'])
plt.title(f'Top 15 Feature Importances ({best_model_name})', fontsize=14, fontweight='bold')
plt.xlabel('Importance')
plt.gca().invert_yaxis()
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

print("\nTop 10 Most Important Features:")
display(feature_importance.head(10))

## Step 9: Save Best Model

In [None]:
import os

# Create models directory
os.makedirs('../models', exist_ok=True)

# Save all models
joblib.dump(lr_model, '../models/linear_regression.joblib')
joblib.dump(rf_model, '../models/random_forest.joblib')
joblib.dump(xgb_model, '../models/xgboost.joblib')

# Save best model separately
joblib.dump(best_model, '../models/best_model.joblib')

# Save comparison metrics
comparison.to_csv('../models/model_comparison.csv', index=False)

print("‚úÖ Models saved!")
print("   Location: ../models/")
print(f"   Best model: {best_model_name}")

## ‚úÖ Summary

### Models Trained:
1. ‚úÖ **Linear Regression** - Fast baseline
2. ‚úÖ **Random Forest** - Handles non-linearity
3. ‚úÖ **XGBoost** - State-of-the-art gradient boosting

### Key Insights:
- XGBoost typically performs best (lowest RMSE)
- Random Forest close second
- Linear Regression good baseline but limited
- Training time: LR < XGB < RF

### Why This Matters for MLOps:
- üéØ **Model Selection**: Data-driven, not guesswork
- üìä **Reproducibility**: All models saved
- üîÑ **A/B Testing Ready**: Can compare in production
- üìà **Baseline Established**: Track improvements over time

### The Problem with This Approach:
**We just trained 3 models... but:**
- How do we track all these experiments?
- What if we try 50 hyperparameter combinations?
- How do we reproduce results from 3 months ago?
- Where are the model artifacts?

**This is why we need MLflow!** üëâ

---

**Next:** `04_MLflow_Tracking_Demo.ipynb` - Track experiments systematically

**¬© 2024 Amey Talkatkar**