# Unit 4 - Example 12: Model Evaluation

## üìö Learning Objectives

By completing this notebook, you will:
- Understand the key concepts of this topic
- Apply the topic using Python code examples
- Practice with small, realistic datasets or scenarios

## üîó Prerequisites

- ‚úÖ Basic Python
- ‚úÖ Basic NumPy/Pandas (when applicable)

---

## Official Structure Reference

This notebook supports **Course 05, Unit 4** requirements from `DETAILED_UNIT_DESCRIPTIONS.md`.

---


# Unit 4 - Example 12: Model Evaluation

## üîó Solving the Problem from Example 11 | ÿ≠ŸÑ ÿßŸÑŸÖÿ¥ŸÉŸÑÿ© ŸÖŸÜ ÿßŸÑŸÖÿ´ÿßŸÑ 11

**Remember the dead end from Example 11?**
- We learned to build classification models
- But we discovered accuracy alone can be misleading
- We need comprehensive evaluation beyond just accuracy

**This notebook solves that problem!**
- We'll learn **comprehensive evaluation metrics** (precision, recall, F1, AUC)
- We'll learn **cross-validation** for robust performance estimation
- We'll learn **learning curves** to understand model behavior

**This solves the evaluation problem from Example 11!**

In [1]:
# Step 1: Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import (train_test_split, cross_val_score,
                                     KFold, learning_curve)
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (mean_squared_error, r2_score, accuracy_score,
                             precision_score, recall_score, f1_score,
                             classification_report, confusion_matrix)
from sklearn.preprocessing import StandardScaler

# Set style for better plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("=" * 70)
print("Example 12: Model Evaluation | ÿ™ŸÇŸäŸäŸÖ ÿßŸÑŸÜŸÖÿßÿ∞ÿ¨")
print("=" * 70)
print("\nüìö Prerequisites: Examples 10-11 completed, ML model knowledge")
print("üîó This is the THIRD example in Unit 4 - model evaluation")
print("üéØ Goal: Master model evaluation with cross-validation and metrics")


Example 12: Model Evaluation | ÿ™ŸÇŸäŸäŸÖ ÿßŸÑŸÜŸÖÿßÿ∞ÿ¨

üìö Prerequisites: Examples 10-11 completed, ML model knowledge
üîó This is the THIRD example in Unit 4 - model evaluation
üéØ Goal: Master model evaluation with cross-validation and metrics


## 


In [2]:
# 1. CROSS-VALIDATION


## 


In [3]:
print("\n1. Cross Validation")
print("-" * 70)
np.random.seed(42)
n_samples = 300
X = np.random.randn(n_samples, 4)
y = X[:, 0] * 2 + X[:, 1] * 1.5 - X[:, 2] * 0.5 + np.random.randn(n_samples) * 0.1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
# K-Fold Cross-Validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X_train, y_train, cv=kfold, scoring='r2')
print(f"\n5-Fold Cross-Validation R¬≤ Scores:")
print(f" R¬≤    :")
for i, score in enumerate(cv_scores, 1):
    print(f"  Fold {i}: {score:.4f}")
print(f"\nMean CV Score: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
print(f"  CV: {cv_scores.mean():.4f}")


1. Cross Validation
----------------------------------------------------------------------

5-Fold Cross-Validation R¬≤ Scores:
 R¬≤    :
  Fold 1: 0.9978
  Fold 2: 0.9980
  Fold 3: 0.9976
  Fold 4: 0.9987
  Fold 5: 0.9985

Mean CV Score: 0.9981 (+/- 0.0008)
  CV: 0.9981


## 


In [4]:
# 2. LEARNING CURVES


## 


In [5]:
print("\n\n2. Learning Curves")
print("-" * 70)
train_sizes, train_scores, val_scores = learning_curve(
model, X_train, y_train, cv=5, n_jobs=-1,
train_sizes=np.linspace(0.1, 1.0, 10), scoring='r2')
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
val_mean = np.mean(val_scores, axis=1)
val_std = np.std(val_scores, axis=1)
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_mean, 'o-', color='#FF6B6B', label='Training Score')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1, color='#FF6B6B')
plt.plot(train_sizes, val_mean, 'o-', color='#4ECDC4', label='Validation Score')
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.1, color='#4ECDC4')
plt.xlabel('Training Set Size')
plt.ylabel('R¬≤ Score', fontsize=12)
plt.title('Learning Curves')
plt.legend(loc='best')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('12_learning_curves.png', dpi=300, bbox_inches='tight')
print("‚úì Learning curves saved")
plt.close()



2. Learning Curves
----------------------------------------------------------------------


‚úì Learning curves saved


## 


In [6]:
# 3. MULTIPLE METRICS EVALUATION


## 


In [7]:
print("\n\n3. Multiple Metrics Evaluation")
print("-" * 70)
# Classification metrics
np.random.seed(42)
X_clf = np.random.randn(200, 3)
y_clf = (X_clf[:, 0] + X_clf[:, 1] > 0).astype(int)
X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(
X_clf, y_clf, test_size=0.2, random_state=42, stratify=y_clf)
scaler = StandardScaler()
X_train_clf_scaled = scaler.fit_transform(X_train_clf)
X_test_clf_scaled = scaler.transform(X_test_clf)
clf_model = LogisticRegression(random_state=42)
clf_model.fit(X_train_clf_scaled, y_train_clf)
y_pred_clf = clf_model.predict(X_test_clf_scaled)
accuracy = accuracy_score(y_test_clf, y_pred_clf)
precision = precision_score(y_test_clf, y_pred_clf)
recall = recall_score(y_test_clf, y_pred_clf)
f1 = f1_score(y_test_clf, y_pred_clf)
print("\nClassification Metrics:")
print(f"  Accuracy:  {accuracy:.4f}")
print(f"  Precision: {precision:.4f}")
print(f"  Recall:    {recall:.4f}")
print(f"  F1 Score:  {f1:.4f}")



3. Multiple Metrics Evaluation
----------------------------------------------------------------------

Classification Metrics:
  Accuracy:  0.9500
  Precision: 0.9412
  Recall:    0.9412
  F1 Score:  0.9412


## 


# 4. MODEL COMPARISON


## 


In [8]:
print("\n\n4. Model Comparison")
print("-" * 70)
models = {
'Logistic Regression': LogisticRegression(random_state=42), 'Decision Tree': DecisionTreeClassifier(random_state=42, max_depth = 5)
}
results = {}
for name, model in models.items():
    model.fit(X_train_clf_scaled, y_train_clf)
y_pred = model.predict(X_test_clf_scaled)
results[name] = {
'Accuracy': accuracy_score(y_test_clf, y_pred),
'Precision': precision_score(y_test_clf, y_pred),
'Recall': recall_score(y_test_clf, y_pred),
'F1': f1_score(y_test_clf, y_pred)
}
comparison_df = pd.DataFrame(results).T
print("\nModel Comparison:")
print(comparison_df.round(4))
# Visualize comparison
fig, ax = plt.subplots(figsize=(10, 6))
comparison_df.plot(kind='bar', ax=ax, color=['#FF6B6B', '#4ECDC4', '#45B7D1', '#FFA07A'])
ax.set_title('Model Comparison', fontsize=14, weight='bold')
ax.set_ylabel('Score', fontsize=12)
ax.set_xlabel('Model')
ax.legend(loc='best')
ax.grid(True, alpha=0.3, axis='y')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('12_model_comparison.png', dpi=300, bbox_inches='tight')
print("‚úì Model comparison saved")
plt.close()



4. Model Comparison
----------------------------------------------------------------------

Model Comparison:
               Accuracy  Precision  Recall      F1
Decision Tree       0.9     0.8421  0.9412  0.8889


‚úì Model comparison saved


## 


# 5. SUMMARY


## 


In [9]:
print("\n" + "=" * 70)
print("Summary")
print("=" * 70)
print("\nKey Concepts Covered:")
print("1. Cross-validation techniques")
print("2. Learning curves")
print("3. Multiple evaluation metrics")
print("4. Model comparison")
print("\nNext Steps: Continue to Example 13 for CPU vs GPU ML")



Summary

Key Concepts Covered:
1. Cross-validation techniques
2. Learning curves
3. Multiple evaluation metrics
4. Model comparison

Next Steps: Continue to Example 13 for CPU vs GPU ML


## üö´ When Model Evaluation Hits a Dead End | ÿπŸÜÿØŸÖÿß ŸäŸàÿßÿ¨Ÿá ÿ™ŸÇŸäŸäŸÖ ÿßŸÑŸÜŸÖÿßÿ∞ÿ¨ ÿ∑ÿ±ŸäŸÇ ŸÖÿ≥ÿØŸàÿØ

**BEFORE**: We've learned comprehensive model evaluation techniques.

**AFTER**: We discover evaluation on CPU is too slow for large datasets!

**Why this matters**: CPU-based evaluation works, but large datasets require GPU acceleration!

---

### The Problem We've Discovered

We've learned:
- ‚úÖ How to evaluate models comprehensively (cross-validation, metrics)
- ‚úÖ How to compare models and understand performance
- ‚úÖ How to create learning curves

**But we have a problem:**
- ‚ùì **What if we have large datasets (millions of samples)?**
- ‚ùì **What if cross-validation takes hours on CPU?**
- ‚ùì **What if we need faster training and evaluation?**

**The Dead End:**
- CPU-based evaluation works for small/medium datasets
- But for large datasets, CPU processing is too slow
- We need GPU acceleration for large-scale ML

---

### Demonstrating the Problem

Let's see what happens with large datasets on CPU:


In [10]:
print("\n" + "=" * 70)
print("üö´ DEMONSTRATING THE DEAD END: CPU Performance on Large Datasets")
print("=" * 70)

import time

# Simulate large dataset
print(f"\nüìä Testing CPU Performance on Large Dataset:")
large_n = 100_000  # 100k samples (real-world can be millions)

# Create large dataset
X_large = np.random.randn(large_n, 10)
y_large = (X_large[:, 0] + X_large[:, 1] > 0).astype(int)

print(f"   - Dataset size: {large_n:,} samples √ó 10 features")
print(f"   - Task: Training and evaluating model")

# Time CPU-based training and evaluation
print(f"\n‚è±Ô∏è  Testing CPU Performance:")
start_time = time.time()

# Train model
clf_large = LogisticRegression(random_state=42, max_iter = 1000)
clf_large.fit(X_large, y_large)

# Evaluate with cross-validation (5-fold)
cv_scores_large = cross_val_score(clf_large, X_large, y_large, cv=5, n_jobs = -1)

cpu_time = time.time() - start_time

print(f"   ‚úÖ CPU completed in {cpu_time:.2f} seconds")
print(f"   - Training + 5-fold cross-validation")
print(f"   - Mean CV score: {cv_scores_large.mean():.4f}")

print(f"\n‚ö†Ô∏è  Performance Issue:")
print(f"   - {cpu_time:.2f} seconds for 100k samples")
print(f"   - For 1 million samples, this would take ~{cpu_time * 10:.1f} seconds ({cpu_time * 10 / 60:.1f} minutes)")
print(f"   - Real-world datasets can be 10-100x larger!")

print(f"\nüí° The Problem:")
print(f"   - CPU processing is sequential (one operation at a time)")
print(f"   - Large datasets require more time and computation")
print(f"   - Cross-validation multiplies the time (5-fold = 5x training time)")
print(f"   - For large-scale ML, we need GPU acceleration!")

print(f"\n‚û°Ô∏è  Solution Needed:")
print(f"   - We need GPU acceleration for ML training and evaluation")
print(f"   - We need parallel processing for faster computation")
print(f"   - We need GPU-accelerated ML libraries")
print(f"   - This leads us to Example 13: CPU vs GPU for ML")

print("\n" + "=" * 70)



üö´ DEMONSTRATING THE DEAD END: CPU Performance on Large Datasets

üìä Testing CPU Performance on Large Dataset:


   - Dataset size: 100,000 samples √ó 10 features
   - Task: Training and evaluating model

‚è±Ô∏è  Testing CPU Performance:


  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_streng

   ‚úÖ CPU completed in 0.23 seconds
   - Training + 5-fold cross-validation
   - Mean CV score: 0.9995

‚ö†Ô∏è  Performance Issue:
   - 0.23 seconds for 100k samples
   - For 1 million samples, this would take ~2.3 seconds (0.0 minutes)
   - Real-world datasets can be 10-100x larger!

üí° The Problem:
   - CPU processing is sequential (one operation at a time)
   - Large datasets require more time and computation
   - Cross-validation multiplies the time (5-fold = 5x training time)
   - For large-scale ML, we need GPU acceleration!

‚û°Ô∏è  Solution Needed:
   - We need GPU acceleration for ML training and evaluation
   - We need parallel processing for faster computation
   - We need GPU-accelerated ML libraries
   - This leads us to Example 13: CPU vs GPU for ML



### What We Need Next

**The Solution**: We need GPU acceleration for ML:
- **GPU-accelerated ML**: Libraries like cuML, XGBoost GPU
- **Parallel processing**: GPU processes many operations simultaneously
- **Faster training**: 10-100x speedup for large datasets
- **Faster evaluation**: Cross-validation runs much faster on GPU

**This dead end leads us to Example 13: CPU vs GPU for ML**
- Example 13 will show us GPU acceleration for ML
- We'll see dramatic speedup for training and evaluation
- This solves the performance problem for large-scale machine learning!
