# **AI TECH INSTITUTE** ¬∑ *Intermediate AI & Data Science*
### Week 10: Classification Model Zoo
**Instructor:** Amir Charkhi | **Goal:** Compare All Classification Algorithms

### Learning Objectives
- Train and compare 8 different classification algorithms
- Understand strengths and weaknesses of each
- Learn when to use which algorithm
- Make data-driven model selection decisions

### The 8 Models We'll Compare:
1. **Logistic Regression** - Simple baseline
2. **Ridge Classifier** - Regularized linear model
3. **Linear SVC** - Maximum margin (linear)
4. **SVM with RBF Kernel** - Non-linear patterns
5. **K-Nearest Neighbors (KNN)** - Instance-based learning
6. **Decision Tree** - Interpretable but overfits
7. **Random Forest** - Ensemble of trees
8. **Gradient Boosting** - Sequential boosting

---

## 1. Import Libraries

We'll need all our classification algorithms and metrics.

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import time
warnings.filterwarnings('ignore')

print("‚úÖ Core libraries imported")

In [None]:
# Scikit-learn: Data preparation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

print("‚úÖ Data preparation tools imported")

In [None]:
# Scikit-learn: Classification models
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

print("‚úÖ All 8 classification algorithms imported")

In [None]:
# Scikit-learn: Evaluation metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, confusion_matrix, classification_report
)

print("‚úÖ Evaluation metrics imported")

In [None]:
# Visualization settings
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("\n" + "="*70)
print("üéØ CLASSIFICATION MODEL ZOO - All libraries ready!")
print("="*70)

---
## 2. Load and Prepare Dataset

We'll use the **Online Shoppers Purchasing Intention** dataset.

**Goal:** Predict if a visitor will make a purchase (Revenue: True/False)

In [None]:
# Load dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00468/online_shoppers_intention.csv'

print("üì• Loading Online Shoppers dataset...")
df_raw = pd.read_csv(url)

print(f"‚úÖ Dataset loaded: {df_raw.shape[0]:,} rows √ó {df_raw.shape[1]} columns")
print(f"\nüìä Target distribution:")
print(df_raw['Revenue'].value_counts())
print(f"\nüí° Class imbalance: {df_raw['Revenue'].value_counts()[True]/len(df_raw)*100:.1f}% positive class")

In [None]:
# Preprocessing
print("üßπ Preprocessing data...")

df = df_raw.copy()

# Convert target to binary
df['Revenue'] = df['Revenue'].astype(int)

# Encode Month
month_map = {'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'June': 6,
             'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}
df['Month'] = df['Month'].map(month_map)

# One-hot encode VisitorType
visitor_dummies = pd.get_dummies(df['VisitorType'], prefix='Visitor', drop_first=True)
df = pd.concat([df, visitor_dummies.astype(int)], axis=1)
df = df.drop(columns=['VisitorType'])

# Convert Weekend
df['Weekend'] = df['Weekend'].astype(int)

print("‚úÖ Preprocessing complete!")

In [None]:
# Prepare X and y
feature_cols = [col for col in df.columns if col != 'Revenue']
X = df[feature_cols].copy()
y = df['Revenue'].copy()

print(f"üìä Features: {len(feature_cols)} columns")
print(f"üìä Total samples: {len(X):,}")

---
## 3. Train-Test Split

‚ö†Ô∏è **Important:** Use stratification to maintain class balance!

In [None]:
# Split data with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

print("‚úÇÔ∏è Data Split:")
print(f"   Training:   {len(X_train):>6,} samples ({len(X_train)/len(X)*100:.0f}%)")
print(f"   Testing:    {len(X_test):>6,} samples ({len(X_test)/len(X)*100:.0f}%)")
print(f"\n‚úÖ Class balance maintained:")
print(f"   Train: {y_train.mean():.3f} positive class")
print(f"   Test:  {y_test.mean():.3f} positive class")

---
## 4. Feature Scaling

**Required for:** Logistic Regression, SVC, KNN  
**Not required for:** Tree-based models

We'll scale for consistency and to support all models.

In [None]:
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("‚öñÔ∏è Features scaled using StandardScaler")
print(f"   Mean ‚âà {X_train_scaled.mean():.6f}")
print(f"   Std  ‚âà {X_train_scaled.std():.6f}")
print("\n‚úÖ Scaled features ready for all models!")

---
## 5. Helper Function: Evaluate Models

We'll create a reusable function to evaluate all models consistently.

In [None]:
def evaluate_model(model, X_train, y_train, X_test, y_test, model_name):
    """
    Train and evaluate a classification model.
    Returns metrics dictionary.
    """
    # Measure training time
    start_time = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - start_time
    
    # Predictions
    y_pred = model.predict(X_test)
    
    # Get probabilities if available
    if hasattr(model, 'predict_proba'):
        y_prob = model.predict_proba(X_test)[:, 1]
        auc = roc_auc_score(y_test, y_prob)
    elif hasattr(model, 'decision_function'):
        y_decision = model.decision_function(X_test)
        auc = roc_auc_score(y_test, y_decision)
    else:
        auc = None
    
    # Calculate metrics
    metrics = {
        'Model': model_name,
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1-Score': f1_score(y_test, y_pred),
        'ROC-AUC': auc,
        'Train Time (s)': train_time
    }
    
    # Print results
    print(f"\n{'='*70}")
    print(f"üìä {model_name}")
    print(f"{'='*70}")
    print(f"Accuracy:  {metrics['Accuracy']:.4f}")
    print(f"Precision: {metrics['Precision']:.4f}")
    print(f"Recall:    {metrics['Recall']:.4f}")
    print(f"F1-Score:  {metrics['F1-Score']:.4f}")
    if auc is not None:
        print(f"ROC-AUC:   {metrics['ROC-AUC']:.4f}")
    print(f"Train Time: {train_time:.3f}s")
    
    return metrics

print("‚úÖ Evaluation function defined!")

---
## 6. Train All Models

Now we'll train all 8 classifiers one by one!

### Model 1: Logistic Regression
**Type:** Linear model  
**Pros:** Fast, interpretable, probability estimates  
**Best for:** Baseline, interpretability needed

In [None]:
# Logistic Regression
lr_model = LogisticRegression(random_state=42, max_iter=1000)
lr_metrics = evaluate_model(lr_model, X_train_scaled, y_train, X_test_scaled, y_test, 
                             "Logistic Regression")

### Model 2: Ridge Classifier
**Type:** Regularized linear model  
**Pros:** Very fast, handles multicollinearity  
**Cons:** No probability estimates  
**Best for:** Large datasets, speed matters

In [None]:
# Ridge Classifier
ridge_model = RidgeClassifier(random_state=42)
ridge_metrics = evaluate_model(ridge_model, X_train_scaled, y_train, X_test_scaled, y_test,
                                "Ridge Classifier")

### Model 3: Linear SVC
**Type:** Support Vector Machine (linear)  
**Pros:** Maximum margin, good for high dimensions  
**Cons:** Sensitive to C parameter  
**Best for:** Linearly separable data

In [None]:
# Linear SVC
linear_svc_model = LinearSVC(random_state=42, max_iter=2000, dual=False)
linear_svc_metrics = evaluate_model(linear_svc_model, X_train_scaled, y_train, 
                                     X_test_scaled, y_test, "Linear SVC")

### Model 4: SVM with RBF Kernel ‚≠ê
**Type:** Support Vector Machine (non-linear)  
**Pros:** Handles non-linear patterns, powerful  
**Cons:** Slower than linear, needs tuning  
**Best for:** Complex decision boundaries

üí° **Reminder:** See `theory_svm.ipynb` for kernel trick explanation!

In [None]:
# SVM with RBF Kernel
svm_rbf_model = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)
svm_rbf_metrics = evaluate_model(svm_rbf_model, X_train_scaled, y_train, 
                                  X_test_scaled, y_test, "SVM (RBF Kernel)")

print(f"\nüí° Support vectors: {len(svm_rbf_model.support_vectors_)}/{len(X_train)} "
      f"({len(svm_rbf_model.support_vectors_)/len(X_train)*100:.1f}%)")

### Model 5: K-Nearest Neighbors ‚≠ê
**Type:** Instance-based learning  
**Pros:** Simple, no training, non-parametric  
**Cons:** Slow prediction, memory intensive  
**Best for:** Small datasets, as baseline

üí° **Reminder:** See `theory_knn.ipynb` for choosing K!

In [None]:
# K-Nearest Neighbors
# Using K=5 as a good starting point
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_metrics = evaluate_model(knn_model, X_train_scaled, y_train, 
                              X_test_scaled, y_test, "K-Nearest Neighbors (K=5)")

print(f"\nüí° Using K=5 neighbors for each prediction")

### Model 6: Decision Tree
**Type:** Tree-based (single tree)  
**Pros:** Interpretable, no scaling needed  
**Cons:** Prone to overfitting  
**Best for:** Understanding feature interactions

In [None]:
# Decision Tree (with some regularization)
dt_model = DecisionTreeClassifier(max_depth=10, min_samples_split=20, random_state=42)
dt_metrics = evaluate_model(dt_model, X_train, y_train, X_test, y_test, 
                             "Decision Tree")

print(f"\nüí° Tree depth: {dt_model.get_depth()}, Leaves: {dt_model.get_n_leaves()}")

### Model 7: Random Forest üåü
**Type:** Ensemble (bagging)  
**Pros:** Excellent performance, robust, feature importance  
**Cons:** Less interpretable, larger model  
**Best for:** Strong out-of-box performance

In [None]:
# Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf_metrics = evaluate_model(rf_model, X_train, y_train, X_test, y_test, 
                             "Random Forest")

print(f"\nüí° Ensemble of {rf_model.n_estimators} decision trees")

### Model 8: Gradient Boosting üåü
**Type:** Ensemble (boosting)  
**Pros:** Often best performance, powerful  
**Cons:** Slower training, more hyperparameters  
**Best for:** When you need maximum accuracy

In [None]:
# Gradient Boosting
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, 
                                      max_depth=5, random_state=42)
gb_metrics = evaluate_model(gb_model, X_train, y_train, X_test, y_test, 
                             "Gradient Boosting")

print(f"\nüí° Sequential ensemble with {gb_model.n_estimators} iterations")

---
## 7. Complete Comparison

Let's see all models side-by-side!

In [None]:
# Create comparison DataFrame
all_metrics = [
    lr_metrics,
    ridge_metrics,
    linear_svc_metrics,
    svm_rbf_metrics,
    knn_metrics,
    dt_metrics,
    rf_metrics,
    gb_metrics
]

comparison_df = pd.DataFrame(all_metrics)

# Sort by F1-Score (best metric for imbalanced data)
comparison_df = comparison_df.sort_values('F1-Score', ascending=False)

print("\n" + "="*90)
print("üèÜ CLASSIFICATION MODEL COMPARISON - TEST SET PERFORMANCE")
print("="*90)
print(comparison_df.to_string(index=False))
print("="*90)

# Highlight best model
best_model = comparison_df.iloc[0]
print(f"\nü•á BEST MODEL: {best_model['Model']}")
print(f"   F1-Score: {best_model['F1-Score']:.4f}")
print(f"   ROC-AUC:  {best_model['ROC-AUC']:.4f}")
print(f"   Accuracy: {best_model['Accuracy']:.4f}")

---
## 8. Visual Comparison

Let's visualize the performance of all models.

In [None]:
# Create visualizations
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. F1-Score comparison
axes[0, 0].barh(comparison_df['Model'], comparison_df['F1-Score'], color='purple', alpha=0.7)
axes[0, 0].set_xlabel('F1-Score', fontsize=11)
axes[0, 0].set_title('F1-Score Comparison (Higher is Better)', fontsize=12, pad=15)
axes[0, 0].invert_yaxis()
axes[0, 0].grid(True, alpha=0.3, axis='x')

# 2. ROC-AUC comparison
comparison_with_auc = comparison_df[comparison_df['ROC-AUC'].notna()]
axes[0, 1].barh(comparison_with_auc['Model'], comparison_with_auc['ROC-AUC'], 
                color='seagreen', alpha=0.7)
axes[0, 1].set_xlabel('ROC-AUC', fontsize=11)
axes[0, 1].set_title('ROC-AUC Comparison', fontsize=12, pad=15)
axes[0, 1].invert_yaxis()
axes[0, 1].grid(True, alpha=0.3, axis='x')

# 3. Precision vs Recall
axes[1, 0].scatter(comparison_df['Recall'], comparison_df['Precision'], 
                   s=200, alpha=0.6, c=range(len(comparison_df)), cmap='viridis')
for idx, row in comparison_df.iterrows():
    axes[1, 0].annotate(row['Model'], 
                        (row['Recall'], row['Precision']),
                        fontsize=8, ha='right', va='bottom')
axes[1, 0].set_xlabel('Recall', fontsize=11)
axes[1, 0].set_ylabel('Precision', fontsize=11)
axes[1, 0].set_title('Precision vs Recall Tradeoff', fontsize=12, pad=15)
axes[1, 0].grid(True, alpha=0.3)

# 4. Training Time
sorted_by_time = comparison_df.sort_values('Train Time (s)')
axes[1, 1].barh(sorted_by_time['Model'], sorted_by_time['Train Time (s)'], 
                color='coral', alpha=0.7)
axes[1, 1].set_xlabel('Training Time (seconds)', fontsize=11)
axes[1, 1].set_title('Training Time Comparison', fontsize=12, pad=15)
axes[1, 1].invert_yaxis()
axes[1, 1].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

print("\nüí° Key Observations:")
print("   - Tree-based models (RF, GB) typically perform best")
print("   - KNN is slow to train but simple")
print("   - Linear models are fastest")
print("   - Precision-Recall tradeoff varies by model")

---
## 9. Model Categories

Let's group models by type and compare categories.

In [None]:
# Add model categories
categories = {
    'Logistic Regression': 'Linear',
    'Ridge Classifier': 'Linear',
    'Linear SVC': 'Linear',
    'SVM (RBF Kernel)': 'Non-Linear',
    'K-Nearest Neighbors (K=5)': 'Instance-Based',
    'Decision Tree': 'Tree-Based',
    'Random Forest': 'Ensemble (Bagging)',
    'Gradient Boosting': 'Ensemble (Boosting)'
}

comparison_df['Category'] = comparison_df['Model'].map(categories)

# Average performance by category
category_performance = comparison_df.groupby('Category').agg({
    'Accuracy': 'mean',
    'F1-Score': 'mean',
    'ROC-AUC': 'mean',
    'Train Time (s)': 'mean'
}).round(4)

print("\nüìä AVERAGE PERFORMANCE BY MODEL CATEGORY:")
print("="*70)
print(category_performance)
print("="*70)

In [None]:
# Visualize category performance
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# F1-Score by category
category_performance['F1-Score'].sort_values().plot(kind='barh', ax=axes[0], 
                                                      color='steelblue', alpha=0.7)
axes[0].set_xlabel('Average F1-Score', fontsize=11)
axes[0].set_title('Average F1-Score by Model Category', fontsize=12, pad=15)
axes[0].grid(True, alpha=0.3, axis='x')

# Training time by category
category_performance['Train Time (s)'].sort_values().plot(kind='barh', ax=axes[1], 
                                                            color='coral', alpha=0.7)
axes[1].set_xlabel('Average Training Time (s)', fontsize=11)
axes[1].set_title('Average Training Time by Category', fontsize=12, pad=15)
axes[1].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

---
## 10. When to Use Each Model

### üéØ Decision Framework:

#### **Need Interpretability?**
- ‚úÖ **Logistic Regression** - Clear coefficients
- ‚úÖ **Decision Tree** - Visual rules
- ‚ùå Avoid: SVM, Neural Networks, Ensembles

#### **Need Speed (Real-time)?**
- ‚úÖ **Linear Models** (Logistic, Ridge, Linear SVC)
- ‚úÖ **Decision Tree** (single tree)
- ‚ùå Avoid: KNN, Large Ensembles

#### **Need Best Performance?**
- ‚úÖ **Gradient Boosting** - Often wins competitions
- ‚úÖ **Random Forest** - Great out-of-box
- ‚úÖ **SVM (RBF)** - For smaller datasets

#### **Small Dataset (<10K samples)?**
- ‚úÖ **SVM** - Works well with limited data
- ‚úÖ **KNN** - Simple baseline
- ‚úÖ **Logistic Regression** - Reliable

#### **Large Dataset (>100K samples)?**
- ‚úÖ **Linear Models** - Scale well
- ‚úÖ **Random Forest** (with subsampling)
- ‚ùå Avoid: KNN, SVM with RBF

#### **High-Dimensional Data (many features)?**
- ‚úÖ **Linear SVC** - Good for sparse data
- ‚úÖ **Logistic Regression** with regularization
- ‚úÖ **Random Forest** with feature selection
- ‚ùå Avoid: KNN (curse of dimensionality)

#### **Imbalanced Classes?**
- ‚úÖ **Random Forest** - Use `class_weight='balanced'`
- ‚úÖ **Gradient Boosting** - Robust to imbalance
- ‚úÖ **Logistic Regression** with class weights

#### **Non-Linear Patterns?**
- ‚úÖ **SVM (RBF)** - Kernel trick
- ‚úÖ **Tree-based models** - Natural non-linearity
- ‚ùå Avoid: Linear models

---

## 11. Model Selection Summary

### üìã Quick Reference:

| Model | Best For | Avoid When |
|-------|----------|------------|
| **Logistic Regression** | Baseline, interpretability, linear data | Non-linear patterns |
| **Ridge Classifier** | Speed, multicollinearity | Need probabilities |
| **Linear SVC** | High dimensions, linear separation | Non-linear data |
| **SVM (RBF)** | Complex patterns, small data | Large datasets, need speed |
| **KNN** | Simple baseline, no assumptions | Large data, high dimensions |
| **Decision Tree** | Interpretability, feature interactions | Production (overfits) |
| **Random Forest** | Strong performance, robustness | Need interpretability |
| **Gradient Boosting** | Maximum accuracy, competitions | Need speed, interpretability |

---

### üèÜ For This Dataset (Online Shoppers):

**Top 3 Models:**

In [None]:
# Show top 3
top_3 = comparison_df.head(3)[['Model', 'F1-Score', 'ROC-AUC', 'Accuracy']]

print("\nüèÜ TOP 3 MODELS FOR THIS DATASET:")
print("="*70)
print(top_3.to_string(index=False))
print("="*70)

print("\nüí° Recommendation:")
print(f"   Deploy: {top_3.iloc[0]['Model']}")
print(f"   Reason: Best F1-Score ({top_3.iloc[0]['F1-Score']:.4f}) for imbalanced classes")
print(f"   Backup: {top_3.iloc[1]['Model']} (nearly as good)")

---
## 12. Feature Importance (Top 3 Models)

Let's see which features matter most for our best models.

In [None]:
# Get feature importance from tree-based models
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

models_to_plot = [
    ('Decision Tree', dt_model),
    ('Random Forest', rf_model),
    ('Gradient Boosting', gb_model)
]

for idx, (name, model) in enumerate(models_to_plot):
    importance_df = pd.DataFrame({
        'Feature': feature_cols,
        'Importance': model.feature_importances_
    }).sort_values('Importance', ascending=False).head(10)
    
    axes[idx].barh(importance_df['Feature'], importance_df['Importance'], alpha=0.7)
    axes[idx].set_xlabel('Importance', fontsize=10)
    axes[idx].set_title(f'{name}\nTop 10 Features', fontsize=11, pad=15)
    axes[idx].invert_yaxis()
    axes[idx].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

print("\nüí° Common Important Features:")
print("   - PageValues: Higher values = more likely to purchase")
print("   - ProductRelated_Duration: Time on product pages matters")
print("   - ExitRates/BounceRates: Negative indicators")

---
## 13. Key Takeaways

### ‚úÖ What We Learned:

**1. Model Diversity Matters**
- 8 different approaches to the same problem
- Each has unique strengths and weaknesses
- No single "best" algorithm for all problems

**2. Performance Patterns**
- **Ensemble methods** (RF, GB) typically top performers
- **Linear models** fast but limited for complex patterns
- **SVM** powerful but slower
- **KNN** simple but computationally expensive

**3. Trade-offs to Consider**
- **Accuracy vs Speed:** GB best accuracy, Linear fastest
- **Accuracy vs Interpretability:** DT interpretable, RF accurate
- **Training vs Prediction:** KNN instant training, slow prediction

**4. For This Problem (E-Commerce Purchase Prediction)**
- Tree-based models work best (non-linear patterns)
- PageValues is the most important feature
- Class imbalance (~16% purchases) handled well by RF/GB

**5. General Guidelines**
- Start with **Logistic Regression** (baseline)
- Try **Random Forest** (strong out-of-box)
- If you need max performance: **Gradient Boosting**
- If you need speed: **Linear models**
- If you need interpretability: **Logistic Regression** or **Decision Tree**

---

### üéØ Model Selection Flowchart:

```
START
  |
  ‚îú‚îÄ Need interpretability? ‚Üí Logistic Regression or Decision Tree
  |
  ‚îú‚îÄ Need real-time speed? ‚Üí Linear models (Logistic, Ridge, SVC)
  |
  ‚îú‚îÄ Small dataset (<10K)? ‚Üí SVM, KNN, or Logistic Regression
  |
  ‚îú‚îÄ Large dataset (>100K)? ‚Üí Linear models or Random Forest
  |
  ‚îú‚îÄ High dimensions? ‚Üí Linear SVC or Logistic Regression (regularized)
  |
  ‚îú‚îÄ Need best accuracy? ‚Üí Gradient Boosting or Random Forest
  |
  ‚îî‚îÄ Unsure? ‚Üí Start with Random Forest (reliable)
```

---

### üìö Resources:

**Theory Notebooks:**
- `classification_fundamentals.ipynb` - Encoding & probabilities
- `theory_svm.ipynb` - SVM concepts & kernel trick
- `theory_knn.ipynb` - KNN intuition & choosing K

**What's Next:**
- Hyperparameter tuning (Week 10 Session 2)
- Threshold optimization
- Model deployment (Week 11)
- Neural networks (Week 11)

---

**AI Tech Institute** | *Building Tomorrow's AI Engineers Today*