# Tutorial 08: Model Selection Strategies

## Module 4: Model Development

---

## Learning Objectives

By the end of this tutorial, you will be able to:

1. **Establish appropriate baselines** for any ML problem
2. **Progress from simple to complex models** systematically
3. **Understand model trade-offs** across multiple dimensions
4. **Apply ensemble methods** to improve performance
5. **Make informed model selection decisions** based on requirements

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import time
from dataclasses import dataclass
import pickle

from sklearn.datasets import load_breast_cancer, fetch_california_housing, make_classification
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, SVR
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier
from sklearn.ensemble import StackingClassifier, VotingClassifier

np.random.seed(42)
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-whitegrid')
print("Libraries imported successfully!")

## 1. Introduction to Model Selection

Model selection depends on:
- **Problem type**: Classification, regression, ranking
- **Data characteristics**: Size, dimensionality, noise
- **Performance requirements**: Accuracy, latency, throughput
- **Operational constraints**: Interpretability, deployment

In [None]:
# Load datasets
cancer_data = load_breast_cancer()
X_cancer = pd.DataFrame(cancer_data.data, columns=cancer_data.feature_names)
y_cancer = cancer_data.target

X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(
    X_cancer, y_cancer, test_size=0.2, random_state=42, stratify=y_cancer
)

scaler_clf = StandardScaler()
X_train_clf_scaled = scaler_clf.fit_transform(X_train_clf)
X_test_clf_scaled = scaler_clf.transform(X_test_clf)

print(f"Classification Dataset: {len(X_train_clf)} training, {len(X_test_clf)} test samples")

In [None]:
housing_data = fetch_california_housing()
X_housing = pd.DataFrame(housing_data.data[:5000], columns=housing_data.feature_names)
y_housing = housing_data.target[:5000]

X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_housing, y_housing, test_size=0.2, random_state=42
)

scaler_reg = StandardScaler()
X_train_reg_scaled = scaler_reg.fit_transform(X_train_reg)
X_test_reg_scaled = scaler_reg.transform(X_test_reg)

print(f"Regression Dataset: {len(X_train_reg)} training, {len(X_test_reg)} test samples")

## 2. Establishing Baselines

| Baseline Type | Description | When to Use |
|--------------|-------------|-------------|
| **Random** | Random predictions | Sanity check |
| **Majority/Mean** | Most common class/mean | Minimum bar |
| **Stratified** | Random with class distribution | Imbalanced data |

In [None]:
@dataclass
class ModelResult:
    name: str
    accuracy: float = 0.0
    f1: float = 0.0
    auc: float = 0.0
    train_time: float = 0.0
    rmse: float = 0.0
    r2: float = 0.0

def evaluate_classifier(model, X_train, X_test, y_train, y_test, name):
    start = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - start
    y_pred = model.predict(X_test)
    result = ModelResult(
        name=name,
        accuracy=accuracy_score(y_test, y_pred),
        f1=f1_score(y_test, y_pred, average='weighted'),
        train_time=train_time
    )
    if hasattr(model, 'predict_proba'):
        try:
            y_proba = model.predict_proba(X_test)
            if y_proba.shape[1] == 2:
                result.auc = roc_auc_score(y_test, y_proba[:, 1])
        except: pass
    return result

def evaluate_regressor(model, X_train, X_test, y_train, y_test, name):
    start = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - start
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    return ModelResult(
        name=name, rmse=np.sqrt(mse), r2=r2_score(y_test, y_pred), train_time=train_time
    )

In [None]:
# Classification baselines
all_clf_results = []
baselines = {
    'Random': DummyClassifier(strategy='uniform', random_state=42),
    'Most Frequent': DummyClassifier(strategy='most_frequent'),
    'Stratified': DummyClassifier(strategy='stratified', random_state=42)
}

print("Classification Baselines")
for name, model in baselines.items():
    result = evaluate_classifier(model, X_train_clf_scaled, X_test_clf_scaled, y_train_clf, y_test_clf, name)
    all_clf_results.append(result)
    print(f"{name}: Accuracy={result.accuracy:.4f}, F1={result.f1:.4f}")

In [None]:
# Regression baselines
all_reg_results = []
reg_baselines = {
    'Mean': DummyRegressor(strategy='mean'),
    'Median': DummyRegressor(strategy='median')
}

print("Regression Baselines")
for name, model in reg_baselines.items():
    result = evaluate_regressor(model, X_train_reg_scaled, X_test_reg_scaled, y_train_reg, y_test_reg, name)
    all_reg_results.append(result)
    print(f"{name}: RMSE={result.rmse:.4f}, R2={result.r2:.4f}")

## 3. Simple Models

| Model | Pros | Cons |
|-------|------|------|
| Logistic Regression | Interpretable, fast | Linear boundary |
| Decision Tree | No scaling needed | Overfits easily |
| Naive Bayes | Very fast | Independence assumption |
| KNN | Simple | Slow prediction |

In [None]:
simple_models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Decision Tree (depth=5)': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Naive Bayes': GaussianNB(),
    'KNN (k=5)': KNeighborsClassifier(n_neighbors=5)
}

print("Simple Models")
for name, model in simple_models.items():
    result = evaluate_classifier(model, X_train_clf_scaled, X_test_clf_scaled, y_train_clf, y_test_clf, name)
    all_clf_results.append(result)
    print(f"{name}: Accuracy={result.accuracy:.4f}, F1={result.f1:.4f}")

In [None]:
# Feature importance from Logistic Regression
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train_clf_scaled, y_train_clf)

importance = np.abs(lr.coef_[0])
sorted_idx = np.argsort(importance)[::-1][:10]

plt.figure(figsize=(10, 5))
plt.barh(range(10), importance[sorted_idx][::-1], color='steelblue')
plt.yticks(range(10), [cancer_data.feature_names[i] for i in sorted_idx[::-1]])
plt.xlabel('Absolute Coefficient')
plt.title('Top 10 Feature Importance (Logistic Regression)')
plt.tight_layout()
plt.show()

In [None]:
# Overfitting analysis for Decision Tree
depths = range(1, 15)
train_scores, test_scores = [], []

for depth in depths:
    dt = DecisionTreeClassifier(max_depth=depth, random_state=42)
    dt.fit(X_train_clf_scaled, y_train_clf)
    train_scores.append(dt.score(X_train_clf_scaled, y_train_clf))
    test_scores.append(dt.score(X_test_clf_scaled, y_test_clf))

plt.figure(figsize=(10, 5))
plt.plot(depths, train_scores, 'b-', label='Training', linewidth=2)
plt.plot(depths, test_scores, 'r-', label='Test', linewidth=2)
plt.fill_between(depths, train_scores, test_scores, alpha=0.2)
plt.xlabel('Tree Depth')
plt.ylabel('Accuracy')
plt.title('Decision Tree Overfitting Analysis')
plt.legend()
plt.show()
print(f"Optimal depth: {depths[np.argmax(test_scores)]}")

In [None]:
# Simple regression models
simple_reg = {
    'Linear Regression': LinearRegression(),
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.1),
    'Decision Tree Reg': DecisionTreeRegressor(max_depth=5, random_state=42)
}

print("Simple Regression Models")
for name, model in simple_reg.items():
    result = evaluate_regressor(model, X_train_reg_scaled, X_test_reg_scaled, y_train_reg, y_test_reg, name)
    all_reg_results.append(result)
    print(f"{name}: RMSE={result.rmse:.4f}, R2={result.r2:.4f}")

## 4. Complex Models

- **Random Forest**: Ensemble of decision trees
- **Gradient Boosting**: Sequential error correction
- **SVM**: Non-linear boundaries with kernels

In [None]:
complex_models = {
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'SVM (RBF)': SVC(kernel='rbf', probability=True, random_state=42)
}

print("Complex Models")
for name, model in complex_models.items():
    result = evaluate_classifier(model, X_train_clf_scaled, X_test_clf_scaled, y_train_clf, y_test_clf, name)
    all_clf_results.append(result)
    print(f"{name}: Accuracy={result.accuracy:.4f}, F1={result.f1:.4f}, AUC={result.auc:.4f}")

In [None]:
complex_reg = {
    'Random Forest Reg': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting Reg': GradientBoostingRegressor(n_estimators=100, random_state=42)
}

print("Complex Regression Models")
for name, model in complex_reg.items():
    result = evaluate_regressor(model, X_train_reg_scaled, X_test_reg_scaled, y_train_reg, y_test_reg, name)
    all_reg_results.append(result)
    print(f"{name}: RMSE={result.rmse:.4f}, R2={result.r2:.4f}")

## 5. Ensemble Methods

| Method | Description | Best For |
|--------|-------------|----------|
| Bagging | Bootstrap samples | Reducing variance |
| Boosting | Sequential correction | Reducing bias |
| Stacking | Meta-learner | Maximum performance |

In [None]:
ensemble_models = {
    'Bagging': BaggingClassifier(estimator=DecisionTreeClassifier(max_depth=5), n_estimators=50, random_state=42),
    'AdaBoost': AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=3), n_estimators=100, random_state=42),
    'Voting': VotingClassifier(
        estimators=[('lr', LogisticRegression(max_iter=1000)), ('rf', RandomForestClassifier(n_estimators=50))],
        voting='soft'
    )
}

print("Ensemble Methods")
for name, model in ensemble_models.items():
    result = evaluate_classifier(model, X_train_clf_scaled, X_test_clf_scaled, y_train_clf, y_test_clf, name)
    all_clf_results.append(result)
    print(f"{name}: Accuracy={result.accuracy:.4f}, F1={result.f1:.4f}")

In [None]:
# Stacking
stacking = StackingClassifier(
    estimators=[
        ('lr', LogisticRegression(max_iter=1000)),
        ('rf', RandomForestClassifier(n_estimators=50)),
        ('svc', SVC(probability=True))
    ],
    final_estimator=LogisticRegression(),
    cv=5
)

result = evaluate_classifier(stacking, X_train_clf_scaled, X_test_clf_scaled, y_train_clf, y_test_clf, 'Stacking')
all_clf_results.append(result)
print(f"Stacking: Accuracy={result.accuracy:.4f}, F1={result.f1:.4f}")

## 6. Hyperparameter Tuning

In [None]:
# Grid Search
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [3, 5, 10]
}

rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train_clf_scaled, y_train_clf)

print("Grid Search Results")
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
print(f"Test score: {grid_search.score(X_test_clf_scaled, y_test_clf):.4f}")

In [None]:
# Random Search
from scipy.stats import randint

param_dist = {
    'n_estimators': randint(50, 200),
    'max_depth': randint(3, 15),
    'min_samples_split': randint(2, 10)
}

random_search = RandomizedSearchCV(rf, param_dist, n_iter=20, cv=5, scoring='accuracy', random_state=42, n_jobs=-1)
random_search.fit(X_train_clf_scaled, y_train_clf)

print("\nRandom Search Results")
print(f"Best params: {random_search.best_params_}")
print(f"Best CV score: {random_search.best_score_:.4f}")
print(f"Test score: {random_search.score(X_test_clf_scaled, y_test_clf):.4f}")

## 7. Model Comparison

In [None]:
# Classification comparison
clf_df = pd.DataFrame([{'Model': r.name, 'Accuracy': r.accuracy, 'F1': r.f1} for r in all_clf_results])
clf_df = clf_df.sort_values('Accuracy', ascending=False)
print("Classification Model Comparison")
print(clf_df.to_string(index=False))

In [None]:
# Visualize results
fig, ax = plt.subplots(figsize=(12, 8))
top_models = clf_df.head(12)
colors = ['green' if acc > 0.95 else 'steelblue' if acc > 0.9 else 'coral' for acc in top_models['Accuracy']]
ax.barh(range(len(top_models)), top_models['Accuracy'], color=colors)
ax.set_yticks(range(len(top_models)))
ax.set_yticklabels(top_models['Model'])
ax.set_xlabel('Accuracy')
ax.set_title('Top Models by Accuracy')
ax.axvline(x=0.95, color='green', linestyle='--', alpha=0.7)
ax.axvline(x=0.90, color='orange', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

In [None]:
# Regression comparison
reg_df = pd.DataFrame([{'Model': r.name, 'RMSE': r.rmse, 'R2': r.r2} for r in all_reg_results])
reg_df = reg_df.sort_values('R2', ascending=False)
print("\nRegression Model Comparison")
print(reg_df.to_string(index=False))

## 8. Hands-on Exercise

In [None]:
# Create synthetic dataset
X_ex, y_ex = make_classification(n_samples=2000, n_features=20, n_informative=15, random_state=42)
X_train_ex, X_test_ex, y_train_ex, y_test_ex = train_test_split(X_ex, y_ex, test_size=0.2, random_state=42)

scaler_ex = StandardScaler()
X_train_ex = scaler_ex.fit_transform(X_train_ex)
X_test_ex = scaler_ex.transform(X_test_ex)

print("Exercise: Progressive Model Selection")
print(f"Dataset: {len(X_train_ex)} training, {len(X_test_ex)} test samples")

In [None]:
# Step 1: Baseline
baseline = DummyClassifier(strategy='stratified', random_state=42)
baseline.fit(X_train_ex, y_train_ex)
print(f"Step 1 - Baseline: {baseline.score(X_test_ex, y_test_ex):.4f}")

# Step 2: Simple Model
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train_ex, y_train_ex)
print(f"Step 2 - Logistic Regression: {lr.score(X_test_ex, y_test_ex):.4f}")

# Step 3: Complex Model
gb = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb.fit(X_train_ex, y_train_ex)
print(f"Step 3 - Gradient Boosting: {gb.score(X_test_ex, y_test_ex):.4f}")

# Step 4: Tuning
param_grid = {'n_estimators': [50, 100], 'max_depth': [3, 5], 'learning_rate': [0.05, 0.1]}
tuned_gb = GridSearchCV(GradientBoostingClassifier(random_state=42), param_grid, cv=5, n_jobs=-1)
tuned_gb.fit(X_train_ex, y_train_ex)
print(f"Step 4 - Tuned GB: {tuned_gb.score(X_test_ex, y_test_ex):.4f}")
print(f"Best params: {tuned_gb.best_params_}")

## 9. Summary

### Key Takeaways

1. **Always start with baselines** to establish minimum performance thresholds
2. **Progress from simple to complex** models systematically
3. **Consider trade-offs** between accuracy, speed, interpretability, and memory
4. **Ensemble methods** often provide the best performance
5. **Hyperparameter tuning** can significantly improve model performance

### Model Selection Guidelines

| Scenario | Recommended Approach |
|----------|---------------------|
| Quick prototype | Logistic Regression / Decision Tree |
| Interpretability needed | Logistic Regression / Small Decision Tree |
| Maximum accuracy | Gradient Boosting / Stacking |
| Large dataset | Random Forest / XGBoost |
| Low latency required | Simple models or compressed ensembles |