# Session 8: Model Selection, Pipelines & Interpretability

In this notebook we'll work through the practical tools that turn individual ML algorithms into a **reliable, reproducible workflow**:

1. **Cross-Validation Strategies** — Stratified K-Fold
2. **Hyperparameter Tuning** — GridSearchCV & RandomizedSearchCV
3. **Sklearn Pipelines** — preventing data leakage
4. **Feature Selection** — SelectKBest & RFECV
5. **Support Vector Machines** — a new classifier with tunable hyperparameters
6. **Model Interpretability** — SHAP values

We'll use the **Breast Cancer Wisconsin** dataset throughout so we can focus on the tools rather than data wrangling.

In [None]:
# ============================================================
# Setup: Imports
# ============================================================

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Data & preprocessing
from sklearn.datasets import load_breast_cancer, make_classification
from sklearn.model_selection import (
    train_test_split, KFold, StratifiedKFold,
    cross_val_score, GridSearchCV, RandomizedSearchCV
)
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

# Pipelines & feature selection
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif, RFECV

# Metrics
from sklearn.metrics import (
    accuracy_score, classification_report,
    confusion_matrix, ConfusionMatrixDisplay
)

# Interpretability
!pip install shap -q
import shap

# Reproducibility
np.random.seed(42)

print("All imports loaded successfully.")

In [None]:
# ============================================================
# Load and explore the dataset
# ============================================================

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')  # 0 = malignant, 1 = benign

print(f"Dataset shape: {X.shape}")
print(f"Features: {X.shape[1]}")
print(f"\nClass distribution:")
print(y.value_counts().rename({0: 'malignant', 1: 'benign'}))
print(f"\nClass balance: {y.mean():.1%} benign")

In [None]:
# Hold out a final test set — we won't touch this until the end
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set:     {X_test.shape[0]} samples")
print(f"Train class balance: {y_train.mean():.1%} benign")
print(f"Test class balance:  {y_test.mean():.1%} benign")

---
## Section 1: Cross-Validation Strategies

We introduced cross-validation in Session 4. Now we'll look under the hood and see **why the choice of CV strategy matters** — especially for classification.

### K-Fold vs. Stratified K-Fold

Regular K-Fold splits the data into K chunks without regard for the target variable. If the classes are imbalanced, some folds may end up with very different class proportions than the full dataset — leading to unreliable score estimates.

**Stratified K-Fold** preserves the class distribution in every fold.

In [None]:
# ============================================================
# Compare K-Fold vs Stratified K-Fold: class proportions
# ============================================================

kf = KFold(n_splits=5, shuffle=True, random_state=42)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print("Regular K-Fold — % benign in each fold's TEST set:")
for i, (train_idx, test_idx) in enumerate(kf.split(X_train)):
    pct = y_train.iloc[test_idx].mean()
    print(f"  Fold {i+1}: {pct:.1%}")

print(f"\nOverall training set: {y_train.mean():.1%}")

print("\nStratified K-Fold — % benign in each fold's TEST set:")
for i, (train_idx, test_idx) in enumerate(skf.split(X_train, y_train)):
    pct = y_train.iloc[test_idx].mean()
    print(f"  Fold {i+1}: {pct:.1%}")

In [None]:
# ============================================================
# Impact on model scores: K-Fold vs Stratified K-Fold
# ============================================================

model = LogisticRegression(max_iter=5000, random_state=42)

# Scale for logistic regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

scores_kf = cross_val_score(model, X_train_scaled, y_train, cv=kf, scoring='accuracy')
scores_skf = cross_val_score(model, X_train_scaled, y_train, cv=skf, scoring='accuracy')

print("Logistic Regression accuracy scores per fold:")
print(f"  K-Fold:           {scores_kf}")
print(f"    Mean: {scores_kf.mean():.4f}  Std: {scores_kf.std():.4f}")
print(f"  Stratified K-Fold: {scores_skf}")
print(f"    Mean: {scores_skf.mean():.4f}  Std: {scores_skf.std():.4f}")
print("\nNotice stratified folds tend to have lower variance across folds.")

**Key takeaway:** For classification, always use `StratifiedKFold`. The good news — `cross_val_score` and `GridSearchCV` default to stratified splitting when you pass a classifier.

### Exercise 1: Cross-Validation

1. Create a `StratifiedKFold` with **10 folds** (shuffle=True, random_state=0).
2. Run `cross_val_score` using a `DecisionTreeClassifier(random_state=42)` on the **unscaled** `X_train` with `scoring='f1'`.
3. Print the mean and standard deviation of the F1 scores.
4. Now repeat with `cv=5`. Does using more folds reduce variance in the scores?

In [None]:
# Exercise 1: Your code here


In [None]:
#@title Click to reveal solution

# 1. Create StratifiedKFold with 10 folds
skf_10 = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)

# 2. Cross-validate a decision tree with F1 scoring
dt = DecisionTreeClassifier(random_state=42)
scores_10 = cross_val_score(dt, X_train, y_train, cv=skf_10, scoring='f1')

# 3. Print mean and std
print("10-Fold Stratified CV (F1):")
print(f"  Scores: {scores_10.round(4)}")
print(f"  Mean:   {scores_10.mean():.4f}")
print(f"  Std:    {scores_10.std():.4f}")

# 4. Repeat with 5 folds
skf_5 = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
scores_5 = cross_val_score(dt, X_train, y_train, cv=skf_5, scoring='f1')

print("\n5-Fold Stratified CV (F1):")
print(f"  Scores: {scores_5.round(4)}")
print(f"  Mean:   {scores_5.mean():.4f}")
print(f"  Std:    {scores_5.std():.4f}")

print(f"\nMore folds often means lower variance ({scores_10.std():.4f} vs {scores_5.std():.4f}),")
print("because each fold has a larger training set, but also means more computation.")

---
## Section 2: Hyperparameter Tuning

Model **parameters** (weights, coefficients) are learned during training. **Hyperparameters** (max_depth, C, n_estimators) are set *before* training and control the model's capacity.

We can't optimize hyperparameters with gradient descent — we need to **search** for good values.

### GridSearchCV: Exhaustive Search

Grid search tries **every combination** of the hyperparameter values you specify, evaluating each with cross-validation.

In [None]:
# ============================================================
# GridSearchCV on a Decision Tree
# ============================================================

param_grid = {
    'max_depth': [2, 3, 5, 7, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

total_combos = 6 * 3 * 3
print(f"Total combinations to evaluate: {total_combos}")
print(f"With 5-fold CV, that's {total_combos * 5} model fits.\n")

grid_search = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring='f1',
    n_jobs=-1,           # use all CPU cores
    return_train_score=True
)

grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV F1 score: {grid_search.best_score_:.4f}")

In [None]:
# ============================================================
# Inspect Grid Search results
# ============================================================

results = pd.DataFrame(grid_search.cv_results_)

# Top 10 configurations
cols = ['param_max_depth', 'param_min_samples_split', 'param_min_samples_leaf',
        'mean_test_score', 'std_test_score', 'mean_train_score', 'rank_test_score']

print("Top 10 configurations by CV F1 score:")
results[cols].sort_values('rank_test_score').head(10)

### RandomizedSearchCV: Sampling the Search Space

When the search space is large — especially with continuous hyperparameters — trying every combination is impractical. `RandomizedSearchCV` randomly samples a fixed number of combinations.

In [None]:
# ============================================================
# RandomizedSearchCV on a Random Forest
# ============================================================
from scipy.stats import randint, uniform

# With distributions instead of fixed lists, the space is effectively infinite
param_distributions = {
    'n_estimators': randint(50, 300),
    'max_depth': [3, 5, 7, 10, 15, None],
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'max_features': ['sqrt', 'log2', None]
}

random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_distributions,
    n_iter=50,          # try 50 random combinations
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring='f1',
    n_jobs=-1,
    random_state=42,
    return_train_score=True
)

random_search.fit(X_train, y_train)

print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV F1 score: {random_search.best_score_:.4f}")

In [None]:
# ============================================================
# Compare: Grid Search timing vs Random Search timing
# ============================================================
import time

# Grid search over a moderately-sized space
param_grid_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10]
}
total_grid = 3 * 4 * 3

start = time.time()
gs = GridSearchCV(
    RandomForestClassifier(random_state=42), param_grid_rf,
    cv=5, scoring='f1', n_jobs=-1
)
gs.fit(X_train, y_train)
grid_time = time.time() - start

start = time.time()
rs = RandomizedSearchCV(
    RandomForestClassifier(random_state=42), param_distributions,
    n_iter=total_grid, cv=5, scoring='f1', n_jobs=-1, random_state=42
)
rs.fit(X_train, y_train)
rand_time = time.time() - start

print(f"Grid Search:   {total_grid} combos → best F1 = {gs.best_score_:.4f} ({grid_time:.1f}s)")
print(f"Random Search: {total_grid} combos → best F1 = {rs.best_score_:.4f} ({rand_time:.1f}s)")
print("\nSame budget, but random search covers a much larger space.")

---
## Section 3: Sklearn Pipelines

### The Data Leakage Problem

A subtle but critical mistake: if you scale (or impute, or encode) **before** splitting, statistics from the test set leak into your preprocessing. The model sees information it shouldn't, and your evaluation is too optimistic.

In [None]:
# ============================================================
# Data leakage demo: WRONG vs RIGHT
# ============================================================

model = LogisticRegression(max_iter=5000, random_state=42)

# --- WRONG: scale everything first, then split ---
scaler_wrong = StandardScaler()
X_all_scaled = scaler_wrong.fit_transform(X)  # fit on ALL data including test!
X_tr_wrong, X_te_wrong, y_tr, y_te = train_test_split(
    X_all_scaled, y, test_size=0.2, random_state=42, stratify=y
)
model.fit(X_tr_wrong, y_tr)
acc_wrong = accuracy_score(y_te, model.predict(X_te_wrong))

# --- RIGHT: split first, then scale on train only ---
X_tr_right, X_te_right, y_tr, y_te = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
scaler_right = StandardScaler()
X_tr_right_scaled = scaler_right.fit_transform(X_tr_right)   # fit on train
X_te_right_scaled = scaler_right.transform(X_te_right)       # transform test
model.fit(X_tr_right_scaled, y_tr)
acc_right = accuracy_score(y_te, model.predict(X_te_right_scaled))

print(f"Accuracy with leakage:    {acc_wrong:.4f}")
print(f"Accuracy without leakage: {acc_right:.4f}")
print("\nThe difference may be small here, but on smaller or noisier datasets")
print("it can be substantial — and it ALWAYS gives you a false sense of security.")

### Pipelines: The Clean Solution

A `Pipeline` chains preprocessing and modeling into a single object. When you call `.fit()`, each step fits on the training data and transforms it before passing to the next step. When you call `.predict()`, it only transforms (no re-fitting).

This makes leakage **impossible by construction**.

In [None]:
# ============================================================
# Building a Pipeline
# ============================================================

# Simple pipeline: scale → classify
pipe_lr = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(max_iter=5000, random_state=42))
])

# Fit and predict — scaler fits on train only, automatically
pipe_lr.fit(X_train, y_train)
y_pred = pipe_lr.predict(X_test)

print(f"Pipeline accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"\nPipeline steps: {pipe_lr.named_steps}")

In [None]:
# ============================================================
# Cross-validation with a Pipeline — leakage-free by design
# ============================================================

# The pipeline is re-fit from scratch in each fold
scores = cross_val_score(
    pipe_lr, X_train, y_train,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring='f1'
)

print(f"Pipeline + Stratified 5-Fold CV")
print(f"  F1 scores: {scores.round(4)}")
print(f"  Mean: {scores.mean():.4f} ± {scores.std():.4f}")

### Pipelines + Grid Search

The real power: you can tune hyperparameters of **any step** in the pipeline. Use the naming convention `stepname__parameter`.

In [None]:
# ============================================================
# GridSearchCV with a Pipeline
# ============================================================

pipe_dt = Pipeline([
    ('scaler', StandardScaler()),
    ('model', DecisionTreeClassifier(random_state=42))
])

# Note the double-underscore notation: step_name__param_name
param_grid_pipe = {
    'model__max_depth': [2, 3, 5, 7, None],
    'model__min_samples_split': [2, 5, 10],
    'model__min_samples_leaf': [1, 2, 4]
}

grid_pipe = GridSearchCV(
    pipe_dt, param_grid_pipe,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring='f1', n_jobs=-1
)

grid_pipe.fit(X_train, y_train)

print(f"Best params: {grid_pipe.best_params_}")
print(f"Best CV F1:  {grid_pipe.best_score_:.4f}")
print(f"\nTest set F1: {grid_pipe.score(X_test, y_test):.4f}")

### Exercise 2: Build and Tune a Pipeline

1. Create a pipeline with two steps: `StandardScaler` (named `'scaler'`) and `RandomForestClassifier(random_state=42)` (named `'model'`).
2. Define a parameter grid that searches over:
   - `model__n_estimators`: [50, 100, 200]
   - `model__max_depth`: [3, 5, 10, None]
3. Run `GridSearchCV` with `scoring='f1'` and 5-fold stratified CV.
4. Print the best parameters and the test set accuracy of the best model.

In [None]:
# Exercise 2: Your code here


In [None]:
#@title Click to reveal solution

# 1. Create the pipeline
pipe_rf = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier(random_state=42))
])

# 2. Define parameter grid
param_grid_rf = {
    'model__n_estimators': [50, 100, 200],
    'model__max_depth': [3, 5, 10, None]
}

# 3. Run GridSearchCV
grid_rf = GridSearchCV(
    pipe_rf, param_grid_rf,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring='f1', n_jobs=-1
)
grid_rf.fit(X_train, y_train)

# 4. Print results
print(f"Best parameters: {grid_rf.best_params_}")
print(f"Best CV F1: {grid_rf.best_score_:.4f}")

y_pred_rf = grid_rf.predict(X_test)
print(f"Test set accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")
print(f"\nTest set classification report:")
print(classification_report(y_test, y_pred_rf, target_names=['malignant', 'benign']))

---
## Section 4: Feature Selection

More features doesn't always mean a better model. Irrelevant features add noise, correlated features cause instability, and fewer features make models simpler, faster, and easier to interpret.

We'll look at two practical approaches: **filter methods** and **wrapper methods**.

### Filter Methods: SelectKBest

Score each feature independently using a statistical test (e.g., ANOVA F-test), then keep the top K.

In [None]:
# ============================================================
# SelectKBest: rank features by ANOVA F-score
# ============================================================

selector = SelectKBest(score_func=f_classif, k='all')  # score all features
selector.fit(X_train, y_train)

# Visualize feature scores
feature_scores = pd.DataFrame({
    'feature': X_train.columns,
    'f_score': selector.scores_,
    'p_value': selector.pvalues_
}).sort_values('f_score', ascending=False)

fig, ax = plt.subplots(figsize=(12, 6))
bars = ax.barh(
    feature_scores['feature'],
    feature_scores['f_score'],
    color='steelblue'
)
ax.set_xlabel('ANOVA F-Score')
ax.set_title('Feature Importance: ANOVA F-Test Scores')
ax.invert_yaxis()
plt.tight_layout()
plt.show()

print("\nTop 10 features:")
print(feature_scores.head(10).to_string(index=False))

In [None]:
# ============================================================
# SelectKBest inside a Pipeline: how many features is enough?
# ============================================================

results_by_k = []

for k in range(1, X_train.shape[1] + 1):
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('selector', SelectKBest(score_func=f_classif, k=k)),
        ('model', LogisticRegression(max_iter=5000, random_state=42))
    ])
    scores = cross_val_score(
        pipe, X_train, y_train,
        cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
        scoring='f1'
    )
    results_by_k.append({'k': k, 'mean_f1': scores.mean(), 'std_f1': scores.std()})

results_df = pd.DataFrame(results_by_k)

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(results_df['k'], results_df['mean_f1'], 'o-', color='steelblue', markersize=4)
ax.fill_between(
    results_df['k'],
    results_df['mean_f1'] - results_df['std_f1'],
    results_df['mean_f1'] + results_df['std_f1'],
    alpha=0.2, color='steelblue'
)
best_k = results_df.loc[results_df['mean_f1'].idxmax(), 'k']
ax.axvline(best_k, color='coral', linestyle='--', label=f'Best k={int(best_k)}')
ax.set_xlabel('Number of Features (k)')
ax.set_ylabel('Mean CV F1 Score')
ax.set_title('SelectKBest: F1 Score vs. Number of Features')
ax.legend()
plt.tight_layout()
plt.show()

print(f"Best k = {int(best_k)} with mean F1 = {results_df['mean_f1'].max():.4f}")

### Wrapper Methods: Recursive Feature Elimination (RFECV)

Instead of scoring features independently, RFECV uses a model to iteratively remove the weakest feature and evaluates performance at each step using cross-validation. It automatically selects the optimal number of features.

In [None]:
# ============================================================
# RFECV with a Random Forest
# ============================================================

# RFECV needs a model with feature_importances_ or coef_
rf_for_rfe = RandomForestClassifier(n_estimators=100, random_state=42)

rfecv = RFECV(
    estimator=rf_for_rfe,
    step=1,                # remove 1 feature at a time
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring='f1',
    n_jobs=-1
)

rfecv.fit(X_train, y_train)

print(f"Optimal number of features: {rfecv.n_features_}")
print(f"\nSelected features:")
selected = X_train.columns[rfecv.support_].tolist()
for f in selected:
    print(f"  • {f}")

print(f"\nDropped features:")
dropped = X_train.columns[~rfecv.support_].tolist()
for f in dropped:
    print(f"  ✗ {f}")

In [None]:
# ============================================================
# RFECV: Performance vs number of features
# ============================================================

fig, ax = plt.subplots(figsize=(10, 5))

n_features_range = range(1, len(rfecv.cv_results_['mean_test_score']) + 1)
means = rfecv.cv_results_['mean_test_score']
stds = rfecv.cv_results_['std_test_score']

ax.plot(n_features_range, means, 'o-', color='steelblue', markersize=4)
ax.fill_between(n_features_range, means - stds, means + stds, alpha=0.2, color='steelblue')
ax.axvline(rfecv.n_features_, color='coral', linestyle='--',
           label=f'Optimal: {rfecv.n_features_} features')
ax.set_xlabel('Number of Features')
ax.set_ylabel('Mean CV F1 Score')
ax.set_title('RFECV: Recursive Feature Elimination with Cross-Validation')
ax.legend()
plt.tight_layout()
plt.show()

### Exercise 3: Feature Selection

1. Build a pipeline with: `StandardScaler` → `SelectKBest(k=10)` → `LogisticRegression(max_iter=5000, random_state=42)`.
2. Use `GridSearchCV` to search over `selector__k` values of [5, 10, 15, 20, 25] with `scoring='f1'` and 5-fold stratified CV.
3. Print the best `k` and the corresponding F1 score.
4. Which features were selected at the best `k`? Print their names. *(Hint: after fitting, access the selector step from the best estimator with `grid.best_estimator_.named_steps['selector']`, then use `.get_support()` to get a boolean mask.)*

In [None]:
# Exercise 3: Your code here


In [None]:
#@title Click to reveal solution

# 1. Build pipeline
pipe_fs = Pipeline([
    ('scaler', StandardScaler()),
    ('selector', SelectKBest(score_func=f_classif)),
    ('model', LogisticRegression(max_iter=5000, random_state=42))
])

# 2. Grid search over k
param_grid_k = {'selector__k': [5, 10, 15, 20, 25]}

grid_fs = GridSearchCV(
    pipe_fs, param_grid_k,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring='f1', n_jobs=-1
)
grid_fs.fit(X_train, y_train)

# 3. Best k and score
print(f"Best k: {grid_fs.best_params_['selector__k']}")
print(f"Best CV F1: {grid_fs.best_score_:.4f}")

# 4. Which features were selected?
best_selector = grid_fs.best_estimator_.named_steps['selector']
selected_mask = best_selector.get_support()
selected_features = X_train.columns[selected_mask].tolist()

print(f"\nSelected features ({len(selected_features)}):")
for f in selected_features:
    print(f"  • {f}")

---
## Section 5: Support Vector Machines

SVMs find the decision boundary that **maximizes the margin** between classes. With kernel functions, they can model non-linear boundaries without explicitly transforming features.

Key hyperparameters:
- **C**: tradeoff between margin width and training errors (large C = narrow margin, fewer errors)
- **kernel**: linear, rbf, poly
- **gamma** (for rbf/poly): controls how far the influence of a single training point reaches

In [None]:
# ============================================================
# Visualize SVM decision boundaries (2D synthetic data)
# ============================================================
from sklearn.datasets import make_circles

# Create non-linearly separable data
X_circles, y_circles = make_circles(n_samples=300, noise=0.1, factor=0.4, random_state=42)

fig, axes = plt.subplots(1, 3, figsize=(16, 4.5))

kernels = ['linear', 'rbf', 'poly']

for ax, kernel in zip(axes, kernels):
    svm = SVC(kernel=kernel, C=1.0, gamma='scale', random_state=42)
    svm.fit(X_circles, y_circles)

    # Create mesh for decision boundary
    xx, yy = np.meshgrid(
        np.linspace(X_circles[:, 0].min() - 0.5, X_circles[:, 0].max() + 0.5, 200),
        np.linspace(X_circles[:, 1].min() - 0.5, X_circles[:, 1].max() + 0.5, 200)
    )
    Z = svm.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

    ax.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
    ax.scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles,
               cmap='coolwarm', edgecolors='k', s=30)
    acc = svm.score(X_circles, y_circles)
    ax.set_title(f'{kernel} kernel (acc: {acc:.2f})')
    ax.set_xlabel('$x_1$')
    ax.set_ylabel('$x_2$')

plt.suptitle('SVM Decision Boundaries: Different Kernels', y=1.02, fontsize=13)
plt.tight_layout()
plt.show()

print("The linear kernel can't separate these concentric circles.")
print("RBF and polynomial kernels handle non-linear boundaries naturally.")

In [None]:
# ============================================================
# Effect of C on decision boundary
# ============================================================
from sklearn.datasets import make_classification

X_demo, y_demo = make_classification(
    n_samples=200, n_features=2, n_informative=2, n_redundant=0,
    n_clusters_per_class=1, random_state=42
)

fig, axes = plt.subplots(1, 3, figsize=(16, 4.5))
C_values = [0.01, 1.0, 100.0]

for ax, C in zip(axes, C_values):
    svm = SVC(kernel='rbf', C=C, gamma='scale', random_state=42)
    svm.fit(X_demo, y_demo)

    xx, yy = np.meshgrid(
        np.linspace(X_demo[:, 0].min() - 1, X_demo[:, 0].max() + 1, 200),
        np.linspace(X_demo[:, 1].min() - 1, X_demo[:, 1].max() + 1, 200)
    )
    Z = svm.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

    ax.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
    ax.scatter(X_demo[:, 0], X_demo[:, 1], c=y_demo,
               cmap='coolwarm', edgecolors='k', s=30)
    # Highlight support vectors
    ax.scatter(svm.support_vectors_[:, 0], svm.support_vectors_[:, 1],
               s=100, facecolors='none', edgecolors='gold', linewidths=1.5)
    acc = svm.score(X_demo, y_demo)
    ax.set_title(f'C={C} ({len(svm.support_vectors_)} SVs, acc: {acc:.2f})')
    ax.set_xlabel('$x_1$')
    ax.set_ylabel('$x_2$')

plt.suptitle('Effect of C on SVM Decision Boundary (RBF kernel)', y=1.02, fontsize=13)
plt.tight_layout()
plt.show()

print("Small C → wide margin, more support vectors, tolerates more errors.")
print("Large C → narrow margin, fewer support vectors, tries to classify every point correctly.")
print("Gold circles = support vectors (the points that define the boundary).")

### Tuning an SVM with GridSearchCV + Pipeline

SVMs are **sensitive to feature scale**, so we always standardize first. We'll search over C, kernel, and gamma using a pipeline.

In [None]:
# ============================================================
# Full SVM pipeline with hyperparameter search
# ============================================================

pipe_svm = Pipeline([
    ('scaler', StandardScaler()),
    ('model', SVC(random_state=42))
])

param_grid_svm = {
    'model__C': [0.1, 1, 10, 100],
    'model__kernel': ['linear', 'rbf'],
    'model__gamma': ['scale', 'auto', 0.01, 0.1]
}

# Note: gamma is only used for rbf/poly, but sklearn handles this
# gracefully — it's ignored for linear kernel.

grid_svm = GridSearchCV(
    pipe_svm, param_grid_svm,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring='f1',
    n_jobs=-1,
    return_train_score=True
)

grid_svm.fit(X_train, y_train)

print(f"Best parameters: {grid_svm.best_params_}")
print(f"Best CV F1:      {grid_svm.best_score_:.4f}")
print(f"\nTest set results:")
y_pred_svm = grid_svm.predict(X_test)
print(classification_report(y_test, y_pred_svm, target_names=['malignant', 'benign']))

### Exercise 4: SVM Tuning with RandomizedSearchCV

1. Create a pipeline: `StandardScaler` → `SVC(random_state=42, probability=True)`. *(We set `probability=True` because we'll need it for SHAP later.)*
2. Define a parameter distribution for `RandomizedSearchCV`:
   - `model__C`: `uniform(0.1, 100)` (continuous uniform from 0.1 to ~100)
   - `model__kernel`: `['linear', 'rbf', 'poly']`
   - `model__gamma`: `['scale', 'auto']`
3. Run `RandomizedSearchCV` with `n_iter=30`, `scoring='f1'`, and 5-fold stratified CV.
4. Print the best parameters, best CV F1 score, and the test set classification report.

In [None]:
# Exercise 4: Your code here


In [None]:
#@title Click to reveal solution
from scipy.stats import uniform

# 1. Create pipeline with probability=True
pipe_svm_ex = Pipeline([
    ('scaler', StandardScaler()),
    ('model', SVC(random_state=42, probability=True))
])

# 2. Define parameter distributions
param_dist_svm = {
    'model__C': uniform(0.1, 100),
    'model__kernel': ['linear', 'rbf', 'poly'],
    'model__gamma': ['scale', 'auto']
}

# 3. Run RandomizedSearchCV
random_svm = RandomizedSearchCV(
    pipe_svm_ex, param_dist_svm,
    n_iter=30,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring='f1',
    n_jobs=-1,
    random_state=42
)
random_svm.fit(X_train, y_train)

# 4. Print results
print(f"Best parameters: {random_svm.best_params_}")
print(f"Best CV F1: {random_svm.best_score_:.4f}")

print(f"\nTest set classification report:")
y_pred_svm_ex = random_svm.predict(X_test)
print(classification_report(y_test, y_pred_svm_ex, target_names=['malignant', 'benign']))

---
## Section 6: Model Interpretability with SHAP

We've seen model-specific interpretability tools:
- Linear/logistic regression → **coefficients**
- Trees / random forests → **feature importances**

**SHAP** (SHapley Additive exPlanations) is **model-agnostic** — it works with *any* model. It assigns each feature a contribution to each individual prediction, based on Shapley values from game theory.

Two views:
- **Local:** Why did the model make *this specific* prediction?
- **Global:** Which features matter most *across all predictions*?

In [None]:
# ============================================================
# Train a model to interpret
# We'll use a Random Forest since it's fast with SHAP's TreeExplainer
# ============================================================

# Fit a tuned RF on the full training set
rf_final = RandomForestClassifier(
    n_estimators=200, max_depth=10, random_state=42
)
rf_final.fit(X_train, y_train)

print(f"Random Forest test accuracy: {rf_final.score(X_test, y_test):.4f}")

In [None]:
# ============================================================
# SHAP: Compute Shapley values
# ============================================================

# TreeExplainer is optimized for tree-based models
explainer = shap.TreeExplainer(rf_final)
shap_values = explainer(X_test)

print(f"SHAP values shape: {shap_values.shape}")
print(f"  {shap_values.shape[0]} test samples × {shap_values.shape[1]} features")
print(f"\nEach sample gets a SHAP value per feature, explaining that feature's")
print(f"contribution to pushing the prediction above or below the base value.")

In [None]:
# ============================================================
# Global view: Summary / Beeswarm plot
# ============================================================
# Each dot = one feature for one sample
# Position on x-axis = SHAP value (impact on prediction)
# Color = feature value (red = high, blue = low)

shap.plots.beeswarm(shap_values[:, :, 1], max_display=15)  # class 1 = benign

**Reading the beeswarm plot:**
- Features are ranked by overall importance (top = most important).
- Each dot is one test sample. The x-axis shows the SHAP value: positive values push the prediction toward "benign," negative toward "malignant."
- The color shows the actual feature value for that sample — red = high, blue = low.
- This gives you a *global* view of which features matter and *how* they affect predictions.

In [None]:
# ============================================================
# Global view: Bar plot of mean absolute SHAP values
# ============================================================

shap.plots.bar(shap_values[:, :, 1], max_display=15)

In [None]:
# ============================================================
# Local view: Waterfall plot for a single prediction
# ============================================================

# Pick a specific test sample
sample_idx = 0
true_label = 'benign' if y_test.iloc[sample_idx] == 1 else 'malignant'
pred_label = 'benign' if rf_final.predict(X_test.iloc[[sample_idx]])[0] == 1 else 'malignant'

print(f"Sample {sample_idx}: true = {true_label}, predicted = {pred_label}")
print(f"\nWaterfall plot shows how each feature pushed the prediction")
print(f"away from the base value (average prediction) toward the final output.\n")

shap.plots.waterfall(shap_values[sample_idx, :, 1], max_display=12)

In [None]:
# ============================================================
# Local view: a misclassified sample (if any)
# ============================================================

y_pred_final = rf_final.predict(X_test)
misclassified = np.where(y_pred_final != y_test.values)[0]

if len(misclassified) > 0:
    idx = misclassified[0]
    true_label = 'benign' if y_test.iloc[idx] == 1 else 'malignant'
    pred_label = 'benign' if y_pred_final[idx] == 1 else 'malignant'

    print(f"Misclassified sample {idx}: true = {true_label}, predicted = {pred_label}")
    print(f"SHAP waterfall shows which features led the model astray:\n")
    shap.plots.waterfall(shap_values[idx, :, 1], max_display=12)
else:
    print("No misclassified samples in the test set!")

In [None]:
# ============================================================
# Compare: SHAP importance vs built-in feature_importances_
# ============================================================

# Built-in (Gini / impurity-based)
builtin_imp = pd.Series(rf_final.feature_importances_, index=X_train.columns)
builtin_imp = builtin_imp.sort_values(ascending=False).head(10)

# SHAP-based (mean absolute SHAP value)
shap_imp = pd.Series(
    np.abs(shap_values[:, :, 1].values).mean(axis=0),
    index=X_train.columns
)
shap_imp = shap_imp.sort_values(ascending=False).head(10)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].barh(builtin_imp.index[::-1], builtin_imp.values[::-1], color='steelblue')
axes[0].set_title('Built-in Feature Importance (Gini)')
axes[0].set_xlabel('Importance')

axes[1].barh(shap_imp.index[::-1], shap_imp.values[::-1], color='coral')
axes[1].set_title('SHAP Feature Importance')
axes[1].set_xlabel('Mean |SHAP value|')

plt.suptitle('Feature Importance: Built-in vs SHAP', y=1.02, fontsize=13)
plt.tight_layout()
plt.show()

print("Rankings often agree on the top features but can differ in the middle.")
print("SHAP is generally more reliable — it measures actual impact on predictions,")
print("while Gini importance can overweight high-cardinality or noisy features.")

---
## Summary

**What we covered:**

| Tool | What it does | Key takeaway |
|---|---|---|
| **Stratified K-Fold** | Preserves class proportions in CV folds | Default for classification |
| **GridSearchCV** | Exhaustive hyperparameter search | Best for small search spaces |
| **RandomizedSearchCV** | Random sampling of hyperparameters | Better for large / continuous spaces |
| **Pipelines** | Chain preprocessing + model | Prevents data leakage by design |
| **SelectKBest** | Filter-based feature selection | Fast, model-independent |
| **RFECV** | Wrapper-based feature selection | Uses model performance, picks optimal k |
| **SVM** | Maximum-margin classifier with kernels | Powerful for non-linear boundaries |
| **SHAP** | Model-agnostic feature explanations | Local (per-prediction) + global (overall) |

**The workflow pattern:**

1. Build a **pipeline** (preprocessing + model)
2. Define a **search space** for hyperparameters
3. Use **GridSearchCV / RandomizedSearchCV** with **stratified CV**
4. Evaluate on **held-out test set**
5. Interpret with **SHAP**