# Homework 5 — SVM on LFW (Linear & RBF) + Optional PCA

This starter notebook is intentionally **incomplete**. Your tasks include:
- Load **LFW** (identities ≥ 50 images) and build a **60/15/25** train/val/test split (fixed seed).
- Implement evaluation utilities and tuning loops for **Linear SVM** and **RBF SVM**.
- (Optional) Add **PCA → SVM** experiments.
- Report **validation-selected** hyperparameters and **test accuracy**.

> You **may** use scikit-learn (e.g., `SVC`, `Pipeline`, `StandardScaler`, `PCA`).
> 
> **Important:** Tune hyperparameters on the **validation** set only; after choosing best params, retrain on **train+val** and evaluate once on **test**.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_lfw_people
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

## 1) Load LFW people (min_faces_per_person=50) and flatten

**Provided** so everyone uses the same filtered dataset.

**Hint:** This may download on first run.

In [None]:
lfw = fetch_lfw_people(min_faces_per_person=50, resize=0.4, color=False)
X_images = lfw.images                 # (n_samples, h, w)
X = lfw.data.astype(np.float32)       # (n_samples, h*w)
y = lfw.target                        # integer labels
target_names = lfw.target_names       # label -> name mapping
h, w = lfw.images.shape[1:3]

print("Images:", X_images.shape, "| Flattened:", X.shape, "| Labels:", y.shape)
print("Num classes:", len(target_names), "Names:", list(target_names))

## 2) Visualize some faces (optional but recommended)

Use the helper below to display a few samples to sanity-check the data and labels.

In [None]:
def plot_faces(images, labels, label_names, n_row=2, n_col=5, title=None):
    plt.figure(figsize=(1.8*n_col, 2.2*n_row))
    if title:
        plt.suptitle(title)
    for i in range(n_row*n_col):
        plt.subplot(n_row, n_col, i+1)
        plt.imshow(images[i], cmap="gray")
        plt.title(str(label_names[labels[i]]), fontsize=9)
        plt.xticks([]); plt.yticks([])
    plt.tight_layout(rect=[0,0,1,0.95])
    plt.show()

# Uncomment to preview
# plot_faces(X_images[:10], y[:10], target_names, n_row=2, n_col=5, title="Sample faces")

## 3) Stratified split: 60% train / 15% val / 25% test

**Provided** code for the two-stage split; keep the same seed for fairness and reproducibility.

**Hint:** Use `stratify=` so class proportions are preserved in splits.

In [None]:
# Step 1: train+val vs test (25% test)
X_trval, X_te, y_trval, y_te = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=RANDOM_STATE
)
# Step 2: train vs val (20% of trval -> 15% overall)
X_tr, X_val, y_tr, y_val = train_test_split(
    X_trval, y_trval, test_size=0.20, stratify=y_trval, random_state=RANDOM_STATE
)
print("Train:", X_tr.shape, "Val:", X_val.shape, "Test:", X_te.shape)

## 4) Utilities — **YOU implement** evaluation & plots

**TODO A:** Implement `evaluate_model` that:
1. **Fits** the model on **TRAIN** only.
2. Predicts on **VAL** and **TEST**.
3. Returns a dict with `val_acc`, `test_acc`, and predictions.

**TODO B (optional):** Implement `show_confusion` to display a confusion matrix.

**Hints:**
- Use `accuracy_score` from `sklearn.metrics`.
- Return a dict so you can compare results across hyperparameters.
- Keep function signatures unchanged for reuse later.

In [None]:
def evaluate_model(clf, X_tr, y_tr, X_val, y_val, X_te, y_te, label=""):
    """Fit on TRAIN only; eval on VAL and TEST; return metrics dict.
    TODO: implement steps described above.
    """
    #### YOUR CODE HERE ####
    # Example sketch:
    # clf.fit(X_tr, y_tr)
    # val_pred = clf.predict(X_val)
    # test_pred = clf.predict(X_te)
    # val_acc = accuracy_score(y_val, val_pred)
    # test_acc = accuracy_score(y_te, test_pred)
    # print(f"[{label}] Val Acc: {val_acc:.4f} | Test Acc: {test_acc:.4f}")
    # return {"val_acc": val_acc, "test_acc": test_acc, "val_pred": val_pred, "test_pred": test_pred}
    raise NotImplementedError("Implement evaluate_model as described in the cell above.")

def show_confusion(y_true, y_pred, label_names, title="Confusion matrix"):
    """Optional helper to plot confusion matrix.
    TODO: implement if you want to visualize per-class performance.
    """
    #### YOUR CODE HERE (optional) ####
    # disp = ConfusionMatrixDisplay.from_predictions(y_true, y_pred, display_labels=label_names, xticks_rotation=90)
    # plt.title(title)
    # plt.tight_layout(); plt.show()
    pass

## 5) Linear SVM — **YOU implement** tuning over C on validation

**Goal:** Build a pipeline `StandardScaler → SVC(kernel='linear')` and tune **C** on the validation set.

**TODO C:** Write the loop over a small grid of `C` values, call `evaluate_model`, and track the best by **validation accuracy**.

**TODO D:** After selecting the best `C`, **retrain on TRAIN+VAL** and evaluate once on **TEST**.

**Hints:**
- Use `Pipeline([('scaler', StandardScaler()), ('svc', SVC(kernel='linear', C=C))])`.
- For retraining on train+val, stack arrays: `np.vstack([X_tr, X_val])`, `np.hstack([y_tr, y_val])`.
- Keep a clear printout of your chosen best C and corresponding accuracies.
- Example grid: `[0.01, 0.1, 1, 10, 100]` (you can adjust).

In [None]:
C_grid_linear = [0.01, 0.1, 1, 10, 100]  # you may modify

best_linear = None
best_C = None

#### YOUR CODE HERE (TODO C): build loop over C, construct pipeline, call evaluate_model, track best by val_acc ####
raise NotImplementedError("Implement Linear SVM tuning loop over C and select best by validation accuracy.")

# (TODO D) Retrain best Linear SVM on TRAIN+VAL and evaluate on TEST
#### YOUR CODE HERE ####
raise NotImplementedError("Retrain Linear SVM on train+val with best C, evaluate on test, print final test accuracy.")

## 6) RBF SVM — **YOU implement** tuning over C and gamma on validation

**Goal:** Build `StandardScaler → SVC(kernel='rbf')` and tune over a grid of **C** and **gamma**.

**TODO E:** Write the nested loop over C and gamma, call `evaluate_model`, and track the best pair by **validation accuracy**.

**TODO F:** After selecting the best `(C, gamma)`, **retrain on TRAIN+VAL** and evaluate once on **TEST**.

**Hints:**
- Start with a small grid, then expand near the best region if needed.
- Example grids: `C in [0.1, 1, 10, 100]`, `gamma in ['scale', 0.001, 0.01, 0.1]` (you may adjust).

In [None]:
C_grid_rbf = [0.1, 1, 10, 100]            # you may modify
gamma_grid = ["scale", 0.001, 0.01, 0.1]  # you may modify

best_rbf = None
best_params_rbf = None

#### YOUR CODE HERE (TODO E): nested loop over (C, gamma), build pipeline, call evaluate_model, track best by val_acc ####
raise NotImplementedError("Implement RBF SVM tuning loop over (C, gamma) and select best by validation accuracy.")

# (TODO F) Retrain best RBF SVM on TRAIN+VAL and evaluate on TEST
#### YOUR CODE HERE ####
raise NotImplementedError("Retrain RBF SVM on train+val with best (C, gamma), evaluate on test, print final test accuracy.")

## 7) (Optional) PCA → SVM — **YOU implement**

**Goal:** Insert `PCA` before SVM (Linear and/or RBF). Choose `k` via a target variance ratio.

**TODO G:**
1. Fit a temporary PCA on **TRAIN** only to compute cumulative explained variance and select `k` for a target ratio (e.g., 0.95).
2. Build a pipeline `StandardScaler → PCA(n_components=k) → SVC(...)`.
3. Repeat tuning (C for Linear; C & gamma for RBF) using the **validation** set.
4. Retrain the best PCA+SVM model(s) on **TRAIN+VAL** and evaluate once on **TEST**.

**Hints:**
- Use `PCA(svd_solver='full', whiten=False, random_state=RANDOM_STATE)`.
- Compute `k` with `np.cumsum(pca_tmp.explained_variance_ratio_)` and `np.searchsorted`.
- Report `k`, chosen hyperparameters, and test accuracy.
- Be careful to avoid data leakage: fit PCA only on training data.

In [None]:
use_pca = False   # set True to run PCA experiments
variance_ratio = 0.95

if use_pca:
    #### YOUR CODE HERE (TODO G): determine k from TRAIN only ####
    # pca_tmp = PCA(svd_solver='full', whiten=False, random_state=RANDOM_STATE)
    # pca_tmp.fit(X_tr)
    # cum = np.cumsum(pca_tmp.explained_variance_ratio_)
    # k = int(np.searchsorted(cum, variance_ratio) + 1)
    # print(f"PCA: retaining ~{variance_ratio*100:.0f}% variance ⇒ k={k} components")

    #### YOUR CODE HERE: build pipelines with PCA + SVM; tune on VAL; retrain on TR+VAL; evaluate on TEST ####
    raise NotImplementedError("Implement PCA→SVM experiments as described.")
else:
    print("Skipped PCA experiments. Set use_pca=True to enable.")

## 8) What to include in your PDF report
- Final **test accuracy** for:
  - Linear SVM (best C)
  - RBF SVM (best C, gamma)
  - (Optional) PCA + SVM variants (report k)
- The **chosen hyperparameters** (selected by validation accuracy)
- Short discussion of how **C** and **gamma** affected performance
- If PCA used: effect on accuracy and runtime
- (Optional) confusion matrices for best models

**Tip:** Keep code and results well organized so we can follow your steps.