# CE49X: Introduction to Computational Thinking and Data Science for Civil Engineers
## Week 8: Support Vector Machines

**Instructor:** Dr. Eyuphan Koc  
**Department of Civil Engineering, Bogazici University**  
**Semester:** Spring 2026

Based on *Python Data Science Handbook* by Jake VanderPlas  
Chapter 5: Machine Learning (Section 5.07 - Support Vector Machines)  
https://jakevdp.github.io/PythonDataScienceHandbook/

---

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.svm import SVC, SVR
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler
%matplotlib inline

## Table of Contents

1. [Introduction to Support Vector Machines](#1.-Introduction-to-Support-Vector-Machines)
2. [Maximum Margin Classification](#2.-Maximum-Margin-Classification)
3. [Beyond Linear Boundaries: Kernel SVM](#3.-Beyond-Linear-Boundaries:-Kernel-SVM)
4. [Tuning the SVM: Softening Margins](#4.-Tuning-the-SVM:-Softening-Margins)
5. [Example: Face Recognition](#5.-Example:-Face-Recognition)
6. [Support Vector Machine Summary](#6.-Support-Vector-Machine-Summary)

---
## 1. Introduction to Support Vector Machines

### What are Support Vector Machines?

**Overview:**
- Support vector machines (SVMs) are a powerful **discriminative classifier**
- Unlike generative models (Naive Bayes), SVMs directly find decision boundaries
- Can be used for both **classification** and **regression** tasks
- One of the most effective and widely used machine learning algorithms

**Key Applications:**
- Image classification and face recognition
- Text categorization and sentiment analysis
- Bioinformatics and medical diagnosis
- Structural health monitoring and fault detection

> **Key Idea**  
> Find the decision boundary that maximizes the margin between classes

### Motivating Example: Simple Classification

**The Problem:**

Consider two classes of points that are well separated in 2D space.

**Linear Discriminative Classifier:**
- Goal: Draw a straight line separating the two classes
- Create a model for classification based on this boundary
- For 2D data, this is a line; for higher dimensions, it's a hyperplane

**The Challenge:**
- There are **many possible dividing lines** that perfectly separate the classes
- Each different line would classify new points differently
- Simple intuition of "drawing a line between classes" is not enough
- **Question:** Which line is the best?

> **Key Insight: The SVM Solution**  
> Among all possible separating lines, choose the one that maximizes the margin to the nearest data points from each class

### [LIVE] Visualizing Multiple Decision Boundaries

Let us generate two well-separated classes and draw several lines that all perfectly separate them, illustrating the ambiguity that SVM resolves.

In [None]:
from sklearn.datasets import make_blobs

# Generate sample data with two well-separated clusters
X, y = make_blobs(n_samples=50, centers=2,
                  random_state=0, cluster_std=0.60)

# Plot data with three arbitrary separating lines
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')

# Draw three plausible separating lines
xfit = np.linspace(-1, 3.5)
ax.plot(xfit, 1 * xfit + 0.65, '-k', linewidth=1, label='Line 1')
ax.plot(xfit, 0.5 * xfit + 1.6, '-k', linewidth=1, linestyle='--', label='Line 2')
ax.plot(xfit, -0.2 * xfit + 2.9, '-k', linewidth=1, linestyle='-.', label='Line 3')

ax.set_xlim(-1, 3.5)
ax.set_ylim(-1, 6)
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
ax.set_title('Multiple Valid Decision Boundaries')
ax.legend()
plt.show()

print("All three lines perfectly separate the training data.")
print("But they would classify new points differently!")
print("Which line should we trust?")

---
## 2. Maximum Margin Classification

### The Margin Concept

**SVM Solution: Maximize the Margin**

Rather than drawing a zero-width line, draw a **margin** of some width around each line, up to the nearest point.

**Margin Definition:**
- The perpendicular distance from the decision boundary to the nearest data point
- Creates a "buffer zone" around the decision boundary
- Larger margin = more confident classification
- Points exactly on the margin boundary are called **support vectors**

**Maximum Margin Classifier:**
- Among all possible separating boundaries, choose the one with the largest margin
- This is the **optimal** decision boundary
- Provides the most robust classification
- Less sensitive to small perturbations in the data

> **Key Insight**  
> The line that maximizes the margin is the one we choose as the optimal model

### [TOGETHER] Visualizing Margins

The following visualization shows the optimal decision boundary (solid line), the margin boundaries (dashed lines), and the support vectors (circled points).

In [None]:
def plot_svc_decision_function(model, ax=None, plot_support=True):
    """Plot the decision function for a 2D SVC."""
    if ax is None:
        ax = plt.gca()
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()

    # Create grid to evaluate model
    x = np.linspace(xlim[0], xlim[1], 30)
    y = np.linspace(ylim[0], ylim[1], 30)
    Y, X_grid = np.meshgrid(y, x)
    xy = np.vstack([X_grid.ravel(), Y.ravel()]).T
    P = model.decision_function(xy).reshape(X_grid.shape)

    # Plot decision boundary and margins
    ax.contour(X_grid, Y, P, colors='k',
               levels=[-1, 0, 1], alpha=0.5,
               linestyles=['--', '-', '--'])

    # Plot support vectors
    if plot_support:
        ax.scatter(model.support_vectors_[:, 0],
                   model.support_vectors_[:, 1],
                   s=300, linewidth=1, facecolors='none',
                   edgecolors='black')
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)

In [None]:
# Generate sample data
X, y = make_blobs(n_samples=50, centers=2,
                  random_state=0, cluster_std=0.60)

# Create and fit the SVM model with a hard margin
model = SVC(kernel='linear', C=1E10)
model.fit(X, y)

# Plot decision boundary with margins
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plot_svc_decision_function(model, ax)
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
ax.set_title('SVM Decision Boundary and Margins')
plt.show()

print("Solid line: optimal decision boundary")
print("Dashed lines: margin boundaries")
print("Circled points: support vectors")

### Support Vectors: The Pivotal Points

**What are Support Vectors?**

**Definition:**
- The training points that lie exactly on the margin boundaries
- These are the "pivotal elements" of the fit
- They give the algorithm its name: **Support Vector Machine**
- Only these points determine the decision boundary

**Key Property:**
- For the fit, **only the position of support vectors matter**
- Points far from the margin don't affect the model
- Can add/remove non-support-vector points without changing the boundary
- This makes SVM very efficient and robust

**Mathematical Insight:**
- Non-support vectors don't contribute to the loss function
- Their position and number don't matter (as long as they're on the correct side)
- This is why SVM is **insensitive to outliers** far from the boundary

> **Example: Efficiency**  
> Even with thousands of training points, often only a handful are support vectors

In [None]:
# Display the support vectors
print("Support Vectors:")
print(model.support_vectors_)
print(f"\nNumber of support vectors: {len(model.support_vectors_)}")
print(f"Total training points: {len(X)}")
print(f"Only {len(model.support_vectors_)} out of {len(X)} points define the boundary!")

### Fitting an SVM in Scikit-Learn

In [None]:
from sklearn.svm import SVC  # "Support Vector Classifier"
from sklearn.datasets import make_blobs

# Generate sample data
X, y = make_blobs(n_samples=50, centers=2,
                  random_state=0, cluster_std=0.60)

# Create and fit the SVM model
model = SVC(kernel='linear', C=1E10)
model.fit(X, y)

# The support vectors are stored here
print("Support vector coordinates:")
print(model.support_vectors_)

# Make predictions on the data
predictions = model.predict(X)
print(f"\nTraining accuracy: {accuracy_score(y, predictions):.2%}")

**Key Parameters:**
- `kernel='linear'`: Use linear (straight line/plane) decision boundary
- `C=1E10`: Very large C means hard margin (we will discuss this later)

### SVM Model Stability

**Insensitivity to Distant Points:**

A key strength of SVM: the model is determined only by support vectors.

**Experiment:**
- Train SVM on 60 points from a dataset
- Identify the support vectors
- Train SVM on 120 points from the same dataset
- The same support vectors appear!
- The decision boundary remains unchanged

**Implications:**
- Adding more correctly classified points doesn't change the model
- SVM is robust to the exact number of training samples
- Focus is on the "difficult" points near the boundary
- Computationally efficient: only support vectors matter

> **Key Insight: Robustness**  
> This insensitivity to distant points is one of the key strengths of the SVM model. It focuses on what matters: the boundary region.

In [None]:
def plot_svm(N=10, ax=None):
    """Plot SVM with N training points."""
    X, y = make_blobs(n_samples=200, centers=2,
                      random_state=0, cluster_std=0.60)
    X = X[:N]
    y = y[:N]
    model = SVC(kernel='linear', C=1E10)
    model.fit(X, y)

    ax = ax or plt.gca()
    ax.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
    ax.set_xlim(-1, 4)
    ax.set_ylim(-1, 6)
    plot_svc_decision_function(model, ax)

# Compare models with different numbers of points
fig, ax = plt.subplots(1, 2, figsize=(16, 6))
fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)

for axi, N in zip(ax, [60, 120]):
    plot_svm(N, axi)
    axi.set_title(f'N = {N}')

plt.show()

print("Notice: The support vectors and decision boundary remain the")
print("same even when we double the training data!")

---
## 3. Beyond Linear Boundaries: Kernel SVM

### Limitation of Linear Boundaries

**When Linear SVM Fails:**

Many real-world datasets are **not linearly separable**.

**Example: Concentric Circles**
- Inner circle = Class 1
- Outer circle = Class 2
- No straight line can separate these classes
- Linear SVM will perform poorly

**The Challenge:**
- Real data often has complex, non-linear patterns
- Linear boundaries are too restrictive
- Need more flexible decision boundaries
- But we want to keep SVM's advantages (margin maximization, support vectors)

> **Key Insight: Solution -- Kernel Methods**  
> Project the data into a higher-dimensional space where it becomes linearly separable, then apply linear SVM in that space

In [None]:
from sklearn.datasets import make_circles

# Generate concentric circles
X, y = make_circles(100, factor=.1, noise=.1)

# Visualize the data
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Non-Linearly Separable Data (Concentric Circles)')
plt.show()

### The Kernel Trick: Intuition

**Basic Idea:**

Transform the data into a higher-dimensional space where classes become linearly separable.

**Example: 2D to 3D Transformation**

For concentric circles in 2D $(x, y)$:
- Compute radial distance: $r = \sqrt{x^2 + y^2}$
- Project to 3D: $(x, y) \to (x, y, r)$
- Add radial basis function: $r = e^{-(x^2 + y^2)}$
- In 3D, classes become linearly separable by a plane

**General Strategy:**
1. Apply basis function transformation to features
2. Create higher-dimensional representation
3. Apply linear SVM in transformed space
4. Decision boundary is non-linear in original space

> **Example: Kernel Functions**  
> A kernel function $K(\mathbf{x}_i, \mathbf{x}_j)$ computes similarity between data points, implicitly performing this transformation without explicitly constructing the high-dimensional space

In [None]:
# Visualize the kernel transformation: 2D -> 3D
# Compute radial basis function for visualization
r = np.exp(-(X ** 2).sum(1))

from mpl_toolkits import mplot3d

fig = plt.figure(figsize=(14, 5))

# Original 2D data
ax1 = fig.add_subplot(121)
ax1.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
ax1.set_xlabel('Feature 1')
ax1.set_ylabel('Feature 2')
ax1.set_title('Original 2D Data (NOT linearly separable)')

# Projected 3D data
ax2 = fig.add_subplot(122, projection='3d')
ax2.scatter3D(X[:, 0], X[:, 1], r, c=y, s=50, cmap='autumn')
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_zlabel('r = exp(-(x^2 + y^2))')
ax2.set_title('Projected 3D Data (linearly separable!)')

plt.tight_layout()
plt.show()

print("In 3D space, the classes become linearly separable!")
print("A flat plane can now separate the two groups.")

### Radial Basis Function (RBF) Kernel

**Most Popular Kernel: RBF (Gaussian Kernel)**

**Idea:**
- Create a basis function centered at **every** training point
- Each basis function is Gaussian: $\phi_i(\mathbf{x}) = \exp\left(-\gamma \|\mathbf{x} - \mathbf{x}_i\|^2\right)$
- Transform $N$ points into $N$ dimensions
- Let SVM find the best combination

**The Kernel Trick:**
- Problem: Projecting $N$ points to $N$ dimensions is expensive!
- Solution: Use the **kernel trick**
- Compute inner products in high-dimensional space **implicitly**
- Never actually construct the $N$-dimensional representation
- Makes the computation tractable

**RBF Kernel Formula:**

$$K(\mathbf{x}_i, \mathbf{x}_j) = \exp\left(-\gamma \|\mathbf{x}_i - \mathbf{x}_j\|^2\right)$$

where $\gamma$ controls the "reach" of each training point.

### [LIVE] Kernel SVM Visualization: Linear vs. RBF

Let us compare a linear kernel (which fails on concentric circles) with the RBF kernel (which succeeds).

In [None]:
# Generate concentric circles
X, y = make_circles(100, factor=.1, noise=.1)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Left: Linear kernel (fails)
clf_linear = SVC(kernel='linear').fit(X, y)
axes[0].scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plot_svc_decision_function(clf_linear, axes[0], plot_support=False)
axes[0].set_title(f'Linear Kernel (Accuracy: {clf_linear.score(X, y):.2%})')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')

# Right: RBF kernel (succeeds)
clf_rbf = SVC(kernel='rbf', C=1E6).fit(X, y)
axes[1].scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plot_svc_decision_function(clf_rbf, axes[1])
axes[1].set_title(f'RBF Kernel (Accuracy: {clf_rbf.score(X, y):.2%})')
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

print("Left: Linear kernel CANNOT separate concentric circles.")
print("Right: RBF kernel creates a circular decision boundary and succeeds!")
print("Non-linear boundaries allow SVM to handle complex patterns.")

### Common Kernel Functions

Scikit-Learn provides several kernels:

| Kernel | Formula | Use Case |
|--------|---------|----------|
| Linear | $K(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i \cdot \mathbf{x}_j$ | Linearly separable data |
| Polynomial | $K(\mathbf{x}_i, \mathbf{x}_j) = (\gamma\, \mathbf{x}_i \cdot \mathbf{x}_j + r)^d$ | Polynomial relationships |
| RBF (Gaussian) | $K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma \|\mathbf{x}_i - \mathbf{x}_j\|^2)$ | General purpose, most popular |
| Sigmoid | $K(\mathbf{x}_i, \mathbf{x}_j) = \tanh(\gamma\, \mathbf{x}_i \cdot \mathbf{x}_j + r)$ | Similar to neural networks |

**Choosing a Kernel:**
- **Linear:** Start here for high-dimensional data, fast and interpretable
- **RBF:** Default choice for non-linear problems, works well in most cases
- **Polynomial:** When you expect polynomial relationships, but can be slow
- **Custom:** You can define your own kernel function for specialized applications

In [None]:
# Compare all kernel types on linearly separable data
X_blobs, y_blobs = make_blobs(n_samples=100, centers=2,
                              random_state=42, cluster_std=1.0)

kernels = ['linear', 'poly', 'rbf', 'sigmoid']
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()

for idx, kernel in enumerate(kernels):
    model = SVC(kernel=kernel, C=1.0, gamma='auto')
    model.fit(X_blobs, y_blobs)

    axes[idx].scatter(X_blobs[:, 0], X_blobs[:, 1], c=y_blobs, s=50, cmap='autumn')
    plot_svc_decision_function(model, axes[idx])
    axes[idx].set_title(f'{kernel.upper()} Kernel (Accuracy: {model.score(X_blobs, y_blobs):.2%})')
    axes[idx].set_xlabel('Feature 1')
    axes[idx].set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

---
## 4. Tuning the SVM: Softening Margins

### The Problem with Hard Margins

**Real Data is Messy:**

Our discussion so far assumed perfectly separable data. But what if:
- Classes overlap slightly?
- There are outliers?
- No perfect separation exists?

**Hard Margin SVM:**
- Requires all points to be on the correct side of the margin
- Cannot handle any misclassification
- Very sensitive to outliers
- May not find a solution if data is not separable

**Solution: Soft Margin SVM**
- Allow some points to be **within** or even **on the wrong side** of the margin
- Trade-off between: (1) wide margin, (2) few misclassifications
- Controlled by a parameter $C$

### The C Parameter

**Tuning Parameter: C**

The $C$ parameter controls the hardness of the margin:

**Large C (e.g., $C = 10^{10}$):**
- Hard margin
- Penalizes misclassifications heavily
- Narrow margin, tries to classify all training points correctly
- Risk of overfitting
- Sensitive to outliers

**Small C (e.g., $C = 0.1$):**
- Soft margin
- Tolerates some misclassifications
- Wide margin, better generalization
- Risk of underfitting
- Robust to outliers

### [LIVE] Soft Margin Visualization

Compare a large C (hard margin, narrow, complex boundary) with a small C (soft margin, wide, simple boundary).

In [None]:
# Generate data with some overlap
X, y = make_blobs(n_samples=100, centers=2,
                  random_state=0, cluster_std=0.8)

# Compare different C values
C_values = [10.0, 0.1]

fig, ax = plt.subplots(1, 2, figsize=(16, 6))
fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)

for axi, C in zip(ax, C_values):
    model = SVC(kernel='linear', C=C).fit(X, y)
    axi.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
    plot_svc_decision_function(model, axi)
    axi.scatter(model.support_vectors_[:, 0],
                model.support_vectors_[:, 1],
                s=300, lw=1, facecolors='none', edgecolors='black')
    axi.set_title(f'C = {C:.1f}', size=14)

plt.show()

print("Left (C=10.0): Hard margin -- narrow margin, tries to classify all points correctly")
print("Right (C=0.1): Soft margin -- wide margin, accepts some misclassifications")
print("\nSoft margin often generalizes better to new, unseen data.")

### Finding Optimal C

- Use cross-validation to tune $C$
- Try logarithmic range: $[10^{-3}, 10^{-2}, 10^{-1}, 1, 10, 100, 1000]$
- Balance between training accuracy and generalization

### Hyperparameter Tuning Strategy

**Two Main Hyperparameters to Tune:**

**1. Kernel Parameter ($\gamma$ for RBF kernel):**
- Controls the "reach" of each training point
- Large $\gamma$: tight fit, each point has small influence
- Small $\gamma$: loose fit, each point has wide influence
- Typical range: $[10^{-4}, 10^{-3}, 10^{-2}, 10^{-1}, 1]$

**2. Regularization Parameter ($C$):**
- Controls margin hardness
- Large $C$: hard margin, fewer errors on training data
- Small $C$: soft margin, wider margin
- Typical range: $[10^{-2}, 10^{-1}, 1, 10, 10^2, 10^3]$

**Tuning Approach:**
- Use **grid search** with cross-validation
- Try combinations of both parameters
- Select the combination with best cross-validation score

### [PRACTICE] Effect of Gamma Parameter (RBF Kernel)

Gamma controls the "reach" of each training point:
- Large gamma = tight fit, small influence region
- Small gamma = loose fit, large influence region

In [None]:
# Generate non-linear data
X_circles, y_circles = make_circles(100, factor=.1, noise=.1)

# Test different gamma values
gamma_values = [0.1, 1.0, 10.0, 100.0]
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()

for idx, gamma in enumerate(gamma_values):
    model = SVC(kernel='rbf', C=1.0, gamma=gamma)
    model.fit(X_circles, y_circles)

    axes[idx].scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles, s=50, cmap='autumn')
    plot_svc_decision_function(model, axes[idx])
    axes[idx].set_title(f'Gamma = {gamma} (Accuracy: {model.score(X_circles, y_circles):.2%})')
    axes[idx].set_xlabel('Feature 1')
    axes[idx].set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

print("Note: Very large gamma can lead to overfitting!")

---
## 5. Example: Face Recognition

### [TOGETHER] Application: Face Recognition

**Real-World Example:**

Use SVM to recognize faces from images.

**Dataset: Labeled Faces in the Wild**
- Several thousand photos of public figures
- Multiple photos per person
- Built-in fetcher in Scikit-Learn
- Select people with at least 60 photos for sufficient training data

**Total:** 8 classes, 1,348 images of size $62 \times 47$ pixels (2,914 features per image)

### Loading the Face Dataset

In [None]:
from sklearn.datasets import fetch_lfw_people

# Fetch faces with at least 60 photos per person
faces = fetch_lfw_people(min_faces_per_person=60)

print("Categories (people):")
print(faces.target_names)
print(f"\nDataset shape: {faces.images.shape}")
print(f"Number of images: {faces.images.shape[0]}")
print(f"Image size: {faces.images.shape[1]} x {faces.images.shape[2]}")
print(f"Total features per image: {faces.images.shape[1] * faces.images.shape[2]}")

### Visualize Sample Faces

In [None]:
# Plot sample faces
fig, ax = plt.subplots(3, 5, figsize=(12, 8))
for i, axi in enumerate(ax.flat):
    axi.imshow(faces.images[i], cmap='bone')
    axi.set(xticks=[], yticks=[],
            xlabel=faces.target_names[faces.target[i]])
plt.suptitle('Sample Faces from Dataset', fontsize=16)
plt.tight_layout()
plt.show()

### Challenge: High-Dimensional Data

**The Dimensionality Problem:**
- Each image has $62 \times 47 = 2{,}914$ pixels
- Could use each pixel as a feature
- But raw pixels are not the best features
- Too many dimensions can cause overfitting
- Need feature extraction/dimensionality reduction

**Solution: Principal Component Analysis (PCA)**
- Extract the most important patterns (principal components)
- Reduce 2,914 dimensions to 150 dimensions
- Retain most of the information
- More meaningful features for classification

> **Example: Pipeline Approach**  
> Combine PCA (preprocessing) and SVM (classification) into a single pipeline for clean, efficient code

### Building the Face Recognition Pipeline

In [None]:
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline

# Create pipeline: PCA (dimension reduction) -> SVM (classification)
pca = PCA(n_components=150, whiten=True, random_state=42)
svc = SVC(kernel='rbf', class_weight='balanced')
model = make_pipeline(pca, svc)

print("Pipeline created: PCA(150 components) -> SVM(RBF kernel)")
print("Pipeline benefits: Preprocessing and model fit in one step, prevents data leakage")

In [None]:
# Split data into training and test sets
Xtrain, Xtest, ytrain, ytest = train_test_split(
    faces.data, faces.target, random_state=42
)

print(f"Training set size: {Xtrain.shape}")
print(f"Test set size: {Xtest.shape}")

### Grid Search for Hyperparameter Tuning

Find optimal values for C and gamma using cross-validation.

In [None]:
# Define parameter grid
param_grid = {
    'svc__C': [1, 5, 10, 50],
    'svc__gamma': [0.0001, 0.0005, 0.001, 0.005]
}

# Perform grid search with cross-validation
grid = GridSearchCV(model, param_grid, cv=5)

print("Starting grid search...")
print(f"Testing {len(param_grid['svc__C']) * len(param_grid['svc__gamma'])} parameter combinations")
print("This may take a minute...\n")

%time grid.fit(Xtrain, ytrain)

print(f"\nBest parameters: {grid.best_params_}")
print(f"Best cross-validation score: {grid.best_score_:.2%}")

# Note: Grid search tries all 16 combinations (4 x 4) with
# cross-validation to find the best parameters

### Evaluating Face Recognition Results

In [None]:
# Use the best model for predictions
best_model = grid.best_estimator_
yfit = best_model.predict(Xtest)

print(f"Test set accuracy: {best_model.score(Xtest, ytest):.2%}")

In [None]:
# Show sample predictions
fig, ax = plt.subplots(4, 6, figsize=(12, 8))
for i, axi in enumerate(ax.flat):
    axi.imshow(Xtest[i].reshape(62, 47), cmap='bone')
    axi.set(xticks=[], yticks=[])
    # Color correct predictions in black, wrong predictions in red
    color = 'black' if yfit[i] == ytest[i] else 'red'
    axi.set_ylabel(faces.target_names[yfit[i]].split()[-1],
                   color=color)
plt.suptitle('Predicted Names (Incorrect Labels in Red)', size=14)
plt.tight_layout()
plt.show()

In [None]:
# Print classification report
print("Classification Report:")
print("=" * 70)
print(classification_report(ytest, yfit,
                            target_names=faces.target_names))

**Interpretation:**
- **Precision:** Of the faces predicted as person X, what fraction are correct?
- **Recall:** Of all faces of person X, what fraction did we find?
- **F1-Score:** Harmonic mean of precision and recall

### Confusion Matrix for Face Recognition

In [None]:
import seaborn as sns

# Compute confusion matrix
mat = confusion_matrix(ytest, yfit)

# Plot confusion matrix as heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=True,
            xticklabels=faces.target_names,
            yticklabels=faces.target_names,
            cmap='Blues')
plt.xlabel('True Label')
plt.ylabel('Predicted Label')
plt.title('Confusion Matrix for Face Recognition')
plt.tight_layout()
plt.show()

print("Key Observations:")
print("- Diagonal values = correct predictions")
print("- Off-diagonal values = misclassifications")
print("- Darker blue = more predictions")

### Limitations and Modern Approaches

**Limitations of This Example:**
- Dataset has pre-cropped faces; real-world needs face detection first
- Controlled conditions (professional photos); less variation
- Limited training data (60-530 images per person)

**Modern Approaches:**
- Use CNNs instead of PCA+SVM
- Learn features automatically from raw pixels
- State-of-the-art: 99%+ accuracy on face recognition

---
## 6. Support Vector Machine Summary

### SVM Advantages

**Why Use Support Vector Machines?**

**1. Effective in High Dimensions**
- Work well even when dimensions exceed samples (e.g., face recognition: 2,914 features, 1,011 samples)

**2. Memory Efficient**
- Model defined by support vectors only (small fraction of training points)
- Fast prediction even with large training sets

**3. Versatile**
- Different kernels for different data types; can model complex non-linear boundaries

**4. Robust**
- Insensitive to points far from boundary; less affected by outliers

**5. Mathematically Elegant**
- Convex optimization with unique global solution

**6. Often High Accuracy**
- Especially strong on medium-sized datasets

### SVM Disadvantages

**When NOT to Use SVM:**

**1. Computational Cost**
- Training time: $\mathcal{O}(N^2)$ to $\mathcal{O}(N^3)$ -- slow on large datasets ($N > 10{,}000$)
- Grid search for hyperparameters is expensive

**2. Hyperparameter Sensitivity**
- Results depend heavily on $C$ and $\gamma$ -- requires careful cross-validation

**3. No Direct Probability Estimates**
- SVM outputs class labels; probability estimates require extra calibration

**4. Black Box Nature**
- With kernel methods, less interpretable than linear models
- Hard to understand why a particular classification was made

### When to Use SVM

**SVM is a Good Choice When:**
- Medium-sized dataset (hundreds to thousands of samples)
- High-dimensional data (many features)
- Clear margin of separation between classes
- Accuracy is more important than speed/interpretability
- Willing to invest time in hyperparameter tuning

**Consider Alternatives When:**
- Very large dataset ($N > 100{,}000$) --> use linear models, SGD, or deep learning
- Need probability estimates --> use logistic regression, Naive Bayes
- Need interpretability --> use decision trees, linear models
- Very small dataset --> try simpler models first (Naive Bayes, logistic regression)
- Real-time predictions required --> simpler models may be faster

> **Key Insight: Author's Recommendation**  
> "I generally only turn to SVMs once other simpler, faster, and less tuning-intensive methods have been shown to be insufficient for my needs."

### Key Takeaways and Best Practices

**Core Concepts:**
- **Maximum Margin**: Choose boundary with largest margin
- **Support Vectors**: Only points on margin matter (compact model)
- **Kernel Trick**: Non-linear boundaries without explicit transformation
- **Soft Margins**: Parameter $C$ controls margin hardness

**Best Practices:**
- **Always** scale features (`StandardScaler`)
- Start with linear kernel, then try RBF if needed
- Use `GridSearchCV`: $C \in [10^{-2}, 10^3]$, $\gamma \in [10^{-4}, 1]$
- For high-dimensional data, consider PCA first

> **Key Insight: Final Thought**  
> SVMs excel on medium-sized, high-dimensional datasets. Try simpler models first; use SVM when you need the extra accuracy.

---

## [PRACTICE] Practice Exercises

Try these on your own:

1. **Exercise 1:** Create your own 2D dataset with `make_blobs` and experiment with different C values. What happens when C is very small (0.01) or very large (1000)?

2. **Exercise 2:** Generate a more complex non-linear dataset using `make_moons` from sklearn. Compare linear and RBF kernels.

3. **Exercise 3:** Load a different dataset (e.g., digits dataset) and build an SVM classifier. Use grid search to find optimal hyperparameters.

4. **Exercise 4:** Implement a multi-class classification problem with 3 or more classes. How does SVM handle multi-class problems?

5. **Exercise 5:** Compare SVM with other classifiers (Naive Bayes, Logistic Regression, Random Forest) on the same dataset. Which performs best?