# Lesson 11 - Learning Theory and VC Dimension


## Objectives
- Visualize empirical risk vs true risk.
- Simulate capacity vs generalization behavior.
- Discuss VC dimension intuition.


## From the notes

**Learning theory**
- Empirical risk minimization: minimize training error.
- Generalization bounds depend on hypothesis class complexity (e.g., VC dimension).

_TODO: Validate learning theory statements with CS229 main notes PDF._


## Intuition
As model capacity grows, training error decreases but generalization can worsen without enough data. VC dimension captures the richness of a hypothesis class.


## Data
We simulate fitting polynomials of increasing degree to show overfitting.


In [None]:
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

def make_data(n=80):
    x = np.linspace(-1, 1, n)
    y = np.sin(3 * x) + 0.3 * np.random.randn(n)
    return x, y

x, y = make_data()

def poly_features(x, degree):
    return np.vstack([x**d for d in range(degree+1)]).T

def fit_poly(x, y, degree):
    X = poly_features(x, degree)
    theta = np.linalg.pinv(X.T @ X) @ X.T @ y
    return theta


## Experiments


In [None]:
degrees = range(1, 12)
train_err = []
val_err = []
split = 60
x_train, y_train = x[:split], y[:split]
x_val, y_val = x[split:], y[split:]
for d in degrees:
    theta = fit_poly(x_train, y_train, d)
    train_err.append(np.mean((poly_features(x_train, d) @ theta - y_train)**2))
    val_err.append(np.mean((poly_features(x_val, d) @ theta - y_val)**2))


## Visualizations


In [None]:
plt.figure(figsize=(6,4))
plt.plot(list(degrees), train_err, label="train")
plt.plot(list(degrees), val_err, label="validation")
plt.title("Generalization vs model capacity")
plt.xlabel("polynomial degree")
plt.ylabel("MSE")
plt.legend()
plt.show()

plt.figure(figsize=(6,4))
plt.scatter(x, y, alpha=0.6)
theta = fit_poly(x, y, 9)
plt.plot(x, poly_features(x, 9) @ theta, color="black")
plt.title("High-capacity fit example")
plt.xlabel("x")
plt.ylabel("y")
plt.show()


## Takeaways
- Training error is not a reliable measure of generalization.
- Model complexity needs to be balanced with data size.


## Explain it in an interview
- Explain what VC dimension captures in a hypothesis class.
- Describe empirical risk minimization.


## Exercises
- Simulate how validation error changes with more training data.
- Explain why VC dimension influences generalization bounds.
- Try a different hypothesis class (e.g., splines).
