[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/buildLittleWorlds/ml-math-with-densworld/blob/main/modules/04-applied-ml/notebooks/03-regularization.ipynb)

# Lesson 3: Regularization — Taming Complexity

*"Eulr Voss demanded a complex model. The Tribunal wanted protection against overfitting. I offered them a compromise: keep the complexity, but penalize it. Make the model pay a price for every parameter it uses. This is regularization—the art of controlled complexity."*  
— Mink Pavar, third day of testimony

---

## The Controlled Complexity

The Forgery Trial had reached an impasse. Mink Pavar's simple model was too biased. Eulr Voss's complex model was too variable. The Tribunal demanded a solution that preserved flexibility while preventing memorization.

Mink Pavar proposed a radical idea:

> *"We can keep a complex model—with many parameters—but we must tax its complexity. For every weight that grows large, we add a penalty to our loss function. The model must earn its complexity by making better predictions."*

This is **regularization**: adding a penalty term to the loss function that discourages overly complex solutions.

---

## Learning Objectives

By the end of this lesson, you will:
1. Understand why regularization works (shrinking weights)
2. Implement L2 (Ridge) regularization and interpret its effects
3. Implement L1 (Lasso) regularization and understand feature selection
4. Know when to use Ridge vs Lasso
5. Tune the regularization strength hyperparameter

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Set random seed for reproducibility
np.random.seed(42)

# Nice plotting defaults
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

# Colab-ready data loading
BASE_URL = "https://raw.githubusercontent.com/buildLittleWorlds/ml-math-with-densworld/main/data/"

# Load datasets
manuscripts = pd.read_csv(BASE_URL + "manuscript_features.csv")
expeditions = pd.read_csv(BASE_URL + "expedition_outcomes.csv")

print(f"Loaded {len(manuscripts)} manuscripts")
print(f"Loaded {len(expeditions)} expeditions")

## Part 1: The Problem — Large Weights Cause Overfitting

*"The overfitting model had enormous weights—some positive, some negative, fighting each other to fit every wiggle in the training data. I realized: if we force the weights to stay small, the model cannot overfit."*  
— Mink Pavar

Let's revisit the overfitting problem and examine the weights:

In [None]:
# Create synthetic data
def true_function(x):
    return 0.5 + 0.3 * np.sin(2 * x) + 0.1 * x

np.random.seed(42)
n_samples = 30
X_train = np.sort(np.random.uniform(0, 4, n_samples))
y_train = true_function(X_train) + np.random.normal(0, 0.15, n_samples)

X_test = np.sort(np.random.uniform(0, 4, 20))
y_test = true_function(X_test) + np.random.normal(0, 0.15, 20)

# Fit a high-degree polynomial WITHOUT regularization
degree = 10
poly = PolynomialFeatures(degree=degree, include_bias=True)
X_train_poly = poly.fit_transform(X_train.reshape(-1, 1))
X_test_poly = poly.transform(X_test.reshape(-1, 1))

# Standard linear regression (no regularization)
model_unregularized = LinearRegression(fit_intercept=False)
model_unregularized.fit(X_train_poly, y_train)

# Look at the weights
print("Weights of degree-10 polynomial (NO regularization):")
print("=" * 50)
for i, coef in enumerate(model_unregularized.coef_):
    print(f"  x^{i}: {coef:12.2f}")

print(f"\nSum of absolute weights: {np.abs(model_unregularized.coef_).sum():.2f}")
print(f"Max absolute weight: {np.abs(model_unregularized.coef_).max():.2f}")

In [None]:
# Visualize the overfitting
x_smooth = np.linspace(0, 4, 200)
X_smooth_poly = poly.transform(x_smooth.reshape(-1, 1))
y_pred_unreg = model_unregularized.predict(X_smooth_poly)
y_pred_unreg_clipped = np.clip(y_pred_unreg, -1, 3)

plt.figure(figsize=(10, 6))
plt.plot(x_smooth, true_function(x_smooth), 'g-', linewidth=2, alpha=0.5, label='True function')
plt.scatter(X_train, y_train, c='blue', s=60, edgecolor='black', label='Training data', zorder=5)
plt.plot(x_smooth, y_pred_unreg_clipped, 'r-', linewidth=2, label='Unregularized (overfitting)')

plt.xlabel('Manuscript Feature', fontsize=12)
plt.ylabel('Authenticity Score', fontsize=12)
plt.title('Large Weights → Wild Oscillations → Overfitting', fontsize=13)
plt.legend()
plt.ylim(-0.5, 2)
plt.show()

train_mse = mean_squared_error(y_train, model_unregularized.predict(X_train_poly))
test_mse = mean_squared_error(y_test, model_unregularized.predict(X_test_poly))
print(f"Training MSE: {train_mse:.4f}")
print(f"Test MSE: {test_mse:.4f}")
print("\nThe large weights cause wild oscillations between data points!")

## Part 2: L2 Regularization (Ridge Regression)

*"I proposed a simple penalty: add the sum of squared weights to the loss function. The model must now minimize BOTH the prediction error AND the size of its weights."*  
— Mink Pavar

### The Ridge Loss Function

$$L_{\text{Ridge}} = \underbrace{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}_{\text{MSE}} + \underbrace{\lambda \sum_{j=1}^{p} w_j^2}_{\text{L2 penalty}}$$

Where:
- $\lambda$ (lambda) controls the regularization strength
- Higher $\lambda$ = more penalty = smaller weights

In [None]:
# Scale features (important for regularization!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_poly)
X_test_scaled = scaler.transform(X_test_poly)
X_smooth_scaled = scaler.transform(X_smooth_poly)

# Fit Ridge regression with different lambda values
lambdas = [0, 0.001, 0.01, 0.1, 1, 10]

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for idx, lam in enumerate(lambdas):
    if lam == 0:
        # No regularization
        model = LinearRegression(fit_intercept=False)
    else:
        model = Ridge(alpha=lam, fit_intercept=False)
    
    model.fit(X_train_scaled, y_train)
    
    y_smooth_pred = model.predict(X_smooth_scaled)
    y_smooth_pred_clipped = np.clip(y_smooth_pred, -1, 3)
    
    test_mse = mean_squared_error(y_test, model.predict(X_test_scaled))
    weight_sum = np.sum(model.coef_**2)
    
    ax = axes[idx]
    ax.plot(x_smooth, true_function(x_smooth), 'g-', linewidth=2, alpha=0.5)
    ax.scatter(X_train, y_train, c='blue', s=30, edgecolor='black')
    ax.plot(x_smooth, y_smooth_pred_clipped, 'r-', linewidth=2)
    ax.set_ylim(-0.5, 2)
    ax.set_title(f'λ = {lam}\nTest MSE: {test_mse:.4f}\nΣw²: {weight_sum:.2f}')

plt.suptitle('Ridge Regression: Effect of Regularization Strength', fontsize=14)
plt.tight_layout()
plt.show()

In [None]:
# Compare weights at different regularization strengths
print("Weight comparison: Unregularized vs Ridge (λ=1)")
print("=" * 60)

model_ridge = Ridge(alpha=1.0, fit_intercept=False)
model_ridge.fit(X_train_scaled, y_train)

model_unreg = LinearRegression(fit_intercept=False)
model_unreg.fit(X_train_scaled, y_train)

print(f"{'Term':>10} | {'Unregularized':>15} | {'Ridge (λ=1)':>15}")
print("-" * 50)
for i in range(len(model_unreg.coef_)):
    print(f"{'x^'+str(i):>10} | {model_unreg.coef_[i]:>15.4f} | {model_ridge.coef_[i]:>15.4f}")

print("-" * 50)
print(f"{'Σ|w|':>10} | {np.abs(model_unreg.coef_).sum():>15.4f} | {np.abs(model_ridge.coef_).sum():>15.4f}")
print(f"{'Σw²':>10} | {np.sum(model_unreg.coef_**2):>15.4f} | {np.sum(model_ridge.coef_**2):>15.4f}")

## Part 3: L1 Regularization (Lasso Regression)

*"The Tribunal asked: 'What if some features are useless for detecting forgeries? Can the model learn to ignore them?' I showed them Lasso—a regularization that forces useless weights to exactly zero."*  
— Mink Pavar

### The Lasso Loss Function

$$L_{\text{Lasso}} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} |w_j|$$

The key difference: L1 uses **absolute values**, not squares. This creates a different geometry that pushes weights to exactly zero.

In [None]:
# Compare Ridge vs Lasso
lambda_val = 0.1

model_ridge = Ridge(alpha=lambda_val, fit_intercept=False)
model_lasso = Lasso(alpha=lambda_val, fit_intercept=False, max_iter=10000)

model_ridge.fit(X_train_scaled, y_train)
model_lasso.fit(X_train_scaled, y_train)

print(f"Weight comparison: Ridge vs Lasso (λ={lambda_val})")
print("=" * 60)
print(f"{'Term':>10} | {'Ridge':>15} | {'Lasso':>15}")
print("-" * 50)
for i in range(len(model_ridge.coef_)):
    ridge_w = model_ridge.coef_[i]
    lasso_w = model_lasso.coef_[i]
    zero_marker = " ← ZERO" if abs(lasso_w) < 1e-6 else ""
    print(f"{'x^'+str(i):>10} | {ridge_w:>15.4f} | {lasso_w:>15.4f}{zero_marker}")

print("-" * 50)
print(f"Non-zero weights: Ridge={np.sum(np.abs(model_ridge.coef_) > 1e-6)}, Lasso={np.sum(np.abs(model_lasso.coef_) > 1e-6)}")

In [None]:
# Visualize Ridge vs Lasso
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Ridge
y_smooth_ridge = np.clip(model_ridge.predict(X_smooth_scaled), -1, 3)
axes[0].plot(x_smooth, true_function(x_smooth), 'g-', linewidth=2, alpha=0.5, label='True')
axes[0].scatter(X_train, y_train, c='blue', s=40, edgecolor='black')
axes[0].plot(x_smooth, y_smooth_ridge, 'r-', linewidth=2, label='Ridge')
test_mse_ridge = mean_squared_error(y_test, model_ridge.predict(X_test_scaled))
axes[0].set_title(f'Ridge (L2): λ={lambda_val}\nTest MSE: {test_mse_ridge:.4f}\nAll weights non-zero', fontsize=11)
axes[0].set_ylim(-0.2, 1.8)
axes[0].legend()

# Lasso
y_smooth_lasso = np.clip(model_lasso.predict(X_smooth_scaled), -1, 3)
axes[1].plot(x_smooth, true_function(x_smooth), 'g-', linewidth=2, alpha=0.5, label='True')
axes[1].scatter(X_train, y_train, c='blue', s=40, edgecolor='black')
axes[1].plot(x_smooth, y_smooth_lasso, 'r-', linewidth=2, label='Lasso')
test_mse_lasso = mean_squared_error(y_test, model_lasso.predict(X_test_scaled))
n_nonzero = np.sum(np.abs(model_lasso.coef_) > 1e-6)
axes[1].set_title(f'Lasso (L1): λ={lambda_val}\nTest MSE: {test_mse_lasso:.4f}\nOnly {n_nonzero}/{len(model_lasso.coef_)} weights non-zero', fontsize=11)
axes[1].set_ylim(-0.2, 1.8)
axes[1].legend()

plt.tight_layout()
plt.show()

## Part 4: The Geometry of Regularization

*"Why does L1 push weights to zero while L2 merely shrinks them? The answer lies in geometry. Let me draw you the constraint regions..."*  
— Mink Pavar

In [None]:
# Visualize the geometry of L1 vs L2
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Create loss contours (ellipses centered at optimal point)
w1_range = np.linspace(-2, 4, 100)
w2_range = np.linspace(-2, 4, 100)
W1, W2 = np.meshgrid(w1_range, w2_range)

# Simulated loss function (centered at w1=2, w2=1)
optimal_w1, optimal_w2 = 2.0, 1.0
Loss = 0.5 * (W1 - optimal_w1)**2 + 2 * (W2 - optimal_w2)**2

# L2 constraint region (circle)
theta = np.linspace(0, 2*np.pi, 100)
r_l2 = 1.5
l2_w1 = r_l2 * np.cos(theta)
l2_w2 = r_l2 * np.sin(theta)

# L1 constraint region (diamond)
r_l1 = 1.5
l1_w1 = [r_l1, 0, -r_l1, 0, r_l1]
l1_w2 = [0, r_l1, 0, -r_l1, 0]

# L2 (Ridge) plot
axes[0].contour(W1, W2, Loss, levels=20, cmap='Greys', alpha=0.7)
axes[0].fill(l2_w1, l2_w2, alpha=0.3, color='blue')
axes[0].plot(l2_w1, l2_w2, 'b-', linewidth=2, label='L2 constraint')
axes[0].plot(optimal_w1, optimal_w2, 'r*', markersize=15, label='Unconstrained optimum')
# Ridge solution (on the circle)
ridge_w1, ridge_w2 = 1.3, 0.7
axes[0].plot(ridge_w1, ridge_w2, 'g*', markersize=15, label='Ridge solution')
axes[0].axhline(0, color='black', linewidth=0.5)
axes[0].axvline(0, color='black', linewidth=0.5)
axes[0].set_xlabel('w₁', fontsize=12)
axes[0].set_ylabel('w₂', fontsize=12)
axes[0].set_title('L2 (Ridge): Solution on Circle\nWeights shrink but stay non-zero', fontsize=12)
axes[0].legend(loc='upper left')
axes[0].set_xlim(-2.5, 4)
axes[0].set_ylim(-2.5, 4)
axes[0].set_aspect('equal')

# L1 (Lasso) plot
axes[1].contour(W1, W2, Loss, levels=20, cmap='Greys', alpha=0.7)
axes[1].fill(l1_w1, l1_w2, alpha=0.3, color='red')
axes[1].plot(l1_w1, l1_w2, 'r-', linewidth=2, label='L1 constraint')
axes[1].plot(optimal_w1, optimal_w2, 'r*', markersize=15, label='Unconstrained optimum')
# Lasso solution (on a corner!)
lasso_w1, lasso_w2 = 1.5, 0  # On the corner → w2 is exactly 0!
axes[1].plot(lasso_w1, lasso_w2, 'g*', markersize=15, label='Lasso solution')
axes[1].axhline(0, color='black', linewidth=0.5)
axes[1].axvline(0, color='black', linewidth=0.5)
axes[1].set_xlabel('w₁', fontsize=12)
axes[1].set_ylabel('w₂', fontsize=12)
axes[1].set_title('L1 (Lasso): Solution on Diamond Corner\nWeights hit exactly zero!', fontsize=12)
axes[1].legend(loc='upper left')
axes[1].set_xlim(-2.5, 4)
axes[1].set_ylim(-2.5, 4)
axes[1].set_aspect('equal')

plt.tight_layout()
plt.show()

print("Key Insight:")
print("- L2 constraint is a CIRCLE → solutions land on the smooth edge")
print("- L1 constraint is a DIAMOND → solutions land on CORNERS")
print("- Corners have coordinates equal to zero → AUTOMATIC FEATURE SELECTION")

## Part 5: Choosing λ — The Regularization Path

*"The Tribunal asked: 'How much penalty is right?' I showed them the regularization path—how weights change as we increase λ from 0 to infinity."*  
— Mink Pavar

In [None]:
# Plot regularization paths
lambdas = np.logspace(-4, 2, 100)

# Store weights at each lambda
ridge_weights = []
lasso_weights = []

for lam in lambdas:
    ridge = Ridge(alpha=lam, fit_intercept=False)
    ridge.fit(X_train_scaled, y_train)
    ridge_weights.append(ridge.coef_.copy())
    
    lasso = Lasso(alpha=lam, fit_intercept=False, max_iter=10000)
    lasso.fit(X_train_scaled, y_train)
    lasso_weights.append(lasso.coef_.copy())

ridge_weights = np.array(ridge_weights)
lasso_weights = np.array(lasso_weights)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Ridge path
for i in range(ridge_weights.shape[1]):
    axes[0].plot(lambdas, ridge_weights[:, i], linewidth=1.5)
axes[0].set_xscale('log')
axes[0].set_xlabel('λ (regularization strength)', fontsize=12)
axes[0].set_ylabel('Weight value', fontsize=12)
axes[0].set_title('Ridge Regularization Path\n(Weights shrink smoothly toward zero)', fontsize=12)
axes[0].axhline(0, color='black', linewidth=0.5)

# Lasso path
for i in range(lasso_weights.shape[1]):
    axes[1].plot(lambdas, lasso_weights[:, i], linewidth=1.5)
axes[1].set_xscale('log')
axes[1].set_xlabel('λ (regularization strength)', fontsize=12)
axes[1].set_ylabel('Weight value', fontsize=12)
axes[1].set_title('Lasso Regularization Path\n(Weights hit zero at different λ values)', fontsize=12)
axes[1].axhline(0, color='black', linewidth=0.5)

plt.tight_layout()
plt.show()

print("Observation:")
print("- Ridge: All weights approach zero asymptotically but never reach it")
print("- Lasso: Weights drop to exactly zero at different λ values")
print("- Lasso performs AUTOMATIC FEATURE SELECTION")

In [None]:
# Find optimal lambda using validation error
train_errors_ridge = []
test_errors_ridge = []
train_errors_lasso = []
test_errors_lasso = []

for lam in lambdas:
    ridge = Ridge(alpha=lam, fit_intercept=False)
    ridge.fit(X_train_scaled, y_train)
    train_errors_ridge.append(mean_squared_error(y_train, ridge.predict(X_train_scaled)))
    test_errors_ridge.append(mean_squared_error(y_test, ridge.predict(X_test_scaled)))
    
    lasso = Lasso(alpha=lam, fit_intercept=False, max_iter=10000)
    lasso.fit(X_train_scaled, y_train)
    train_errors_lasso.append(mean_squared_error(y_train, lasso.predict(X_train_scaled)))
    test_errors_lasso.append(mean_squared_error(y_test, lasso.predict(X_test_scaled)))

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Ridge
axes[0].plot(lambdas, train_errors_ridge, 'b-', linewidth=2, label='Training')
axes[0].plot(lambdas, test_errors_ridge, 'r-', linewidth=2, label='Test')
best_idx_ridge = np.argmin(test_errors_ridge)
axes[0].axvline(lambdas[best_idx_ridge], color='green', linestyle='--', 
                label=f'Best λ = {lambdas[best_idx_ridge]:.4f}')
axes[0].set_xscale('log')
axes[0].set_xlabel('λ', fontsize=12)
axes[0].set_ylabel('MSE', fontsize=12)
axes[0].set_title('Ridge: Finding Optimal λ', fontsize=12)
axes[0].legend()

# Lasso
axes[1].plot(lambdas, train_errors_lasso, 'b-', linewidth=2, label='Training')
axes[1].plot(lambdas, test_errors_lasso, 'r-', linewidth=2, label='Test')
best_idx_lasso = np.argmin(test_errors_lasso)
axes[1].axvline(lambdas[best_idx_lasso], color='green', linestyle='--', 
                label=f'Best λ = {lambdas[best_idx_lasso]:.4f}')
axes[1].set_xscale('log')
axes[1].set_xlabel('λ', fontsize=12)
axes[1].set_ylabel('MSE', fontsize=12)
axes[1].set_title('Lasso: Finding Optimal λ', fontsize=12)
axes[1].legend()

plt.tight_layout()
plt.show()

print(f"Best Ridge λ: {lambdas[best_idx_ridge]:.4f} (Test MSE: {test_errors_ridge[best_idx_ridge]:.4f})")
print(f"Best Lasso λ: {lambdas[best_idx_lasso]:.4f} (Test MSE: {test_errors_lasso[best_idx_lasso]:.4f})")

## Part 6: Applying to Manuscript Forgery Detection

*"Now let us apply regularization to the forgery problem. Which features truly matter? Lasso will tell us."*  
— Mink Pavar

In [None]:
# Prepare manuscript data with multiple features
feature_cols = ['stylometric_variance', 'era_marker_score', 'vocabulary_richness', 
                'avg_sentence_length', 'philosophical_term_density', 'word_count']

X_ms = manuscripts[feature_cols].values
y_ms = manuscripts['is_forgery'].astype(int).values

# Split and scale
X_train_ms, X_test_ms, y_train_ms, y_test_ms = train_test_split(
    X_ms, y_ms, test_size=0.3, random_state=42
)

scaler_ms = StandardScaler()
X_train_ms_scaled = scaler_ms.fit_transform(X_train_ms)
X_test_ms_scaled = scaler_ms.transform(X_test_ms)

print(f"Features: {feature_cols}")
print(f"Training samples: {len(X_train_ms)}")
print(f"Test samples: {len(X_test_ms)}")

In [None]:
# Compare unregularized, Ridge, and Lasso
models = {
    'Unregularized': LinearRegression(),
    'Ridge (λ=0.1)': Ridge(alpha=0.1),
    'Lasso (λ=0.01)': Lasso(alpha=0.01, max_iter=10000),
}

print("Feature Importance Comparison for Forgery Detection")
print("=" * 80)

results = []
for name, model in models.items():
    model.fit(X_train_ms_scaled, y_train_ms)
    train_mse = mean_squared_error(y_train_ms, model.predict(X_train_ms_scaled))
    test_mse = mean_squared_error(y_test_ms, model.predict(X_test_ms_scaled))
    results.append((name, train_mse, test_mse, model.coef_))

# Print weights
print(f"\n{'Feature':<30} | {'Unreg':>10} | {'Ridge':>10} | {'Lasso':>10}")
print("-" * 75)
for i, feat in enumerate(feature_cols):
    unreg_w = results[0][3][i]
    ridge_w = results[1][3][i]
    lasso_w = results[2][3][i]
    zero_marker = "*" if abs(lasso_w) < 1e-6 else ""
    print(f"{feat:<30} | {unreg_w:>10.4f} | {ridge_w:>10.4f} | {lasso_w:>10.4f}{zero_marker}")

print("-" * 75)
print("* = weight set to zero by Lasso")

print(f"\n{'Model':<20} | {'Train MSE':>12} | {'Test MSE':>12}")
print("-" * 50)
for name, train_mse, test_mse, _ in results:
    print(f"{name:<20} | {train_mse:>12.4f} | {test_mse:>12.4f}")

In [None]:
# Visualize feature importance
fig, ax = plt.subplots(figsize=(12, 6))

x_pos = np.arange(len(feature_cols))
width = 0.25

colors = ['steelblue', 'coral', 'green']
for i, (name, _, _, weights) in enumerate(results):
    ax.bar(x_pos + i*width, np.abs(weights), width, label=name, color=colors[i], alpha=0.8)

ax.set_xticks(x_pos + width)
ax.set_xticklabels([f.replace('_', '\n') for f in feature_cols], fontsize=9)
ax.set_ylabel('|Weight| (Absolute Value)', fontsize=12)
ax.set_title('Feature Importance for Forgery Detection\n(Lasso Performs Automatic Feature Selection)', fontsize=13)
ax.legend()

plt.tight_layout()
plt.show()

# Find selected features
lasso_weights = results[2][3]
selected_features = [f for f, w in zip(feature_cols, lasso_weights) if abs(w) > 1e-6]
print(f"\nLasso selected features: {selected_features}")
print(f"Lasso eliminated features: {[f for f in feature_cols if f not in selected_features]}")

## Part 7: When to Use Ridge vs Lasso

*"The Tribunal asked: 'Which is better—Ridge or Lasso?' I told them: it depends on your beliefs about the problem."*  
— Mink Pavar

### Decision Guide

| Situation | Use Ridge (L2) | Use Lasso (L1) |
|-----------|----------------|----------------|
| All features likely relevant | ✓ | |
| Many irrelevant features | | ✓ |
| Feature selection needed | | ✓ |
| Correlated features | ✓ | |
| Interpretability important | | ✓ |
| Numerical stability needed | ✓ | |

### Elastic Net: The Best of Both Worlds

$$L_{\text{Elastic}} = \text{MSE} + \lambda_1 \sum |w_j| + \lambda_2 \sum w_j^2$$

In [None]:
from sklearn.linear_model import ElasticNet

# Compare all three on manuscript data
elastic = ElasticNet(alpha=0.01, l1_ratio=0.5, max_iter=10000)  # 50% L1, 50% L2
elastic.fit(X_train_ms_scaled, y_train_ms)

print("Elastic Net Weights (50% L1 + 50% L2):")
print("=" * 50)
for feat, w in zip(feature_cols, elastic.coef_):
    zero_marker = "← ZERO" if abs(w) < 1e-6 else ""
    print(f"  {feat:30}: {w:8.4f} {zero_marker}")

test_mse_elastic = mean_squared_error(y_test_ms, elastic.predict(X_test_ms_scaled))
print(f"\nElastic Net Test MSE: {test_mse_elastic:.4f}")

---

## Exercises

### Exercise 1: Optimal Lambda Search

Use cross-validation to find the optimal λ for Ridge regression on the manuscript data.

In [None]:
# Exercise 1: Find optimal lambda using cross-validation
from sklearn.model_selection import cross_val_score

lambdas_to_try = [0.001, 0.01, 0.1, 1, 10, 100]

# YOUR CODE HERE
# for lam in lambdas_to_try:
#     ridge = Ridge(alpha=lam)
#     scores = cross_val_score(ridge, X_train_ms_scaled, y_train_ms, 
#                              cv=5, scoring='neg_mean_squared_error')
#     print(f"λ={lam}: Mean CV MSE = {-scores.mean():.4f} (+/- {scores.std():.4f})")

### Exercise 2: Feature Selection with Lasso

Find the λ value where Lasso selects exactly 3 features for forgery detection.

In [None]:
# Exercise 2: Find lambda for exactly 3 features

# YOUR CODE HERE
# Test different lambda values and count non-zero weights
# for lam in [0.001, 0.005, 0.01, 0.02, 0.05, 0.1]:
#     lasso = Lasso(alpha=lam, max_iter=10000)
#     lasso.fit(X_train_ms_scaled, y_train_ms)
#     n_features = np.sum(np.abs(lasso.coef_) > 1e-6)
#     print(f"λ={lam}: {n_features} non-zero features")

### Exercise 3: Regularization for Expedition Data

Apply Ridge and Lasso to predict expedition catch_value from multiple features.

In [None]:
# Exercise 3: Expedition catch value prediction

exp_features = ['days_in_field', 'crew_size', 'leader_experience_years', 'creature_encounters']
# Note: You'll need to handle missing values if any

# YOUR CODE HERE
# X_exp = expeditions[exp_features].values
# y_exp = expeditions['catch_value'].values
# 
# Split, scale, fit Ridge and Lasso, compare results

### Exercise 4: The Effect of Scaling

What happens if you apply regularization WITHOUT scaling the features first? Compare results.

In [None]:
# Exercise 4: Regularization without scaling

# YOUR CODE HERE
# Fit Ridge on scaled vs unscaled data
# Compare the weights - what do you observe?

### Exercise 5: Elastic Net Tuning

Tune both α (overall strength) and l1_ratio (L1 vs L2 balance) for Elastic Net.

In [None]:
# Exercise 5: Elastic Net parameter tuning

alphas = [0.001, 0.01, 0.1]
l1_ratios = [0.1, 0.5, 0.9]  # 0.1 = mostly L2, 0.9 = mostly L1

# YOUR CODE HERE
# for alpha in alphas:
#     for l1_ratio in l1_ratios:
#         elastic = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, max_iter=10000)
#         elastic.fit(X_train_ms_scaled, y_train_ms)
#         test_mse = mean_squared_error(y_test_ms, elastic.predict(X_test_ms_scaled))
#         n_features = np.sum(np.abs(elastic.coef_) > 1e-6)
#         print(f"α={alpha}, L1_ratio={l1_ratio}: MSE={test_mse:.4f}, features={n_features}")

---

## Summary

| Method | Penalty | Effect | Use When |
|--------|---------|--------|----------|
| **Ridge (L2)** | $\lambda \sum w_j^2$ | Shrinks all weights | All features relevant, correlated features |
| **Lasso (L1)** | $\lambda \sum |w_j|$ | Zeros some weights | Feature selection needed |
| **Elastic Net** | Both L1 + L2 | Best of both | Groups of correlated features |

---

## Key Takeaways

1. **Regularization prevents overfitting** by penalizing large weights

2. **L2 (Ridge) shrinks weights smoothly** — all features remain, but with smaller influence

3. **L1 (Lasso) creates sparse solutions** — automatic feature selection

4. **The geometry explains the difference** — L2 constraint is a circle, L1 is a diamond with corners

5. **λ controls the trade-off** — use validation data to find the optimal value

6. **Always scale features** before applying regularization

---

## Next Lesson

In **Lesson 4: Model Selection & Cross-Validation (Capstone)**, we'll bring everything together. We'll learn how to properly evaluate models, tune hyperparameters, and build a complete ML pipeline for the forgery detection challenge.

*"The Tribunal has heard about loss functions, bias-variance trade-offs, and regularization. Now comes the final test: building a complete system that can be trusted to classify manuscripts we have never seen. This is the true challenge of machine learning—generalization."*  
— Mink Pavar, preparing for the final day of testimony