[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/buildLittleWorlds/ml-math-with-densworld/blob/main/modules/04-applied-ml/notebooks/02-bias-variance-tradeoff.ipynb)

# Lesson 2: The Bias-Variance Trade-off

*"The Tribunal grew impatient. 'Your simple model makes errors,' they said. 'Give us a more complex one—one that makes no errors on the manuscripts we've shown you.' I warned them: a model that perfectly fits the past may fail spectacularly on the future. This is the deepest truth of prediction."*  
— Mink Pavar, second day of testimony

---

## The Most Important Concept in Machine Learning

The Forgery Trial took an unexpected turn on the second day. A rival scholar, Eulr Voss, challenged Mink Pavar's method:

> *"Your linear model makes errors! I can construct a model so complex that it perfectly classifies every manuscript you've shown us. Zero errors. Is that not better?"*

The Tribunal murmured in approval. Zero errors sounded impressive.

But Mink Pavar shook his head.

> *"Bring me a new manuscript—one your model has never seen. I wager my model will outperform yours. For yours has memorized the training data, while mine has learned the underlying pattern."*

This challenge revealed the central tension in all of machine learning: **the bias-variance trade-off**.

---

## Learning Objectives

By the end of this lesson, you will:
1. Understand bias (underfitting) and variance (overfitting) intuitively
2. Visualize how model complexity affects both
3. See why training error alone is misleading
4. Learn to diagnose which problem your model has
5. Understand the mathematical decomposition of error

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Set random seed for reproducibility
np.random.seed(42)

# Nice plotting defaults
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

# Colab-ready data loading
BASE_URL = "https://raw.githubusercontent.com/buildLittleWorlds/ml-math-with-densworld/main/data/"

# Load datasets
manuscripts = pd.read_csv(BASE_URL + "manuscript_features.csv")
expeditions = pd.read_csv(BASE_URL + "expedition_outcomes.csv")

print(f"Loaded {len(manuscripts)} manuscripts")
print(f"Loaded {len(expeditions)} expeditions")

## Part 1: The Fundamental Problem

*"We have limited data—a sample of manuscripts. From this sample, we must learn a rule that applies to all manuscripts, including those we haven't yet seen. This is the challenge: learning from the particular to predict the general."*  
— Mink Pavar

### The True Function vs. Our Approximation

In the real world, there's some true relationship between features and outcomes. We never see this true function—we only see noisy samples from it.

Let's create a synthetic example to illustrate:

In [None]:
# The TRUE function (unknown in practice)
def true_function(x):
    """The true relationship between manuscript age and authenticity score."""
    return 0.5 + 0.3 * np.sin(2 * x) + 0.1 * x

# Generate "observed" data with noise
np.random.seed(42)
n_samples = 30
X_train = np.sort(np.random.uniform(0, 4, n_samples))
noise = np.random.normal(0, 0.15, n_samples)
y_train = true_function(X_train) + noise

# Generate test data (new manuscripts)
X_test = np.sort(np.random.uniform(0, 4, 20))
y_test = true_function(X_test) + np.random.normal(0, 0.15, 20)

# Plot
x_smooth = np.linspace(0, 4, 200)

plt.figure(figsize=(10, 6))
plt.plot(x_smooth, true_function(x_smooth), 'g-', linewidth=2, label='True function (unknown)')
plt.scatter(X_train, y_train, c='blue', s=60, edgecolor='black', label='Training data', zorder=5)
plt.scatter(X_test, y_test, c='orange', s=60, edgecolor='black', marker='s', label='Test data (unseen)', zorder=5)
plt.xlabel('Manuscript Feature (e.g., normalized age)', fontsize=12)
plt.ylabel('Authenticity Score', fontsize=12)
plt.title('The Fundamental Problem: Learning the True Function from Noisy Data', fontsize=13)
plt.legend()
plt.show()

print("We see the blue points (training data).")
print("We must predict the orange points (test data).")
print("The green line is the truth—but we never actually see it!")

## Part 2: Underfitting (High Bias)

*"Eulr Voss mocked my simple linear model. 'A straight line?' he laughed. 'The world is not so simple!' He was partly right—my model was too simple. It had HIGH BIAS: it assumed a relationship that was simpler than reality."*  
— Mink Pavar

### What is Bias?

**Bias** is the error from overly simplistic assumptions. A high-bias model:
- Is too simple to capture the true pattern
- Makes systematic errors
- Underfits both training AND test data

In [None]:
# Fit a simple linear model (degree 1 polynomial)
def fit_polynomial(X, y, degree):
    """Fit a polynomial of given degree."""
    poly = PolynomialFeatures(degree=degree, include_bias=True)
    X_poly = poly.fit_transform(X.reshape(-1, 1))
    model = LinearRegression(fit_intercept=False)
    model.fit(X_poly, y)
    return model, poly

def predict_polynomial(model, poly, X):
    """Predict using fitted polynomial."""
    X_poly = poly.transform(X.reshape(-1, 1))
    return model.predict(X_poly)

# Fit linear model (degree 1) - HIGH BIAS
model_linear, poly_linear = fit_polynomial(X_train, y_train, degree=1)
y_train_pred_linear = predict_polynomial(model_linear, poly_linear, X_train)
y_test_pred_linear = predict_polynomial(model_linear, poly_linear, X_test)

# Calculate errors
train_mse_linear = mean_squared_error(y_train, y_train_pred_linear)
test_mse_linear = mean_squared_error(y_test, y_test_pred_linear)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(x_smooth, true_function(x_smooth), 'g-', linewidth=2, alpha=0.5, label='True function')
plt.scatter(X_train, y_train, c='blue', s=60, edgecolor='black', label='Training data', zorder=5)
y_smooth_linear = predict_polynomial(model_linear, poly_linear, x_smooth)
plt.plot(x_smooth, y_smooth_linear, 'r-', linewidth=2, label=f'Linear model (degree 1)')

plt.xlabel('Manuscript Feature', fontsize=12)
plt.ylabel('Authenticity Score', fontsize=12)
plt.title('HIGH BIAS (Underfitting): Model Too Simple', fontsize=13)
plt.legend()
plt.show()

print("Underfitting (High Bias):")
print(f"  Training MSE: {train_mse_linear:.4f}")
print(f"  Test MSE:     {test_mse_linear:.4f}")
print("\nBoth errors are HIGH because the model is too simple.")
print("The straight line cannot capture the curved true function.")

## Part 3: Overfitting (High Variance)

*"Eulr Voss presented his model—a polynomial of degree 25. It passed through every training point perfectly. 'Zero training error!' he proclaimed. But I asked him: 'What happens when we test it on new manuscripts?'"*  
— Mink Pavar

### What is Variance?

**Variance** is the error from sensitivity to small fluctuations in the training data. A high-variance model:
- Is too complex
- Memorizes the noise in the training data
- Fits training data perfectly but fails on new data

In [None]:
# Fit a high-degree polynomial (degree 15) - HIGH VARIANCE
model_complex, poly_complex = fit_polynomial(X_train, y_train, degree=15)
y_train_pred_complex = predict_polynomial(model_complex, poly_complex, X_train)
y_test_pred_complex = predict_polynomial(model_complex, poly_complex, X_test)

# Calculate errors
train_mse_complex = mean_squared_error(y_train, y_train_pred_complex)
test_mse_complex = mean_squared_error(y_test, y_test_pred_complex)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(x_smooth, true_function(x_smooth), 'g-', linewidth=2, alpha=0.5, label='True function')
plt.scatter(X_train, y_train, c='blue', s=60, edgecolor='black', label='Training data', zorder=5)

# Predict on smooth x values (clip extreme predictions for visualization)
y_smooth_complex = predict_polynomial(model_complex, poly_complex, x_smooth)
y_smooth_complex = np.clip(y_smooth_complex, -1, 3)  # Clip for visualization
plt.plot(x_smooth, y_smooth_complex, 'r-', linewidth=2, label=f'Complex model (degree 15)')

plt.xlabel('Manuscript Feature', fontsize=12)
plt.ylabel('Authenticity Score', fontsize=12)
plt.title('HIGH VARIANCE (Overfitting): Model Too Complex', fontsize=13)
plt.ylim(-0.5, 2)
plt.legend()
plt.show()

print("Overfitting (High Variance):")
print(f"  Training MSE: {train_mse_complex:.4f} (very low!)")
print(f"  Test MSE:     {test_mse_complex:.4f} (much higher!)")
print("\nThe model memorized the training data (low training error)")
print("but fails to generalize (high test error).")
print("\nThe wiggly curve fits the noise, not the signal!")

## Part 4: The Sweet Spot (Balanced Complexity)

*"The Tribunal asked: 'What is the right complexity?' I told them: enough to capture the pattern, but not so much that we fit the noise. We seek the sweet spot."*  
— Mink Pavar

In [None]:
# Fit models of various complexities
degrees = [1, 2, 3, 4, 5, 7, 10, 15]
train_errors = []
test_errors = []

for degree in degrees:
    model, poly = fit_polynomial(X_train, y_train, degree)
    y_train_pred = predict_polynomial(model, poly, X_train)
    y_test_pred = predict_polynomial(model, poly, X_test)
    train_errors.append(mean_squared_error(y_train, y_train_pred))
    test_errors.append(mean_squared_error(y_test, y_test_pred))

# Find best degree
best_idx = np.argmin(test_errors)
best_degree = degrees[best_idx]

# Plot
plt.figure(figsize=(10, 6))
plt.plot(degrees, train_errors, 'b-o', linewidth=2, markersize=8, label='Training Error')
plt.plot(degrees, test_errors, 'r-s', linewidth=2, markersize=8, label='Test Error')
plt.axvline(best_degree, color='green', linestyle='--', linewidth=2, label=f'Best degree = {best_degree}')

# Annotate regions
plt.annotate('UNDERFITTING\n(High Bias)', xy=(1.5, 0.04), fontsize=11, ha='center',
             bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.5))
plt.annotate('OVERFITTING\n(High Variance)', xy=(12, 0.1), fontsize=11, ha='center',
             bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.5))
plt.annotate('SWEET SPOT', xy=(best_degree, test_errors[best_idx] - 0.015), fontsize=11, ha='center',
             bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.7))

plt.xlabel('Model Complexity (Polynomial Degree)', fontsize=12)
plt.ylabel('Mean Squared Error', fontsize=12)
plt.title('The Bias-Variance Trade-off: Finding the Sweet Spot', fontsize=13)
plt.legend()
plt.ylim(0, 0.2)
plt.show()

print("Key Observations:")
print("1. Training error always decreases with complexity")
print("2. Test error decreases, then INCREASES (the U-curve)")
print(f"3. Best model: degree {best_degree} (lowest test error)")

In [None]:
# Visualize the sweet spot model
model_best, poly_best = fit_polynomial(X_train, y_train, degree=best_degree)
y_smooth_best = predict_polynomial(model_best, poly_best, x_smooth)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Underfitting
axes[0].plot(x_smooth, true_function(x_smooth), 'g-', linewidth=2, alpha=0.5, label='True')
axes[0].scatter(X_train, y_train, c='blue', s=40, edgecolor='black')
axes[0].plot(x_smooth, y_smooth_linear, 'r-', linewidth=2)
axes[0].set_title(f'Underfitting (degree 1)\nTest MSE: {test_mse_linear:.4f}', fontsize=11)
axes[0].set_ylim(-0.2, 1.8)

# Sweet Spot
axes[1].plot(x_smooth, true_function(x_smooth), 'g-', linewidth=2, alpha=0.5, label='True')
axes[1].scatter(X_train, y_train, c='blue', s=40, edgecolor='black')
axes[1].plot(x_smooth, y_smooth_best, 'r-', linewidth=2)
axes[1].set_title(f'Sweet Spot (degree {best_degree})\nTest MSE: {test_errors[best_idx]:.4f}', fontsize=11)
axes[1].set_ylim(-0.2, 1.8)

# Overfitting
axes[2].plot(x_smooth, true_function(x_smooth), 'g-', linewidth=2, alpha=0.5, label='True')
axes[2].scatter(X_train, y_train, c='blue', s=40, edgecolor='black')
y_smooth_complex_clipped = np.clip(y_smooth_complex, -0.5, 2)
axes[2].plot(x_smooth, y_smooth_complex_clipped, 'r-', linewidth=2)
axes[2].set_title(f'Overfitting (degree 15)\nTest MSE: {test_mse_complex:.4f}', fontsize=11)
axes[2].set_ylim(-0.2, 1.8)

for ax in axes:
    ax.set_xlabel('Feature')
    ax.set_ylabel('Score')

plt.tight_layout()
plt.show()

## Part 5: The Mathematical Decomposition

*"The Tribunal demanded proof. I showed them that any prediction error can be decomposed into three parts: bias, variance, and irreducible noise. Only by understanding this decomposition can we make wise choices."*  
— Mink Pavar

### The Bias-Variance Decomposition

For any prediction, the expected error can be written as:

$$\text{Expected Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}$$

Where:
- **Bias²**: Error from wrong assumptions (model too simple)
- **Variance**: Error from sensitivity to training data (model too complex)
- **Irreducible Noise**: Random error we can never eliminate

In [None]:
# Simulate the bias-variance decomposition
# by training many models on different samples

def simulate_bias_variance(degree, n_simulations=100):
    """Estimate bias and variance by training on many different samples."""
    predictions = []
    
    # Fixed test point
    x_test_point = np.array([2.0])
    true_y = true_function(x_test_point)[0]
    
    for _ in range(n_simulations):
        # Generate new training data
        X_sim = np.sort(np.random.uniform(0, 4, 30))
        y_sim = true_function(X_sim) + np.random.normal(0, 0.15, 30)
        
        # Fit model
        model, poly = fit_polynomial(X_sim, y_sim, degree)
        pred = predict_polynomial(model, poly, x_test_point)[0]
        predictions.append(pred)
    
    predictions = np.array(predictions)
    
    # Calculate components
    mean_pred = np.mean(predictions)
    bias = mean_pred - true_y
    variance = np.var(predictions)
    
    return bias**2, variance, predictions

# Calculate for different complexities
degrees_to_test = [1, 2, 3, 5, 7, 10, 15]
biases_sq = []
variances = []

print("Simulating bias-variance for different model complexities...")
for deg in degrees_to_test:
    b2, v, _ = simulate_bias_variance(deg)
    biases_sq.append(b2)
    variances.append(v)
    print(f"  Degree {deg:2d}: Bias² = {b2:.4f}, Variance = {v:.4f}, Total = {b2+v:.4f}")

In [None]:
# Visualize the decomposition
irreducible_noise = 0.15**2  # Known noise level

plt.figure(figsize=(10, 6))
plt.plot(degrees_to_test, biases_sq, 'b-o', linewidth=2, markersize=8, label='Bias²')
plt.plot(degrees_to_test, variances, 'r-s', linewidth=2, markersize=8, label='Variance')
total_error = [b + v for b, v in zip(biases_sq, variances)]
plt.plot(degrees_to_test, total_error, 'g-^', linewidth=2, markersize=8, label='Bias² + Variance')
plt.axhline(irreducible_noise, color='gray', linestyle='--', label=f'Irreducible Noise ({irreducible_noise:.4f})')

plt.xlabel('Model Complexity (Polynomial Degree)', fontsize=12)
plt.ylabel('Error Component', fontsize=12)
plt.title('The Bias-Variance Decomposition', fontsize=13)
plt.legend()
plt.show()

print("Key Insight:")
print("- Bias DECREASES with complexity (model becomes more flexible)")
print("- Variance INCREASES with complexity (model becomes more sensitive)")
print("- Total error is minimized at intermediate complexity")

## Part 6: Diagnosing Your Model

*"The Tribunal asked: 'How do we know if our model has bias or variance problems?' I gave them this diagnostic table."*  
— Mink Pavar

### Diagnostic Table

| Training Error | Test Error | Diagnosis | Solution |
|----------------|------------|-----------|----------|
| High | High | **Underfitting** (High Bias) | Increase complexity |
| Low | High | **Overfitting** (High Variance) | Decrease complexity, add regularization |
| Low | Low | **Good fit** | You're done! |
| High | Low | **Impossible** | Check for data leakage |

In [None]:
# Interactive diagnosis tool
def diagnose_model(train_error, test_error, threshold=0.03):
    """Diagnose a model based on training and test error."""
    gap = test_error - train_error
    
    print(f"Training Error: {train_error:.4f}")
    print(f"Test Error:     {test_error:.4f}")
    print(f"Gap:            {gap:.4f}")
    print("\n" + "="*50)
    
    if train_error > threshold and test_error > threshold:
        print("DIAGNOSIS: UNDERFITTING (High Bias)")
        print("\nBoth training and test errors are high.")
        print("The model is too simple to capture the pattern.")
        print("\nSolutions:")
        print("  - Increase model complexity")
        print("  - Add more features")
        print("  - Use a more flexible model family")
        
    elif train_error < threshold and test_error > threshold:
        print("DIAGNOSIS: OVERFITTING (High Variance)")
        print("\nTraining error is low but test error is high.")
        print("The model memorized training data but doesn't generalize.")
        print("\nSolutions:")
        print("  - Reduce model complexity")
        print("  - Add regularization (next lesson!)")
        print("  - Get more training data")
        print("  - Use dropout (for neural networks)")
        
    elif train_error < threshold and test_error < threshold:
        print("DIAGNOSIS: GOOD FIT")
        print("\nBoth errors are low and similar.")
        print("The model has learned the true pattern!")
        
    else:
        print("DIAGNOSIS: CHECK YOUR DATA")
        print("\nTest error lower than training error is suspicious.")
        print("Check for data leakage or evaluation errors.")

print("Example 1: Linear Model (Underfitting)")
print("-" * 50)
diagnose_model(train_mse_linear, test_mse_linear)

In [None]:
print("\nExample 2: Complex Model (Overfitting)")
print("-" * 50)
diagnose_model(train_mse_complex, test_mse_complex)

In [None]:
print("\nExample 3: Balanced Model (Good Fit)")
print("-" * 50)
train_mse_best = mean_squared_error(y_train, predict_polynomial(model_best, poly_best, X_train))
test_mse_best = mean_squared_error(y_test, predict_polynomial(model_best, poly_best, X_test))
diagnose_model(train_mse_best, test_mse_best)

## Part 7: Applying to Manuscript Forgery Detection

*"Let us return to the forgery problem. How complex should our model be?"*  
— Mink Pavar

Let's apply bias-variance thinking to the real manuscript data:

In [None]:
# Prepare manuscript data
from sklearn.model_selection import train_test_split

# Use stylometric_variance and era_marker_score to predict forgery
X_ms = manuscripts[['stylometric_variance', 'era_marker_score']].values
y_ms = manuscripts['is_forgery'].astype(int).values

# Split into train/test
X_train_ms, X_test_ms, y_train_ms, y_test_ms = train_test_split(
    X_ms, y_ms, test_size=0.3, random_state=42
)

print(f"Training samples: {len(X_train_ms)}")
print(f"Test samples: {len(X_test_ms)}")
print(f"Forgery rate in training: {y_train_ms.mean()*100:.1f}%")
print(f"Forgery rate in test: {y_test_ms.mean()*100:.1f}%")

In [None]:
# Test different polynomial complexities
degrees_ms = [1, 2, 3, 4, 5, 6]
train_errors_ms = []
test_errors_ms = []

for degree in degrees_ms:
    poly = PolynomialFeatures(degree=degree, include_bias=True)
    X_train_poly = poly.fit_transform(X_train_ms)
    X_test_poly = poly.transform(X_test_ms)
    
    model = LinearRegression(fit_intercept=False)
    model.fit(X_train_poly, y_train_ms)
    
    y_train_pred = model.predict(X_train_poly)
    y_test_pred = model.predict(X_test_poly)
    
    train_errors_ms.append(mean_squared_error(y_train_ms, y_train_pred))
    test_errors_ms.append(mean_squared_error(y_test_ms, y_test_pred))

# Plot
plt.figure(figsize=(10, 6))
plt.plot(degrees_ms, train_errors_ms, 'b-o', linewidth=2, markersize=8, label='Training Error')
plt.plot(degrees_ms, test_errors_ms, 'r-s', linewidth=2, markersize=8, label='Test Error')

best_degree_ms = degrees_ms[np.argmin(test_errors_ms)]
plt.axvline(best_degree_ms, color='green', linestyle='--', linewidth=2, 
            label=f'Best degree = {best_degree_ms}')

plt.xlabel('Model Complexity (Polynomial Degree)', fontsize=12)
plt.ylabel('Mean Squared Error', fontsize=12)
plt.title('Bias-Variance Trade-off for Forgery Detection', fontsize=13)
plt.legend()
plt.show()

print(f"Best polynomial degree for forgery detection: {best_degree_ms}")
print(f"Test MSE at best degree: {test_errors_ms[best_degree_ms-1]:.4f}")

---

## Exercises

### Exercise 1: Diagnose These Models

For each scenario, diagnose whether the model suffers from bias or variance:

In [None]:
# Exercise 1: Diagnose these models
scenarios = [
    ("Model A", 0.15, 0.16),  # Training MSE, Test MSE
    ("Model B", 0.02, 0.18),
    ("Model C", 0.05, 0.06),
    ("Model D", 0.25, 0.24),
]

print("Diagnose each model:")
print("="*60)
for name, train_err, test_err in scenarios:
    print(f"\n{name}: Train={train_err:.2f}, Test={test_err:.2f}")
    # YOUR DIAGNOSIS HERE
    # Is this underfitting, overfitting, or good fit?

### Exercise 2: Learning Curves

Create learning curves showing how training and test error change with the amount of training data. This is another way to diagnose bias vs variance.

In [None]:
# Exercise 2: Learning curves
# How do errors change as we add more training data?

train_sizes = [5, 10, 15, 20, 25, 30]
# For each size, train a model and record train/test error

# YOUR CODE HERE
# degree = 3  # Choose a complexity
# train_errors_lc = []
# test_errors_lc = []
#
# for size in train_sizes:
#     # Use first 'size' training points
#     # Fit model
#     # Record errors
#     pass
#
# plt.plot(train_sizes, train_errors_lc, 'b-o', label='Training')
# plt.plot(train_sizes, test_errors_lc, 'r-s', label='Test')
# plt.xlabel('Training Set Size')
# plt.ylabel('MSE')
# plt.legend()

### Exercise 3: The Effect of Noise

How does the noise level affect the optimal model complexity? Test with noise levels 0.05, 0.15, and 0.30.

In [None]:
# Exercise 3: Effect of noise on optimal complexity

noise_levels = [0.05, 0.15, 0.30]

# YOUR CODE HERE
# For each noise level:
#   - Generate new training/test data
#   - Test different polynomial degrees
#   - Find the optimal degree
#   - Print results

# What pattern do you observe?

### Exercise 4: Manuscript Complexity Analysis

Add more features to the forgery model (vocabulary_richness, avg_sentence_length). Does higher feature count lead to overfitting?

In [None]:
# Exercise 4: More features for forgery detection

feature_sets = [
    ['stylometric_variance'],
    ['stylometric_variance', 'era_marker_score'],
    ['stylometric_variance', 'era_marker_score', 'vocabulary_richness'],
    ['stylometric_variance', 'era_marker_score', 'vocabulary_richness', 'avg_sentence_length'],
]

# YOUR CODE HERE
# For each feature set:
#   - Split data
#   - Train linear regression
#   - Record train/test error
#   - Does adding features help or hurt?

### Exercise 5: Eulr Voss's Challenge

Eulr Voss claims his degree-20 polynomial achieves "perfect training accuracy" on the manuscript data. Show him why this is misleading.

In [None]:
# Exercise 5: Debunk Eulr Voss's perfect model

# Fit a degree-20 polynomial on manuscript data
# Show the training vs test error
# Explain why "perfect training accuracy" is not the goal

# YOUR CODE HERE

---

## Summary

| Concept | Definition | Symptom |
|---------|------------|----------|
| **Bias** | Error from simplistic assumptions | High train error, high test error |
| **Variance** | Error from sensitivity to training data | Low train error, high test error |
| **Underfitting** | Model too simple | Cannot capture the pattern |
| **Overfitting** | Model too complex | Memorizes noise instead of signal |
| **Sweet Spot** | Balanced complexity | Minimizes total error |

---

## Key Takeaways

1. **Training error alone is misleading** — a model with zero training error might be terrible on new data

2. **The bias-variance trade-off is fundamental** — reducing one often increases the other

3. **Model complexity must be chosen carefully** — not too simple (underfitting), not too complex (overfitting)

4. **Always evaluate on held-out test data** — this is how we detect overfitting

5. **The goal is generalization** — we want to predict new data, not memorize old data

---

## Next Lesson

In **Lesson 3: Regularization — Taming Complexity**, we'll learn a powerful technique to prevent overfitting without reducing model complexity.

*"The Tribunal asked: 'Is there a way to have a complex model that doesn't overfit?' I told them yes—but we must add a penalty for complexity. This is regularization: we tell the model that large weights are expensive. The model must justify its complexity with better predictions."*  
— Mink Pavar