[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/buildLittleWorlds/ml-math-with-densworld/blob/main/modules/04-applied-ml/notebooks/01-deriving-linear-regression.ipynb)

# Lesson 1: Deriving Linear Regression from First Principles

*"The Tribunal demanded evidence. 'Show us,' they said, 'how you know this manuscript is false.' I could not simply declare it—I needed mathematics that would stand before the court. I needed a method that derived its authority from pure logic, not from the intuition of scholars who might be bribed or mistaken."*  
— Mink Pavar, testimony at the Great Forgery Trial of 912

---

## The Forgery Trial Begins

In the year 912, the Capital Archives faced a crisis. Dozens of manuscripts attributed to the great philosopher Grigsu Haldo had been called into question. The accusations came from an unlikely source: Mink Pavar, the reclusive scholar of the Water School, who claimed that someone had been forging Haldo's works for decades.

The Tribunal convened. Careers hung in the balance. Fortunes had been spent on manuscripts now suspected to be worthless. Mink Pavar stood before them with a radical proposition:

> *"Let the numbers speak. Each manuscript has measurable features—sentence length, vocabulary richness, philosophical term density. A true Haldo manuscript will have patterns. A forgery will deviate. I propose we derive a mathematical rule that separates authentic from false."*

The Tribunal was skeptical. "And how do we know your rule is correct?"

Mink Pavar smiled. "Because we will derive it from first principles. We will not assume the answer—we will prove it."

---

## Learning Objectives

By the end of this lesson, you will:
1. Understand what "best fit" actually means mathematically
2. Derive the loss function from intuitive requirements
3. Use calculus to find the optimal parameters
4. Implement linear regression from scratch on Densworld data
5. Connect linear regression to the forgery detection problem

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

# Nice plotting defaults
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

# Colab-ready data loading
BASE_URL = "https://raw.githubusercontent.com/buildLittleWorlds/ml-math-with-densworld/main/data/"

# Load the manuscript features data (for forgery detection)
manuscripts = pd.read_csv(BASE_URL + "manuscript_features.csv")
# Load creature market data (for price prediction)
market = pd.read_csv(BASE_URL + "creature_market.csv")
# Load expedition data
expeditions = pd.read_csv(BASE_URL + "expedition_outcomes.csv")

print(f"Loaded {len(manuscripts)} manuscript records")
print(f"Loaded {len(market)} market transactions")
print(f"Loaded {len(expeditions)} expedition records")
print(f"\nForgeries in manuscript data: {manuscripts['is_forgery'].sum()} ({manuscripts['is_forgery'].mean()*100:.1f}%)")

## Part 1: The Problem Statement

*"Before we can find the answer, we must precisely state the question. The Tribunal wanted me to identify forgeries. But I realized: the first step is simpler. Can we predict one measurable feature from another?"*  
— Mink Pavar

### The Miasto Trappers Guild's Question

Before tackling the complex forgery problem, let's start with a simpler question from the Miasto Trappers Guild:

> "Can we predict a creature's market price from its danger rating?"

The Guild believes: *"Dangerous creatures fetch higher prices."*

Let's see if this holds—and find the "best" line to describe the relationship.

In [None]:
# Group by creature to get average price per danger level
creature_stats = market.groupby('creature_id').agg({
    'creature_name': 'first',
    'danger_rating': 'first',
    'rarity_rating': 'first',
    'price_per_unit': 'mean'
}).reset_index()

creature_stats.columns = ['id', 'name', 'danger', 'rarity', 'avg_price']

# Plot danger vs price
plt.figure(figsize=(10, 6))
plt.scatter(creature_stats['danger'], creature_stats['avg_price'], 
            s=100, c='darkred', edgecolor='black', alpha=0.7)

for _, row in creature_stats.iterrows():
    plt.annotate(row['name'].split()[0], 
                 xy=(row['danger'], row['avg_price']),
                 xytext=(5, 5), textcoords='offset points', fontsize=8)

plt.xlabel('Danger Rating', fontsize=12)
plt.ylabel('Average Price', fontsize=12)
plt.title('Creature Danger vs. Market Price\nWhat is the "Best" Line?', fontsize=13)
plt.show()

print("The Miasto Guild's hypothesis seems plausible.")
print("But what exactly is the relationship? And what is 'best'?")

## Part 2: Defining "Best" — Building the Loss Function

*"The Tribunal asked me: 'What makes one line better than another?' I told them: a good line makes small errors. But then they asked: 'How do you combine many small errors into a single measure of goodness?' This was the key question."*  
— Mink Pavar

Let's think step by step about what we want.

### Step 1: What are we fitting?

A line: $\hat{y} = w \cdot x + b$

where:
- $x$ = danger rating (input)
- $\hat{y}$ = predicted price (output)
- $w$ = slope (how much does price increase per danger point?)
- $b$ = intercept (base price for a harmless creature)

### Step 2: What is an error?

For each creature, the **residual** is the vertical distance between actual and predicted price:

$$\text{residual}_i = y_i - \hat{y}_i = \text{actual price} - \text{predicted price}$$

In [None]:
# Extract our data
X = creature_stats['danger'].values
y = creature_stats['avg_price'].values

# Let's try a guess: w=50, b=0 (price increases 50 per danger point)
w_guess = 50
b_guess = 0

y_pred = w_guess * X + b_guess
residuals = y - y_pred

plt.figure(figsize=(10, 6))
plt.scatter(X, y, s=100, c='darkred', edgecolor='black', label='Actual prices')
plt.plot(X, y_pred, 'b-', linewidth=2, label=f'Guess: price = {w_guess}×danger + {b_guess}')

# Draw residuals as vertical lines
for i in range(len(X)):
    color = 'green' if residuals[i] > 0 else 'red'
    plt.vlines(X[i], y_pred[i], y[i], colors=color, linestyles='dashed', linewidth=2)

plt.xlabel('Danger Rating', fontsize=12)
plt.ylabel('Price', fontsize=12)
plt.title('Residuals: The Vertical Errors\n(Green = underpredicted, Red = overpredicted)', fontsize=13)
plt.legend()
plt.show()

print("Residuals (actual - predicted):")
for name, res in zip(creature_stats['name'], residuals):
    sign = '+' if res > 0 else ''
    print(f"  {name:25}: {sign}{res:.1f}")

### Step 3: The Summing Problem

*"My first instinct was to add up all the errors. But I quickly realized my mistake..."*  
— Mink Pavar

We might think: "Just minimize the sum of residuals!"

**Problem**: Positive and negative residuals cancel out!

A line could be terribly wrong in both directions, but have a sum near zero.

In [None]:
print(f"Sum of residuals: {residuals.sum():.1f}")
print(f"But total absolute error: {np.abs(residuals).sum():.1f}")
print("\nThe sum can be misleadingly small due to cancellation!")
print("\nMink Pavar's insight: 'We need all errors to count positively.'")

### Step 4: The Solution — Square the Residuals

*"I considered taking absolute values—but the absolute value function has sharp corners, making calculus difficult. Squaring is smooth everywhere. And it has another virtue: large errors are penalized more heavily. A line that makes one huge mistake is worse than a line that makes several small ones."*  
— Mink Pavar

**Why squaring wins:**
- Squaring is differentiable everywhere (important for calculus!)
- Squaring penalizes large errors more heavily
- The math works out cleanly

### The Loss Function: Mean Squared Error (MSE)

$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \frac{1}{n} \sum_{i=1}^{n} (y_i - wx_i - b)^2$$

In [None]:
def compute_mse(w, b, X, y):
    """Compute Mean Squared Error for price prediction."""
    y_pred = w * X + b
    return np.mean((y - y_pred)**2)

# Our guess
mse_guess = compute_mse(w_guess, b_guess, X, y)
print(f"MSE for our guess (w={w_guess}, b={b_guess}): {mse_guess:.1f}")

# Try a different line
mse_alt = compute_mse(40, 20, X, y)
print(f"MSE for alternative (w=40, b=20): {mse_alt:.1f}")

print(f"\nLower MSE = Better fit")
print(f"The alternative is {'better' if mse_alt < mse_guess else 'worse'}!")

## Part 3: The Loss Landscape

*"Imagine a vast terrain where every point represents a choice of parameters. The height at each point is the error. We seek the lowest valley—the parameters that minimize error. This is optimization."*  
— Mink Pavar, addressing the Tribunal

Now we have a loss function. How do we find the $w$ and $b$ that minimize it?

Let's visualize the "loss landscape"—how MSE changes with different parameter values:

In [None]:
# Create a grid of w and b values
w_range = np.linspace(-20, 80, 100)
b_range = np.linspace(-100, 200, 100)
W, B = np.meshgrid(w_range, b_range)

# Compute MSE for each combination
MSE = np.zeros_like(W)
for i in range(len(b_range)):
    for j in range(len(w_range)):
        MSE[i, j] = compute_mse(W[i, j], B[i, j], X, y)

# Plot the loss landscape
fig, ax = plt.subplots(figsize=(10, 8))
contour = ax.contourf(W, B, np.log10(MSE + 1), levels=50, cmap='viridis')
ax.contour(W, B, np.log10(MSE + 1), levels=15, colors='white', alpha=0.3, linewidths=0.5)
plt.colorbar(contour, label='log10(MSE)')

ax.set_xlabel('Weight (w) — Price Increase per Danger Point', fontsize=12)
ax.set_ylabel('Bias (b) — Base Price', fontsize=12)
ax.set_title('The Loss Landscape\nDarker = Lower MSE = Better Fit', fontsize=13)

# Mark our guess
ax.plot(w_guess, b_guess, 'ro', markersize=12, label='Our guess')
ax.legend()
plt.show()

print("The dark valley is where the optimal parameters live.")
print("Our goal: find the lowest point in this landscape.")

## Part 4: The Analytical Solution

*"The Tribunal was impressed by the landscape visualization, but they demanded more. 'Can you prove that your solution is truly optimal?' I showed them that at the minimum, the derivative equals zero—this is the condition for optimality."*  
— Mink Pavar

For simple linear regression, we can solve directly by setting the derivatives to zero.

### The Derivation (from Module 3: Calculus)

We want to minimize:
$$L(w, b) = \frac{1}{n} \sum_{i=1}^{n} (y_i - wx_i - b)^2$$

Setting $\frac{\partial L}{\partial w} = 0$ and $\frac{\partial L}{\partial b} = 0$ gives us:

$$w^* = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}$$

$$b^* = \bar{y} - w^*\bar{x}$$

In [None]:
def fit_linear_regression(X, y):
    """Compute optimal w and b using the closed-form solution."""
    x_mean = np.mean(X)
    y_mean = np.mean(y)
    
    # Optimal weight
    numerator = np.sum((X - x_mean) * (y - y_mean))
    denominator = np.sum((X - x_mean)**2)
    w = numerator / denominator
    
    # Optimal bias
    b = y_mean - w * x_mean
    
    return w, b

w_optimal, b_optimal = fit_linear_regression(X, y)
mse_optimal = compute_mse(w_optimal, b_optimal, X, y)

print("Optimal Parameters for Creature Pricing:")
print("=" * 50)
print(f"  w = {w_optimal:.2f} (price increases by ~{w_optimal:.0f} per danger point)")
print(f"  b = {b_optimal:.2f} (base price for danger=0 creature)")
print(f"  MSE = {mse_optimal:.1f}")
print(f"\nOur guess MSE was {mse_guess:.1f} — optimal is {mse_guess/mse_optimal:.1f}× better!")

In [None]:
# Visualize the optimal line
x_line = np.linspace(0, 10, 100)
y_optimal = w_optimal * x_line + b_optimal

plt.figure(figsize=(10, 6))
plt.scatter(X, y, s=100, c='darkred', edgecolor='black', label='Actual prices')
plt.plot(x_line, y_optimal, 'b-', linewidth=2, 
         label=f'Best fit: price = {w_optimal:.1f}×danger + {b_optimal:.1f}')

# Draw residuals
y_pred_optimal = w_optimal * X + b_optimal
for i in range(len(X)):
    plt.vlines(X[i], y_pred_optimal[i], y[i], colors='green', 
               linestyles='dashed', alpha=0.5)

# Add creature names
for _, row in creature_stats.iterrows():
    plt.annotate(row['name'].split()[0], 
                 xy=(row['danger'], row['avg_price']),
                 xytext=(5, 5), textcoords='offset points', fontsize=8)

plt.xlabel('Danger Rating', fontsize=12)
plt.ylabel('Average Price', fontsize=12)
plt.title('The Optimal Line Minimizes Squared Residuals', fontsize=13)
plt.legend()
plt.show()

print("\nInterpretation for the Miasto Guild:")
print(f"  - A danger-0 creature is worth ~{b_optimal:.0f} on average")
print(f"  - Each point of danger adds ~{w_optimal:.0f} to the price")
print(f"  - A Stakdur (danger=9) should fetch ~{w_optimal*9 + b_optimal:.0f}")

## Part 5: Gradient Descent — The General-Purpose Algorithm

*"The closed-form solution works for simple problems. But what of complex ones—with many features, many parameters? The Tribunal would need a more general method. I showed them the algorithm of descent: follow the slope downhill until you reach the valley floor."*  
— Mink Pavar

The closed-form solution works for simple linear regression. But for more complex models (neural networks, etc.), there's no closed form.

**Gradient Descent** is the general-purpose optimization algorithm:

1. Start with random w, b
2. Compute gradient (which direction increases loss?)
3. Take a step in the opposite direction
4. Repeat until convergence

This connects back to Module 3—the Colonel's siege was gradient descent in the fog of war!

In [None]:
def gradient_descent(X, y, learning_rate=0.001, n_iterations=1000):
    """Find optimal w, b using gradient descent."""
    # Initialize with a guess
    w = 10.0  # Start with a low estimate
    b = 50.0
    n = len(X)
    
    history = {'w': [w], 'b': [b], 'mse': [compute_mse(w, b, X, y)]}
    
    for iteration in range(n_iterations):
        # Compute predictions
        y_pred = w * X + b
        
        # Compute gradients (partial derivatives of MSE)
        dw = -2/n * np.sum(X * (y - y_pred))
        db = -2/n * np.sum(y - y_pred)
        
        # Update parameters (move opposite to gradient)
        w = w - learning_rate * dw
        b = b - learning_rate * db
        
        # Record history
        history['w'].append(w)
        history['b'].append(b)
        history['mse'].append(compute_mse(w, b, X, y))
    
    return w, b, history

# Run gradient descent
w_gd, b_gd, history = gradient_descent(X, y, learning_rate=0.005, n_iterations=500)

print("Gradient Descent Results:")
print(f"  w = {w_gd:.2f} (optimal: {w_optimal:.2f})")
print(f"  b = {b_gd:.2f} (optimal: {b_optimal:.2f})")
print(f"  MSE = {history['mse'][-1]:.1f} (optimal: {mse_optimal:.1f})")

In [None]:
# Visualize the gradient descent journey
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Path on loss landscape
contour = axes[0].contourf(W, B, np.log10(MSE + 1), levels=50, cmap='viridis')
axes[0].plot(history['w'], history['b'], 'r.-', markersize=2, linewidth=1, alpha=0.7)
axes[0].plot(history['w'][0], history['b'][0], 'ro', markersize=10, label='Start')
axes[0].plot(w_optimal, b_optimal, 'g*', markersize=15, label='Optimal')
axes[0].set_xlabel('Weight (w)', fontsize=11)
axes[0].set_ylabel('Bias (b)', fontsize=11)
axes[0].set_title('Gradient Descent Path on Loss Landscape', fontsize=12)
axes[0].legend()

# MSE over iterations
axes[1].plot(history['mse'], 'b-', linewidth=2)
axes[1].axhline(mse_optimal, color='green', linestyle='--', label='Optimal MSE')
axes[1].set_xlabel('Iteration', fontsize=11)
axes[1].set_ylabel('MSE', fontsize=11)
axes[1].set_title('Loss Decreasing During Training', fontsize=12)
axes[1].legend()
axes[1].set_yscale('log')

plt.tight_layout()
plt.show()

print("Left: The red path shows gradient descent 'walking downhill'")
print("Right: MSE decreases rapidly at first, then plateaus")

## Part 6: Applying to the Forgery Problem

*"The Tribunal saw that I could predict prices from danger ratings. 'But what of manuscripts?' they asked. 'Can your method detect forgeries?' I told them: first we must find features that correlate with authenticity."*  
— Mink Pavar

Let's apply linear regression to the manuscript data. One hypothesis: stylometric variance (inconsistency in writing style) might predict whether a manuscript is a forgery.

In [None]:
# Examine the manuscript features
print("Manuscript Features for Forgery Detection:")
print(manuscripts[['manuscript_id', 'attributed_author', 'is_forgery', 
                   'stylometric_variance', 'era_marker_score']].head(10))

# Compare authentic vs forgery
authentic = manuscripts[manuscripts['is_forgery'] == False]
forgeries = manuscripts[manuscripts['is_forgery'] == True]

print(f"\nAuthentic manuscripts: {len(authentic)}")
print(f"Forgeries: {len(forgeries)}")
print(f"\nMean stylometric variance:")
print(f"  Authentic: {authentic['stylometric_variance'].mean():.4f}")
print(f"  Forgeries: {forgeries['stylometric_variance'].mean():.4f}")
print(f"\nMean era marker score:")
print(f"  Authentic: {authentic['era_marker_score'].mean():.4f}")
print(f"  Forgeries: {forgeries['era_marker_score'].mean():.4f}")

In [None]:
# Visualize the relationship
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Stylometric variance vs forgery
axes[0].scatter(authentic['stylometric_variance'], [0]*len(authentic), 
                alpha=0.5, label='Authentic', c='green', s=50)
axes[0].scatter(forgeries['stylometric_variance'], [1]*len(forgeries), 
                alpha=0.5, label='Forgery', c='red', s=50)
axes[0].set_xlabel('Stylometric Variance', fontsize=11)
axes[0].set_ylabel('Is Forgery (0=No, 1=Yes)', fontsize=11)
axes[0].set_title('Stylometric Variance vs. Forgery Status', fontsize=12)
axes[0].legend()

# Era marker score vs forgery
axes[1].scatter(authentic['era_marker_score'], [0]*len(authentic), 
                alpha=0.5, label='Authentic', c='green', s=50)
axes[1].scatter(forgeries['era_marker_score'], [1]*len(forgeries), 
                alpha=0.5, label='Forgery', c='red', s=50)
axes[1].set_xlabel('Era Marker Score', fontsize=11)
axes[1].set_ylabel('Is Forgery (0=No, 1=Yes)', fontsize=11)
axes[1].set_title('Era Marker Score vs. Forgery Status', fontsize=12)
axes[1].legend()

plt.tight_layout()
plt.show()

print("Observation: Forgeries tend to have higher stylometric variance")
print("and higher era marker scores (anachronistic terms).")
print("\nMink Pavar: 'The forger's hand betrays inconsistency.'")

In [None]:
# Fit linear regression to predict forgery probability from stylometric variance
X_ms = manuscripts['stylometric_variance'].values
y_ms = manuscripts['is_forgery'].astype(int).values

w_ms, b_ms = fit_linear_regression(X_ms, y_ms)

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X_ms, y_ms + np.random.normal(0, 0.02, len(y_ms)), 
            alpha=0.5, c=['green' if y==0 else 'red' for y in y_ms], s=50)

x_line = np.linspace(0, X_ms.max(), 100)
y_line = w_ms * x_line + b_ms
plt.plot(x_line, y_line, 'b-', linewidth=2, label=f'Prediction = {w_ms:.2f}×variance + {b_ms:.2f}')
plt.axhline(0.5, color='orange', linestyle='--', label='Decision boundary (0.5)')

plt.xlabel('Stylometric Variance', fontsize=12)
plt.ylabel('Forgery Probability', fontsize=12)
plt.title('Linear Regression for Forgery Detection\n(Note: This is a preview—logistic regression works better!)', fontsize=13)
plt.legend()
plt.ylim(-0.2, 1.2)
plt.show()

print(f"Model: forgery_probability = {w_ms:.4f} × stylometric_variance + {b_ms:.4f}")
print(f"\nFor a manuscript with variance = 0.1: probability = {w_ms*0.1 + b_ms:.2f}")
print(f"For a manuscript with variance = 0.3: probability = {w_ms*0.3 + b_ms:.2f}")
print("\nNote: Linear regression can give probabilities outside [0,1]!")
print("We'll fix this with logistic regression in a later course.")

## Part 7: Expedition Data — Multiple Features Preview

*"One feature is a start. But the world is multidimensional. True prediction requires many features working together."*  
— Mink Pavar

Let's apply regression to expedition data, previewing multiple features:

In [None]:
# Simple regression: days in field vs catch value
X_exp = expeditions['days_in_field'].values
y_exp = expeditions['catch_value'].values

# Fit regression
w_exp, b_exp = fit_linear_regression(X_exp, y_exp)

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X_exp, y_exp, alpha=0.3, s=20, c='steelblue')

x_line = np.linspace(X_exp.min(), X_exp.max(), 100)
y_line = w_exp * x_line + b_exp
plt.plot(x_line, y_line, 'r-', linewidth=2, 
         label=f'catch = {w_exp:.1f}×days + {b_exp:.1f}')

plt.xlabel('Days in Field', fontsize=12)
plt.ylabel('Catch Value', fontsize=12)
plt.title('Expedition Duration vs. Catch Value', fontsize=13)
plt.legend()
plt.show()

print(f"Model: catch_value = {w_exp:.2f} × days + {b_exp:.2f}")
print(f"\nInterpretation: Each additional day adds ~{w_exp:.1f} to catch value")

# Calculate R-squared
y_pred_exp = w_exp * X_exp + b_exp
ss_res = np.sum((y_exp - y_pred_exp)**2)
ss_tot = np.sum((y_exp - y_exp.mean())**2)
r_squared = 1 - ss_res / ss_tot
print(f"R² = {r_squared:.3f} ({r_squared*100:.1f}% of variance explained)")

---

## Exercises

### Exercise 1: Multi-feature Pricing

The Miasto Guild suspects that **rarity** also affects price, not just danger. Fit a model using both features:

$$\text{price} = w_1 \cdot \text{danger} + w_2 \cdot \text{rarity} + b$$

*Hint: Use `np.linalg.lstsq` for the matrix solution.*

In [None]:
# Exercise 1: Multi-feature model
# Build feature matrix with danger, rarity, and bias term

X_multi = np.column_stack([
    creature_stats['danger'].values,
    creature_stats['rarity'].values,
    np.ones(len(creature_stats))  # bias term
])
y_multi = creature_stats['avg_price'].values

# Solve using least squares
# YOUR CODE HERE
# weights, residuals, rank, s = np.linalg.lstsq(X_multi, y_multi, rcond=None)
# print(f"Danger weight: {weights[0]:.2f}")
# print(f"Rarity weight: {weights[1]:.2f}")
# print(f"Bias: {weights[2]:.2f}")

### Exercise 2: Learning Rate Experiments

The Tribunal asks: "What happens if we step too fast or too slow?"

Try gradient descent with:
- `learning_rate = 0.1` (too fast?)
- `learning_rate = 0.0001` (too slow?)

What happens in each case?

In [None]:
# Exercise 2: Learning rate experiments
# Try different learning rates and observe the behavior

# YOUR CODE HERE
# w_fast, b_fast, history_fast = gradient_descent(X, y, learning_rate=0.1, n_iterations=100)
# w_slow, b_slow, history_slow = gradient_descent(X, y, learning_rate=0.0001, n_iterations=500)

# Plot comparison

### Exercise 3: Outlier Analysis

The Golden Amalgam Snail fetches extreme prices. Mink Pavar warns: "Outliers can distort our models."

Remove the Golden Amalgam Snail and refit. How do the parameters change?

In [None]:
# Exercise 3: Outlier analysis
# Remove the outlier and compare

# YOUR CODE HERE
# no_outlier = creature_stats[~creature_stats['name'].str.contains('Golden Amalgam')]
# X_clean = no_outlier['danger'].values
# y_clean = no_outlier['avg_price'].values
# w_clean, b_clean = fit_linear_regression(X_clean, y_clean)
# print(f"Original: w={w_optimal:.2f}, b={b_optimal:.2f}")
# print(f"Without outlier: w={w_clean:.2f}, b={b_clean:.2f}")

### Exercise 4: Forgery Features

Mink Pavar wants to know: which manuscript feature best predicts forgery?

Compare R² values for:
- `stylometric_variance`
- `era_marker_score`
- `vocabulary_richness`

In [None]:
# Exercise 4: Compare forgery predictors
# Calculate R² for each feature

features_to_test = ['stylometric_variance', 'era_marker_score', 'vocabulary_richness']
y_forgery = manuscripts['is_forgery'].astype(int).values

# YOUR CODE HERE
# for feature in features_to_test:
#     X_feat = manuscripts[feature].values
#     w_feat, b_feat = fit_linear_regression(X_feat, y_forgery)
#     y_pred = w_feat * X_feat + b_feat
#     ss_res = np.sum((y_forgery - y_pred)**2)
#     ss_tot = np.sum((y_forgery - y_forgery.mean())**2)
#     r2 = 1 - ss_res / ss_tot
#     print(f"{feature}: R² = {r2:.4f}")

### Exercise 5: The Forgery Trial Challenge

The Tribunal gives you 5 manuscripts. Based on `stylometric_variance` alone, predict which are forgeries (probability > 0.5).

In [None]:
# Exercise 5: Predict on new manuscripts
# Use the model from Part 6

test_variances = [0.05, 0.12, 0.25, 0.08, 0.35]
test_ids = ['MS-TEST-1', 'MS-TEST-2', 'MS-TEST-3', 'MS-TEST-4', 'MS-TEST-5']

print("Forgery Predictions:")
print("=" * 50)
# YOUR CODE HERE
# for ms_id, variance in zip(test_ids, test_variances):
#     prob = w_ms * variance + b_ms
#     prediction = "FORGERY" if prob > 0.5 else "Authentic"
#     print(f"{ms_id}: variance={variance:.2f} -> prob={prob:.3f} -> {prediction}")

---

## Summary

| Concept | Key Insight | Forgery Trial Application |
|---------|-------------|---------------------------|
| **Loss Function** | MSE measures average squared error | How far are predictions from truth? |
| **Why Squared?** | Differentiable, penalizes outliers, no cancellation | Ensures all errors count positively |
| **Closed-Form Solution** | Set derivative = 0, solve directly | Works for simple linear regression |
| **Gradient Descent** | Iteratively follow the downhill slope | General method for any model |
| **R²** | Proportion of variance explained | How good is our forgery predictor? |

---

## Key Takeaways

1. **"Best fit" means minimize squared errors** — we derived this from first principles, just as Mink Pavar did for the Tribunal

2. **The loss landscape is smooth** — this is why optimization works; we can always find which direction is downhill

3. **Two ways to find the minimum**: closed-form (when available) or gradient descent (general purpose)

4. **Linear regression is a building block** — it's the foundation for more complex models

5. **The Forgery Trial continues** — we can detect forgeries, but our method has limitations (probabilities outside [0,1])

---

## Next Lesson

In **Lesson 2: The Bias-Variance Trade-off**, we'll discover the most important theoretical concept in machine learning. 

*"The Tribunal asked me a dangerous question: 'If we give you more data, more features, more complexity—will your predictions improve?' I told them the truth: not necessarily. There is a hidden tension between simplicity and complexity, between fitting the past and predicting the future. This tension is the bias-variance trade-off, and understanding it is the key to all of machine learning."*  
— Mink Pavar