# Part 1.3: Probability & Statistics for Deep Learning — The Formula 1 Edition

Probability and statistics are essential for understanding:
- How models make predictions (probabilistic outputs)
- How we train models (maximum likelihood)
- How we measure uncertainty and information

**The F1 Connection**: Formula 1 is a sport drowning in probability. Will it rain at Spa? What's the chance of a safety car at Monaco? How do lap times distribute around the mean? Every race strategy decision — when to pit, which tire compound to choose, whether to risk a one-stop — is a bet against a probability distribution. The teams that model these distributions best win championships.

## Learning Objectives
- [ ] Work with common probability distributions
- [ ] Apply Bayes' theorem to update beliefs
- [ ] Derive MLE estimators for simple distributions
- [ ] Calculate entropy and KL divergence

---

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy.special import comb

%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

## 1. Probability Basics

### Random Variables

A **random variable** is a variable whose value is determined by a random process.

- **Discrete**: Takes on countable values (e.g., coin flips, dice rolls)
- **Continuous**: Takes on any value in a range (e.g., height, temperature)

**F1 analogy**: A driver's finishing position is a discrete random variable (1st, 2nd, ..., DNF). Their lap time is a continuous random variable — it can be 1:31.204 or 1:31.205 or anything in between.

### Probability Distributions

A **probability distribution** describes the likelihood of each possible outcome.

- **PMF** (Probability Mass Function): For discrete variables, $P(X = x)$
- **PDF** (Probability Density Function): For continuous variables, $f(x)$

**F1 analogy**: The PMF is like a grid of starting positions with the probability of each driver winning from that slot. The PDF is like the smooth curve of lap time variation — you can't ask "what's the probability of exactly 1:31.204?" but you can ask "what's the probability of a lap between 1:31 and 1:32?"

### Deep Dive: What is a Probability Distribution?

A probability distribution answers a fundamental question: **"What outcomes are possible, and how likely is each one?"**

Think of it as a complete recipe for uncertainty:
- It lists every possible outcome
- It assigns a probability (or density) to each outcome
- All probabilities sum to 1 (something must happen!)

**The Key Insight**: A distribution captures *everything* we know about a random process. Once you have the distribution, you can compute any probability, expectation, or uncertainty measure.

**F1 analogy**: An F1 strategist's entire job is building probability distributions. Before a race, they model: the distribution of possible lap times on each tire compound, the probability of rain in each 10-minute window, the likelihood of a safety car on each lap. The team with the best distributions makes the best pit stop calls.

#### Discrete vs Continuous Distributions

| Aspect | Discrete | Continuous | F1 Example |
|--------|----------|------------|------------|
| **Possible values** | Countable (finite or infinite) | Uncountable (any value in a range) | Finishing position (1st-20th, DNF) vs. lap time (continuous) |
| **Probability function** | PMF: P(X = x) gives exact probability | PDF: f(x) gives density, not probability | P(win from pole) = 0.45 vs. lap time density curve |
| **Finding probabilities** | Sum: P(a ≤ X ≤ b) = Σ P(X = x) | Integrate: P(a ≤ X ≤ b) = ∫f(x)dx | P(podium) = P(1st) + P(2nd) + P(3rd) vs. P(lap < 1:32) |
| **Examples** | Coin flips, dice, word counts | Height, temperature, neural network weights | Points scored, pit stops made vs. fuel load, tire degradation rate |
| **ML applications** | Classification labels, token IDs | Regression targets, latent variables | Predicting race winner vs. predicting lap time |

**Important**: For continuous distributions, P(X = x) = 0 for any specific value! We can only ask about ranges.

---

## 2. Common Distributions

### 2.1 Bernoulli Distribution

Models a single binary outcome (success/failure, yes/no, 1/0).

$$P(X = 1) = p, \quad P(X = 0) = 1 - p$$

**In ML**: Binary classification outputs, dropout masks

**F1 analogy**: Will the car finish the race? Every Grand Prix is a Bernoulli trial for each driver — they either finish (1) or DNF (0). A reliable car might have p = 0.95, while a fragile one has p = 0.70. Dropout in neural networks works the same way: each neuron is an "engine component" that randomly fails (is zeroed out) during training.

In [None]:
# Bernoulli distribution — Will the car finish the race?
finish_probability = 0.7  # Probability of finishing (no DNF)

# Generate samples: 1000 race starts
race_results = np.random.binomial(1, finish_probability, size=1000)

print(f"Bernoulli(p={finish_probability}) — Car Finish Probability")
print(f"Mean (theoretical): {finish_probability}")
print(f"Mean (empirical): {race_results.mean():.3f}")
print(f"Variance (theoretical): {finish_probability * (1-finish_probability):.3f}")
print(f"Variance (empirical): {race_results.var():.3f}")

# Visualize
plt.figure(figsize=(8, 4))
plt.bar([0, 1], [1-finish_probability, finish_probability], width=0.4, alpha=0.7)
plt.xticks([0, 1], ['DNF (0)', 'Finish (1)'])
plt.ylabel('Probability')
plt.title(f'Bernoulli Distribution: Will the Car Finish? (p={finish_probability})')
plt.show()

### 2.2 Binomial Distribution

Number of successes in $n$ independent Bernoulli trials.

$$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}$$

**In ML**: Counting successes in multiple trials

**F1 analogy**: If a driver enters 20 races in a season and has a 30% chance of finishing on the podium at each race, the binomial distribution tells us the probability of getting exactly k podiums across the season. "How many points finishes will this driver collect over a 20-race calendar?"

In [None]:
# Binomial distribution — Podium finishes in a season
n_races = 20  # Races in the season
podium_prob = 0.3  # Probability of podium at each race

# PMF
k = np.arange(0, n_races+1)
pmf = stats.binom.pmf(k, n_races, podium_prob)

plt.figure(figsize=(10, 4))
plt.bar(k, pmf, alpha=0.7)
plt.xlabel('Number of Podium Finishes (k)')
plt.ylabel('P(X = k)')
plt.title(f'Binomial Distribution: Podiums in a {n_races}-Race Season (p={podium_prob})')
plt.axvline(x=n_races*podium_prob, color='red', linestyle='--', label=f'Expected podiums = np = {n_races*podium_prob}')
plt.legend()
plt.show()

print(f"Expected podiums: E[X] = np = {n_races*podium_prob}")
print(f"Variance: Var[X] = np(1-p) = {n_races*podium_prob*(1-podium_prob):.2f}")

### 2.3 Categorical Distribution

Generalization of Bernoulli to $K$ categories.

$$P(X = k) = p_k, \quad \sum_{k=1}^K p_k = 1$$

**In ML**: Multi-class classification (softmax output)

**F1 analogy**: Predicting the race winner is a categorical distribution across all 20 drivers. The favorites might have P(Verstappen wins) = 0.40, P(Hamilton wins) = 0.25, and the remaining probability spread across the field. A softmax output in a neural network works exactly the same way — probabilities across categories that must sum to 1.

In [None]:
# Categorical distribution — Predicting the race winner
teams = ['Red Bull', 'Mercedes', 'Ferrari', 'McLaren']
win_probabilities = [0.4, 0.35, 0.15, 0.1]

# Generate samples: simulate 1000 race outcomes
race_outcomes = np.random.choice(len(teams), size=1000, p=win_probabilities)

plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.bar(teams, win_probabilities, alpha=0.7, color='steelblue')
plt.ylabel('Win Probability')
plt.title('Pre-Race Win Probabilities (True)')

plt.subplot(1, 2, 2)
empirical_wins = [np.mean(race_outcomes == i) for i in range(len(teams))]
plt.bar(teams, empirical_wins, alpha=0.7, color='coral')
plt.ylabel('Win Frequency')
plt.title('Simulated Race Wins (1000 races)')

plt.tight_layout()
plt.show()

### 2.4 Gaussian (Normal) Distribution

The most important continuous distribution.

$$f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

**In ML**: 
- Weight initialization
- Noise in VAEs
- Regression targets
- Batch normalization

**F1 analogy**: Lap times follow an approximately normal distribution. A driver's laps cluster around their mean pace (mu), with some natural variation (sigma). A consistent driver has small sigma (tight lap time window), while an erratic driver has large sigma. The same math that describes lap time scatter also describes how neural network weights are initialized — small random values drawn from a Gaussian.

In [None]:
# Gaussian distribution — Lap time variation
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Different means = different drivers' average pace
lap_time = np.linspace(85, 100, 200)  # Lap times in seconds
for mean_pace in [89, 91, 93, 95]:
    axes[0].plot(lap_time, stats.norm.pdf(lap_time, mean_pace, 1), label=f'Mean pace={mean_pace}s, σ=1s')
axes[0].set_xlabel('Lap Time (seconds)')
axes[0].set_ylabel('Density')
axes[0].set_title('Effect of Mean Pace (μ) — Different Drivers')
axes[0].legend()

# Different standard deviations = different consistency levels
lap_time = np.linspace(84, 100, 200)
for consistency in [0.5, 1, 2, 3]:
    axes[1].plot(lap_time, stats.norm.pdf(lap_time, 92, consistency), label=f'Mean=92s, σ={consistency}s')
axes[1].set_xlabel('Lap Time (seconds)')
axes[1].set_ylabel('Density')
axes[1].set_title('Effect of Consistency (σ) — Same Driver, Different Conditions')
axes[1].legend()

plt.tight_layout()
plt.show()

In [None]:
# The 68-95-99.7 rule — Lap time consistency bands
mu, sigma = 0, 1
x = np.linspace(-4, 4, 200)
y = stats.norm.pdf(x, mu, sigma)

plt.figure(figsize=(10, 6))
plt.plot(x, y, 'b-', linewidth=2)

# Fill regions
plt.fill_between(x, y, where=(x >= -3) & (x <= 3), alpha=0.2, color='blue', label='99.7% of laps (±3σ)')
plt.fill_between(x, y, where=(x >= -2) & (x <= 2), alpha=0.3, color='blue', label='95% of laps (±2σ)')
plt.fill_between(x, y, where=(x >= -1) & (x <= 1), alpha=0.4, color='blue', label='68% of laps (±1σ)')

plt.xlabel('Deviation from Mean Lap Time (in standard deviations)')
plt.ylabel('Density')
plt.title('Lap Time Variation — The 68-95-99.7 Rule\n"68% of laps fall within ±1σ of the driver\'s average pace"')
plt.legend()
plt.show()

# Verify with scipy
print("Probability within:")
print(f"  ±1σ: {stats.norm.cdf(1) - stats.norm.cdf(-1):.4f} (68.27%)")
print(f"  ±2σ: {stats.norm.cdf(2) - stats.norm.cdf(-2):.4f} (95.45%)")
print(f"  ±3σ: {stats.norm.cdf(3) - stats.norm.cdf(-3):.4f} (99.73%)")
print("\nF1 insight: A lap outside ±3σ is almost certainly due to")
print("traffic, an incident, or a mistake — not random variation.")

### 2.5 Multivariate Gaussian

Extension to multiple dimensions:

$$f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{x}-\boldsymbol{\mu})\right)$$

Where:
- $\boldsymbol{\mu}$: Mean vector
- $\Sigma$: Covariance matrix

**F1 analogy**: A single lap time is univariate Gaussian, but a car's full telemetry — speed, tire temperature, fuel load — follows a multivariate Gaussian. The covariance matrix captures how these variables relate: when tire temperature goes up, grip goes down (negative correlation). When fuel load drops, lap time improves (also correlated). Understanding these joint distributions is how teams optimize strategy across multiple interacting variables simultaneously.

### Choosing the Right Distribution: A Decision Guide

| Distribution | Use When | Parameters | Example in ML | F1 Parallel |
|--------------|----------|------------|---------------|-------------|
| **Bernoulli** | Single yes/no outcome | p (success probability) | Binary classification output, dropout mask | Will the car finish the race? (finish/DNF) |
| **Binomial** | Count of successes in n trials | n (trials), p (success prob) | Number of correct predictions in batch | How many podiums in a 20-race season? |
| **Categorical** | Single choice from K options | p₁, p₂, ..., pₖ (probabilities) | Softmax output, token prediction | Which of the 20 drivers wins this race? |
| **Multinomial** | Counts across K categories | n (trials), p₁...pₖ | Word counts in document (bag of words) | Finishing position counts across a season |
| **Gaussian** | Continuous value, symmetric uncertainty | μ (mean), σ (std dev) | Regression targets, weight initialization | Lap time variation around mean pace |
| **Multivariate Gaussian** | Multiple correlated continuous values | μ (mean vector), Σ (covariance) | VAE latent space, GP predictions | Joint distribution of speed, tire temp, fuel load |

**The Pattern**: 
- Bernoulli/Binomial are for binary outcomes (yes/no)
- Categorical/Multinomial are for multi-class outcomes  
- Gaussian is for continuous outcomes with symmetric uncertainty

**Key ML Connection**: The distribution you choose for your model's output determines your loss function:
- Categorical output → Cross-entropy loss
- Gaussian output → MSE loss (equivalent to assuming Gaussian noise)

In [None]:
# 2D Gaussian — Joint distributions of car telemetry variables
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Generate grid for contour plots
x = np.linspace(-4, 4, 100)
y = np.linspace(-4, 4, 100)
X, Y = np.meshgrid(x, y)
pos = np.dstack((X, Y))

# Different covariance matrices representing different telemetry relationships
covariances = [
    (np.array([[1, 0], [0, 1]]), 'Independent Variables\n(Speed vs. Fuel Load)'),
    (np.array([[2, 0], [0, 0.5]]), 'Different Variances\n(Speed varies more than Tire Temp)'),
    (np.array([[1, 0.8], [0.8, 1]]), 'Correlated Variables\n(Tire Temp vs. Degradation Rate)')
]

mean = np.array([0, 0])

for ax, (cov, title) in zip(axes, covariances):
    rv = stats.multivariate_normal(mean, cov)
    Z = rv.pdf(pos)
    
    ax.contour(X, Y, Z, levels=10, cmap='viridis')
    
    # Draw samples
    samples = rv.rvs(size=200)
    ax.scatter(samples[:, 0], samples[:, 1], alpha=0.3, s=10, color='red')
    
    ax.set_xlabel('Telemetry Variable 1')
    ax.set_ylabel('Telemetry Variable 2')
    ax.set_title(f'{title}\nΣ = {cov.tolist()}')
    ax.set_aspect('equal')
    ax.set_xlim(-4, 4)
    ax.set_ylim(-4, 4)

plt.tight_layout()
plt.show()

---

## 3. Expected Value and Variance

### Expected Value (Mean)

The "average" outcome weighted by probability:

- Discrete: $E[X] = \sum_x x \cdot P(X = x)$
- Continuous: $E[X] = \int x \cdot f(x) dx$

### Variance

Measures spread around the mean:

$$\text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2$$

**F1 analogy**: Expected value is the average championship points a driver earns from a given starting position. Starting from pole, E[points] might be 20 (weighted by probability of each finishing position). Variance measures how much the actual result varies — a driver who always finishes where they qualify has low variance, while one who either wins or DNFs has high variance. Teams use expected points calculations to evaluate strategy decisions: "Does this pit stop gamble increase our expected points?"

### Deep Dive: Understanding Each Term in Bayes' Theorem

$$P(\text{hypothesis}|\text{data}) = \frac{P(\text{data}|\text{hypothesis}) \cdot P(\text{hypothesis})}{P(\text{data})}$$

Let's break down what each term really means:

| Term | Name | Meaning | Medical Example | F1 Example |
|------|------|---------|-----------------|------------|
| **P(H)** | Prior | Your belief *before* seeing any evidence | 1% of population has disease | 30% chance of rain before the race |
| **P(D\|H)** | Likelihood | How probable is this evidence *if* hypothesis is true? | 95% chance of positive test *if* you have disease | If it rains, 80% chance the track is wet by lap 10 |
| **P(D)** | Evidence (Marginal) | Total probability of seeing this evidence | Overall rate of positive tests | Overall probability of a wet track by lap 10 |
| **P(H\|D)** | Posterior | Updated belief *after* seeing evidence | Probability you have disease *given* positive test | P(rain) *given* the track is wet at lap 10 |

**The Core Insight**: Bayes' theorem is a *belief update* mechanism:
```
New Belief = (How well evidence supports hypothesis) x (Old Belief) / (How common is this evidence)
```

**F1 analogy**: Every lap, the strategy team is running Bayes' theorem in their heads. Before the race: P(rain) = 30%. They see dark clouds forming: that's new evidence. P(dark clouds | rain) is high, P(dark clouds | no rain) is low. Their posterior P(rain | dark clouds) shoots up. Now they're preparing wet tires. This is exactly how Bayesian neural networks update their weight distributions given new training data.

**Why the denominator matters**: P(D) normalizes everything. If positive tests are common (many false positives), a positive test is less informative.

In [None]:
# Visual: How Bayes' Theorem Updates Beliefs
# F1 scenario: Predicting rain during a race based on weather radar data

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Setup — Using the medical testing example (the math is identical)
P_disease = 0.01
P_positive_given_disease = 0.95      # True positive rate
P_positive_given_no_disease = 0.05   # False positive rate

# Imagine 10,000 people
n_people = 10000
n_sick = int(n_people * P_disease)
n_healthy = n_people - n_sick

# Among sick people
sick_test_positive = int(n_sick * P_positive_given_disease)
sick_test_negative = n_sick - sick_test_positive

# Among healthy people  
healthy_test_positive = int(n_healthy * P_positive_given_no_disease)
healthy_test_negative = n_healthy - healthy_test_positive

# Plot 1: Prior - Population breakdown
ax = axes[0, 0]
ax.bar(['Sick', 'Healthy'], [n_sick, n_healthy], color=['red', 'green'], alpha=0.7)
ax.set_ylabel('Number of People')
ax.set_title(f'Step 1: PRIOR\n{n_people:,} people: {n_sick} sick (1%), {n_healthy} healthy (99%)')
ax.set_ylim(0, n_people * 1.1)
for i, v in enumerate([n_sick, n_healthy]):
    ax.text(i, v + 200, str(v), ha='center', fontweight='bold')

# Plot 2: Likelihood - Test results by group
ax = axes[0, 1]
x = np.arange(2)
width = 0.35
bars1 = ax.bar(x - width/2, [sick_test_positive, healthy_test_positive], width, 
               label='Test Positive', color='orange', alpha=0.7)
bars2 = ax.bar(x + width/2, [sick_test_negative, healthy_test_negative], width,
               label='Test Negative', color='blue', alpha=0.7)
ax.set_xticks(x)
ax.set_xticklabels(['Sick (100)', 'Healthy (9900)'])
ax.set_ylabel('Number of People')
ax.set_title('Step 2: LIKELIHOOD\nHow the test performs on each group')
ax.legend()

# Plot 3: Evidence - All positive tests
ax = axes[1, 0]
ax.bar(['True Positives\n(Sick + Positive)', 'False Positives\n(Healthy + Positive)'], 
       [sick_test_positive, healthy_test_positive], 
       color=['red', 'green'], alpha=0.7)
total_positive = sick_test_positive + healthy_test_positive
ax.set_ylabel('Number of People')
ax.set_title(f'Step 3: EVIDENCE\nAll positive tests: {total_positive} total\n'
             f'P(positive) = {total_positive/n_people:.2%}')
for i, v in enumerate([sick_test_positive, healthy_test_positive]):
    ax.text(i, v + 10, str(v), ha='center', fontweight='bold')

# Plot 4: Posterior - Among positive tests, who is actually sick?
ax = axes[1, 1]
posterior = sick_test_positive / total_positive
ax.bar(['Actually Sick', 'Actually Healthy'], 
       [sick_test_positive, healthy_test_positive],
       color=['red', 'green'], alpha=0.7)
ax.set_ylabel('Number of People (with positive test)')
ax.set_title(f'Step 4: POSTERIOR\nAmong {total_positive} positive tests:\n'
             f'P(sick|positive) = {sick_test_positive}/{total_positive} = {posterior:.1%}')
for i, v in enumerate([sick_test_positive, healthy_test_positive]):
    pct = v / total_positive * 100
    ax.text(i, v + 10, f'{v} ({pct:.1f}%)', ha='center', fontweight='bold')

plt.tight_layout()
plt.suptitle('Bayes Theorem: Why a 95% Accurate Test Gives Only 16% Confidence\n'
             '(Same math applies: a 95% accurate rain radar still misleads when rain is rare)', 
             fontsize=14, fontweight='bold', y=1.02)
plt.show()

print("\nThe Counterintuitive Result Explained:")
print("=" * 50)
print(f"Even though the test is 95% accurate:")
print(f"  - Out of {n_sick} sick people: {sick_test_positive} test positive")
print(f"  - Out of {n_healthy} healthy people: {healthy_test_positive} ALSO test positive (false positives)")
print(f"\nTotal positive tests: {total_positive}")
print(f"True positives: {sick_test_positive} ({sick_test_positive/total_positive:.1%})")
print(f"False positives: {healthy_test_positive} ({healthy_test_positive/total_positive:.1%})")
print(f"\nThe false positives OVERWHELM the true positives because")
print(f"healthy people vastly outnumber sick people!")
print(f"\nF1 parallel: If your rain radar is 95% accurate but rain only happens")
print(f"5% of the time, most 'rain detected' alerts are false positives.")

In [None]:
# Computing expected value — Expected championship points from a grid position
# Example: Points distribution from starting P3
finishing_positions = np.array([1, 2, 3, 4, 5, 6])
points_scored = np.array([25, 18, 15, 12, 10, 8])  # F1 points system
position_probabilities = np.array([0.1, 0.1, 0.1, 0.2, 0.2, 0.3])  # Biased toward lower positions

# Expected points
expected_points = np.sum(points_scored * position_probabilities)
print(f"E[Points] = Σ points·P(position) = {expected_points}")

# Variance in points
variance_points = np.sum((points_scored - expected_points)**2 * position_probabilities)
print(f"Var(Points) = E[(Points - E[Points])²] = {variance_points:.4f}")
print(f"Std(Points) = √Var(Points) = {np.sqrt(variance_points):.4f}")

# Verify with sampling
sampled_positions = np.random.choice(points_scored, size=10000, p=position_probabilities)
print(f"\nEmpirical mean points: {sampled_positions.mean():.4f}")
print(f"Empirical variance: {sampled_positions.var():.4f}")

---

## 4. Bayes' Theorem

Bayes' theorem tells us how to update beliefs given new evidence:

$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$

In ML terms:

$$P(\text{hypothesis}|\text{data}) = \frac{P(\text{data}|\text{hypothesis}) \cdot P(\text{hypothesis})}{P(\text{data})}$$

Or:

$$\text{Posterior} = \frac{\text{Likelihood} \times \text{Prior}}{\text{Evidence}}$$

**F1 analogy**: Bayes' theorem is the mathematical backbone of real-time race strategy. Before the race, the team has a prior belief about tire degradation (say, 0.1s per lap). As laps unfold and actual lap times come in, they update this belief. If the driver's times are dropping faster than expected, the posterior shifts toward higher degradation — and the team calls an earlier pit stop. Every lap is new evidence, and the strategy wall is constantly computing posteriors.

### Bayes' Theorem in Machine Learning

Bayesian thinking is fundamental to many ML techniques:

| Application | Prior P(H) | Likelihood P(D\|H) | Posterior P(H\|D) | F1 Parallel |
|-------------|------------|-------------------|-------------------|-------------|
| **Naive Bayes Classifier** | Class frequencies in training data | P(features\|class) assumed independent | P(class\|features) for prediction | Predicting tire compound from telemetry features |
| **Bayesian Neural Networks** | Prior on weights (e.g., Gaussian) | P(data\|weights) from network output | Distribution over weights given data | Uncertainty in lap time predictions |
| **Bayesian Optimization** | GP prior over objective function | Observations so far | Updated belief about function | Finding optimal car setup (test limited configs) |
| **Spam Filtering** | Base rate of spam emails | P(words\|spam) and P(words\|ham) | P(spam\|email content) | Filtering valid telemetry from sensor noise |
| **A/B Testing** | Prior belief about conversion rates | Observed clicks/conversions | Updated belief about which variant wins | Testing two setup configurations mid-weekend |

**The Bayesian vs Frequentist Perspective**:
- **Frequentist**: Parameters are fixed, unknown constants. We estimate them.
- **Bayesian**: Parameters have probability distributions. We update our beliefs.

In deep learning, we're usually frequentist (point estimates via SGD), but Bayesian methods give us uncertainty quantification.

### Deep Dive: The Intuition Behind Maximum Likelihood

**The Core Question**: Given observed data, what parameters would have made this data *most probable*?

Imagine you flip a coin 10 times and get 7 heads. What's the "most likely" value of p (probability of heads)?

**MLE answers**: Find the p that maximizes P(7 heads in 10 flips | p)

The answer is p = 0.7, because:
- If p = 0.5, getting 7 heads is somewhat unlikely
- If p = 0.9, getting only 7 heads (not 9) is unlikely
- p = 0.7 makes our observed data most probable

**F1 analogy**: Imagine you're an engineer trying to estimate the tire degradation rate from lap data. You observe lap times of 92.1, 92.3, 92.5, 92.8, 93.0 seconds over 5 laps. MLE asks: "What degradation rate makes these observed lap times most probable?" If you assume lap times increase linearly with degradation, MLE finds the slope that best fits the data — just like fitting a line through your lap time scatter plot.

**Why Log-Likelihood?**
1. Products become sums: log(a x b x c) = log(a) + log(b) + log(c)
2. Numerical stability: Avoids underflow when multiplying many small probabilities
3. Same maximum: log is monotonic, so argmax is preserved

**The Profound Connection to Loss Functions**:

For classification with softmax outputs:
$$\text{Minimize Cross-Entropy} = \text{Maximize Log-Likelihood}$$

They're the same optimization! When you train with cross-entropy loss, you're doing MLE.

In [None]:
# Classic example: Safety car prediction
# Safety cars occur in ~5% of race laps
# Sensor detects incidents with 90% accuracy

P_safety_car = 0.01  # Prior: probability of safety car on any given lap
P_sensor_alert_given_incident = 0.95  # Sensitivity (true positive rate)
P_sensor_alert_given_no_incident = 0.05  # False positive rate (1 - specificity)

# P(alert) = P(alert|incident)P(incident) + P(alert|no incident)P(no incident)
P_alert = P_sensor_alert_given_incident * P_safety_car + P_sensor_alert_given_no_incident * (1 - P_safety_car)

# Bayes' theorem: P(incident|alert)
P_incident_given_alert = (P_sensor_alert_given_incident * P_safety_car) / P_alert

print("F1 Safety Car Prediction (Same Math as Medical Testing)")
print("=" * 55)
print(f"Prior P(safety car this lap) = {P_safety_car:.2%}")
print(f"Sensor sensitivity = {P_sensor_alert_given_incident:.2%}")
print(f"Sensor specificity = {1 - P_sensor_alert_given_no_incident:.2%}")
print()
print(f"P(sensor alert) = {P_alert:.4f}")
print(f"P(actual incident | sensor alert) = {P_incident_given_alert:.2%}")
print()
print("Surprising! Even with an alert from a 95% accurate sensor,")
print(f"there's only a {P_incident_given_alert:.1%} chance of an actual safety car!")
print("This is because incidents are rare on any given lap (low prior).")

In [None]:
# Demonstrating: Cross-Entropy Loss = Negative Log-Likelihood
# This shows they're mathematically equivalent!

print("Cross-Entropy Loss vs Negative Log-Likelihood")
print("=" * 50)

# Imagine a 3-class tire compound prediction problem
# True label is Soft compound (class 0), model outputs these probabilities:
true_compound = 0
model_probs = np.array([0.7, 0.2, 0.1])  # Model is fairly confident it's Soft

# Method 1: Cross-Entropy Loss (what we use in practice)
# CE = -sum(y_true * log(y_pred)) where y_true is one-hot
one_hot = np.array([1, 0, 0])  # One-hot encoding of true compound (Soft)
cross_entropy_val = -np.sum(one_hot * np.log(model_probs))
print(f"\nCross-Entropy Loss: -sum(y_true * log(y_pred))")
print(f"  = -({one_hot[0]} * log({model_probs[0]:.2f}) + {one_hot[1]} * log({model_probs[1]:.2f}) + {one_hot[2]} * log({model_probs[2]:.2f}))")
print(f"  = -{np.log(model_probs[0]):.4f}")
print(f"  = {cross_entropy_val:.4f}")

# Method 2: Negative Log-Likelihood (MLE perspective)
# NLL = -log(P(true_class))
neg_log_likelihood = -np.log(model_probs[true_compound])
print(f"\nNegative Log-Likelihood: -log(P(true_compound))")
print(f"  = -log({model_probs[true_compound]:.2f})")
print(f"  = {neg_log_likelihood:.4f}")

print(f"\nThey're identical! CE = NLL = {cross_entropy_val:.4f}")
print("\nThis means: Training with cross-entropy loss is doing MLE!")
print("We're finding network weights that maximize P(correct labels | inputs)")

# Show how loss changes with confidence
print("\n" + "=" * 50)
print("How loss varies with model confidence (predicting tire compound):")
probs_for_true_compound = [0.1, 0.3, 0.5, 0.7, 0.9, 0.99]
print(f"{'P(correct compound)':<22} {'Cross-Entropy Loss':<20}")
print("-" * 42)
for p in probs_for_true_compound:
    loss = -np.log(p)
    print(f"{p:<22.2f} {loss:<20.4f}")

In [None]:
# Visualize how posterior changes with prior
# F1 context: How P(rain | radar alert) changes with base rain probability
priors = np.linspace(0.001, 0.5, 100)
sensitivity = 0.95
specificity = 0.95

posteriors = []
for prior in priors:
    p_positive = sensitivity * prior + (1 - specificity) * (1 - prior)
    posterior = (sensitivity * prior) / p_positive
    posteriors.append(posterior)

plt.figure(figsize=(10, 6))
plt.plot(priors * 100, np.array(posteriors) * 100, 'b-', linewidth=2)
plt.xlabel('Prior Probability [%]\n(e.g., base rate of rain at this circuit)')
plt.ylabel('Posterior Probability [%]\n(e.g., P(rain | radar alert))')
plt.title('How the Prior Affects the Posterior (95% Accurate Sensor)\n'
          'F1: Low base-rate rain circuits give more false alarms')
plt.grid(True, alpha=0.3)

# Mark some key points
for prior in [0.01, 0.1, 0.5]:
    p_positive = sensitivity * prior + (1 - specificity) * (1 - prior)
    posterior = (sensitivity * prior) / p_positive
    plt.scatter([prior * 100], [posterior * 100], s=100, zorder=5)
    plt.annotate(f'({prior*100:.0f}%, {posterior*100:.1f}%)', 
                 (prior * 100 + 1, posterior * 100 - 3))

plt.show()

In [None]:
# Interactive visualization: The Likelihood Surface
# F1 context: Estimating mean lap time (mu) and consistency (sigma) from observed laps

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Generate data from known distribution (simulated lap times)
np.random.seed(42)
true_mean_lap = 3.0  # True mean (offset for math convenience)
true_consistency = 1.5  # True sigma
lap_times = np.random.normal(true_mean_lap, true_consistency, size=30)

# Plot 1: 1D likelihood for mu (sigma fixed at true value)
mus = np.linspace(0, 6, 100)
log_likelihoods_mu = [np.sum(stats.norm.logpdf(lap_times, mu, true_consistency)) for mu in mus]

axes[0].plot(mus, log_likelihoods_mu, 'b-', linewidth=2)
axes[0].axvline(x=lap_times.mean(), color='red', linestyle='--', label=f'MLE: {lap_times.mean():.2f}')
axes[0].axvline(x=true_mean_lap, color='green', linestyle=':', label=f'True: {true_mean_lap}')
axes[0].set_xlabel('Mean Lap Time (mu)')
axes[0].set_ylabel('Log-Likelihood')
axes[0].set_title('Likelihood vs. Mean Pace\n(consistency fixed)')
axes[0].legend()

# Plot 2: 1D likelihood for sigma (mu fixed at true value)
sigmas = np.linspace(0.5, 4, 100)
log_likelihoods_sigma = [np.sum(stats.norm.logpdf(lap_times, true_mean_lap, sigma)) for sigma in sigmas]

axes[1].plot(sigmas, log_likelihoods_sigma, 'b-', linewidth=2)
axes[1].axvline(x=lap_times.std(), color='red', linestyle='--', label=f'MLE: {lap_times.std():.2f}')
axes[1].axvline(x=true_consistency, color='green', linestyle=':', label=f'True: {true_consistency}')
axes[1].set_xlabel('Lap Time Consistency (sigma)')
axes[1].set_ylabel('Log-Likelihood')
axes[1].set_title('Likelihood vs. Consistency\n(mean pace fixed)')
axes[1].legend()

# Plot 3: 2D likelihood surface
mus_2d = np.linspace(1, 5, 50)
sigmas_2d = np.linspace(0.5, 3, 50)
MU, SIGMA = np.meshgrid(mus_2d, sigmas_2d)

LL = np.zeros_like(MU)
for i in range(len(sigmas_2d)):
    for j in range(len(mus_2d)):
        LL[i, j] = np.sum(stats.norm.logpdf(lap_times, MU[i, j], SIGMA[i, j]))

contour = axes[2].contourf(MU, SIGMA, LL, levels=30, cmap='viridis')
axes[2].scatter([lap_times.mean()], [lap_times.std()], color='red', s=150, marker='*', 
                label=f'MLE', zorder=5, edgecolors='white')
axes[2].scatter([true_mean_lap], [true_consistency], color='white', s=100, marker='o',
                label=f'True', zorder=5, edgecolors='black')
axes[2].set_xlabel('Mean Lap Time (mu)')
axes[2].set_ylabel('Consistency (sigma)')
axes[2].set_title('2D Log-Likelihood Surface\n(Finding best mu, sigma from lap data)')
axes[2].legend()
plt.colorbar(contour, ax=axes[2], label='Log-Likelihood')

plt.tight_layout()
plt.show()

print("Key Observations:")
print("1. The likelihood surface has a clear peak (the MLE)")
print("2. As we move away from the MLE, likelihood decreases")
print("3. Gradient ascent on this surface finds the MLE")
print("4. This is exactly what neural network training does!")
print("\nF1 insight: MLE finds the mean pace and consistency that best")
print("explain the observed lap times — the same technique teams use")
print("to estimate tire degradation rates from stint data.")

### Deep Dive: Understanding Entropy

Entropy has several intuitive interpretations that all lead to the same formula:

**Interpretation 1: Average Surprise**
- "Surprise" of an event = -log P(event)
- Rare events (low probability) are more surprising
- Entropy = average surprise across all possible outcomes
- H(X) = E[-log P(X)] = "How surprised will I be on average?"

**Interpretation 2: Uncertainty**  
- How uncertain are we about the outcome?
- Maximum entropy = maximum uncertainty (uniform distribution)
- Zero entropy = complete certainty (deterministic)

**Interpretation 3: Information Content (Bits)**
- "How many yes/no questions do I need to identify the outcome?"
- Fair coin: 1 bit (one yes/no question: "Was it heads?")
- Fair 4-sided die: 2 bits ("Is it 1 or 2?" then "Is it the first of those two?")
- Biased distributions need fewer questions on average (can ask about likely outcomes first)

**F1 analogy**: Entropy measures the **unpredictability of race results**. A season where one driver dominates (P(Verstappen wins) = 0.9) has low entropy — you're rarely surprised by the winner. A season with 5 competitive drivers splitting wins equally has high entropy — every race is a genuine surprise. This is exactly why fans call competitive seasons "exciting" — high entropy = high unpredictability = better entertainment. In ML, softmax temperature controls this same trade-off: low temperature = peaked distribution (confident), high temperature = flat distribution (uncertain).

**Why log base 2?** 
- Gives entropy in "bits" - the number of binary questions
- log base e gives "nats" (natural units)
- They're proportional: 1 nat = 1.44 bits

In [None]:
# Visualizing Entropy as "Average Surprise"
# Surprise of an event = -log2(P(event))
# F1: How surprised are you when a particular driver wins?

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Plot 1: Surprise function
probs = np.linspace(0.01, 1, 100)
surprise = -np.log2(probs)

axes[0].plot(probs, surprise, 'b-', linewidth=2)
axes[0].set_xlabel('P(driver wins)')
axes[0].set_ylabel('Surprise = -log2(P(win))')
axes[0].set_title('Surprise Function\n"How shocked are you by the race winner?"')
axes[0].grid(True, alpha=0.3)
axes[0].annotate('Underdog wins!\n(high surprise)', xy=(0.1, 3.3), fontsize=10)
axes[0].annotate('Favorite wins\n(low surprise)', xy=(0.7, 0.8), fontsize=10)

# Plot 2: Entropy for different championship scenarios
distributions = {
    'Dominant era\n[1,0,0,0]': [1, 0, 0, 0],
    'Clear favorite\n[0.7,0.2,0.1,0]': [0.7, 0.2, 0.1, 0],
    'Competitive\n[0.4,0.3,0.2,0.1]': [0.4, 0.3, 0.2, 0.1],
    'Wide open\n[0.25,0.25,0.25,0.25]': [0.25, 0.25, 0.25, 0.25],
}

names = list(distributions.keys())
entropies = [entropy(p) for p in distributions.values()]

bars = axes[1].bar(range(len(names)), entropies, color=['darkblue', 'blue', 'steelblue', 'lightblue'])
axes[1].set_xticks(range(len(names)))
axes[1].set_xticklabels(names, fontsize=9)
axes[1].set_ylabel('Entropy (bits)')
axes[1].set_title('Season Competitiveness = Entropy\n"How unpredictable is each race?"')
axes[1].set_ylim(0, 2.5)
for i, v in enumerate(entropies):
    axes[1].text(i, v + 0.1, f'{v:.2f}', ha='center', fontweight='bold')

# Plot 3: Why uniform has maximum entropy
# Show entropy vs "peakedness" of distribution (like softmax temperature)
alphas = np.linspace(0, 5, 50)
scores = np.array([1, 2, 3, 4])  # Base driver ratings

entropy_values = []
for alpha in alphas:
    if alpha == 0:
        probs = np.ones(4) / 4  # Uniform — anyone can win
    else:
        logits = alpha * scores
        probs = np.exp(logits - logits.max())
        probs = probs / probs.sum()
    entropy_values.append(entropy(probs))

axes[2].plot(alphas, entropy_values, 'b-', linewidth=2)
axes[2].set_xlabel('Performance gap (lower = more equal)')
axes[2].set_ylabel('Entropy (bits)')
axes[2].set_title('Entropy vs. Field Competitiveness\n(Like softmax temperature in ML)')
axes[2].axhline(y=2, color='r', linestyle='--', alpha=0.5, label='Max entropy (anyone can win)')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Entropy Reference Table

| Distribution | Formula | Entropy (bits) | Interpretation | F1 Parallel |
|--------------|---------|----------------|----------------|-------------|
| Fair coin | [0.5, 0.5] | 1.00 | 1 yes/no question needed | "Will the car finish?" — pure coin flip |
| Biased coin (90/10) | [0.9, 0.1] | 0.47 | Less than 1 question on average | Reliable car: almost certainly finishes |
| Certain outcome | [1, 0] | 0.00 | No uncertainty, no questions needed | Dominant driver: guaranteed win |
| Fair 4-sided die | [0.25, 0.25, 0.25, 0.25] | 2.00 | 2 yes/no questions needed | 4-way title fight: any of them can win |
| Fair 8-sided die | [1/8] * 8 | 3.00 | 3 yes/no questions needed | 8 competitive drivers: wide-open field |
| Fair N-sided die | [1/N] * N | log2(N) | log2(N) questions needed | N equally-matched drivers |

**Pattern**: For a uniform distribution over N outcomes, entropy = log2(N) bits.

**Why Maximum Entropy = Uniform?**
- Mathematically: Proven via Lagrange multipliers (maximizing H subject to sum = 1)
- Intuitively: Any preference toward one outcome reduces average surprise
- Philosophically: Maximum entropy = maximum ignorance = all outcomes equally plausible
- In F1 terms: The most unpredictable season is when every driver has equal chance of winning

### Naive Bayes Classifier

A simple but effective classifier using Bayes' theorem:

$$P(y|x_1, ..., x_n) \propto P(y) \prod_{i=1}^n P(x_i|y)$$

The "naive" assumption is that features are conditionally independent given the class.

**F1 analogy**: Imagine predicting which tire compound a driver is on (Soft/Medium/Hard) based on telemetry features: average speed, tire temperature, and lap time degradation rate. Naive Bayes assumes these features are independent given the compound — which isn't perfectly true (speed and degradation correlate), but it works surprisingly well in practice, just as it does in spam filtering and text classification.

In [None]:
# Simple Naive Bayes from scratch
# F1 context: Classifying tire compound from telemetry features
class NaiveBayesClassifier:
    def __init__(self):
        self.class_priors = {}
        self.feature_params = {}  # (class, feature) -> (mean, std)
        
    def fit(self, X, y):
        """Fit Gaussian Naive Bayes."""
        classes = np.unique(y)
        n_samples = len(y)
        
        for c in classes:
            # Class prior
            self.class_priors[c] = np.sum(y == c) / n_samples
            
            # Feature parameters (Gaussian)
            X_c = X[y == c]
            for j in range(X.shape[1]):
                self.feature_params[(c, j)] = (X_c[:, j].mean(), X_c[:, j].std() + 1e-6)
                
    def predict_proba(self, X):
        """Compute class probabilities."""
        classes = list(self.class_priors.keys())
        n_samples = X.shape[0]
        probs = np.zeros((n_samples, len(classes)))
        
        for i, c in enumerate(classes):
            # Start with log prior
            log_prob = np.log(self.class_priors[c])
            
            # Add log likelihood for each feature
            for j in range(X.shape[1]):
                mean, std = self.feature_params[(c, j)]
                log_prob += stats.norm.logpdf(X[:, j], mean, std)
            
            probs[:, i] = log_prob
        
        # Convert to probabilities (softmax of log probs)
        probs = np.exp(probs - probs.max(axis=1, keepdims=True))
        probs = probs / probs.sum(axis=1, keepdims=True)
        
        return probs
    
    def predict(self, X):
        """Predict class labels."""
        probs = self.predict_proba(X)
        classes = list(self.class_priors.keys())
        return np.array([classes[i] for i in probs.argmax(axis=1)])


# Generate synthetic telemetry data (two tire compounds)
np.random.seed(42)
n_samples = 200

# Compound 0 (Hard): lower avg speed, lower degradation
hard_compound_telemetry = np.random.randn(n_samples // 2, 2) + np.array([0, 0])
# Compound 1 (Soft): higher avg speed, higher degradation
soft_compound_telemetry = np.random.randn(n_samples // 2, 2) + np.array([3, 3])

X_telemetry = np.vstack([hard_compound_telemetry, soft_compound_telemetry])
y_compound = np.array([0] * (n_samples // 2) + [1] * (n_samples // 2))

# Train
clf = NaiveBayesClassifier()
clf.fit(X_telemetry, y_compound)

# Predict
y_pred = clf.predict(X_telemetry)
accuracy = np.mean(y_pred == y_compound)
print(f"Training accuracy: {accuracy:.2%}")

# Visualize decision boundary
x_min, x_max = X_telemetry[:, 0].min() - 1, X_telemetry[:, 0].max() + 1
y_min, y_max = X_telemetry[:, 1].min() - 1, X_telemetry[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                     np.linspace(y_min, y_max, 100))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

plt.figure(figsize=(10, 8))
plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
plt.scatter(X_telemetry[y_compound==0, 0], X_telemetry[y_compound==0, 1], c='blue', label='Hard Compound', edgecolors='k')
plt.scatter(X_telemetry[y_compound==1, 0], X_telemetry[y_compound==1, 1], c='red', label='Soft Compound', edgecolors='k')
plt.xlabel('Average Speed (normalized)')
plt.ylabel('Tire Degradation Rate (normalized)')
plt.title('Naive Bayes: Classifying Tire Compound from Telemetry')
plt.legend()
plt.show()

### Deep Dive: Understanding KL Divergence

KL divergence measures how "different" one distribution is from another. Here are multiple ways to understand it:

**Interpretation 1: Extra Bits for Wrong Encoding**
- Suppose you design a code optimized for distribution Q
- But the true data comes from distribution P
- KL(P || Q) = extra bits needed because you used the wrong distribution
- If P = Q, you need exactly H(P) bits (optimal)
- If P != Q, you need H(P) + KL(P||Q) bits (suboptimal)

**Interpretation 2: Information Lost**
- KL(P || Q) measures information lost when Q is used to approximate P
- It's the "distance" from Q to P (but not symmetric!)

**Interpretation 3: The Fundamental Relationship**
$$D_{KL}(P || Q) = H(P, Q) - H(P) = \text{Cross-Entropy} - \text{Entropy}$$

This tells us:
- Cross-entropy = cost of using Q to encode P
- Entropy = minimum possible cost (using P itself)
- KL divergence = the "wasted" bits from using Q instead of P

**F1 analogy**: KL divergence measures **how different qualifying pace is from race pace**. If a team qualifies brilliantly but fades in the race (different distributions), the KL divergence between their qualifying and race performance is large. A team whose race pace mirrors qualifying (like Red Bull in a dominant season) has low KL divergence. In ML, this exact concept powers knowledge distillation — measuring how well the student model's distribution matches the teacher's.

**Why Not Symmetric?**
- KL(P || Q): Cost of using Q when truth is P
- KL(Q || P): Cost of using P when truth is Q
- These are different questions!

In [None]:
# Visualizing the KL = CrossEntropy - Entropy relationship
# And demonstrating asymmetry
# F1: Qualifying pace (P) vs Race pace (Q)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Example distributions: Qualifying vs Race performance
quali_pace = np.array([0.6, 0.3, 0.1])  # Qualifying: strong at top positions
race_pace = np.array([0.33, 0.33, 0.34])  # Race: much more spread out

# Calculate all quantities
H_P = entropy(quali_pace)
H_P_Q = cross_entropy(quali_pace, race_pace)  # Cross-entropy
KL_P_Q = kl_divergence(quali_pace, race_pace)

# Plot 1: Bar chart showing the relationship
quantities = ['H(P)\nEntropy', 'KL(P||Q)\nDivergence', 'H(P,Q)\nCross-Entropy']
values = [H_P, KL_P_Q, H_P_Q]
colors = ['green', 'red', 'blue']

bars = axes[0].bar(quantities, values, color=colors, alpha=0.7)
axes[0].set_ylabel('Bits')
axes[0].set_title('H(P,Q) = H(P) + KL(P||Q)\nQuali vs Race Pace Gap')

# Add value labels
for bar, val in zip(bars, values):
    axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.05, 
                 f'{val:.3f}', ha='center', fontweight='bold')

# Verify the relationship
axes[0].text(0.5, 0.85, f'{H_P:.3f} + {KL_P_Q:.3f} = {H_P + KL_P_Q:.3f}', 
             transform=axes[0].transAxes, ha='center', fontsize=11,
             bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.5))

# Plot 2: Asymmetry visualization
P = np.array([0.9, 0.1])  # Dominant driver: almost always wins
Q = np.array([0.5, 0.5])  # Equal competition

kl_pq = kl_divergence(P, Q)
kl_qp = kl_divergence(Q, P)

x = np.arange(2)
width = 0.35

axes[1].bar(x - width/2, P, width, label='P (dominant era)', color='blue', alpha=0.7)
axes[1].bar(x + width/2, Q, width, label='Q (equal field)', color='orange', alpha=0.7)
axes[1].set_xticks(x)
axes[1].set_xticklabels(['Win', 'Lose'])
axes[1].set_ylabel('Probability')
axes[1].set_title(f'KL Asymmetry\nKL(P||Q)={kl_pq:.3f}, KL(Q||P)={kl_qp:.3f}')
axes[1].legend()

# Plot 3: Why asymmetry matters in practice
axes[2].text(0.5, 0.85, 'KL(P || Q) vs KL(Q || P)', fontsize=14, fontweight='bold',
             ha='center', transform=axes[2].transAxes)

explanation = """
KL(P || Q): "How bad is Q as a model of P?"
- Averages over P (true distribution)
- Catastrophic if Q gives 0 probability 
  where P has probability (log(0) = -inf!)
- Used in: VAE loss, variational inference

KL(Q || P): "How bad is P as a model of Q?"  
- Averages over Q (approximate distribution)
- Catastrophic if P gives 0 probability
  where Q has probability
- Used in: Reverse KL for mode-seeking

In classification:
- P = true labels (one-hot), Q = model predictions
- Cross-entropy = H(P) + KL(P||Q) = KL(P||Q)
  (since H(P) = 0 for one-hot)

F1: KL(quali || race) != KL(race || quali)
"How different is race pace from quali pace"
is NOT the same as the reverse!
"""

axes[2].text(0.05, 0.75, explanation, fontsize=9, transform=axes[2].transAxes,
             verticalalignment='top', fontfamily='monospace')
axes[2].axis('off')

plt.tight_layout()
plt.show()

print("Key Insight for Classification:")
print("When true labels are one-hot, H(P) = 0")
print("So: Cross-Entropy Loss = KL(true || predicted)")
print("Minimizing cross-entropy = minimizing KL divergence!")

In [None]:
# Demonstrating KL divergence in Knowledge Distillation
# F1 context: An experienced race engineer (teacher) training a junior (student)

# Imagine a 5-class race outcome prediction problem
race_outcomes = ['Win', 'Podium', 'Points', 'No Points', 'DNF']

# Hard label (ground truth: the driver won)
hard_label = np.array([1, 0, 0, 0, 0])  # True outcome is 'Win'

# Experienced engineer's prediction (teacher — soft, nuanced)
# Notice: teacher thinks podium was likely too (car was competitive)
senior_engineer = np.array([0.7, 0.2, 0.05, 0.03, 0.02])

# Junior engineer predictions at different training stages
junior_untrained = np.array([0.2, 0.2, 0.2, 0.2, 0.2])  # Uniform (no insight)
junior_learning = np.array([0.5, 0.15, 0.15, 0.1, 0.1])  # Developing intuition
junior_trained = np.array([0.68, 0.18, 0.07, 0.04, 0.03])  # Well-calibrated

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Compare distributions
x = np.arange(len(race_outcomes))
width = 0.2

axes[0].bar(x - 1.5*width, hard_label, width, label='Hard Label (result)', color='red', alpha=0.7)
axes[0].bar(x - 0.5*width, senior_engineer, width, label='Senior Engineer', color='blue', alpha=0.7)
axes[0].bar(x + 0.5*width, junior_learning, width, label='Junior (learning)', color='green', alpha=0.7)
axes[0].bar(x + 1.5*width, junior_trained, width, label='Junior (trained)', color='purple', alpha=0.7)

axes[0].set_xticks(x)
axes[0].set_xticklabels(race_outcomes)
axes[0].set_ylabel('Probability')
axes[0].set_title('Knowledge Distillation: Senior to Junior Engineer\n"Learning the nuance behind race outcomes"')
axes[0].legend()

# Plot 2: KL divergences
juniors = {
    'Untrained': junior_untrained,
    'Learning': junior_learning, 
    'Trained': junior_trained
}

# KL from hard labels (what standard cross-entropy uses)
kl_hard = [kl_divergence(hard_label, s) for s in juniors.values()]

# KL from senior engineer's soft labels (knowledge distillation)
kl_soft = [kl_divergence(senior_engineer, s) for s in juniors.values()]

x = np.arange(len(juniors))
width = 0.35

axes[1].bar(x - width/2, kl_hard, width, label='KL(Result || Junior)', color='red', alpha=0.7)
axes[1].bar(x + width/2, kl_soft, width, label='KL(Senior || Junior)', color='blue', alpha=0.7)
axes[1].set_xticks(x)
axes[1].set_xticklabels(list(juniors.keys()))
axes[1].set_ylabel('KL Divergence (bits)')
axes[1].set_title('Loss: Hard Result vs Knowledge Distillation')
axes[1].legend()

plt.tight_layout()
plt.show()

print("Why Soft Labels Help:")
print("=" * 50)
print("Hard label only says: 'The driver won this race'")
print("Soft label says: 'The driver won, but podium was likely too,")
print("                  DNF was very unlikely given car reliability'")
print("\nThe relationships between outcomes ('dark knowledge') help")
print("the junior engineer develop better race intuition!")

### KL Divergence in Machine Learning Applications

| Application | P (True/Target) | Q (Approximate/Model) | What KL Measures | F1 Parallel |
|-------------|-----------------|----------------------|------------------|-------------|
| **Classification Loss** | One-hot labels | Softmax predictions | How wrong are predictions | How far model's race prediction is from actual result |
| **VAE Loss** | Posterior q(z\|x) | Prior p(z), usually N(0,1) | How far latent code is from prior | How far telemetry encoding is from baseline |
| **Knowledge Distillation** | Teacher softmax | Student softmax | How well student mimics teacher | Junior engineer matching senior's race intuition |
| **Policy Gradient (PPO)** | Old policy | New policy | Prevents too-large policy updates | Gradual strategy updates between races |
| **Variational Inference** | True posterior | Variational approx | Quality of approximation | How well simplified model captures real tire behavior |

**The VAE Loss Decomposition**:
$$\mathcal{L}_{VAE} = \underbrace{-\mathbb{E}_{q(z|x)}[\log p(x|z)]}_{\text{Reconstruction Loss}} + \underbrace{D_{KL}(q(z|x) || p(z))}_{\text{Regularization}}$$

The KL term pulls the encoder's latent distribution toward the prior, enabling generation.

**Knowledge Distillation**:
- Teacher: Large, accurate model with "soft" predictions
- Student: Small model learning to match teacher
- Loss = KL(Teacher || Student) on softmax outputs
- Student learns teacher's "dark knowledge" (relationships between classes)

---

## 5. Maximum Likelihood Estimation (MLE)

MLE finds parameters that maximize the probability of observing the data:

$$\hat{\theta}_{MLE} = \arg\max_\theta P(\text{data}|\theta) = \arg\max_\theta \prod_i P(x_i|\theta)$$

In practice, we maximize the **log-likelihood** (easier to work with):

$$\hat{\theta}_{MLE} = \arg\max_\theta \sum_i \log P(x_i|\theta)$$

**Key insight**: Minimizing cross-entropy loss = maximizing log-likelihood!

**F1 analogy**: MLE is how teams estimate tire degradation rate from lap data. You observe a stint of 15 laps with gradually increasing times. MLE asks: "What degradation rate per lap makes these observed times most probable?" The answer is the slope of best fit through the lap time data. This is identical to how neural networks learn: find the parameters (weights) that make the training data most probable.

In [None]:
# MLE for Gaussian parameters — Estimating a driver's true pace
# True parameters (unknown to us in practice)
true_mean_pace = 5.0  # True mean lap time offset
true_pace_sigma = 2.0  # True consistency

# Generate observed lap times
n_laps = 100
observed_laps = np.random.normal(true_mean_pace, true_pace_sigma, n_laps)

# MLE estimates (can be derived analytically)
pace_mle = observed_laps.mean()  # Sample mean
sigma_mle = observed_laps.std()  # Sample std (biased, but MLE)

print(f"True parameters: mean pace = {true_mean_pace}, consistency = {true_pace_sigma}")
print(f"MLE estimates:   mean pace = {pace_mle:.3f}, consistency = {sigma_mle:.3f}")

# Visualize
x = np.linspace(true_mean_pace - 4*true_pace_sigma, true_mean_pace + 4*true_pace_sigma, 100)

plt.figure(figsize=(10, 6))
plt.hist(observed_laps, bins=20, density=True, alpha=0.6, label='Observed lap times')
plt.plot(x, stats.norm.pdf(x, true_mean_pace, true_pace_sigma), 'g-', linewidth=2, 
         label=f'True: N({true_mean_pace}, {true_pace_sigma}²)')
plt.plot(x, stats.norm.pdf(x, pace_mle, sigma_mle), 'r--', linewidth=2,
         label=f'MLE: N({pace_mle:.2f}, {sigma_mle:.2f}²)')
plt.xlabel('Lap Time (offset from baseline)')
plt.ylabel('Density')
plt.title('MLE for Lap Time Distribution\n"Finding the driver\'s true pace from observed data"')
plt.legend()
plt.show()

In [None]:
# Visualize the likelihood function — Finding the best-fit pace parameters
def log_likelihood(mu, sigma, data):
    """Compute log-likelihood of data under N(mu, sigma^2)."""
    return np.sum(stats.norm.logpdf(data, mu, sigma))

# Create grid of parameters
mus = np.linspace(3, 7, 50)
sigmas = np.linspace(1, 4, 50)
MU, SIGMA = np.meshgrid(mus, sigmas)

# Compute log-likelihood at each point
LL = np.zeros_like(MU)
for i in range(len(sigmas)):
    for j in range(len(mus)):
        LL[i, j] = log_likelihood(MU[i, j], SIGMA[i, j], observed_laps)

plt.figure(figsize=(10, 8))
plt.contourf(MU, SIGMA, LL, levels=30, cmap='viridis')
plt.colorbar(label='Log-Likelihood')
plt.scatter([pace_mle], [sigma_mle], color='red', s=200, marker='*', 
            label=f'MLE: (pace={pace_mle:.2f}, consistency={sigma_mle:.2f})', zorder=5)
plt.scatter([true_mean_pace], [true_pace_sigma], color='white', s=100, marker='o',
            label=f'True: (pace={true_mean_pace}, consistency={true_pace_sigma})', zorder=5)
plt.xlabel('Mean Lap Time (mu)')
plt.ylabel('Consistency / Sigma')
plt.title('Log-Likelihood Surface for Driver Pace Estimation\n"The peak is the MLE — the best estimate from observed laps"')
plt.legend()
plt.show()

### MLE for Bernoulli (Coin Flip / Race Finish)

If we observe $k$ heads in $n$ flips, the MLE estimate is simply:

$$\hat{p}_{MLE} = \frac{k}{n}$$

**F1 analogy**: If a car finishes 14 out of 20 races, the MLE estimate of its reliability is p = 14/20 = 0.70. Simple, elegant, and exactly how teams track reliability statistics.

In [None]:
# MLE for Bernoulli — Estimating car reliability from race finishes
reliability_true = 0.7
n_races = 50
race_finishes = np.random.binomial(1, reliability_true, n_races)
n_finishes = race_finishes.sum()  # Number of races finished

reliability_mle = n_finishes / n_races

print(f"True reliability: {reliability_true}")
print(f"Observed: {n_finishes} finishes in {n_races} races")
print(f"MLE estimate: reliability = {reliability_mle:.3f}")

# Visualize likelihood function
p_values = np.linspace(0.01, 0.99, 100)
likelihoods = [stats.binom.pmf(n_finishes, n_races, p) for p in p_values]

plt.figure(figsize=(10, 5))
plt.plot(p_values, likelihoods, 'b-', linewidth=2)
plt.axvline(x=reliability_mle, color='red', linestyle='--', label=f'MLE: reliability = {reliability_mle:.3f}')
plt.axvline(x=reliability_true, color='green', linestyle=':', label=f'True: reliability = {reliability_true}')
plt.xlabel('Reliability (p)')
plt.ylabel('Likelihood P(data|p)')
plt.title(f'Likelihood Function: {n_finishes} Finishes in {n_races} Race Starts\n'
          f'"What reliability makes this season most probable?"')
plt.legend()
plt.show()

---

## 6. Information Theory

Information theory quantifies information and uncertainty.

### Entropy

Entropy measures the "uncertainty" or "information content" of a distribution:

$$H(X) = -\sum_x P(x) \log P(x) = -E[\log P(X)]$$

**Properties**:
- Higher entropy = more uncertainty
- Uniform distribution has maximum entropy
- Deterministic variable has entropy 0

**F1 analogy**: Entropy is the **excitement level of a championship**. A season where one team dominates has low entropy (boring, predictable). A season with 5 teams in contention has high entropy (thrilling, unpredictable). The 2021 Hamilton-Verstappen title fight had much higher entropy than the 2023 Verstappen dominance. In ML, when your model's softmax output has high entropy, it means the model is uncertain about its prediction — just like a pundit who says "anyone could win this race."

In [None]:
def entropy(p):
    """Compute entropy of a discrete distribution."""
    p = np.array(p)
    p = p[p > 0]  # Avoid log(0)
    return -np.sum(p * np.log2(p))

# F1 season competitiveness examples (in bits)
print("Entropy examples — Race/Season Unpredictability (in bits):")
print(f"Two equal rivals [0.5, 0.5]: H = {entropy([0.5, 0.5]):.4f} bits")
print(f"Dominant driver [0.9, 0.1]: H = {entropy([0.9, 0.1]):.4f} bits")
print(f"Certain winner [1.0, 0.0]: H = {entropy([1.0, 0.0]):.4f} bits")
print(f"Six equal drivers [1/6]*6: H = {entropy([1/6]*6):.4f} bits")
print(f"Eight-way fight [1/8]*8: H = {entropy([1/8]*8):.4f} bits")

In [None]:
# Entropy of binary distribution as function of p
# F1: How uncertain is a head-to-head title fight?
p_values = np.linspace(0.001, 0.999, 100)
entropies = [-p * np.log2(p) - (1-p) * np.log2(1-p) for p in p_values]

plt.figure(figsize=(10, 6))
plt.plot(p_values, entropies, 'b-', linewidth=2)
plt.xlabel('P(Driver A wins the championship)')
plt.ylabel('Entropy (bits)')
plt.title('Binary Entropy: Excitement of a Two-Way Title Fight\n'
          '"Maximum drama when both drivers have equal chance"')
plt.axhline(y=1, color='r', linestyle='--', alpha=0.5, label='Maximum = 1 bit (50/50 fight)')
plt.axvline(x=0.5, color='g', linestyle='--', alpha=0.5, label='p = 0.5 (equal)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("Maximum entropy at p = 0.5 (maximum uncertainty / most exciting)")
print("Entropy = 0 when p = 0 or p = 1 (championship already decided)")

### Cross-Entropy

Cross-entropy measures the "cost" of using distribution $Q$ to encode samples from distribution $P$:

$$H(P, Q) = -\sum_x P(x) \log Q(x) = -E_P[\log Q(X)]$$

**In ML**: Cross-entropy loss measures how well predicted probabilities $Q$ match true labels $P$.

**F1 analogy**: Imagine you're a betting house using your model's race predictions (Q) to set odds, but the actual outcomes follow distribution P. Cross-entropy measures how much money you lose because your model doesn't perfectly match reality. The closer your predictions to truth, the lower the cross-entropy — and the less you lose.

In [None]:
def cross_entropy(p, q):
    """Compute cross-entropy H(P, Q)."""
    p = np.array(p)
    q = np.array(q)
    # Avoid log(0) by clipping
    q = np.clip(q, 1e-10, 1.0)
    return -np.sum(p * np.log2(q))

# Example: True race winner vs model predictions
# True outcome: Driver A won (class 0)
true_result = np.array([1, 0, 0])  # Driver A won

predictions = [
    ([0.9, 0.05, 0.05], "Model confident in Driver A (correct!)"),
    ([0.6, 0.2, 0.2], "Model leans toward A but unsure"),
    ([0.33, 0.33, 0.34], "Model has no idea (uniform)"),
    ([0.1, 0.45, 0.45], "Model confident in wrong driver"),
]

print("Cross-entropy loss for different race predictions:")
print(f"True result: Driver A wins (one-hot: {true_result})\n")

for q, desc in predictions:
    ce = cross_entropy(true_result, q)
    print(f"{desc}")
    print(f"  Prediction: {q}")
    print(f"  Cross-entropy: {ce:.4f} bits\n")

### KL Divergence

KL divergence measures how different distribution $Q$ is from $P$:

$$D_{KL}(P || Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} = H(P, Q) - H(P)$$

**Properties**:
- $D_{KL}(P || Q) \geq 0$ (always non-negative)
- $D_{KL}(P || Q) = 0$ if and only if $P = Q$
- Not symmetric: $D_{KL}(P || Q) \neq D_{KL}(Q || P)$

**In ML**: Used in VAEs, knowledge distillation, regularization

**F1 analogy**: KL divergence measures how different two performance distributions are. If P is a team's qualifying pace distribution and Q is their race pace distribution, KL(P||Q) quantifies the "qualifying-to-race translation gap." A team that's a qualifying specialist (fast in quali, slow in race) has high KL divergence between these distributions. A team whose race pace reliably mirrors qualifying has low KL divergence — they "translate" Saturday pace to Sunday.

In [None]:
def kl_divergence(p, q):
    """Compute KL divergence D_KL(P || Q)."""
    p = np.array(p)
    q = np.array(q)
    # Only sum where p > 0
    mask = p > 0
    q = np.clip(q, 1e-10, 1.0)
    return np.sum(p[mask] * np.log2(p[mask] / q[mask]))

# Compare qualifying pace vs race pace distributions
quali_dist = np.array([0.4, 0.3, 0.2, 0.1])  # Qualifying: tends to be at front
race_similar = np.array([0.35, 0.35, 0.2, 0.1])  # Race pace similar to quali
race_reversed = np.array([0.1, 0.2, 0.3, 0.4])  # Race pace: drops back (quali specialist)
race_uniform = np.array([0.25, 0.25, 0.25, 0.25])  # Race pace: anything can happen

print(f"Qualifying distribution = {quali_dist}")
print(f"\nKL divergences (how different is race pace from quali?):")
print(f"  Similar race pace    {race_similar}: {kl_divergence(quali_dist, race_similar):.4f} bits")
print(f"  Reversed (drops back) {race_reversed}: {kl_divergence(quali_dist, race_reversed):.4f} bits")
print(f"  Unpredictable race   {race_uniform}: {kl_divergence(quali_dist, race_uniform):.4f} bits")

print(f"\nNote asymmetry (direction matters!):")
print(f"  KL(quali || reversed) = {kl_divergence(quali_dist, race_reversed):.4f}")
print(f"  KL(reversed || quali) = {kl_divergence(race_reversed, quali_dist):.4f}")

In [None]:
# Visualize KL divergence between two Gaussians
# F1: How different is one driver's lap time distribution from another's?
def kl_gaussian(mu1, sigma1, mu2, sigma2):
    """KL divergence between two univariate Gaussians."""
    return (np.log(sigma2/sigma1) + 
            (sigma1**2 + (mu1 - mu2)**2) / (2 * sigma2**2) - 0.5)

# Driver A's lap time distribution: N(0, 1) (baseline)
mu_driver_a, sigma_driver_a = 0, 1

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Vary mean pace difference
pace_offsets = np.linspace(-3, 3, 100)
kls = [kl_gaussian(mu_driver_a, sigma_driver_a, mu, sigma_driver_a) for mu in pace_offsets]

axes[0].plot(pace_offsets, kls, 'b-', linewidth=2)
axes[0].set_xlabel('Driver B Mean Pace Difference (seconds)')
axes[0].set_ylabel('KL Divergence (nats)')
axes[0].set_title('KL Divergence: Same Consistency, Different Pace\n'
                   '"How different is Driver B\'s pace from Driver A?"')
axes[0].grid(True, alpha=0.3)

# Vary consistency
consistencies = np.linspace(0.1, 4, 100)
kls = [kl_gaussian(mu_driver_a, sigma_driver_a, mu_driver_a, sigma) for sigma in consistencies]

axes[1].plot(consistencies, kls, 'r-', linewidth=2)
axes[1].axvline(x=1, color='g', linestyle='--', alpha=0.5, label='Same consistency as A')
axes[1].set_xlabel('Driver B Consistency (sigma, seconds)')
axes[1].set_ylabel('KL Divergence (nats)')
axes[1].set_title('KL Divergence: Same Pace, Different Consistency\n'
                   '"How different is Driver B\'s consistency?"')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## Exercises

### Exercise 1: Bayesian Tire Degradation Inference

Use Bayes' theorem to update beliefs about a tire compound's degradation rate after observing lap times. Just as we used Beta-Binomial conjugacy for coin flips, we'll update our beliefs about the probability that tires are in "high degradation" mode based on observed performance drops.

In [None]:
# Bayesian inference for tire degradation rate
# Model: Each lap, we observe if there was a "performance drop" (1) or not (0)
# Prior: Beta(a, b) distribution over the degradation probability p
# Posterior after k drops in n laps: Beta(a + k, b + n - k)

def plot_beta_posterior(a_prior, b_prior, n_drops, n_clean_laps):
    """Plot prior and posterior distributions for degradation rate."""
    p = np.linspace(0, 1, 100)
    
    # Prior
    prior = stats.beta.pdf(p, a_prior, b_prior)
    
    # Posterior
    a_post = a_prior + n_drops
    b_post = b_prior + n_clean_laps
    posterior = stats.beta.pdf(p, a_post, b_post)
    
    plt.figure(figsize=(10, 6))
    plt.plot(p, prior, 'b--', linewidth=2, label=f'Prior: Beta({a_prior}, {b_prior})')
    plt.plot(p, posterior, 'r-', linewidth=2, 
             label=f'Posterior: Beta({a_post}, {b_post})')
    plt.axvline(x=n_drops/(n_drops + n_clean_laps) if (n_drops + n_clean_laps) > 0 else 0.5, 
                color='g', linestyle=':', label=f'MLE: {n_drops/(n_drops + n_clean_laps):.3f}')
    plt.xlabel('Degradation Rate (p = probability of performance drop per lap)')
    plt.ylabel('Density')
    plt.title(f'Bayesian Tire Degradation Inference: {n_drops} drops in {n_drops + n_clean_laps} laps')
    plt.legend()
    plt.show()
    
    # Posterior statistics
    post_mean = a_post / (a_post + b_post)
    print(f"Posterior mean degradation rate: {post_mean:.4f}")
    print(f"95% credible interval: [{stats.beta.ppf(0.025, a_post, b_post):.4f}, "
          f"{stats.beta.ppf(0.975, a_post, b_post):.4f}]")

# Start with uniform prior (no prior knowledge about this tire compound)
# Observe 7 laps with performance drops, 3 clean laps
plot_beta_posterior(a_prior=1, b_prior=1, n_drops=7, n_clean_laps=3)

In [None]:
# TODO: Experiment with different priors and data
# What happens with:
# 1. Strong prior from testing data that tires are durable: Beta(10, 10)
# 2. More race laps observed: 70 drops, 30 clean laps
# 3. Prior from testing conflicts with race data (e.g., testing says durable but race says fragile)

# Your experiments here:
# Try: What if the team tested extensively and believed degradation was moderate?
plot_beta_posterior(a_prior=10, b_prior=10, n_drops=7, n_clean_laps=3)

### Exercise 2: Implement Softmax Cross-Entropy Loss for Race Outcome Prediction

In [None]:
def softmax(x):
    """Compute softmax."""
    x_shifted = x - np.max(x, axis=-1, keepdims=True)
    exp_x = np.exp(x_shifted)
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

def cross_entropy_loss(logits, labels):
    """
    Compute cross-entropy loss for race outcome prediction.
    
    Args:
        logits: Raw model outputs (before softmax), shape (batch_size, num_classes)
               e.g., scores for [Win, Podium, Points, DNF]
        labels: True outcome indices, shape (batch_size,)
    
    Returns:
        Scalar loss value
    """
    # TODO: Implement
    # 1. Apply softmax to get probabilities
    # 2. Extract probability of true outcome
    # 3. Return negative log probability (averaged over batch)
    
    probs = softmax(logits)
    batch_size = len(labels)
    # Get probability assigned to correct outcome for each race
    correct_probs = probs[np.arange(batch_size), labels]
    # Negative log likelihood
    loss = -np.mean(np.log(correct_probs + 1e-10))
    return loss

# Test: Predicting race outcomes for 3 different race weekends
# Classes: [Win, Podium, Points finish]
race_logits = np.array([[2.0, 1.0, 0.1],   # Model thinks Win is likely
                        [0.1, 2.5, 0.3],   # Model thinks Podium is likely
                        [0.2, 0.3, 3.0]])  # Model thinks Points finish
true_outcomes = np.array([0, 1, 2])  # Actual results: Win, Podium, Points

loss = cross_entropy_loss(race_logits, true_outcomes)
print(f"Race outcome logits:\n{race_logits}")
print(f"Softmax probabilities:\n{softmax(race_logits).round(4)}")
print(f"True outcomes: {true_outcomes} (Win=0, Podium=1, Points=2)")
print(f"Cross-entropy loss: {loss:.4f}")

### Exercise 3: Information Gain for Pit Stop Strategy Decisions

In decision trees, we split data to maximize information gain (reduction in entropy). Here, imagine you're deciding whether to split race laps into groups based on a feature (e.g., "is it raining?") to better predict outcomes. The split that gives the highest information gain is the most useful for the decision.

In [None]:
def information_gain(parent_labels, left_labels, right_labels):
    """
    Compute information gain from a split.
    
    IG = H(parent) - weighted_avg(H(left), H(right))
    
    F1 context: Splitting race data by a condition (e.g., wet vs dry)
    to better predict outcome (e.g., podium vs no podium).
    """
    def label_entropy(labels):
        """Compute entropy of label distribution."""
        if len(labels) == 0:
            return 0
        _, counts = np.unique(labels, return_counts=True)
        probs = counts / len(labels)
        return entropy(probs)
    
    n = len(parent_labels)
    n_left = len(left_labels)
    n_right = len(right_labels)
    
    h_parent = label_entropy(parent_labels)
    h_left = label_entropy(left_labels)
    h_right = label_entropy(right_labels)
    
    weighted_child = (n_left/n) * h_left + (n_right/n) * h_right
    
    return h_parent - weighted_child

# Example: Splitting race results by track condition
# Parent: mixed results (4 podiums, 6 no-podiums)
race_results = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1])  # 0=no podium, 1=podium

# Good split: "Is it a dry race?" separates podiums from non-podiums
dry_races = np.array([0, 0, 0, 0])        # Dry: all no-podiums (clear pattern)
wet_races = np.array([1, 1, 1, 1, 1, 1])  # Wet: all podiums (rain specialist!)

# Bad split: Random grouping that doesn't separate outcomes
group_a = np.array([0, 0, 1, 1, 1])  # Mixed outcomes
group_b = np.array([0, 0, 1, 1, 1])  # Also mixed

print(f"Parent entropy (all races): {entropy([0.4, 0.6]):.4f} bits")
print(f"\nGood split (dry vs wet): IG = {information_gain(race_results, dry_races, wet_races):.4f} bits")
print(f"  -> Track condition perfectly separates outcomes!")
print(f"\nBad split (random groups): IG = {information_gain(race_results, group_a, group_b):.4f} bits")
print(f"  -> No useful separation of outcomes")

---

## Summary

### Key Concepts and Their F1 Parallels

| Concept | What It Does | F1 Parallel |
|---------|-------------|-------------|
| **Probability Distributions** | Describe likelihood of all outcomes | Lap time distributions, race outcome probabilities |
| **Bernoulli** | Models binary yes/no outcomes | Will the car finish this race? |
| **Gaussian** | Models continuous symmetric uncertainty | Lap time variation around mean pace |
| **Bayes' Theorem** | Updates beliefs given evidence (prior x likelihood = posterior) | Updating rain probability as clouds form during a race |
| **Maximum Likelihood** | Finds parameters that maximize P(data\|params) | Estimating tire degradation rate from observed lap times |
| **Entropy** | Measures uncertainty in a distribution | Season competitiveness — high entropy = anyone can win |
| **Cross-Entropy** | The loss function for classification | How wrong is your race prediction model? |
| **KL Divergence** | Measures difference between distributions | Gap between qualifying pace and race pace |

### Connection to Deep Learning

- **Classification**: Softmax outputs a categorical distribution, trained with cross-entropy
- **Regression**: Often assumes Gaussian noise, uses MSE (= MLE for Gaussian)
- **VAEs**: Use KL divergence to regularize latent distributions
- **Dropout**: Samples from Bernoulli to create masks
- **Bayesian NN**: Treat weights as distributions, use Bayes' theorem

### Checklist
- [ ] I understand common probability distributions (and can map them to race scenarios)
- [ ] I can apply Bayes' theorem (like updating rain predictions mid-race)
- [ ] I understand MLE and its connection to loss functions (like fitting tire degradation curves)
- [ ] I can compute entropy and KL divergence (like measuring season unpredictability)

---

## Next Steps

Continue to **Part 1.4: Classical Machine Learning** (Notebook 04), where we'll put probability and statistics to work with practical ML algorithms:
- Decision trees, random forests, and gradient boosting
- SVMs and kernel methods
- Clustering (k-means, DBSCAN)
- Model evaluation: cross-validation, ROC curves, confusion matrices
- When to use classical ML vs deep learning

**Looking ahead with F1**: The probability foundations from this notebook power everything ahead. When we build neural networks, softmax + cross-entropy (from this notebook) becomes our classification loss. When we explore VAEs, KL divergence becomes the regularizer. When we study reinforcement learning, Bayes' theorem helps the agent update its beliefs about the world.