# Part 1.3: Probability & Statistics for Deep Learning

Probability and statistics are essential for understanding:
- How models make predictions (probabilistic outputs)
- How we train models (maximum likelihood)
- How we measure uncertainty and information

## Learning Objectives
- [ ] Work with common probability distributions
- [ ] Apply Bayes' theorem to update beliefs
- [ ] Derive MLE estimators for simple distributions
- [ ] Calculate entropy and KL divergence

---

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy.special import comb

%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

## 1. Probability Basics

### Random Variables

A **random variable** is a variable whose value is determined by a random process.

- **Discrete**: Takes on countable values (e.g., coin flips, dice rolls)
- **Continuous**: Takes on any value in a range (e.g., height, temperature)

### Probability Distributions

A **probability distribution** describes the likelihood of each possible outcome.

- **PMF** (Probability Mass Function): For discrete variables, $P(X = x)$
- **PDF** (Probability Density Function): For continuous variables, $f(x)$

### Deep Dive: What is a Probability Distribution?

A probability distribution answers a fundamental question: **"What outcomes are possible, and how likely is each one?"**

Think of it as a complete recipe for uncertainty:
- It lists every possible outcome
- It assigns a probability (or density) to each outcome
- All probabilities sum to 1 (something must happen!)

**The Key Insight**: A distribution captures *everything* we know about a random process. Once you have the distribution, you can compute any probability, expectation, or uncertainty measure.

#### Discrete vs Continuous Distributions

| Aspect | Discrete | Continuous |
|--------|----------|------------|
| **Possible values** | Countable (finite or infinite) | Uncountable (any value in a range) |
| **Probability function** | PMF: P(X = x) gives exact probability | PDF: f(x) gives density, not probability |
| **Finding probabilities** | Sum: P(a ≤ X ≤ b) = Σ P(X = x) | Integrate: P(a ≤ X ≤ b) = ∫f(x)dx |
| **Examples** | Coin flips, dice, word counts | Height, temperature, neural network weights |
| **ML applications** | Classification labels, token IDs | Regression targets, latent variables |

**Important**: For continuous distributions, P(X = x) = 0 for any specific value! We can only ask about ranges.

---

## 2. Common Distributions

### 2.1 Bernoulli Distribution

Models a single binary outcome (success/failure, yes/no, 1/0).

$$P(X = 1) = p, \quad P(X = 0) = 1 - p$$

**In ML**: Binary classification outputs, dropout masks

In [None]:
# Bernoulli distribution
p = 0.7  # Probability of success

# Generate samples
samples = np.random.binomial(1, p, size=1000)

print(f"Bernoulli(p={p})")
print(f"Mean (theoretical): {p}")
print(f"Mean (empirical): {samples.mean():.3f}")
print(f"Variance (theoretical): {p * (1-p):.3f}")
print(f"Variance (empirical): {samples.var():.3f}")

# Visualize
plt.figure(figsize=(8, 4))
plt.bar([0, 1], [1-p, p], width=0.4, alpha=0.7)
plt.xticks([0, 1], ['Failure (0)', 'Success (1)'])
plt.ylabel('Probability')
plt.title(f'Bernoulli Distribution (p={p})')
plt.show()

### 2.2 Binomial Distribution

Number of successes in $n$ independent Bernoulli trials.

$$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}$$

**In ML**: Counting successes in multiple trials

In [None]:
# Binomial distribution
n = 20  # Number of trials
p = 0.3  # Probability of success

# PMF
k = np.arange(0, n+1)
pmf = stats.binom.pmf(k, n, p)

plt.figure(figsize=(10, 4))
plt.bar(k, pmf, alpha=0.7)
plt.xlabel('Number of Successes (k)')
plt.ylabel('P(X = k)')
plt.title(f'Binomial Distribution (n={n}, p={p})')
plt.axvline(x=n*p, color='red', linestyle='--', label=f'Mean = np = {n*p}')
plt.legend()
plt.show()

print(f"Mean: E[X] = np = {n*p}")
print(f"Variance: Var[X] = np(1-p) = {n*p*(1-p):.2f}")

### 2.3 Categorical Distribution

Generalization of Bernoulli to $K$ categories.

$$P(X = k) = p_k, \quad \sum_{k=1}^K p_k = 1$$

**In ML**: Multi-class classification (softmax output)

In [None]:
# Categorical distribution (e.g., softmax output)
categories = ['Cat', 'Dog', 'Bird', 'Fish']
probabilities = [0.4, 0.35, 0.15, 0.1]

# Generate samples
samples = np.random.choice(len(categories), size=1000, p=probabilities)

plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.bar(categories, probabilities, alpha=0.7, color='steelblue')
plt.ylabel('Probability')
plt.title('Categorical Distribution (True)')

plt.subplot(1, 2, 2)
empirical = [np.mean(samples == i) for i in range(len(categories))]
plt.bar(categories, empirical, alpha=0.7, color='coral')
plt.ylabel('Frequency')
plt.title('Empirical Distribution (1000 samples)')

plt.tight_layout()
plt.show()

### 2.4 Gaussian (Normal) Distribution

The most important continuous distribution.

$$f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

**In ML**: 
- Weight initialization
- Noise in VAEs
- Regression targets
- Batch normalization

In [None]:
# Gaussian distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Different means
x = np.linspace(-6, 10, 200)
for mu in [-2, 0, 2, 4]:
    axes[0].plot(x, stats.norm.pdf(x, mu, 1), label=f'μ={mu}, σ=1')
axes[0].set_xlabel('x')
axes[0].set_ylabel('Density')
axes[0].set_title('Effect of Mean (μ)')
axes[0].legend()

# Different standard deviations
x = np.linspace(-8, 8, 200)
for sigma in [0.5, 1, 2, 3]:
    axes[1].plot(x, stats.norm.pdf(x, 0, sigma), label=f'μ=0, σ={sigma}')
axes[1].set_xlabel('x')
axes[1].set_ylabel('Density')
axes[1].set_title('Effect of Standard Deviation (σ)')
axes[1].legend()

plt.tight_layout()
plt.show()

In [None]:
# The 68-95-99.7 rule
mu, sigma = 0, 1
x = np.linspace(-4, 4, 200)
y = stats.norm.pdf(x, mu, sigma)

plt.figure(figsize=(10, 6))
plt.plot(x, y, 'b-', linewidth=2)

# Fill regions
plt.fill_between(x, y, where=(x >= -3) & (x <= 3), alpha=0.2, color='blue', label='99.7% (±3σ)')
plt.fill_between(x, y, where=(x >= -2) & (x <= 2), alpha=0.3, color='blue', label='95% (±2σ)')
plt.fill_between(x, y, where=(x >= -1) & (x <= 1), alpha=0.4, color='blue', label='68% (±1σ)')

plt.xlabel('x (in standard deviations)')
plt.ylabel('Density')
plt.title('Standard Normal Distribution - The 68-95-99.7 Rule')
plt.legend()
plt.show()

# Verify with scipy
print("Probability within:")
print(f"  ±1σ: {stats.norm.cdf(1) - stats.norm.cdf(-1):.4f} (68.27%)")
print(f"  ±2σ: {stats.norm.cdf(2) - stats.norm.cdf(-2):.4f} (95.45%)")
print(f"  ±3σ: {stats.norm.cdf(3) - stats.norm.cdf(-3):.4f} (99.73%)")

### 2.5 Multivariate Gaussian

Extension to multiple dimensions:

$$f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{x}-\boldsymbol{\mu})\right)$$

Where:
- $\boldsymbol{\mu}$: Mean vector
- $\Sigma$: Covariance matrix

### Choosing the Right Distribution: A Decision Guide

| Distribution | Use When | Parameters | Example in ML |
|--------------|----------|------------|---------------|
| **Bernoulli** | Single yes/no outcome | p (success probability) | Binary classification output, dropout mask |
| **Binomial** | Count of successes in n trials | n (trials), p (success prob) | Number of correct predictions in batch |
| **Categorical** | Single choice from K options | p₁, p₂, ..., pₖ (probabilities) | Softmax output, token prediction |
| **Multinomial** | Counts across K categories | n (trials), p₁...pₖ | Word counts in document (bag of words) |
| **Gaussian** | Continuous value, symmetric uncertainty | μ (mean), σ (std dev) | Regression targets, weight initialization |
| **Multivariate Gaussian** | Multiple correlated continuous values | μ (mean vector), Σ (covariance) | VAE latent space, GP predictions |

**The Pattern**: 
- Bernoulli/Binomial are for binary outcomes (yes/no)
- Categorical/Multinomial are for multi-class outcomes  
- Gaussian is for continuous outcomes with symmetric uncertainty

**Key ML Connection**: The distribution you choose for your model's output determines your loss function:
- Categorical output → Cross-entropy loss
- Gaussian output → MSE loss (equivalent to assuming Gaussian noise)

In [None]:
# 2D Gaussian with different covariance structures
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Generate grid for contour plots
x = np.linspace(-4, 4, 100)
y = np.linspace(-4, 4, 100)
X, Y = np.meshgrid(x, y)
pos = np.dstack((X, Y))

# Different covariance matrices
covariances = [
    (np.array([[1, 0], [0, 1]]), 'Spherical\n(Independent)'),
    (np.array([[2, 0], [0, 0.5]]), 'Diagonal\n(Different variances)'),
    (np.array([[1, 0.8], [0.8, 1]]), 'Full\n(Correlated)')
]

mean = np.array([0, 0])

for ax, (cov, title) in zip(axes, covariances):
    rv = stats.multivariate_normal(mean, cov)
    Z = rv.pdf(pos)
    
    ax.contour(X, Y, Z, levels=10, cmap='viridis')
    
    # Draw samples
    samples = rv.rvs(size=200)
    ax.scatter(samples[:, 0], samples[:, 1], alpha=0.3, s=10, color='red')
    
    ax.set_xlabel('x₁')
    ax.set_ylabel('x₂')
    ax.set_title(f'{title}\nΣ = {cov.tolist()}')
    ax.set_aspect('equal')
    ax.set_xlim(-4, 4)
    ax.set_ylim(-4, 4)

plt.tight_layout()
plt.show()

---

## 3. Expected Value and Variance

### Expected Value (Mean)

The "average" outcome weighted by probability:

- Discrete: $E[X] = \sum_x x \cdot P(X = x)$
- Continuous: $E[X] = \int x \cdot f(x) dx$

### Variance

Measures spread around the mean:

$$\text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2$$

### Deep Dive: Understanding Each Term in Bayes' Theorem

$$P(\text{hypothesis}|\text{data}) = \frac{P(\text{data}|\text{hypothesis}) \cdot P(\text{hypothesis})}{P(\text{data})}$$

Let's break down what each term really means:

| Term | Name | Meaning | Example (Disease Testing) |
|------|------|---------|---------------------------|
| **P(H)** | Prior | Your belief *before* seeing any evidence | 1% of population has disease |
| **P(D\|H)** | Likelihood | How probable is this evidence *if* hypothesis is true? | 95% chance of positive test *if* you have disease |
| **P(D)** | Evidence (Marginal) | Total probability of seeing this evidence | Overall rate of positive tests |
| **P(H\|D)** | Posterior | Updated belief *after* seeing evidence | Probability you have disease *given* positive test |

**The Core Insight**: Bayes' theorem is a *belief update* mechanism:
```
New Belief = (How well evidence supports hypothesis) × (Old Belief) / (How common is this evidence)
```

**Why the denominator matters**: P(D) normalizes everything. If positive tests are common (many false positives), a positive test is less informative.

In [None]:
# Visual: How Bayes' Theorem Updates Beliefs
# Let's visualize the belief update process step by step

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Setup
P_disease = 0.01
P_positive_given_disease = 0.95      # True positive rate
P_positive_given_no_disease = 0.05   # False positive rate

# Imagine 10,000 people
n_people = 10000
n_sick = int(n_people * P_disease)
n_healthy = n_people - n_sick

# Among sick people
sick_test_positive = int(n_sick * P_positive_given_disease)
sick_test_negative = n_sick - sick_test_positive

# Among healthy people  
healthy_test_positive = int(n_healthy * P_positive_given_no_disease)
healthy_test_negative = n_healthy - healthy_test_positive

# Plot 1: Prior - Population breakdown
ax = axes[0, 0]
ax.bar(['Sick', 'Healthy'], [n_sick, n_healthy], color=['red', 'green'], alpha=0.7)
ax.set_ylabel('Number of People')
ax.set_title(f'Step 1: PRIOR\n{n_people:,} people: {n_sick} sick (1%), {n_healthy} healthy (99%)')
ax.set_ylim(0, n_people * 1.1)
for i, v in enumerate([n_sick, n_healthy]):
    ax.text(i, v + 200, str(v), ha='center', fontweight='bold')

# Plot 2: Likelihood - Test results by group
ax = axes[0, 1]
x = np.arange(2)
width = 0.35
bars1 = ax.bar(x - width/2, [sick_test_positive, healthy_test_positive], width, 
               label='Test Positive', color='orange', alpha=0.7)
bars2 = ax.bar(x + width/2, [sick_test_negative, healthy_test_negative], width,
               label='Test Negative', color='blue', alpha=0.7)
ax.set_xticks(x)
ax.set_xticklabels(['Sick (100)', 'Healthy (9900)'])
ax.set_ylabel('Number of People')
ax.set_title('Step 2: LIKELIHOOD\nHow the test performs on each group')
ax.legend()

# Plot 3: Evidence - All positive tests
ax = axes[1, 0]
ax.bar(['True Positives\n(Sick + Positive)', 'False Positives\n(Healthy + Positive)'], 
       [sick_test_positive, healthy_test_positive], 
       color=['red', 'green'], alpha=0.7)
total_positive = sick_test_positive + healthy_test_positive
ax.set_ylabel('Number of People')
ax.set_title(f'Step 3: EVIDENCE\nAll positive tests: {total_positive} total\n'
             f'P(positive) = {total_positive/n_people:.2%}')
for i, v in enumerate([sick_test_positive, healthy_test_positive]):
    ax.text(i, v + 10, str(v), ha='center', fontweight='bold')

# Plot 4: Posterior - Among positive tests, who is actually sick?
ax = axes[1, 1]
posterior = sick_test_positive / total_positive
ax.bar(['Actually Sick', 'Actually Healthy'], 
       [sick_test_positive, healthy_test_positive],
       color=['red', 'green'], alpha=0.7)
ax.set_ylabel('Number of People (with positive test)')
ax.set_title(f'Step 4: POSTERIOR\nAmong {total_positive} positive tests:\n'
             f'P(sick|positive) = {sick_test_positive}/{total_positive} = {posterior:.1%}')
for i, v in enumerate([sick_test_positive, healthy_test_positive]):
    pct = v / total_positive * 100
    ax.text(i, v + 10, f'{v} ({pct:.1f}%)', ha='center', fontweight='bold')

plt.tight_layout()
plt.suptitle('Bayes Theorem: Why a 95% Accurate Test Gives Only 16% Confidence', 
             fontsize=14, fontweight='bold', y=1.02)
plt.show()

print("\nThe Counterintuitive Result Explained:")
print("=" * 50)
print(f"Even though the test is 95% accurate:")
print(f"  - Out of {n_sick} sick people: {sick_test_positive} test positive")
print(f"  - Out of {n_healthy} healthy people: {healthy_test_positive} ALSO test positive (false positives)")
print(f"\nTotal positive tests: {total_positive}")
print(f"True positives: {sick_test_positive} ({sick_test_positive/total_positive:.1%})")
print(f"False positives: {healthy_test_positive} ({healthy_test_positive/total_positive:.1%})")
print(f"\nThe false positives OVERWHELM the true positives because")
print(f"healthy people vastly outnumber sick people!")

In [None]:
# Computing expected value for a discrete distribution
# Example: Unfair die
outcomes = np.array([1, 2, 3, 4, 5, 6])
probabilities = np.array([0.1, 0.1, 0.1, 0.2, 0.2, 0.3])  # Biased toward higher numbers

# Expected value
expected_value = np.sum(outcomes * probabilities)
print(f"E[X] = Σ x·P(x) = {expected_value}")

# Variance
variance = np.sum((outcomes - expected_value)**2 * probabilities)
print(f"Var(X) = E[(X - E[X])²] = {variance:.4f}")
print(f"Std(X) = √Var(X) = {np.sqrt(variance):.4f}")

# Verify with sampling
samples = np.random.choice(outcomes, size=10000, p=probabilities)
print(f"\nEmpirical mean: {samples.mean():.4f}")
print(f"Empirical variance: {samples.var():.4f}")

---

## 4. Bayes' Theorem

Bayes' theorem tells us how to update beliefs given new evidence:

$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$

In ML terms:

$$P(\text{hypothesis}|\text{data}) = \frac{P(\text{data}|\text{hypothesis}) \cdot P(\text{hypothesis})}{P(\text{data})}$$

Or:

$$\text{Posterior} = \frac{\text{Likelihood} \times \text{Prior}}{\text{Evidence}}$$

### Bayes' Theorem in Machine Learning

Bayesian thinking is fundamental to many ML techniques:

| Application | Prior P(H) | Likelihood P(D\|H) | Posterior P(H\|D) |
|-------------|------------|-------------------|-------------------|
| **Naive Bayes Classifier** | Class frequencies in training data | P(features\|class) assumed independent | P(class\|features) for prediction |
| **Bayesian Neural Networks** | Prior on weights (e.g., Gaussian) | P(data\|weights) from network output | Distribution over weights given data |
| **Bayesian Optimization** | GP prior over objective function | Observations so far | Updated belief about function |
| **Spam Filtering** | Base rate of spam emails | P(words\|spam) and P(words\|ham) | P(spam\|email content) |
| **A/B Testing** | Prior belief about conversion rates | Observed clicks/conversions | Updated belief about which variant wins |

**The Bayesian vs Frequentist Perspective**:
- **Frequentist**: Parameters are fixed, unknown constants. We estimate them.
- **Bayesian**: Parameters have probability distributions. We update our beliefs.

In deep learning, we're usually frequentist (point estimates via SGD), but Bayesian methods give us uncertainty quantification.

### Deep Dive: The Intuition Behind Maximum Likelihood

**The Core Question**: Given observed data, what parameters would have made this data *most probable*?

Imagine you flip a coin 10 times and get 7 heads. What's the "most likely" value of p (probability of heads)?

**MLE answers**: Find the p that maximizes P(7 heads in 10 flips | p)

The answer is p = 0.7, because:
- If p = 0.5, getting 7 heads is somewhat unlikely
- If p = 0.9, getting only 7 heads (not 9) is unlikely
- p = 0.7 makes our observed data most probable

**Why Log-Likelihood?**
1. Products become sums: log(a × b × c) = log(a) + log(b) + log(c)
2. Numerical stability: Avoids underflow when multiplying many small probabilities
3. Same maximum: log is monotonic, so argmax is preserved

**The Profound Connection to Loss Functions**:

For classification with softmax outputs:
$$\text{Minimize Cross-Entropy} = \text{Maximize Log-Likelihood}$$

They're the same optimization! When you train with cross-entropy loss, you're doing MLE.

In [None]:
# Classic example: Medical testing
# Disease affects 1% of population
# Test is 95% accurate (both sensitivity and specificity)

P_disease = 0.01  # Prior: probability of having disease
P_positive_given_disease = 0.95  # Sensitivity (true positive rate)
P_positive_given_no_disease = 0.05  # False positive rate (1 - specificity)

# P(positive) = P(positive|disease)P(disease) + P(positive|no disease)P(no disease)
P_positive = P_positive_given_disease * P_disease + P_positive_given_no_disease * (1 - P_disease)

# Bayes' theorem: P(disease|positive)
P_disease_given_positive = (P_positive_given_disease * P_disease) / P_positive

print("Medical Testing Example")
print("=" * 40)
print(f"Prior P(disease) = {P_disease:.2%}")
print(f"Test sensitivity = {P_positive_given_disease:.2%}")
print(f"Test specificity = {1 - P_positive_given_no_disease:.2%}")
print()
print(f"P(positive test) = {P_positive:.4f}")
print(f"P(disease | positive test) = {P_disease_given_positive:.2%}")
print()
print("Surprising! Even with a positive test from a 95% accurate test,")
print(f"there's only a {P_disease_given_positive:.1%} chance of actually having the disease!")
print("This is because the disease is rare (low prior).")

In [None]:
# Demonstrating: Cross-Entropy Loss = Negative Log-Likelihood
# This shows they're mathematically equivalent!

print("Cross-Entropy Loss vs Negative Log-Likelihood")
print("=" * 50)

# Imagine a 3-class classification problem
# True label is class 0, model outputs these probabilities:
true_class = 0
model_probs = np.array([0.7, 0.2, 0.1])  # Model is fairly confident

# Method 1: Cross-Entropy Loss (what we use in practice)
# CE = -sum(y_true * log(y_pred)) where y_true is one-hot
one_hot = np.array([1, 0, 0])  # One-hot encoding of true class
cross_entropy = -np.sum(one_hot * np.log(model_probs))
print(f"\nCross-Entropy Loss: -sum(y_true * log(y_pred))")
print(f"  = -({one_hot[0]} * log({model_probs[0]:.2f}) + {one_hot[1]} * log({model_probs[1]:.2f}) + {one_hot[2]} * log({model_probs[2]:.2f}))")
print(f"  = -{np.log(model_probs[0]):.4f}")
print(f"  = {cross_entropy:.4f}")

# Method 2: Negative Log-Likelihood (MLE perspective)
# NLL = -log(P(true_class))
neg_log_likelihood = -np.log(model_probs[true_class])
print(f"\nNegative Log-Likelihood: -log(P(true_class))")
print(f"  = -log({model_probs[true_class]:.2f})")
print(f"  = {neg_log_likelihood:.4f}")

print(f"\nThey're identical! CE = NLL = {cross_entropy:.4f}")
print("\nThis means: Training with cross-entropy loss is doing MLE!")
print("We're finding network weights that maximize P(correct labels | inputs)")

# Show how loss changes with confidence
print("\n" + "=" * 50)
print("How loss varies with model confidence:")
probs_for_true_class = [0.1, 0.3, 0.5, 0.7, 0.9, 0.99]
print(f"{'P(true class)':<15} {'Cross-Entropy Loss':<20}")
print("-" * 35)
for p in probs_for_true_class:
    loss = -np.log(p)
    print(f"{p:<15.2f} {loss:<20.4f}")

In [None]:
# Visualize how posterior changes with prior
priors = np.linspace(0.001, 0.5, 100)
sensitivity = 0.95
specificity = 0.95

posteriors = []
for prior in priors:
    p_positive = sensitivity * prior + (1 - specificity) * (1 - prior)
    posterior = (sensitivity * prior) / p_positive
    posteriors.append(posterior)

plt.figure(figsize=(10, 6))
plt.plot(priors * 100, np.array(posteriors) * 100, 'b-', linewidth=2)
plt.xlabel('Prior P(disease) [%]')
plt.ylabel('Posterior P(disease|positive) [%]')
plt.title('Posterior vs Prior (95% accurate test)')
plt.grid(True, alpha=0.3)

# Mark some key points
for prior in [0.01, 0.1, 0.5]:
    p_positive = sensitivity * prior + (1 - specificity) * (1 - prior)
    posterior = (sensitivity * prior) / p_positive
    plt.scatter([prior * 100], [posterior * 100], s=100, zorder=5)
    plt.annotate(f'({prior*100:.0f}%, {posterior*100:.1f}%)', 
                 (prior * 100 + 1, posterior * 100 - 3))

plt.show()

In [None]:
# Interactive visualization: The Likelihood Surface
# Shows how likelihood changes as we vary parameters

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Generate data from known distribution
np.random.seed(42)
true_mu, true_sigma = 3.0, 1.5
data = np.random.normal(true_mu, true_sigma, size=30)

# Plot 1: 1D likelihood for mu (sigma fixed at true value)
mus = np.linspace(0, 6, 100)
log_likelihoods_mu = [np.sum(stats.norm.logpdf(data, mu, true_sigma)) for mu in mus]

axes[0].plot(mus, log_likelihoods_mu, 'b-', linewidth=2)
axes[0].axvline(x=data.mean(), color='red', linestyle='--', label=f'MLE: {data.mean():.2f}')
axes[0].axvline(x=true_mu, color='green', linestyle=':', label=f'True: {true_mu}')
axes[0].set_xlabel('mu')
axes[0].set_ylabel('Log-Likelihood')
axes[0].set_title('Likelihood Slice (sigma fixed)')
axes[0].legend()

# Plot 2: 1D likelihood for sigma (mu fixed at true value)
sigmas = np.linspace(0.5, 4, 100)
log_likelihoods_sigma = [np.sum(stats.norm.logpdf(data, true_mu, sigma)) for sigma in sigmas]

axes[1].plot(sigmas, log_likelihoods_sigma, 'b-', linewidth=2)
axes[1].axvline(x=data.std(), color='red', linestyle='--', label=f'MLE: {data.std():.2f}')
axes[1].axvline(x=true_sigma, color='green', linestyle=':', label=f'True: {true_sigma}')
axes[1].set_xlabel('sigma')
axes[1].set_ylabel('Log-Likelihood')
axes[1].set_title('Likelihood Slice (mu fixed)')
axes[1].legend()

# Plot 3: 2D likelihood surface
mus_2d = np.linspace(1, 5, 50)
sigmas_2d = np.linspace(0.5, 3, 50)
MU, SIGMA = np.meshgrid(mus_2d, sigmas_2d)

LL = np.zeros_like(MU)
for i in range(len(sigmas_2d)):
    for j in range(len(mus_2d)):
        LL[i, j] = np.sum(stats.norm.logpdf(data, MU[i, j], SIGMA[i, j]))

contour = axes[2].contourf(MU, SIGMA, LL, levels=30, cmap='viridis')
axes[2].scatter([data.mean()], [data.std()], color='red', s=150, marker='*', 
                label=f'MLE', zorder=5, edgecolors='white')
axes[2].scatter([true_mu], [true_sigma], color='white', s=100, marker='o',
                label=f'True', zorder=5, edgecolors='black')
axes[2].set_xlabel('mu')
axes[2].set_ylabel('sigma')
axes[2].set_title('2D Log-Likelihood Surface')
axes[2].legend()
plt.colorbar(contour, ax=axes[2], label='Log-Likelihood')

plt.tight_layout()
plt.show()

print("Key Observations:")
print("1. The likelihood surface has a clear peak (the MLE)")
print("2. As we move away from the MLE, likelihood decreases")
print("3. Gradient ascent on this surface finds the MLE")
print("4. This is exactly what neural network training does!")

### Deep Dive: Understanding Entropy

Entropy has several intuitive interpretations that all lead to the same formula:

**Interpretation 1: Average Surprise**
- "Surprise" of an event = -log P(event)
- Rare events (low probability) are more surprising
- Entropy = average surprise across all possible outcomes
- H(X) = E[-log P(X)] = "How surprised will I be on average?"

**Interpretation 2: Uncertainty**  
- How uncertain are we about the outcome?
- Maximum entropy = maximum uncertainty (uniform distribution)
- Zero entropy = complete certainty (deterministic)

**Interpretation 3: Information Content (Bits)**
- "How many yes/no questions do I need to identify the outcome?"
- Fair coin: 1 bit (one yes/no question: "Was it heads?")
- Fair 4-sided die: 2 bits ("Is it 1 or 2?" then "Is it the first of those two?")
- Biased distributions need fewer questions on average (can ask about likely outcomes first)

**Why log base 2?** 
- Gives entropy in "bits" - the number of binary questions
- log base e gives "nats" (natural units)
- They're proportional: 1 nat = 1.44 bits

In [None]:
# Visualizing Entropy as "Average Surprise"
# Surprise of an event = -log2(P(event))

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Plot 1: Surprise function
probs = np.linspace(0.01, 1, 100)
surprise = -np.log2(probs)

axes[0].plot(probs, surprise, 'b-', linewidth=2)
axes[0].set_xlabel('Probability P(x)')
axes[0].set_ylabel('Surprise = -log2(P(x))')
axes[0].set_title('Surprise Function')
axes[0].grid(True, alpha=0.3)
axes[0].annotate('Rare events\n(high surprise)', xy=(0.1, 3.3), fontsize=10)
axes[0].annotate('Common events\n(low surprise)', xy=(0.7, 0.8), fontsize=10)

# Plot 2: Entropy for different distributions
distributions = {
    'Certain\n[1,0,0,0]': [1, 0, 0, 0],
    'Skewed\n[0.7,0.2,0.1,0]': [0.7, 0.2, 0.1, 0],
    'Moderate\n[0.4,0.3,0.2,0.1]': [0.4, 0.3, 0.2, 0.1],
    'Uniform\n[0.25,0.25,0.25,0.25]': [0.25, 0.25, 0.25, 0.25],
}

names = list(distributions.keys())
entropies = [entropy(p) for p in distributions.values()]

bars = axes[1].bar(range(len(names)), entropies, color=['darkblue', 'blue', 'steelblue', 'lightblue'])
axes[1].set_xticks(range(len(names)))
axes[1].set_xticklabels(names, fontsize=9)
axes[1].set_ylabel('Entropy (bits)')
axes[1].set_title('Entropy of Different Distributions')
axes[1].set_ylim(0, 2.5)
for i, v in enumerate(entropies):
    axes[1].text(i, v + 0.1, f'{v:.2f}', ha='center', fontweight='bold')

# Plot 3: Why uniform has maximum entropy
# Show entropy vs "peakedness" of distribution
# Using parameterized softmax: p_i = exp(alpha * x_i) / sum(exp(alpha * x_j))
alphas = np.linspace(0, 5, 50)
scores = np.array([1, 2, 3, 4])  # Base preferences

entropy_values = []
for alpha in alphas:
    if alpha == 0:
        probs = np.ones(4) / 4  # Uniform
    else:
        logits = alpha * scores
        probs = np.exp(logits - logits.max())
        probs = probs / probs.sum()
    entropy_values.append(entropy(probs))

axes[2].plot(alphas, entropy_values, 'b-', linewidth=2)
axes[2].set_xlabel('Temperature (lower = peakier)')
axes[2].set_ylabel('Entropy (bits)')
axes[2].set_title('Entropy vs Distribution Sharpness\n(Like softmax temperature)')
axes[2].axhline(y=2, color='r', linestyle='--', alpha=0.5, label='Max entropy (uniform)')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Entropy Reference Table

| Distribution | Formula | Entropy (bits) | Interpretation |
|--------------|---------|----------------|----------------|
| Fair coin | [0.5, 0.5] | 1.00 | 1 yes/no question needed |
| Biased coin (90/10) | [0.9, 0.1] | 0.47 | Less than 1 question on average |
| Certain outcome | [1, 0] | 0.00 | No uncertainty, no questions needed |
| Fair 4-sided die | [0.25, 0.25, 0.25, 0.25] | 2.00 | 2 yes/no questions needed |
| Fair 8-sided die | [1/8] * 8 | 3.00 | 3 yes/no questions needed |
| Fair N-sided die | [1/N] * N | log2(N) | log2(N) questions needed |

**Pattern**: For a uniform distribution over N outcomes, entropy = log2(N) bits.

**Why Maximum Entropy = Uniform?**
- Mathematically: Proven via Lagrange multipliers (maximizing H subject to sum = 1)
- Intuitively: Any preference toward one outcome reduces average surprise
- Philosophically: Maximum entropy = maximum ignorance = all outcomes equally plausible

### Naive Bayes Classifier

A simple but effective classifier using Bayes' theorem:

$$P(y|x_1, ..., x_n) \propto P(y) \prod_{i=1}^n P(x_i|y)$$

The "naive" assumption is that features are conditionally independent given the class.

In [None]:
# Simple Naive Bayes from scratch
class NaiveBayesClassifier:
    def __init__(self):
        self.class_priors = {}
        self.feature_params = {}  # (class, feature) -> (mean, std)
        
    def fit(self, X, y):
        """Fit Gaussian Naive Bayes."""
        classes = np.unique(y)
        n_samples = len(y)
        
        for c in classes:
            # Class prior
            self.class_priors[c] = np.sum(y == c) / n_samples
            
            # Feature parameters (Gaussian)
            X_c = X[y == c]
            for j in range(X.shape[1]):
                self.feature_params[(c, j)] = (X_c[:, j].mean(), X_c[:, j].std() + 1e-6)
                
    def predict_proba(self, X):
        """Compute class probabilities."""
        classes = list(self.class_priors.keys())
        n_samples = X.shape[0]
        probs = np.zeros((n_samples, len(classes)))
        
        for i, c in enumerate(classes):
            # Start with log prior
            log_prob = np.log(self.class_priors[c])
            
            # Add log likelihood for each feature
            for j in range(X.shape[1]):
                mean, std = self.feature_params[(c, j)]
                log_prob += stats.norm.logpdf(X[:, j], mean, std)
            
            probs[:, i] = log_prob
        
        # Convert to probabilities (softmax of log probs)
        probs = np.exp(probs - probs.max(axis=1, keepdims=True))
        probs = probs / probs.sum(axis=1, keepdims=True)
        
        return probs
    
    def predict(self, X):
        """Predict class labels."""
        probs = self.predict_proba(X)
        classes = list(self.class_priors.keys())
        return np.array([classes[i] for i in probs.argmax(axis=1)])


# Generate synthetic data
np.random.seed(42)
n_samples = 200

# Class 0: centered at (0, 0)
X0 = np.random.randn(n_samples // 2, 2) + np.array([0, 0])
# Class 1: centered at (3, 3)
X1 = np.random.randn(n_samples // 2, 2) + np.array([3, 3])

X = np.vstack([X0, X1])
y = np.array([0] * (n_samples // 2) + [1] * (n_samples // 2))

# Train
clf = NaiveBayesClassifier()
clf.fit(X, y)

# Predict
y_pred = clf.predict(X)
accuracy = np.mean(y_pred == y)
print(f"Training accuracy: {accuracy:.2%}")

# Visualize decision boundary
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                     np.linspace(y_min, y_max, 100))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

plt.figure(figsize=(10, 8))
plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
plt.scatter(X[y==0, 0], X[y==0, 1], c='blue', label='Class 0', edgecolors='k')
plt.scatter(X[y==1, 0], X[y==1, 1], c='red', label='Class 1', edgecolors='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Naive Bayes Decision Boundary')
plt.legend()
plt.show()

### Deep Dive: Understanding KL Divergence

KL divergence measures how "different" one distribution is from another. Here are multiple ways to understand it:

**Interpretation 1: Extra Bits for Wrong Encoding**
- Suppose you design a code optimized for distribution Q
- But the true data comes from distribution P
- KL(P || Q) = extra bits needed because you used the wrong distribution
- If P = Q, you need exactly H(P) bits (optimal)
- If P != Q, you need H(P) + KL(P||Q) bits (suboptimal)

**Interpretation 2: Information Lost**
- KL(P || Q) measures information lost when Q is used to approximate P
- It's the "distance" from Q to P (but not symmetric!)

**Interpretation 3: The Fundamental Relationship**
$$D_{KL}(P || Q) = H(P, Q) - H(P) = \text{Cross-Entropy} - \text{Entropy}$$

This tells us:
- Cross-entropy = cost of using Q to encode P
- Entropy = minimum possible cost (using P itself)
- KL divergence = the "wasted" bits from using Q instead of P

**Why Not Symmetric?**
- KL(P || Q): Cost of using Q when truth is P
- KL(Q || P): Cost of using P when truth is Q
- These are different questions!

In [None]:
# Visualizing the KL = CrossEntropy - Entropy relationship
# And demonstrating asymmetry

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Example distributions
P = np.array([0.6, 0.3, 0.1])  # True distribution
Q = np.array([0.33, 0.33, 0.34])  # Approximation (roughly uniform)

# Calculate all quantities
H_P = entropy(P)
H_P_Q = cross_entropy(P, Q)  # Cross-entropy
KL_P_Q = kl_divergence(P, Q)

# Plot 1: Bar chart showing the relationship
quantities = ['H(P)\nEntropy', 'KL(P||Q)\nDivergence', 'H(P,Q)\nCross-Entropy']
values = [H_P, KL_P_Q, H_P_Q]
colors = ['green', 'red', 'blue']

bars = axes[0].bar(quantities, values, color=colors, alpha=0.7)
axes[0].set_ylabel('Bits')
axes[0].set_title('H(P,Q) = H(P) + KL(P||Q)')

# Add value labels
for bar, val in zip(bars, values):
    axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.05, 
                 f'{val:.3f}', ha='center', fontweight='bold')

# Verify the relationship
axes[0].text(0.5, 0.85, f'{H_P:.3f} + {KL_P_Q:.3f} = {H_P + KL_P_Q:.3f}', 
             transform=axes[0].transAxes, ha='center', fontsize=11,
             bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.5))

# Plot 2: Asymmetry visualization
P = np.array([0.9, 0.1])  # Peaked distribution
Q = np.array([0.5, 0.5])  # Uniform distribution

kl_pq = kl_divergence(P, Q)
kl_qp = kl_divergence(Q, P)

x = np.arange(2)
width = 0.35

axes[1].bar(x - width/2, P, width, label='P (peaked)', color='blue', alpha=0.7)
axes[1].bar(x + width/2, Q, width, label='Q (uniform)', color='orange', alpha=0.7)
axes[1].set_xticks(x)
axes[1].set_xticklabels(['Outcome 1', 'Outcome 2'])
axes[1].set_ylabel('Probability')
axes[1].set_title(f'KL Asymmetry\nKL(P||Q)={kl_pq:.3f}, KL(Q||P)={kl_qp:.3f}')
axes[1].legend()

# Plot 3: Why asymmetry matters in practice
# KL(P||Q) penalizes Q=0 where P>0 (infinity!)
# KL(Q||P) penalizes P=0 where Q>0 (infinity!)

axes[2].text(0.5, 0.85, 'KL(P || Q) vs KL(Q || P)', fontsize=14, fontweight='bold',
             ha='center', transform=axes[2].transAxes)

explanation = """
KL(P || Q): "How bad is Q as a model of P?"
- Averages over P (true distribution)
- Catastrophic if Q gives 0 probability 
  where P has probability (log(0) = -inf!)
- Used in: VAE loss, variational inference

KL(Q || P): "How bad is P as a model of Q?"  
- Averages over Q (approximate distribution)
- Catastrophic if P gives 0 probability
  where Q has probability
- Used in: Reverse KL for mode-seeking

In classification:
- P = true labels (one-hot), Q = model predictions
- Cross-entropy = H(P) + KL(P||Q) = KL(P||Q)
  (since H(P) = 0 for one-hot)
"""

axes[2].text(0.05, 0.75, explanation, fontsize=9, transform=axes[2].transAxes,
             verticalalignment='top', fontfamily='monospace')
axes[2].axis('off')

plt.tight_layout()
plt.show()

print("Key Insight for Classification:")
print("When true labels are one-hot, H(P) = 0")
print("So: Cross-Entropy Loss = KL(true || predicted)")
print("Minimizing cross-entropy = minimizing KL divergence!")

In [None]:
# Demonstrating KL divergence in Knowledge Distillation
# Soft labels from teacher contain more information than hard labels

# Imagine a 5-class image classification problem
classes = ['cat', 'dog', 'bird', 'car', 'plane']

# Hard label (ground truth)
hard_label = np.array([1, 0, 0, 0, 0])  # True class is 'cat'

# Teacher model's soft prediction (trained, accurate)
# Notice: teacher thinks it could be dog (similar to cat)
teacher_soft = np.array([0.7, 0.2, 0.05, 0.03, 0.02])

# Student predictions at different training stages
student_untrained = np.array([0.2, 0.2, 0.2, 0.2, 0.2])  # Uniform (no knowledge)
student_partial = np.array([0.5, 0.15, 0.15, 0.1, 0.1])  # Learning
student_trained = np.array([0.68, 0.18, 0.07, 0.04, 0.03])  # Well-trained

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Compare distributions
x = np.arange(len(classes))
width = 0.2

axes[0].bar(x - 1.5*width, hard_label, width, label='Hard Label', color='red', alpha=0.7)
axes[0].bar(x - 0.5*width, teacher_soft, width, label='Teacher (soft)', color='blue', alpha=0.7)
axes[0].bar(x + 0.5*width, student_partial, width, label='Student (learning)', color='green', alpha=0.7)
axes[0].bar(x + 1.5*width, student_trained, width, label='Student (trained)', color='purple', alpha=0.7)

axes[0].set_xticks(x)
axes[0].set_xticklabels(classes)
axes[0].set_ylabel('Probability')
axes[0].set_title('Knowledge Distillation: Soft vs Hard Labels')
axes[0].legend()

# Plot 2: KL divergences
students = {
    'Untrained': student_untrained,
    'Partial': student_partial, 
    'Trained': student_trained
}

# KL from hard labels (what standard cross-entropy uses)
kl_hard = [kl_divergence(hard_label, s) for s in students.values()]

# KL from teacher soft labels (knowledge distillation)
kl_soft = [kl_divergence(teacher_soft, s) for s in students.values()]

x = np.arange(len(students))
width = 0.35

axes[1].bar(x - width/2, kl_hard, width, label='KL(Hard || Student)', color='red', alpha=0.7)
axes[1].bar(x + width/2, kl_soft, width, label='KL(Teacher || Student)', color='blue', alpha=0.7)
axes[1].set_xticks(x)
axes[1].set_xticklabels(list(students.keys()))
axes[1].set_ylabel('KL Divergence (bits)')
axes[1].set_title('Loss: Hard Labels vs Knowledge Distillation')
axes[1].legend()

plt.tight_layout()
plt.show()

print("Why Soft Labels Help:")
print("=" * 50)
print("Hard label only says: 'This is a cat'")
print("Soft label says: 'This is a cat, but somewhat similar to dog,")
print("                  not similar to car or plane'")
print("\nThe relationships between classes ('dark knowledge') help")
print("the student generalize better!")

### KL Divergence in Machine Learning Applications

| Application | P (True/Target) | Q (Approximate/Model) | What KL Measures |
|-------------|-----------------|----------------------|------------------|
| **Classification Loss** | One-hot labels | Softmax predictions | How wrong are predictions |
| **VAE Loss** | Posterior q(z\|x) | Prior p(z), usually N(0,1) | How far latent code is from prior |
| **Knowledge Distillation** | Teacher softmax | Student softmax | How well student mimics teacher |
| **Policy Gradient (PPO)** | Old policy | New policy | Prevents too-large policy updates |
| **Variational Inference** | True posterior | Variational approx | Quality of approximation |

**The VAE Loss Decomposition**:
$$\mathcal{L}_{VAE} = \underbrace{-\mathbb{E}_{q(z|x)}[\log p(x|z)]}_{\text{Reconstruction Loss}} + \underbrace{D_{KL}(q(z|x) || p(z))}_{\text{Regularization}}$$

The KL term pulls the encoder's latent distribution toward the prior, enabling generation.

**Knowledge Distillation**:
- Teacher: Large, accurate model with "soft" predictions
- Student: Small model learning to match teacher
- Loss = KL(Teacher || Student) on softmax outputs
- Student learns teacher's "dark knowledge" (relationships between classes)

---

## 5. Maximum Likelihood Estimation (MLE)

MLE finds parameters that maximize the probability of observing the data:

$$\hat{\theta}_{MLE} = \arg\max_\theta P(\text{data}|\theta) = \arg\max_\theta \prod_i P(x_i|\theta)$$

In practice, we maximize the **log-likelihood** (easier to work with):

$$\hat{\theta}_{MLE} = \arg\max_\theta \sum_i \log P(x_i|\theta)$$

**Key insight**: Minimizing cross-entropy loss = maximizing log-likelihood!

In [None]:
# MLE for Gaussian parameters
# True parameters
mu_true = 5.0
sigma_true = 2.0

# Generate data
n_samples = 100
data = np.random.normal(mu_true, sigma_true, n_samples)

# MLE estimates (can be derived analytically)
mu_mle = data.mean()  # Sample mean
sigma_mle = data.std()  # Sample std (biased, but MLE)

print(f"True parameters: μ = {mu_true}, σ = {sigma_true}")
print(f"MLE estimates:   μ̂ = {mu_mle:.3f}, σ̂ = {sigma_mle:.3f}")

# Visualize
x = np.linspace(mu_true - 4*sigma_true, mu_true + 4*sigma_true, 100)

plt.figure(figsize=(10, 6))
plt.hist(data, bins=20, density=True, alpha=0.6, label='Data histogram')
plt.plot(x, stats.norm.pdf(x, mu_true, sigma_true), 'g-', linewidth=2, 
         label=f'True: N({mu_true}, {sigma_true}²)')
plt.plot(x, stats.norm.pdf(x, mu_mle, sigma_mle), 'r--', linewidth=2,
         label=f'MLE: N({mu_mle:.2f}, {sigma_mle:.2f}²)')
plt.xlabel('x')
plt.ylabel('Density')
plt.title('MLE for Gaussian Distribution')
plt.legend()
plt.show()

In [None]:
# Visualize the likelihood function
def log_likelihood(mu, sigma, data):
    """Compute log-likelihood of data under N(mu, sigma^2)."""
    return np.sum(stats.norm.logpdf(data, mu, sigma))

# Create grid of parameters
mus = np.linspace(3, 7, 50)
sigmas = np.linspace(1, 4, 50)
MU, SIGMA = np.meshgrid(mus, sigmas)

# Compute log-likelihood at each point
LL = np.zeros_like(MU)
for i in range(len(sigmas)):
    for j in range(len(mus)):
        LL[i, j] = log_likelihood(MU[i, j], SIGMA[i, j], data)

plt.figure(figsize=(10, 8))
plt.contourf(MU, SIGMA, LL, levels=30, cmap='viridis')
plt.colorbar(label='Log-Likelihood')
plt.scatter([mu_mle], [sigma_mle], color='red', s=200, marker='*', 
            label=f'MLE: ({mu_mle:.2f}, {sigma_mle:.2f})', zorder=5)
plt.scatter([mu_true], [sigma_true], color='white', s=100, marker='o',
            label=f'True: ({mu_true}, {sigma_true})', zorder=5)
plt.xlabel('μ')
plt.ylabel('σ')
plt.title('Log-Likelihood Surface')
plt.legend()
plt.show()

### MLE for Bernoulli (Coin Flip)

If we observe $k$ heads in $n$ flips, the MLE estimate is simply:

$$\hat{p}_{MLE} = \frac{k}{n}$$

In [None]:
# MLE for coin flip
p_true = 0.7
n_flips = 50
flips = np.random.binomial(1, p_true, n_flips)
k = flips.sum()  # Number of heads

p_mle = k / n_flips

print(f"True p: {p_true}")
print(f"Observed: {k} heads in {n_flips} flips")
print(f"MLE estimate: p̂ = {p_mle:.3f}")

# Visualize likelihood function
p_values = np.linspace(0.01, 0.99, 100)
likelihoods = [stats.binom.pmf(k, n_flips, p) for p in p_values]

plt.figure(figsize=(10, 5))
plt.plot(p_values, likelihoods, 'b-', linewidth=2)
plt.axvline(x=p_mle, color='red', linestyle='--', label=f'MLE: p̂ = {p_mle:.3f}')
plt.axvline(x=p_true, color='green', linestyle=':', label=f'True: p = {p_true}')
plt.xlabel('p')
plt.ylabel('Likelihood P(data|p)')
plt.title(f'Likelihood Function for {k} heads in {n_flips} flips')
plt.legend()
plt.show()

---

## 6. Information Theory

Information theory quantifies information and uncertainty.

### Entropy

Entropy measures the "uncertainty" or "information content" of a distribution:

$$H(X) = -\sum_x P(x) \log P(x) = -E[\log P(X)]$$

**Properties**:
- Higher entropy = more uncertainty
- Uniform distribution has maximum entropy
- Deterministic variable has entropy 0

In [None]:
def entropy(p):
    """Compute entropy of a discrete distribution."""
    p = np.array(p)
    p = p[p > 0]  # Avoid log(0)
    return -np.sum(p * np.log2(p))

# Examples
print("Entropy examples (in bits):")
print(f"Fair coin [0.5, 0.5]: H = {entropy([0.5, 0.5]):.4f} bits")
print(f"Biased coin [0.9, 0.1]: H = {entropy([0.9, 0.1]):.4f} bits")
print(f"Certain [1.0, 0.0]: H = {entropy([1.0, 0.0]):.4f} bits")
print(f"Fair die [1/6]*6: H = {entropy([1/6]*6):.4f} bits")
print(f"Uniform 8-sided: H = {entropy([1/8]*8):.4f} bits")

In [None]:
# Entropy of binary distribution as function of p
p_values = np.linspace(0.001, 0.999, 100)
entropies = [-p * np.log2(p) - (1-p) * np.log2(1-p) for p in p_values]

plt.figure(figsize=(10, 6))
plt.plot(p_values, entropies, 'b-', linewidth=2)
plt.xlabel('p (probability of heads)')
plt.ylabel('Entropy (bits)')
plt.title('Binary Entropy Function H(p)')
plt.axhline(y=1, color='r', linestyle='--', alpha=0.5, label='Maximum = 1 bit')
plt.axvline(x=0.5, color='g', linestyle='--', alpha=0.5, label='p = 0.5')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("Maximum entropy occurs at p = 0.5 (maximum uncertainty)")
print("Entropy = 0 when p = 0 or p = 1 (no uncertainty)")

### Cross-Entropy

Cross-entropy measures the "cost" of using distribution $Q$ to encode samples from distribution $P$:

$$H(P, Q) = -\sum_x P(x) \log Q(x) = -E_P[\log Q(X)]$$

**In ML**: Cross-entropy loss measures how well predicted probabilities $Q$ match true labels $P$.

In [None]:
def cross_entropy(p, q):
    """Compute cross-entropy H(P, Q)."""
    p = np.array(p)
    q = np.array(q)
    # Avoid log(0) by clipping
    q = np.clip(q, 1e-10, 1.0)
    return -np.sum(p * np.log2(q))

# Example: True distribution vs predictions
p_true = np.array([1, 0, 0])  # True class is 0

predictions = [
    ([0.9, 0.05, 0.05], "Confident correct"),
    ([0.6, 0.2, 0.2], "Less confident"),
    ([0.33, 0.33, 0.34], "Uniform (uncertain)"),
    ([0.1, 0.45, 0.45], "Confident wrong"),
]

print("Cross-entropy loss for different predictions:")
print(f"True label: class 0 (one-hot: {p_true})\n")

for q, desc in predictions:
    ce = cross_entropy(p_true, q)
    print(f"{desc}")
    print(f"  Prediction: {q}")
    print(f"  Cross-entropy: {ce:.4f} bits\n")

### KL Divergence

KL divergence measures how different distribution $Q$ is from $P$:

$$D_{KL}(P || Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} = H(P, Q) - H(P)$$

**Properties**:
- $D_{KL}(P || Q) \geq 0$ (always non-negative)
- $D_{KL}(P || Q) = 0$ if and only if $P = Q$
- Not symmetric: $D_{KL}(P || Q) \neq D_{KL}(Q || P)$

**In ML**: Used in VAEs, knowledge distillation, regularization

In [None]:
def kl_divergence(p, q):
    """Compute KL divergence D_KL(P || Q)."""
    p = np.array(p)
    q = np.array(q)
    # Only sum where p > 0
    mask = p > 0
    q = np.clip(q, 1e-10, 1.0)
    return np.sum(p[mask] * np.log2(p[mask] / q[mask]))

# Compare two distributions
p = np.array([0.4, 0.3, 0.2, 0.1])
q1 = np.array([0.35, 0.35, 0.2, 0.1])  # Similar to P
q2 = np.array([0.1, 0.2, 0.3, 0.4])    # Reversed
q3 = np.array([0.25, 0.25, 0.25, 0.25])  # Uniform

print(f"P = {p}")
print(f"\nKL divergences:")
print(f"D_KL(P || Q1) where Q1 = {q1}: {kl_divergence(p, q1):.4f} bits")
print(f"D_KL(P || Q2) where Q2 = {q2}: {kl_divergence(p, q2):.4f} bits")
print(f"D_KL(P || Q3) where Q3 = {q3}: {kl_divergence(p, q3):.4f} bits")

print(f"\nNote asymmetry:")
print(f"D_KL(P || Q2) = {kl_divergence(p, q2):.4f}")
print(f"D_KL(Q2 || P) = {kl_divergence(q2, p):.4f}")

In [None]:
# Visualize KL divergence between two Gaussians
def kl_gaussian(mu1, sigma1, mu2, sigma2):
    """KL divergence between two univariate Gaussians."""
    return (np.log(sigma2/sigma1) + 
            (sigma1**2 + (mu1 - mu2)**2) / (2 * sigma2**2) - 0.5)

# P is N(0, 1), vary Q
mu_p, sigma_p = 0, 1

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Vary mean
mus = np.linspace(-3, 3, 100)
kls = [kl_gaussian(mu_p, sigma_p, mu, sigma_p) for mu in mus]

axes[0].plot(mus, kls, 'b-', linewidth=2)
axes[0].set_xlabel('μ_Q')
axes[0].set_ylabel('D_KL(P || Q)')
axes[0].set_title('KL Divergence: P = N(0,1), Q = N(μ,1)')
axes[0].grid(True, alpha=0.3)

# Vary std
sigmas = np.linspace(0.1, 4, 100)
kls = [kl_gaussian(mu_p, sigma_p, mu_p, sigma) for sigma in sigmas]

axes[1].plot(sigmas, kls, 'r-', linewidth=2)
axes[1].axvline(x=1, color='g', linestyle='--', alpha=0.5, label='σ_Q = σ_P')
axes[1].set_xlabel('σ_Q')
axes[1].set_ylabel('D_KL(P || Q)')
axes[1].set_title('KL Divergence: P = N(0,1), Q = N(0,σ)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## Exercises

### Exercise 1: Bayesian Coin Inference

Use Bayes' theorem to update beliefs about a coin's bias after observing flips.

In [None]:
# Bayesian inference for coin bias
# Prior: Beta(a, b) distribution over p
# Posterior after k heads, n-k tails: Beta(a + k, b + n - k)

def plot_beta_posterior(a_prior, b_prior, n_heads, n_tails):
    """Plot prior and posterior distributions."""
    p = np.linspace(0, 1, 100)
    
    # Prior
    prior = stats.beta.pdf(p, a_prior, b_prior)
    
    # Posterior
    a_post = a_prior + n_heads
    b_post = b_prior + n_tails
    posterior = stats.beta.pdf(p, a_post, b_post)
    
    plt.figure(figsize=(10, 6))
    plt.plot(p, prior, 'b--', linewidth=2, label=f'Prior: Beta({a_prior}, {b_prior})')
    plt.plot(p, posterior, 'r-', linewidth=2, 
             label=f'Posterior: Beta({a_post}, {b_post})')
    plt.axvline(x=n_heads/(n_heads + n_tails) if (n_heads + n_tails) > 0 else 0.5, 
                color='g', linestyle=':', label=f'MLE: {n_heads/(n_heads + n_tails):.3f}')
    plt.xlabel('p (probability of heads)')
    plt.ylabel('Density')
    plt.title(f'Bayesian Inference: {n_heads} heads, {n_tails} tails')
    plt.legend()
    plt.show()
    
    # Posterior statistics
    post_mean = a_post / (a_post + b_post)
    print(f"Posterior mean: {post_mean:.4f}")
    print(f"95% credible interval: [{stats.beta.ppf(0.025, a_post, b_post):.4f}, "
          f"{stats.beta.ppf(0.975, a_post, b_post):.4f}]")

# Start with uniform prior (no prior knowledge)
# Update after observing data
plot_beta_posterior(a_prior=1, b_prior=1, n_heads=7, n_tails=3)

In [None]:
# TODO: Experiment with different priors and data
# What happens with:
# 1. A strong prior belief that coin is fair: Beta(10, 10)
# 2. More data: 70 heads, 30 tails
# 3. Conflicting prior and data

# Your experiments here:
plot_beta_posterior(a_prior=10, b_prior=10, n_heads=7, n_tails=3)

### Exercise 2: Implement Softmax Cross-Entropy Loss

In [None]:
def softmax(x):
    """Compute softmax."""
    x_shifted = x - np.max(x, axis=-1, keepdims=True)
    exp_x = np.exp(x_shifted)
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

def cross_entropy_loss(logits, labels):
    """
    Compute cross-entropy loss.
    
    Args:
        logits: Raw model outputs (before softmax), shape (batch_size, num_classes)
        labels: True class indices, shape (batch_size,)
    
    Returns:
        Scalar loss value
    """
    # TODO: Implement
    # 1. Apply softmax to get probabilities
    # 2. Extract probability of true class
    # 3. Return negative log probability (averaged over batch)
    
    probs = softmax(logits)
    batch_size = len(labels)
    # Get probability assigned to correct class for each sample
    correct_probs = probs[np.arange(batch_size), labels]
    # Negative log likelihood
    loss = -np.mean(np.log(correct_probs + 1e-10))
    return loss

# Test
logits = np.array([[2.0, 1.0, 0.1],   # Should predict class 0
                   [0.1, 2.5, 0.3],   # Should predict class 1
                   [0.2, 0.3, 3.0]])  # Should predict class 2
labels = np.array([0, 1, 2])

loss = cross_entropy_loss(logits, labels)
print(f"Logits:\n{logits}")
print(f"Softmax probabilities:\n{softmax(logits).round(4)}")
print(f"True labels: {labels}")
print(f"Cross-entropy loss: {loss:.4f}")

### Exercise 3: Information Gain for Decision Trees

In [None]:
def information_gain(parent_labels, left_labels, right_labels):
    """
    Compute information gain from a split.
    
    IG = H(parent) - weighted_avg(H(left), H(right))
    """
    def label_entropy(labels):
        """Compute entropy of label distribution."""
        if len(labels) == 0:
            return 0
        _, counts = np.unique(labels, return_counts=True)
        probs = counts / len(labels)
        return entropy(probs)
    
    n = len(parent_labels)
    n_left = len(left_labels)
    n_right = len(right_labels)
    
    h_parent = label_entropy(parent_labels)
    h_left = label_entropy(left_labels)
    h_right = label_entropy(right_labels)
    
    weighted_child = (n_left/n) * h_left + (n_right/n) * h_right
    
    return h_parent - weighted_child

# Example: Splitting data based on a feature
# Parent has mixed classes
parent = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

# Good split: separates classes well
left_good = np.array([0, 0, 0, 0])
right_good = np.array([1, 1, 1, 1, 1, 1])

# Bad split: doesn't separate well
left_bad = np.array([0, 0, 1, 1, 1])
right_bad = np.array([0, 0, 1, 1, 1])

print(f"Parent entropy: {entropy([0.4, 0.6]):.4f} bits")
print(f"\nGood split information gain: {information_gain(parent, left_good, right_good):.4f} bits")
print(f"Bad split information gain: {information_gain(parent, left_bad, right_bad):.4f} bits")

---

## Summary

### Key Concepts

1. **Probability Distributions**: Bernoulli, Categorical, Gaussian are fundamental to ML
2. **Bayes' Theorem**: Updates beliefs given evidence (prior × likelihood = posterior)
3. **Maximum Likelihood**: Find parameters that maximize P(data|params)
4. **Entropy**: Measures uncertainty in a distribution
5. **Cross-Entropy**: The loss function for classification
6. **KL Divergence**: Measures difference between distributions

### Connection to Deep Learning

- **Classification**: Softmax outputs a categorical distribution, trained with cross-entropy
- **Regression**: Often assumes Gaussian noise, uses MSE (= MLE for Gaussian)
- **VAEs**: Use KL divergence to regularize latent distributions
- **Dropout**: Samples from Bernoulli to create masks
- **Bayesian NN**: Treat weights as distributions, use Bayes' theorem

### Checklist
- [ ] I understand common probability distributions
- [ ] I can apply Bayes' theorem
- [ ] I understand MLE and its connection to loss functions
- [ ] I can compute entropy and KL divergence

---

## Next Steps

Continue to **Part 2: Python Foundations** where we'll cover:
- Python OOP for deep learning
- NumPy deep dive
- Building the tools we'll use throughout the course