# Module 01 — Mathematical & Programming Foundations## 01-08: Information Theory for ML**Objective:** Build the information-theoretic foundations that underpinloss functions, model comparison, and representation learning — entropy,cross-entropy, KL divergence, and mutual information — from scratch withnumerical experiments and visual intuition.**Prerequisites:** 01-01 (Python, NumPy & Tensor Speed), 01-07 (Probability & Statistics for ML)

---## Part 0 — Setup & PrerequisitesInformation theory, pioneered by Claude Shannon in 1948, provides themathematical language for quantifying uncertainty, surprise, and theamount of information in data. In machine learning, information-theoreticconcepts appear everywhere:- **Cross-entropy** is the standard classification loss function- **KL divergence** measures how one distribution differs from another (VAEs, knowledge distillation)- **Mutual information** quantifies feature relevance and representation quality- **Entropy** sets limits on compression and optimal codingThis notebook covers:- **Shannon entropy** — measuring uncertainty in a random variable- **Cross-entropy** — the loss function connection- **KL divergence** — distance between distributions- **Mutual information** — shared information between variables- **Conditional entropy and chain rule** — decomposing information- **Jensen–Shannon divergence** — a symmetric, bounded alternative to KL**Prerequisites:** 01-01, 01-07 (Probability & Statistics for ML)

In [None]:
# ── Imports ──────────────────────────────────────────────────────────────────
import sys
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
from scipy import stats as sp_stats

from sklearn.datasets import load_digits, fetch_california_housing
from sklearn.model_selection import train_test_split

print(f'Python: {sys.version.split()[0]}')
print(f'PyTorch: {torch.__version__}')
print(f'NumPy: {np.__version__}')
if torch.cuda.is_available():
    print(f'CUDA: {torch.version.cuda}')
    print(f'GPU: {torch.cuda.get_device_name(0)}')

In [None]:
# ── Reproducibility ──────────────────────────────────────────────────────────
import random

SEED = 1103
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

In [None]:
# ── Configuration ────────────────────────────────────────────────────────────
FIGSIZE = (10, 6)
COLORS = {
    'blue': '#1E88E5',
    'red': '#E53935',
    'green': '#43A047',
    'orange': '#FF9800',
    'purple': '#9C27B0',
    'teal': '#00897B',
    'gray': '#757575',
}
COLOR_LIST = list(COLORS.values())

### Data LoadingWe use the Digits dataset (10-class classification) and syntheticdistributions to demonstrate information-theoretic concepts.

In [None]:
# Digits dataset for classification-related demonstrations
digits = load_digits()
X_digits = digits.data.astype(np.float64)
y_digits = digits.target
n_classes = len(np.unique(y_digits))
print(f'Digits dataset: {X_digits.shape}')
print(f'Classes: {n_classes}, samples per class:')
class_counts = np.bincount(y_digits)
for c in range(n_classes):
    print(f'  Digit {c}: {class_counts[c]} samples')

# Show sample digits
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.ravel()):
    ax.imshow(digits.images[i * 18], cmap='gray_r')
    ax.set_title(f'Label: {y_digits[i * 18]}')
    ax.axis('off')
plt.suptitle('Sample Digits', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

---## Part 1 — Information Theory from ScratchWe build up from the notion of surprise (information content) to entropy,cross-entropy, KL divergence, and mutual information.

### 1.1 Information Content and Shannon EntropyThe **information content** (surprise) of an event with probability $p$ is:$$I(x) = -\log_2 p(x)$$Rare events carry more information (more surprise). The **Shannon entropy**is the expected information content:$$H(X) = -\sum_{x} p(x) \log_2 p(x) = \mathbb{E}[-\log_2 p(X)]$$Key properties:- $H(X) \geq 0$ — entropy is non-negative- $H(X) = 0$ iff $X$ is deterministic- $H(X) \leq \log_2 K$ for $K$ outcomes — maximum when uniform- Uses $0 \cdot \log_2(0) \triangleq 0$ by continuity

In [None]:
def information_content(p: np.ndarray, base: float = 2.0) -> np.ndarray:
    """Compute information content (surprise) of events.

    Args:
        p: Probability of each event.
        base: Logarithm base (2 for bits, e for nats).

    Returns:
        Information content for each event.
    """
    return -np.log(p) / np.log(base)


def shannon_entropy(p: np.ndarray, base: float = 2.0) -> float:
    """Compute Shannon entropy of a discrete distribution.

    Args:
        p: Probability vector (must sum to 1).
        base: Logarithm base (2 for bits, e for nats).

    Returns:
        Entropy value.
    """
    # Filter out zero probabilities
    p_nonzero = p[p > 0]
    return -np.sum(p_nonzero * np.log(p_nonzero) / np.log(base))


# Demonstrate: information content vs probability
probs = np.linspace(0.001, 1.0, 500)
info = information_content(probs, base=2.0)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Information content curve
axes[0].plot(probs, info, color=COLORS['blue'], linewidth=2)
axes[0].set_xlabel('Probability p(x)', fontsize=12)
axes[0].set_ylabel('Information Content (bits)', fontsize=12)
axes[0].set_title('$I(x) = -\log_2 p(x)$: Rare Events = More Surprise')
# Mark specific probabilities
for p_val, label in [(0.5, 'Fair coin'), (0.1, '10% event'), (0.01, '1% event')]:
    i_val = information_content(np.array([p_val]))[0]
    axes[0].plot(p_val, i_val, 'o', markersize=8)
    axes[0].annotate(f'{label}\n{i_val:.2f} bits',
                      xy=(p_val, i_val), xytext=(p_val + 0.08, i_val + 0.5),
                      fontsize=9, arrowprops=dict(arrowstyle='->', color='gray'))
axes[0].grid(True, alpha=0.2)

# Entropy of a binary variable
p_binary = np.linspace(0.001, 0.999, 500)
entropy_binary = np.array(
    [shannon_entropy(np.array([p, 1 - p]), base=2.0) for p in p_binary]
)
axes[1].plot(p_binary, entropy_binary, color=COLORS['red'], linewidth=2)
axes[1].set_xlabel('p (probability of heads)', fontsize=12)
axes[1].set_ylabel('H(X) (bits)', fontsize=12)
axes[1].set_title('Binary Entropy Function')
axes[1].axvline(0.5, color=COLORS['gray'], linestyle='--', alpha=0.5,
                 label='Max entropy at p=0.5')
axes[1].legend()
axes[1].grid(True, alpha=0.2)

plt.tight_layout()
plt.show()

# Verify: maximum entropy for fair coin
print(f'Binary entropy at p=0.5: {shannon_entropy(np.array([0.5, 0.5])):.4f} bits')
print(f'Binary entropy at p=0.9: {shannon_entropy(np.array([0.9, 0.1])):.4f} bits')
print(f'Binary entropy at p=1.0: {shannon_entropy(np.array([1.0, 0.0])):.4f} bits')

### 1.2 Entropy of Multi-Class DistributionsFor a $K$-class distribution, entropy measures how "spread out" theprobability mass is. A uniform distribution has maximum entropy$\log_2 K$, while a peaked distribution has low entropy.

In [None]:
def entropy_vs_concentration() -> None:
    """Visualize how entropy changes with distribution shape."""
    fig, axes = plt.subplots(1, 3, figsize=(16, 5))
    n_classes_demo = 8

    # Create distributions from uniform to peaked
    temperatures = [2.0, 1.0, 0.5, 0.1, 0.01]
    logits = np.random.RandomState(SEED).randn(n_classes_demo)

    for temp, color in zip(temperatures, COLOR_LIST[:5]):
        probs = np.exp(logits / temp) / np.sum(np.exp(logits / temp))
        h = shannon_entropy(probs)
        axes[0].bar(np.arange(n_classes_demo) + (temp - 1.0) * 0.08,
                     probs, width=0.15, alpha=0.7, color=color,
                     label=f'T={temp}, H={h:.2f}')

    axes[0].set_xlabel('Class')
    axes[0].set_ylabel('Probability')
    axes[0].set_title('Softmax at Different Temperatures')
    axes[0].legend(fontsize=8)

    # Entropy vs temperature (smooth curve)
    temp_range = np.linspace(0.01, 5.0, 200)
    entropies_temp = []
    for t in temp_range:
        p = np.exp(logits / t) / np.sum(np.exp(logits / t))
        entropies_temp.append(shannon_entropy(p))
    axes[1].plot(temp_range, entropies_temp, color=COLORS['blue'], linewidth=2)
    axes[1].axhline(np.log2(n_classes_demo), color=COLORS['red'], linestyle='--',
                     label=f'Max entropy = log₂({n_classes_demo}) = {np.log2(n_classes_demo):.2f}')
    axes[1].set_xlabel('Temperature')
    axes[1].set_ylabel('Entropy (bits)')
    axes[1].set_title('Entropy vs Temperature')
    axes[1].legend(fontsize=9)
    axes[1].grid(True, alpha=0.2)

    # Entropy of real digit class distribution
    digit_probs = class_counts / class_counts.sum()
    digit_entropy = shannon_entropy(digit_probs)
    uniform_entropy = np.log2(n_classes)

    axes[2].bar(range(n_classes), digit_probs, color=COLORS['green'], alpha=0.7,
                 edgecolor='white')
    axes[2].set_xlabel('Digit Class')
    axes[2].set_ylabel('Proportion')
    axes[2].set_title(f'Digits Distribution\nH={digit_entropy:.4f} bits '
                       f'(max={uniform_entropy:.4f})')
    axes[2].grid(True, alpha=0.2, axis='y')

    plt.tight_layout()
    plt.show()

    print(f'Digits class entropy: {digit_entropy:.4f} bits')
    print(f'Maximum possible:     {uniform_entropy:.4f} bits')
    print(f'Efficiency:           {digit_entropy / uniform_entropy:.2%}')


entropy_vs_concentration()

### 1.3 Cross-EntropyThe **cross-entropy** between a true distribution $p$ and a modeldistribution $q$ is:$$H(p, q) = -\sum_{x} p(x) \log q(x)$$Key insight: **Cross-entropy = Entropy + KL divergence**$$H(p, q) = H(p) + D_{\text{KL}}(p \| q)$$Since $H(p)$ is constant w.r.t. model parameters, minimizingcross-entropy $\equiv$ minimizing KL divergence $\equiv$ making $q$match $p$.**ML connection:** When we train a classifier with cross-entropy loss,we're finding the $q$ (model predictions) closest to $p$ (true labels)in the KL sense.

In [None]:
def cross_entropy(
    p: np.ndarray,
    q: np.ndarray,
    base: float = 2.0,
) -> float:
    """Compute cross-entropy H(p, q).

    Args:
        p: True distribution.
        q: Model distribution.
        base: Logarithm base.

    Returns:
        Cross-entropy value.
    """
    eps = 1e-15
    q = np.clip(q, eps, 1.0)
    mask = p > 0
    return -np.sum(p[mask] * np.log(q[mask]) / np.log(base))


def categorical_cross_entropy_loss(
    labels: np.ndarray,
    logits: np.ndarray,
) -> float:
    """Compute categorical cross-entropy loss (nats) from logits.

    This is equivalent to PyTorch's nn.CrossEntropyLoss.
    Applies softmax internally for numerical stability.

    Args:
        labels: Integer class labels of shape (n,).
        logits: Raw model outputs of shape (n, K).

    Returns:
        Average cross-entropy loss.
    """
    # Log-softmax (numerically stable)
    max_logits = logits.max(axis=1, keepdims=True)
    shifted = logits - max_logits
    log_sum_exp = np.log(np.sum(np.exp(shifted), axis=1, keepdims=True))
    log_probs = shifted - log_sum_exp

    # Pick log-probability of correct class
    n = len(labels)
    nll = -log_probs[np.arange(n), labels]
    return nll.mean()


# Demonstrate cross-entropy behavior
true_dist = np.array([0.7, 0.2, 0.1])
model_dists = {
    'Perfect match': np.array([0.7, 0.2, 0.1]),
    'Good model':    np.array([0.6, 0.25, 0.15]),
    'Bad model':     np.array([0.33, 0.34, 0.33]),
    'Wrong model':   np.array([0.1, 0.2, 0.7]),
}

print('Cross-entropy H(p, q) for different model distributions q:')
print(f'  True distribution p = {true_dist}')
print(f'  Entropy H(p) = {shannon_entropy(true_dist):.4f} bits (lower bound)\n')

results = []
for name, q in model_dists.items():
    ce = cross_entropy(true_dist, q)
    h = shannon_entropy(true_dist)
    kl = ce - h  # KL divergence = CE - H
    results.append({'Model': name, 'H(p,q)': f'{ce:.4f}', 'KL(p||q)': f'{kl:.4f}'})
    print(f'  {name:16s}: H(p,q) = {ce:.4f}  KL(p||q) = {kl:.4f}')

print(f'\nVerify: H(p,q) = H(p) + KL(p||q) always holds')

Let's verify our cross-entropy implementation against PyTorch's built-in.

In [None]:
def verify_cross_entropy_against_pytorch() -> None:
    """Compare our cross-entropy with PyTorch's nn.CrossEntropyLoss."""
    rng = np.random.RandomState(SEED)
    n_samples = 100
    n_cls = 5

    # Random logits and labels
    logits_np = rng.randn(n_samples, n_cls).astype(np.float32)
    labels_np = rng.randint(0, n_cls, size=n_samples)

    # Our implementation
    our_loss = categorical_cross_entropy_loss(labels_np, logits_np)

    # PyTorch implementation
    logits_pt = torch.tensor(logits_np)
    labels_pt = torch.tensor(labels_np, dtype=torch.long)
    criterion = torch.nn.CrossEntropyLoss()
    pt_loss = criterion(logits_pt, labels_pt).item()

    print(f'Our cross-entropy:     {our_loss:.6f}')
    print(f'PyTorch CrossEntropy:  {pt_loss:.6f}')
    print(f'Absolute difference:   {abs(our_loss - pt_loss):.2e}')
    assert abs(our_loss - pt_loss) < 1e-5, 'Mismatch!'
    print('Verification PASSED')


verify_cross_entropy_against_pytorch()

### 1.4 Kullback–Leibler (KL) DivergenceKL divergence measures how much one distribution $q$ diverges froma reference distribution $p$:$$D_{\text{KL}}(p \| q) = \sum_{x} p(x) \log \frac{p(x)}{q(x)} = H(p, q) - H(p)$$Key properties:- $D_{\text{KL}}(p \| q) \geq 0$ (Gibbs' inequality)- $D_{\text{KL}}(p \| q) = 0$ iff $p = q$- **Not symmetric:** $D_{\text{KL}}(p \| q) \neq D_{\text{KL}}(q \| p)$ in general- **Not a metric:** does not satisfy the triangle inequality**ML applications:**- **VAE loss:** $D_{\text{KL}}(q(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))$ regularizes the latent space- **Knowledge distillation:** $D_{\text{KL}}(p_{\text{teacher}} \| p_{\text{student}})$ transfers knowledge- **Policy optimization:** $D_{\text{KL}}(\pi_{\text{old}} \| \pi_{\text{new}})$ constrains policy updates

In [None]:
def kl_divergence(
    p: np.ndarray,
    q: np.ndarray,
    base: float = 2.0,
) -> float:
    """Compute KL divergence D_KL(p || q).

    Args:
        p: True/reference distribution.
        q: Approximate distribution.
        base: Logarithm base.

    Returns:
        KL divergence value (non-negative).
    """
    eps = 1e-15
    q = np.clip(q, eps, 1.0)
    mask = p > 0
    return np.sum(p[mask] * np.log(p[mask] / q[mask]) / np.log(base))


# Demonstrate asymmetry
p = np.array([0.4, 0.3, 0.2, 0.1])
q = np.array([0.25, 0.25, 0.25, 0.25])

kl_pq = kl_divergence(p, q)
kl_qp = kl_divergence(q, p)

print(f'p = {p}')
print(f'q = {q} (uniform)')
print(f'KL(p || q) = {kl_pq:.6f} bits')
print(f'KL(q || p) = {kl_qp:.6f} bits')
print(f'Asymmetry: KL(p||q) - KL(q||p) = {kl_pq - kl_qp:.6f} bits')
print(f'\nNote: KL is NOT symmetric — it is a directed divergence, not a distance.')

#### Forward vs Reverse KLThe choice of direction matters in practice:- **Forward KL** $D_{\text{KL}}(p \| q)$: penalizes $q$ for having  low probability where $p$ has high probability → **mode-covering** (q is spread out)- **Reverse KL** $D_{\text{KL}}(q \| p)$: penalizes $q$ for having  high probability where $p$ has low probability → **mode-seeking** (q is peaked)

In [None]:
def visualize_kl_asymmetry() -> None:
    """Visualize forward vs reverse KL with a bimodal target."""
    x = np.linspace(-6, 6, 1000)

    # True distribution: bimodal
    p = 0.5 * sp_stats.norm.pdf(x, -2, 0.8) + 0.5 * sp_stats.norm.pdf(x, 2, 0.8)
    p = p / p.sum()  # Normalize to a proper PMF for computation

    fig, axes = plt.subplots(1, 3, figsize=(18, 5))

    # Forward KL: mode-covering
    axes[0].fill_between(x, p / p.max(), alpha=0.3, color=COLORS['blue'], label='p (true)')
    axes[0].plot(x, p / p.max(), color=COLORS['blue'], linewidth=2)
    q_forward = sp_stats.norm.pdf(x, 0, 2.5)
    q_forward = q_forward / q_forward.max()
    axes[0].plot(x, q_forward, '--', color=COLORS['red'], linewidth=2,
                  label='q (forward KL fit)')
    axes[0].set_title('Forward KL: Mode-Covering\n$q$ spreads to cover all of $p$')
    axes[0].legend(fontsize=10)
    axes[0].set_xlabel('x')
    axes[0].grid(True, alpha=0.2)

    # Reverse KL: mode-seeking (left mode)
    axes[1].fill_between(x, p / p.max(), alpha=0.3, color=COLORS['blue'], label='p (true)')
    axes[1].plot(x, p / p.max(), color=COLORS['blue'], linewidth=2)
    q_reverse = sp_stats.norm.pdf(x, -2, 0.8)
    q_reverse = q_reverse / q_reverse.max()
    axes[1].plot(x, q_reverse, '--', color=COLORS['green'], linewidth=2,
                  label='q (reverse KL fit)')
    axes[1].set_title('Reverse KL: Mode-Seeking\n$q$ locks onto one mode')
    axes[1].legend(fontsize=10)
    axes[1].set_xlabel('x')
    axes[1].grid(True, alpha=0.2)

    # Both on same plot
    axes[2].fill_between(x, p / p.max(), alpha=0.3, color=COLORS['blue'], label='p (true)')
    axes[2].plot(x, p / p.max(), color=COLORS['blue'], linewidth=2)
    axes[2].plot(x, q_forward, '--', color=COLORS['red'], linewidth=2, label='Forward KL')
    axes[2].plot(x, q_reverse, '--', color=COLORS['green'], linewidth=2, label='Reverse KL')
    axes[2].set_title('Forward vs Reverse KL')
    axes[2].legend(fontsize=10)
    axes[2].set_xlabel('x')
    axes[2].grid(True, alpha=0.2)

    plt.suptitle('KL Divergence Asymmetry', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()


visualize_kl_asymmetry()

### 1.5 Jensen–Shannon (JS) DivergenceSince KL divergence is asymmetric, we often use the **Jensen–Shannondivergence** which is symmetric and bounded:$$D_{\text{JS}}(p \| q) = \frac{1}{2} D_{\text{KL}}\left(p \| \frac{p+q}{2}\right) + \frac{1}{2} D_{\text{KL}}\left(q \| \frac{p+q}{2}\right)$$Properties:- $0 \leq D_{\text{JS}}(p \| q) \leq 1$ (when using $\log_2$)- Symmetric: $D_{\text{JS}}(p \| q) = D_{\text{JS}}(q \| p)$- $\sqrt{D_{\text{JS}}}$ is a proper metric**ML application:** The original GAN loss is related to JS divergence betweenthe real and generated distributions.

In [None]:
def js_divergence(
    p: np.ndarray,
    q: np.ndarray,
    base: float = 2.0,
) -> float:
    """Compute Jensen-Shannon divergence.

    Args:
        p: First distribution.
        q: Second distribution.
        base: Logarithm base.

    Returns:
        JS divergence (symmetric, bounded in [0, 1] for base=2).
    """
    m = 0.5 * (p + q)
    return 0.5 * kl_divergence(p, m, base) + 0.5 * kl_divergence(q, m, base)


# Compare KL and JS divergences as distributions diverge
base_dist = np.array([0.4, 0.3, 0.2, 0.1])
alpha_range = np.linspace(0, 1, 50)
uniform_dist = np.ones(4) / 4

kl_fwd_vals = []
kl_rev_vals = []
js_vals = []

for alpha in alpha_range:
    q_interp = (1 - alpha) * base_dist + alpha * uniform_dist
    q_interp = q_interp / q_interp.sum()
    kl_fwd_vals.append(kl_divergence(base_dist, q_interp))
    kl_rev_vals.append(kl_divergence(q_interp, base_dist))
    js_vals.append(js_divergence(base_dist, q_interp))

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(alpha_range, kl_fwd_vals, color=COLORS['blue'], linewidth=2, label='KL(p || q)')
ax.plot(alpha_range, kl_rev_vals, color=COLORS['green'], linewidth=2, label='KL(q || p)')
ax.plot(alpha_range, js_vals, color=COLORS['red'], linewidth=2, label='JS(p, q)')
ax.set_xlabel('α (interpolation towards uniform)', fontsize=12)
ax.set_ylabel('Divergence (bits)', fontsize=12)
ax.set_title('KL vs JS Divergence: Asymmetry and Boundedness')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.2)
plt.tight_layout()
plt.show()

# Verify symmetry
p_test = np.array([0.3, 0.5, 0.2])
q_test = np.array([0.1, 0.6, 0.3])
print(f'JS(p, q) = {js_divergence(p_test, q_test):.6f}')
print(f'JS(q, p) = {js_divergence(q_test, p_test):.6f}')
print(f'Symmetric: {np.isclose(js_divergence(p_test, q_test), js_divergence(q_test, p_test))}')

### 1.6 Conditional Entropy and the Chain RuleThe **conditional entropy** measures the remaining uncertainty in $Y$after observing $X$:$$H(Y | X) = -\sum_{x, y} p(x, y) \log_2 p(y | x)$$The **chain rule of entropy**:$$H(X, Y) = H(X) + H(Y | X) = H(Y) + H(X | Y)$$This decomposes the joint uncertainty into the uncertainty of onevariable plus the remaining uncertainty of the other.

In [None]:
def joint_entropy(
    joint_prob: np.ndarray,
    base: float = 2.0,
) -> float:
    """Compute joint entropy H(X, Y) from a joint probability table.

    Args:
        joint_prob: Joint probability matrix of shape (|X|, |Y|).
        base: Logarithm base.

    Returns:
        Joint entropy value.
    """
    p_flat = joint_prob.ravel()
    p_nonzero = p_flat[p_flat > 0]
    return -np.sum(p_nonzero * np.log(p_nonzero) / np.log(base))


def conditional_entropy(
    joint_prob: np.ndarray,
    base: float = 2.0,
) -> float:
    """Compute conditional entropy H(Y|X) from joint probability table.

    H(Y|X) = H(X,Y) - H(X)

    Args:
        joint_prob: Joint probability matrix of shape (|X|, |Y|).
        base: Logarithm base.

    Returns:
        Conditional entropy H(Y|X).
    """
    h_xy = joint_entropy(joint_prob, base)
    p_x = joint_prob.sum(axis=1)
    h_x = shannon_entropy(p_x, base)
    return h_xy - h_x


# Example: Weather and Umbrella
# Joint distribution P(Weather, Umbrella)
#              Umbrella=Yes  Umbrella=No
# Sunny          0.10          0.40
# Rainy          0.35          0.05
# Cloudy         0.05          0.05
joint_wu = np.array([
    [0.10, 0.40],
    [0.35, 0.05],
    [0.05, 0.05],
])

# Marginals
p_weather = joint_wu.sum(axis=1)
p_umbrella = joint_wu.sum(axis=0)

h_w = shannon_entropy(p_weather)
h_u = shannon_entropy(p_umbrella)
h_wu = joint_entropy(joint_wu)
h_u_given_w = conditional_entropy(joint_wu)
h_w_given_u = h_wu - h_u  # Chain rule: H(W|U) = H(W,U) - H(U)

print('=== Entropy Decomposition ===')
print(f'H(Weather)       = {h_w:.4f} bits')
print(f'H(Umbrella)      = {h_u:.4f} bits')
print(f'H(Weather, Umb)  = {h_wu:.4f} bits')
print(f'H(Umb | Weather) = {h_u_given_w:.4f} bits')
print(f'H(Weather | Umb) = {h_w_given_u:.4f} bits')
print()
print('Chain rule verification:')
print(f'  H(W) + H(U|W) = {h_w:.4f} + {h_u_given_w:.4f} = {h_w + h_u_given_w:.4f}')
print(f'  H(W, U)        = {h_wu:.4f}')
assert np.isclose(h_w + h_u_given_w, h_wu), 'Chain rule violated!'
print('  Chain rule VERIFIED')

### 1.7 Mutual Information**Mutual information** $I(X; Y)$ measures how much knowing $X$ reducesthe uncertainty about $Y$ (and vice versa):$$I(X; Y) = H(Y) - H(Y | X) = H(X) - H(X | Y)$$Equivalently:$$I(X; Y) = \sum_{x, y} p(x, y) \log_2 \frac{p(x, y)}{p(x) p(y)} = D_{\text{KL}}(p(x, y) \| p(x) p(y))$$Properties:- $I(X; Y) \geq 0$ (non-negative)- $I(X; Y) = 0$ iff $X$ and $Y$ are independent- $I(X; Y) = I(Y; X)$ (symmetric)- $I(X; X) = H(X)$ (self-information = entropy)**ML applications:**- Feature selection (select features with highest MI with target)- Information bottleneck (compress representations while preserving MI with labels)- Infomax principle (maximize MI between input and learned representation)

In [None]:
def mutual_information(
    joint_prob: np.ndarray,
    base: float = 2.0,
) -> float:
    """Compute mutual information I(X; Y) from joint probability table.

    Args:
        joint_prob: Joint probability matrix of shape (|X|, |Y|).
        base: Logarithm base.

    Returns:
        Mutual information value.
    """
    p_x = joint_prob.sum(axis=1)  # Marginal P(X)
    p_y = joint_prob.sum(axis=0)  # Marginal P(Y)
    h_x = shannon_entropy(p_x, base)
    h_y_given_x = conditional_entropy(joint_prob, base)
    return h_x - h_y_given_x  # I(X;Y) = H(X) - H(X|Y) ... wait, using H(Y|X)


def mutual_information_direct(
    joint_prob: np.ndarray,
    base: float = 2.0,
) -> float:
    """Compute mutual information directly via KL(p(x,y) || p(x)p(y)).

    Args:
        joint_prob: Joint probability matrix of shape (|X|, |Y|).
        base: Logarithm base.

    Returns:
        Mutual information value.
    """
    p_x = joint_prob.sum(axis=1, keepdims=True)
    p_y = joint_prob.sum(axis=0, keepdims=True)
    independent = p_x * p_y  # P(X)P(Y) product distribution

    eps = 1e-15
    mask = joint_prob > 0
    mi = np.sum(joint_prob[mask] * np.log(
        joint_prob[mask] / (independent[mask] + eps)
    ) / np.log(base))
    return mi


# Compute MI for the weather-umbrella example
mi_formula = h_u - h_u_given_w  # I(W; U) = H(U) - H(U|W)
mi_direct = mutual_information_direct(joint_wu)

print(f'MI via formula:   I(W; U) = H(U) - H(U|W) = {h_u:.4f} - {h_u_given_w:.4f} = {mi_formula:.4f}')
print(f'MI via KL(joint || product): {mi_direct:.4f}')
assert np.isclose(mi_formula, mi_direct, atol=1e-10), 'MI computation mismatch!'
print('Both methods agree!')
print(f'\nInterpretation: Knowing the weather reduces umbrella uncertainty by {mi_formula:.4f} bits')

### 1.8 The Information Venn DiagramThe relationships between entropy, conditional entropy, joint entropy,and mutual information form a Venn diagram:- $H(X, Y) = H(X) + H(Y) - I(X; Y)$- $H(X, Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)$- $I(X; Y) = H(X) + H(Y) - H(X, Y)$

In [None]:
def plot_information_venn(
    h_x: float,
    h_y: float,
    h_xy: float,
    mi_xy: float,
    label_x: str = 'X',
    label_y: str = 'Y',
) -> None:
    """Plot Venn diagram of information-theoretic quantities.

    Args:
        h_x: Entropy of X.
        h_y: Entropy of Y.
        h_xy: Joint entropy.
        mi_xy: Mutual information.
        label_x: Label for variable X.
        label_y: Label for variable Y.
    """
    from matplotlib.patches import Circle

    fig, ax = plt.subplots(figsize=(10, 7))

    # Draw circles
    c1 = Circle((-0.5, 0), 1.5, fill=False, edgecolor=COLORS['blue'],
                 linewidth=3, label=f'H({label_x})')
    c2 = Circle((0.5, 0), 1.5, fill=False, edgecolor=COLORS['red'],
                 linewidth=3, label=f'H({label_y})')
    ax.add_patch(c1)
    ax.add_patch(c2)

    # Fill regions
    from matplotlib.patches import FancyBboxPatch
    h_x_only = h_x - mi_xy
    h_y_only = h_y - mi_xy

    # Labels
    ax.text(-1.3, 0, f'H({label_x}|{label_y})\n{h_x_only:.3f}',
             ha='center', va='center', fontsize=12, fontweight='bold',
             color=COLORS['blue'])
    ax.text(0, 0, f'I({label_x};{label_y})\n{mi_xy:.3f}',
             ha='center', va='center', fontsize=12, fontweight='bold',
             color=COLORS['purple'])
    ax.text(1.3, 0, f'H({label_y}|{label_x})\n{h_y_only:.3f}',
             ha='center', va='center', fontsize=12, fontweight='bold',
             color=COLORS['red'])

    # Overall labels
    ax.text(-0.5, 1.8, f'H({label_x}) = {h_x:.3f}',
             ha='center', fontsize=11, color=COLORS['blue'])
    ax.text(0.5, 1.8, f'H({label_y}) = {h_y:.3f}',
             ha='center', fontsize=11, color=COLORS['red'])
    ax.text(0, -2.0, f'H({label_x},{label_y}) = {h_xy:.3f}',
             ha='center', fontsize=11, color=COLORS['gray'])

    ax.set_xlim(-3, 3)
    ax.set_ylim(-2.5, 2.5)
    ax.set_aspect('equal')
    ax.set_title('Information Venn Diagram', fontsize=14, fontweight='bold')
    ax.axis('off')
    plt.tight_layout()
    plt.show()


plot_information_venn(
    h_x=h_w, h_y=h_u, h_xy=h_wu, mi_xy=mi_formula,
    label_x='Weather', label_y='Umbrella',
)

---## Part 2 — Putting It All Together: InformationTheoryToolkitWe assemble our information-theoretic primitives into a reusable classthat can analyze distributions and compute all key quantities.

In [None]:
class InformationTheoryToolkit:
    """Reusable information theory toolkit for ML.

    Provides methods for computing entropy, cross-entropy, KL divergence,
    mutual information, and related quantities.

    Attributes:
        base: Logarithm base (2 for bits, e for nats).
    """

    def __init__(self, base: float = 2.0) -> None:
        """Initialize toolkit.

        Args:
            base: Logarithm base (2 for bits, e for nats).
        """
        self.base = base

    def entropy(self, p: np.ndarray) -> float:
        """Compute Shannon entropy.

        Args:
            p: Probability distribution.

        Returns:
            Entropy value.
        """
        return shannon_entropy(p, self.base)

    def cross_entropy(self, p: np.ndarray, q: np.ndarray) -> float:
        """Compute cross-entropy H(p, q).

        Args:
            p: True distribution.
            q: Model distribution.

        Returns:
            Cross-entropy value.
        """
        return cross_entropy(p, q, self.base)

    def kl_divergence(self, p: np.ndarray, q: np.ndarray) -> float:
        """Compute KL divergence D_KL(p || q).

        Args:
            p: Reference distribution.
            q: Approximate distribution.

        Returns:
            KL divergence value.
        """
        return kl_divergence(p, q, self.base)

    def js_divergence(self, p: np.ndarray, q: np.ndarray) -> float:
        """Compute Jensen-Shannon divergence.

        Args:
            p: First distribution.
            q: Second distribution.

        Returns:
            JS divergence value.
        """
        return js_divergence(p, q, self.base)

    def mutual_information(self, joint_prob: np.ndarray) -> float:
        """Compute mutual information from joint probability table.

        Args:
            joint_prob: Joint probability matrix.

        Returns:
            Mutual information value.
        """
        return mutual_information_direct(joint_prob, self.base)

    def summary(self, joint_prob: np.ndarray, label_x: str = 'X', label_y: str = 'Y') -> pd.DataFrame:
        """Compute all information-theoretic quantities for a joint distribution.

        Args:
            joint_prob: Joint probability matrix.
            label_x: Name for the row variable.
            label_y: Name for the column variable.

        Returns:
            DataFrame summarizing all quantities.
        """
        p_x = joint_prob.sum(axis=1)
        p_y = joint_prob.sum(axis=0)

        h_x = self.entropy(p_x)
        h_y = self.entropy(p_y)
        h_xy = joint_entropy(joint_prob, self.base)
        h_y_given_x = conditional_entropy(joint_prob, self.base)
        h_x_given_y = h_xy - h_y
        mi = self.mutual_information(joint_prob)

        return pd.DataFrame({
            'Quantity': [
                f'H({label_x})', f'H({label_y})', f'H({label_x},{label_y})',
                f'H({label_y}|{label_x})', f'H({label_x}|{label_y})',
                f'I({label_x};{label_y})',
            ],
            'Value (bits)': [h_x, h_y, h_xy, h_y_given_x, h_x_given_y, mi],
        })


# Sanity check
toolkit = InformationTheoryToolkit(base=2.0)

# Verify toolkit with the weather-umbrella example
summary_df = toolkit.summary(joint_wu, 'Weather', 'Umbrella')
print(summary_df.to_string(index=False))

# Quick check: MI via toolkit matches our manual computation
assert np.isclose(toolkit.mutual_information(joint_wu), mi_formula), 'Toolkit MI mismatch!'
print('\nToolkit verification PASSED')

---## Part 3 — Application: Information Theory in PracticeWe apply our information-theoretic tools to real ML scenarios:cross-entropy as a loss function, KL divergence for model comparison,and mutual information for feature selection.

### 3.1 Cross-Entropy as Classification LossWhen we train a classifier, the cross-entropy loss measures how wellthe model's predicted distribution matches the true label distribution.Let's see how cross-entropy behaves during simulated training.

In [None]:
def simulate_training_progress() -> None:
    """Simulate how cross-entropy changes as a model improves."""
    rng = np.random.RandomState(SEED)
    n_samples = 500
    n_cls = 5

    # True labels (one-hot)
    true_labels = rng.randint(0, n_cls, n_samples)

    # Simulate model improvement stages
    stages = [
        ('Random (untrained)', 0.0),
        ('Early training', 0.3),
        ('Mid training', 0.7),
        ('Late training', 0.9),
        ('Well-trained', 0.95),
        ('Overconfident', 1.0),
    ]

    fig, axes = plt.subplots(2, 3, figsize=(16, 10))
    axes = axes.ravel()
    results = []

    for idx, (name, confidence) in enumerate(stages):
        # Generate predicted probabilities
        logits = rng.randn(n_samples, n_cls)
        # Increase logit for correct class
        for i in range(n_samples):
            logits[i, true_labels[i]] += confidence * 5

        # Softmax
        exp_logits = np.exp(logits - logits.max(axis=1, keepdims=True))
        probs = exp_logits / exp_logits.sum(axis=1, keepdims=True)

        # Cross-entropy
        ce = categorical_cross_entropy_loss(true_labels, logits)

        # Average predicted probability for correct class
        correct_probs = probs[np.arange(n_samples), true_labels]
        avg_correct = correct_probs.mean()

        results.append({'Stage': name, 'CE Loss': ce, 'Avg P(correct)': avg_correct})

        # Plot distribution of predicted probabilities for correct class
        axes[idx].hist(correct_probs, bins=30, color=COLOR_LIST[idx],
                        alpha=0.7, edgecolor='white', density=True)
        axes[idx].set_xlabel('P(correct class)')
        axes[idx].set_ylabel('Density')
        axes[idx].set_title(f'{name}\nCE={ce:.3f}, Avg P(c)={avg_correct:.3f}')
        axes[idx].set_xlim(0, 1)

    plt.suptitle('Cross-Entropy Loss During Training', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

    # Summary table
    results_df = pd.DataFrame(results)
    print(results_df.to_string(index=False))


simulate_training_progress()

### 3.2 KL Divergence for Distribution ComparisonKL divergence is commonly used to compare how well a model's outputdistribution matches a reference. Here we compare histograms of theDigits dataset features.

In [None]:
def compare_digit_distributions() -> None:
    """Compare pixel distributions across digit classes using KL/JS divergence."""
    # Sum all pixel values per image as a simple feature
    pixel_sums = X_digits.sum(axis=1)

    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    # Histogram per digit class
    n_bins = 30
    bin_edges = np.linspace(pixel_sums.min(), pixel_sums.max(), n_bins + 1)

    class_hists = {}
    for c in range(n_classes):
        mask = y_digits == c
        hist, _ = np.histogram(pixel_sums[mask], bins=bin_edges, density=False)
        hist = hist.astype(np.float64) + 1e-10  # Smoothing
        hist = hist / hist.sum()
        class_hists[c] = hist
        if c < 5:
            axes[0].plot(bin_edges[:-1], hist, '-', linewidth=1.5,
                          color=COLOR_LIST[c], label=f'Digit {c}', alpha=0.8)

    axes[0].set_xlabel('Pixel Sum')
    axes[0].set_ylabel('Probability')
    axes[0].set_title('Pixel Sum Distribution per Class')
    axes[0].legend(fontsize=9)
    axes[0].grid(True, alpha=0.2)

    # KL divergence matrix
    kl_matrix = np.zeros((n_classes, n_classes))
    for i in range(n_classes):
        for j in range(n_classes):
            kl_matrix[i, j] = kl_divergence(class_hists[i], class_hists[j])

    im = axes[1].imshow(kl_matrix, cmap='YlOrRd', aspect='auto')
    axes[1].set_xlabel('Digit (q)')
    axes[1].set_ylabel('Digit (p)')
    axes[1].set_title('KL(digit_p || digit_q) Matrix')
    axes[1].set_xticks(range(n_classes))
    axes[1].set_yticks(range(n_classes))
    plt.colorbar(im, ax=axes[1], label='KL Divergence (bits)')

    plt.tight_layout()
    plt.show()

    # Find most and least similar pairs (excluding self)
    kl_no_diag = kl_matrix.copy()
    np.fill_diagonal(kl_no_diag, np.inf)
    min_idx = np.unravel_index(kl_no_diag.argmin(), kl_no_diag.shape)
    np.fill_diagonal(kl_no_diag, 0)
    max_idx = np.unravel_index(kl_no_diag.argmax(), kl_no_diag.shape)

    print(f'Most similar pair:  digits {min_idx[0]} and {min_idx[1]} '
          f'(KL = {kl_matrix[min_idx]:.4f})')
    print(f'Most different pair: digits {max_idx[0]} and {max_idx[1]} '
          f'(KL = {kl_matrix[max_idx]:.4f})')


compare_digit_distributions()

### 3.3 Feature Selection with Mutual InformationMutual information can measure how informative each feature is aboutthe target variable. Features with high MI are good candidates formodel input.

In [None]:
def estimate_mi_continuous(
    x: np.ndarray,
    y: np.ndarray,
    n_bins: int = 20,
) -> float:
    """Estimate mutual information between a continuous feature and discrete labels.

    Uses histogram-based discretization.

    Args:
        x: Continuous feature values of shape (n,).
        y: Discrete labels of shape (n,).
        n_bins: Number of bins for discretizing x.

    Returns:
        Estimated mutual information in bits.
    """
    # Discretize x
    x_bins = np.digitize(x, np.linspace(x.min(), x.max(), n_bins + 1)[1:-1])

    # Build joint distribution
    classes = np.unique(y)
    joint = np.zeros((n_bins, len(classes)))
    for i in range(len(x)):
        bin_idx = min(x_bins[i], n_bins - 1)
        class_idx = np.searchsorted(classes, y[i])
        joint[bin_idx, class_idx] += 1

    # Add smoothing and normalize
    joint += 1e-10
    joint = joint / joint.sum()

    return mutual_information_direct(joint)


def mi_feature_selection() -> None:
    """Rank digit pixel positions by mutual information with digit label."""
    n_features = X_digits.shape[1]
    mi_scores = np.zeros(n_features)

    for f in range(n_features):
        mi_scores[f] = estimate_mi_continuous(X_digits[:, f], y_digits)

    # Compare with sklearn's mutual_info_classif
    from sklearn.feature_selection import mutual_info_classif
    mi_sklearn = mutual_info_classif(
        X_digits, y_digits, discrete_features=False, random_state=SEED,
    )

    fig, axes = plt.subplots(1, 3, figsize=(18, 5))

    # Reshape MI scores to 8x8 image
    mi_image = mi_scores.reshape(8, 8)
    im1 = axes[0].imshow(mi_image, cmap='hot', interpolation='nearest')
    axes[0].set_title('Our MI Scores (Histogram-based)')
    axes[0].set_xlabel('Pixel Column')
    axes[0].set_ylabel('Pixel Row')
    plt.colorbar(im1, ax=axes[0], label='MI (bits)')

    # Sklearn MI scores
    mi_sk_image = mi_sklearn.reshape(8, 8)
    im2 = axes[1].imshow(mi_sk_image, cmap='hot', interpolation='nearest')
    axes[1].set_title('sklearn MI Scores (k-NN based)')
    axes[1].set_xlabel('Pixel Column')
    axes[1].set_ylabel('Pixel Row')
    plt.colorbar(im2, ax=axes[1], label='MI (nats)')

    # Correlation between our and sklearn MI
    axes[2].scatter(mi_scores, mi_sklearn, alpha=0.6, color=COLORS['blue'])
    axes[2].set_xlabel('Our MI (bits)')
    axes[2].set_ylabel('sklearn MI (nats)')
    axes[2].set_title(f'Our MI vs sklearn MI\nCorrelation: {np.corrcoef(mi_scores, mi_sklearn)[0,1]:.4f}')
    axes[2].grid(True, alpha=0.2)

    plt.suptitle('Feature Importance via Mutual Information', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

    # Top and bottom features
    top_indices = np.argsort(mi_scores)[::-1][:5]
    bot_indices = np.argsort(mi_scores)[:5]
    print('Top 5 most informative pixels:')
    for i in top_indices:
        row, col = divmod(i, 8)
        print(f'  Pixel ({row},{col}): MI = {mi_scores[i]:.4f} bits')
    print('\nBottom 5 least informative pixels:')
    for i in bot_indices:
        row, col = divmod(i, 8)
        print(f'  Pixel ({row},{col}): MI = {mi_scores[i]:.4f} bits')


mi_feature_selection()

### 3.4 Prediction Entropy as Uncertainty MeasureThe entropy of a model's predicted probability vector indicates howuncertain the model is. Low entropy = confident prediction, highentropy = uncertain prediction.

In [None]:
def demonstrate_prediction_entropy() -> None:
    """Show how prediction entropy correlates with model confidence."""
    rng = np.random.RandomState(SEED)
    n_test = 200
    n_cls = 10

    # Simulate model predictions with varying confidence
    true_labels = rng.randint(0, n_cls, n_test)
    logits = rng.randn(n_test, n_cls) * 0.5

    # Add signal for correct class (varying strength)
    signal_strengths = rng.uniform(0, 6, n_test)
    for i in range(n_test):
        logits[i, true_labels[i]] += signal_strengths[i]

    # Softmax
    exp_logits = np.exp(logits - logits.max(axis=1, keepdims=True))
    probs = exp_logits / exp_logits.sum(axis=1, keepdims=True)

    # Compute entropy for each prediction
    pred_entropies = np.array([shannon_entropy(probs[i]) for i in range(n_test)])

    # Check correctness
    predicted = probs.argmax(axis=1)
    correct = predicted == true_labels

    fig, axes = plt.subplots(1, 3, figsize=(18, 5))

    # Entropy distribution for correct vs incorrect
    axes[0].hist(pred_entropies[correct], bins=25, alpha=0.7, color=COLORS['green'],
                  label=f'Correct ({correct.sum()})', density=True)
    axes[0].hist(pred_entropies[~correct], bins=25, alpha=0.7, color=COLORS['red'],
                  label=f'Incorrect ({(~correct).sum()})', density=True)
    axes[0].set_xlabel('Prediction Entropy (bits)')
    axes[0].set_ylabel('Density')
    axes[0].set_title('Entropy: Correct vs Incorrect Predictions')
    axes[0].legend()

    # Max probability vs entropy
    max_probs = probs.max(axis=1)
    axes[1].scatter(max_probs, pred_entropies, c=correct.astype(int),
                     cmap='RdYlGn', alpha=0.6, edgecolors='gray', linewidth=0.3)
    axes[1].set_xlabel('Max Predicted Probability')
    axes[1].set_ylabel('Prediction Entropy (bits)')
    axes[1].set_title('Confidence vs Entropy')
    axes[1].grid(True, alpha=0.2)

    # Accuracy at different entropy thresholds
    thresholds = np.linspace(0, np.log2(n_cls), 50)
    accs = []
    coverages = []
    for t in thresholds:
        mask = pred_entropies <= t
        if mask.sum() > 0:
            accs.append(correct[mask].mean())
            coverages.append(mask.mean())
        else:
            accs.append(np.nan)
            coverages.append(0)

    ax_twin = axes[2].twinx()
    axes[2].plot(thresholds, accs, color=COLORS['blue'], linewidth=2, label='Accuracy')
    ax_twin.plot(thresholds, coverages, color=COLORS['orange'], linewidth=2, label='Coverage')
    axes[2].set_xlabel('Entropy Threshold (bits)')
    axes[2].set_ylabel('Accuracy', color=COLORS['blue'])
    ax_twin.set_ylabel('Coverage', color=COLORS['orange'])
    axes[2].set_title('Accuracy-Coverage Trade-off')
    axes[2].grid(True, alpha=0.2)

    # Combined legend
    lines1, labels1 = axes[2].get_legend_handles_labels()
    lines2, labels2 = ax_twin.get_legend_handles_labels()
    axes[2].legend(lines1 + lines2, labels1 + labels2, loc='center right')

    plt.suptitle('Prediction Entropy as Uncertainty', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

    # Key insight
    low_entropy_mask = pred_entropies < 1.0
    high_entropy_mask = pred_entropies > 2.5
    print(f'Low entropy (<1.0 bits): accuracy={correct[low_entropy_mask].mean():.2%}, '
          f'coverage={low_entropy_mask.mean():.2%}')
    if high_entropy_mask.sum() > 0:
        print(f'High entropy (>2.5 bits): accuracy={correct[high_entropy_mask].mean():.2%}, '
              f'coverage={high_entropy_mask.mean():.2%}')


demonstrate_prediction_entropy()

---## Part 4 — Evaluation & AnalysisWe analyze edge cases, verify theoretical properties numerically,and benchmark our implementations.

### 4.1 Verifying Theoretical PropertiesLet's systematically verify the key properties of entropy, KL divergence,and mutual information.

In [None]:
def verify_properties() -> None:
    """Numerically verify key information-theoretic properties."""
    rng = np.random.RandomState(SEED)
    results = []

    # 1. Entropy non-negativity
    for _ in range(100):
        p = rng.dirichlet(np.ones(5))
        h = shannon_entropy(p)
        assert h >= -1e-10, f'Entropy negative: {h}'
    results.append(('H(X) >= 0', 'PASSED', 'Tested 100 random distributions'))

    # 2. Maximum entropy is log2(K) for uniform
    for k in [2, 5, 10, 50, 100]:
        uniform_p = np.ones(k) / k
        h = shannon_entropy(uniform_p)
        assert np.isclose(h, np.log2(k)), f'Max entropy mismatch for K={k}'
    results.append(('H(uniform) = log2(K)', 'PASSED', 'Verified for K=2,5,10,50,100'))

    # 3. KL non-negativity (Gibbs' inequality)
    for _ in range(100):
        p = rng.dirichlet(np.ones(5))
        q = rng.dirichlet(np.ones(5))
        kl = kl_divergence(p, q)
        assert kl >= -1e-10, f'KL negative: {kl}'
    results.append(('D_KL(p||q) >= 0', 'PASSED', 'Tested 100 random distribution pairs'))

    # 4. KL = 0 iff p = q
    p = rng.dirichlet(np.ones(5))
    kl_same = kl_divergence(p, p)
    assert np.isclose(kl_same, 0), f'KL(p||p) != 0: {kl_same}'
    results.append(('D_KL(p||p) = 0', 'PASSED', f'KL(p||p) = {kl_same:.2e}'))

    # 5. Cross-entropy >= entropy
    for _ in range(100):
        p = rng.dirichlet(np.ones(5))
        q = rng.dirichlet(np.ones(5))
        ce = cross_entropy(p, q)
        h = shannon_entropy(p)
        assert ce >= h - 1e-10, f'CE < H: {ce} < {h}'
    results.append(('H(p,q) >= H(p)', 'PASSED', 'Tested 100 pairs'))

    # 6. Mutual information non-negativity
    for _ in range(50):
        joint = rng.dirichlet(np.ones(12)).reshape(3, 4)
        mi = mutual_information_direct(joint)
        assert mi >= -1e-10, f'MI negative: {mi}'
    results.append(('I(X;Y) >= 0', 'PASSED', 'Tested 50 joint distributions'))

    # 7. MI = 0 for independent variables
    p_x = np.array([0.3, 0.7])
    p_y = np.array([0.4, 0.6])
    joint_indep = np.outer(p_x, p_y)
    mi_indep = mutual_information_direct(joint_indep)
    assert np.isclose(mi_indep, 0, atol=1e-10), f'MI for independent not 0: {mi_indep}'
    results.append(('I(X;Y) = 0 for independent', 'PASSED', f'MI = {mi_indep:.2e}'))

    # 8. JS symmetry
    for _ in range(50):
        p = rng.dirichlet(np.ones(5))
        q = rng.dirichlet(np.ones(5))
        assert np.isclose(js_divergence(p, q), js_divergence(q, p)), 'JS not symmetric'
    results.append(('JS(p,q) = JS(q,p)', 'PASSED', 'Tested 50 pairs'))

    # 9. JS bounded [0, 1]
    for _ in range(100):
        p = rng.dirichlet(np.ones(5))
        q = rng.dirichlet(np.ones(5))
        js = js_divergence(p, q)
        assert -1e-10 <= js <= 1 + 1e-10, f'JS out of bounds: {js}'
    results.append(('0 <= JS <= 1', 'PASSED', 'Tested 100 pairs'))

    results_df = pd.DataFrame(results, columns=['Property', 'Status', 'Details'])
    print(results_df.to_string(index=False))


verify_properties()

### 4.2 Entropy Estimation from Finite SamplesEstimating entropy from data (using empirical frequencies) introducesa systematic **negative bias** — the plug-in estimator underestimatesthe true entropy, especially with small samples. This is importantwhen using MI for feature selection.

In [None]:
def entropy_estimation_bias() -> None:
    """Demonstrate bias of plug-in entropy estimator."""
    rng = np.random.RandomState(SEED)
    true_k = 10
    true_p = np.ones(true_k) / true_k  # Uniform
    true_entropy = np.log2(true_k)

    sample_sizes = [5, 10, 20, 50, 100, 200, 500, 1000, 5000]
    n_trials = 200

    means = []
    stds = []

    for n in sample_sizes:
        estimates = []
        for _ in range(n_trials):
            samples = rng.choice(true_k, size=n, p=true_p)
            counts = np.bincount(samples, minlength=true_k)
            freq = counts / counts.sum()
            h_est = shannon_entropy(freq)
            estimates.append(h_est)
        means.append(np.mean(estimates))
        stds.append(np.std(estimates))

    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    axes[0].errorbar(sample_sizes, means, yerr=stds, fmt='o-',
                      color=COLORS['blue'], capsize=4, linewidth=2,
                      label='Estimated H (mean ± std)')
    axes[0].axhline(true_entropy, color=COLORS['red'], linestyle='--',
                     linewidth=2, label=f'True H = log₂({true_k}) = {true_entropy:.4f}')
    axes[0].set_xlabel('Sample Size')
    axes[0].set_ylabel('Entropy (bits)')
    axes[0].set_title('Plug-in Entropy Estimator Bias')
    axes[0].set_xscale('log')
    axes[0].legend()
    axes[0].grid(True, alpha=0.2)

    # Bias as function of sample size
    biases = [true_entropy - m for m in means]
    axes[1].plot(sample_sizes, biases, 'o-', color=COLORS['green'], linewidth=2)
    axes[1].axhline(0, color=COLORS['gray'], linestyle='--')
    axes[1].set_xlabel('Sample Size')
    axes[1].set_ylabel('Bias (bits)')
    axes[1].set_title('Negative Bias of Plug-in Estimator')
    axes[1].set_xscale('log')
    axes[1].grid(True, alpha=0.2)

    # Miller-Madow correction: H_corrected = H_plugin + (K-1)/(2n)
    axes[1].plot(sample_sizes, [(true_k - 1) / (2 * n) for n in sample_sizes],
                  's--', color=COLORS['orange'], label='Miller-Madow correction')
    axes[1].legend()

    plt.tight_layout()
    plt.show()

    print(f'True entropy: {true_entropy:.4f} bits')
    print(f'Bias at n=10:   {biases[1]:.4f} bits')
    print(f'Bias at n=1000: {biases[7]:.4f} bits')
    print(f'Miller-Madow correction at n=10: +{(true_k-1)/(2*10):.4f}')


entropy_estimation_bias()

### 4.3 Data Processing InequalityThe **data processing inequality** states that processing data cannotincrease information:$$X \to Y \to Z \implies I(X; Z) \leq I(X; Y)$$This is fundamental to understanding why deep networks lose informationin later layers and motivates the information bottleneck theory.

In [None]:
def demonstrate_data_processing_inequality() -> None:
    """Empirically verify the data processing inequality."""
    rng = np.random.RandomState(SEED)

    # X: original signal (4 classes)
    n_samples = 10000
    x = rng.choice(4, size=n_samples, p=[0.4, 0.3, 0.2, 0.1])

    # Y = noisy observation of X
    noise_levels = [0.0, 0.05, 0.1, 0.2, 0.3, 0.5]
    mi_xy_list = []
    mi_xz_list = []

    for noise in noise_levels:
        # Y: noisy version of X (flip with probability 'noise')
        flip_mask = rng.random(n_samples) < noise
        y = x.copy()
        y[flip_mask] = rng.choice(4, size=flip_mask.sum())

        # Z: further processed Y (quantize to 2 classes)
        z = (y >= 2).astype(int)

        # Estimate MI(X; Y)
        joint_xy = np.zeros((4, 4))
        for i in range(n_samples):
            joint_xy[x[i], y[i]] += 1
        joint_xy = (joint_xy + 1e-10) / joint_xy.sum()
        mi_xy = mutual_information_direct(joint_xy)

        # Estimate MI(X; Z)
        joint_xz = np.zeros((4, 2))
        for i in range(n_samples):
            joint_xz[x[i], z[i]] += 1
        joint_xz = (joint_xz + 1e-10) / joint_xz.sum()
        mi_xz = mutual_information_direct(joint_xz)

        mi_xy_list.append(mi_xy)
        mi_xz_list.append(mi_xz)

    fig, ax = plt.subplots(figsize=(10, 6))
    ax.plot(noise_levels, mi_xy_list, 'o-', color=COLORS['blue'],
             linewidth=2, markersize=8, label='I(X; Y) — noisy copy')
    ax.plot(noise_levels, mi_xz_list, 's-', color=COLORS['red'],
             linewidth=2, markersize=8, label='I(X; Z) — quantized Y')
    ax.fill_between(noise_levels, mi_xz_list, mi_xy_list,
                     alpha=0.15, color=COLORS['gray'])
    ax.set_xlabel('Noise Level', fontsize=12)
    ax.set_ylabel('Mutual Information (bits)', fontsize=12)
    ax.set_title('Data Processing Inequality: I(X;Z) ≤ I(X;Y)')
    ax.legend(fontsize=11)
    ax.grid(True, alpha=0.2)
    plt.tight_layout()
    plt.show()

    # Verify inequality holds for all noise levels
    for noise, mi_xy_val, mi_xz_val in zip(noise_levels, mi_xy_list, mi_xz_list):
        violation = mi_xz_val > mi_xy_val + 1e-10
        status = 'VIOLATED!' if violation else 'OK'
        print(f'Noise={noise:.2f}: I(X;Y)={mi_xy_val:.4f} >= I(X;Z)={mi_xz_val:.4f} [{status}]')


demonstrate_data_processing_inequality()

### 4.4 Benchmarks: Our Implementations vs LibraryWe compare speed and accuracy of our implementations againstscipy's implementations.

In [None]:
import time


def benchmark_implementations() -> None:
    """Benchmark our information theory functions against scipy."""
    rng = np.random.RandomState(SEED)
    results = []

    # Generate test distributions
    sizes = [10, 100, 1000, 10000]
    n_repeats = 1000

    for k in sizes:
        p = rng.dirichlet(np.ones(k))
        q = rng.dirichlet(np.ones(k))

        # Entropy benchmark
        start = time.perf_counter()
        for _ in range(n_repeats):
            h_ours = shannon_entropy(p, base=np.e)
        our_time = (time.perf_counter() - start) / n_repeats * 1e6

        start = time.perf_counter()
        for _ in range(n_repeats):
            h_scipy = sp_stats.entropy(p)
        scipy_time = (time.perf_counter() - start) / n_repeats * 1e6

        diff = abs(h_ours - h_scipy)
        results.append({
            'Function': 'Entropy', 'K': k,
            'Ours (μs)': f'{our_time:.1f}',
            'SciPy (μs)': f'{scipy_time:.1f}',
            'Max Diff': f'{diff:.2e}',
        })

        # KL divergence benchmark
        start = time.perf_counter()
        for _ in range(n_repeats):
            kl_ours = kl_divergence(p, q, base=np.e)
        our_time = (time.perf_counter() - start) / n_repeats * 1e6

        start = time.perf_counter()
        for _ in range(n_repeats):
            kl_scipy = sp_stats.entropy(p, q)  # scipy uses natural log
        scipy_time = (time.perf_counter() - start) / n_repeats * 1e6

        diff = abs(kl_ours - kl_scipy)
        results.append({
            'Function': 'KL Divergence', 'K': k,
            'Ours (μs)': f'{our_time:.1f}',
            'SciPy (μs)': f'{scipy_time:.1f}',
            'Max Diff': f'{diff:.2e}',
        })

    results_df = pd.DataFrame(results)
    print(results_df.to_string(index=False))


benchmark_implementations()

### 4.5 Common Information Theory Mistakes in MLLet's illustrate pitfalls that practitioners frequently encounter.

In [None]:
def common_mistakes() -> None:
    """Demonstrate common information theory mistakes."""
    print('=== Mistake 1: Treating KL divergence as a symmetric distance ===')
    p = np.array([0.9, 0.05, 0.05])
    q = np.array([0.33, 0.34, 0.33])
    print(f'  p = {p}')
    print(f'  q = {q}')
    print(f'  KL(p||q) = {kl_divergence(p, q):.4f} bits')
    print(f'  KL(q||p) = {kl_divergence(q, p):.4f} bits')
    print(f'  Ratio: {kl_divergence(p, q) / kl_divergence(q, p):.1f}x')
    print(f'  Use JS divergence if you need symmetry: JS = {js_divergence(p, q):.4f}')
    print()

    print('=== Mistake 2: Ignoring numerical stability in log probabilities ===')
    pred_probs = np.array([1.0, 0.0, 0.0])  # Overconfident prediction
    true_label = 1  # But the true label is class 1!
    # Without clipping: log(0) = -inf
    # With clipping: small but finite loss
    eps = 1e-15
    clipped = np.clip(pred_probs, eps, 1.0)
    loss = -np.log(clipped[true_label])
    print(f'  Overconfident wrong prediction: P(correct) = {pred_probs[true_label]}')
    print(f'  Without clipping: loss = -log(0) = inf')
    print(f'  With clipping (eps={eps}): loss = {loss:.2f}')
    print(f'  Lesson: Always use log-softmax or clip probabilities!\n')

    print('=== Mistake 3: Confusing bits and nats ===')
    p = np.array([0.5, 0.3, 0.2])
    h_bits = shannon_entropy(p, base=2.0)
    h_nats = shannon_entropy(p, base=np.e)
    print(f'  H(p) in bits: {h_bits:.4f}')
    print(f'  H(p) in nats: {h_nats:.4f}')
    print(f'  Conversion: bits × ln(2) = nats: {h_bits * np.log(2):.4f} = {h_nats:.4f}')
    print(f'  PyTorch uses nats (natural log); most textbooks use bits (log₂)\n')

    print('=== Mistake 4: MI from small samples is biased upward ===')
    rng = np.random.RandomState(SEED)
    # Independent variables
    x = rng.choice(5, size=20)
    y = rng.choice(5, size=20)
    joint = np.zeros((5, 5))
    for i in range(20):
        joint[x[i], y[i]] += 1
    joint_norm = (joint + 1e-10) / joint.sum()
    mi_small = mutual_information_direct(joint_norm)

    # Same with large sample
    x_large = rng.choice(5, size=100000)
    y_large = rng.choice(5, size=100000)
    joint_large = np.zeros((5, 5))
    for i in range(100000):
        joint_large[x_large[i], y_large[i]] += 1
    joint_large_norm = (joint_large + 1e-10) / joint_large.sum()
    mi_large = mutual_information_direct(joint_large_norm)

    print(f'  Independent variables with n=20:     MI = {mi_small:.4f} (should be ~0)')
    print(f'  Independent variables with n=100000: MI = {mi_large:.6f} (closer to 0)')
    print(f'  Small samples overestimate MI!')


common_mistakes()

### 4.6 Error Analysis: When Information Measures MisleadInformation-theoretic measures have failure modes that can misleadpractitioners. Let's examine specific cases.

In [None]:
def error_analysis() -> None:
    """Analyze failure cases of information-theoretic measures."""
    rng = np.random.RandomState(SEED)

    fig, axes = plt.subplots(1, 3, figsize=(18, 5))

    # Case 1: KL divergence explodes when supports don't overlap
    p1 = np.array([0.5, 0.5, 0.0, 0.0])
    q1_cases = [
        ('Full overlap', np.array([0.4, 0.4, 0.1, 0.1])),
        ('Partial overlap', np.array([0.1, 0.1, 0.4, 0.4])),
        ('No overlap (clipped)', np.array([0.0, 0.0, 0.5, 0.5])),
    ]

    kl_vals = []
    js_vals_case1 = []
    labels_case1 = []
    for name, q_case in q1_cases:
        kl_val = kl_divergence(p1, q_case)
        js_val = js_divergence(p1, q_case)
        kl_vals.append(kl_val)
        js_vals_case1.append(js_val)
        labels_case1.append(name)

    x_pos = np.arange(len(labels_case1))
    axes[0].bar(x_pos - 0.15, kl_vals, width=0.3, color=COLORS['blue'],
                 label='KL(p||q)', alpha=0.7)
    axes[0].bar(x_pos + 0.15, js_vals_case1, width=0.3, color=COLORS['green'],
                 label='JS(p,q)', alpha=0.7)
    axes[0].set_xticks(x_pos)
    axes[0].set_xticklabels(labels_case1, fontsize=9)
    axes[0].set_ylabel('Divergence (bits)')
    axes[0].set_title('Support Mismatch Problem')
    axes[0].legend()

    # Case 2: Cross-entropy sensitive to label noise
    n_test = 1000
    true_labels = rng.randint(0, 5, n_test)
    logits = rng.randn(n_test, 5) * 0.5
    for i in range(n_test):
        logits[i, true_labels[i]] += 3.0

    noise_rates = np.linspace(0, 0.5, 20)
    ce_values = []
    for noise_rate in noise_rates:
        noisy_labels = true_labels.copy()
        n_flip = int(noise_rate * n_test)
        flip_idx = rng.choice(n_test, n_flip, replace=False)
        noisy_labels[flip_idx] = rng.randint(0, 5, n_flip)
        ce = categorical_cross_entropy_loss(noisy_labels, logits)
        ce_values.append(ce)

    axes[1].plot(noise_rates, ce_values, 'o-', color=COLORS['red'], linewidth=2)
    axes[1].set_xlabel('Label Noise Rate')
    axes[1].set_ylabel('Cross-Entropy Loss')
    axes[1].set_title('CE Sensitivity to Label Noise')
    axes[1].grid(True, alpha=0.2)

    # Case 3: MI estimation variance with different bin counts
    x_cont = rng.randn(500)
    y_cont = x_cont + rng.randn(500) * 0.5  # Correlated
    bin_counts = [3, 5, 10, 20, 30, 50, 100]
    mi_by_bins = []
    for n_b in bin_counts:
        x_d = np.digitize(x_cont, np.linspace(x_cont.min(), x_cont.max(), n_b + 1)[1:-1])
        y_d = np.digitize(y_cont, np.linspace(y_cont.min(), y_cont.max(), n_b + 1)[1:-1])
        joint = np.zeros((n_b, n_b))
        for i in range(len(x_cont)):
            joint[min(x_d[i], n_b - 1), min(y_d[i], n_b - 1)] += 1
        joint = (joint + 1e-10) / joint.sum()
        mi_by_bins.append(mutual_information_direct(joint))

    axes[2].plot(bin_counts, mi_by_bins, 'o-', color=COLORS['purple'], linewidth=2)
    axes[2].set_xlabel('Number of Bins')
    axes[2].set_ylabel('Estimated MI (bits)')
    axes[2].set_title('MI Estimate vs Binning Choice\n(true MI is constant)')
    axes[2].grid(True, alpha=0.2)

    plt.suptitle('Error Analysis: Information Theory Failure Modes',
                  fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

    print('Key failure modes:')
    print('  1. KL divergence → ∞ when supports do not overlap (JS stays bounded)')
    print('  2. Cross-entropy is highly sensitive to label noise')
    print('  3. Histogram-based MI estimates depend heavily on bin count')


error_analysis()

---## Part 5 — Summary & Lessons Learned### Key Takeaways- **Shannon entropy** $H(X) = -\sum p(x) \log p(x)$ measures the average  surprise (uncertainty) in a random variable. It is maximized by the  uniform distribution and equals 0 for deterministic variables.- **Cross-entropy** $H(p, q) = -\sum p(x) \log q(x)$ is the standard  classification loss. Minimizing it is equivalent to minimizing KL  divergence between true labels and model predictions.- **KL divergence** $D_{\text{KL}}(p \| q)$ measures distributional  mismatch. It is non-negative, asymmetric, and appears in VAE losses,  knowledge distillation, and policy optimization. Use JS divergence  when you need a symmetric alternative.- **Mutual information** $I(X; Y)$ quantifies how much one variable tells  you about another. It is the gold standard for feature relevance but  suffers from positive bias in finite samples.- The **data processing inequality** ($I(X; Z) \leq I(X; Y)$ for  $X \to Y \to Z$) explains why deep networks must prioritize preserving  relevant information through their layers.### What's Next→ **01-09 (Calculus & Optimization Foundations)** builds on the lossfunction perspective developed here, showing how gradients optimizecross-entropy and other objectives.