# Probability and Distributions

**Course:** Mathematics for Machine Learning  
**Instructor:** Mohammed Alnemari

---

## What You'll Learn

This notebook covers probability distributions and related concepts essential for machine learning:

1. **Discrete Distributions** - Bernoulli, Binomial, Geometric
2. **Continuous Distributions** - Uniform, Exponential, Gaussian
3. **Bayes' Theorem** - Prior, likelihood, and posterior
4. **Joint and Marginal Distributions** - 2D distributions and marginalization
5. **Covariance and Correlation** - Measuring linear relationships
6. **Gaussian Distribution** - Univariate and multivariate
7. **Central Limit Theorem** - Why the Gaussian is everywhere

---

## Google Colab Ready!

This notebook works perfectly in Google Colab. All required libraries are pre-installed!

In [None]:
# Import libraries
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Plotting settings
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

np.random.seed(42)
print("Libraries imported successfully!")

---

# Part 1: Discrete Distributions

Discrete random variables take on countable values. We study three foundational discrete distributions.

## 1.1 Bernoulli Distribution

A single trial with two outcomes: success (1) with probability $p$, or failure (0) with probability $1 - p$.

$$P(X = k) = p^k (1 - p)^{1-k}, \quad k \in \{0, 1\}$$

- Mean: $\mu = p$
- Variance: $\sigma^2 = p(1-p)$

In [None]:
# Bernoulli Distribution
p = 0.7
bernoulli = stats.bernoulli(p)

# PMF values
k_values = [0, 1]
pmf_values = bernoulli.pmf(k_values)

print(f"Bernoulli(p={p})")
print(f"P(X=0) = {pmf_values[0]:.4f}")
print(f"P(X=1) = {pmf_values[1]:.4f}")
print()

# Verify mean and variance against formulas
print(f"Mean (scipy):   {bernoulli.mean():.4f}")
print(f"Mean (formula): {p:.4f}")
print(f"Variance (scipy):   {bernoulli.var():.4f}")
print(f"Variance (formula): {p * (1 - p):.4f}")

# Plot
fig, ax = plt.subplots(figsize=(6, 4))
ax.bar(k_values, pmf_values, color=['#e74c3c', '#2ecc71'], edgecolor='black', width=0.4)
ax.set_xticks(k_values)
ax.set_xticklabels(['Failure (0)', 'Success (1)'])
ax.set_ylabel('Probability')
ax.set_title(f'Bernoulli Distribution (p = {p})')
ax.set_ylim(0, 1)
for i, v in enumerate(pmf_values):
    ax.text(i, v + 0.02, f'{v:.2f}', ha='center', fontweight='bold')
plt.tight_layout()
plt.show()

## 1.2 Binomial Distribution

The number of successes in $n$ independent Bernoulli trials, each with success probability $p$.

$$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, \dots, n$$

- Mean: $\mu = np$
- Variance: $\sigma^2 = np(1-p)$

In [None]:
# Binomial Distribution for different parameters
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

params = [(10, 0.5), (20, 0.3), (20, 0.7)]

for ax, (n, p) in zip(axes, params):
    binom = stats.binom(n, p)
    k = np.arange(0, n + 1)
    pmf = binom.pmf(k)

    ax.bar(k, pmf, color='steelblue', edgecolor='black', alpha=0.8)
    ax.set_xlabel('k')
    ax.set_ylabel('P(X = k)')
    ax.set_title(f'Binomial(n={n}, p={p})')

    # Verify mean and variance
    print(f"Binomial(n={n}, p={p}):")
    print(f"  Mean  -> scipy: {binom.mean():.4f}, formula (np): {n * p:.4f}")
    print(f"  Var   -> scipy: {binom.var():.4f}, formula (np(1-p)): {n * p * (1 - p):.4f}")

plt.tight_layout()
plt.show()

## 1.3 Geometric Distribution

The number of trials needed to get the first success.

$$P(X = k) = (1-p)^{k-1} p, \quad k = 1, 2, 3, \dots$$

- Mean: $\mu = 1/p$
- Variance: $\sigma^2 = (1-p)/p^2$

In [None]:
# Geometric Distribution
fig, ax = plt.subplots(figsize=(10, 5))

colors = ['#e74c3c', '#3498db', '#2ecc71']
p_values = [0.2, 0.5, 0.8]

for p, color in zip(p_values, colors):
    geom = stats.geom(p)
    k = np.arange(1, 16)
    pmf = geom.pmf(k)

    ax.plot(k, pmf, 'o-', color=color, label=f'p = {p}', markersize=6)

    # Verify mean and variance
    print(f"Geometric(p={p}):")
    print(f"  Mean  -> scipy: {geom.mean():.4f}, formula (1/p): {1 / p:.4f}")
    print(f"  Var   -> scipy: {geom.var():.4f}, formula ((1-p)/p^2): {(1 - p) / p**2:.4f}")

ax.set_xlabel('k (number of trials)')
ax.set_ylabel('P(X = k)')
ax.set_title('Geometric Distribution PMF')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---

# Part 2: Continuous Distributions

Continuous random variables can take any value in an interval. We describe them using probability density functions (PDFs).

## 2.1 Uniform Distribution

Every value in the interval $[a, b]$ is equally likely.

$$f(x) = \frac{1}{b - a}, \quad a \le x \le b$$

In [None]:
# Uniform Distribution
a, b = 2, 8
uniform = stats.uniform(loc=a, scale=b - a)  # scipy parameterizes as (loc, scale)

x = np.linspace(0, 10, 500)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# PDF
axes[0].plot(x, uniform.pdf(x), 'b-', linewidth=2)
axes[0].fill_between(x, uniform.pdf(x), alpha=0.3)
axes[0].set_title(f'Uniform PDF on [{a}, {b}]')
axes[0].set_xlabel('x')
axes[0].set_ylabel('f(x)')
axes[0].set_ylim(0, 0.25)
axes[0].grid(True, alpha=0.3)

# CDF
axes[1].plot(x, uniform.cdf(x), 'r-', linewidth=2)
axes[1].set_title(f'Uniform CDF on [{a}, {b}]')
axes[1].set_xlabel('x')
axes[1].set_ylabel('F(x)')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Sample and compare with theoretical PDF
samples = uniform.rvs(size=5000)

fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(samples, bins=40, density=True, alpha=0.6, color='steelblue', edgecolor='black', label='Histogram of samples')
ax.plot(x, uniform.pdf(x), 'r-', linewidth=2, label='Theoretical PDF')
ax.set_title('Uniform Distribution: Samples vs Theoretical PDF')
ax.set_xlabel('x')
ax.set_ylabel('Density')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Mean  -> scipy: {uniform.mean():.4f}, formula ((a+b)/2): {(a + b) / 2:.4f}")
print(f"Var   -> scipy: {uniform.var():.4f}, formula ((b-a)^2/12): {(b - a)**2 / 12:.4f}")

## 2.2 Exponential Distribution

Models the time between events in a Poisson process.

$$f(x) = \lambda e^{-\lambda x}, \quad x \ge 0$$

- Mean: $\mu = 1/\lambda$
- Variance: $\sigma^2 = 1/\lambda^2$

In [None]:
# Exponential Distribution
x = np.linspace(0, 8, 500)
lambdas = [0.5, 1.0, 2.0]
colors = ['#e74c3c', '#3498db', '#2ecc71']

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for lam, color in zip(lambdas, colors):
    exp_dist = stats.expon(scale=1 / lam)  # scipy uses scale = 1/lambda
    axes[0].plot(x, exp_dist.pdf(x), color=color, linewidth=2, label=f'$\\lambda$ = {lam}')
    axes[1].plot(x, exp_dist.cdf(x), color=color, linewidth=2, label=f'$\\lambda$ = {lam}')

axes[0].set_title('Exponential PDF')
axes[0].set_xlabel('x')
axes[0].set_ylabel('f(x)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

axes[1].set_title('Exponential CDF')
axes[1].set_xlabel('x')
axes[1].set_ylabel('F(x)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Sample and compare
lam = 1.5
exp_dist = stats.expon(scale=1 / lam)
samples = exp_dist.rvs(size=5000)

fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(samples, bins=50, density=True, alpha=0.6, color='steelblue', edgecolor='black', label='Histogram of samples')
ax.plot(x, exp_dist.pdf(x), 'r-', linewidth=2, label=f'Theoretical PDF ($\\lambda$ = {lam})')
ax.set_title('Exponential Distribution: Samples vs Theoretical PDF')
ax.set_xlabel('x')
ax.set_ylabel('Density')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Mean  -> scipy: {exp_dist.mean():.4f}, formula (1/lambda): {1 / lam:.4f}")
print(f"Var   -> scipy: {exp_dist.var():.4f}, formula (1/lambda^2): {1 / lam**2:.4f}")

## 2.3 Gaussian (Normal) Distribution

The most important distribution in statistics and machine learning.

$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)$$

In [None]:
# Gaussian Distribution
x = np.linspace(-8, 12, 500)

gaussians = [
    (0, 1, 'Standard Normal'),
    (2, 0.5, '$\\mu=2, \\sigma=0.5$'),
    (-1, 2, '$\\mu=-1, \\sigma=2$'),
]
colors = ['#e74c3c', '#3498db', '#2ecc71']

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for (mu, sigma, label), color in zip(gaussians, colors):
    norm_dist = stats.norm(loc=mu, scale=sigma)
    axes[0].plot(x, norm_dist.pdf(x), color=color, linewidth=2, label=label)
    axes[1].plot(x, norm_dist.cdf(x), color=color, linewidth=2, label=label)

axes[0].set_title('Gaussian PDF')
axes[0].set_xlabel('x')
axes[0].set_ylabel('f(x)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

axes[1].set_title('Gaussian CDF')
axes[1].set_xlabel('x')
axes[1].set_ylabel('F(x)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Sample and compare
mu, sigma = 3, 1.5
norm_dist = stats.norm(loc=mu, scale=sigma)
samples = norm_dist.rvs(size=5000)

fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(samples, bins=50, density=True, alpha=0.6, color='steelblue', edgecolor='black', label='Histogram of samples')
x_plot = np.linspace(mu - 4 * sigma, mu + 4 * sigma, 500)
ax.plot(x_plot, norm_dist.pdf(x_plot), 'r-', linewidth=2, label=f'Theoretical PDF ($\\mu$={mu}, $\\sigma$={sigma})')
ax.set_title('Gaussian Distribution: Samples vs Theoretical PDF')
ax.set_xlabel('x')
ax.set_ylabel('Density')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---

# Part 3: Bayes' Theorem

Bayes' theorem lets us update our beliefs given new evidence:

$$P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)}$$

We illustrate this with a classic **medical test** example.

## 3.1 Medical Test Example

Suppose:
- **Prevalence** (prior probability of disease): $P(D) = 0.01$ (1% of the population has the disease)
- **Sensitivity** (true positive rate): $P(+|D) = 0.95$ (test correctly detects 95% of sick people)
- **Specificity** (true negative rate): $P(-|\neg D) = 0.90$ (test correctly identifies 90% of healthy people)

**Question:** If a person tests positive, what is the probability they actually have the disease?

In [None]:
# Bayes' Theorem - Medical Test Example

# Given probabilities
P_disease = 0.01           # Prevalence (prior)
P_no_disease = 1 - P_disease
P_pos_given_disease = 0.95  # Sensitivity
P_neg_given_no_disease = 0.90  # Specificity
P_pos_given_no_disease = 1 - P_neg_given_no_disease  # False positive rate

# Total probability of testing positive (Law of Total Probability)
P_pos = P_pos_given_disease * P_disease + P_pos_given_no_disease * P_no_disease

# Bayes' Theorem: P(Disease | Positive)
P_disease_given_pos = (P_pos_given_disease * P_disease) / P_pos

print("=== Medical Test - Bayes' Theorem ===")
print(f"Prior P(Disease)          = {P_disease:.4f}")
print(f"Sensitivity P(+|D)        = {P_pos_given_disease:.4f}")
print(f"Specificity P(-|~D)       = {P_neg_given_no_disease:.4f}")
print(f"False positive rate P(+|~D) = {P_pos_given_no_disease:.4f}")
print(f"P(+)                      = {P_pos:.4f}")
print()
print(f"Posterior P(Disease|+)    = {P_disease_given_pos:.4f}")
print()
print(f"Despite a positive test, there is only a {P_disease_given_pos:.1%} chance")
print(f"that the patient actually has the disease!")

## 3.2 Visualizing Prior vs Posterior

In [None]:
# Visualize prior vs posterior
categories = ['Has Disease', 'No Disease']
prior = [P_disease, P_no_disease]
posterior = [P_disease_given_pos, 1 - P_disease_given_pos]

x_pos = np.arange(len(categories))
width = 0.35

fig, ax = plt.subplots(figsize=(8, 5))
bars1 = ax.bar(x_pos - width / 2, prior, width, label='Prior P(D)', color='#3498db', edgecolor='black')
bars2 = ax.bar(x_pos + width / 2, posterior, width, label='Posterior P(D|+)', color='#e74c3c', edgecolor='black')

ax.set_ylabel('Probability')
ax.set_title('Prior vs Posterior Probability after Positive Test')
ax.set_xticks(x_pos)
ax.set_xticklabels(categories)
ax.legend()

# Add value labels on bars
for bar in bars1:
    h = bar.get_height()
    ax.text(bar.get_x() + bar.get_width() / 2, h + 0.01, f'{h:.4f}', ha='center', fontsize=10)
for bar in bars2:
    h = bar.get_height()
    ax.text(bar.get_x() + bar.get_width() / 2, h + 0.01, f'{h:.4f}', ha='center', fontsize=10)

ax.set_ylim(0, 1.15)
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

In [None]:
# How posterior changes with prevalence
prevalences = np.linspace(0.001, 0.5, 200)
posteriors = (P_pos_given_disease * prevalences) / (
    P_pos_given_disease * prevalences + P_pos_given_no_disease * (1 - prevalences)
)

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(prevalences, posteriors, 'b-', linewidth=2)
ax.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5, label='P = 0.5')
ax.axvline(x=0.01, color='red', linestyle='--', alpha=0.5, label=f'Prevalence = 0.01')
ax.plot(0.01, P_disease_given_pos, 'ro', markersize=10, zorder=5)
ax.annotate(f'  ({0.01}, {P_disease_given_pos:.3f})', xy=(0.01, P_disease_given_pos),
            fontsize=11, color='red')
ax.set_xlabel('Prevalence P(Disease)')
ax.set_ylabel('Posterior P(Disease | Positive Test)')
ax.set_title('Effect of Disease Prevalence on Posterior Probability')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---

# Part 4: Joint and Marginal Distributions

A **joint distribution** $P(X, Y)$ describes the probability of two random variables simultaneously.

**Marginal distributions** are obtained by summing (discrete) or integrating (continuous) over the other variable:

$$P(X = x) = \sum_y P(X = x, Y = y)$$

## 4.1 Creating a Joint Distribution

In [None]:
# Create a discrete joint distribution table
# Example: X = Weather {Sunny=0, Cloudy=1, Rainy=2}
#          Y = Mood    {Happy=0, Neutral=1, Sad=2}

# Joint probability table P(X, Y)
joint = np.array([
    [0.20, 0.10, 0.02],  # Sunny
    [0.08, 0.15, 0.07],  # Cloudy
    [0.02, 0.10, 0.26],  # Rainy
])

x_labels = ['Sunny', 'Cloudy', 'Rainy']
y_labels = ['Happy', 'Neutral', 'Sad']

print("Joint Distribution P(X, Y):")
print(f"{'':>10}", end='')
for yl in y_labels:
    print(f"{yl:>10}", end='')
print()
for i, xl in enumerate(x_labels):
    print(f"{xl:>10}", end='')
    for j in range(len(y_labels)):
        print(f"{joint[i, j]:>10.2f}", end='')
    print()

print(f"\nTotal probability: {joint.sum():.2f} (should be 1.00)")

## 4.2 Computing Marginal Distributions

In [None]:
# Marginal distributions by summing over the other variable
P_X = joint.sum(axis=1)  # Sum over Y (columns) -> P(X)
P_Y = joint.sum(axis=0)  # Sum over X (rows) -> P(Y)

print("Marginal P(X) [Weather]:")
for label, prob in zip(x_labels, P_X):
    print(f"  P({label}) = {prob:.2f}")
print()

print("Marginal P(Y) [Mood]:")
for label, prob in zip(y_labels, P_Y):
    print(f"  P({label}) = {prob:.2f}")

# Verify marginals sum to 1
print(f"\nSum of P(X): {P_X.sum():.2f}")
print(f"Sum of P(Y): {P_Y.sum():.2f}")

## 4.3 Visualizing the Joint Distribution

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Heatmap of joint distribution
im = axes[0].imshow(joint, cmap='Blues', aspect='auto')
axes[0].set_xticks(range(len(y_labels)))
axes[0].set_xticklabels(y_labels)
axes[0].set_yticks(range(len(x_labels)))
axes[0].set_yticklabels(x_labels)
axes[0].set_title('Joint Distribution P(X, Y)')
axes[0].set_xlabel('Mood (Y)')
axes[0].set_ylabel('Weather (X)')
# Annotate cells
for i in range(len(x_labels)):
    for j in range(len(y_labels)):
        axes[0].text(j, i, f'{joint[i, j]:.2f}', ha='center', va='center',
                     fontsize=14, fontweight='bold',
                     color='white' if joint[i, j] > 0.15 else 'black')
plt.colorbar(im, ax=axes[0], shrink=0.8)

# Marginal P(X)
axes[1].bar(x_labels, P_X, color='#3498db', edgecolor='black')
axes[1].set_title('Marginal P(X) [Weather]')
axes[1].set_ylabel('Probability')
for i, v in enumerate(P_X):
    axes[1].text(i, v + 0.01, f'{v:.2f}', ha='center', fontweight='bold')

# Marginal P(Y)
axes[2].bar(y_labels, P_Y, color='#e74c3c', edgecolor='black')
axes[2].set_title('Marginal P(Y) [Mood]')
axes[2].set_ylabel('Probability')
for i, v in enumerate(P_Y):
    axes[2].text(i, v + 0.01, f'{v:.2f}', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

---

# Part 5: Covariance and Correlation

**Covariance** measures how two variables change together:

$$\text{Cov}(X, Y) = E[(X - \mu_X)(Y - \mu_Y)]$$

**Correlation** is the normalized covariance:

$$\rho_{XY} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}, \quad -1 \le \rho \le 1$$

## 5.1 Generating Correlated Random Variables

In [None]:
# Generate correlated random variables using Cholesky decomposition
np.random.seed(42)
n_samples = 1000

# Define the desired covariance matrix
mean = [0, 0]
cov_matrix = [[1.0, 0.8],
              [0.8, 1.0]]

# Generate samples from multivariate normal
samples = np.random.multivariate_normal(mean, cov_matrix, n_samples)
X = samples[:, 0]
Y = samples[:, 1]

print("Desired covariance matrix:")
print(np.array(cov_matrix))
print()

# Compute sample covariance matrix
sample_cov = np.cov(X, Y)
print("Sample covariance matrix (np.cov):")
print(sample_cov)
print()

# Compute sample correlation matrix
sample_corr = np.corrcoef(X, Y)
print("Sample correlation matrix (np.corrcoef):")
print(sample_corr)

## 5.2 Manual Computation of Covariance and Correlation

In [None]:
# Manual computation
cov_manual = np.mean((X - np.mean(X)) * (Y - np.mean(Y)))
corr_manual = cov_manual / (np.std(X) * np.std(Y))

print(f"Manual Cov(X, Y):  {cov_manual:.4f}")
print(f"np.cov result:     {sample_cov[0, 1]:.4f}  (uses N-1 denominator)")
print()
print(f"Manual Corr(X, Y): {corr_manual:.4f}")
print(f"np.corrcoef result: {sample_corr[0, 1]:.4f}")

## 5.3 Visualizing Different Correlations

In [None]:
# Scatter plots for different correlation values
np.random.seed(42)
correlations = [-0.9, -0.5, 0.0, 0.5, 0.9]

fig, axes = plt.subplots(1, 5, figsize=(20, 4))

for ax, rho in zip(axes, correlations):
    cov_mat = [[1.0, rho],
               [rho, 1.0]]
    data = np.random.multivariate_normal([0, 0], cov_mat, 500)

    ax.scatter(data[:, 0], data[:, 1], alpha=0.4, s=10, color='steelblue')
    ax.set_title(f'$\\rho$ = {rho}')
    ax.set_xlim(-4, 4)
    ax.set_ylim(-4, 4)
    ax.set_aspect('equal')
    ax.grid(True, alpha=0.3)
    ax.set_xlabel('X')
    if rho == correlations[0]:
        ax.set_ylabel('Y')

plt.suptitle('Scatter Plots for Different Correlation Values', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

---

# Part 6: Gaussian Distribution (Deep Dive)

The Gaussian distribution is central to machine learning. We explore both the univariate and multivariate cases.

## 6.1 Univariate Gaussian

$$\mathcal{N}(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

In [None]:
# Univariate Gaussian for different parameters
x = np.linspace(-8, 12, 500)

params = [
    (0, 1, 'steelblue'),
    (0, 2, '#e74c3c'),
    (0, 0.5, '#2ecc71'),
    (3, 1, '#9b59b6'),
    (-2, 1.5, '#f39c12'),
]

fig, ax = plt.subplots(figsize=(12, 6))

for mu, sigma, color in params:
    pdf = stats.norm.pdf(x, loc=mu, scale=sigma)
    ax.plot(x, pdf, color=color, linewidth=2,
            label=f'$\\mu={mu}, \\sigma={sigma}$')

ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.set_title('Univariate Gaussian for Different Parameters')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 6.2 Multivariate Gaussian

For a $d$-dimensional random vector $\mathbf{x}$:

$$\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)$$

In [None]:
# 2D Multivariate Gaussian contour plots
from scipy.stats import multivariate_normal

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Define grid
x_grid = np.linspace(-4, 4, 200)
y_grid = np.linspace(-4, 4, 200)
X_grid, Y_grid = np.meshgrid(x_grid, y_grid)
pos = np.dstack((X_grid, Y_grid))

# Different covariance matrices
configs = [
    ([0, 0], [[1, 0], [0, 1]], 'Isotropic ($\\Sigma = I$)'),
    ([0, 0], [[2, 0], [0, 0.5]], 'Axis-aligned'),
    ([0, 0], [[1, 0.8], [0.8, 1]], 'Correlated ($\\rho = 0.8$)'),
]

for ax, (mean, cov, title) in zip(axes, configs):
    rv = multivariate_normal(mean, cov)
    Z = rv.pdf(pos)

    contour = ax.contourf(X_grid, Y_grid, Z, levels=20, cmap='Blues')
    ax.contour(X_grid, Y_grid, Z, levels=6, colors='navy', linewidths=0.5, alpha=0.5)
    ax.set_title(title)
    ax.set_xlabel('$x_1$')
    ax.set_ylabel('$x_2$')
    ax.set_aspect('equal')
    ax.grid(True, alpha=0.2)
    plt.colorbar(contour, ax=ax, shrink=0.8)

plt.suptitle('2D Multivariate Gaussian Distributions', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

## 6.3 Sampling from Multivariate Gaussian

In [None]:
# Sample from multivariate Gaussian and visualize
np.random.seed(42)

mean = [1, -1]
cov = [[2.0, 1.2],
       [1.2, 1.0]]

samples = np.random.multivariate_normal(mean, cov, 2000)

fig, ax = plt.subplots(figsize=(8, 8))

# Scatter plot of samples
ax.scatter(samples[:, 0], samples[:, 1], alpha=0.3, s=8, color='steelblue', label='Samples')

# Overlay theoretical contours
x_grid = np.linspace(-5, 7, 200)
y_grid = np.linspace(-5, 3, 200)
X_grid, Y_grid = np.meshgrid(x_grid, y_grid)
pos = np.dstack((X_grid, Y_grid))
rv = multivariate_normal(mean, cov)
Z = rv.pdf(pos)

ax.contour(X_grid, Y_grid, Z, levels=8, colors='red', linewidths=1.5, alpha=0.8)
ax.plot(mean[0], mean[1], 'r*', markersize=15, label=f'Mean ({mean[0]}, {mean[1]})')

ax.set_xlabel('$x_1$')
ax.set_ylabel('$x_2$')
ax.set_title('Samples from 2D Gaussian with Theoretical Contours')
ax.legend()
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Verify sample statistics
print("True mean:", mean)
print("Sample mean:", np.mean(samples, axis=0).round(3))
print()
print("True covariance:")
print(np.array(cov))
print("Sample covariance:")
print(np.cov(samples.T).round(3))

---

# Part 7: Central Limit Theorem

The **Central Limit Theorem (CLT)** states that the sum (or average) of a large number of independent, identically distributed random variables will be approximately normally distributed, regardless of the original distribution.

If $X_1, X_2, \dots, X_n$ are i.i.d. with mean $\mu$ and variance $\sigma^2$, then as $n \to \infty$:

$$\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \xrightarrow{d} \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right)$$

## 7.1 CLT Demonstration: Sum of Uniform Random Variables

In [None]:
# Central Limit Theorem demonstration
# Sum of Uniform(0,1) random variables -> approaches Gaussian

np.random.seed(42)
n_experiments = 10000
n_values = [1, 2, 5, 10, 30, 100]

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.ravel()

# Uniform(0,1) has mean=0.5, var=1/12
mu_uniform = 0.5
var_uniform = 1 / 12

for ax, n in zip(axes, n_values):
    # Generate n_experiments sums, each summing n uniform random variables
    sums = np.sum(np.random.uniform(0, 1, (n_experiments, n)), axis=1)

    # Standardize: Z = (S_n - n*mu) / sqrt(n*var)
    standardized = (sums - n * mu_uniform) / np.sqrt(n * var_uniform)

    # Plot histogram
    ax.hist(standardized, bins=50, density=True, alpha=0.6,
            color='steelblue', edgecolor='black', label='Empirical')

    # Overlay standard normal
    x = np.linspace(-4, 4, 200)
    ax.plot(x, stats.norm.pdf(x), 'r-', linewidth=2, label='N(0, 1)')

    ax.set_title(f'n = {n}', fontsize=13)
    ax.set_xlim(-4, 4)
    ax.legend(fontsize=9)
    ax.grid(True, alpha=0.3)

plt.suptitle('Central Limit Theorem: Sum of Uniform(0,1) Variables (Standardized)',
             fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

## 7.2 CLT Convergence: Tracking the Distribution Shape

In [None]:
# Measure how quickly the CLT converges
# Use Kolmogorov-Smirnov test to compare sample means with normal distribution

np.random.seed(42)
n_values_fine = np.arange(1, 101)
n_experiments = 5000
ks_statistics = []

for n in n_values_fine:
    # Compute sample means of n uniform(0,1) draws
    sample_means = np.mean(np.random.uniform(0, 1, (n_experiments, n)), axis=1)

    # Standardize
    standardized = (sample_means - mu_uniform) / np.sqrt(var_uniform / n)

    # KS test against standard normal
    ks_stat, _ = stats.kstest(standardized, 'norm')
    ks_statistics.append(ks_stat)

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(n_values_fine, ks_statistics, 'b-', linewidth=1.5)
ax.axhline(y=0.05, color='red', linestyle='--', alpha=0.7, label='KS = 0.05 threshold')
ax.set_xlabel('n (number of variables summed)')
ax.set_ylabel('KS Statistic (distance from Normal)')
ax.set_title('CLT Convergence: KS Statistic vs Sample Size')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Find when KS statistic drops below 0.05
ks_arr = np.array(ks_statistics)
below_threshold = np.where(ks_arr < 0.05)[0]
if len(below_threshold) > 0:
    print(f"KS statistic first drops below 0.05 at n = {below_threshold[0] + 1}")
else:
    print("KS statistic did not drop below 0.05 in the tested range.")

## 7.3 CLT with Non-Uniform Distributions

The CLT works for any distribution with finite mean and variance. Let us verify with the exponential distribution.

In [None]:
# CLT with Exponential distribution (highly skewed)
np.random.seed(42)
lam = 2.0  # rate parameter
mu_exp = 1 / lam
var_exp = 1 / lam**2

n_experiments = 10000
n_values = [1, 2, 5, 10, 30, 100]

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.ravel()

for ax, n in zip(axes, n_values):
    # Sample means from exponential
    sample_means = np.mean(np.random.exponential(scale=1/lam, size=(n_experiments, n)), axis=1)

    # Standardize
    standardized = (sample_means - mu_exp) / np.sqrt(var_exp / n)

    ax.hist(standardized, bins=50, density=True, alpha=0.6,
            color='#2ecc71', edgecolor='black', label='Empirical')

    x = np.linspace(-4, 4, 200)
    ax.plot(x, stats.norm.pdf(x), 'r-', linewidth=2, label='N(0, 1)')

    ax.set_title(f'n = {n}', fontsize=13)
    ax.set_xlim(-4, 4)
    ax.legend(fontsize=9)
    ax.grid(True, alpha=0.3)

plt.suptitle('CLT with Exponential Distribution (Standardized Sample Means)',
             fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

---

# Practice Exercises

Try these on your own:

1. **Poisson Distribution:** The Poisson distribution models the number of events in a fixed interval. Using `scipy.stats.poisson`, plot the PMF for $\lambda = 1, 5, 10$. Verify that the mean and variance are both equal to $\lambda$. Then draw 10,000 samples for $\lambda = 5$ and overlay the histogram with the theoretical PMF.

2. **Bayesian Coin Flip:** You have a coin that may be biased. Your prior belief is that $p$ (probability of heads) follows a Beta(2, 2) distribution. You flip the coin 10 times and observe 7 heads. Using Bayes' theorem, the posterior is Beta(2+7, 2+3) = Beta(9, 5). Plot the prior and posterior distributions on the same axes. Compute the prior and posterior means and print them.

3. **Independence Test:** Generate two independent standard normal random variables $X$ and $Y$ (1000 samples each). Then create $Z = 2X + 3Y + \text{noise}$. Compute the correlation matrix of $(X, Y, Z)$ using `np.corrcoef`. Which pairs are correlated and which are approximately independent? Visualize with a heatmap.

4. **CLT with Dice:** Simulate rolling a fair six-sided die. The mean of a single die roll is $3.5$ and the variance is $35/12$. For $n \in \{1, 2, 5, 10, 50, 200\}$, compute the average of $n$ dice rolls (repeat 10,000 times). Plot histograms of the standardized averages and overlay the standard normal PDF. At what $n$ does the distribution look approximately Gaussian?

---

**Course:** Mathematics for Machine Learning  
**Instructor:** Mohammed Alnemari