# Random Variables and Continuous Distributions

In [None]:
%%html
<link rel="stylesheet" type="text/css" href="../styles/styles.css">

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.special import comb
import pandas as pd
from ipywidgets import interact, FloatSlider, IntSlider
from IPython.display import HTML, display, IFrame
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

In [None]:
import sys
from pathlib import Path

# Add the "resources" directory to the path
project_root = Path().resolve().parent
resources_path = project_root / 'resources'
sys.path.insert(0, str(resources_path))

In [None]:
from random_variable import(demo_pmf_limitations_cont_rv, discrete_vs_cont_rv, demo_pdf, demo_cdf_disrete_vs_cont, cdf_interval, show_complementary_event_cdf, demo_weights)

## Learning Objectives

- Master continuous probability distributions (Normal, Exponential)
- Calculate expectations and variances
- Apply distributions to ML scenarios

<div class="alert alert-info">
<h4>🎯 Weight Initialization Paradox</h3>

You're initializing weights for a neural network layer with 100 neurons.

**Scenario 1**: Uniform initialization $W \sim Uniform(-1, 1)$

*Question*: What is $P(W = 0.5\text{ exactly})$?

*Your intuition*: "There are infinite values in $[-1,1]$, so maybe $1/\infty = 0$?"

*Correct answer*: $P(W = 0.5000000...) = 0$

But wait... If EVERY exact value has probability 0, how can ANY weight exist?!

**Scenario 2**: Normal initialization $W \sim \mathcal{N}(0, 0.01)$

You measure a weight: $W = 0.0234$

*Question*: What was the probability of getting exactly this value?

*Answer*: Also 0!

Yet the weight exists... How is this possible?

**Scenario 3**: The Real Question</br>
If $P(W = \text{any exact value}) = 0$, how do we calculate:
- $P(0.4 < W \leq 0.6)$?
- $P(|W| \leq 0.1)$?
- What's the "typical" range for weights?

This is the difference between DENSITY and PROBABILITY.

Today you'll understand continuous distributions and solve weight initialization properly.
</div>

## From Discrete to Continuous

> Why can't we use PMF for continuous variables?
</br>

Let's consider the following case. We observe 1000 data points.

In [None]:
# demo hist of 1000 points
demo_pmf_limitations_cont_rv()

<div class="alert .alert-warning">
<h4>💡 Key Insight: Fundamental Difference between Discrete and Continuous R.V.</h4>

DISCRETE: $P(X = x)$ can be > 0
- Count people: $P(Height = 175\text{cm exactly})$ makes sense
- Finite/countable values

CONTINUOUS: $P(X = x) = 0$ always!
- Measure height: $P(Height = 175.0000...\text{ cm})$ = 0
- Uncountably infinite values
- Instead, we ask: $P(174.5 < Height \leq 175.5)$?

Solution: Use DENSITY, not PROBABILITY
</div>

## Probability Density Function, PDF

<div class="alert alert-success">
<h4>Definition: Probability Density Function (PDF)</h4>

Let $X$ be an absolutely continuous real r.v. **The probability density function** (or **pdf**) $f_X(x)$ is a positive and integrable function on $\mathbb{R}$, such that:
$$\mathbb{P}(a\leq X\leq b) = \int_{a}^{b}f(t)dt$$

Note that:
$f_X(x) = \frac{d}{dx}F_X(x)$
almost everywhere.

Properties:

* $f_X(x)$ is a **density** (probability per unit length): $f_X(x) = \lim_{\Delta\rightarrow 0}\frac{\mathbb{P}(x< X\leq x + \Delta)}{\Delta}$
* $f_X(x) \geq 0$ for all $x$
* $f_X(x)$ can be > 1 (unlike PMF)
* $\forall t\in \mathbb{R}, \ f_X(t)\in \mathbb{R}^+$
* $\int_{\mathbb{R}}f(t)dt = 1$
* Probability = AREA under $f_X(x)$, not HEIGHT
</div>


In [None]:
# demo PDF can be > 1
demo_pdf()

<div class="alert alert-danger">
<h4>⚠️ Common Mistake:</h4>

WRONG: "$f_X(0.5) = 0.8$ means $P(X=0.5) = 0.8$"</br>
RIGHT: "$f_X(0.5) = 0.8$ means density at $x=0.5$ is $0.8$"

Think of density like "people per square km":
- 1000 people/km² doesn't mean 1000 people at a single point!
- It means 1000 people spread over 1 km²
- Similarly, $f(x)=0.8$ means probability density, not probability

</div>

In [None]:
# demo comparison PMF vs PDF
discrete_vs_cont_rv()

## CDF for Continuous Variables

<div class="alert alert-success">
<h4>Definition: CDF for Continuous R.V.</h4>

$F_X(x) = \mathbb{P}(X \leq x) = \int_{-\infty}^{x}f_X(t)dt$

Thus, the distribution function $F_X(x)$ corresponds to the area under the curve $f_X(x)$

Relationship: $f_X(x) = \frac{d}{dx}F_X(x)$ (PDF is derivative of CDF)

**Properties**:

1. $F(x)$ is continuous (no jumps!)
2. $0 \leq F(x) \leq 1$
3. $F(-\infty) = 0$, $F(+\infty) = 1$
4. $F(x)$ is non-decreasing

</div>

### Comparison: Discrete vs Continuous CDF Properties

| Property | Discrete | Continuous |
|---|---|---|
| CDF Shape | Step function (jumps at values) | Smooth curve (no jumps)|
| $P(X = x)$ | Can be > 0 (height of jump) | Always = 0 (single point)|
| $F(x)$ calculation | $F(x) = \sum_{i} p(x_i)$ for all $x_i ≤ x$| $F(x) = \int_{-\infty}^x f(t)dt$|
| $P(a < X \leq b)$ formula | $F(b) - F(a)$ </br> SAME FORMULA |  $F(b) - F(a)$ </br> SAME FORMULA |
| $P(a < X \leq b)$ interpretation | Sum of probabilities at discrete points | Area under PDF curve in interval |

In [None]:
# comparison CDF for cont vs discrete rv
demo_cdf_disrete_vs_cont()

### Interval Probability

The formula for interval probability holds for the continuous case:

$$\mathbb{P}(a< X\leq b) = F_X(b) - F_X(a)$$

In [None]:
# demo CDF on the interval 
cdf_interval()

### Probability of the Complementary Event

Similar to finding the probability of the complementary event, we can find the probability $\mathbb{P}(X > a)$ as follows:
$\mathbb{P}(X > a) = 1 - \mathbb{P}(X \leq a)$

In [None]:
# complementary event CDF
show_complementary_event_cdf()

<div class="alert .alert-exercise">
<h4>Calculated Example</h4>

Let's take the example proposed in [@orloff2014].

The r.v. $X$ is defined on the interval $[0,2]$ by the density function $f(x) = cx^2$.

1. What is the value of $c$?
2. What is the distribution function $F(x)$?
3. What is the probability $\mathbb{P}(1\leq X \leq 2)$?

</div>

<details>
<summary>Reveal solution</summary>

1. What is the value of $c$?

The sum of probabilities over the entire interval $[a,b]$ on which $X$ is defined must be equal to 1, i.e. $\int_a^b f(x)dx = 1$. In our case, $X$ is defined on $[0,2]$. Then:

$\int_a^b f(x)dx = \int_0^2 cx^2 dx = c\frac{x^3}{3}\Bigg\rvert_{0}^2 = c\frac{8}{3} - 0 = c\frac{8}{3} = 1$
Hence $c = \mathbf{\frac{3}{8}}$.

By replacing $c$ with this value, we obtain $f(x) = \frac{3}{8}x^2$.

2. What is the distribution function $F(x)$?

As a reminder, the distribution function is given by $F(x) = \mathbb{P}(X\leq x) = \int_{-\infty}^x f(t)dt$. Thus, knowing $f(x)$, we can find $F(x)$.

Note that by definition, the density function $f(x)=0$ outside the interval on which $X$ is defined. In our case, we can rewrite $f(x)$ as follows:
$f(x) = \left\{ \begin{array}{ll} 0, \text{ if } x < 0 \\ \frac{3}{8}x^2, \text{ if } x\in [0,2] \\ 0, \text{ if } x>2 \end{array} \right.$
Therefore, the distribution function $F(x) = 0, \text{ if } x<0$ and $F(x) = 1, \text{ if } x>2$. It remains to find the distribution function for $x\in [0, 2]$.

$F(x) = \int_{-\infty}^x f(t)dt = \int_{\mathbf{0}}^x ct^2dt = c\frac{t^3}{3}\Bigg\rvert_{0}^x = c\frac{x^3}{3} = \frac{3}{8}\cdot\frac{x^3}{3} = \frac{x^3}{8} = \left(\frac{x}{2}\right)^3$
So, putting it all together:

$F(x) = \left\{ \begin{array}{ll} 0, \text{ if } x < 0 \\ \left(\frac{x}{2}\right)^3, \text{ if } x\in [0,2] \\ 1, \text{ if } x>2 \end{array} \right.$


3. What is the probability $\mathbb{P}(1\leq X \leq 2)$?
We can approach the calculation of $\mathbb{P}(a\leq X\leq b)$ in two ways:

1. $\mathbb{P}(a\leq X\leq b) = \int_a^b f(x)dx$
2. $\mathbb{P}(a\leq X\leq b) = F(b) - F(a)$

Let's try both options.

3.1. Option 1: $\mathbb{P}(a\leq X\leq b) = \int_a^b f(x)dx$

$\mathbb{P}(1\leq X\leq 2) = \int_1^2 cx^2 dx = c\frac{x^3}{3}\Bigg\rvert_{1}^2 = \frac{3}{8}\cdot\frac{x^3}{3}\Bigg\rvert_{1}^2 = \frac{x^3}{8}\Bigg\rvert_{1}^2 = \frac{8}{8} - \frac{1}{8} = \mathbf{\frac{7}{8}}$

3.2. Option 2: $\mathbb{P}(a\leq X\leq b) = F(b) - F(a)$

Note that $1\in [0,2]$ and $2\in [0,2]$. On the interval in question, the distribution function takes the form $F(x) = \left(\frac{x}{2}\right)^3$. Therefore:

$\mathbb{P}(1\leq X\leq 2) = F(2) - F(1) = \left(\frac{2}{2}\right)^3 - \left(\frac{1}{2}\right)^3 = 1 - \frac{1}{8} = \mathbf{\frac{7}{8}}$



</details>

## Expected Value and Variance for Continuous R.V.

<div class="alert alert-success">
<h4>Definition: Expected Value</h4>

**The expectation** (or **expected value** or **mean** or **average** or **first moment**) of a real r.v. $X$, denoted $\mathbb{E}X$ or $\mathbb{E}[X]$ or $\mathbb{E}(X)$ is a generalization of the weighted average value. It can also be seen as the "center of mass" of the distribution.

Its calculation (when this quantity exists) depends on the nature of $X$:

* continuous real r.v.:
$\mathbb{E}[X] = \int_{-\infty}^{+\infty}tf(t)dt$


Note that these definitions can be generalized for the expectation of a real r.v. $g(X)$, where $g : X(\Omega) \rightarrow \mathbb{R}$:

* continuous case: $\mathbb{E}[g(X)] = \int_{-\infty}^{+\infty}g(t)f(t)dt$


<h5>Properties:</h5>

* $\mathbb{E}[X] \geq 0$
* $\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]$
* $\mathbb{E}[aX + b] = a\mathbb{E}[X] + b$

Note that these results can be generalized for the expectation of a real r.v. $g(X)$, where $g : X(\Omega) \rightarrow \mathbb{R}$:

* $\mathbb{E}[g(X)] \geq 0$
* $\mathbb{E}[g_1(X) + g_2(X)] = \mathbb{E}[g_1(X)] + \mathbb{E}[g_2(X)]$
* $\mathbb{E}[ag(X)] = a\mathbb{E}[g(X)] + b$
* let $X$ be a continuous real r.v., $g_1$ and $g_2$ two functions such that $g_1 \leq g_2$, then $\mathbb{E}[g_1(X)] \leq \mathbb{E}[g_2(X)]$
* if $X$ is a constant real r.v. on $\Omega$ and $g$ any function, then $\mathbb{E}[g(X)] = g(X)$

</div>

<div class="alert alert-success">
<h4>Definition: Variance</h4>

**The variance** of the real r.v. $X$, denoted $Var(X)$ or $\sigma^2$, is a measure of the dispersion of data around its expectation $\mathbb{E}X$.

Its calculation (when this quantity exists) depends on the nature of $X$:

* continuous real r.v.:
$Var(X) = \mathbb{E}[(X-\mathbb{E}X)^2] = \mathbb{E}[X^2]-(\mathbb{E}X)^2=\int_{a}^{b}(x-\mathbb{E}X)^2f(x)dx$

<h5>Properties:</h5>

* $Var(X) \geq 0$
* $Var(X + Y) = Var[X] + Var[Y]\text{, (if } X\text{ and } Y\text{ indep.)}$
* $Var(aX + b) = a^2Var(X)$

</div>

<div class="alert alert-success">
<h4>Definition: Standard Deviation</h4>

**The standard deviation** (or **std**) of the real r.v. $X$, denoted $\sqrt{Var(X)}$ or $\sigma$, is a measure of the deviation between the values taken by $X$ and its expectation $\mathbb{E}X$

$\sigma(X)=\sigma_X = \sqrt{Var(X)}$

</div>

## Common Continuous Distributions

### Uniform Distribution, $\mathcal{U}(a, b)$

<div class="alert alert-example">
<h4>Uniform Distribution</h4>

$X \sim \mathcal{U}(a,b)$ or $X \sim \mathcal{U}[a,b]$ or $X \sim Uniform(a,b)$

**When to use:**

Models situations where **all values in an interval are equally likely**. It represents complete uncertainty within a bounded range - no value is preferred over any other.

- All outcomes in a range are equally probable
- You have no reason to prefer one value over another
- Modeling random selection from an interval
- Need a "baseline" or "uninformative" distribution

**Parameters & Domain:**
- Parameters: 

    * $a \in \mathbb{R}$: lower bound (minimum value)
    * $b \in \mathbb{R}$: upper bound (maximum value)
    * Constraint: $a < b$

- Support (Domain): $X \in [a, b]$



| PMF | CDF | $E(X)$ | $Var(X)$| $\sigma$|
|:---:|:---:|:----:|:---:|:---:|
|$$f(x) = \begin{cases} \frac{1}{b-a} & \text{if } a \leq x \leq b \\0 & \text{otherwise}\end{cases}$$ *Interpretation:*</br> Constant density across the entire interval</br>Height = 1/(width) ensures total area = 1</br> Rectangular shape (hence "rectangular distribution")| $$F(x) = P(X \leq x) = \begin{cases} 0 & \text{if } x < a \\ \frac{x-a}{b-a} & \text{if } a \leq x \leq b \\1 & \text{if } x > b\end{cases}$$ *Interpretation:* </br> Linear growth from 0 to 1 across $[a, b]$</br> $Slope = 1/(b-a)$ = constant density| $$\frac{a + b}{2}$$ Midpoint of the interval (by symmetry)| $$\frac{(b-a)^2}{12}$$ | $$\frac{b-a}{2\sqrt{3}}\approx 0.289(b-a)$$|

**Key Properties:**

1. Symmetry: Symmetric around the midpoint $(a+b)/2$
2. Memoryless on intervals: If $X \sim U(a,b)$ and you know $X \in (c,d) ⊂ (a,b)$, then $X|(X \in (c,d)) \sim U(c,d)$
3. Transformation property: If $X \sim U(0,1)$, then $Y = a + (b-a)X \sim U(a,b)$
4. Standard uniform: $U(0,1)$ is the base case:

- Computer random number generators produce U(0,1)
- Can transform to any other distribution

5. Maximum entropy: Among all distributions on $[a,b]$, Uniform has maximum entropy (maximum uncertainty)
6. Quantile function (inverse CDF):
 $$F^{-1}(p) = a + p(b-a)$$
7. Mode: Not unique (all values in $[a,b]$ are equally likely)

**Real-world examples:**
1. Physical/Natural:

- Angle of a spinner ($U(0, 360°)$)
- Position where a dart hits (if truly random)
- Breaking point of a uniformly weak rod
- Round-off errors in measurements

2. Time-based:

- Arrival time when "sometime between 2-3 PM" with equal probability
- Random moment within a time window
- Phase of a periodic signal at random observation

3. Selection:

- Random number from a range
- Coordinates within a square/rectangle
- Random point on a line segment

**Typical Events This Describes**

"Anywhere in this range is equally likely":

- Initial guess when you know bounds but nothing else
- Random sampling from a known range
- Uniform allocation across resources
- Monte Carlo sampling baseline

*Key characteristic*: Complete lack of bias within the interval

**Relationships to Other Distributions:**
1. Parent/Child Relationships:

- Standard Uniform $U(0,1)$: Base case for all distributions
- If $U \sim U(0,1)$: Then $a + (b-a)U \sim U(a,b)$

2. Connection to other distributions:

- Order statistics: If $X_1, ..., X_n \sim U(0,1)$ are ordered, the gaps follow Beta distributions
- Sum of Uniforms: Sum of $n$ independent $U(0,1) \rightarrow Normal$ (by CLT as $n\rightarrow \infty$)
- Max/Min of Uniforms: $Max$ of $n$ $U(0,1)$ $\sim Beta(n, 1)$, $Min \sim Beta(1, n)$

3. Inverse Transform Method:

- If $F$ is any CDF and $U \sim U(0,1)$, then $F^{-1}(U)$ follows distribution $F$

This is how we generate random samples from any distribution!

4. Limiting case:

- Can approximate any distribution over $[a,b]$ with mixture of uniforms
</div>

<div class="alert alert-primary">
<h4>🤖 ML Applications</h4>

1. Initialization:

- Weight initialization (old method): $W \sim U(-c, c)$

Before Xavier/He, uniform was common

Example: $U(-1/\sqrt{n}, 1/\sqrt{n})$ for layer with $n$ inputs

Still used for some architectures (embeddings)


2. Regularization:

- Dropout variant: Random dropout rate $\sim U(0.1, 0.5)$
- Data augmentation: Random crop position, rotation angle
- Label smoothing: Mix true label with uniform over all classes

3. Optimization:

- Random search: Hyperparameter values ~ Uniform over range
- Batch shuffling: Random permutation of training data
- Random restart: Initial guess ~ Uniform over feasible region

4. Sampling:

- Monte Carlo integration: Sample points $\sim U(domain)$
- Rejection sampling: Accept/reject based on $U(0,1)$
- Stochastic gradient descent: Random mini-batch selection

5. Exploration:

- Epsilon-greedy: With probability $\epsilon$, choose action $\sim Uniform$
- Uniform exploration: Random action selection in RL
- Random feature generation: Random Fourier features

</div>

<div class="alert alert-danger">
<h4>⚠️ Common Pitfalls:</h4>

❌ Thinking it's appropriate for unbounded ranges

- Uniform ONLY works on finite intervals $[a, b]$
- Can't have $U(-\infty, \infty)$ (doesn't integrate to 1)

❌ Confusing discrete uniform with continuous uniform

- Discrete: $U\{1,2,3,4,5,6\}$ (die roll) - PMF
- Continuous: $U(1,6)$ - PDF
- They're different distributions!

❌ Assuming "random" means uniform

- Many processes are NOT uniform (e.g., human reaction times)
- Check assumptions before using uniform

❌ Forgetting the 1/12 factor in variance

- Common mistake: using $(b-a)^2$ instead of $(b-a)^2/12$

❌ Using uniform for everything in early ML

- Xavier/He initialization (Normal) is usually better
- Uniform can work but Normal often preferred

⚠️ Edge cases:

- Check if endpoints $a$, $b$ are included (closed interval $[a,b]$)
- $P(X = a) = P(X = b) = 0$ (continuous), but support includes them

</div>

We can use SciPy implementation of this distribution, [`scipy.stats.uniform`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.uniform.html):

In [None]:
a, b = 2, 5
X = stats.uniform(loc=a, scale=b-a)  # loc=a, scale=width

# IMPORTANT: scipy uses loc (start) and scale (width), NOT (a, b) directly!
# For U(a, b): use loc=a, scale=b-a

x = 3.5
# PDF
print(f"f({x}) = {X.pdf(x):.4f}")  # P(X = 3.5) = 1/(b-a) = 1/3

# CDF
print(f"F({x}) = {X.cdf(x):.4f}")  # F(x) = (x-a)/(b-a) = 0.5

# Probability of interval
prob = X.cdf(4) - X.cdf(3)
print(f"P(3 ≤ X ≤ 4) = {prob:.4f}")  # F(4) - F(3) = 1/3

# Moments
print(f"E[X] = {X.mean()}")           # E[X] = (1-p)/p (failures), or 1/p (trials)
print(f"Var(X) = {X.var()}")            # Var(X) = (1-p)/p²
print(f"σ = {X.std()}")            # σ

# Random sampling
samples = X.rvs(size=10, random_state=42)
# These are number of failures; add 1 for number of trials
print(f"Generated 10 samples: {samples}")

In [None]:
def plot_uniform(a=-2.0, b=3.0):
    """
    Interactive visualization of Uniform distribution
    
    Parameters:
    -----------
    a : float - lower bound
    b : float - upper bound
    """
    print("Visual Characteristics")
    print("PDF Shape:")
    print("      Rectangular/flat across [a, b]")
    print("      Height = 1/(b-a)")
    print("      Zero outside [a, b]")
    print("      Sharp discontinuities at endpoints")
    print("CDF Shape:")
    print("      Linear ramp from 0 to 1")
    print("      Constant slope = 1/(b-a)")
    print("      Horizontal at 0 before a")
    print("      Horizontal at 1 after b")
    print("Key visual features:")
    print("      Perfect symmetry")
    print("      No peak (uniform height)")
    print("      Clear bounded support")
    
    
    # Create distribution
    if a >= b:
        print("Error: a must be < b")
        return
    
    uniform_dist = stats.uniform(loc=a, scale=b-a)
    
    # Create figure with subplots
    fig = plt.figure(figsize=(8, 4))
    gs = fig.add_gridspec(1, 2, hspace=0.3, wspace=0.3)
    
    # ========================================================================
    # Panel 1: PDF
    # ========================================================================
    ax1 = fig.add_subplot(gs[0, 0])
    
    x_range = np.linspace(a-1, b+1, 500)
    pdf_values = uniform_dist.pdf(x_range)
    
    ax1.fill_between(x_range, pdf_values, alpha=0.3, color='skyblue')
    ax1.plot(x_range, pdf_values, 'b-', linewidth=2.5)
    ax1.axvline(a, color='red', linestyle='--', alpha=0.7, linewidth=1.5, label='Bounds')
    ax1.axvline(b, color='red', linestyle='--', alpha=0.7, linewidth=1.5)
    ax1.axvline((a+b)/2, color='green', linestyle=':', linewidth=2, label='Mean')
    
    ax1.set_xlabel('x', fontsize=11)
    ax1.set_ylabel('f(x) - Density', fontsize=11)
    ax1.set_title(f'PDF: Uniform({a:.2f}, {b:.2f})', fontsize=12, fontweight='bold')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    ax1.set_ylim(0, max(pdf_values) * 1.3 if max(pdf_values) > 0 else 1)
    
    # Add text box with PDF formula
    pdf_text = f'f(x) = {1/(b-a):.3f} for x ' + r'$\in$' + f' [{a:.1f}, {b:.1f}]'
    ax1.text(0.5, 0.95, pdf_text, transform=ax1.transAxes,
            verticalalignment='top', horizontalalignment='center',
            bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8),
            fontsize=10)
    
    # ========================================================================
    # Panel 2: CDF
    # ========================================================================
    ax2 = fig.add_subplot(gs[0, 1])
    
    cdf_values = uniform_dist.cdf(x_range)
    
    ax2.plot(x_range, cdf_values, 'r-', linewidth=2.5)
    ax2.axhline(0, color='gray', linestyle=':', alpha=0.5)
    ax2.axhline(1, color='gray', linestyle=':', alpha=0.5)
    ax2.axhline(0.5, color='green', linestyle=':', linewidth=1.5, label='Median')
    ax2.axvline(a, color='red', linestyle='--', alpha=0.7, linewidth=1.5)
    ax2.axvline(b, color='red', linestyle='--', alpha=0.7, linewidth=1.5)
    
    ax2.set_xlabel('x', fontsize=11)
    ax2.set_ylabel('F(x) = P(X ≤ x)', fontsize=11)
    ax2.set_title('CDF: Cumulative Distribution', fontsize=12, fontweight='bold')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    ax2.set_ylim(-0.05, 1.05)
    
    plt.close()
    
    return fig

interact(plot_uniform,
         a=FloatSlider(min=-10, max=0, step=0.5, value=-2, description='Lower (a):'),
         b=FloatSlider(min=0, max=10, step=0.5, value=3, description='Upper (b):'))
         

### Exponential Distribution, $\mathcal{E}(\lambda)$

<div class="alert alert-example">
<h4>Exponential Distribution</h4>

$X \sim \mathcal{E}(\lambda)$ or $X \sim Exp(\lambda)$ or $X \sim Exponential(\lambda)$

**When to use:**

The exponential distribution models **waiting times** or **time between events** in a Poisson process. It describes how long you wait for the **first occurrence** of an event when events happen randomly at a constant average rate.

- Modeling time until an event occurs
- Events happen independently at a constant rate
- You're measuring "time until failure" or "duration"
- Analyzing lifetimes or survival times
- Continuous analog of the Geometric distribution

**Parameters & Domain:**
- Parameter: $\lambda > 0$ rate parameter (events per unit time) 

    * $\lambda = 1/\mu$ where $\mu$ is the mean waiting time
    * Higher $λ$ → shorter waiting times (events happen more frequently)

Alternative Parameterization:

- Some sources use $\beta = 1/\lambda$ (scale parameter = mean)
- `scipy` uses `scale = 1/λ`    

- Support (Domain): $X \in [0, \infty]$

Always non-negative (can't have negative waiting times)



| PMF | CDF | $E(X)$ | $Var(X)$| $\sigma$|
|:---:|:---:|:----:|:---:|:---:| 
|$$f(x) = \begin{cases} \lambda e^{-\lambda x} & \text{if } x \geq 0 \\0 & \text{if } x < 0\end{cases}$$ or $\lambda e^{-\lambda x}$ for $x\geq 0$</br>*Interpretation:*</br>Exponential decay starting from $\lambda$ at $x=0$</br>Peak at $x=0$ ($mode = 0$)</br>Long right tail (some events take much longer)</br>Decay rate controlled by $\lambda$ | $$F(x) = P(X \leq x) = \begin{cases} 1 - e^{-\lambda x} & \text{if } x \geq 0 \\ 0 & \text{if } x < 0 \end{cases}$$ **Survival function** $$S(x) = P(X > x) = e^{-\lambda x}$$ *Interpretation:*</br>$F(x)$ grows from 0 to 1 (asymptotically) </br>$S(x)$ = probability of waiting longer than $x$</br>Half-life: time for $S(x)$ to reach 0.5| $$\frac{1}{\lambda}$$ Average waiting time = inverse of rate| $$\frac{1}{\lambda^2}$$ | $$\frac{1}{\lambda}$$ mean = standard deviation|

**Key Properties:**

1. Mode: 0 (distribution peaks at zero, then decays)
2. Memoryless Property (Most Important!): 
$$P(X > s + t ∣ X > s) = P(X>t)$$
*Interpretation:* "The future doesn't depend on the past". If you've already waited $s$ time units, the probability of waiting $t$ more is the same as waiting $t$ from the start
3. Relationship to Poisson Process: If events occur according to $Poisson(\lambda)$ process:

- Number of events in time $t \sim Poisson(\lambda t)$
- Time until first event $\sim Exponential(\lambda)$
- Time between consecutive events $\sim Exponential(\lambda)$
4. Minimum of Exponentials: If $X_1, ..., X_n$ are independent $Exp(\lambda_i)$:
$$min(X_1​,...,X_n​)∼Exp(\lambda_1​ + ... + \lambda_n​)$$
5. Sum of Exponentials: Sum of $n$ independent $Exp(\lambda) → Gamma(n, \lambda)$ distribution
6. Scaling Property: If $X \sim Exp(\lambda)$, then $cX \sim Exp(\lambda/c)$ for $c > 0$


**Real-world examples:**
1. Time-Based:

- Time until next phone call arrives
- Lifetime of electronic components
- Time between arrivals at service counter
- Duration of phone calls
- Time until radioactive decay

2. System Reliability:

- Time to failure for devices with constant hazard rate
- Server uptime before crash
- Time between network packet arrivals
- Battery lifetime (simplified model)

3. Natural Phenomena:

- Time between earthquakes (approximation)
- Intervals between lightning strikes
- Waiting time for rain to start

4. Service/Queue:

- Service time at a server (if memory-less)
- Inter-arrival times in queuing systems
- Time to complete a task

**Typical Events This Describes**

"Time until something happens" when:

- Events occur randomly
- No "aging" or "learning" (memoryless)
- Constant hazard rate
- Independent events

*Key characteristics:*

- Most events happen quickly (mode at 0)
- Some events take much longer (long tail)
- No memory of past waiting

**Relationships to Other Distributions:**
1. Discrete Analog:

- Geometric distribution is discrete version
- Geometric counts trials, Exponential measures time
- Both are memoryless

2. Parent Distribution:

- $Gamma(k, \lambda)$: Exponential is $Gamma(1, \lambda)$
- Sum of $k$ independent $Exp(\lambda) ~ Gamma(k, \lambda)$

3. Related to Poisson:

- If $N(t) \sim Poisson(\lambda t)$ counts events, then inter-arrival times $\sim Exp(\lambda)$

4. Continuous Uniform Connection:

If $X \sim Exp(\lambda)$, then $1 - e^{-\lambda X} \sim Uniform(0, 1)$
Used for generating exponential random variables

5. Weibull Distribution:

- Exponential is Weibull with shape parameter $k=1$
- Weibull allows non-constant hazard rates

6. Chi-squared:

$2\lambda X \sim \chi^2(2)$ if $X \sim Exp(\lambda)$
</div>

<div class="alert alert-primary">
<h4>🤖 ML Applications</h4>

1. System Monitoring:

- Server failure prediction: Time until server fails
- Request inter-arrival times: Model traffic patterns
- Session duration: How long users stay active
- Response time modeling: Server/API latency (simplified)

2. Reinforcement Learning:

- Episode length: Time until termination in continuous time
- Event-triggered learning: Time between state changes
- Exploration strategies: Random wait times

3. Generative Models:

- Event timing in sequences: Time stamps in temporal data
- Survival analysis: Time until event (customer churn, etc.)
- Hawkes processes: Self-exciting point processes

4. Network & Communication:

- Packet inter-arrival: Time between network packets
- Transmission delays: Simplified communication models
- Timeout settings: When to declare connection lost

5. Anomaly Detection:

- Baseline timing model: Normal inter-event times
- Deviation detection: Unusually long/short waits indicate anomalies
- Failure prediction: Based on time since last event

6. Queueing Systems:

- Service time distribution: Time to serve a request
- Arrival process modeling: $M/M/1$ queues (Markovian)
- Wait time analysis: Expected delays

</div>

<div class="alert alert-danger">
<h4>⚠️ Common Pitfalls:</h4>

❌ Assuming memoryless property applies to real systems

- Many real systems DO have memory (aging, learning)
- Example: Battery life increases failure rate over time (NOT exponential)
- Example: Human waiting patience decreases (NOT memoryless)
- Exponential is an approximation, often valid only for short times

❌ Confusing rate $\lambda$ with mean

- $Mean = 1/\lambda$ (inverse relationship!)
- $\lambda = 2\text{ events/hour} \rightarrow \text{mean wait} = 0.5\text{ hours}$
- Common error: using $\lambda$ directly as mean

❌ Using for non-constant hazard rates

- Exponential assumes constant failure rate
- If hazard rate changes over time, use Weibull or other distributions
- Bathtub curve (early failures, stable, wear-out) NOT exponential

❌ `scipy` parameterization confusion

- `scipy` uses `scale = 1/λ` (mean), not `λ` directly!
- `stats.expon(scale=2)` has $λ=0.5$, $mean=2$
- Must remember to use `scale = 1/λ`

❌ Forgetting support is $[0, \infty)$

- Cannot have negative waiting times
- Check if data has natural lower bound at 0

⚠️ Overdispersion issues:

- Exponential has $Var = Mean^2$ (high variance relative to mean)
- If real data has lower variance, consider other distributions
- If higher variance, consider mixture models

❌ Using for "time to nth event"

- Exponential is for FIRST event
- For $n$-th event, use $Gamma(n, \lambda)$ or Erlang distribution

</div>

We can use SciPy implementation of this distribution, [`scipy.stats.expon`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.expon.html):

In [None]:
lambda_rate = 3.0  # events per minute
mean_time = 1 / lambda_rate  # mean waiting time = 1/3 minute

# IMPORTANT: scipy uses scale (mean), not rate!
exp_dist = stats.expon(scale=mean_time)  # scale = 1/λ = 1/3

# Alternative (equivalent)
exp_dist_alt = stats.expon(scale=1/lambda_rate)

# PDF
x = 0.5  # 0.5 minutes
pdf_value = exp_dist.pdf(x)
print(f"PDF at {x}: f({x}) = {pdf_value:.4f}")
# Manual: λ * e^(-λx) = 3 * e^(-3*0.5) ≈ 0.669

# CDF
cdf_value = exp_dist.cdf(x)
print(f"CDF at {x}: F({x}) = {cdf_value:.4f}")
print(f"P(wait ≤ {x} min) = {cdf_value:.4f}")
# Manual: 1 - e^(-λx) = 1 - e^(-1.5) ≈ 0.777

# Survival function (more intuitive for waiting times)
survival = exp_dist.sf(x)  # P(X > x) = e^(-λx)
print(f"P(wait > {x} min) = {survival:.4f}")
print(f"Verification: {1 - cdf_value:.4f}")

# Mean and variance
print(f"\nMean: {exp_dist.mean():.4f} (should be {1/lambda_rate:.4f})")
print(f"Variance: {exp_dist.var():.4f} (should be {(1/lambda_rate)**2:.4f})")
print(f"Std: {exp_dist.std():.4f} (equals mean for exponential!)\n")

# Random sampling
samples = exp_dist.rvs(size=10, random_state=42)
print(f"Generated 10 samples: {samples}")

In [None]:
def plot_expon(lambda_rate=2.0):    
    """
    Interactive visualization of Exponential distribution
    
    Parameters:
    -----------
    lambda_rate : float - rate parameter (events per unit time)
    """
    print("Visual Characteristics")
    print("PDF Shape:")
    print("     Monotonically decreasing from maximum at x=0")
    print("     Long right tail (some values much larger than mean)")
    print("     Exponential decay: f(x) = λe^(-λx)")
    print("     Peak at zero (mode = 0)")
    print("     Most probability mass near zero")
    print("CDF Shape:")
    print("     Starts at 0 (at x=0)")
    print("     Asymptotically approaches 1 (never quite reaches it)")
    print("     Concave (second derivative negative)")
    print("     Steeper rise for larger λ (events happen faster)")
    print("Effect of λ:")
    print("     Large λ: Steep decay, short waiting times, concentrated near zero")
    print("     Small λ: Slow decay, long waiting times, more spread out")
    print("Key visual features:")
    print("     Always skewed right (positive skew)")
    print("     No inflection point")
    print("     Simple exponential curve")
    
    if lambda_rate <= 0:
        print("Error: λ must be > 0")
        return
    
    # Create distribution (remember: scipy uses scale = 1/λ)
    exp_dist = stats.expon(scale=1/lambda_rate)
    
    mean_val = 1 / lambda_rate
    var_val = 1 / (lambda_rate ** 2)
    std_val = 1 / lambda_rate
    median_val = np.log(2) / lambda_rate
    
    # Create figure
    fig = plt.figure(figsize=(8, 4))
    gs = fig.add_gridspec(1, 2, hspace=0.35, wspace=0.3)
    
    # ========================================================================
    # Panel 1: PDF
    # ========================================================================
    ax1 = fig.add_subplot(gs[0, 0])
    
    x_max = max(10/lambda_rate, 5)  # Show enough of the tail
    x_range = np.linspace(0, x_max, 500)
    pdf_values = exp_dist.pdf(x_range)
    
    ax1.fill_between(x_range, pdf_values, alpha=0.3, color='orange')
    ax1.plot(x_range, pdf_values, 'darkorange', linewidth=2.5)
    ax1.axvline(mean_val, color='blue', linestyle='--', linewidth=2, label=f'Mean={mean_val:.3f}')
    ax1.axvline(median_val, color='green', linestyle=':', linewidth=2, label=f'Median={median_val:.3f}')
    ax1.axvline(0, color='red', linestyle='-', linewidth=1.5, alpha=0.7, label='Mode=0')
    
    ax1.set_xlabel('x (time)', fontsize=11)
    ax1.set_ylabel('f(x) - Density', fontsize=11)
    ax1.set_title(f'PDF: Exponential(λ={lambda_rate:.3f})', fontsize=12, fontweight='bold')
    ax1.legend(fontsize=9)
    ax1.grid(True, alpha=0.3)
    
    # Add formula
    ax1.text(0.98, 0.98, f'f(x) = {lambda_rate:.3f}e^(-{lambda_rate:.3f}x)',
            transform=ax1.transAxes, verticalalignment='top', horizontalalignment='right',
            bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8),
            fontsize=10)
    
    # ========================================================================
    # Panel 2: CDF and Survival Function
    # ========================================================================
    ax2 = fig.add_subplot(gs[0, 1])
    
    cdf_values = exp_dist.cdf(x_range)
    sf_values = exp_dist.sf(x_range)  # Survival function = 1 - CDF
    
    ax2.plot(x_range, cdf_values, 'b-', linewidth=2.5, label='CDF F(x)')
    ax2.plot(x_range, sf_values, 'r--', linewidth=2.5, label='Survival S(x)=1-F(x)')
    ax2.axhline(0.5, color='green', linestyle=':', alpha=0.7)
    ax2.axvline(median_val, color='green', linestyle=':', alpha=0.7)
    ax2.plot(median_val, 0.5, 'go', markersize=10, label='Median')
    
    ax2.set_xlabel('x (time)', fontsize=11)
    ax2.set_ylabel('Probability', fontsize=11)
    ax2.set_title('CDF and Survival Function', fontsize=12, fontweight='bold')
    ax2.legend(fontsize=9)
    ax2.grid(True, alpha=0.3)
    ax2.set_ylim(-0.05, 1.05)
    
    plt.close()
    
    return fig

interact(plot_expon,
         lambda_rate=FloatSlider(min=0.1, max=10, step=0.1, value=0.7, description='Rate (# events per unit time)'),
         )

### Normal (Gaussian) Distribution, $\mathcal{N}(\mu, \sigma^2)$

<div class="alert alert-example">
<h4>Normal Distribution</h4>

$X \sim \mathcal{N}(\mu, \sigma^2)$ or $X \sim Normal(\mu, \sigma^2)$

**When to use:**

The most important distribution in statistics and machine learning. It models phenomena where values cluster symmetrically around a central mean, with variation governed by a bell-shaped curve.

- Data is symmetric around the mean
- Variation comes from many small, independent random effects
- Modeling measurement errors
- Central Limit Theorem applies (sum/average of many variables)
- Natural variation in populations
- You need a mathematically tractable model

> Why it's everywhere:

- Natural phenomena often approximately normal
- CLT makes sums/means converge to normal
- Mathematical convenience (closed under addition, scaling)
- Well-understood statistical properties
- Foundation of classical statistics

**Parameters & Domain:**
- Parameters: 

    * $\mu \in \mathbb{R}$:  location parameter (mean, center of distribution)
    * $\sigma^2 > 0 \in \mathbb{R}$: scale parameter (variance, spread of distribution)
    * $\sigma > 0$: standard deviation

- Support (Domain): $X \in (-\infty, \infty)$

    * Can take any real value
    * No bounds (though extreme values have tiny probability)

- Standard Normal: $\mathcal{N}(0, 1)$
    * Mean = 0, Variance = 1
    * Base case for all normal distributions
    * Used for Z-scores and standardization

|Variant| PMF | CDF | $E(X)$ | $Var(X)$| $\sigma$|
|---|:---:|:---:|:----:|:---:|:---:|
|$\mathcal{N}(\mu, \sigma^2)$|$$f(x)=\frac{1}{\sigma\sqrt{2\pi}}​exp(−\frac{(x−\mu)^2}{2\sigma^2}​)$$ *Interpretation:*</br> Bell-shaped curve symmetric around $\mu$</br>Maximum at $x = \mu$ (peak at the mean) </br> Inflection points at $\mu \pm \sigma$ </br> Shape determined by $\sigma$ (wider for larger $\sigma$)</br> Total area under curve = 1| $$\Phi(\frac{x - \mu}{\sigma}) = \int_{-\infty}^x \frac{1}{\sigma\sqrt{2\pi}}{e^{-\frac{(t - \mu)^2}{2\sigma^2}}}dt$$ *Important!:*</br> No closed-form solution! Must use: numerical integration, statistical tables (Z-tables) or computer functions (`scipy.stats.norm.cdf`)| $$\mu$$ | $$\sigma^2$$| $$\sigma$$|
|$\mathcal{N}(0, 1)$| $$\phi(z) = \frac{1}{2\pi}​e^{−\frac{z^2}{2}}​$$| $$\Phi(z) = \int_{-\infty}^x \frac{1}{\sqrt{2\pi}}{e^{-\frac{t^2}{2}}}dt$$ *Properties*:</br> $\Phi(0) = 0.5$ (median at $\mu$)</br> $\Phi(-z) = 1 - \Phi(z)$ (symmetry) </br> Smooth S-shaped curve | $$0$$ | $$1$$ | $$1$$

**Key Properties:**

1. Mode: $\mu$ (peak at the mean)
2. Skewness: 0 (perfectly symmetric)
3. Empirical Rule (68-95-99.7 Rule):  
For $X \sim \mathcal{N}(\mu, \sigma^2)$:

- 68% of values within $\mu ± \sigma$ (1 standard deviation)
- 95% of values within $\mu ± 2\sigma$ (2 standard deviations)
- 99.7% of values within $\mu ± 3\sigma$ (3 standard deviations)

More precisely:

- $P(\mu - \sigma ≤ X ≤ \mu + \sigma) = 0.6827$
- $P(\mu - 2\sigma ≤ X ≤ \mu + 2\sigma) = 0.9545$
- $P(\mu - 3\sigma ≤ X ≤ \mu + 3\sigma) = 0.9973$

Practical: Almost all values (99.7%) within 3\sigma of mean!

4. Standardization (Z-Score Transformation)
$$Z = \frac{X - \mu}{\sigma} \sim \mathcal{N}(0, 1)$$
Purpose:

- Convert any normal to standard normal
- Compare values from different normal distributions
- Use standard normal tables
- Interpretable scale (units of standard deviation)

Reverse: $X = \mu + \sigma Z$

5. Linear Transformation Property
If $X ~ N(\mu, \sigma^2)$, then:

$$aX + b \sim \mathcal{N}(a\mu + b, a^2\sigma^2)$$

4. Sum of Independent Normals
If $X_1 \sim N(\mu_1, \sigma_1^2)$ and $X_2 \sim N(\mu_2, \sigma_2^2)$ are independent:

$$X_1 + X_2 \sim N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)$$
General: Sum of $n$ independent normals is normal!

- Means add: $\mu_{sum} = \sum\mu_i$
- Variances add: $\sigma^2_{sum} = \sum \sigma_i^2$

5. Maximum Entropy
Among all distributions with:

- Fixed mean $\mu$
- Fixed variance $\sigma^2$
- Support on entire real line

The normal distribution has maximum entropy (maximum uncertainty/randomness).
6. Reproductive Property
Normal family is closed under:

- Linear combinations
- Convolution (sums)
- Affine transformations

If you start with normals and do these operations, you get normals!
7. Symmetry
$$f(\mu + x) = f(\mu - x)$$
PDF is symmetric about μ.
8. Relationship to Chi-Squared: If $Z \sim N(0,1)$, then $Z^2 \sim \chi^2(1)$. 
More generally, sum of $n$ squared standard normals $\sim \chi^2(n)$

**Real-world examples:**
1. Physical Measurements:

- Heights of people in a population
- Weights (approximately, with slight right skew)
- Blood pressure readings
- IQ scores (designed to be normal)
- Measurement errors in instruments

2. Natural Phenomena:

- Distribution of particle velocities (Maxwell-Boltzmann)
- Thermal noise in electronics
- Positions in Brownian motion
- Quantum mechanical position/momentum (wave function squared)

3. Social Sciences:

- Test scores (when well-designed)
- Reaction times (approximately)
- Psychological trait measures
- Survey response aggregates

4. Finance:

- Log returns of stock prices (approximately)
- Portfolio returns (by CLT)
- Option pricing (Black-Scholes assumes normal)

5. Manufacturing:

- Product dimensions (when process is in control)
- Quality control measurements
- Process variations

**Typical Events This Describes**

"Sum of many small independent effects":

- Measurement with many error sources
- Aggregate of many random influences
- Average of many samples (CLT)
- Natural variation around a fixed target

*Key characteristics:*

- Symmetric variation around center
- Most values near mean
- Extreme values rare but possible
- No inherent bounds (unlike uniform)

**Relationships to Other Distributions:**
1. Parent/Child Relationships:
**Central Limit Theorem Connection** (Most Important Relationship)

If $X_1, X_2, ..., X_n$ are i.i.d. with mean $\mu$ and variance $\sigma^2$:

$$\frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} N(0, 1)$$

*Consequence*: Sample means converge to normal, regardless of original distribution!
This is why normal appears everywhere.

2. Special Cases and Limits
- From Binomial:

    * $Binomial(n, p) \rightarrow N(np, np(1-p))$ as $n \rightarrow \infty$
    * Rule of thumb: $np \geq 5$ and $n(1-p) geq 5$

- From Poisson:

    * $Poisson(\lambda) \rightarrow N(\lambda, \lambda)$ as $\lambda \rightarrow \infty$
    * Rule of thumb: $\lambda \geq 10$

- To Lognormal:

If $X \sim N(\mu, \sigma^2)$, then $e^X \sim Lognormal(\mu, \sigma^2)$. Used for modeling positive quantities with multiplicative growth

3. Related Distributions
- Chi-Squared:

Sum of $n$ squared independent $N(0,1)$ $\sim \chi^2(n)$

- Student's $t$:

Ratio of $N(0,1)$ to $\sqrt{\chi^2(n)/n} \sim t(n)$. Approaches normal as $n \rightarrow \infty$

- $F$-distribution:

Ratio of two chi-squared variables (related to normals)

- Multivariate Normal:

Vector generalization: $X \sim N(\mathbf{μ}, \mathbf{\Sigma})$ where $\Sigma$ is covariance matrix

- Cauchy:

Ratio of two independent $N(0,1) \sim Cauchy$. Heavy tails, no mean!
</div>

<div class="alert alert-primary">
<h4>🤖 ML Applications</h4>

*Fundamental Assumptions*

1. Gaussian Noise Model:

- Linear regression: Y = Xβ + ε, where ε ~ N(0, σ²)
- Leads to least squares estimation
- Maximum likelihood = minimize squared error

2. Weight Initialization:

- Xavier: W ~ N(0, 1/n_in)
- He: W ~ N(0, 2/n_in) for ReLU
- Controls variance propagation

3. Feature Engineering:

- Standardization: z = (x - μ)/σ → N(0, 1)
- Many algorithms assume/prefer normalized features
- Improves gradient descent convergence
</br>

*Probabilistic Models*

4. Gaussian Mixture Models (GMM):

- Mixture of K normal distributions
- Clustering with soft assignments
- Density estimation

5. Naive Bayes (Gaussian):

- P(feature|class) ~ N(μ_class, σ²_class)
- Simple but effective classifier

6. Gaussian Processes:

- Distribution over functions
- Bayesian non-parametric regression
- Uncertainty quantification
</br>

*Deep Learning*

7. Variational Autoencoders (VAE):

- Latent space ~ N(0, I)
- Reparameterization trick: z = μ + σε, ε ~ N(0,1)
- Learn mean and variance functions

8. Batch Normalization:

- Normalize activations: (x - μ_batch)/σ_batch
- Stabilizes training
- Reduces internal covariate shift

9. Gradient Noise:

- Mini-batch gradients have noise ~ N(true_gradient, variance)
- Helps escape sharp minima
- Adds regularization effect

10. Dropout Approximation:

- Dropout can be viewed as approximate Gaussian noise
- Connection to Bayesian neural networks
</br>

*Optimization*

11. Gaussian Approximations:

- Laplace approximation for posterior
- Variational inference with Gaussian family
- Natural gradient methods

12. Confidence Intervals:

- Parameter estimates ± z_α/2 × SE
- Based on asymptotic normality
</br>

*Generative Models*

13. Diffusion Models:

- Add Gaussian noise progressively
- Learn to denoise
- State-of-the-art image generation

14. Score Matching:

- Learn gradient of log-density
- Gaussian perturbations
</br>

*Anomaly Detection*

15. Outlier Detection:

- Flag points > 3σ from mean
- Mahalanobis distance for multivariate
- Assumes normal baseline

</div>

<div class="alert alert-danger">
<h4>⚠️ Common Pitfalls:</h4>

❌ Assuming everything is normal

- Many real distributions are NOT normal
- Check with Q-Q plots, Shapiro-Wilk test
- Example: Income (lognormal), waiting times (exponential)

❌ Ignoring outliers

- Normal has light tails
- Real data often has heavier tails
- Outliers can distort μ and σ estimates

❌ Forgetting it's unbounded

- Normal can theoretically produce any value
- Problems when modeling bounded quantities (e.g., probabilities)
- Use logistic/beta for [0,1], lognormal for positive

❌ Confusing $\sigma$ with $\sigma^2$

- Parameters are $(\mu, \sigma^2)$ but we often think in terms of $\sigma$
- `scipy` uses $scale=\sigma$ not $scale=\sigma^2$

❌ Misapplying CLT

- CLT requires sufficiently large $n$
- Original distribution matters (heavy tails need larger $n$)
- Rule of thumb: $n \geq 30$ (but depends on skewness)

❌ Treating correlation as independence

- Joint normal needs covariance matrix
- Uncorrelated ≠ independent (generally), but = for normal!

⚠️ Standardization errors:

- Using sample std when population std is known
- Dividing by $n$ instead of $\sqrt{n}$ for standard error
- Forgetting to center (subtract mean) before scaling

❌ Inappropriate hypothesis tests

- Using normal-based tests on non-normal data
- Small samples need $t$-distribution not normal


</div>

We can use SciPy implementation of this distribution, [`scipy.stats.norm`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html):

In [None]:
mu = 75
sigma = 10
norm_dist = stats.norm(loc=mu, scale=sigma)  # loc=mean, scale=std

# Note: scipy uses (μ, σ) NOT (μ, σ²)
# Alternative: stats.norm(75, 10)

# PDF
x = 80
pdf_value = norm_dist.pdf(x)
print(f"\nPDF at x={x}: f({x}) = {pdf_value:.6f}")

# CDF
cdf_value = norm_dist.cdf(x)
print(f"CDF at x={x}: P(X ≤ {x}) = {cdf_value:.4f}")
print(f"Interpretation: {cdf_value*100:.2f}% scored ≤ {x}")

# Survival function
sf_value = norm_dist.sf(x)  # P(X > x) = 1 - CDF(x)
print(f"P(X > {x}) = {sf_value:.4f}")

# Random sampling
samples = norm_dist.rvs(size=10, random_state=42)
print(f"Generated 10 samples: {samples}")

In [None]:
# 68-95-99.7 Rule Verification
# Within 1 sigma
prob_1sigma = norm_dist.cdf(mu + sigma) - norm_dist.cdf(mu - sigma)
print(f"\nP(μ-σ ≤ X ≤ μ+σ) = P({mu-sigma} ≤ X ≤ {mu+sigma})")
print(f"  = {prob_1sigma:.4f} ≈ 0.6827 (68.27%)")

# Within 2 sigma
prob_2sigma = norm_dist.cdf(mu + 2*sigma) - norm_dist.cdf(mu - 2*sigma)
print(f"\nP(μ-2σ ≤ X ≤ μ+2σ) = P({mu-2*sigma} ≤ X ≤ {mu+2*sigma})")
print(f"  = {prob_2sigma:.4f} ≈ 0.9545 (95.45%)")

# Within 3 sigma
prob_3sigma = norm_dist.cdf(mu + 3*sigma) - norm_dist.cdf(mu - 3*sigma)
print(f"\nP(μ-3σ ≤ X ≤ μ+3σ) = P({mu-3*sigma} ≤ X ≤ {mu+3*sigma})")
print(f"  = {prob_3sigma:.4f} ≈ 0.9973 (99.73%)")

# Practical rule: 95% within 1.96 sigma (for confidence intervals)
prob_196sigma = norm_dist.cdf(mu + 1.96*sigma) - norm_dist.cdf(mu - 1.96*sigma)
print(f"\nP(μ-1.96σ ≤ X ≤ μ+1.96σ) = {prob_196sigma:.4f} (exactly 95%)")
print(f"This is used for 95% confidence intervals!")

In [None]:
# Standardization
# Original score
x_score = 90
z_score = (x_score - mu) / sigma
print(f"\nOriginal score: {x_score}")
print(f"Z-score: z = (x-μ)/σ = ({x_score}-{mu})/{sigma} = {z_score:.2f}")
print(f"Interpretation: {x_score} is {z_score:.2f} standard deviations above the mean")

# Using standardized normal
std_norm = stats.norm(0, 1)
prob_above = std_norm.sf(z_score)
print(f"\nP(X > {x_score}) = P(Z > {z_score:.2f}) = {prob_above:.4f}")
print(f"About {prob_above*100:.2f}% scored higher than {x_score}")

In [None]:
# interactive visualisation
def plot_normal(mu=0.0, sigma=1.0):    
    """
    Interactive visualization of Normal distribution
    
    Parameters:
    -----------
    mu : float - mean (location parameter)
    sigma : float - standard deviation (scale parameter)
    """
    print("Visual Characteristics")
    print("PDF Shape:")
    print("     Symmetric bell curve centered at μ")
    print("     Maximum at μ (mode = median = mean)")
    print("     Inflection points at μ ± σ")
    print("     Gaussian shape: f(x) = (1/(σ√(2π)))e^(-(x-μ)²/(2σ²))")
    print("     Tails extend to ±∞ (theoretically)")
    print("     68% of data within μ ± σ")
    print("     95% of data within μ ± 2σ")
    print("     99.7% of data within μ ± 3σ (empirical rule)")
    print("CDF Shape:")
    print("     S-shaped (sigmoid) curve")
    print("     Point of inflection at μ (where CDF = 0.5)")
    print("     Symmetric around μ")
    print("     Steepest slope at μ")
    print("     Asymptotically approaches 0 and 1")
    print("Effect of μ (mean):")
    print("     Shifts entire distribution left/right")
    print("     Does not change shape")
    print("     Centers the bell curve at new location")
    print("Effect of σ (std dev):")
    print("     Large σ: Wider, flatter bell, more spread")
    print("     Small σ: Narrower, taller bell, more concentrated")
    print("     Controls dispersion around mean")
    print("Key visual features:")
    print("     Perfect symmetry (skewness = 0)")
    print("     Mean = Median = Mode (all coincide)")
    print("     Unimodal (single peak)")
    print("     Tails never touch x-axis")
    print("     Universal 68-95-99.7 rule applies")
    
    if sigma <= 0:
        print("Error: σ must be > 0")
        return
    
    # Create distribution
    norm_dist = stats.norm(loc=mu, scale=sigma)
    
    mean_val = mu
    var_val = sigma ** 2
    std_val = sigma
    median_val = mu
    mode_val = mu
    
    # Create figure
    fig = plt.figure(figsize=(16, 5))
    gs = fig.add_gridspec(1, 3, hspace=0.35, wspace=0.3)
    
    # ========================================================================
    # Panel 1: PDF
    # ========================================================================
    ax1 = fig.add_subplot(gs[0, 0])
    
    # Show range: mean ± 4 standard deviations
    x_range = np.linspace(mu - 4*sigma, mu + 4*sigma, 1000)
    pdf_values = norm_dist.pdf(x_range)
    
    ax1.fill_between(x_range, pdf_values, alpha=0.3, color='skyblue')
    ax1.plot(x_range, pdf_values, 'darkblue', linewidth=2.5)
    
    # Mark mean/median/mode (all same for normal)
    ax1.axvline(mean_val, color='red', linestyle='--', linewidth=2, 
                label=f'Mean=Median=Mode={mean_val:.3f}')
    
    # Mark μ ± σ (inflection points)
    ax1.axvline(mu - sigma, color='orange', linestyle=':', linewidth=2, 
                label=f'μ-σ={mu-sigma:.3f}')
    ax1.axvline(mu + sigma, color='orange', linestyle=':', linewidth=2, 
                label=f'μ+σ={mu+sigma:.3f}')
    
    # Shade the 68% region (μ ± σ)
    mask_1sigma = (x_range >= mu - sigma) & (x_range <= mu + sigma)
    ax1.fill_between(x_range[mask_1sigma], pdf_values[mask_1sigma], 
                      alpha=0.2, color='green', label='68% area (μ±σ)')
    
    ax1.set_xlabel('x', fontsize=11)
    ax1.set_ylabel('f(x) - Density', fontsize=11)
    ax1.set_title(f'PDF: Normal(μ={mu:.3f}, σ={sigma:.3f})', 
                  fontsize=12, fontweight='bold')
    ax1.legend(fontsize=9, loc='upper right')
    ax1.grid(True, alpha=0.3)
    
    # Add formula
    formula_text = f'f(x) = (1/({sigma:.2f}√(2π)))e^(-(x-{mu:.2f})²/(2·{sigma:.2f}²))'
    ax1.text(0.02, 0.05, formula_text,
            transform=ax1.transAxes, verticalalignment='top', 
            horizontalalignment='left',
            bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8),
            fontsize=9)
    
    # ========================================================================
    # Panel 2: CDF
    # ========================================================================
    ax2 = fig.add_subplot(gs[0, 1])
    
    cdf_values = norm_dist.cdf(x_range)
    
    ax2.plot(x_range, cdf_values, 'b-', linewidth=2.5, label='CDF F(x)')
    
    # Mark median (where CDF = 0.5)
    ax2.axhline(0.5, color='green', linestyle=':', alpha=0.7, linewidth=2)
    ax2.axvline(median_val, color='green', linestyle=':', alpha=0.7, linewidth=2)
    ax2.plot(median_val, 0.5, 'go', markersize=10, label=f'Median={median_val:.3f}')
    
    # Mark 68%, 95%, 99.7% points
    ax2.axhline(norm_dist.cdf(mu + sigma), color='orange', linestyle='--', 
                alpha=0.5, label='F(μ+σ)≈0.84')
    ax2.axhline(norm_dist.cdf(mu - sigma), color='orange', linestyle='--', 
                alpha=0.5, label='F(μ-σ)≈0.16')
    
    ax2.set_xlabel('x', fontsize=11)
    ax2.set_ylabel('F(x) - Cumulative Probability', fontsize=11)
    ax2.set_title('CDF: Cumulative Distribution Function', 
                  fontsize=12, fontweight='bold')
    ax2.legend(fontsize=9)
    ax2.grid(True, alpha=0.3)
    ax2.set_ylim(-0.05, 1.05)
    
    # ========================================================================
    # Panel 3: Empirical Rule Visualization
    # ========================================================================
    ax3 = fig.add_subplot(gs[0, 2])
    
    ax3.fill_between(x_range, pdf_values, alpha=0.2, color='lightgray', 
                     label='Total area = 1')
    ax3.plot(x_range, pdf_values, 'darkblue', linewidth=2.5)
    
    # 68% region (μ ± σ)
    mask_1sigma = (x_range >= mu - sigma) & (x_range <= mu + sigma)
    ax3.fill_between(x_range[mask_1sigma], pdf_values[mask_1sigma], 
                      alpha=0.4, color='green', label='68.27% (μ±σ)')
    
    # 95% region (μ ± 2σ)
    mask_2sigma = (x_range >= mu - 2*sigma) & (x_range <= mu + 2*sigma)
    ax3.fill_between(x_range[mask_2sigma], pdf_values[mask_2sigma], 
                      alpha=0.3, color='yellow', label='95.45% (μ±2σ)')
    
    # 99.7% region (μ ± 3σ)
    mask_3sigma = (x_range >= mu - 3*sigma) & (x_range <= mu + 3*sigma)
    ax3.fill_between(x_range[mask_3sigma], pdf_values[mask_3sigma], 
                      alpha=0.2, color='orange', label='99.73% (μ±3σ)')
    
    # Mark the boundaries
    for i in range(1, 4):
        ax3.axvline(mu - i*sigma, color='red', linestyle='--', 
                    alpha=0.5, linewidth=1)
        ax3.axvline(mu + i*sigma, color='red', linestyle='--', 
                    alpha=0.5, linewidth=1)
    
    ax3.set_xlabel('x', fontsize=11)
    ax3.set_ylabel('f(x) - Density', fontsize=11)
    ax3.set_title('Empirical Rule (68-95-99.7)', 
                  fontsize=12, fontweight='bold')
    ax3.legend(fontsize=9, loc='upper right')
    ax3.grid(True, alpha=0.3)
    
    # Add text annotations for percentages
    ax3.text(mu, max(pdf_values)*0.5, '68%', 
             ha='center', va='center', fontsize=12, fontweight='bold', 
             color='darkgreen')
    ax3.text(mu, max(pdf_values)*0.7, '95%', 
             ha='center', va='center', fontsize=11, fontweight='bold', 
             color='darkorange')
    ax3.text(mu, max(pdf_values)*0.85, '99.7%', 
             ha='center', va='center', fontsize=10, fontweight='bold', 
             color='red')
    
    plt.tight_layout()
    plt.close()
    
    return fig

# Interactive widget
interact(plot_normal,
         mu=FloatSlider(min=-10, max=10, step=0.5, value=0, 
                        description='Mean (μ)'),
         sigma=FloatSlider(min=0.1, max=5, step=0.1, value=1, 
                          description='Std Dev (σ)'),
)

## Return to Opening Challenge

QUESTION 1: P(W = 0.5 exactly) = ?
ANSWER: 0

WHY? Because continuous distributions have:
- Uncountably infinite values
- Probability is AREA, not height
- Single point has zero width → zero area

QUESTION 2: But the weight exists as 0.5. How?

ANSWER: The event happened, even though it had probability 0!
- Probability 0 ≠ Impossible
- Individual outcomes have probability 0
- But SOME outcome must occur

QUESTION 3: How do we work with this?

ANSWER: Use intervals!
- Don't ask: P(W = 0.5)
- Ask: $P(0.49 \leq W \leq 0.51) = \int_{0.49}^{0.51} f(w)dw$

QUESTION 4: Weight initialization?

ANSWER: W ~ N(0, 1/n_in) means:
- NOT that each weight has specific probability
- But that weights are distributed according to PDF
- We care about VARIANCE staying constant: σ² = 1/n_in
- This prevents explosion/vanishing

KEY LESSON:

Continuous distributions describe:
- Not individual outcomes (all have P=0)
- But the DENSITY of outcomes over intervals
- And the SHAPE of the distribution (where values concentrate)

PRACTICAL TAKEAWAY:
When you initialize weights W ~ N(0, σ²):

- You're not assigning probabilities to exact values
- You're defining how weights SPREAD around 0
- The σ determines concentration vs dispersion
- Choose σ = 1/√n_in to keep activations stable!


In [None]:
# demo
demo_weights()

## Common Mistakes

<div class="alert alert-danger">
<h4>⚠️ Common Pitfalls to Avoid:</h4>

- Thinking $f(x)$ is a probability                                   
- Expecting $P(X=x) > 0$ for continuous                            
- Using PMF formulas for continuous R.V.                         
- Forgetting that $f(x)$ can exceed 1                              
- Confusing $\sigma$ (std dev) with $\sigma^2$ (variance)                       
- Using $\mathcal{N}(0,1)$ for all weight initialization 

</div>

## Applications in Machine Learning



<div class="alert alert-secondary">
<h4>🤖 ML Applications Summary</h4>

1. Weight Initialization
- Xavier:  $W \sim \mathcal{N}(0, 1/n_in)$      [tanh, sigmoid]                  
- He:      $W \sim \mathcal{N}(0, 2/n_in)$      [ReLU] 

2. Feature Scaling
$$Z = \frac{X - \mu} {\sigma} \sim \mathcal{N}(0, 1)$$

3. Gaussian Noise
- Add $\epsilon \sim \mathcal{N}(0, \sigma^2)$ for regularization 
</div>

<div class="alert alert-secondary">
<h4>🔧 Python Essentials</h4>

```
from scipy import stats`

# PDF (density)
stats.norm.pdf(x, loc=mu, scale=sigma)

# CDF (probability)
stats.norm.cdf(x, loc=mu, scale=sigma)
                                                                   
# Inverse CDF (quantiles)
stats.norm.ppf(0.95, loc=mu, scale=sigma)  # 95th percentile
                                                                  
# Random sampling
stats.norm.rvs(loc=mu, scale=sigma, size=n)
```

</div>

## Key Takeaways

<div class="alert alert-summary">
<h4>🎓 Key Takeaways</h4>

1. CONTINUOUS vs DISCRETE
   - Discrete: $P(X=x)$ can be > 0 (PMF)
   - Continuous: $P(X=x) = 0$ always (PDF)
   - Continuous: Use $P(a ≤ X ≤ b) = \int_a^b f(x)dx$

2. PROBABILITY DENSITY FUNCTION (PDF)
   - $f(x) ≥ 0$ everywhere
   - $\int_{-\infty}^{\infty} f(x)dx = 1$
   - $f(x)$ is DENSITY, not probability
   - $f(x)$ CAN exceed 1! (unlike PMF)

3. CUMULATIVE DISTRIBUTION FUNCTION (CDF)
   - $F(x) = P(X ≤ x) = \int_{-\infty}^x f(t)dt$
   - $F(x)$ is CONTINUOUS (no jumps)
   - $f(x) = dF/dx$ (derivative relationship)
   - $P(a < X ≤ b) = F(b) - F(a)$

4. KEY DISTRIBUTIONS
   
   UNIFORM(a,b):
   - $f(x) = 1/(b-a)$ on $[a,b]$
   - All values equally likely
   - $E[X] = (a+b)/2$, $Var = (b-a)^2/12$
   
   EXPONENTIAL($\lambda$):
   - $f(x) = \lambda e^(-\lambda x)$ for $x\geq 0$
   - Waiting times
   - Memoryless property
   - $E[X] = 1/\lambda$, $Var = 1/\lambda^2$
   
   NORMAL($\mu$, $\sigma^2$): ⭐ MOST IMPORTANT
   - $f(x) = \frac{1}{\sigma\sqrt{2\pi}}exp(-(x-\mu)^2/(2\sigma^2))$
   - Bell-shaped, symmetric
   - 68-95-99.7 rule
   - $E[X] = \mu$, $Var = \sigma^2$


</div>