<a href="https://colab.research.google.com/github/farrelrassya/teachingMLDL/blob/main/06.Probability.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---

> *The laws of probability, so true in general, so fallacious in particular."*  
> —Edward Gibbon

---

It is hard to do data science without some sort of understanding of probability and its mathematics. As with our treatment of statistics in Chapter 5, we’ll wave our hands a lot and elide many of the technicalities.

For our purposes you should think of probability as a way of quantifying the uncertainty associated with events chosen from some universe of events. Rather than get ting technical about what these terms mean, think of rolling a die. The universe consists of all possible outcomes. And any subset of these outcomes is an event; for example, “the die rolls a 1” or “the die rolls an even number.”

Notationally, we write P(E) to mean “the probability of the event E.”
We’ll use probability theory to build models. We’ll use probability theory to evaluate models. We’ll use probability theory all over the place.
One could, were one so inclined, get really deep into the philosophy of what probability theory means. (This is best done over beers.) We won’t be doing that.

# Dependence and Independence (Probability Events)

---

## Summary: Dependence and Independence (Probability Events)

**Why Dependence/Independence?**
In probability, we often want to know whether two events are connected. If learning about one event changes what we believe about the other, they are **dependent**. If not, they are **independent**.

### Key Points:

**1. Definition: Dependent vs Independent Events**

* **Dependent:** knowing whether event **E** happened gives information about whether **F** happened (and vice versa)
* **Independent:** knowing **E** happened gives **no information** about **F**

**2. Example: Two Coin Flips**

* Event: “first flip is heads”

  * gives no information about “second flip is heads”
    → these are **independent**

* Event: “first flip is heads”

  * gives information about “both flips are tails”
    → these are **dependent**
    (because if the first flip is heads, then “both tails” is impossible)

**3. Mathematical Condition for Independence**
Two events **E** and **F** are independent if:
$$
P(E \cap F) = P(E)\,P(F)
$$
Meaning:


* probability both happen = probability of (E) × probability of (F)

**4. Coin Flip Probability Check**

- $P(\text{first flip heads}) = \frac{1}{2}$  
- $P(\text{both flips tails}) = \frac{1}{4}$  
- $P(\text{first flip heads AND both tails}) = 0$  

Since:

- $0 \neq \left(\frac{1}{2}\right)\left(\frac{1}{4}\right)$  
  these events are **not independent** (they are dependent).



---

**Takeaway:** Independence means one event gives no information about the other, and mathematically it requires:

$$
P(E \cap F) = P(E),P(F)
$$

Coin flips are a classic example of both independent and dependent event pairs depending on how the events are defined.

**Key Concepts:**

* Dependent events → information carries over
* Independent events → no influence
* Independence rule:

$$
P(E \cap F) = P(E),P(F)
$$


# Conditional Probability

**Why Conditional Probability?**

Conditional probability lets us compute the chance of an event $E$ happening **given that** another event $F$ has happened. This is essential when events are not independent and information changes the probability.

### Key Points:

**1. Independence vs Conditional Probability**

If events $E$ and $F$ are independent:

$$P(E \cap F) = P(E)\,P(F)$$

If they are not necessarily independent (and $P(F) \neq 0$):

$$P(E \mid F) = \frac{P(E \cap F)}{P(F)}$$

* Read $P(E \mid F)$ as "probability of $E$ given $F$"

**2. Equivalent Rearrangement (Useful Identity)**

$$P(E \cap F) = P(E \mid F)\,P(F)$$

**3. Independence Implies Conditioning Doesn't Change Anything**

If $E$ and $F$ are independent:

$$P(E \mid F) = P(E)$$

Meaning: knowing $F$ happened gives no extra information about $E$

---

## Two-Child Example (Tricky Conditional Probability)

**Why This Example?**

It shows that conditional probability depends heavily on **what exactly you are told**, even if the situations seem similar.

### Key Points:

**1. Setup Assumptions**

* Each child is equally likely to be a boy or girl
* Children's genders are independent

So probabilities are:

* no girls: $1/4$
* one girl + one boy: $1/2$
* two girls: $1/4$

**2. Conditional on "Older Child is a Girl"**

Let:

* $B$: both children are girls
* $G$: older child is a girl

Then:

$$P(B \mid G) = \frac{P(B \cap G)}{P(G)}$$

Since $B \cap G = B$, we get:

$$P(B \mid G) = \frac{P(B)}{P(G)} = \frac{1/4}{1/2} = \frac{1}{2}$$

**3. Conditional on "At Least One Child is a Girl"**

Let:

* $L$: at least one child is a girl

Similarly $B \cap L = B$, so:

$$P(B \mid L) = \frac{P(B)}{P(L)}$$

Now $P(L) = 3/4$, hence:

$$P(B \mid L) = \frac{1/4}{3/4} = \frac{1}{3}$$

Reasoning:

* Given "at least one girl," families with one girl + one boy are **twice as likely** as families with two girls.

---

## Summary: Simulation to Verify the Result

**Why Simulate?**

Simulation confirms the theoretical probabilities by generating many random families and counting outcomes.

### Key Points:

**1. Generate Random Kids**
```python
import enum, random

class Kid(enum.Enum):
    BOY = 0
    GIRL = 1

def random_kid() -> Kid:
    return random.choice([Kid.BOY, Kid.GIRL])
```

**2. Run Trials and Estimate Probabilities**
```python
both_girls = 0
older_girl = 0
either_girl = 0

random.seed(0)

for _ in range(10000):
    younger = random_kid()
    older = random_kid()

    if older == Kid.GIRL:
        older_girl += 1
    if older == Kid.GIRL and younger == Kid.GIRL:
        both_girls += 1
    if older == Kid.GIRL or younger == Kid.GIRL:
        either_girl += 1

print("P(both | older):", both_girls / older_girl)   # ~ 1/2
print("P(both | either):", both_girls / either_girl) # ~ 1/3
```

* Results match expectation:
  * $P(B \mid G) \approx 1/2$
  * $P(B \mid L) \approx 1/3$

---

**Takeaway:** Conditional probability depends on what information you condition on. Even similar-sounding facts ("older child is a girl" vs "at least one is a girl") can produce different results.

**Key Concepts:**

* Conditional probability: $P(E \mid F) = \frac{P(E \cap F)}{P(F)}$

* Product rule: $P(E \cap F) = P(E \mid F)\,P(F)$

* Independence implies: $P(E \mid F) = P(E)$

# Bayes’s Theorem (Reversing Conditional Probabilities)



**Why Bayes's Theorem?**

Bayes's theorem helps you compute $P(E \mid F)$ when you instead know $P(F \mid E)$. It's a core tool for reasoning under uncertainty (especially when base rates matter).

### Key Points:

**1. Start from Conditional Probability**

$$P(E \mid F) = \frac{P(E \cap F)}{P(F)}$$

$$P(F \mid E) = \frac{P(E \cap F)}{P(E)}$$

Rearranging gives:

$$P(E \mid F) = \frac{P(F \mid E)\,P(E)}{P(F)}$$

**2. Expand $P(F)$ Using Two Mutually Exclusive Cases**

Event $F$ can happen either with $E$ or with $\neg E$:

$$P(F) = P(F \cap E) + P(F \cap \neg E)$$

Substitute:

$$P(E \mid F) = \frac{P(F \mid E)\,P(E)}{P(F \mid E)\,P(E) + P(F \mid \neg E)\,P(\neg E)}$$

This is the common Bayes form.

---

## Summary: Medical Test Example (Base Rate Fallacy)

**Why This Example?**

It shows that even a highly accurate test can produce many false positives when the disease is very rare.

### Key Points:

**1. Define Events**

* $T$: test is positive
* $D$: person has the disease

We want: $P(D \mid T)$

**2. Plug Into Bayes's Theorem**

$$P(D \mid T) = \frac{P(T \mid D)\,P(D)}{P(T \mid D)\,P(D) + P(T \mid \neg D)\,P(\neg D)}$$

Given:
* $P(D) = 1/10{,}000 = 0.0001$
* $P(\neg D) = 0.9999$
* $P(T \mid D) = 0.99$
* $P(T \mid \neg D) = 0.01$

Result:
* $P(D \mid T) \approx 0.98\%$ — So less than 1% of positive tests actually indicate disease.

**3. Intuition with 1,000,000 People**

* Expected diseased: $100$
  * positives among them: $99$
* Expected not diseased: $999{,}900$
  * false positives among them: $9{,}999$

So positive tests total ($99 + 9{,}999$), and only $99$ are real:

$$\frac{99}{99 + 9999} \approx 0.98\%$$

**4. Important Assumption**

This assumes people take the test randomly. If mostly symptomatic people test, the probability would be higher because you'd be conditioning on more information.

---

**Takeaway:** Bayes's theorem shows why base rates matter. Even accurate tests can have low "true positive probability" when the condition is rare.

**Key Formulas:**

$$P(E \mid F) = \frac{P(F \mid E)\,P(E)}{P(F)}$$

$$P(E \mid F) = \frac{P(F \mid E)\,P(E)}{P(F \mid E)\,P(E) + P(F \mid \neg E)\,P(\neg E)}$$

# Random Variable

**Why Random Variables?**

A random variable represents a quantity whose value is uncertain, and whose possible outcomes follow a probability distribution. Random variables are fundamental for modeling randomness in statistics and data science.

### Key Points:

**1. Definition of a Random Variable**

A random variable is a variable whose possible values have an associated probability distribution.

Examples:
* Coin flip outcome: $1$ if heads, $0$ if tails
* Number of heads in 10 flips
* Uniform pick from `range(10)` (values 0–9 equally likely)

**2. Probability Distribution Describes Outcome Likelihood**

Examples:
* Coin flip variable:
  * $P(X=0)=0.5$
  * $P(X=1)=0.5$
* Uniform `range(10)` variable:
  * $P(X=k)=0.1$ for each $k \in \{0,1,2,\dots,9\}$

**3. Expected Value (Mean of a Random Variable)**

Expected value is the probability-weighted average of outcomes.

Coin flip example:

$$\mathbb{E}[X] = 0 \cdot \frac{1}{2} + 1 \cdot \frac{1}{2} = \frac{1}{2}$$

Uniform `range(10)` example:
* Expected value is $4.5$

**4. Random Variables Can Be Conditioned**

Random variables can be defined given an event occurs, changing their distribution.

Two-child example ($X$ = number of girls):
* $P(X=0)=\frac{1}{4}$
* $P(X=1)=\frac{1}{2}$
* $P(X=2)=\frac{1}{4}$

Conditional variables:
* $Y$ = number of girls given at least one child is a girl
  * $P(Y=1)=\frac{2}{3}$
  * $P(Y=2)=\frac{1}{3}$
* $Z$ = number of girls given older child is a girl
  * $P(Z=1)=\frac{1}{2}$
  * $P(Z=2)=\frac{1}{2}$

**5. Random Variables Often Appear Implicitly**

* In practice, many data science methods use random variables even if not explicitly labeled as such
* Looking deeper often reveals random-variable thinking behind the math

---

**Takeaway:** A random variable maps uncertain outcomes into numeric values with a probability distribution. You can compute expected values and update distributions using conditioning, which is key for probabilistic reasoning.

**Key Concepts:**

* Random variable → numeric outcomes + distribution
* Distribution → probabilities for each value
* Expected value: $$\mathbb{E}[X] = \sum_x x \, P(X=x)$$
* Conditioning changes the distribution of a random variable

# Continuous Distributions (PDF and CDF)

**Why Continuous Distributions?**

Discrete distributions assign probability to distinct outcomes (like coin flips). But many real-world values vary continuously (like measurements), so we use **continuous distributions** over real numbers.

### Key Points:

**1. Discrete vs Continuous Distributions**

* **Discrete distribution:** assigns positive probability to specific outcomes
  Example: coin flip ($0$ or $1$)
* **Continuous distribution:** outcomes lie on a continuum (infinitely many values)
  Example: uniform distribution over $[0,1]$

**2. Why Individual Points Have Probability 0**

* Since there are infinitely many real numbers in an interval, a continuous distribution must assign:
  * $P(X = x) = 0$ for any single point

**3. Probability Density Function (PDF)**

A continuous distribution is represented by a **PDF** $f(x)$, where probability comes from **area under the curve**.

$$P(a \le X \le b) = \int_a^b f(x)\,dx$$

Approximation intuition:

$$P(x \le X \le x+h) \approx h \cdot f(x) \quad \text{(for small } h\text{)}$$

**4. Uniform Distribution PDF**
```python
def uniform_pdf(x: float) -> float:
    return 1 if 0 <= x < 1 else 0
```

* Constant density of $1$ inside $[0,1)$, zero outside
* Example: probability between $0.2$ and $0.3$ is $0.1$
* `random.random()` behaves like a pseudo-random variable with this uniform density

**5. Cumulative Distribution Function (CDF)**

The **CDF** gives the probability that a random variable is less than or equal to $x$.

$$F(x) = P(X \le x)$$

Uniform distribution CDF:
```python
def uniform_cdf(x: float) -> float:
    """Returns the probability that a uniform random variable is <= x"""
    if x < 0:
        return 0
    elif x < 1:
        return x
    else:
        return 1
```

* If $x < 0$, probability is $0$
* If $0 \le x < 1$, probability is $x$
* If $x \ge 1$, probability is $1$

---

**Takeaway:** Continuous distributions use a PDF to represent how probability is spread across intervals, and a CDF to represent accumulated probability up to a point.

**Key Methods / Concepts:**

* PDF: $$P(a \le X \le b) = \int_a^b f(x)\,dx$$

* CDF: $$F(x) = P(X \le x)$$

* `uniform_pdf(x)` → uniform density on $[0,1)$
* `uniform_cdf(x)` → probability up to $x$ for uniform distribution

# The Normal Distribution (PDF, CDF, Standardization, Inverse CDF)

**Why Normal Distribution?**

The normal distribution is the classic **bell-shaped** curve used throughout statistics and ML. It is fully defined by two parameters:

* mean $\mu$ (center)
* standard deviation $\sigma$ (spread)

### Key Points:

**1. Normal Distribution Parameters**

* $\mu$ controls where the curve is centered
* $\sigma$ controls how wide or narrow the curve is
  * larger $\sigma$ → wider and flatter
  * smaller $\sigma$ → narrower and taller

**2. Probability Density Function (PDF)**

Colab-friendly formula:

$$f(x \mid \mu, \sigma) = \frac{1}{\sqrt{2\pi}\,\sigma} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

Python implementation:
```python
import math

SQRT_TWO_PI = math.sqrt(2 * math.pi)

def normal_pdf(x: float, mu: float = 0, sigma: float = 1) -> float:
    return (math.exp(-(x - mu) ** 2 / (2 * sigma ** 2)) / (SQRT_TWO_PI * sigma))
```

**3. Plotting Normal PDFs (Different Shapes)**
```python
import matplotlib.pyplot as plt

xs = [x / 10.0 for x in range(-50, 50)]

plt.plot(xs, [normal_pdf(x, sigma=1) for x in xs], '-',  label='mu=0,sigma=1')
plt.plot(xs, [normal_pdf(x, sigma=2) for x in xs], '--', label='mu=0,sigma=2')
plt.plot(xs, [normal_pdf(x, sigma=0.5) for x in xs], ':', label='mu=0,sigma=0.5')
plt.plot(xs, [normal_pdf(x, mu=-1) for x in xs], '-.', label='mu=-1,sigma=1')

plt.legend()
plt.title("Various Normal pdfs")
plt.show()
```

**4. Standard Normal + Standardization**

* Standard normal: $\mu = 0$, $\sigma = 1$

If $Z \sim \mathcal{N}(0,1)$, then:

$$X = \sigma Z + \mu$$

gives $X \sim \mathcal{N}(\mu, \sigma)$

Conversely, if $X \sim \mathcal{N}(\mu,\sigma)$, then:

$$Z = \frac{X - \mu}{\sigma}$$

turns it into a standard normal variable.

---

**5. Cumulative Distribution Function (CDF)**

The normal CDF has no simple closed-form, but can be computed using the error function `erf`.
```python
def normal_cdf(x: float, mu: float = 0, sigma: float = 1) -> float:
    return (1 + math.erf((x - mu) / (math.sqrt(2) * sigma))) / 2
```

**6. Plotting Normal CDFs**
```python
xs = [x / 10.0 for x in range(-50, 50)]

plt.plot(xs, [normal_cdf(x, sigma=1) for x in xs], '-',  label='mu=0,sigma=1')
plt.plot(xs, [normal_cdf(x, sigma=2) for x in xs], '--', label='mu=0,sigma=2')
plt.plot(xs, [normal_cdf(x, sigma=0.5) for x in xs], ':', label='mu=0,sigma=0.5')
plt.plot(xs, [normal_cdf(x, mu=-1) for x in xs], '-.', label='mu=-1,sigma=1')

plt.legend(loc=4)
plt.title("Various Normal cdfs")
plt.show()
```

---

**7. Inverse Normal CDF (Quantiles) via Binary Search**

Sometimes we need the value $x$ such that:

$$P(X \le x) = p$$

Since there's no simple inverse formula, we approximate it using binary search:
```python
def inverse_normal_cdf(p: float,
                       mu: float = 0,
                       sigma: float = 1,
                       tolerance: float = 1e-5) -> float:
    """Find approximate inverse using binary search"""

    # if not standard, compute standard and rescale
    if mu != 0 or sigma != 1:
        return mu + sigma * inverse_normal_cdf(p, tolerance=tolerance)

    low_z = -10.0
    hi_z = 10.0

    while hi_z - low_z > tolerance:
        mid_z = (low_z + hi_z) / 2
        mid_p = normal_cdf(mid_z)

        if mid_p < p:
            low_z = mid_z
        else:
            hi_z = mid_z

    return mid_z
```

* `normal_cdf` is strictly increasing, so binary search works reliably
* It repeatedly narrows the interval until the probability matches closely

---

**Takeaway:** The normal distribution is controlled by $\mu$ and $\sigma$, can be standardized to/from the standard normal, and while its CDF has no elementary formula, it can be computed using `erf` and inverted numerically with binary search.

**Key Methods / Concepts:**

* Normal PDF: $$f(x \mid \mu, \sigma) = \frac{1}{\sqrt{2\pi}\,\sigma} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

* Standardization: $$Z = \frac{X-\mu}{\sigma}$$

* `normal_pdf(x, mu, sigma)`
* `normal_cdf(x, mu, sigma)`
* `inverse_normal_cdf(p, mu, sigma)`

# The Central Limit Theorem (CLT)

**Why the CLT Matters?**

The Central Limit Theorem explains why the **normal distribution** appears everywhere: averages (or sums) of many independent random variables tend to look normal, even if the original variables are not.

### Key Points:

**1. Core Idea of CLT**

If $x_1, x_2, \dots, x_n$ are **independent and identically distributed (i.i.d.)** with mean $\mu$ and standard deviation $\sigma$, then when $n$ is large:

The average:

$$\frac{1}{n}\left(x_1 + x_2 + \cdots + x_n\right)$$

is approximately normal with:

* mean $\mu$
* standard deviation $\frac{\sigma}{\sqrt{n}}$

So:

$$\frac{1}{n}\sum_{i=1}^{n} x_i \approx \mathcal{N}\left(\mu, \frac{\sigma}{\sqrt{n}}\right)$$

**2. Standardized (Often More Useful) Form**

Equivalently, the standardized sum:

$$\frac{(x_1 + \cdots + x_n) - n\mu}{\sigma\sqrt{n}}$$

is approximately:

$$\mathcal{N}(0,1)$$

---

## Summary: Binomial as a CLT Example

**Why Binomial?**

A binomial random variable is literally the sum of many Bernoulli trials, making it a perfect CLT demonstration.

### Key Points:

**1. Bernoulli Trial**

A Bernoulli($p$) random variable:

* equals $1$ with probability $p$
* equals $0$ with probability $1-p$
```python
def bernoulli_trial(p: float) -> int:
    """Returns 1 with probability p and 0 with probability 1-p"""
    return 1 if random.random() < p else 0
```

**2. Binomial Random Variable**

Binomial($n,p$) = sum of $n$ Bernoulli($p$) trials:
```python
def binomial(n: int, p: float) -> int:
    """Returns the sum of n bernoulli(p) trials"""
    return sum(bernoulli_trial(p) for _ in range(n))
```

**3. Mean and Standard Deviation**

For Bernoulli($p$):

* mean $= p$
* standard deviation: $$\sqrt{p(1-p)}$$

For Binomial($n,p$):

* mean: $$\mu = np$$

* standard deviation: $$\sigma = \sqrt{np(1-p)}$$

**4. Normal Approximation to Binomial**

CLT implies that for large $n$:

$$\text{Binomial}(n,p) \approx \mathcal{N}\left(np,\sqrt{np(1-p)}\right)$$

---

## Summary: Visualization (Binomial vs Normal Approximation)

**Why Plot It?**

Plotting shows how closely the binomial histogram matches the normal curve when $n$ is large.
```python
from collections import Counter

def binomial_histogram(p: float, n: int, num_points: int) -> None:
    """Picks points from a Binomial(n, p) and plots their histogram"""
    data = [binomial(n, p) for _ in range(num_points)]

    histogram = Counter(data)
    plt.bar([x - 0.4 for x in histogram.keys()],
            [v / num_points for v in histogram.values()],
            0.8,
            color='0.75')

    mu = p * n
    sigma = math.sqrt(n * p * (1 - p))

    xs = range(min(data), max(data) + 1)
    ys = [normal_cdf(i + 0.5, mu, sigma) - normal_cdf(i - 0.5, mu, sigma)
          for i in xs]

    plt.plot(xs, ys)
    plt.title("Binomial Distribution vs. Normal Approximation")
    plt.show()
```

---

## Summary: Practical Use (Approximating Probabilities)

**Why This Is Useful?**

Computing binomial probabilities directly can be hard, but normal probabilities are easier to work with.

Example:

* Probability of getting more than 60 heads in 100 fair coin flips:

Instead of:

* Binomial($100, 0.5$)

Approximate with:

$$\mathcal{N}(50, 5)$$

since:

* mean $= 100 \cdot 0.5 = 50$
* standard deviation: $$\sqrt{100 \cdot 0.5 \cdot 0.5} = 5$$

---

**Takeaway:** The CLT explains why sums and averages often look normal. It lets you approximate complex distributions (like binomial) with the normal distribution, making probability calculations much easier.

**Key Concepts:**

* Average of i.i.d. variables becomes normal: $$\frac{1}{n}\sum_{i=1}^{n} x_i \approx \mathcal{N}\left(\mu, \frac{\sigma}{\sqrt{n}}\right)$$

* Binomial normal approximation: $$\text{Binomial}(n,p) \approx \mathcal{N}\left(np,\sqrt{np(1-p)}\right)$$