# Random Variables and Discrete Distributions

In [None]:
%%html
<link rel="stylesheet" type="text/css" href="../styles/styles.css">

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.special import comb
import pandas as pd
from ipywidgets import interact, FloatSlider, IntSlider
from IPython.display import HTML, display, IFrame
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

In [None]:
import sys
from pathlib import Path

# Add the "resources" directory to the path
project_root = Path().resolve().parent
resources_path = project_root / 'resources'
sys.path.insert(0, str(resources_path))

In [None]:
from random_variable import(generate_click_data, visualize_rv_concept, coffee_example, coffee_example_mean_var, 
                            demo_markov_ineq, demo_markov_ineq_2, demo_schebyshev_ineq, demo_chebyshev_ineq_2, 
                            demo_pdf_cdf_discrete, demo_cdf_interval_discrete, comparison_discrete_rv, mystery_prob)

## Learning Objectives

- Understand random variables as functions from sample spaces to numbers
- Master discrete probability distributions (Bernoulli, Binomial, Poisson)
- Calculate expectations and variances
- Apply distributions to ML scenarios

<div class="alert alert-info">
<h4>🎯 The Distribution Detective Problem</h3>

You're analyzing user engagement data for a social media platform. You notice:
- Average of 2 clicks per day per user
- About 41% of users click 0-1 times
- Clicks per day range from 0 to 10+
- The distribution has a long right tail

Your boss asks: "Model this data. Which distribution should we use?"

Choices:
<ul>
<li>(a) Binomial distribution (n=50, p=0.05)</li>
<li>(b) Poisson distribution (λ=2)</li>
<li>(c) Normal distribution (μ=2, σ=5)</li>
</ul>

Most beginners pick Normal because "*it's always normal, right?*"

Wrong. By the end of today, you'll understand WHY distribution choice matters.
</div>

In [None]:
# visualise demo data
generate_click_data()

## What is a Random Variable?

The raw sample space $\Omega$ can be awkward to work with. For example:
- Coin flip: $\Omega = \{heads,tails\}$  - how do you calculate an average of "heads"?
- Card draw: $\Omega = \{A♠, K♠, ..., 2♣\}$ - what's the expected value of a "queen"?

By defining **a random variable**, you create a bridge from the abstract to the numerical. But a random variable is NOT like algebraic variables you know!

Think of it as a FUNCTION that maps outcomes to numbers:
- Sample space $\Omega$ (abstract outcomes) $\rightarrow$ Real numbers $\mathrm{R}$ (quantifiable values)

<center>
<img src="img/rv.png" alt="Random variable as a function" width="800px">
</center>

Examples: 

1. Coin toss
- Sample space: $\Omega = {Heads, Tails}$
- Random variable $X: Heads \rightarrow 1, Tails \rightarrow 0$
- Now we can do math compute probabilities and means: $E[X] = 0.5$, $Var(X) = 0.25$

2. Card draw:
- $\Omega = \{A♠, K♠, ..., 2♣\}$
- Random variable: $X(card) = \text{value of card}$
- Now you can analyze expected winnings

3. In ML: "Label for this image" is a random variable!
- Sample space: All possible images
- Random variable $Y: Image \rightarrow {0=cat, 1=dog}$

In [None]:
# coin flipping visualisation
visualize_rv_concept()

<div class="alert alert-success">
<h4>Definition: Random Variable</h4>

Let $\mathcal{A}$ be a set of events associated with the sample space $\Omega$, called **sigma-algebra** or **$\sigma$-algebra**, satisfying the following properties:

* $\Omega \in \mathcal{A}$
* $A \in \mathcal {A} \Rightarrow A^c \in \mathcal{A}$
* $\forall i \in \mathbb{N},  A_i \in \mathcal{A} \Rightarrow \cup_{i\in \mathbb{N}} A_i \in \mathcal{A}$

For finite or countably infinite $\Omega$, we often consider $\mathcal{A} = \mathcal{P}(\Omega)$.

For $\Omega = \mathbb{R}$: $\mathcal{A} = \mathcal{B}(\mathbb{R})$ where $\mathcal{B}(\mathbb{R})$ is a set of Borel sets, i.e., the set of parts of $\mathbb{R}$ generated by intervals of $\mathbb{R}$.

Let $(\Omega, \mathcal{A})$ be an event space of the probability space $(\Omega, \mathcal{A}, \mathbb{P})$ and $(E, \mathcal{E})$ a measurable space. A **random variable, r.v.** $X$ from $\Omega$ to $E$ is a measurable function:
$$X : \Omega \rightarrow E$$

such that:

$$\forall A' \in \mathcal{E}, X^{-1}(A') \in \mathcal{A}$$

where $X^{-1}(A')$ is the preimage (inverse image) $X^{-1}(A') = \{\omega\in \Omega | X(\omega)\in A'\}\in \mathcal{A}$.

Often, we will consider the case where $E \subset \mathbb{R}$. A **real-valued random variable** (or **real random variable**) on the event space $(\Omega, \mathcal{A})$ is a measurable function $X : \Omega \rightarrow \mathbb{R}$, such that:
$$\forall x\in \mathbb{R}, X^{-1}(]-\infty, x])\in \mathcal{A}$$

**Remark:** By convention, capital letters are used for the notation of random variables.

</div>

## Types of Random Variables

> What are the values of a real random variable $X$?

Depending on the value a random variable can take, we distinguish different types of random variables.

|| DISCRETE r.v. | CONTINUOUS r.v. |
|----|-----------|---------|
|**Type of values**| Countable values (finite or countably infinite)|Uncountably infinite values (intervals)|
|**Examples**|Number of clicks, defective items, network packets, number of TikTok posts published in the next hour|Weight, temperature, time duration, (exact) height cleared by the athlete winning the pole vault at the next Olympic Games, level of global warming reached around 2030|
|**Use in ML**|Classification labels, token counts, layer depth|Neural network weights, loss values, confidence scores|

</br>

<div class="alert alert-success">
<h4>Definition: Discrete r.v.</h4>

A real r.v. $X$ is called **discrete** if it takes only a countable and/or finite number of values in $\mathbb{R}$, i.e. $X(\Omega) = \left\{x_j\in \mathbb{R}, j\in J\right\}$ with $J \subset \mathbb{N}$.

</div>

<div class="alert alert-success">
<h4>Definition: Continuous r.v.</h4>

In the general case, a real r.v. $X$ is called **continuous** if its image is uncountably infinite (often an interval).

In the more concrete case, it concerns *absolute continuity* (in the Lebesgue sense).

</div>

In this session, we focus on DISCRETE r.v.


## Probability Distribution, Probability Mass Function (PMF) and Cumulative Distribution Function

<div class="alert alert-success">
<h4>Definition: Probability Distribution</h4>

A space $(\Omega, \mathcal{A})$ is equipped with a probability measure $\mathbb{P}$ in the probability space $(\Omega, \mathcal{A}, \mathbb{P})$. Thanks to the measurability of the random variable $X$, it is possible to define the probability $\mathbb{P}_X: \mathcal{E} \rightarrow [0,1]$ (also called **the probability law of the r.v. $X$** or just **law of the r.v. $X$** or **probability distribution**) on the measurable space $(E, \mathcal{E})$:
$$\forall A'\in \mathcal{E}, \ \ \mathbb{P}_X(A') = \mathbb{P}(X^{-1}(A')) = \mathbb{P}(X\in A')$$

Thus, a probability law is a function that describes the probability of occurrence of possible outcomes of the random experiment.

</div>

In the general case, we can speak of the distribution function that characterizes the probability law.


<a id="cdf"></a>
<div class="alert alert-success">
<h4>Definition: Cumulative Distribution Function (CDF)</h4>

Let $X$ be a real r.v. on the probability space $(\Omega, \mathcal{A}, \mathbb{P})$. We call **the cumulative distribution function of a real r.v. $X$** (or **CDF**) the application $F_X$ which for $\forall x\in \mathbb{R}$ associates the probability of obtaining a value less than or equal to $x$, i.e.:
$$F_X: \ \left.
    \begin{array}{ll}
        \mathbb{R}\rightarrow [0,1]  \\
        x \rightarrow \mathbb{P}(X^{-1}(]-\infty,x]))
    \end{array}
\right.
$$
In other words, $F_X(x) = \mathbb{P}(X\leq x)$, i.e. it answers a simple question *What's the probability that my random variable is at most this value?* or *What fraction of outcomes fall to the left of $x$?*

*Intuition*: The CDF can be seen a *running total* or *accumulator of probability* as you sweep from left to right along the number line. 

For discrete random variables, the CDF is a STEP FUNCTION that jumps by $p(x_i)$ at each possible value $x_i$.

<h5>Basic Properties of the CDF</h5>

* $F_X$ is always increasing, i.e. $\forall (a,b)\in \mathbb{R}^2,\ a \leq b \Rightarrow F_X(a) \leq F_X(b)$
* $F_X$ is right-continuous
* $\lim\limits_{x\rightarrow-\infty} F_X(x) = 0$ and $\lim\limits_{x\rightarrow+\infty} F_X(x) = 1$

Note that $F_X$ is a bounded monotonic function:
$$\forall x\in \mathbb{R}, \ 0\leq F_X(x)\leq 1$$

The distribution function allows us to calculate the probability of a real r.v. $X$ being included in a left half-open interval $]a,b]$ where $a < b$ as follows:
$$\mathbb{P}(X \in ]a,b]) = \mathbb{P}(a < X \leq b) = F_X(b)-F_X(a)$$

</div>

Suppose we are interested in the number of cups of coffee a student drinks before the lunch break.

In the table, we present all values with non-zero probability. In our case, these are: $1$, $2$, $3$, $4$, $5$. Suppose also that we know the probability of each of its values. Let's write them in the same table.

|$x$| $1$ | $2$ | $3$ | $4$ | $5$ |
|--:|:--:|:--:|:--:|:--:|:--:|
|$P(X=x)$| $0.4$ | $0.25$ | $0.2$ | $0.1$| $0.05$|


<div class="alert alert-success">
<h4>Definition: Probability Mass Function (PMF)</h4>

Let $X$ be a discrete real r.v., $X(\Omega) = \left\{x_j\in \mathbb{R}, j\in J\right\}$ with $J \subset \mathbb{N}$. **The probability mass function of the real r.v. $X$** (or **pmf**) is an application $p$ such that:
$$p \ \left.
    \begin{array}{ll}
        J\rightarrow [0,1]  \\
        j \rightarrow p_j
    \end{array}
\right.$$
where $p_j = \mathbb{P}(X = x_j)$, considering that $\forall j \in \mathbb{N}\setminus J, \ p_j = 0$.

The mass function $p$ has the following properties:

* $\forall j\in J, \ p_j \geq 0,\ p_j \in [0,1]$
* $\sum\limits_{j \in \mathbb{N}} p_j = 1$

*Intuition*: "How much probability MASS sits at each point?"

</div>

Let's check if the conditions of PMF are verified for our example:

1. $\forall j\in J, \ p_j \geq 0,\ p_j \in [0,1]$: $\left\{\begin{aligned}0.4 \geq 0 \\0.25\geq 0 \\0.2 \geq 0 \\ 0.1 \geq 0 \\ 0.05 \geq 0\end{aligned}\right.\ \checkmark$
2. $\sum\limits_{j \in \mathbb{N}} p_j = 1$: $0.4 + 0.25 + 0.2 + 0.1 + 0.05 = 1\ \checkmark$ 

Now, let's calculate the CDF.

|$x$| $1$ | $2$ | $3$ | $4$ | $5$ |
|--:|:--:|:--:|:--:|:--:|:--:|
|$P(X=x)$| $0.4$ | $0.25$ | $0.2$ | $0.1$| $0.05$|
|$P(X \leq x)$ | $0.4$ | $0.4 + 0.25 = 0.65$ | $0.4 + 0.25 + 0.2 = 0.85$ | $0.4 + 0.25 + 0.2 + 0.1 = 0.95$ | $0.4 + 0.25 + 0.2 + 0.1 + 0.05 = 1$|

In [None]:
# IFrame(src='img/cdf_animation_html.html', width=700, height=400)
HTML(filename='img/cdf_animation_html.html')

In [None]:
# example 
coffees = np.array([1, 2, 3, 4, 5])
# probabilities 
probs = np.array([0.4, 0.25, 0.2, 0.1, 0.05])

In [None]:
# PMF and CDF of coffee example
coffee_example()

Let's consider the following example:

We are interested in the number of successful API calls out of 5 attempts. 

| $x$ | 0 | 1 | 2 | 3 | 4 | 5 |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| $P(X = x)$ | 0.0102 |  0.0768 | 0.2304 | 0.3456 | 0.2592 | 0.0778 |

Calculate CDF, and visualise PMF and CDF.

In [None]:
x = np.array([0, 1, 2, 3, 4, 5])
# PMF
prob_x = np.array([0.0102, 0.0768, 0.2304, 0.3456, 0.2592, 0.0778])
# CDF
cdf_x = np.cumsum(prob_x)
print(f"CDF: {cdf_x}")
# visualise PMF
plt.stem(x, prob_x, linefmt='b-', markerfmt='bo', basefmt=' ', label='PMF p(x)')
plt.title("PMF: Probability Mass Function")

In [None]:
# visualise CDF
plt.step(x, cdf_x, where='post', linewidth=2.5, color='red', label='CDF')
plt.title("CDF: Cumulative Distribution Function")

Let's consider the following example. We are interested in the number of successful API calls out of 5 attempts. 

|$x$| 0| 1|2|3|4|5|
|--|:--:|:--:|:--:|:--:|:--:|:--:|
|$P(X=x)$|0.0102|0.0768|0.2304|0.3456|0.2592|0.0778|

Let's calculate the CDF.

In [None]:
# Example: Number of successful API calls out of 5 attempts
f1, f2 = demo_pdf_cdf_discrete()
# construction of CDF step by step
f1

In [None]:
# PMF and CDF of Binomial Distribution
f2

Now, let's calculate the CDF on the interval: $F(1 < X \leq 4)$. As mentioned above (see Section [CDF](#cdf)) $\mathbb{P}(a < X \leq b) = F_X(b)-F_X(a)$:

1. Option 1: as we are interested in $a < X \leq b$, then the set of favourable values of $X$ is $\{2, 3, 4\}$. Then we can apply direct summation:

$\mathbb{P}(a < X \leq b) = \mathbb{P}(X = 2) + \mathbb{P}(X = 3) + \mathbb{P}(X = 4) = 0.2304 + 0.3456 + 0.2592 = 0.8352$

2. Option 2: Using CDF formula

$$F(4) = \mathbb{P}(X \leq 4) = 0.9222$$

$$F(1) = \mathbb{P}(X \leq 1) = 0.0870$$

$$\mathbb{P}(1 < X \leq 4) = F_X(4)-F_X(1) = 0.9222 - 0.0870 = 0.8352$$


In [None]:
# CDF on the interval
demo_cdf_interval_discrete()

<div class="alert alert-primary">
<h4>🤖 ML Applications</h4>
<p>PMF models discrete predictions:</p>

<ul>
<li>Softmax output: p(class=k) for k ∈ {0,1,...,9} in digit classification</li>
<li>Token probabilities in language models</li>
<li>Batch accuracy: P(correct predictions = k) in batch of size n\n")</li>
</ul>

</div>

## Numerical Indicators. Expectation (Mean) and Variance

<div class="alert alert-success">
<h4>Definition: Expected Value</h4>

**The expectation** (or **expected value** or **mean** or **average** or **first moment**) of a real r.v. $X$, denoted $\mathbb{E}X$ or $\mathbb{E}[X]$ or $\mathbb{E}(X)$ is a generalization of the weighted average value. It can also be seen as the "center of mass" of the distribution.

Its calculation (when this quantity exists) depends on the nature of $X$:

* discrete real r.v.:
$\mathbb{E}[X] = \sum_i x_ip_i = \sum_i x_iP(X=x)$

Note that these definitions can be generalized for the expectation of a real r.v. $g(X)$, where $g : X(\Omega) \rightarrow \mathbb{R}$:

* discrete case: $\mathbb{E}[g(X)] = \sum_i g(x_i)p_i = \sum_i g(x_i)P(X=x)$


<h5>Properties:</h5>

* $\mathbb{E}[X] \geq 0$
* $\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]$
* $\mathbb{E}[aX + b] = a\mathbb{E}[X] + b$

Note that these results can be generalized for the expectation of a real r.v. $g(X)$, where $g : X(\Omega) \rightarrow \mathbb{R}$:

* $\mathbb{E}[g(X)] \geq 0$
* $\mathbb{E}[g_1(X) + g_2(X)] = \mathbb{E}[g_1(X)] + \mathbb{E}[g_2(X)]$
* $\mathbb{E}[ag(X)] = a\mathbb{E}[g(X)] + b$
* let $X$ be a continuous real r.v., $g_1$ and $g_2$ two functions such that $g_1 \leq g_2$, then $\mathbb{E}[g_1(X)] \leq \mathbb{E}[g_2(X)]$
* if $X$ is a constant real r.v. on $\Omega$ and $g$ any function, then $\mathbb{E}[g(X)] = g(X)$

</div>

Let's calculate the expected value for our example:

|$x$| $1$ | $2$ | $3$ | $4$ | $5$ |
|--:|:--:|:--:|:--:|:--:|:--:|
|$P(X=x)$| $0.4$ | $0.25$ | $0.2$ | $0.1$| $0.05$|
|$P(X \leq x)$ | $0.4$ | $0.4 + 0.25 = 0.65$ | $0.4 + 0.25 + 0.2 = 0.85$ | $0.4 + 0.25 + 0.2 + 0.1 = 0.95$ | $0.4 + 0.25 + 0.2 + 0.1 + 0.05 = 1$|
|$xP(X=x)$|$1\times 0.4 = 0.4$| $2\times 0.25 = 0.5$ | $3\times 0.2 = 0.6$ | $4\times 0.1 = 0.4$ | $5\times 0.05 = 0.25$|

Thus, 
$$\mathbb{E}[X] = \sum_i x_ip_i = \sum_i x_iP(X=x) = 0.4 + 0.5 + 0.6 + 0.4 + 0.25 = 2.15$$

In [None]:
# example
expected_val = sum(coffees * probs)
print(f"Expected value E(X) = {expected_val}")

<div class="alert .alert-exercise">
<h4>Calculated Example</h4>

If each time you get *heads* when tossing a coin, you win 5 euros, and each time you get *tails*, you lose 5 euros. What is the average gain or in other words, the expectation of gain?

</div>

<details>
<summary>Reveal Solution</summary>

$\mathbb{E}[X] = \sum_i x_ip_i = \sum_i x_iP(X=x) = 5\times \frac{1}{2} + (-5)\times \frac{1}{2} = \mathbf{0}$
On average, we gain nothing.

</details>

<div class="alert alert-success">
<h4>Definition: Markov's Inequality</h4>

Let $X$ be a real r.v. Then:
$\forall a> 0, a\in \mathbb{R} :  \ \  \mathbb{P}(|X|\geq a) \leq \frac{\mathbb{E}[|X|]}{a} $

**Remark:** it's a worst-case bound, real probability is usually much smaller.

</div>

This inequality can be illustrated with the following *budget constraint* analogy: 

You know the average wealth in a country is $50,000 per person. 

> What's the maximum fraction of people who could be millionaires?

**Answer**: at most 5%.

If more than 5% were millionaires, the average would have to exceed \$50,000: 
- If 5% have \$1,000,000 and 95\% have $0: $Average = 0.05 \times 1,000,000 + 0.95 \times 0 = 50,000\ \checkmark$
- If 6% have \$1,000,000 and 94\% have $0: $Average = 0.06 \times 1,000,000 + 0.95 \times 0 = 60,000\ X$ (contradicts our known average)

This problem can be formulated as Markov's inequality as follows:

$$\mathbb{P}(wealth\geq 1,000,000) \leq \frac{50,000}{1,000,000} = 0.05$$

In [None]:
# demo
demo_markov_ineq_2()

<div class="alert .alert-exercise">
<h4>Calculated Example</h4>

Average training time per epoch is 10 minutes. What's the probability an epoch takes ≥ 50 minutes?

</div>

<details>
<summary>Reveal Solution</summary>

Let $X = \text{time per epoch}$, $\mathbb{E}[X] = 10$, $\alpha = 50$.

Using Markov's inequality:

$$P(X \geq alpha) = P(X \geq 50) \leq \frac{10}{50} = 0.2$$

**Answer:** at most 20%. If more than 20% of epochs took ≥50 min, the average would exceed 10 min.


</details>

In [None]:
# visualisation 
# usually training time is modelled as exponential distributions
demo_markov_ineq()

<div class="alert alert-success">
<h4>Definition: Variance</h4>

**The variance** of the real r.v. $X$, denoted $Var(X)$ or $\sigma^2$, is a measure of the dispersion of data around its expectation $\mathbb{E}X$.

Its calculation (when this quantity exists) depends on the nature of $X$:

* discrete real r.v.:
$Var(X) = \mathbb{E}[(X-\mathbb{E}X)^2] = \mathbb{E}[X^2]-(\mathbb{E}X)^2=\sum_i (x_i-\mathbb{E}X)^2 p_i$

* continuous real r.v.:
$Var(X) = \mathbb{E}[(X-\mathbb{E}X)^2] = \mathbb{E}[X^2]-(\mathbb{E}X)^2=\int_{a}^{b}(x-\mathbb{E}X)^2f(x)dx$

<h5>Properties:</h5>

* $Var(X) \geq 0$
* $Var(X + Y) = Var[X] + Var[Y]\text{, (if } X\text{ and } Y\text{ indep.)}$
* $Var(aX + b) = a^2Var(X)$

</div>

<div class="alert alert-success">
<h4>Definition: Standard Deviation</h4>

**The standard deviation** (or **std**) of the real r.v. $X$, denoted $\sqrt{Var(X)}$ or $\sigma$, is a measure of the deviation between the values taken by $X$ and its expectation $\mathbb{E}X$

$\sigma(X)=\sigma_X = \sqrt{Var(X)}$

</div>

Let's calculate the variance and standard deviation for our example:

|$x$| $1$ | $2$ | $3$ | $4$ | $5$ |
|--:|:--:|:--:|:--:|:--:|:--:|
|$P(X=x)$| $0.4$ | $0.25$ | $0.2$ | $0.1$| $0.05$|
|$P(X \leq x)$ | $0.4$ | $0.4 + 0.25 = 0.65$ | $0.4 + 0.25 + 0.2 = 0.85$ | $0.4 + 0.25 + 0.2 + 0.1 = 0.95$ | $0.4 + 0.25 + 0.2 + 0.1 + 0.05 = 1$|
|$xP(X=x)$|$1\times 0.4 = 0.4$| $2\times 0.25 = 0.5$ | $3\times 0.2 = 0.6$ | $4\times 0.1 = 0.4$ | $5\times 0.05 = 0.25$|
|$x - \mathbb{E}[X]$ | $1 - 2.15 = -1.15$ | $2 - 2.15 = -0.15$ | $3 - 2.15 = 0.85$ | $4 - 2.15 = 1.85$ | $5 - 2.15 = 2.85$|
|$(x - \mathbb{E}[X])^2$| $(-1.15)^2 \approx 1.3225$ | $(-0.15)^2 = 0.0225$ | $0.85^2 \approx 0.7225$ | $1.85^2 \approx 3.4225$ | $2.85^2 = 8.1225$ |
|$(x - \mathbb{E}[X])^2P(X=x)$| $1.3225 \times 0.4 = 0.529$ | $0.0225 \times 0.25 = 0.005625$ | $0.7225 \times 0.2 = 0.1445$ | $3.4225 \times 0.1 = 0.34225$ | $8.1225\times 0.05 = 0.406125$ |

Hence:

$$Var(X) = \mathbb{E}[(X-\mathbb{E}X)^2] = \mathbb{E}[X^2]-(\mathbb{E}X)^2=\sum_i (x_i-\mathbb{E}X)^2 p_i = 0.529 + 0.005625 + 0.1445 + 0.34225 + 0.406125 = 1.4275$$

Now:

$$\sigma(X) = \sqrt{Var(X)} = \sqrt{1.4275} \approx 1.1948$$

In [None]:
# variance of our toy example
var = sum((coffees - expected_val)**2 * probs)
print(f"Variance Var(X) = {var:.4f}")
# standard deviation 
std = np.sqrt(var)
print(f"Standard deviation std(X) = {std:.4f}")

In [None]:
# visualisation
coffee_example_mean_var()

<div class="alert alert-success">
<h4>Definition: Chebyshev's Inequality</h4>

Let $X$ be a real r.v. Then:
$$\forall \alpha > 0, \alpha\in \mathbb{R} : \ \ \mathbb{P}(|X - \mathbb{E}X|\geq \alpha) \leq \frac{Var(X)}{\alpha^2}$$

Or equivalently (using $k$ standard deviations, i.e. $k = \alpha\sigma$):

$$\mathbb{P}(|X - \mathbb{E}X|\geq k\sigma) \leq \frac{1}{k^2}$$

</div>

Chebyshev's inequality is a powerful refinement of Markov's inequality that uses information about variance to make much tighter statements about how data clusters around the mean.

It can be illustrated with the *"spread cannot lie" principle*: 
If variance is small, the data MUST be concentrated near the mean. Low variance means values can't stray far from the mean without violating the variance constraint.

<div class="alert .alert-exercise">
<h4>Calculated Example</h4>

Class average is 75, standard deviation is 5 points. What fraction of students could have scored ≥ 95 (20 points above mean)?

</div>

<details>
<summary>Reveal Solution</summary>

Let $X = \text{student's grade}$, $\mathbb{E}[X] = 75$, $\sigma(X) = 5$, $\alpha = 95 - 75 = 20$ (distance from the mean).

Using Chebychev's inequality:

$$P(|X - 75| \geq 20) \leq \frac{5^2}{20^2} = \frac{25}{400} = 0.0625$$

**Answer:** at most 6.25% of students scored ≥95 or ≤55 (combined). If more students were at extremes, the variance would have to be larger than 25.


</details>

In [None]:
# visualisation
demo_schebyshev_ineq()

In [None]:
# demo Chebyshev's inequality
demo_chebyshev_ineq_2()

<div class="alert alert-success">
<h4>Definition: Moment of order p</h4>

**A moment of order $p\ (p\in \mathbb{N})$** of the real r.v. $X$ is a real number $\mathbb{E}(|X|^p)$, when it exists.

**A centered moment of order $p\ (p\in \mathbb{N})$** of the real r.v. $X$ is a real number $\mathbb{E}(|X - \mathbb{E}X|^p)$, when it exists.


</div>

<div class="alert alert-warning">
<h4>💡 Key Insight: Expected Value and Variance</h4>

$E[X]$ tells you *'typical value'*, $Var(X)$ tells you *'how much variation to expect'*.

</div>

<div class="alert alert-primary">
<h4>🤖 ML Applications</h4>

<ul>
<li>Expected loss: E[Loss] guides optimization</li>
<li>Variance of gradients: affects training stability (high var → noisy updates)</li>
<li>Weight initialization: control E[W]=0, Var(W)=σ² for stable forward pass</li>
</ul>

</div>

## Common Discrete Distributions

### Bernoulli Distribution, $Bernoulli(p)$

<div class="alert alert-example">
<h4>Bernoulli Distribution</h4>

$X \sim Bernoulli(p)$ or $X \sim Bern(p)$

**When to use:**

Models a single binary trial with two outcomes: success (1) or failure (0). Use when you have exactly one experiment with two possible outcomes, where the probability of success is $p$.

**Parameters & Domain:**
- Parameter: $p \in [0, 1]$ (probability of success)
- Domain (what values $X$ can take): $X \in \{0, 1\}$

| PMF | CDF | $E(X)$ | $Var(X)$|
|:---:|:---:|:----:|:---:|
|$$P(X = k) = \begin{cases} 1-p & \text{if } k=0 \\ p & \text{if } k=1 \\ 0 & \text{otherwise} \end{cases}$$</br>or $P(X=k) = p^k(1−p)^{1−k}$ for $k \in \{0,1\}$| $$F(x) = P(X \leq x) = \begin{cases} 0 & \text{if } x < 0 \\ 1-p & \text{if } 0 \leq x < 1 \\ 1 & \text{if } x \geq 1 \end{cases}$$| $$p$$ | $$p(1-p)$$ </br> Variance is maximized when $p = 0.5$ |

**Key Properties:**
- Mode: $\begin{cases} 1 & \text{if } p > 0.5 \\ 0 & \text{if } p < 0.5 \\ \text{both } 0 \text{ and } 1  & \text{if } p = 0.5 \end{cases}$
- Symmetry: Symmetric only when $p = 0.5$
- Special case: $Bernoulli(0.5)$ is a fair coin flip

**Real-world examples:**
- Coin flip (fair: $p = 0.5$, biased: $p \neq 0.5$)
- Single customer makes purchase (yes/no)
- Single patient recovers (yes/no)
- Quality control: single item is defective or not

**Typical Events This Describes**

- Single classification prediction (correct/incorrect)
- Single packet transmitted successfully/failed
- Single user clicks ad (yes/no)
- Single neuron fires (active/inactive)
- Single A/B test participant converts (yes/no)

**Relationships to Other Distributions:**
- Binomial: Bernoulli is Binomial(n=1, p)
- Sum: Sum of n independent Bernoulli(p) trials = Binomial(n, p)
- Categorical: Bernoulli is Categorical distribution with 2 categories
</div>

<div class="alert alert-primary">
<h4>🤖 ML Applications</h4>

Training:
- Dropout mask for a single neuron: keep with probability p
- Data augmentation decision: apply transform with probability p
- Stochastic decision gates in neural architectures

Modeling:
- Binary classification output (after sigmoid/threshold)
- Bernoulli likelihood in generative models
- Binary labels in supervised learning

Evaluation:
- Single prediction correctness
- Binary event detection (anomaly present: yes/no)

</div>

<div class="alert alert-danger">
<h4>⚠️ Common Pitfalls:</h4>

- Don't use for multiple trials (use Binomial instead)
- Don't confuse $p$ (parameter) with $P(X=k)$ (probability at $k$)
- Remember: $Var(X) \neq p$ (it's $p(1-p)$)
- Maximum variance is 0.25, not 1

</div>

We can use SciPy implementation of this distribution, [`scipy.stats.bernoulli`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bernoulli.html):

In [None]:
# Create distribution
X = stats.bernoulli(p=0.7)

# Calculate probabilities
print(f"P(X = 1) = {X.pmf(1)}")        # P(X=1) = 0.7
print(f"P(X≤0) = {X.cdf(0)}")        # P(X≤0) = 0.3
print(f"Generated 10 samples: {X.rvs(size=10)}")  # Generate 10 samples

# Moments
print(f"E(X) = {X.mean()}")        # E[X] = 0.7
print(f"Var(X) = {X.var()}")         # Var(X) = 0.21

In [None]:
def plot_bernoulli(p=0.7):
    print("Visual Characteristics")
    print("Two bars at x=0 and x=1")
    print("Height at x=1 is p, height at x=0 is 1-p")
    print("Symmetric when p=0.5, skewed otherwise\n")
    
    fig, ax = plt.subplots(figsize=(8, 4))
    x = [0, 1]
    pmf = [1-p, p]
    ax.bar(x, pmf, color=['salmon', 'lightgreen'], edgecolor='black', alpha=0.7)
    ax.set_xlabel('Outcome')
    ax.set_ylabel('Probability')
    ax.set_title(f'Bernoulli({p}): E[X]={p:.2f}, Var(X)={p*(1-p):.4f}')
    ax.set_xticks([0, 1])
    ax.set_xticklabels(['Failure (0)', 'Success (1)'])
    ax.set_ylim(0, 1)
    ax.grid(True, alpha=0.3)
    plt.close()
    return fig

interact(plot_bernoulli, p=FloatSlider(min=0.1, max=0.9, step=0.1, value=0.7))

### Binomial Distribution, $\mathcal{B}(n, p)$

<div class="alert alert-example">
<h4>Binomial Distribution</h4>

$X \sim Bin(n, p)$ or $X \sim \mathcal{B}(n, p)$

**When to use:**

Models the number of successes in $n$ independent identical trials, where each trial has probability $p$ of success. Use when you have:

- Fixed number of trials ($n$)
- Each trial is independent
- Each trial has only two outcomes (success/failure)
- Probability of success ($p$) is constant across trials

**Parameters & Domain:**
- Parameters: 

$n \in \mathbb{N}$ (number of trials, $n\geq 1$)

$p \in [0, 1]$ (probability of success per trial)
- Support (what values $X$ can take): $X \in \{0, 1, 2, ..., n\}$

| PMF | CDF | $E(X)$ | $Var(X)$|
|:---:|:---:|:----:|:---:|
|$$P(X = k) = \left(\begin{matrix}n\\k\end{matrix}\right)p^k(1-p)^{n-k} = \frac{n!}{k!(n-k)!}p^k(1-p)^{n-k}$$ for $k\in \{0, 1, 2, ..., n\}$</br> *Interpretation:*</br>$\binom{n}{k}$ = number of ways to choose $k$ successes from $n$ trials</br>$p^k$ = probability of $k$ successes </br> $(1-p)^{n-k}$ = probability of $(n-k)$ failures| $$F(x) = P(X \leq x) = \sum_{k=0}^{⌊x⌋}\left(\begin{matrix}n\\k\end{matrix}\right)p^k(1-p)^{n-k}$$ for $x\geq 0$ ($F(x) = 0$ for $x < 0$, $F(x) = 1$ for $x\geq n$) </br> *Note*: No closed-form expression; typically computed numerically or via tables| $$np$$ | $$np(1-p)$$ </br>Variance decreases as p → 0 or p → 1 (less uncertainty)</br>Variance is maximized when $p = 0.5$ |

**Key Properties:**
- Mode: $⌊(n + 1)p⌋$ (floor of $(n+1)p$)

Can have two modes if $(n+1)p$ is an integer

- Symmetry: Symmetric when $p = 0.5$, otherwise skewed
    * Right-skewed if p < 0.5
    * Left-skewed if p > 0.5

- Sum property: If $X_1 \sim Bin(n_1, p)$ and $X_2 \sim Bin(n_2, p)$ are independent, then $X_1 + X+2 \sim Bin(n_1 + n_2, p)$

**Real-world examples:**
- Number of heads in 10 coin flips
- Number of defective items in a batch of 100
- Number of patients who recover out of 50 treated
- Number of emails marked as spam out of 200

**Typical Events This Describes**

Counting successes in repeated trials:

- Number of website visitors who make a purchase (out of n visitors)
- Number of correctly classified samples (out of n test samples)
- Number of successful HTTP requests (out of n total requests)
- Number of spam emails (out of n total emails)
- Number of defective products in quality control
- Number of students passing an exam (out of n students)
- Number of "heads" in n coin flips
- Number of winning lottery tickets purchased

**Relationships to Other Distributions:**
1. Special cases:

- Bernoulli: $Binomial(n=1, p) = Bernoulli(p)$
- Certain event: $Binomial(n, p=1)$ gives $X = n$ always
- Impossible event: $Binomial(n, p=0)$ gives $X = 0$ always

2. Limiting cases:

- Normal approximation: As $n \rightarrow \infty$ with $np$ and $n(1-p)$ both large (typically ≥ 5):

$$\text{Binomial}(n,p) \approx \mathcal{N}(np, np(1-p))$$

- Poisson approximation: When $n$ is large, $p$ is small, and $np = \lambda$ remains moderate:

$$\text{Binomial}(n,p) \approx \text{Poisson}(\lambda = np)$$
  (Rule of thumb: $n \geq 20, p \leq 0.05$)


3. Composition:

Binomial is the sum of $n$ independent $Bernoulli(p)$ trials
</div>

<div class="alert alert-primary">
<h4>🤖 ML Applications</h4>

Model Training:

- Batch accuracy: Number of correct predictions in a mini-batch of size $n$

If model has accuracy $p$, correct predictions $\sim Bin(n, p)$

- Ensemble methods: Number of models (out of $n$) that agree on a prediction
- Dropout: Total number of neurons kept in a layer with $n$ neurons and keep probability $p$
- Bootstrap sampling: Number of unique samples in bootstrap (related)

Model Evaluation:

- Cross-validation: Number of folds where model performs above threshold
- Statistical testing: Count successes in repeated experiments
- A/B testing: Number of conversions in treatment group of size n

Data Analysis:

- Binary classification metrics: At threshold, count of true positives
- Sample statistics: Modeling counts in repeated binary outcomes
- Confidence intervals: For proportions and success rates

Generative Models:

- Likelihood: Binomial likelihood for count data
- Bayesian inference: Conjugate with Beta prior for Bayesian updating

</div>

<div class="alert alert-danger">
<h4>⚠️ Common Pitfalls:</h4>

❌ Don't use when:

- Trials are not independent (use Hypergeometric instead)
- Probability changes between trials (use Beta-Binomial)
- Number of trials is not fixed in advance (use Negative Binomial or Poisson)

⚠️ Common mistakes:

- Confusing $n$ (number of trials) with $k$ (number of successes)
- Using when trials have different success probabilities
- Forgetting that $P(X = k)$ requires the binomial coefficient
- Assuming Normal approximation works for small $n$

⚠️ Computational issues:

- Binomial coefficients can overflow for large $n$
- Use log probabilities for numerical stability
- `scipy.stats` handles this automatically

</div>

We can use SciPy implementation of this distribution, [`scipy.stats.binom`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom.html):

In [None]:
# Create distribution: 10 trials, p=0.3
X = stats.binom(n=10, p=0.3)

# Probabilities
print(f"PMF: P(X=5) = {X.pmf(5)}")           # P(X=5) = probability of exactly 5 successes
print(f"CDF: P(X≤5) = {X.cdf(5)}")           # P(X≤5) = cumulative probability
print(f"P(X>4) = {1 - X.cdf(4)}")       # P(X>4) = P(X≥5) for discrete

# Interval probability
print(f"P(3 < X ≤ 7) = {X.cdf(7) - X.cdf(3)}")  # P(3 < X ≤ 7) = P(X ∈ {4,5,6,7})

# Moments
print(f"E[X] = {X.mean()}")           # E[X] = np = 3.0
print(f"Var(X) = {X.var()}")            # Var(X) = np(1-p) = 2.1
print(f"σ = {X.std()}")            # σ = √2.1 ≈ 1.45

# Random sampling
samples = X.rvs(size=10, random_state=42)
print(f"Generated 10 samples: {samples}")

# Survival function
print(f"Survival function, P(X > 5) = {X.sf(5)}")            # P(X > 5) = 1 - F(5)

In [None]:
def plot_binomial(n=10, p=0.7):
    print("Visual Characteristics")
    print("     p = 0.5: Symmetric, bell-shaped (approaches Normal as n increases)")
    print("     p < 0.5: Right-skewed (tail extends right)")
    print("     p > 0.5: Left-skewed (tail extends left)")
    print("Width: Spread increases with n")
    print("Peak: Centered around np")
    print("As n increases:")
    print("     Distribution becomes smoother")
    print("     Approaches Normal distribution (Central Limit Theorem preview)")
    print("     More concentrated around the mean (by Law of Large Numbers)")
    
    fig, ax = plt.subplots(figsize=(8, 4))
    x = np.arange(0, n + 1)
    pmf = []
    for xx in x:
        pmf.append(stats.binom.pmf(xx, n, p))
    ax.stem(x, pmf, basefmt=' ', linefmt='salmon', markerfmt='mo')
    ax.set_xlabel('x (number of successes)')
    ax.set_ylabel('Probability')
    ax.set_title(f'Binomial({n}, {p}): E[X]={p:.2f}, Var(X)={p*(1-p):.4f}')
    #ax.set_ylim(0, 1)
    ax.grid(True, alpha=0.3)
    plt.close()
    return fig

interact(plot_binomial, p=FloatSlider(min=0.1, max=0.9, step=0.1, value=0.7), n=IntSlider(min=1, max=100, step=1, value=10))

Let's consider the following example:

We are interested in the number of successful API calls out of 5 attempts. Each attempt can have a binary outcome: success or failure. The probability of success is 0.6. 
Define PMF, CDF.

As each API call has a binary outcome, the outcomes are independent, and the total number of trials is $n = 5$, then $X = \text{number of successful API calls}$ follows Binomial distribution with $p=0.6$ and $n=5$.

In [None]:
# Example: Number of successful API calls out of 5 attempts
f1, f2 = demo_pdf_cdf_discrete()

In [None]:
# PMF and CDF of Binomial Distribution
f2

### Geometric Distribution, $\mathcal{G}(p)$

<div class="alert alert-example">
<h4>Geometric Distribution</h4>

$X \sim \mathcal{G}(p)$ or $X \sim Geometric(p)$ or $X \sim Geo(p)$

**When to use:**

Models the **number of trials needed to get the first success** in a sequence of independent Bernoulli trials. Use when you're waiting for the first occurrence of an event with:

- Independent trials
- Constant probability $p$ of success per trial
- No memory of previous failures (memoryless property)
- Counting trials until first success

**Parameters & Domain:**
- Parameter: $p \in (0, 1]$ (probability of success per trial)

There are two common definitions:

- Number of trials until first success (includes the success) - support: {1, 2, 3, ...}
- Number of failures before first success - support: {0, 1, 2, ...}

We use convention 1, which is `scipy`'s default.

*Note*: $X = 1$ means success on first trial.

| PMF | CDF | $E(X)$ | $Var(X)$|
|:---:|:---:|:----:|:---:|
|$$P(X = k) = (1 - p)^{k-1}\cdot p$$ for $k \in \{1, 2, 3, ...\}$</br>- $(1 - p)^{k-1}$ is the probability of $(k-1)$ failures</br>- $p$ is the probability of success on the $k$-th trial</br>No binomial coefficient needed (order is fixed: failures then success)<br>*Alternative form (failures before success)*$$P(Y=k) = (1 - p)^k\cdot p$$ for $k\in \{0,1,2,...\}$| $$F(x) = P(X \leq x) = 1 - (1 - p)^{⌊x⌋}$$ for $x\geq 1$ </br>*Intuition*: P(success within k trials) = 1 - P(all k trials fail)</br>**Survival function** (often more useful): $$P(X > k) = (1 - p)^k$$ This is the probability of needing more than k trials (all k trials fail)| $$1/p$$ - If $p = 0.5$ (coin flip): expect 2 trials for first heads </br> - If $p = 0.1$ (rare event): expect 10 trials for first success </br> - If $p = 0.01$: expect 100 trials for first success| $$(1-p)/p^2$$ </br> Variance increases (more uncertainty) as $p$ decreases |

**Key Properties:**

1. **Memoryless Property** (UNIQUE to Geometric and Exponential):
$$P(X > s + t ∣ X > s) = P(X > t)$$

*Interpretation*: If you've already had $s$ failures, the probability of needing $t$ more trials is the same as if you were starting fresh. Past failures don't affect future probabilities.

*Example*: If you flip 10 tails, the probability of needing 5 more flips for heads is still the same as needing 5 flips from the start.
2. **Minimum Property**: If $X_1, X_2, ..., X_n$ are independent $Geometric(p)$ random variables, then:

$$\min⁡(X_1, X_2, ..., X_n) \sim Geometric(1−(1−p)^n)$$

3. Lack of Memory in Practical Terms:

The distribution "resets" after each failure. This makes it suitable for modeling processes where each trial is truly independent.

4. Mode:

- $Mode = 1$ (first trial is most likely to be the success)
- Distribution is always right-skewed (long tail)

**Real-world examples:**
- Number of coin flips until first heads
- Number of job applications until first interview
- Number of attempts until passing a test
- Number of products inspected until finding first defect
- Number of customers until first sale
- Number of days until first rain

**Typical Events This Describes**

"How many trials until first success?"

- Number of coin flips until first heads
- Number of dice rolls until first six
- Number of packets sent until first successful transmission
- Number of job interviews until first offer
- Number of sales calls until first conversion
- Number of attempts until first bug is found
- Number of generations until mutation occurs
- Number of iterations until convergence criterion met
- Number of samples until finding a rare class
- Number of epochs until model improvement

*Key characteristic*: Counting discrete waiting time in a memoryless process.

**Relationships to Other Distributions:**
1. Special cases:

- Certain success: $Geometric(p=1)$ gives $X = 1$ always (success on first trial)
- Impossible success: $Geometric(p\rightarrow 0)$ gives $X \infty ∞$ (never succeed)

2. Related distributions:

- Negative Binomial: Geometric is Negative Binomial with $r = 1$ (waiting for first success vs. $r$-th success)
- Exponential: Continuous analog of Geometric (memoryless property shared)

If trials happen continuously at rate $\lambda$, $Geometric \rightarrow Exponential(\lambda)$

3. Sum property: If $X_1, X_2, ..., X_r$ are independent $Geometric(p)$, then:

$X_1 + X_2 + ... + X_ r \sim NegativeBinomial(r,p)$

This counts trials needed for r successes.
</div>

<div class="alert alert-primary">
<h4>🤖 ML Applications</h4>

Training & Optimization:

- Convergence analysis: Number of epochs until loss < threshold
- Early stopping: Iterations until validation improvement
- Hyperparameter search: Number of configurations tested until finding good one
- Gradient descent: Steps until reaching local minimum
- Stochastic processes: Modeling random exploration until success

Sampling & Data:

- Rare class sampling: Samples needed until finding minority class example
- Rejection sampling: Proposals needed until acceptance
- Active learning: Queries needed until finding informative sample
- Data augmentation: Attempts until generating valid augmented sample

Reinforcement Learning:

- Episode length: Steps until reaching terminal state
- Exploration: Actions until discovering reward
- Success probability: Trials until agent succeeds at task

System Monitoring:

- Failure detection: Requests until first failure
- Anomaly detection: Events until first anomaly
- Testing: Test cases until first bug found

Networking & Distributed Systems:

- Retry mechanisms: Attempts until successful connection
- Leader election: Rounds until consensus
- Packet transmission: Retransmissions until ACK received

</div>

<div class="alert alert-danger">
<h4>⚠️ Common Pitfalls:</h4>

❌ Don't use when:

- Trials are not independent (use Markov chains)
- Success probability changes over time (use non-stationary models)
- You're waiting for multiple successes (use Negative Binomial)
- Dealing with continuous time (use Exponential instead)
- There's a maximum number of trials (use truncated Geometric)

⚠️ Common mistakes:

- Confusing "trials until success" vs "failures before success" conventions
- Forgetting the memoryless property doesn't apply to real-world "learning" situations
- Using when success probability improves with practice (violates constant p)
- Misapplying to situations with "hot streaks" or "cold streaks"

⚠️ Memoryless property misuse:

- Valid: Independent coin flips, random sampling with replacement
- Invalid: "*I've failed 10 job interviews, so I'm due for success*" (human learning violates independence)

</div>

We can use SciPy implementation of this distribution, [`scipy.stats.geom`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.geom.html):

In [None]:
p = 0.3

# Create distribution
X = stats.geom(p)

# PMF: P(k trials until success)
print(f"P(X = 0) = {X.pmf(0)}")           # X = 0 (not in support)
print(f"P(X = 1) = {X.pmf(1)}")           # P(success on 1st trial) = (1-p)p
k = 5
print(f"P(X = k) = {X.pmf(k)}")           # P(success on (k)-th trial)

# CDF
print(f"P(X≤5) = {X.cdf(5)}")           # P(success within 5 trials)
print(f"P(need more than 5 trials) = {1 - X.cdf(5)}")       # P(need more than 5 trials) = (1-p)^5

# Survival function (often more useful)
print(f"Survival function P(X > k) = {X.sf(k)}")            # P(X > k) = (1-p)^k

# Moments
print(f"E[X] = {X.mean()}")           # E[X] = (1-p)/p (failures), or 1/p (trials)
print(f"Var(X) = {X.var()}")            # Var(X) = (1-p)/p²
print(f"σ = {X.std()}")            # σ

# Random sampling
samples = X.rvs(size=10, random_state=42)
# These are number of failures; add 1 for number of trials
print(f"Generated 10 samples: {samples}")


In [None]:
# Memoryless property demonstration:
p = 0.3

# P(X > 10) - probability of more than 10 failures
prob_more_than_10 = (1-p)**10
print(f"P(X > 10) = {prob_more_than_10:.4f}")

# P(X > 15 | X > 5) should equal P(X > 10)
# (After 5 failures, probability of 10 more failures)
prob_conditional = (1-p)**10  # Same!
print(f"P(X > 15 | X > 5) = {prob_conditional:.4f}")

# Memoryless: past doesn't affect future!

In [None]:
def plot_geom(p=0.7):
    print("Visual Characteristics")
    print("Always right-skewed (long tail to the right)")
    print("Mode at k=1 (most likely to succeed immediately)")
    print("Decreasing probabilities: P(X=k) decreases geometrically as k increases")
    print("Rate of decrease: Faster decay for larger p (success more likely)")
    
    fig, ax = plt.subplots(figsize=(8, 4))
    x = np.linspace(1, 10, num=10)
    X = stats.geom(p)
    
    pmf = []
    for xx in x:
        pmf.append(X.pmf(xx))
    ax.bar(x, pmf, color=['salmon', 'lightgreen'], edgecolor='black', alpha=0.7)
    ax.set_xlabel('x (trials until first success)')
    ax.set_ylabel('Probability')
    ax.set_title(f'Geometric({p}): E[X]={1/p:.2f}, Var(X)={(1-p)/p**2:.4f}')
    ax.set_xticks(x)
    ax.set_ylim(0, 1)
    ax.grid(True, alpha=0.3)
    plt.close()
    return fig

interact(plot_geom, p=FloatSlider(min=0.1, max=0.9, step=0.1, value=0.7))

### Poisson Distribution, $\mathcal{P}(\lambda)$

<div class="alert alert-example">
<h4>Poisson Distribution</h4>

$X \sim \mathcal{P}(\lambda)$ or $X \sim Pois(\lambda)$

**When to use:**

Models the number of events occurring in a fixed interval (time, space, volume, etc.) when events occur:

- Independently
- At a constant average rate $\lambda$
- One at a time (no simultaneous events)

Use for rare events or when counting occurrences with:

- No fixed upper limit on counts
- Events occur randomly over time/space
- Average rate is known

**Parameters & Support/Domain:**
- Parameter: $\lambda > 0$ (rate parameter, average number of events per interval)
- Support (what values $X$ can take): $X \in \{0, 1, 2, ...\}$ (all non-negative integers)

| PMF | CDF | $E(X)$ | $Var(X)$|
|:---:|:---:|:----:|:---:|
|$$P(X = k) = \frac{e^{-\lambda}\lambda^k}{k!}$$ for $k\in \{0,1,2,...\}$</br>*Interpretation:*</br> $\lambda^k$ grows with more events</br>$e^{-\lambda}$ is the normalizing constant</br>$k!$ accounts for the different orderings| $$F(x) = P(X \leq x) = e^{-\lambda}\sum_{k=0}^{⌊x⌋}\frac{\lambda^k}{k!}$$| $$\lambda$$ | $$\lambda$$ |

**Key Properties:**

*KEY PROPERTY*: Mean equals variance! This is the signature of the Poisson distribution.

- If you observe data where sample mean ≈ sample variance, suspect Poisson
- Coefficient of variation: $CV = \frac{\sigma}{\mu} = \frac{\sqrt{\lambda}}{\lambda} = \frac{1}{\sqrt{\lambda}}$ (decreases as $\lambda$ increases)

1. Mode:
- $⌊\lambda⌋$ (floor of $\lambda$)
- Both $⌊\lambda⌋$ and $⌊\lambda⌋+1$ if $\lambda$ is an integer

2. Symmetry:
- Right-skewed for small $\lambda$
- Approaches symmetry as $\lambda$ increases
- Nearly symmetric for $\lambda \geq 10$

3. Sum property: If $X_1 \sim Pois(\lambda_1)$ and $X_2 \sim Pois(\lambda_2)$ are independent, then $X_1 + X_2 \sim Pois(\lambda_1 + \lambda_2)$
4. Divisibility: If $X \sim Pois(\lambda)$ over interval $T$, then $X \sim Pois(\lambda t)$ over a shorter interval interval $t$ (where $0 < t \leq T$) (e.g. time or space scaling)
5. Memoryless (between events): Related to exponential distribution for waiting times

**Real-world examples:**
- Number of phone calls received per hour
- Number of typos per page in a book
- Number of earthquakes per year in a region
- Number of customers arriving at a store per day
- Number of mutations in a DNA sequence
- Number of radioactive decay events per second

**Typical Events This Describes**

Count data with no fixed upper bound:

- Number of server requests per second
- Number of errors/bugs per 1000 lines of code
- Number of clicks on an ad per day
- Number of API calls per minute
- Number of network packets dropped per hour
- Number of fraud attempts per day
- Number of customer support tickets per week
- Number of rare disease cases per year
- Number of typos/errors in a document
- Number of goals scored in a soccer match
- Number of accidents at an intersection per month

Key characteristic: Events are rare relative to the possible opportunities.

**Relationships to Other Distributions:**

1. Special cases:

- Zero inflation: $Poisson(\lambda=0)$ gives $P(X=0) = 1$ (no events)
- Rare events: Poisson models "needle in haystack" scenarios

2. Limiting cases:

- From Binomial: Poisson is the limit of $Binomial(n, p)$ as $n \rightarrow \infty, p \rightarrow 0$, with $np = \lambda$ fixed

Rule of thumb: $n \geq 20, p \leq 0.05$, use $Poisson(\lambda = np)$


- To Normal: As $\lambda \rightarrow \infty$:

$$\text{Poisson}(\lambda) \approx \mathcal{N}(\lambda, \lambda)$$

Rule of thumb: $\lambda \geq 10$ for reasonable approximation, $\lambda \geq 30$ for good approximation

3. Related distributions:

- Exponential: If events follow $Poisson(\lambda)$ per unit time, waiting time between events $\sim Exponential(\lambda)$

</div>

<div class="alert alert-primary">
<h4>🤖 ML Applications</h4>

Anomaly Detection:

- Rare event modeling: Detect anomalies in event counts (fraud, intrusion, failures)
- If normal behavior $\sim Pois(\lambda)$, unusually high counts indicate anomalies
- Network security: Unusual number of failed login attempts
- Quality control: Defects per product unit

System Monitoring:

- Server load: Model request rates, predict capacity needs
- Traffic analysis: Number of users per time window
- Error rates: Number of exceptions/crashes per deployment
- Queue length: Number of pending jobs/requests

Natural Language Processing:

- Word frequency: Model rare word occurrences in text
- Document statistics: Number of mentions of rare entities
- Topic modeling: Word counts in topics (though often overdispersed)

Computer Vision:

- Object counting: Number of objects in an image (when rare)
- Event detection: Number of events per frame in video

Recommendation Systems:

- Impression modeling: Number of times user sees an item
- Click modeling: Number of clicks per user (when rare)

Generative Models:

- Poisson regression: Model count outcomes as function of features
- Poisson processes: Model event streams over time
- Count-based likelihoods: Bayesian models for count data

A/B Testing:

- Model conversion counts when rates are low
- Test differences in rare event rates

</div>

<div class="alert alert-danger">
<h4>⚠️ Common Pitfalls:</h4>

❌ Don't use when:
- Events are not independent
- Rate varies over time (use non-homogeneous Poisson process)
- Variance >> Mean (overdispersion → use Negative Binomial)
- Variance << Mean (underdispersion → use constrained models)
- Number of trials is fixed (use Binomial instead)

⚠️ Common mistakes:

- Using Poisson when $mean \neq variance$ (check empirical data!)
- Confusing $\lambda$ (rate) with the actual count $X$
- Forgetting that support is infinite (though probability becomes negligible for large $k$)
- Using when events are not rare or independent

⚠️ When to question Poisson:

- If sample variance > 2 × sample mean, consider Negative Binomial
- If many zeros beyond what Poisson predicts, consider Zero-Inflated Poisson
- If events are clustered in time/space, Poisson assumptions violated

⚠️ Computational issues:

- For large $\lambda$ and $k$, factorial $k!$ can overflow
- Use log probabilities: $log(P(X=k)) = k·log(\lambda) - \lambda - log(k!)$
- `scipy.stats` uses numerically stable algorithms

</div>

We can use SciPy implementation of this distribution, [`scipy.stats.poisson`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.poisson.html):

In [None]:
# Create distribution: average rate λ = 3.5
X = stats.poisson(mu=3.5)

# Probabilities
print(f"P(X=5) = {X.pmf(5)}")           # P(X=5) = probability of exactly 5 events
print(f"P(X≤5) = {X.cdf(5)}")           # P(X≤5) = cumulative probability
print(f"P(X>4) = {1 - X.cdf(4)}")       # P(X>4) = P(X≥5) for discrete
print(f"Survival function, P(X>4) = {X.sf(4)}")            # P(X>4) = survival function (same as above)

# Interval probability
print(f"P(2 < X ≤ 8) = {X.cdf(8) - X.cdf(2)}")  # P(2 < X ≤ 8) = P(X ∈ {3,4,5,6,7,8})

# Moments
print(f"E[X]: {X.mean()}")           # E[X] = λ = 3.5
print(f"Var(X): {X.var()}")            # Var(X) = λ = 3.5 (same as mean!)
print(f"σ: {X.std()}")            # σ = √3.5 ≈ 1.87

# Random sampling
samples = X.rvs(size=10, random_state=42)
print(f"Generated 10 samples: {samples}")

In [None]:
def plot_poisson(lam=0.7):
    print("Visual Characteristics")
    print("For small λ (< 1):")
    print("     Strongly right-skewed")
    print("     Mode at 0")
    print("     Most probability mass at 0 and 1")
    print("For moderate λ (1-10):")
    print("     Still right-skewed but less pronounced")
    print("     Mode shifts right")
    print("     Bell-shape starts emerging")
    print("For large λ (> 10):")
    print("     Nearly symmetric")
    print("     Bell-shaped (resembles Normal)")
    print("     Well-approximated by Normal distribution\n")
    print("As λ increases:")
    print("     Distribution shifts right (higher mean)")
    print("     Distribution spreads out (higher variance)")
    print("     Distribution becomes more symmetric")
    print("     Approaches Normal distribution")
    
    fig, ax = plt.subplots(figsize=(10, 5))
    x = np.linspace(1, 40, num=40)
    X = stats.poisson(lam)
    
    pmf = []
    for xx in x:
        pmf.append(X.pmf(xx))
    ax.bar(x, pmf, color=['salmon', 'lightgreen'], edgecolor='black', alpha=0.7)
    ax.set_xlabel('x (number of events)')
    ax.set_ylabel('Probability')
    ax.set_title(f'Poisson({lam:.2f}): E[X]={1/p:.2f}, Var(X)={(1-p)/p**2:.4f}')
    ax.set_xticks(x)
    #ax.set_ylim(0, 1)
    ax.grid(True, alpha=0.3)
    plt.close()
    return fig

interact(plot_poisson, lam=FloatSlider(min=0.1, max=20, step=0.1, value=0.7))

### Comparison

|Property | $Bernoulli(p)$ | $Binomial(n, p)$ | $Geometric(p)$ | $Poisson(\lambda)$ | 
|------|:-------:|:-----:|:------:|:------:|
|**What it counts** | Single trial outcome |Successes in n trials | Trials until 1st success | Events in interval | 
|**Support** | $\{0, 1\}$ |$\{0, 1, ..., n\}$| $\{1, 2, 3, ...\}$ (infinite) | $\{0, 1, 2, ...\}$ (infinite)| 
|**Parameters** |$p$ |$n, p$| $p$ | $\lambda$| 
|**PMF** | $p^k(1-p)^{1-k}$ | $\binom{n}{k}p^k(1-p)^{n-k}$ | $(1-p)^{k-1}p$ | $\frac{\lambda^k e^{-\lambda}}{k!}$ |
|$E[X]$| $p$ | $np$ | $\frac{1}{p}$ | $\lambda$ |
|$Var(X)$| $p(1-p)$| $np(1-p)$ | $\frac{1-p}{p^2}$ | $\lambda$ | 
|**Mean = Variance?**| No| No| No | ✓ Yes (signature property)|
|**Memoryless?**| N/A (single trial) | No | ✓ Yes (signature property) | No (discrete) |
|**Fixed # trials?**| Yes ($n=1$) |Yes | No (stops at success) | No|
|**Shape** | Two bars | Bell (large n) | Right-skewed (mode=1) | Right-skewed → Normal |
|**When to use**| One binary event |Fixed repeated trials| Waiting for 1st event | Rare events, no limit|
|**Relationships** |$Bin(1,p)$| Sum of n Bernoullis| $NegBin(r=1, p)$ | Limit of Binomial| 
|**Typical ML use**| Single neuron dropout| Batch accuracy| Epochs to converge | Server requests|
|**Example question** | "Is prediction correct?" | "How many correct in batch?" | "How many tries until success?" | "How many errors per hour?" |




<center>
<img src="img/distribution_decision_tree_svg.svg" alt="Choosing distribution decision tree" width="700px">
</center>

In [None]:
comparison_discrete_rv()

## Return to Opening Challenge

Let's analyze our mystery click data systematically:

1. Check if it matches $Binomial(n=50, p=0.05)$
2. Check if it matches $Poisson(λ=2)$
3. Check if it matches $Normal(μ=2, σ²=25)$



In [None]:
mystery_prob()

The answer is (b) Poisson(λ=2)!

WHY?
1. Data is discrete counts $(0, 1, 2, ...)$ → Rules out Normal
2. $Mean ≈ Variance ≈ 2$ → Characteristic of Poisson
3. Rare events (clicks) over time → Poisson models this well
4. No fixed $n$ trials → Binomial doesn't fit

KEY LESSON: Distribution choice encodes assumptions about data generation!
- Binomial: Fixed $n$ trials, binary outcomes
- Poisson: Rare events, no fixed $n$
- Normal: Continuous, symmetric around mean

## Common Mistakes

<div class="alert alert-danger">
<h4>⚠️ Common Pitfalls to Avoid:</h4>

1. Don't use Normal for discrete counts!
2. Binomial needs fixed $n$; Poisson doesn't
3. Mean ≈ Variance is Poisson signature
4. Remember: $P(X=x)$ vs $P(X\leq x)$ are different!

</div>

## Applications in Machine Learning

<div class="alert alert-secondary">
<h4>🤖 ML Applications Summary</h4>

- Bernoulli: Dropout, binary classification outputs
- Binomial: Batch accuracy, ensemble voting
- Poisson: Rare events (anomalies, server load)
- Always check: Does $E[X]$ and $Var(X)$ match your distribution?

</div>

<div class="alert alert-secondary">
<h4>🔧 Python Essentials</h4>

`from scipy import stats`
- `stats.binom.pmf(k, n, p)` → $P(X=k)$
- `stats.binom.cdf(k, n, p)` → $P(X\leq k)$
- `stats.binom.rvs(n, p, size)` → random samples

</div>

## Key Takeaways

<div class="alert alert-summary">
<h4>🎓 Key Takeaways</h4>

1. Random Variables transform outcomes → numbers (enables math!)
2. Discrete R.V.: PMF $p(x) = P(X=x)$, CDF $F(x) = P(X\leq x)$
3. The PMF gives the jump sizes in the CDF
4. The CDF is the sum of all PMF values up to x
5. Expectation $E[X]$ = center, Variance $Var(X)$ = spread
6. Distribution choice encodes data generation assumptions:
   - $Bernoulli(p)$: Single binary trial
   - $Binomial(n,p)$: Count successes in $n$ trials
   - $Poisson(\lambda)$: Rare events, no fixed $n$

</div>