# Random Vectors

In [None]:
%%html
<link rel="stylesheet" type="text/css" href="../styles/styles.css">

## Learning Objectives

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from mpl_toolkits.mplot3d import Axes3D

# Set style for better-looking plots
plt.style.use('seaborn-v0_8-darkgrid')
#sns.set_palette("husl")

# Set random seed for reproducibility
np.random.seed(42)

In [None]:
import sys
from pathlib import Path

# Add the "resources" directory to the path
project_root = Path().resolve().parent
resources_path = project_root / 'resources'
sys.path.insert(0, str(resources_path))

In [None]:
from multivariate import(interactive_carousel_demo)

<div class="alert alert-info">
<h3>🎯 Viral Ad Carousel Mystery</h3>

**Context**: You're an ML engineer at a social media company. The product team has designed a new ad carousel feature where N = 10 advertisements are displayed in a circular sequence. User testing shows that each ad gets a "like" (*L*) or "skip" (*S*) with roughly equal probability ($p = 0.5$).

**The Bonus System**: Marketing wants to reward advertisers with a bonus whenever their ad stands out from the crowd. Specifically, an ad earns a bonus if the user's reaction to it differs from BOTH adjacent ads.

Example:

|Ad sequence: |   [1] | [2] | [3] | [4] | [5] | [6] | [7] | [8] | [9] | [10]|
|----|:---:|:---:|:--:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| **User reactions:** | L   |S|   L |  L |  S |  L |  S |  S |  L |  S|
| **Bonuses?:**    |   No  |YES| No|  No | YES |No|  YES |No  |YES | YES|

Total:  5 bonuses

**Your Task**: The CFO asks you: "On average, how many bonuses will we pay per user? We need this for our quarterly budget!"

Quick Poll (make a guess!):

- About 2 bonuses?
- About 5 bonuses?
- About 8 bonuses?

**Challenge**: Can you solve this without simulating millions of users?

💡 Why This Is Hard (Without Today's Tools)</h4>

- There are 2^10 = 1,024 possible user reaction sequences
- Each sequence has a different number of bonuses
- The dependencies between adjacent ads make direct calculation complex
- We could simulate, but that doesn't give us the exact mathematical answer

**Promise:** By the end of this lesson, you'll solve this elegantly using random vectors

</div>

<center>
<img src="img/circular_carousel_with_relationships.svg" alt="Carousel ad" width="800px">
</center>

## Random Vectors - The Foundation

When training a neural network, we don't process one feature at a time. We process feature vectors:

- Image: [pixel₁, pixel₂, ..., pixel₇₈₄] for 28×28 MNIST
- Text embedding: [dim₁, dim₂, ..., dim₅₁₂] for BERT
- User profile: [age, income, clicks, time_on_site, ...]

*Question*: How do we mathematically model the joint behavior of multiple random quantities?

Let's start simple. Imagine you're tracking 3 metrics for a web user:

- $X_1 = \text{time spent on page (seconds)}$
- $X_2 = \text{number of clicks}$
- $X_3 = \text{scroll depth (percentage)}$

Instead of treating these separately, we group them: $\mathbf{X} = [X_1, X_2, X_3]$. This is a random vector - a vector whose components are random variables.

<div class="alert alert-success">
<h4>Definition: Random Vector</h4>

Let $(\Omega, A, P)$ be a probability space. A random vector is a mapping:
$$X: \Omega \rightarrow \mathbb{R}^n$$
represented as:
$$\mathbf{X} = [X_1, X_2, ..., X_n]^T$$
where each $X_i$ is a random variable.

Notation: We use bold uppercase letters ($\mathbf{X}$, $\mathbf{Y}$, $\mathbf{Z}$) for random vectors.
</div>

Two Ways to Visualize Random Vectors:

1. View 1: Direct Mapping from Sample Space
- Sample space $\Omega \rightarrow \mathbb{R}^n$
- Each outcome ω maps to a point $(x_1, x_2, ..., x_n)$ in $n$-dimensional space

<center>
<img src="img/vectors-1.png" width="400px">
</center>

2. View 2: Vector of Random Variables
- Each component $X_i: \Omega \rightarrow \mathbb{R}$
- The vector combines $n$ separate random variables

<center>
<img src="img/vectors-2.png" width="400px">
</center>

<div class="alert alert-primary">
<h4>🤖 ML Application Spotlight: Feature Vectors</h4>
In Machine Learning:

- Input features: $X = [x_1, x_2, ..., x_n]$ (e.g., user age, income, browsing history)
- Model weights: $W = [w_1, w_2, ..., w_n]$ (learned parameters)
- Hidden layer activations: $H = [h_1, h_2, ..., h_k]$
- Prediction: $\hat{y} = f(X·W)$

Random vectors allow us to model uncertainty in all these quantities simultaneously!

*Example*: In A/B testing, each user's behavior is a random vector [clicks, time_on_site, conversion]. We need to understand the joint distribution to make business decisions.
</div>

## Joint Distributions for $n$ Variables

**Scenario**: You have sensor data from an autonomous vehicle:

- $X_1 = \text{speed (km/h)}$
- $X_2 = \text{steering angle (degrees)}$
- $X_3 = \text{brake pressure (PSI)}$

Knowing each distribution separately isn't enough! You need to know: "*What's the probability that speed > 100 AND steering angle > 30 AND brake pressure < 50?*"

This requires the joint distribution.

<div class="alert alert-success">
<h4>Definition: Joint Distribution (n variables)</h4>

Let $X_1, X_2, ..., X_n$ be $n$ random variables defined on $(\Omega, \mathcal{A}, \mathbb{P})$.

<h5>Discrete case</h5>:

The **joint probability mass function** (or *joint PMF*) is given by:

$$\mathbb{P}_{X_1,X_2,...,X_n}(x_1, x_2, ..., x_n) = \mathbb{P}(X_1=x_1, X_2=x_2, ..., X_n=x_n)$$

The **joint CDF** of $n$ r.v. $X_1,...,X_n$ is defined as:
$$F_{X_1X_2...X_n}(x_1,x_2,...,x_n) = \mathbb{P}([X_1\leq x_1] \ \cap\ [X_2\leq x_2]\ \cap\ ...\ \cap\ [X_n \leq x_n])$$

<h5>Continuous case</h5>:

The **joint probability density function (PDF)** $f_{X_1X_2...X_n}(x_1,x_2,...,x_n)$ satisfies:
For any set $A\subset \mathbb{R}^n$:
$$\forall A\in \mathbb{R}^n, \ \mathbb{P}\left((X_1, X_2, ..., X_n)\in A\right) = \int_{}...\int\limits_A...\int f_{X_1X_2...X_n}(t_1, t_2, ..., t_n)dt_1 dt_2 ... dt_n$$

The **joint CDF** of $X_1,...,X_n$ is given by:

$$F_{X_1X_2...X_n}(x_1,x_2,...,x_n) = \mathbb{P}(X_1\leq x_1, X_2\leq x_2,..., X_n \leq x_n)  =$$
$$= \int_{-\infty}^{x_1}\int_{-\infty}^{x_2}...\int_{-\infty}^{x_n}f_{X_1X_2...X_n}(t_1, t_2, ..., t_n)dt_1 dt_2 ... dt_n$$


**Key Properties:**

* $\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}...\int_{-\infty}^{\infty}f_{X_1X_2...X_n}(x_1, x_2, ..., x_n)dx_1 dx_2 ... dx_n = 1$
* $\lim \limits_{x\to-\infty} F_{\mathbf{X}}(x,...,x) = 0$
* $\lim \limits_{x\to+\infty} F_{\mathbf{X}}(x,...,x) = 1$

</div>

<div class="alert example">
<h4>Calculated Example: Finding the Normalizing Constant</h4>

Three features from a recommendation system have joint PDF:

$$f_{XYZ}(x,y,z) = \left\{ \begin{array}{ll} c(3x + 2y + z) & 0 \leq x \leq 1, \ 0\leq y \leq 1, \ 0\leq z \leq 1 \\ 0 & \text{otherwise} \end{array}\right.$$

where $c$ is a constant.

Find the constant $c$.

</div>

<details>
<summary>Reveal solution</summary>

When dealing with a joint PDF. Our solution strategy:

$$\int\limits_{-\infty}^{\infty}\int\limits_{-\infty}^{\infty}\int\limits_{-\infty}^{\infty}f_{XYZ}(x,y,z)dxdydz = 1$$

Using this expression, we can find $c$. Note that outside $0 \leq x \leq 1, \ 0\leq y \leq 1, \ 0\leq z \leq 1$ the PDF equals 0. Then:

$$\int\limits_{-\infty}^{\infty}\int\limits_{-\infty}^{\infty}\int\limits_{-\infty}^{\infty}f_{XYZ}(x,y,z)dxdydz = \int\limits_{0}^{1}\int\limits_{0}^{1}\int\limits_{0}^{1}c(3x + 2y + z)dxdydz = 1$$

$$\int\limits_{0}^{1}\int\limits_{0}^{1}\int\limits_{0}^{1}c(3x + 2y + z)dxdydz = \int\limits_{0}^{1}\int\limits_{0}^{1}c\left(\frac{3x^2}{2} + 2yx + zx\right)\Bigg\rvert_{0}^{1}dydz = \int\limits_{0}^{1}\int\limits_{0}^{1}c\left(\frac{3}{2} + 2y + z\right)dydz =$$

$$=\int\limits_{0}^{1}c\left(\frac{3}{2}y + \frac{2y^2}{2} + zy\right)\Bigg\rvert_{0}^{1}dz = \int\limits_{0}^{1}c\left(\frac{3}{2}y + y^2 + zy\right)\Bigg\rvert_{0}^{1}dz =\int\limits_{0}^{1}c\left(\frac{3}{2} + 1 + z\right)dz=$$

$$=\int\limits_{0}^{1}c\left(\frac{5}{2} + z\right)dz = c\left(\frac{5}{2}z + \frac{z^2}{2}\right)\Bigg\rvert_{0}^{1} = c\left(\frac{5}{2} + \frac{1}{2}\right) = 3c$$

Then:
$$3c = 1$$
$$c = \frac{1}{3}$$

Therefore, we can substitute $c$ with its value in $f_{XYZ}$:

$$f_{XYZ}(x,y,z) = \left\{ \begin{array}{ll} \frac{1}{3}(3x + 2y + z) & 0 \leq x \leq 1, \ 0\leq y \leq 1, \ 0\leq z \leq 1 \\ 0 & \text{otherwise} \end{array}\right.$$

</details>

<div class="alert alert-warning">
<h4>💡 Key Insight: The Integration Order</h4>

Notice we integrated $x$ first, then $y$, then $z$. For continuous functions, we can change the order (Fubini's theorem), but:

- Always respect the limits of integration
- In ML applications, choosing the right order can simplify computation
- Think about which variables are "easier" to integrate first

</div>

## Marginal Distributions

**Scenario:** You have a dataset with 100 features, but you want to analyze just feature #37. Do you need the full 100-dimensional distribution?

**Answer:** No! You can get the distribution of X₃₇ by "marginalizing out" the other 99 variables.

<div class="alert alert-success">
<h4>Definition: Marginal Distribution</h4>

Let $\mathbf{X} = (X_1,...,X_n)$ be a random vector in $\mathbb{R}^n$.

We call the $k^{th}$ **marginal distribution** $k\in\{1,...,n\}$ of $\mathbf{X}$ the distribution of the r.v. $X_k$.

How to compute it:

<h5>Discrete case:</h5>

$$P_{X_i}(x_i) = \sum_{x_1}...\sum_{x_{i-1}}\sum_{x_{i+1}}...\sum_{x_n} P(x_1,...,x_n)$$

<h5>Continuous case:</h5>

$$f_{X_i}(x_i) = \int_{-\infty}^{\infty}...\int_{-\infty}^{\infty}f_{X_1X_2...X_n}(x_1, x_2, ..., x_n)dx_1...dx_{i-1}dx_{i+1}...dx_n$$

*Interpretation*: "Integrate out" or "sum out" all variables except $X_i$.

$$F_{X_i}(x) = \mathbb{P}(X_i\leq x) = \mathbb{P}(X_1\in\mathbb{R}, ..., X_{i-1}\in\mathbb{R}, X_i\leq x, X_{i+1}\in\mathbb{R}  ,..., X_n\in\mathbb{R}) =$$

$$= \lim\limits_{y\to +\infty}F_X(\underbrace{y,...,y}_{i-1\ \text{elements}},\overbrace{x}^{i^{th}},\underbrace{y,...,y}_{\text{from }(i+1)^{th}})$$

In the continuous case:
$$F_{X_i}(x) = \int_{-\infty}^x f_{X_i}(t_1,...,t_n)dt_i$$

</div>

<div class="alert example">
<h4>Calculated Example: Finding Marginal Density</h4>

Using our previous example with joint PDF:

$$f_{XYZ}(x,y,z) = \left\{ \begin{array}{ll} (1/3)(3x + 2y + z) & 0 \leq x \leq 1, \ 0\leq y \leq 1, \ 0\leq z \leq 1 \\ 0 & \text{otherwise} \end{array}\right.$$

find the marginal density of $X$.

</div>

<details>
<summary>Reveal solution</summary>

As we noted previously, the joint PDF equals 0 outside $0 \leq x \leq 1, \ 0\leq y \leq 1, \ 0\leq z \leq 1$. Then for $0\leq x\leq 1$, the marginal density can be calculated as follows:

$$f_X(x) = \int\limits_{-\infty}^{\infty}\int\limits_{-\infty}^{\infty}f_{XYZ}(x,y,z)dydz =\int\limits_{0}^{1}\int\limits_{0}^{1}\frac{1}{3}(3x + 2y + z)dydz = \int\limits_{0}^{1}\frac{1}{3}(3xy + 2\frac{y^2}{2} + zy)\Bigg\rvert_{0}^{1}dz =$$

$$= \int\limits_{0}^{1}\frac{1}{3}(3x + 1 + z)dz = \frac{1}{3}\left(3xz + z +\frac{z^2}{2}\right)\Bigg\rvert_{0}^{1} = \frac{1}{3}\left(3x + 1 + \frac{1}{2}\right) = \frac{1}{3}\left(3x + \frac{3}{2}\right) = x + \frac{1}{2}$$

for $x \in [0, 1]$.

Check: 
$$\int_0^1 x + \frac{1}{2} dx = \bigg[\frac{x^2}{2} + \frac{x}{2}\bigg]_0^1 = \frac{1^2}{2} + \frac{1}{2} - (\frac{0^2}{2} + \frac{0}{2}) = 1$$

</details>

<div class="alert alert-primary">
<h4>🤖 ML Application: Feature Selection</h4>

**Problem**: You have 1,000 features but want to select the top 10 for your model.

Approach using marginals:

- Compute marginal distributions of each feature
- Calculate marginal statistics (mean, variance, entropy)
- Rank features by information content
- Select top-k features

**Why this works**: Marginal distributions tell us about individual feature importance, even when features are part of a high-dimensional joint distribution.

**Real example**: In text classification, word frequencies (marginals) often suffice even though words appear in complex joint patterns (n-grams).
</div>

## Expectation

<div class="alert alert-success">
<h4>Definition: Expectation of a Random Vector</h4>

Let $\mathbf{X} = (X_1, ..., X_n)$ be a random vector defined on a probability space $(\Omega, \mathcal{A}, \mathbb{P})$. Let $\mathbb{E}[X_i]\in \mathbb{R}, \forall i=1,...,n$ be the expectation of $X_i$. The expectation of $\mathbf{X}$ is defined as:

$$\mathbb{E}\mathbf{X} = \left(\mathbb{E}[X_1],...,\mathbb{E}[X_n]\right) \in \mathbb{R}^n$$

Component-wise computation: Just take the expectation of each component.
</div>

<div class="alert alert-success"  style='background-color:white'>
<h4>Properties of Expectation</h4>

Let $\mathbf{X}$ and $\mathbf{Y}$ be two random vectors in $\mathbb{R}^n$. Let $A \in \mathcal{M}_n(\mathbb{R})$ be a square matrix of order $n$ with real coefficients. We have:

1. $\mathbb{E}[A\mathbf{X}] = A\ \mathbb{E}\mathbf{X}$
2. $\mathbb{E}(\mathbf{X} + \mathbf{Y}) = \mathbb{E}\mathbf{X} + \mathbb{E}\mathbf{Y}$

</div>

## Covariance Matrix

**Scenario:** You're training a model on image data. You notice:

- Pixel intensity at position (i,j) strongly correlates with position (i+1, j)
- But weakly correlates with position (i+50, j+50)

How do we capture all pairwise relationships between n variables efficiently?

<div class="alert alert-success">
<h4>Definition: Covariance Matrix</h4>

Let $\mathbf{X} = (X_1, ..., X_n)^T$ be a random vector in $\mathbb{R}^n$.

The **covariance matrix**, denoted $\Sigma_{\mathbf{X}}$ or $K_{\mathbf{X}\mathbf{X}}$, is a symmetric square matrix such that:

$$K_{\mathbf{X}\mathbf{X}} = \Sigma_{\mathbf{X}} = \left(Cov(X_i,X_j)\right)_{i,j=1,...,n}$$

Properties:

1. *symmetric*, i.e.: $Cov(X_i,X_j) = Cov(X_j,X_i),\ \forall i,j = 1,...,n$
2. *positive semi-definite*, i.e.: for $\forall \mathbf{x} = \begin{pmatrix}x_1 & x_2 & ... &x_n\end{pmatrix}^T \in \mathbb{R}^n$ we have $\mathbf{x}^TM\mathbf{x} \geq 0$
3. Diagonal elements: $Cov(X_i,X_i) = Var(X_i)$
</div>

For $\mathbf{X} = (X_1, X_2, X_3)$:

$$\Sigma_\mathbf{X} = \begin{pmatrix}Var(X_1) & Cov(X_1,X_2) & Cov(X_1,X_3)\\ Cov(X_1, C_2) & Var(X_2) & Cov(X_2,X_3) \\ Cov(X_3, X_1) & Cov(X_3,X_2) & Var(X_3)\end{pmatrix}$$

<div class="alert alert-primary">
<h4>🤖 ML Application: Principal Component Analysis (PCA)</h4>

PCA Workflow:

1. Center your data: X_centered = X - mean(X)
2. Compute covariance matrix: Σ = (1/n)X_centered' · X_centered
3. Eigen decomposition: Σ = QΛQᵀ
4. Principal components = eigenvectors with largest eigenvalues

Why covariance matrix matters:

- Captures all pairwise feature relationships
- Eigenvectors show directions of maximum variance
- Used for dimensionality reduction (keep top-k components)
- Essential for whitening, decorrelation, and many preprocessing steps

Real example: In face recognition, PCA on pixel covariance matrix reveals "eigenfaces" - the most important facial patterns.
</div>

## Independence

As a reminder, the intuition behind the notion of independence of events is as follows.

The knowledge we have about one event (one random variable) has no influence on the probability of the remaining events.

Suppose we have events $A_1, A_2, ..., A_n$. The fact that all events are independent implies that:

$$\mathbb{P}\left(A_5 \cap \overline{A_6}\right) = \mathbb{P}\left(\underbrace{A_5 \cap \overline{A_6}}_{\text{event of interest}} \bigg|\ \ \underbrace{\overline{A_1} \cup A_2 \cup \left(A_3 \cap \overline{A_4}\right) \cup A_7}_{\text{what we know}} \right)$$

Note that the indices of the events (in our case $5$ and $6$) forming the event of interest ($A_5 \cap \overline{A_6}$) are different from the indices of the events about whose realizations we have information ($A_1$, $A_2$, $A_3$, $A_4$, $A_7$).

<div class="alert alert-success">
<h4>Definition: Mutually Independent Events</h4>

The events $A_1,A_2,...,A_n$ are said to be **mutually independent** if for all distinct indices $\forall i,j,...,m : i\neq j\neq ... \neq m$ and for any number of chosen events:

$$\mathbb{P}(A_i\cap A_j\ \cap\ ... \ \cap\ A_m) = \mathbb{P}(A_i)\times\mathbb{P}(A_j)\times...\times\mathbb{P}(A_m)$$

</div>


<div class="alert alert-success">
<h4>Definition: Mutually Independent RV</h4>

The r.v. $X_1, X_2, ..., X_n$ are said to be **mutually independent** if $\forall (x_1,...,x_n) \in \mathbb{R}^n$, the events $[X_i\leq x_i], i=1,...,n$ are *mutually independent*.

**Property:**

If $X_1,...,X_n$ are mutually independent, then:

$$Var\left(\sum_{i=1}^nX_i\right) = \sum_{i=1}^n Var(X_i)$$

</div>

<div class="alert alert-success">
<h4>Definition: i.i.d.</h4>

The r.v. $X_1, X_2, ..., X_n$ are said to be **independent and identically distributed (i.i.d.)** if $\forall (x_1,...,x_n) \in \mathbb{R}^n$, the r.v. $X_i, i=1,...,n$ are mutually independent and have the same CDF:

$$F_{X_1}(x) = F_{X_2}(x) = ... = F_{X_n}(x), \ \forall x\in \mathbb{R}$$

</div>

<div class="alert alert-success"  style='background-color:white'>
<h4>Property: Expected value of the product</h4>

Let $X_1,...,X_n$ be $n$ i.i.d. r.v., then:

$$\mathbb{E}[X_1...X_n] = (\text{independence}) = \mathbb{E}X_1 \times \mathbb{E}X_2 \times ... \times \mathbb{E}X_n =$$
$$= (\text{identically distributed})  = \mathbb{E}X_1 \times \mathbb{E}X_1 \times ... \times \mathbb{E}X_1 = (\mathbb{E}X_1)^n$$

</div>

## Important Multivariate Distributions

<div class="alert alert-success"  style='background-color:white'>
<h4>Special Case: Multinomial distribution</h4>

**Multinomial distribution**, denoted $\mathcal{M}(n,p_1,...,p_k)$ where $n\in \mathbb{N^{*}}$ and $p_i\in ]0,1[ \ \forall i\in\{1,...,k\}$ is the generalization of the binomial distribution, i.e.:

$$\mathcal{B}(n,p) = \mathcal{M}(n,p,1-p)$$

The probability mass function of $\mathcal{M}(n,p_1,...,p_k)$ is defined by:
$$\mathbb{P}(X_1=\eta_1,...,X_n=\eta_n) = \frac{n!}{\eta_1!\times...\times \eta_k!}p_1^{\eta_1}\times...\times p_k^{\eta_k}, \ \forall \eta=(\eta_1,...,\eta_k)\in \mathbb{N}^k$$
where:

* $\sum_{i=1}^k \eta_i = n$
* $\sum_{i=1}^k p_i = 1$

</div>

<div class="alert alert-primary">
<h4>🤖 ML Application: Multiclass Classification</h4>

1. Problem: Classifying images into $k=10$ categories (digits 0-9).

Model output: Softmax layer produces probabilities (p₁, ..., p₁₀).

Interpretation:

- Each image classified n=1 time
- Multinomial(1, p₁, ..., p₁₀) models the prediction
- Cross-entropy loss: -log(p_true_class)

2. Real scenario: Document classification

- $n = 100$ words in a document
- $k = 20$ topics
- Each word assigned to one topic
- Multinomial models word-topic assignments

3. Naive Bayes Classifier: Assumes features follow multinomial distribution given class label.
</div>

<div class="alert alert-success"  style='background-color:white'>
<h4>Special Case: Multivariate Normal Distribution</h4>

The **$n$-dimensional normal distribution** or **multivariate normal distribution** (also called *multivariate Gaussian distribution* or *joint normal distribution*), denoted $\mathcal{N}(\mu, \Sigma)$ where $\mu = (\mu_1,...,\mu_n)\in\mathbb{R}^n$ and $\Sigma$ is a covariance matrix (square matrix of order $n$, symmetric positive definite), is the generalization of the normal distribution $\mathcal{N}(m, \sigma^2)$ where $\mu = (m)$ and $\Sigma = (\sigma^2)$


$$f_{\mathbf{X}}(x_1,...,x_n) = \frac{1}{\sqrt{(2\pi)^{n}\det\Sigma}}\exp\left(-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T \Sigma^{-1}(\mathbf{x}-\mathbf{\mu})\right)$$
where 

* $(\mathbf{x}-\mathbf{\mu})^T$ denotes the column vector composed of $x_i - \mu_i, \ \forall i\in \{1,...,n\}$
* $\Sigma^{-1}$ denotes the inverse of the matrix $\Sigma$

Key term: $\sqrt{(\mathbf{x}-\mathbf{\mu})^T \Sigma^{-1}(\mathbf{x}-\mathbf{\mu})}$ is called [**Mahalanobis distance**](https://en.wikipedia.org/wiki/Mahalanobis_distance) between the point $\mathbf{x}$ and the expectation $\mathbf{\mu}$.

Special case: When $\Sigma = \sigma^2 I$ (identity matrix), reduces to $n$ independent $\mathcal{N}(\mu_i, \sigma^2)$ variables.

</div>

<div class="alert alert-primary">
<h4>🤖 ML Application: Gaussian Processes & Anomaly Detection</h4>

1. Gaussian Mixture Models (GMM):

- Model data as mixture of $k$ multivariate Gaussians
- Each cluster has its own $(\mu_k, \Sigma_k)$
- Used for clustering, density estimation

2. Anomaly Detection

3. Gaussian Processes:

- Infinite-dimensional generalization
- Defines distribution over functions
- Used for regression with uncertainty quantification

4. Kalman Filters:

- State estimation assumes multivariate normal distributions
- Used in robotics, autonomous vehicles, time series

</div>

## Return to Opening Challenge

Recall: $N=10$ ads, each liked/skipped with $p=0.5$.

Bonus awarded when ad reaction differs from both neighbors.

Question: What's the expected number of bonuses?

**SOLUTION**

1. Step 1: Model as Random Vector

Let's define indicator random variables: $$X_i = \left\{\begin{array}{ll}1 & \text{ if ad } i \text{ gets a "like"}\\ 0 & \text{ if "skip"}\end{array}\right. (i = 1, ..., 10)$$
$X = (X_1, ..., X_{10})$ is our random vector

Each $X_i \sim Bernoulli(p=0.5)$, independent.

2. Step 2: Define Bonus Indicators

Let $$B_i = \left\{\begin{array}{ll}1 & \text{ if ad } i \text{ earns a bonus}\\ 0 & \text{ otherwise}\end{array}\right.$$

Bonus condition for ad $i$ (where $1 ≤ i ≤ 10$):

- Ad $i$ differs from $i-1$: $X_i \neq X_{i-1}$
- Ad $i$ differs from $i+1$: $X_i \neq X_{i+1}$

Both must be true! i.e. there are two winning patterns: $LSL$ and $SLS$.

Therefore:

$$B_i = I(X_i \neq X_{i-1}\text{ AND }X_i \neq X_{i+1})$$

Total bonuses: $S = B_1 + B_2 + ... + B_{10}$

3. Step 3: Calculate $P(\text{Bonus for ad }i)$

For ad $i$ to get a bonus, we need $X_{i-1}, X_i, X_{i+1}$ in pattern:

$(0, 1, 0)$ or $(1, 0, 1)$

Since each ad is independent with $p=0.5$:
- $P(X_{i-1}=0, X_i=1, X_{i+1}=0) = 0.5 × 0.5 × 0.5 = 1/8$
- $P(X_{i-1}=1, X_i=0, X_{i+1}=1) = 0.5 × 0.5 × 0.5 = 1/8$
- $P(B_i = 1) = 1/8 + 1/8 = 1/4$

4. Step 4: Use Linearity of Expectation

Key property: $E[X + Y] = E[X] + E[Y]$ (always true, even if $X$ and $Y$ are dependent!)

$$E[S] = E[B_1 + B_2 + ... + B_{10}]
     = E[B_1] + E[B_2] + ... + E[B_{10}] \text{   (linearity!)}$$
$$= (0\times P(B_1=0) + 1\times P(B_1=1)) + (0\times P(B_2=0) + 1\times P(B_2=1)) + ... + (0\times P(B_{10}=0) + 1\times P(B_{10}=1))$$
$$     = P(B_1=1) + P(B_2=1) + ... + P(B_{10}=1)
     = 10 × (1/4)
     = 2.5$$


**ANSWER:** expected number of bonuses is 2.5

**Generalization**: For $N$ ads and $p$: $E[S] = Np(1-p)$. If $p=0.5$: $E[S] = N\times 0.5 \times 0.5 = N/4$.

<center>
<img src="img/ad_carousel_circular_solution.svg" alt="Summary of solution" width="800px">
</center>

<div class="alert alert-warning">
<h4>💡 Why Intuition Failed</h4>

Common wrong reasoning:

- "About half the ads should differ from neighbors" → Guess 5 bonuses
- "Dependencies make it complex" → Give up

Why random vector approach works:

- Breaks complex problem into simple pieces (indicator variables)
- Linearity of expectation works even with dependencies
- No need to enumerate all 2¹⁰ = 1,024 sequences
- Scales to any $N$ (try N=100!)

The Power of Abstraction: 
Random vectors + LOTUS + linearity = tractable solutions to seemingly impossible problems!
</div>

In [None]:
# simulation

def simulate_ad_carousel(n_ads=10, p_like=0.5, n_simulations=100000, circular=True):
    """
    Simulate the ad carousel bonus problem to verify our theoretical result.
    
    Parameters:
    -----------
    n_ads : int
        Number of ads in carousel
    p_like : float
        Probability of liking an ad
    n_simulations : int
        Number of simulations to run
    circular : bool
        If True, carousel wraps around (ad 1 and ad n are neighbors)
        If False, linear arrangement (only ads 2 through n-1 can earn bonuses)
        
    Returns:
    --------
    dict : Contains average bonuses and distribution
    """
    bonus_counts = []
    
    for _ in range(n_simulations):
        # Generate random likes/skips (1 = like, 0 = skip)
        reactions = np.random.binomial(1, p_like, n_ads)
        
        # Count bonuses
        bonuses = 0
        
        if circular:
            # ALL ads can earn bonuses in circular arrangement
            for i in range(n_ads):
                # Use modulo for circular indexing
                left_neighbor = reactions[(i - 1) % n_ads]
                right_neighbor = reactions[(i + 1) % n_ads]
                
                # Bonus if ad i differs from both neighbors
                if reactions[i] != left_neighbor and reactions[i] != right_neighbor:
                    bonuses += 1
        else:
            # Linear arrangement: only ads 2 through n-1 can earn bonuses
            for i in range(1, n_ads - 1):
                # Bonus if ad i differs from both neighbors
                if reactions[i] != reactions[i-1] and reactions[i] != reactions[i+1]:
                    bonuses += 1
        
        bonus_counts.append(bonuses)
    
    # Calculate statistics
    avg_bonuses = np.mean(bonus_counts)
    std_bonuses = np.std(bonus_counts)
    
    # Calculate theoretical expectation
    if circular:
        theoretical = n_ads * p_like * (1 - p_like) 
        arrangement = "Circular"
    else:
        theoretical = (n_ads - 2) * p_like * (1 - p_like) 
        arrangement = "Linear"
    
    # Plot distribution
    plt.figure(figsize=(12, 5))
    
    # Histogram
    plt.subplot(1, 2, 1)
    unique, counts = np.unique(bonus_counts, return_counts=True)
    plt.bar(unique, counts / n_simulations, alpha=0.7, color='steelblue', edgecolor='black')
    plt.axvline(avg_bonuses, color='red', linestyle='--', linewidth=2, 
                label=f'Simulated = {avg_bonuses:.3f}')
    plt.axvline(theoretical, color='green', linestyle='--', linewidth=2, 
                label=f'Theoretical = {theoretical:.1f}')
    plt.xlabel('Number of Bonuses', fontsize=12)
    plt.ylabel('Probability', fontsize=12)
    plt.title(f'{arrangement} Carousel: Distribution of Bonuses\n({n_simulations:,} simulations)', fontsize=14)
    plt.legend()
    plt.grid(alpha=0.3)
    
    # Example sequences
    plt.subplot(1, 2, 2)
    plt.axis('off')
    
    # Show a few example sequences
    examples_text = "Sample Sequences:\n\n"
    for i in range(5):
        reactions = np.random.binomial(1, p_like, n_ads)
        reaction_symbols = ['L' if r == 1 else 'S' for r in reactions]
        
        bonuses_list = []
        for j in range(1, n_ads - 1):
            if reactions[j] != reactions[j-1] and reactions[j] != reactions[j+1]:
                bonuses_list.append(j)
        
        examples_text += f"{'  '.join(reaction_symbols)}\n"
        examples_text += f"Bonuses at positions: {bonuses_list if bonuses_list else 'None'}\n\n"
    
    plt.text(0.1, 0.5, examples_text, fontsize=10, family='monospace',
             verticalalignment='center')
    
    plt.tight_layout()
    plt.show()
    
    return {
        'average': avg_bonuses,
        'std': std_bonuses,
        'distribution': (unique, counts / n_simulations)
    }



In [None]:
simulate_ad_carousel()

In [None]:
# interactive demo
interactive_carousel_demo()

Now, let's calculate Covariance matrix.

From our previous analysis: $E[B_i] = P(B_i=1) = p(1-p)$ and for $p=0.5$: $E[B_i] = P(B_i=1) = p(1-p) = 0.5\times 0.5 = 0.25$.

1. Calculate Variance of $B_i$
$Var(B_i) = E[B_i^2] - (E[B_i])^2$

Since $B_$ is a Bernoulli random variable:

Since $B_i \in {0,1}$, we have $B_i^2 = B_i$, so: $E[B_i^2] = E[B_i] = p(1-p) = 0.25$

Therefore:
$Var(B_i) = p(1-p) - [p(1-p)]^2 = p(1-p)[1 - p(1-p)] = p(1-p)[1 - p + p^2]$
        
For $p = 0.5$:
$Var(B_i) = 0.5 \times 0.5 \times [1 - 0.5 + 0.25] = 0.25 \times 0.75 = 0.1875 = 3/16

2. Calculate Covariance between bonus indicators

The key formula for covariance: $Cov(B_i, B_j) = E[B_iB_j] - E[B_i]E[B_j]$

Since all ads are symmetric (circular arrangement): $E[B_i] = E[B_j] = p(1-p) = 0.25$ (for $p = 0.5$)

So we need to find: $E[B_iB_j]$

This depends on the relationship between $i$ and $j$!

- Case 1: $i = j$ (Same Ad)

$Cov(B_i, B_i) = Var(B_i) = 3/16 = 0.1875$

This is just the variance we calculated above.

- Case 2: $|i - j| = 1$ (Adjacent Ads)

Example: $Cov(B_2, B_3)$ where ad 2 and ad 3 are neighbors.

```
Ad 1 - Ad 2 - Ad 3 - Ad 4
       ↑      ↑
       Bᵢ     Bⱼ
```

Note: </br>
a. For $B_2 = 1$: Ad 2 must differ from BOTH ad 1 and ad 3 ($X_2 \neq X_1$ AND $X_2 \neq X_3$)</br>
b. For $B_3 = 1$: Ad 3 must differ from BOTH ad 2 and ad 4 ($X_3 \neq X_2$ AND $X_3 \neq X_4$)

So these ads are dependent. 

For a better demonstration, let's enumerate all possible patterns for $(X_1, X_2, X_3, X_4)$:

|X₁|X₂|X₃|X₄|B₂ (X₂≠X₁ & X₂≠X₃)|B₃ (X₃≠X₂ & X₃≠X₄)|B₂B₃|Probability|
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
|0|0|0|0|0|0|0|p⁴|
|0|0|0|1|0|0|0|p³(1-p)|
|0|0|1|0|0|1|0|p³(1-p)|
|0|0|1|1|0|0|0|p²(1-p)²|
|0|1|0|0|1|1|**1**|p³(1-p)|
|0|1|0|1|1|0|0|p²(1-p)²|
|0|1|1|0|0|0|0|p²(1-p)²|
|0|1|1|1|0|0|0|p(1-p)³
|1|0|0|0|0|0|0|p³(1-p)|
|1|0|0|1|0|0|0|p²(1-p)²|
|1|0|1|0|1|1|**1**|p²(1-p)²|
|1|0|1|1|1|0|0|p(1-p)³|
|1|1|0|0|0|0|0|p²(1-p)²|
|1|1|0|1|0|1|0|p(1-p)³|
|1|1|1|0|0|0|0|p(1-p)³|
|1|1|1|1|0|0|0|(1-p)⁴|

Finding $E[B_2B_3]$:
Only two patterns have $B_2B_3 = 1$:

a. $(0, 1, 0, 0)$ with probability $p^3(1-p)$</br>
b. $(1, 0, 1, 0)$ with probability $p^2(1-p)²$

For $p = 0.5$:
$$E[B_2B_3] = (0.5)^3(0.5) + (0.5)^2(0.5)^2
        = 0.125 × 0.5 + 0.25 × 0.25
        = 0.0625 + 0.0625
        = 0.125
        = 1/8$$

Calculate Covariance:
$$Cov(B_2, B_3) = E[B_2B_3] - E[B_2]E[B_3]
            = 1/8 - (1/4)(1/4)
            = 1/8 - 1/16
            = 2/16 - 1/16
            = 1/16
            = 0.0625$$

Result for Adjacent Ads: $Cov(B_i, B_{i+1}) = 1/16$ (for p = 0.5)

*Interpretation*: Adjacent bonuses are positively correlated! If ad $i$ gets a bonus, it slightly increases the chance that ad $i+1$ also gets a bonus.

- Case 3: $|i - j| = 2$ (One Ad Between)

Example: $Cov(B_2, B_4)$ where ads 2 and 4 have ad 3 between them.

Configuration:

```
Ad 1 - Ad 2 - Ad 3 - Ad 4 - Ad 5
       ↑             ↑
       Bᵢ            Bⱼ
```

For $B_2 = 1$: $X_2 \neq X_1$ AND $X_2 \neq X_3$</br>
For $B_4 = 1$: $X_4 \neq X_3₃$ AND $X_4 \neq X_5$

These share $X_3$ as a common neighbor.

Key Pattern Analysis:

For both $B_2 = 1$ and $B_4 = 1$:

    * $B_2 = 1$ requires: $X_1 \neq X_2 \neq X_3$
    * $B_4 = 1$ requires: $X_3 \neq X_4 \neq X_5$

The key constraint: $X_2 \neq X_3$ AND $X_3 \neq X_4$. 
This means: $X_2 \neq X_3 \neq X_4$

So we need an alternating pattern around position 3.

If we consider all possible patterns for $(X_1, X_2, X_3, X_4, X_5)$ (32 cases), we will see that for $B_2B_4 = 1$, the pattern must be a perfect alternation, i.e.: $L-S-L-S-L$ or $S-L-S-L-S$. 

The associated probabilities are:
    * (0, 1, 0, 1, 0) - probability = $p^2(1-p)^3$
    * (1, 0, 1, 0, 1) - probability = $p^3(1-p)^2$

For $p = 0.5$:
$$E[B_2B_4] = (0.5)^2(0.5)^3 + (0.5)^3(0.5)^2
        = 2 × (0.5)^5
        = 2 × 1/32
        = 1/16
        = 0.0625$$

Calculate Covariance:
$$Cov(B₂, B₄) = E[B_2B_4] - E[B_2]E[B_4]
            = 1/16 - (1/4)(1/4)
            = 1/16 - 1/16
            = 0$$

Result for Ads with One Between: $Cov(B_i, B_{i+2}) = 0$ (for $p = 0.5$)

Interpretation: These bonuses are uncorrelated when $p = 0.5$

- Case 4: $|i - j| ≥ 3$ (Non-overlapping Neighborhoods)

Example: $Cov(B_2, B_5)$

Configuration:

```
Ad 1 - Ad 2 - Ad 3 - Ad 4 - Ad 5 - Ad 6
       ↑                     ↑
       Bᵢ                    Bⱼ
```

$B_2$ depends on: $X_1, X_2, X_3$
$B_5$ depends on: $X_4, X_5, X_6$

These are completely disjoint sets!

Since all $X_i$ are independent:
$E[B_2B_5] = E[B_2]E[B_5]$

Therefore:
$$Cov(B_2, B_5) = E[B_2B_5] - E[B_2]E[B_5]
            = E[B_2]E[B_5] - E[B_2]E[B_5]
            = 0$$

Result for Non-overlapping Neighborhoods: $Cov(B_i, B_j) = 0$ for $|i - j| \geq 3$

3. Complete Covariance Matrix ($N=10$)

$$\begin{pmatrix}Var(X_1) & Cov(X_1,X_2) & Cov(X_1,X_3) & \mathbf{0} & \mathbf{0}& \mathbf{0}& \mathbf{0}& \mathbf{0} & Cov(X_1,X_9) & Cov(X_1,X_{10})\\ Cov(X_2,X_1) & Var(X_2) & Cov(X_2,X_3) & Cov(X_2,X_4) & \mathbf{0} & \mathbf{0}& \mathbf{0}& \mathbf{0}& \mathbf{0} & Cov(X_2,X_{10}) \\ Cov(X_3,X_1) & Cov(X_3,X_2) & Var(X_3) & Cov(X_3,X_4) & Cov(X_3,X_5) & \mathbf{0} & \mathbf{0}& \mathbf{0}& \mathbf{0}& \mathbf{0}\\ \mathbf{0} & Cov(X_4,X_2) & Cov(X_4,X_3) & Var(X_4) & Cov(X_4,X_5) & Cov(X_4,X_6) & \mathbf{0}& \mathbf{0}& \mathbf{0}& \mathbf{0}\\ \mathbf{0} & \mathbf{0} & Cov(X_5,X_3) & Cov(X_5,X_4) & Var(X_5) & Cov(X_5,X_6) & Cov(X_5,X_7) & \mathbf{0} & \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} & \mathbf{0} & Cov(X_6,X_4) & Cov(X_6,X_5) & Var(X_6) & Cov(X_6,X_7) & Cov(X_6,X_8) & \mathbf{0} & \mathbf{0}\\ \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & Cov(X_7,X_5) & Cov(X_7,X_6) & Var(X_7) & Cov(X_7,X_8) & Cov(X_7,X_9) & \mathbf{0} \\ \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & Cov(X_8,X_6) & Cov(X_8,X_7) & Var(X_8) & Cov(X_8,X_9) & Cov(X_8,X_{10}) \\ Cov(X_9,X_1) & \mathbf{0} &\mathbf{0} &\mathbf{0} & \mathbf{0} &\mathbf{0} & Cov(X_9,X_7) & Cov(X_9,X_8) & Var(X_9) & Cov(X_9,X_{10}) \\ Cov(X_{10},X_1) & Cov(X_{10},X_2) & \mathbf{0} & \mathbf{0} &\mathbf{0} &\mathbf{0} & \mathbf{0} &  Cov(X_{10},X_8) & Cov(X_{10},X_9) & Var(X_{10})\end{pmatrix}$$

Given that $Cov(X_1, X_3) = Cov(X_2, X_4) = ... = Cov(X_{N-1}, X_1) = Cov(X_N, X_2) = 0$:

$$\begin{pmatrix}Var(X_1) & Cov(X_1,X_2) & \mathbf{0} & \mathbf{0} & \mathbf{0}& \mathbf{0}& \mathbf{0}& \mathbf{0} & \mathbf{0} & Cov(X_1,X_{10})\\ Cov(X_2,X_1) & Var(X_2) & Cov(X_2,X_3) & \mathbf{0} & \mathbf{0} & \mathbf{0}& \mathbf{0}& \mathbf{0}& \mathbf{0} & \mathbf{0} \\ \mathbf{0} & Cov(X_3,X_2) & Var(X_3) & Cov(X_3,X_4) & \mathbf{0} & \mathbf{0} & \mathbf{0}& \mathbf{0}& \mathbf{0}& \mathbf{0}\\ \mathbf{0} & \mathbf{0} & Cov(X_4,X_3) & Var(X_4) & Cov(X_4,X_5) & \mathbf{0} & \mathbf{0}& \mathbf{0}& \mathbf{0}& \mathbf{0}\\ \mathbf{0} & \mathbf{0} & \mathbf{0} & Cov(X_5,X_4) & Var(X_5) & Cov(X_5,X_6) & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & Cov(X_6,X_5) & Var(X_6) & Cov(X_6,X_7) & \mathbf{0} & \mathbf{0} & \mathbf{0}\\ \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & Cov(X_7,X_6) & Var(X_7) & Cov(X_7,X_8) & \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & Cov(X_8,X_7) & Var(X_8) & Cov(X_8,X_9) & \mathbf{0} \\ \mathbf{0} & \mathbf{0} &\mathbf{0} &\mathbf{0} & \mathbf{0} &\mathbf{0} & \mathbf{0} & Cov(X_9,X_8) & Var(X_9) & Cov(X_9,X_{10}) \\ Cov(X_{10},X_1) & \mathbf{0} & \mathbf{0} & \mathbf{0} &\mathbf{0} &\mathbf{0} & \mathbf{0} &  \mathbf{0} & Cov(X_{10},X_9) & Var(X_{10})\end{pmatrix}$$

which equals:
$$\begin{pmatrix}\frac{3}{16} & \frac{1}{16} & \mathbf{0} & \mathbf{0} & \mathbf{0}& \mathbf{0}& \mathbf{0}& \mathbf{0} & \mathbf{0} & \frac{1}{16}\\ \frac{1}{16} & \frac{3}{16} & \frac{1}{16} & \mathbf{0} & \mathbf{0} & \mathbf{0}& \mathbf{0}& \mathbf{0}& \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \frac{1}{16} & \frac{3}{16} & \frac{1}{16} & \mathbf{0} & \mathbf{0} & \mathbf{0}& \mathbf{0}& \mathbf{0}& \mathbf{0}\\ \mathbf{0} & \mathbf{0} & \frac{1}{16} & \frac{3}{16} & \frac{1}{16} & \mathbf{0} & \mathbf{0}& \mathbf{0}& \mathbf{0}& \mathbf{0}\\ \mathbf{0} & \mathbf{0} & \mathbf{0} & \frac{1}{16} & \frac{3}{16} & \frac{1}{16} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \frac{1}{16} & \frac{3}{16} & \frac{1}{16} & \mathbf{0} & \mathbf{0} & \mathbf{0}\\ \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \frac{1}{16} & \frac{3}{16} & \frac{1}{16} & \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \frac{1}{16} & \frac{3}{16} & \frac{1}{16} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} &\mathbf{0} &\mathbf{0} & \mathbf{0} &\mathbf{0} & \mathbf{0} & \frac{1}{16} & \frac{3}{16} & \frac{1}{16} \\ \frac{1}{16} & \mathbf{0} & \mathbf{0} & \mathbf{0} &\mathbf{0} &\mathbf{0} & \mathbf{0} &  \mathbf{0} & \frac{1}{16} & \frac{3}{16}\end{pmatrix}$$

This is a tridiagonal matrix with wrap-around (circulant structure).

Now, we can calculate Variance of Total Bonuses $S$.

$S = B_1 + B_2 + ... + B_{10}$

$$Var(S) = Var(\sum B_i) = \sum \sum Cov(B_i, B_j) = \sum_i Var(B_i) + \sum_{i\neq j}Cov(B_i, B_j)$$

Note that:
$$\sum_{i=1}^N\sum_{\begin{matrix}j=1\\j\neq i\end{matrix}}^N Cov(X_i,X_j) = 2N\times Cov(X_1,X_2) + 2N\times Cov(X_1,X_3) = 2N\times \frac{1}{16} + 2N\times 0 = \frac{2N}{16}$$

Diagonal terms ($i = j$): $10 \times (3/16) = 30/16$

Adjacent terms ($|i-j| = 1$, including wrap): 
  - Each of 10 ads has 2 neighbors
  - But we count each pair once
  - Total: $10 pairs \times (1/16) \times 2 = 20/16$

All other terms: 0

Hence: $$Var(X) = \frac{3N}{16} + \frac{2N}{16} + 0 = \frac{5N}{16} = 50/16 = 3.125$$

## AI Applications

<div class="alert alert-primary">
<h4>🤖 From Ad Carousels to Modern AI</h4>
This same mathematical framework powers:

1. Neural Network Training:

- Each layer: h^(l) = σ(W^(l)h^(l-1) + b^(l))
- Gradients are expectations over random mini-batches
- LOTUS justifies stochastic gradient descent


2. Reinforcement Learning:

- State-action pairs form random vectors
- Expected return: E[Σᵗ γᵗ r(sₜ, aₜ)]
- Linearity enables policy gradient methods


3. Recommendation Systems:

- User features + item features = joint distribution
- Expected utility guides recommendations
- Covariance matrix reveals latent factors

4. Anomaly Detection:

- Sensor readings form multivariate normal
- Mahalanobis distance identifies outliers
- Used in fraud detection, cybersecurity, manufacturing


Common Thread: Model uncertainty with random vectors, use mathematical properties (linearity, LOTUS, covariance) to make decisions efficiently.
</div>

<div class="alert alert-secondary">
<h4>🏭 Industry Case Study</h4>

Context: Google's ad auction determines which ads to show based on multiple factors:

- Bid amount ($B$)
- Quality score ($Q$)
- Click-through rate ($C$)
- User relevance ($R$)

Problem: Each factor is uncertain (random variable). Google needs to compute:

- Expected revenue: $E[B × C]$
- Expected user satisfaction: $E[Q × R]$
- Variance in outcomes: $Var(B × C × Q)$

Solution using Random Vectors:

- Model $(B, Q, C, R)$ as a 4-dimensional random vector
- Historical data provides joint distribution
- Use LOTUS to compute $E[B × C]$ without deriving distribution of the product
- Covariance matrix reveals correlations:
    * High $Q$ often correlates with high $C$
    * This information improves auction design


Impact:

- Billions of auctions per day
- Millisecond decision times required
- Random vector theory enables real-time optimization
- LOTUS makes expectations computationally tractable

Takeaway: Without random vector formalism, this system would require intractable Monte Carlo simulation for each auction!
</div>

## Common Mistakes

<div class="alert alert-danger">
<h4>⚠️ Common Mistakes to Avoid</h4>

1. Confusing joint and marginal distributions

- Marginal ≠ conditional
- Integrating out vs. conditioning


2. Forgetting that linearity works even with dependence

- E[X + Y] = E[X] + E[Y] always holds
- Don't need independence for linearity!


3. Trying to derive intermediate distributions unnecessarily

- Use LOTUS instead!
- Don't make life harder than it needs to be


4. Assuming independence without checking

- Check covariance matrix
- Independence: f(x,y) = f_X(x)f_Y(y)


5. Misinterpreting covariance

- Cov(X,Y) = 0 doesn't imply independence
- (Independence → zero covariance, but not vice versa)

</div>

## Key Takeaways

<div class="alert alert-summary">
<h4>📋 Key Takeaways</h4>
Three Big Ideas:

1. Random Vectors = Joint Behavior

- Model multiple uncertain quantities simultaneously
- Captures dependencies through joint distributions
- Foundation for all multivariate statistics


2. LOTUS = Computational Efficiency

- Calculate E[g(X)] without finding distribution of g(X)
- Essential for ML: loss functions, gradients, expectations
- Saves enormous computational effort


3. Linearity = Problem Decomposition

- E[X + Y] = E[X] + E[Y] (even if dependent!)
- Break complex problems into simple pieces
- Key to analyzing indicator variables



How They Connect:
Random vectors → Model joint distributions → LOTUS for expectations → Linearity simplifies calculations

ML Applications Summary:

- Neural networks: Weight vectors, activations, gradients
- Dimensionality reduction: PCA uses covariance matrices
- Classification: Multinomial models, Gaussian discriminant analysis
- Anomaly detection: Mahalanobis distance in multivariate normal
- Optimization: Portfolio theory, resource allocation

</div>