# Random Variables

Given an event space $\Omega$ and a target space $\mathcal{T}$, a random variable $X$ is a function that maps events in $\Omega$ to outcomes in $\mathcal{T}$

$$X : \Omega \mapsto \mathcal{T}$$

# Distributions of Random Variables

Consider a random variable $$X : \Omega \mapsto \mathcal{T}$$ and a subset of the target space 

$$ S \subseteq \mathcal{T} $$

(for example, a single element of $\mathcal{T}$ can be the outcome that one head is obtained when tossing two coins one after another)

Let $X^{-1}(S)$ be the pre-image of $S$ by $X$, i.e., the set of elements of $\Omega$ that map to $S$ under the random variable $X$. This can be denoted in set builder notation as

$$X^{-1}(S) = \{ \omega \in \Omega \, \vert \, X(\omega) \in S \}$$

To understand the transformation of probability from events in $\Omega$ via the random variable $X$, we can associate it with the probability of the pre-image of $S$, i.e., for $S \subseteq \mathcal{T}$,

$$P_X(S) = P(X \in S) = P(X^{-1}(S)) = P(\{ \omega \in \Omega \, \vert \, X(\omega) \in S \})$$

The left hand side of this equation is the probability of the set of possible outcomes with a particular property that we are interested in. Via the random variable $X$, which maps states to outcomes, the right hand side is the probability of the set of states in $\Omega$ that have this property.

We say that a random variable $X$ is distributed according to a particular probability distribution $P_X$ which defines the probability mapping between the event and the probability of the outcome of the random variable. In other words, the function $P_X$ or equivalently $P \circ X^{-1}$ is the _law_ or _distribution_ of the random variable $X$

The nature of the target space $\mathcal{T}$ (the range of the random variable $X$) determines the kind of probability space involved:

- When $\mathcal{T}$ is finite or countably infinite, $X$ is called a discrete random variable
- When $\mathcal{T}$ is infinite or uncountably finite, $X$ is called a continuous random variable. We mainly consider continuous random variables where the target spaces are $\mathcal{T} = \mathbb{R}$ or $\mathcal{T} = \mathbb{R}^D$

# Discrete and Continuous Probabilities

### Discrete Target Spaces

When the target space $\mathcal{T}$ is discrete, the probability that a random variable $X$ takes on a particular value $x \in \mathcal{T}$ is 

$$P(X = x)$$

This is called the _probability mass function_ for the discrete random variable $X$.

### Continuous Target Spaces

When the target space $\mathcal{T}$ is continuous, e.g., the real line $\mathbb{R}$, it is more natural to specify the probability that a random variable $X$ is in an interval $[a, b]$, where $a < b$, as

$$P(a \leq X \leq b)$$

With that in mind, the probability that a random variable $X$ is less than a particular value $x$ is denoted by

$$P(X \leq x)$$

This function, for a continuous random variable $X$, is known as the _cumulative distribution function_.

### Univariate and Multivariate Distributions

- A univariate distribution is a probability distribution of a single random variable, where the random variable is denoted by a single non-bold letter such as $x$
- A multivariate distribution is a probability distribution of more than one random variable, where the random variables are grouped into a vector and denoted as $\mathbf{x}$

## Discrete Probabilities

When the target space is discrete, we can visualize a probability distribution of multiple random variables as filling out a multidimensional array of numbers.

The target space of the overall joint probability distribution $\mathcal{T}$ is the Cartesian product of the target spaces $\mathcal{T}_i$ of each of the random variables

$$\mathcal{T} = \mathcal{T}_1 \times \mathcal{T}_2 \times \dots \times \mathcal{T}_n$$

### Joint Probability

In a frequentist fashion, we define the _joint probability_ as the entry of both values jointly

$$P(X = x_i, Y = y_j) = \frac{n_{ij}}{N}$$

Where $n_{ij}$ is the number of events with state $x_i$ and $y_j$ and $N$ the total number of events.

The joint probability is the probability of the intersection of both events, i.e.,

$$P(X = x_i, Y = y_j) = P(X = x_i \cap Y = y_j)$$

For two random variables $X$ and $Y$, the joint probability, i.e., the probability that $X = x$ and $Y = y$ is written as $p(x, y)$. The probability is a function that takes states $x$ and $y$ and returns a real number, which is why we write it using function notation as $p(x, y)$

### Marginal Probability

The _marginal probability_ that $X$ takes the value $x$ irrespective of what value the random variable $Y$ takes is written as $p(x)$.

We write $X \sim p(x)$ to denote that the random variable $X$ is distributed according to $p(x)$


### Conditional Probability

If we consider only the instances where $X = x$ for a particular state/value $x$, then the fraction of instances (the _conditional probability_) for which $Y = y$ is written as $p(y \, \vert \, x)$

### Python Example

We now consider an example of a multivariate probability distribution with two random variables and illustrate the concepts discussed so far in Python

In [24]:
import numpy as np

X = np.array([1, 2, 3, 4, 5])
Y = np.array([30, 20, 10])

n = np.random.randint(0, 10, size=(3, 5))

N = np.sum(n)

c = np.sum(n, axis=0)
r = np.sum(n, axis=1)

P_marginal_X = c / N
P_marginal_Y = r / N

assert 1 == 1

P_conditional_X = np.divide(n, c)
P_conditional_Y = np.divide(n.T, r).T

array([[0.11538462, 0.30769231, 0.        , 0.30769231, 0.26923077],
       [0.13636364, 0.40909091, 0.04545455, 0.27272727, 0.13636364],
       [0.28      , 0.24      , 0.2       , 0.12      , 0.16      ]])

## Continuous Probabilities

### Probability Density Function

A function $f : \mathbb{R}^D \mapsto \mathbb{R}$ is called a _probability density function (pdf)_ if

1. $\forall \mathbf{x} \in \mathbb{R}^D : f(\mathbf{x}) \geq 0$
2. Its integral exists (converges to a finite value) and

$$\int_{\mathbb{R}^D} f(\mathbf{x}) d\mathbf{x} = 1$$

For probability mass functions (pmf) of discrete random variables, the integral is replaced with a sum.

Observe that the pdf is any function $f$ that is non-negative $f(\mathbf{x}) \geq 0$ and integrates to one.

We associate a random variable $X$ with this function $f$ by

$$P(a \leq X \leq b) = \int^b_a f(x) dx$$

Where $a, b \in \mathbb{R}$ and $x \in \mathbb{R}$ are outcomes of the continuous random variable $X$.

States $\mathbf{x} \in \mathbb{R}^D$ are defined analogously by considering a vector of $x \in \mathbb{R}$. This association is called the law or distribution of the random variable $X$.

In contrast to discrete random variables, the probability of a continuous random variable $X$ taking a particular value $P(X = x)$ is zero. This is like trying to specify an interval in the integral where $a = b$, i.e., the area under a curve at a singular point is $0$.

### Cumulative Distribution Function

A _cumulative distribution function (cdf)_ of a multivariate real-valued random variable $X$ with states $\mathbf{x} \in \mathbb{R}^D$ is given by

$$ F_X(\mathbf{x}) = P(X_1 \leq x_1, \ldots, X_D \leq x_D)$$

Where $X = [X_1, \ldots, X_D]^\operatorname{T}$, $\mathbf{x} = [x_1, \ldots, x_D]^\operatorname{T}$