# Intro to random variables

### Intro
One of the most important concepts in statistics is the concept of a **random variable**. This is an abstraction of the concept of a variable that we see in our datasets. Namely, a random variable is an abstract concept corresponding to a variable in our dataset, but which is not necessarily observed. Thus, it carries its own distribution, which is a function that describes how the values of the variable are distributed. In statistics, we are often interested in the underlying "true" distribution of the random variable, rather than the distribution visible just from the values that we observe in our dataset. The goal of (inferential) statistics is to *estimate* this true distribution from the observed values in our dataset.

The set of all possible values of a random variable $X$ is called its **support**, denoted $\textup{supp}(X)$. It is more proper to think of a random variable as the data of a pair $(\textup{supp}(X), p_X)$, where $p_X$ is a PMF or PDF:

1. **Discrete case**: In this case, $\textup{supp}(X)$ is finite, and $p_X$ is a PMF, i.e. $p_X: \textup{supp}(X) \to [0,1]$ is a function that satisfies  $$\sum_{x \in \textup{supp}(X)} p_X(x) = 1.$$ For convenience, we can always assume that the support is some finite set of real numbers, and we can always view $p_X$ as a function from $\mathbb{R}$ to $[0,1]$ by defining $p_X(x) = 0$ for all $x \notin \textup{supp}(X)$.
2. **Continuous case**: In this case, $\textup{supp}(X) = \mathbb{R}$ or some interval on $\R$. In fact, for convenience and/or simplicity, we can always assume that $\textup{supp}(X) = \mathbb{R}$ by defining the PDF $p_X$ to be $0$ outside of the interval of interest. So, we can always regard our PDF $p_X: \mathbb{R} \to [0,\infty)$ as a function whose values add up to $1$, where "add" here means *continuous summation*... that is, an integral: $$\int_{-\infty}^{\infty} p_X(x) dx = 1.$$ 

### Three important distributions
Multiple random variables can share the same probability distribution. Some frequently used distributions are:

1. **Uniform distribution**: 

    - *Discrete case*: Suppose we have a discrete random variable with support $\Omega = \{1,\dotsc,n\}$. The **discrete uniform distribution** is the distribution in which all values are equally likely. Thus, the PMF is given by $$p_X(i) = \frac{1}{n}, \quad i=1,\dotsc,n.$$ The typical example of this is the roll of a die, where $n=6$, or the flip of a coin, where $n=2$.
    - *Continuous case*: Suppose we have a continuous random variable with support $\Omega = [a,b]$. The **continuous uniform distribution** is the distribution in which all values are equally likely. More precisely, the probability density at all points within the interval $\Omega$ is constant. In order for the area under the PDF to equal $1$, it follows that the PDF is given by $$p_X(x) = \frac{1}{b-a}, \quad x \in [a,b].$$ This highlights a key difference between PMF's and PDF's: a PMF will always have values between $0$ and $1$, while a PDF can have values greater than $1$: if $X$ is uniformly distributed over the interval $[a,b]$, with $b-a < 1$, then $p_X(x) = 1/(b-1) > 1$ for all $x \in [a,b]$!
2. **Normal distribution**: The most commonly occuring distribution for continuous random variables is (arguably) the normal distribution, which has support $\Omega = \mathbb{R}$ and PDF defined by
$$p_X(x) = \frac{1}{\sqrt{2\pi} \sigma} e^{-\frac{(x - \mu)^2}{2\sigma^2}},$$
where $\mu$ is the mean and $\sigma$ is the standard deviation; we explain these parameters below. The PDF graph looks like a Bell Curve (nay, it *is the Bell Curve*). The most special case is when $\mu=0$ and $\sigma=1$, in which case we say that $X$ is a *standard normal variable*, and the corresponding PDF is denoted by $\varphi(x)$:
\begin{equation*}
    \varphi(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}}.
\end{equation*}
3. **Bernoulli distribution**: The Bernoulli distribution is a discrete probability distribution for a random variable which takes the value $1$ with probability $p$ and the value $0$ with probability $1-p$. For example, this random variable describes the outcome of flipping a biased coin, which favors one side with probability $p$ and the other side with probability $1-p$. The PMF is given by
\begin{align*}
    p_X(x) & = p^x (1-p)^{1-x}, \quad x \in \{0,1\}\\
        & = \begin{cases}
            p & \textup{if } x=1\\
            1-p & \textup{if } x=0.
        \end{cases}
\end{align*}

### Sample spaces
The definition of random variable above is still incomplete. The reason is that we want to be able to do arithmetic and algebra with random variables. For example, if $X_1$ and $X_2$ are random variables which represent alcohol content and sugar content of wine, respectively, then we want to be able to compare the two variables, or even add them together to get a new random variable $X_3 = X_1 + X_2$ which represents the total content of alcohol and sugar in the wine. 

For this to make sense, we need to define a **sample space** $\Omega$ for our random variables (also known as a **population** in statistics). The idea is as follows:

- We start with a sample space $\Omega$ consisting of all possible instances of a particular type of object (e.g. wines, iris flowers, cars, etc.). 
- Then, every time we ask a question about a feature or attribute of the object, the answer can be viewed as a *function* from the sample space to the set of all possible values (in principle) that could be attained by the feature. This then is how we define the associated random variable: it is a function 
\begin{equation*}
    X: \Omega \to \{\textup{all possible values of the feature}\}
\end{equation*}
which assigns to each instance in the sample space the value of the feature for that instance. Note the the co-domain for this function is what we previously called the support of the random variable, $\textup{supp}(X)$.

Note that this is an extremely abstract concept, and imho it is up for debate how useful/pertinent to reality it is. For example, let's say the random variable $X$ represents "height of a person". Then, our sample space $\Omega$ could be interpreted as "all people that currently exist", or perhaps, as "all people who have ever existed." Regardless, $\Omega$ will (until the end of time) be a finite set, and thus the random variable $X$ will be a function from a finite set to a finite set (i.e. the support will be finite because there will forever have been only finitely many instances to observe). However, we would want to view this is as a *continuous* random variable supported on $[0,\infty)$, which seems a little strange because we *know* that infinitely many of these values will never be attained, even if they have non-zero probability densities!

### The true definition of random variable
To summarize, our *final* definition of a random variable is that of a triplet $(\Omega, X, p_X)$, where:

- $\Omega$ is the sample space (the space of all possuble instances of the object under consideration).
- $X: \Omega \to \textup{supp}(X)$ is a function from the sample space to the support of the random variable (the set of all possible values attained by the random variable).
- $p_X: \textup{supp}(X) \to [0,\infty)$ is a PMF or PDF (the probability distribution of the random variable).

There is a (imho disturbing) trend in the literature wherein authors simply suppress or ignore the sample space $\Omega$ as though it does not exist. A lot of times this is doable because a lot of what we want out of the random variable can be obtained simply by considering the support and probability distribution $p_X$. However, this is not always the case, and it is important to keep in mind that the sample space is an integral part of the definition of a random variable.

### Expectation
The **expected value** or **mean** of a random variable $X$ is denoted by $E[X]$ and is defined as follows:

1. **Discrete case**: If $X$ is a discrete random variable with PMF $p_X : \textup{supp}(X) \to \mathbb{R}$ (i.e. the support consists of a finite set of real numbers), then the expected value is given by $$E[X] = \sum_{x \in \textup{supp}(X)} x p_X(x).$$
That is, we weight each possible value of the random variable by its probability, and then sum these products. For example, if the variable is uniformly distributed, then each $p_X(x) = 1/n$ (where $n$ is the size of the support), and thus the expected value is given by $$E[X] = \frac{1}{n} \sum_{x \in \textup{supp}(X)} x,$$ which is simply the average of the values in the support. 
2. **Continuous case**: If $X$ is a continuous random variable with PDF $p_X$, then the expected value is given by $$E[X] = \int_{-\infty}^{\infty} x p_X(x) dx.$$
This is simply the continuous analog of the discrete case: here, we weight each value in the support by its probability density, and continuously sum these weighted values. (Above, we've used the convention of assuming that $\textup{supp}(X) = \mathbb{R}$ by setting the PDF to $0$ outside of the interval of interest.)

The mean is also sometimes denoted as $\mu_X$ or (if $X$ is fixed) $\mu$. Sometimes, we also refer to $\mu$ as the mean of the associated probability distribution $p_X$. 

**Example.** Suppose we have a discrete random variable $X$ with support $\{1,2,3\}$ and PMF given by $$p_X(1) = \frac{1}{6}, \quad p_X(2) = \frac{1}{3}, \quad p_X(3) = \frac{1}{2}.$$ Then, the expected value is given by
\begin{align*}
    E[X] & = \sum_{x \in \textup{supp}(X)} x p_X(x)\\
        & = 1 \cdot \frac{1}{6} + 2 \cdot \frac{1}{3} + 3 \cdot \frac{1}{2}\\
        & = \frac{1}{6} + \frac{2}{3} + \frac{3}{2}\\
        & = \frac{1}{6} + \frac{4}{6} + \frac{9}{6}\\
        & = \frac{14}{6}\\
        & = 2.33.
\end{align*}

**Example.** Suppose $X$ is a random variable that is uniformly distributed over the interval $[1,3]$. Then, the PDF is defined by
\begin{equation*}
    p_X(x) = \begin{cases}
        \frac{1}{2} & \textup{if } x \in [1,3]\\
        0 & \textup{otherwise.}
    \end{cases}
\end{equation*}
So, the expected value is given by
\begin{align*}
    E[X] & = \int_{-\infty}^{\infty} x p_X(x) dx\\
        & = \int_{1}^{3} x \cdot \frac{1}{2} dx\\
        & = \frac{1}{2} \int_{1}^{3} x dx\\
        & = \frac{1}{2} \left[ \frac{x^2}{2} \right]_{1}^{3}\\
        & = \frac{1}{2} \left[ \frac{9}{2} - \frac{1}{2} \right]\\
        & = \frac{1}{2} \cdot 4\\
        & = 2.
\end{align*}
Note that, although all points in $[1,3]$ are equiprobable (i.e. have identical probability density), the expected value is the exact mid-point $2$ as we might intuitively expect.

### Scaling and adding constants to random variables
Given any non-zero constant $c$, we can define a new random variable $Y = cX$ by scaling the original random variable $X$ by the constant $c$. The expected value of this new random variable is given by $$E[Y] = E[cX] = c E[X].$$
Similarly, given any constant $b$, we can define a new random variable $Y = X + b$ by adding the constant $b$ to the original random variable $X$. The expected value of this new random variable is given by $$E[Y] = E[X + b] = E[X] + b.$$

Putting these together, it follows that for constants $a,b\in \mathbb{R}$, we have $$E[aX + b] = a E[X] + b.$$
Note that the probability distribution of $Y = aX + b$ is essentially the same as that of $X$, except that the support is scaled by $a$ and shifted by $b$. 

### Functions of random variables
Suppose we have a random variable $X$ with support $\mathbb{R}$ and PDF (or PMF) $p_X$. For any function $f: \mathbb{R} \to \mathbb{R}$ is a function, we can define a new random variable $f(X)$ by applying the function $f$ to the values (i.e. support) of $X$. Then the "Law of the Unconscious Statistian" states that the expected value of $f(X)$ is given by:
$$E[f(X)] = \sum_{x \in \textup{supp}(X)} f(x) p_X(x),$$
in the discrete case, and
$$E[f(X)] = \int_{-\infty}^{\infty} f(x) p_X(x) dx,$$
in the continuous case.

Thus, we see that the formula for the expected value of $aX + b$ above is a special case of this more general formula, where $f(x) = ax + b$; the computation for the continuous case is as follows:
\begin{align*}
    E[aX + b] & = \int_{-\infty}^{\infty} (ax + b) p_X(x) dx\\
        & = \int_{-\infty}^{\infty} ax p_X(x) dx + \int_{-\infty}^{\infty} b p_X(x) dx\\
        & = a \int_{-\infty}^{\infty} x p_X(x) dx + b \int_{-\infty}^{\infty} p_X(x) dx\\
        & = a E[X] + b.
\end{align*}

In a slightly different direct, given two random variables $X$ and $Y$ (defined on the same sample space), we can form linear combinations to get new random variables $aX + bY$, for scalars $a,b\in \mathbb{R}$. The expected value of this new random variable is given by $$E[aX + bY] = a E[X] + b E[Y].$$ This property is known as **linearity of expectation**.

### Variance and standard deviation
The **variance** of a random variable $X$ is denoted by $Var(X)$ or $\sigma^2_X$ or (if $X$ is fixed) by $\sigma^2$. It is defined as $$Var(X) = E[(X - \mu_X)^2].$$
Here is one way to look at it: the shifted random variable $X' = X - \mu_X$ is the "de-meaned" version of $X$; it has mean $0$. Visually, this is because we have dragged the distribution such that its mean is at zero. Algebraically, this follows from the property of expectation that we just noted above: $$E[X'] = E[X - \mu_X] = E[X] - \mu_X = 0.$$
Thus, the variance is simply the expected value of the square of the de-meaned random variable:
\begin{equation*}
    \sigma_X^2 = E[(X')^2] = E[(X - \mu_X)^2].
\end{equation*}
The **standard deviation** of a random variable $X$, denoted $\sigma_X$ or (if $X$ is fixed) by $\sigma$, is defined as the square-root of the variance: $$\sigma_X = \sqrt{Var(X)} = \sqrt{E[(X - \mu_X)^2]}.$$