---
numbering:
  title:
    offset: 1
---

(ch4.1)=
# Expected Values

This chapter will introduce methods for summarizing distributions. We will start by looking for a summary value that describes the position of the ``center" of the distribution.

Like any summary, choosing a central value, discards most of the information about the distribution. Depending on what we want to describe, and what we are willing to discard, we could choose different summaries. We saw one natural choice in the last chapter. The **mode** of a distribution, or location of its peak, is often adopted as a natural descriptor of the distribution. When the distribution has one peak, the location of the peak is a natural reference value. It represents the *most likely* outcome when sampling from the distribution. 

This chapter focuses on expected values as measures of center. Expected values are the most commonly used summary in all of statistics, data science, machine learning, and probability. They are, in essence, averages. We'll spend this section introducing the expected value. We will define it, compare it to other measures of center, discuss its interpretation, then extend it to functions of random variables.

## Definition

The expected value of a random variable is the *weighted average of possible values of the random variable, weighted by its distribution.*

:::{note} Expected Values of Discrete Random Variables

If $X$ is a discrete random variable with support $\mathcal{X}$ then the **expected value** of $X$ is:

$$\mathbb{E}[X] = \sum_{x \in \mathcal{X}} x \text{PMF}(x) = \sum_{x \in \mathcal{X}} x \text{Pr}(X = x) $$

:::

We will often abbreviate the sum over all possible values of $X$ with "*all $x$". This avoids heavy subscripts. In general, you should assume a sum over $x$ runs over the full support of $X$ unless told otherwise.

Here's an example. Suppose that $\mathcal{X} = \{1,2,3,4\}$ and:

$x = $ | 1  | 2 | 3 | 4   
:----:|:-------------|:-------------|:-------------|:-------------
$\text{PMF}(x) = $  | $1/2$ | $1/4$ | $0$ | $1/4$

Then, to find the expected value of $X$, multiply the entries in each column, then add the entrywise products:

$$\begin{aligned} \mathbb{E}[X] & = \sum_{x=1}^4 x \text{PMF}(x) \\& = 1 \times \frac{1}{2} + 2 \times \frac{1}{4} + 3 \times 0 + 4 \times \frac{1}{4} = 2. \end{aligned}$$


We can also compute expected values of continuous random variables. The formula is entirely analogous. Replace the sum over the support with an integral, and substitute the mass function for a density function.

:::{note} Expected Values of Continuous Random Variables

If $X$ is a continuous random variable with support $\mathcal{X}$ then the **expected value** of $X$ is:

$$\mathbb{E}[X] = \int_{x \in \mathcal{X}} x \text{PDF}(x) dx = \sum_{x \in \mathcal{X}} x f_X(x) dx. $$

:::

For example, consider the dartboard demonstration in [Section 2.4](#ch2.4). The distance from the dart to the center of the dartboard was a random variable $R \in [0,1]$ with PDF $2 r$. Therefore:

$$\begin{aligned} \mathbb{E}[R] & = \int_{r = 0}^1 r \times \text{PDF}(r) dr  = \int_{r=0}^1 r \times (2 r) dr \\& = \int_{r = 0}^1 2 r^2 dr  = \frac{2}{3} r^3 |_{r=0}^1 \\ & = \frac{2}{3}. \end{aligned}$$

Notice that the expected value of $R$ is not $1/2$ even though the dart's positions were drawn uniformly. The expected value is greater than $1/2$ since the dart's distance to the origin is not uniform. The dart is more likely to land with $R > 1/2$ than $R < 1/2$, so its expected value is also greater than $1/2$.



## Interpretation

Expected values are commonly used to summarize the ``center" of a distribution. Here are three interpretations of the expected value.

### Center of Mass

Suppose that $X$ is a random variable with some PMF. Imagine the PMF as a physical distribution of mass. For instance, if $\text{PMF}(2) = 0.4$ you could imagine putting a 4 pound weight at $x = 2$. 

To find a central value, imagine that your masses are sitting on a long beam. You get to place a fulcrum under the beam. You can adjust the fulcrum, but cannot move the masses. You want to move the fulcrum to a position where the beam will balance without tipping left or right. 

The position of the fulcrum that balances the beam is the *center of mass* of the distribution. The center of mass can be found using the same formula as the expected value:

$$\text{center of mass} = \sum_{\text{all } x} x \times (\text{Mass at } x) = \sum_{\text{all } x} x \times \text{PMF}(x) = \mathbb{E}[X].  $$

The same applies for density using units of mass per length. Just replace the sum with an integral.

:::{note} Expectation is Center of Mass

So, the expected value of a random variable is analogous to the *center of mass* of its distribution function.

:::

This physical analogy can be helpful when we want to visualize the expected value of a random variable. For example:

- If the distribution is symmetric about some $x_*$, then it must balance about $x_* $ so $x_* $ is its expected value.
- If the distribution is skewed, then the expected value will move away from the mode of the distribution in the direction of the skew. 
- If the distribution is highly skewed, then the expected value may by far from the peak of the distribution, and may be a highly atypical value. 

:::{caution} Expected Values are Not Typical Values

Expected values are weighted averages. So, like any weighted average, they need not correspond to possible values, or even typical values. For instance:

- In 2024, the average American family had 1.94 children. No family has 1.94 children.

- In 2022 the national average family income, with adult earners between the ages of 35 and 65, was 170,000 dollars. That number probably seems high to you. It seems high since the average is skewed by a small portion of very wealthy households. *The top 1% of American households control 31.7% of the national wealth. The bottom 50% of American households hold about 2.5% of the national wealth, less than the 3.8% controlled by the 800 billionaires in the US.* Since wealth distributions are heavily skewed, most economists and labor statisticians use medians. The median incomde was between 80 and 90,000 dollars (see [Source](https://web.archive.org/web/20231225163527/https://www.federalreserve.gov/publications/files/scf23.pdf).)

:::

### Sample Averages

We've just seen that expectations need not be typical values. They are balance points for the distribution. This begs the question, "*When do we expect an expected value?*"

In short, we should expect an expected value if we draw a large collection of sample values, and average the result. This is a very common procedure in data analysis. For instance, you might:

- Choose a sample of 1,000 individuals from a pool and measure their heights. Then, since your sample size is large, the average height of the 1,000 individuals in your sample will be close to $\mathbb{E}[H]$ where $H$ is the height of a uniformly selected individual. In this case, we would call the average over your sample a *sample average*, and the background average over the full population a *population average*. If you draw a large sample, then the sample average will estimate the population average closely so, before we draw the sample, it is reasonable to *expect* the population average. 

- Run a random experiment repeatedly. Each new experiment produces a new sample value $X$. The values are independent and identical. You repeat the process 1,000 times, then average all your sampled values. As before, since your sample size is large, your sample average is likely close to some background value. In this case there is no population average since we are just running a process, not picking from a fixed pool. However, the likely sample average will still be concentrated around some value. That value is the *expected* value of $X$, $\mathbb{E}[X]$. Formally, the probability that the sample average differs from the expectation by any fixed amount vanishes as the number of samples increases. 

These are suprisingly subtle ideas. In both cases the expectation is a background quantity that we could compute, but that we cannot observe experimentally unless we either sample the entire population, or run infinitely many experiments. Instead, we have experimental procedures that produce sample averages. Those sample averages are random, since our data is random. However, if we draw enough samples, the sample averages will closely estimate the background expectation. 

So, we *expect* to see $\mathbb{E}[X]$ in the sense that, it is *the value we should anticipate before averaging a long list of samples*. The more samples we average, the more prescient our expectation will be. 

:::{caution} How Many Samples?

Be careful when you go to apply the statement above. It is guaranteed to hold, for almost all distributions. However, it does not tell you how long a list is needed. The number of samples required depends critically on two factors:

1. How accurate we demand our estimate is. Usually, this is stated as a guarantee that the sample average is within some fixed error of the expected value with sufficiently high probability.
1. How much individual samples vary. 

The second point is essential, and often overlooked. Expected values are often not a very sensible measure of center, or at least, should not be expected, if the individual samples vary tremendously. Then an enormously large number of samples will be needed before the expectation is a good guess for the sample average. 

:::

Let's see this in a bit more detail. Suppose that $X$ is a discrete random variable with finitely many possible outcomes. Suppose that we draw $n$ samples, where $n$ is a large number. Let $N(x)$ denote the number of times we see the sample value $X = x$. Then, we can write our sample average:

$$\bar{X}_n = \frac{1}{n} \sum_{x \in \mathcal{X}} x N(x) $$

This is the usual formula for an average, grouped by possible outcomes. Now, to simplify, move the $1/n$ inside the sum:

$$\bar{X}_n = \sum_{x \in \mathcal{X}} x \frac{N(x)}{n}. $$

Notice, $N(x)/n$ is the frequency with which we saw the outcome $X = x$. So:

$$\bar{X}_n = \sum_{x \in \mathcal{X}} x \text{Fr}(x)$$

where $\text{Fr}(x)$ is the *frequency* with which we saw outcome $x$ in our $n$ trials. 

Recall, probabilities are meant to estimate frequencies. In [Section 1.2](#ch1.2) we defined chances as long run frequencies. So, assuming $n$ large, $\text{Fr}(x) \approx \text{Pr}(X = x) = \text{PMF}(x)$. Therefore:

$$\bar{X}_n \approx \sum_{x \in \mathcal{X}} x \text{PMF}(x) = \mathbb{E}[X]$$

In other words, the sample average will be close to the expected value, and should converge to the expected value as the number of samples diverges. We haven't proven this formally since we haven't established exactly what we mean by converge, or that frequencies do in fact converge in the long run, but this argument captures the essential background logic. Expected values *are not reasonable expectations for individual samples.* Expected values are reasonable expectations for long run sample averages.

### Smallest Square Error

Here's a more algebraic way to define expected value. 

Often, we are looking for a single summary value when we want to predict the value of a random variable $X$. So, consider all possible values $x_*$ that we could propose as a central value. Let's try to use $x_*$ as a prediction for $X$.

Compute the errors in our prediction between sampled $X$ and the proposed value $x_*$ via $X - x_*$. Compute their magnitude using $(X - x_*)^2$ so that large underestimates and overestimates are both considered large errors. Pick $x_*$ so that, if we drew a long collection of samples $X$, and averaged the squared errors, our averaged square error would be as small as possible. That is, choose $x_*$ to minimize the mean square error over a long collection of samples. Equating long run frequency to chance:

$$\textbf{Minimize: } \text{MSE}(x_*) = \sum_{x \in \mathcal{X}} (x - x_*)^2 \text{PMF}(x).$$

You can think about the MSE at $x_*$ as a measure of how good $x_*$ is as a summary for the distribution. Anytime $X$ can take on multiple values every choice of summary value will ignore the variation in $X$. The quantity $(x - x_*)^2$ measures how badly $x_*$ misrepresents $x$. Minimizing an average against the PMF selects an $x_*$ that tries to stay as close to regions of high probability as possible, while compromising between the different possible values of $x$.

It turns out that, the best choice of $x_*$ is the expected value $\mathbb{E}[X]$. In other words, *expected values minimize mean square errors*. 

Notice the role of the square here. Squaring a number greater than one makes it larger. Squaring a number less than one makes it smaller. So, when we minimize a mean square error, we are discounting small errors, and exagerating large errors. As a consequence, the expected value is the best choice of central value when we *aim to avoid very large errors, but mostly disregard small errors.* 

This interpretation matches our observation about skewed distributions. Expected values will be sensitive to outliers, since the expected value is chosen to try and avoid very large errors in prediction. 

We'll make this idea more precise in the future.

## Expectations of Functions of Random Variables

Often, we are interested in functions of random variables. For instance, consider $Y = X^2$, or $W = e^X$. Notice that both $Y$ and $W$ are random since $X$ was random. So, each function of a random variable defines a new random variable.

We are often interested in functions of random variables since the functions represent operations we could apply to the random variable. Data analysis is all about collecting samples, then applying functions to those samples in order to answer a question. 

So, if $Y = g(X)$ for some function $g$, what is $\mathbb{E}[Y]$?

There are two ways to calculate this expectation:

:::{note} Expectations of Functions (Discrete)

Given a discrete random variable $X$ and $Y = g(X)$ for some function $g$:

$$\mathbb{E}[Y] = \begin{cases} & \textbf{Range Space Formula: }\sum_{\text{all y}} y \text{Pr}(Y = y) \\
& \textbf{Range Space Formula: }\sum_{\text{all x}} g(x) \text{Pr}(X = x)
\end{cases} $$

:::

Both formula give the same answer. The first just treats $Y$ as a new discrete random variable, and applies the formula for generic expectations. To apply it, find the possible values of $Y = g(X)$, then find the PMF of $Y$. This is often a lot of work since finding the PMF of $g(X)$ involves finding $\text{Pr}(g(X) = y)$ for each possible $y$.

The second is often easier. It only requires evaluating a new weighted average against the PMF of $X$. It generalizes our original expectation formula nicely. Instead of averaging $x$, average $g(x)$. The domain space fomrula is the form we'll work with. It's called the domain space formula since it averages over the inputs to $g$ (the domain), rather than over the outputs of $g$ (its range) If you like exercises, showing that the two formula always give the same result is a good problem to chew on. 


Here's an example. Suppose that $\mathcal{X} = \{1,2,3,4\}$ and:

$x = $ | 1  | 2 | 3 | 4   
:----:|:-------------|:-------------|:-------------|:-------------
$\text{PMF}(x) = $  | $1/2$ | $1/4$ | $0$ | $1/4$

What is $\mathbb{E}[X^2]$? 

Well, add a row to the table for $g(x) = x^2$:

$x = $ | 1  | 2 | 3 | 4   
:----:|:-------------|:-------------|:-------------|:-------------
$x^2 = $ | 1  | 4 | 9 | 16   
$\text{PMF}(x) = $  | $1/2$ | $1/4$ | $0$ | $1/4$

Then:

$$\mathbb{E}[X^2] = 1 \times \frac{1}{2} + 4 \times \frac{1}{4} + 9 \times 0 + 16 \times \frac{1}{4} = \frac{1}{2} + 1 + 4 = 5.5 $$

Notice that $\mathbb{E}[X^2] = 5.5$ is not $\mathbb{E}[X]^2 = 2^2 = 4$. Instead, $\mathbb{E}[X^2] = 5.5 > 4 = \mathbb{E}[X^2]$. There are two related lessons here:

:::{caution} Expectations of Functions are Not Functions of Expectations

In general:

- The expectation of a function is not the function of the expectation:

$$\mathbb{E}[g(X)] \neq g(\mathbb{E}[X]) $$

- If the function $g$ is convex (curves upwards everywhere), then, for any random variable $X$ that can take on at least two distinct values:

$$\mathbb{E}[g(x)] > g(\mathbb{E}[X]) $$

The latter inequality is called **Jensen's inequality**. It holds for all convex functions and all distributions. The equality holds in reverse if $g$ is concave (curves down). 

In particular, since $x^2$ curves upwards, *the expected square is greater than the squared expectation.*

:::

The domain space formula also generalizes nicely to continuous random variables. As usual, integrate where we used a sum, and swap mass for density:

:::{note} Expectations of Functions (Discrete)

Given a continuous random variable $X$ and $Y = g(X)$ for some function $g$:

$$\mathbb{E}[Y] = \int_{\text{all } x} g(x) \text{PDF}(x) dx =  \int_{\text{all } x} g(x) f_X(x) dx$$

:::

For instance, suppose we had wanted to find the expected square distance from a dart's position to the center of the dartboard, when the dart's position is drawn uniformly. Then we would compute:

$$\mathbb{E}[R^2] = \int_{r = 0}^1 r^2 \times (2 r) dr = \int_{r = 0}^1 2 r^3 dr = \frac{2}{4} r^4|_{0}^1 = \frac{1}{2}. $$



## Comparison to Mode and Median

There are other ways to summarize the ``center" of a distribution. The two popular alternatives are:

1. The **mode** of a distribution is the value (or values) of $x$ that maximize its distribution function (PMF or PDF). These are the *most likely* possible values of the random variable. They correspond to the locations of the peaks in the distribution plot. 

1. The **median** of a distribution is the value where it's CDF equals first crosses 1/2 from below. The discrete case is a bit awkward so we'll focus on the continuous case for interpretation. The continuous case is easier since, when we work with continuous random variables, we don't need to distinguish between the statements $X \leq x_*$ and $X < x_*$.
    - For a continuous random variable, the median value $x_*$ is the value where $\text{CDF}(x) = 1/2$. 
    - Then $x_*$ is a central value in the sense that, if we predict $X$ using $x_*$ then we are *equally likely to over or underestimate*. Half of the probability mass lies above $x_*$ and half lies below. 
    - Like the expected value, the median can be defined as the central value that minimizes some average error. The median is the choice of central value that minimizes the mean *absolute* error, $|X - x_*|$. This means that the median is concerned by errors in proportion to their size, so does not exagerate large errors or discount small errors. It is less sensitive to outliers than the expectation. 

Importantly, the mean (expected value), median, and mode *are all different summaries*. Do not conflate them.

:::{caution} Mean is not Median or Mode

The expected value (mean) is *neither* the:

- *most likely outcome* (mode)
- *central value that is equally likely to over or underestimate a sample (median).*

:::

We will mostly concern ourselves with expectations and modes in this class. Medians have received short shrift by statisticians and probabilists. Of the three, expectations are the easiest to compute, are easy to estimate, and are the easiest to analyze. They are not the easiest to estimate (medians are more stable). However, it is easier to build a theory that explains how estimates to expectations behave (sample averages). This theory is deep, and is the crown jewel of probability. Many of the deepest theorems in probability regard sample averages. The law of large numbers, central limit theorem, and ergodic theorems all provide guarantees for sample averages that relate sample averages to expectations. The law of large numbers, in particular, is a foundational pillar of probability. It is the key theorem that relates the abstract world of chances, and their rules, to the measurable world of frequencies. Without it, probability would be a purely abstract concept. 

As a result, we will also spend a lot of effort on expectations. While we focus there, don't forget the median. It is an equally, if not more, sensible notion, that minimizes a simpler notion of error, and is more stable to outliers. 