---
numbering:
  title:
    offset: 1
---

(ch4.2)=
# Rules of Expectation

In [Section 4.1](#ch4.1) we introduced the basic definition of an expectation as a weighted average. This definition provides a direct formula for computing an expectation. Evaluate a weighted average of the possible values of the random variable, weighted by their likelihood. 

Evaluating these averages can be tricky. We'll spend a couple weeks practicing integrals and sums, but, it will take some work before we can confidently evaluate interesting expectations directly. 

Nevertheless, expectations are popular summary values because they obey a variety of useful rules. These rules make it possible to compute many expectations without ever performing summation acrobatics. Instead of plugging into the definition, we can often find an expectation by applying the rules of expectation. 

This section will introduce the most useful ones. We will start with an example for motivation, then will explore three sets of rules. Once we've built up a list of rules, we'll come back and complete our example without working through the sum directly. 

## Binomial Example

Suppose that $X \sim \text{Binom}(n,p)$. What is $$\mathbb{E}[X]$?

Count variables are discrete, so we could start with a sum:

$$\mathbb{E}[X] = \sum_{\text{all } x} x \times \text{Pr}(X = x) = \sum_{x = 0}^n x \left( \begin{array}{c} n \\ x \end{array} \right) p^x (1 - p)^{n-x}.$$

That's not an easy sum. We could try simplifying:

$$\mathbb{E}[X] = \sum_{x = 0}^n \frac{x}{x!}  \frac{n!}{(n - x)!} p^x (1 - p)^{n-x} = \sum_{x = 0}^n \frac{n!}{(x- 1)!(n - x)!} p^x (1 - p)^{n-x}$$

but, unless you're quite clever, it's not clear how to progress. Sums are *easy to set up*, but often are *hard to close.*

:::{hint} Closing the Sum
:class: dropdown

To close the sum, start by focusing on the combinatorial term involving all the factorials. It is:

$$ \frac{n!}{(x- 1)!(n - x)!} $$

This is close to, but not quite, $n$ choose $(x - 1)$. It falls short since the terms in the denominator add to $(x - 1) + (n - x) = n - 1$ not $n$. That sais, with a little rearranging, we can relate it to $(n-1)$ choose $(x - 1)$:

$$ \frac{n!}{(x- 1)!(n - x)!} = \frac{n \times (n - 1)!}{(x- 1)!(n - x)!} = n \left( \begin{array}{c} n - 1 \\ x - 1 \end{array} \right) $$ 

So, we can write our sum:

$$\mathbb{E}[X] = n \sum_{x = 0}^{n-1} \left( \begin{array}{c} n-1 \\ x - 1 \end{array}  \right) p^x (1 - p)^{n-x} $$

Now the term inside the sum looks almost like the PMF of a binomial, but the multiplicity (the choose factor) uses $n-1$ and $x - 1$, while the powers are $p^x$ and $(1 - p)^{n - x}$. The power $n - x$ isn't an issue since $n - x = n - 1 + 1 - x = (n - 1) - (x - 1)$. To handle the extra power of $p$, write:

$$p^x = p \times p^{x-1}. $$

Now our sum is:

$$\mathbb{E}[X] = n p \sum_{x = 0}^{n-1} \left( \begin{array}{c} n-1 \\ x - 1 \end{array}  \right) p^{x - 1} (1 - p)^{(n-1)-(x - 1)} $$

Take a look at the bounds of the sum again. We started from $x = 0$. But $x = 0$ will cause us headaches if we plug $0 - 1 = -1$ into a factorial. This is a clue to back up and look at the original expression. Originally, we had:

$$\mathbb{E}[X] = \sum_{x = 0}^n x \times \text{PMF}(x). $$

The first term in this sum is $0 \times \text{PMF}(0) = 0$. Adding zero doesn't change the value of a sum, so we can ignore the $x = 0$ term:

$$\mathbb{E}[X] = 0 + \sum_{x = 1}^n x \times \text{PMF}(x) = \sum_{x = 1}^n x \times \text{PMF}(x). $$

So, dropping the "zero" term from our sum, we now have:

$$\mathbb{E}[X] = n p \sum_{x = 1}^{n} \left( \begin{array}{c} n-1 \\ x - 1 \end{array} \right) p^{x - 1} (1 - p)^{(n-1)-(x - 1)} $$

Now all the terms in the sum involve $x - 1$, and the bounds of the sum make sense. At this point it may look like we've gone to a lot of effort just to arrive at a similar sum. However, we're actually almost done. Like most math problems, our initial steps are all about expanding terms until we reach an expanded form that is easier to simplify. 

To simplify, it helps to come up with a strategy. So far we've just simplified what was available, then organized terms. Now, notice that the resulting sum is a sum over $n-1$ different terms, where each term inside the sum looks like the value of a binomial PMF on $n - 1$ trials, for $x - 1$ successes. Since binomial PMFs are a PMF, they are normalized. If we write the sum as a sum over all the values of a binomial PMF, then we will know its value must equal one!

Let $y = x - 1$. Then, when $x = 1$, $y = 0$ and when $x = n$, $y = n-1$. Therefore,

$$\mathbb{E}[X] = n p \sum_{y = 0}^{n-1} \left( \begin{array}{c} n-1 \\ y \end{array}  \right) p^{y} (1 - p)^{(n-1)-y} $$

Let:

$$\text{PMF}(s;n,p) = \left(\begin{array}{c} n \\ s \end{array} \right) p^s (1 - p)^{n - s} $$

denote the binomial PMF for a generic binomial random variable $S \sim \text{Binom}(n,p)$. Then, we've just shown that, when $X \sim \text{Binom}(n,p)$:

$$\mathbb{E}[X] = \sum_{x = 0}^n x \times \text{PMF}(x;n,p) = n p \sum_{y = 0}^{n-1} \text{PMF}(y;n-1,p) $$

So, by manipulating the sum, we've replaced a weighted average of possible $x$ values over a binomial PMF on $n$ trials, with just a *sum of binomial PMF values on $n-1$ trials.* Since the second sum is the sum of a binomial PMF over all possible inputs:

$$\sum_{y = 0}^{n-1} \text{PMF}(y;n-1,p) = 1 $$

We don't need to use any algebra to close this sum. Instead, we're just recalling normalization. 

So, putting it all together:

$$ \mathbb{E}[X] = n p.$$

:::

It turns out that:

$$ \mathbb{E}[X] = n p.$$

That's a wonderfully simple formula. The expected number of successes in a string of $n$ independent, identical, binary trials is the number of trials, $n$, times the chance each individual trial succeeds, $p$. If I run 10 experiments, and each succeeds with chance 1/5, then I expect to see 2 successes. 

This answer also closely tracks what we learned about the mode of the binomial. The most likely outcome for a binomial random variable is near $n p$. 

But, deriving it directly was exhausting. If you haven't, open the dropdown above to see how much work it took to get here. Whenever we reach an answer that is suspiciously simple, through a process that is evidently opaque, we should ask, "*was there a better way to find this answer?*" Often, if your answer is intuitive, but your work is ornate, there is a simpler method. The rest of this chapter will develop a series of rules that make this sort of calculation a breeze. 

## Rules of Expectation:

Here's our basic strategy:

1. When given a random variable, always try to write out the expectation directly as a sum or an integral. If you can close it, go ahead. There's no need to try anything sharper.
1. If the sum/integral is tricky, we'll try to rewrite the random variable as a combination of simpler random variables.
1. Then, we will try to apply some rules of expectation to break up the original expectation into a combination of expectations of its parts. If each part is simple enough, then we can use the expectations of the parts to put together the expectation of the whole. 

### Expectations of Key Distributions

To use this strategy, we will need to know the expectations of some key reference distributions. Here are three you should always be ready to use:

1. **Constants:** If $X = c$ for some constant $c$, then $\mathbb{E}[X] = c$. 

    This result follows immediately from either interpretation of the expectation. If all of the mass of the distribution is at one value, then that value must be the center of mass. Alternately, if $X$ is always $c$, then any sample average of a string of samples, will be a sample average of a string of $c$'s, so must equal $c$. 

    :::{note} Expectations of Constants
    If $X = c$, then:

    $$ \mathbb{E}[X] = c. $$

    In other words, *the expected value of a constant, is the constant itself.*
    :::

1. **Bernoulli (binary) Random Variables:** Suppose that $X$ is an indicator random variable for some event $E$. Then $X \sim \text{Bern}(p)$ where $p = \text{Pr}(E)$. What is $\mathbb{E}[X]$?

    $$\mathbb{E}[X] = \sum_{x = 0}^1 x \text{PMF}(x) = 0 \times (1 - p) + 1 \times p = p. $$

    :::{note} Expectations of Indicators
    If $X \sim \text{Bern}(p)$, then:

    $$ \mathbb{E}[X] = p. $$

    In other words, *the expected value of an indicator random variable is the chance that the event it indicates occurs.*
    :::

1. **Symmetric Distributions:** Suppose that $X$ is drawn from a distribution that is symmetric about some value $x_*$. Then, to balance the distribution, the only possible midpoint is $x_*$, so the center of mass is $x_*$. It follows that:

    :::{note} Expectations of Symmetric Distributions
    If $X$ is drawn symmetrically about $x_*$, then:

    $$ \mathbb{E}[X] = x_*. $$
    :::

### Linearity of Expectation

Next, we will need rules that help us compute expectations of transformations of random variables. These are just standard algebra rules for averages. 

The simplest transformation is a linear function. We'll break this rule into three parts. The first two are each special cases of the third.

1. **Translations:** If $Y = X + b$ for some $b$, then $\mathbb{E}[Y] = \mathbb{E}[X + b] = \mathbb{E}[X] + b$. 

    As usual, we can either prove this rule using the weighted average formula for expectations, or argue it using the interpretations of expectation. Let's work by interpretation. Adding a constant to $X$ just shifts its distribution rightward by the constant since $\text{PMF}(Y = y) = \text{Pr}(Y = y) = \text{Pr}(X + b = y) = \text{Pr}(X = y - b) = \text{PMF}(x - b)$. If I translate a distribution horizontally by a distance $b$, then I must also translate its center of mass horizontally by a distance of $b$. So, the new center of mass is the old center of mass, plus $b$. 

1. **Scaling:** If $Y = a X$ for some $a$, then $\mathbb{E}[Y] = \mathbb{E}[aX] = a \mathbb{E}[X]$. 

    Let's prove this one using the weighted average formula. We'll do the discrete case. The continuous case works for the same reason.

    $$\mathbb{E}[Y] = \mathbb{E}[a X] = \sum_{\text{all } x} (a x) \text{PMF}(x) = a \sum_{\text{all } x} x \text{PMF}(x) = a \mathbb{E}[X]. $$

Combining these two rules produces the following:

:::{note} Linearity of Expectation

If $f(x) = a x + b $ is a linear function of $x$, then:

$$\mathbb{E}[Y] = \mathbb{E}[a X + b] = a \mathbb{E}[X] + b = f(\mathbb{E}[X]). $$

In other words, *the expectation of a linear function of $X$, is the function of the expectation.*

:::

:::{caution} Nonlinear Functions

We should be careful with this result. As noted in [Section 4.1](#ch4.1) the expectation of a function is rarely the function of the expectation. The result shown above *only works for linear functions.* If $f$ is nonlinear, then it might not hold. 

If $f$ is convex or concave, then we can still relate the function of the expectation to the expectation of the function, only we are limited to using bounds. As discussed in [Section 4.1](#ch4.1), if the function $f$ is convex (curves upwards everywhere), then, for any random variable $X$ that can take on at least two distinct values:

$$\mathbb{E}[f(x)] > f(\mathbb{E}[X]) $$

The latter inequality is called **Jensen's inequality**. It holds for all convex functions and all distributions. The equality holds in reverse if $g$ is concave (curves down).

:::

### Additivity of Expectation

The final, and most useful property of expectation is another statement about sums. This time, it regards the expectations of sums of random variables.

:::{note} Additivity of Expectation
If $X$ and $Y$ are random variables, then:

$$\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]. $$

In other words, *the expectation of a sum is the sum of the expectations.*

:::{important}

This result *does not* require anything special about $X$ and $Y$. They need not be independent, nor do they need to regard disjoint events. The expectation of *any* sum of random variables, no matter the variables, or their relation to one another, is just the *sum of their expectations.*
:::

To prove this result, we'll use some of the ideas from [Section 1.4](#ch1.4) and [1.5](#ch1.5). Let $S = X + Y$. Then, by the weighted average formula for expectations

$$\mathbb{E}[S] = \sum_{\text{all } s} s \times \text{Pr}(S = s) = \sum_{\text{all } s} \sum_{\text{all } x,y \text{ s.t. } x + y = s} (x + y) \text{Pr}(X = x, Y = y) $$

where the second equality follows by partitioning the event, $X + Y = s$, into each of the disjoint ways it could occur. The "s.t." in the subscript stands for "such that." 

The sum over all possible sums, of each pair that could add to the sum, is just the same thing as summing over all pairs $x$ and $y$. So, we can rewrite out sum more simply:

$$\mathbb{E}[S] = \sum_{\text{all } x, y} (x + y) \text{Pr}(X = x, Y = y) $$

Now, simplifying:

$$\mathbb{E}[S] = \sum_{\text{all } x,y } x \text{Pr}(X = x, Y = y) + \sum_{\text{all } x,y } y \text{Pr}(X = x, Y = y). $$

The probabilities in each sum are joint probabilities. We can expand each using the multiplication rule. For example:

$$\sum_{\text{all } x,y } x \text{Pr}(X = x, Y = y) = \sum_{\text{all } x,y } x \text{Pr}(X = x) \text{Pr}(Y = y|X = x). $$

Now, let's split up the sum. Let's first sum over all $x$, then, sum over all $y$:

$$\sum_{\text{all } x } \sum_{\text{all } y} x \text{Pr}(X = x) \text{Pr}(Y = y|X = x) = \sum_{\text{all } x }  x \text{Pr}(X = x) \left( \sum_{\text{all } y} \text{Pr}(Y = y|X = x) \right).$$

Here's the kicker. The sum inside the parentheses on the right is the sum of a distribution, over all possible values of the associated random variable. Anytime we sum the PMF of a random variable (here, the PMF of $Y$ given $X = x$), over all possible values, we must get back 1. All PMF's are normalized. 

So:

$$\sum_{\text{all } x } \sum_{\text{all } y} x \text{Pr}(X = x) \text{Pr}(Y = y|X = x) = \sum_{\text{all } x }  x \text{Pr}(X = x) = \mathbb{E}[X]. $$

The same argument applies for the second term in our original sum, so:


$$ \mathbb{E}[S] = \mathbb{E}[X] + \mathbb{E}[Y].$$

Essentially the same arguments apply in the continuous case. 


## Expectations of Count Variables via Additivity

A **count variable** is an integer-valued random variable that represents some sort of count. For instance, binomial random variables count successes. Geometric random variables count trials until a success. The rules established above make it easy to find the expectations of count variables, since most count variables can be expanded as a sum. After all, most counting processes occur as sequences, where, each time an instance occurs, we add 1 to our running count. 

### Binomial Random Variables

Let's try to find the expectation of a binomial again. This time, we'll use rules instead of brute force algebra. 

First, suppose that $X \sim \text{Binom}(n,p)$. Then $X$ is the number of successes in a string of $n$ independent, identical, binary trials. So, if we let $I_j$ be an indicator for the event that the $j^{th}$ trial succeeds, then:

$$X = I_1 + I_2 + ... I_n = \sum_{j=1}^n I_j. $$

Then, using the additivity property:

$$\mathbb{E}[X] = \mathbb{E}[\sum_{j=1}^n I_j] = \sum_{j=1}^n \mathbb{E}[I_j].$$

Each $I_j$ is an indicator, so is a Bernoulli random variable with success probability $p$. So, since the expectation of any Bernoulli random variable is just its success probability, we know that:

$$\mathbb{E}[X] = \sum_{j=1}^n p = n \times p. $$

Done! Compare this proof to the dropdown argument provided at the start of the chapter. This one is *much* better. 

It is better in two ways:

1. It is simpler. It involves fewer steps and is easier to follow/remember.
1. Each of its steps are meaningful and rely on clearly motivated logical arguments that walk directly towards the desired result. Unlike the algebraic proof, which required a large number of little steps, none of which except the last carried much intrinsic meaning, each step in this proof uses a powerful idea: count variables are sums of indicators, expectations of sums are sums of expectations, and, the expectation of an indicator is its chance of success. 

This is why rules are so helpful. They will allow us to find expectations in situations where direct application of the weighted average formula is ungainly.

### Hypergeometric Random Variables

ðŸ¦ºðŸ”¨ðŸ§± Under construction. Here soon!

