---
numbering:
  title:
    offset: 1
---

# Conditional  Probability

So far we've established rules for "not", "or", and "and" statements. However, we didn't really finish the job for "and" statements. We showed how to organize joint probabilities, and how to use the rules for "or" to relate joints and marginals, but we didn't derive any new rules that help us compute joint probabilities directly. We didn't answer the question, *how is $\text{Pr}(A,B)$ related to $\text{Pr}(A)$ and $\text{Pr}(B)$ 

In this section we'll see that, to work out the probability that $A$ and $B$ happen, it is easier to first work out the probability that $A$ happens *if* $B$ happens (or visa versa). The probability that $A$ happens if $B$ happens is a conditional probability. We call it a *conditional* probability since the statement *conditions* on some other outcome, i.e. adds an additional condition that restricts the outcome space.

## If Statements and Conditional Probability

What is the probability that it rains tomorrow *if* the weather is cold?

Suppose that:

Event | Rain  | Clouds  | Sun    | Marginals
:----:|:-------------|:-------------|:-------------|:-------------
Cold  | $2/10$ | $3/10$ | $1/10$ | $6/10$
Warm  | $1/10 $   | $0$ | $1/10$ | $2/10$
Hot   | $0$   | $0 $ | $2/10$ | $2/10$
Marginals | $3/10$ | $3/10$ | $4/10$ | 1

Then, when we condition on the assumption that it is cold, we are rejecting the possibility that it is warm or hot. In essence, we are restricting our outcome space. If it is cold, then it cannot be warm or hot, so any event in the sets "warm" or "hot" is not possible after conditioning. So, when we condition, we can, in essence, drop the bottom two rows of the table:

Event | Rain  | Clouds  | Sun    | Marginals
:----:|:-------------|:-------------|:-------------|:-------------
Cold  | $2/10$ | $3/10$ | $1/10$ | $6/10$

So, unlike the operations "not", "or", and "and", which act on the definition of the event, the logical operation "if" acts on the space of possible outcomes, $\Omega$. So, unlike the first three operations, which change the composition of the event, "if" changes the list of outcomes that could occur. As a result, *conditioning* will change both the numerator *and* the denominator when we equate probability to proportion or frequency. All other operations only act on the numerator.

### Normalization

Take a look at the conditioned table. All of the numbers in the table are nonnegative and less than one, so could be chances, however, they don't add to one, so fail to form a distribution. The marginal, at the far right, is $6/10$, not 1, so the list $[2/10, 3/10, 1/10]$ can't define a full distribution. 

There's an easy fix here. The list of joints add to $6/10$, so, if we scale them all by $10/6$, they'll add to 1:

$$\frac{10}{6} \left(\frac{2}{10} + \frac{3}{10} + \frac{1}{10} \right) = \frac{2}{6} + \frac{3}{6} + \frac{1}{6} = 1 $$

So, we can get a valid distribution if we rescale the joint entries of the row by its sum. Notice that this is the same as putting all elements of the row over the least common multiple of their numerators, ignoring the denominator, then replacing it with the sum of the numerators:

$$[0.2, 0.3, 0.1] \rightarrow [2/10. 3/10, 1/10] \rightarrow [2, 3, 1] \rightarrow 2 + 3 + 1 = 6 \rightarrow [2/6, 3/6, 1/6]. $$

It turns out that this operation will work for any list of nonnegative numbers with a finite sum. If we have a list $[p_1,p_2,p_3, ..., p_n]$ where $p_j \geq 0$ for all $j$, then:

$$\frac{1}{\sum_{j=1}^n p_j} [p_1,p_2,p_3, ..., p_n] $$

is a valis categorical distribution. This operation is called **normalization** since it rescales the entries to make sure they are normalized (add to 1).

### Conditional Probability

While we could normalize our list $[2/10, 3/10, 1/10]$ to make a valid categorical distribution, it is not clear that we should. Why would normalizing by the marginal correctly return the conditional probabilities?

To answer this question we need a probability model that directs our calculation. Without a model, we could define conditional probabilities however we like. With a model, conditional probabilities have to behave in a sensible way.

Our first probability model is probability as proportion. If all outcomes are equally likely, then the probability of ane event is the number of ways it can occur divided by the number of possible outcomes. In other words, the probability of an event is the proportion of the outcome space contained in the event. We can use this model to define conditional probability for equally likely events. 

Think again about what "if" does to our model. When we condition, we are restricting the set of possible outcomes. For instance, in the weather example, we reject all outcomes where the temperature is warm or hot. If we roll a die, and condition on an even roll, then we are rejecting all odd outcomes. 

Here's the new idea, *if two outcomes are equally likely before conditioning, and are both consistent with the conditioning statement, so remain in the outcome space, they should still be equally likely after conditioning.* Let's spell that out for some examples

- I roll a fair die. All outcomes $\{1,2,3,4,5,6\}$ are equally likely. If the die roll is even, then all outcomes $\{2,4,6\}$ are equally likely.
- I toss two coins. All outcomes $\{HH,HT, TH, TT\}$ are equally likely. If the rolls match, then all outcomes $\{HH,TT\}$ are equally likely.

Since we have a rule that assigns chances to outcomes when the outcomes are equally likely, we can compute conditional probabilities using this rule:

- The probability a fair die lands on a 2  given that the roll is even: 
$$\text{Pr}(\{2\}|\text{ even}) = \frac{|\{2\}|}{|\{2,4,6\} |} = \frac{1}{3}$$
- The probability a fair die lands on a 2 or 4 given that the roll is even: 
$$\text{Pr}(\{2,4\}|\text{ even}) = \frac{|\{2, 4\}|}{|\{2,4,6\} |} = \frac{2}{3}$$
- The probability a fair die roll is less than 3 given that the roll is even: 
$$\text{Pr}(\{1,2,3\}|\text{ even}) = \frac{|\{2 \}|}{|\{2,4,6\} |} = \frac{1}{3}$$
- The probability a fair die roll is equal to 1 given that the roll is even: 
$$\text{Pr}(\{3\}|\text{ even}) = \frac{|\emptyset|}{|\{2,4,6\} |} = 0$$

In other words, the conditional probability of an event $B$ given another event $A$, when outcomes are equally likely is:

$$\text{Pr}(B|A) = \frac{\text{the numb. of ways } B \text{ and } A \text{ can happen}}{\text{the numb. of ways} A \text{ can happen}} = \frac{|B \cap A|}{|A|}. $$

We can rewrite the equation to recover the normalization rule we suggested earlier:

$$\begin{aligned} \text{Pr}(B | A) & = \frac{|\Omega|}{|\Omega|}\frac{|B \cap A|}{|A|} = \frac{|B \cap A}{|\Omega|} \times \frac{|\Omega|}{|A|} = \frac{\text{Pr}(B,A)}{\text{Pr}(A)}. \end{aligned}$$

In other words, *the conditional probability of $B$ given $A$ is the probability $B$ and $A$ happen, divided by the probability $A$ happens*, or, *the conditional probability of $B$ given $A$ is the joint probability of $B$ and $A$, divided by the marginal probability of $A$.*

So, *when outcomes are equally likely*, we can compute conditional probabilities by isolating all outcomes that are consistent with the conditioning statement, then matching probability to proportion in the restricted space. In other words, just normalize the necessary collection of probabilities. 

Does the rule $\text{Pr}(B | A) = \frac{\text{Pr}(B,A)}{\text{Pr}(A)}$ work if the underlying outcomes are not equally likely?

Consider our weather example again. We can isolate the appropriate row of the joint table:

Event | Rain  | Clouds  | Sun    | Marginals
:----:|:-------------|:-------------|:-------------|:-------------
Cold  | $2/10$ | $3/10$ | $1/10$ | $6/10$

but we don't know anything about the background outcome space $\Omega$ that produced these joint probabilities. Moreover, trying to spell out a detailed weather model in which all microscopic outcomes are equally likely is both far too much work for this problem, and, would be impractical in almost all settings. For conditional probability to be useful, we should be able to compute it in categorical settings, without somehow expanding our outcome space. So, while the derivation provided above works for equally likely outcome models, it is too restricted to work for general applications.

Thankfully, we have at hand an alternate model, probability as frequency. Recall that, the probability an event occurs should be approximated by, and equal in the long run, the frequency with which it occurs in a sequence of repeated trials. This relation should hold for any valid probability model. So, let's use it to show that the normalization approach correctly computes conditional probabilities. 

Consider a long weather record. Say, the weather in Berkeley over the last year. Let's try to find the conditional probability that it is cold, and rains, on some future day selected at random. We'll assume that the climate is fixed, the process that produces weather does not change (is stationary), and the process doesn't remember its past forever (e.g. the probability that it rains today given that it rained on this day a century ago is the same as the probability that it rains today). Then the probability of any event should be approximated by the frequency with which the event occurs in the weather record. 

Here's an example two-week record:

Day | 1  | 2  | 3  | 4 | 5| 6| 7| 8 | 9 | 10 | 11 | 12 | 13 | 14 |
:----:|:---------|:---------|:---------|:---------|:---------|:---------|:---------|:---------|:---------|:---------|:---------|:---------|:---------|:---------|
Precip. | R | Cl | Cl | Cl | S | S | Cl | R | R | Cl | S | S | S | Cl
Temp  | Co | Co | Co | W | W | H | W | W | Co | Co | W | H | H | W

We can compute frequencies from this record. For example $\text{Fr}(\text{R}) = 3/14 $ and $\text{Fr}(\text{Cl}) = 6/14$ since it rained on 3 days, and was cloudy on 6, out of the last 14. 

We can also use this record to compute joint frequencies. For example $\text{Fr}(\text{R and Co}) = 2/14$ since it was rainy and cold on 2 of the 14 days. These were days 1 and 9.

Here's the good part. We can also compute conditional frequencies from the record. Suppose I wanted to find the *frequency* with which it rained, given that it was cold. Then, I would disregard all days when it wasn't cold, and compute the frequency out of the remaining days. Disregarding the days when it was not cold is equivalent to filtering for only the days when it was cold:

Day | 1  | 2  | 3  | 9 | 10 |
:----:|:---------|:---------|:---------|:---------|:---------|
Precip. | R | Cl | Cl | R | Cl |
Temp  | Co | Co | Co | Co | Co |

Now that we've filtered the record for only cold days, the conditional frequencies are apparent:

$$\text{Fr}(\text{R}|\text{Co}) = 2/5 $$

since it rained on 2 of the 5 days when it was cold. Notice, 2/5 = 0.2 is not a bad estimate to the value we got by normalizing, $2/10 \times 10/6 = 2/6 = 0.3333...$. These two numbers are different because our record of cold days was short, so the frequencies are only rough estimates to the true probabilities. 

Nevertheless, the algebra for conditional frequencies is clear, and, should recover the appropriate probabilities if we run enough trials/collect a long enough record. 

Take a look at the frequency calculation again:

$$\text{Fr}(\text{R}|\text{Co}) = \frac{\text{the numb. of times R and Co happened}}{\text{the numb. times Co happened}} $$

This expression looks a lot like what we wrote for equally likely outcomes. All we've done is changed the way we count. Instead of counting ways an outcome can occur, we are counting the number of times it did occur in a sequence. Let $N_{R,Co}$ be the number of times it rained and was cold (2), and $N_{Co}$ be the number of times it was cold (5). Let $n$ be the length of the record (14). Then:

$$\begin{aligned} \text{Fr}(\text{R}|\text{Co}) & = \frac{N_{R, Co}}{N_{Co}} = \frac{n}{n} \times \frac{N_{R, Co}}{N_{Co}} = \frac{N_{R, Co}}{n} \times \frac{n}{N_{Co}} \\ & \frac{\text{Fr}(R, Co)}{\text{Fr}(Co)}. \end{aligned}$$

In other words, *the conditional frequency of $B$ given $A$ is the frequency with which $B$ and $A$ happen, divided by the frequency with which $A$ happens*, or, *the conditional frequency of $B$ given $A$ is the joint frequency of $B$ and $A$, divided by the marginal frequency of $A$.* 

Compare these statement to what we wrote for equally likely outcomes. They are identical, up to substituting frequency for probability. Since frequencies should match probabilities on long trials (in the limit as $n$ goes to infinity), we've just derived the general definition for conditional probability:

$$\text{Pr}(B|A) = \frac{\text{Pr}(B,A)}{\text{Pr}(A)} $$

In our example, $\text{Pr}(\text{R}|\text{Co}) = (2/10)/(6/10) = 2/6$ exactly as we predicted by normalizing.

If we want probabilities to match long run frequencies, this is the only sensible definition. You should remember it, "conditional equals joint over marginal." Then, you should always make sure that you *divide by the marginal that matches the event you conditioned on.* If you forget what to divide by, either:

1. Go back to the normalization argument and check that you divide by the quantity needed to normalize the conditional distribution, or
1. Remember that we are filtering for only the outcomes that are consistent with the conditioning statement, so need to reduce the size of the outcome space, or length of the record, by the fraction of the outcome space/length of record consistent with the conditioning statement. We filtered for days when it was cold, not days when it rained, to find the conditional probability it raisn when it is cold.

## Conditioning Preserves Odds

The definition, conditional equals joint over marginal, can be justified without referencing frequencies. We can, instead, generalize the rule we introduced for equally likely outcomes. We said before that, if two outcomes are equally likely before conditioning, then, if both outcomes are consistent with the conditioning statement, then they should be equally likely after conditioning. 

More generally, we might require that the relative likelihood of two outcomes is unchanged by conditioning if both outcomes are consistent with the conditioning statement. For instance, if it is twice as likely to rain and be cold than to be sunny and be cold, we should expect that it is twice as likely to rain if it is cold than to be sunny if it is cold. 

The argument provided above is a statement about odds. The odds of two events is a comparison of their likelihood. Specifically, the odds $A$ happens relative to $B$ is defined:

$$\text{Odds}(A;B) = \frac{\text{Pr}(A)}{\text{Pr}(B)} $$

So, if $\text{Odds}(A;B) = 2$ then $A$ is twice as likely as $B$.

It turns out that, if we know all of the odds for a set of outcomes, then we also know their probabilities. For instance, suppose that there are three possible outcomes, $\{a,b,c\}$ and $a$ is twice as likely as $b$ and three times as likely as $c$. Then the list of their probabilities must be proportional to the list $[6,3,2]$. This is a list of nonnegative numbers, so there is only one matching distribution. The only matching distribution is the distribution we recover by normalizing:

$$[p_a,p_b,p_c] = \frac{1}{6 + 3 + 2}[6,3,2] = [6/11,3/11,2/11]$$

So, knowing the odds is the same thing as knowing the distribution. 

Here's a challenge problem: *show that, if all odds for all pairs of outcomes consistent with the conditioning statement are unchanged by conditioning, then conditional probability must equal joint probability normalized by marginal probability.*

In other words, if the relative likelihood of two events that are consistent with the condition is unchanged by conditioning, then it must be true that conditional distributions are recovered from joint distributions by:

1. isolating the appropriate row or column of the joint probability table
1. dividing all joint entries by their sum, which is the associated marginal

## Conditional Distributions

Let's practice using this rule. Here's the weather example again:

1. Write down the joint table:

Event | Rain  | Clouds  | Sun    | Marginals
:----:|:-------------|:-------------|:-------------|:-------------
Cold  | $2/10$ | $3/10$ | $1/10$ | $6/10$
Warm  | $1/10 $   | $0$ | $1/10$ | $2/10$
Hot   | $0$   | $0 $ | $2/10$ | $2/10$
Marginals | $3/10$ | $3/10$ | $4/10$ | 1

2. Filter for only the cold events:

Event | Rain  | Clouds  | Sun    | Marginals
:----:|:-------------|:-------------|:-------------|:-------------
Cold  | $2/10$ | $3/10$ | $1/10$ | $6/10$

3. Normalize by the marginal:

Event | Rain  | Clouds  | Sun    
:----:|:-------------|:-------------|:-------------
Cold  | $2/6$ | $3/6$ | $1/6$ 


Notice that the conditional distribution is proportional to the list of joint probabilities in the isolated row. This is a nice visual rule of thumb. If you have a joint table, and want the conditionals, jsut look up the appropriate row or column and scale it. 

For instance, if we conditioned on sun, we'd isolate the column:

Event | Sun    | 
:----:|:-------------|
Cold  | $1/10$ |
Warm  |  $1/10$ | 
Hot   | $2/10$ | 
Marginals | $4/10$ | 

Then rescale to find the conditional probabilities:

Event | Sun    | 
:----:|:-------------|
Cold  | $1/4$ |
Warm  |  $1/4$ | 
Hot   | $2/4$ | 

# The Multiplication Rule

% rearrange conditional definition to give the multiplication rule

% A then A example

% expand to general chain rule

## Reasoning with Sequences

% outcome trees example

# Bayes Rule

 % set up problem

 % match joint

 % derive Bayes

 ## Example: Base Rate Neglect

 % set up problem

 % distinguish likelihood from posterior

 % the base rate matters

# Independence and Dependence

% what does it mean for two events to be unrelated?

% it means that knowing one tells us nothing about the chance of the other

% define independence as conditional is invariant to conditioning, = marginal

% dependent otherwise

% show that, if dependent, then probabilities multiply

% A then S example