---
numbering:
  title:
    offset: 1
---

(ch1.5)=
# Conditional  Probability

[Section 1.3](ch1.3) and [1.4](ch1.4) establish rules for "*not*", "*or*", and "*and*" statements. However, we didn't really finish the job for "*and*" statements. We showed how to organize joint probabilities, and how to use the rules for "or" to relate joints and marginals, but we didn't derive any new rules that help us compute joint probabilities directly. We didn't answer the question, *how is $\text{Pr}(A,B)$ related to $\text{Pr}(A)$ and $\text{Pr}(B)$?*

In this section we'll see that, to work out the probability that $A$ and $B$ happen, it is easier to first work out the probability that $A$ happens *if* $B$ happens (or visa versa). The probability that $A$ happens if $B$ happens is a conditional probability. We call it a **conditional** probability since the statement *conditions* on some other outcome, i.e. adds an additional condition that restricts the outcome space.

:::{note} Conditional Probability Definition
A **conditional probability** is the probability one event occurs, given that another occurs.
:::

## If Statements and Conditional Probability

What is the probability that it rains tomorrow *if* the weather is cold?

Suppose that:

Event | Rain  | Clouds  | Sun    | Marginals
:----:|:-------------|:-------------|:-------------|:-------------
Cold  | $2/10$ | $3/10$ | $1/10$ | $6/10$
Warm  | $1/10 $   | $0$ | $1/10$ | $2/10$
Hot   | $0$   | $0 $ | $2/10$ | $2/10$
Marginals | $3/10$ | $3/10$ | $4/10$ | 1

Then, when we condition on the assumption that it is cold, we are rejecting the possibility that it is warm or hot. In essence, we are *restricting our outcome space*. If it is cold, then it cannot be warm or hot, so any event in the sets "warm" or "hot" is not possible after conditioning. So, when we condition, we can, in essence, drop the bottom two rows of the table:

Event | Rain  | Clouds  | Sun    | Marginals
:----:|:-------------|:-------------|:-------------|:-------------
Cold  | $2/10$ | $3/10$ | $1/10$ | $6/10$

So, unlike the operations "not", "or", and "and", which act on the definition of the event, the logical operation "if" acts on the space of possible outcomes, $\Omega$. So, unlike the first three operations, which change the composition of the event, "if" changes the list of outcomes that could occur. As a result, *conditioning* will change both the numerator *and* the denominator when we equate probability to proportion or frequency. All other operations only act on the numerator.

## Normalization

Take a look at the conditioned table. All of the numbers in the table are nonnegative and less than one, so could be chances, however, they don't add to one, so fail to form a distribution. The marginal, at the far right, is $6/10$, not 1, so the list $[2/10, 3/10, 1/10]$ can't define a full distribution. 

There's an easy fix here. The list of joints add to $6/10$, so, if we scale them all by $10/6$, they'll add to 1:

$$\frac{10}{6} \left(\frac{2}{10} + \frac{3}{10} + \frac{1}{10} \right) = \frac{2}{6} + \frac{3}{6} + \frac{1}{6} = 1 $$

So, we can get a valid distribution if we rescale the joint entries of the row by its sum. Notice that this is the same as putting all elements of the row over the least common multiple of their numerators, ignoring the denominator, then replacing it with the sum of the numerators:

$$[0.2, 0.3, 0.1] \rightarrow [2/10. 3/10, 1/10] \rightarrow [2, 3, 1] \rightarrow 2 + 3 + 1 = 6 \rightarrow [2/6, 3/6, 1/6]. $$

It turns out that this operation will work for any list of nonnegative numbers with a finite sum. If we have a list $[p_1,p_2,p_3, ..., p_n]$ where $p_j \geq 0$ for all $j$, then:

$$\frac{1}{\sum_{j=1}^n p_j} [p_1,p_2,p_3, ..., p_n] $$

is a valid categorical distribution. This operation is called **normalization** since it rescales the entries to make sure they are normalized (add to 1).

:::{tip} Normalization
To normalize:

1. Add up all of the values
1. Divide each element in the list by the sum.
:::

## Conditional Probability

### When Outcomes are Equally Likely

While we could normalize our list $[2/10, 3/10, 1/10]$ to make a valid categorical distribution, it is not clear that we should. Why would normalizing by the marginal correctly return the conditional probabilities?

To answer this question we need a probability model that directs our calculation. Without a model, we could define conditional probabilities however we like. With a model, conditional probabilities have to behave in a sensible way.

Our first probability model is probability as proportion. If all outcomes are equally likely, then the probability of ane event is the number of ways it can occur divided by the number of possible outcomes. In other words, the probability of an event is the proportion of the outcome space contained in the event. We can use this model to define conditional probability for equally likely events. 

Think again about what "if" does to our model. When we condition, we are restricting the set of possible outcomes. For instance, in the weather example, we reject all outcomes where the temperature is warm or hot. If we roll a die, and condition on an even roll, then we are rejecting all odd outcomes. 

:::{hint} Equally Likely Events Remain Equally Likely If Possible
Here's a new idea, *if two outcomes are equally likely before conditioning, and are both consistent with the conditioning statement, so remain in the outcome space, they should still be equally likely after conditioning.* Let's spell that out for some examples

- I roll a fair die. All outcomes $\{1,2,3,4,5,6\}$ are equally likely. If the die roll is even, then all outcomes $\{2,4,6\}$ are equally likely.
- I toss two coins. All outcomes $\{HH,HT, TH, TT\}$ are equally likely. If the rolls match, then all outcomes $\{HH,TT\}$ are equally likely.
:::

Since we have a rule that assigns chances to outcomes when the outcomes are equally likely, we can compute conditional probabilities using this rule:

- The probability a fair die lands on a 2  given that the roll is even: 
$$\text{Pr}(\{2\}|\text{ even}) = \frac{|\{2\}|}{|\{2,4,6\} |} = \frac{1}{3}$$
- The probability a fair die lands on a 2 or 4 given that the roll is even: 
$$\text{Pr}(\{2,4\}|\text{ even}) = \frac{|\{2, 4\}|}{|\{2,4,6\} |} = \frac{2}{3}$$
- The probability a fair die roll is less than 3 given that the roll is even: 
$$\text{Pr}(\{1,2,3\}|\text{ even}) = \frac{|\{2 \}|}{|\{2,4,6\} |} = \frac{1}{3}$$
- The probability a fair die roll is equal to 1 given that the roll is even: 
$$\text{Pr}(\{3\}|\text{ even}) = \frac{|\emptyset|}{|\{2,4,6\} |} = 0$$

In other words, the conditional probability of an event $B$ given another event $A$, when outcomes are equally likely is:

$$\text{Pr}(B|A) = \frac{\text{the numb. of ways } B \text{ and } A \text{ can happen}}{\text{the numb. of ways} A \text{ can happen}} = \frac{|B \cap A|}{|A|}. $$

We can rewrite the equation to recover the normalization rule we suggested earlier:

$$\begin{aligned} \text{Pr}(B | A) & = \frac{|\Omega|}{|\Omega|}\frac{|B \cap A|}{|A|} = \frac{|B \cap A}{|\Omega|} \times \frac{|\Omega|}{|A|} = \frac{\text{Pr}(B,A)}{\text{Pr}(A)}. \end{aligned}$$

:::{hint} A Candidate Rule for Conditional Probability
In other words, *the conditional probability of $B$ given $A$ is the probability $B$ and $A$ happen, divided by the probability $A$ happens*, or, *the conditional probability of $B$ given $A$ is the joint probability of $B$ and $A$, divided by the marginal probability of $A$.*
:::

So, *when outcomes are equally likely*, we can compute conditional probabilities by isolating all outcomes that are consistent with the conditioning statement, then matching probability to proportion in the restricted space. In other words, just normalize the necessary collection of probabilities. 

Does the rule $\text{Pr}(B | A) = \frac{\text{Pr}(B,A)}{\text{Pr}(A)}$ work if the underlying outcomes are not equally likely?

Consider our weather example again. We can isolate the appropriate row of the joint table:

Event | Rain  | Clouds  | Sun    | Marginals
:----:|:-------------|:-------------|:-------------|:-------------
Cold  | $2/10$ | $3/10$ | $1/10$ | $6/10$

but we don't know anything about the background outcome space $\Omega$ that produced these joint probabilities. Moreover, trying to spell out a detailed weather model in which all microscopic outcomes are equally likely is both far too much work for this problem, and, would be impractical in almost all settings. For conditional probability to be useful, we should be able to compute it in categorical settings, without somehow expanding our outcome space. So, while the derivation provided above works for equally likely outcome models, it is too restricted to work for general applications.

### By Conditional Frequency

Thankfully, we have at hand an alternate model, probability as frequency. Recall that, the probability an event occurs should be approximated by, and equal in the long run, the frequency with which it occurs in a sequence of repeated trials. This relation should hold for any valid probability model. So, let's use it to show that the normalization approach correctly computes conditional probabilities. 

Consider a long weather record. Say, the weather in Berkeley over the last year. Let's try to find the conditional probability that it is cold, and rains, on some future day selected at random. We'll assume that the climate is fixed, the process that produces weather does not change (is stationary), and the process doesn't remember its past forever (e.g. the probability that it rains today given that it rained on this day a century ago is the same as the probability that it rains today). Then the probability of any event should be approximated by the frequency with which the event occurs in the weather record. 

To keep our record short we'll use the following visuals:

Event | Rain  | Clouds  | Sun  | Cold | Warm | Hot |
:----:|:---------|:---------|:---------|:---------|:---------|:---------|
Precip. | üíß | ‚òÅÔ∏è | ‚òÄÔ∏è | ü•∂ | üòé | ü•µ |
Symbol | R | Cl | S | Co | W | H |


Here's an example two-week record:

Day | 1  | 2  | 3  | 4 | 5| 6| 7| 8 | 9 | 10 | 11 | 12 | 13 | 14 |
:----:|:---------|:---------|:---------|:---------|:---------|:---------|:---------|:---------|:---------|:---------|:---------|:---------|:---------|:---------|
Precip. | üíß | ‚òÅÔ∏è | ‚òÅÔ∏è | ‚òÅÔ∏è | ‚òÄÔ∏è | ‚òÄÔ∏è | ‚òÅÔ∏è | üíß | üíß | ‚òÅÔ∏è | ‚òÄÔ∏è | ‚òÄÔ∏è | ‚òÄÔ∏è | ‚òÅÔ∏è
Temp  | ü•∂ | ü•∂ | ü•∂ | üòé | üòé | ü•µ | üòé | üòé | ü•∂ | ü•∂ | üòé | ü•µ | ü•µ | üòé

We can compute frequencies from this record. For example $\text{Fr}(\text{R}) = 3/14 $ and $\text{Fr}(\text{Cl}) = 6/14$ since it rained on 3 days, and was cloudy on 6, out of the last 14. 

We can also use this record to compute joint frequencies. For example $\text{Fr}(\text{R and Co}) = 2/14$ since it was rainy and cold on 2 of the 14 days. These were days 1 and 9.

Here's the good part. We can also compute conditional frequencies from the record. Suppose I wanted to find the *frequency* with which it rained, given that it was cold. Then, I would disregard all days when it wasn't cold, and compute the frequency out of the remaining days. Disregarding the days when it was not cold is equivalent to filtering for only the days when it was cold:

Day | 1  | 2  | 3  | 9 | 10 |
:----:|:---------|:---------|:---------|:---------|:---------|
Precip. | üíß | ‚òÅÔ∏è | ‚òÅÔ∏è | üíß | ‚òÅÔ∏è |
Temp  | ü•∂ | ü•∂ | ü•∂ | ü•∂ | ü•∂ |

Now that we've filtered the record for only cold days, the conditional frequencies are apparent:

$$\text{Fr}(\text{R}|\text{Co}) = 2/5 $$

since it rained on 2 of the 5 days when it was cold. Notice, 2/5 = 0.2 is not a bad estimate to the value we got by normalizing, $2/10 \times 10/6 = 2/6 = 0.3333...$. These two numbers are different because our record of cold days was short, so the frequencies are only rough estimates to the true probabilities. 

Nevertheless, the algebra for conditional frequencies is clear, and, should recover the appropriate probabilities if we run enough trials/collect a long enough record. 

Take a look at the frequency calculation again:

$$\text{Fr}(\text{R}|\text{Co}) = \frac{\text{the numb. of times R and Co happened}}{\text{the numb. times Co happened}} $$

This expression looks a lot like what we wrote for equally likely outcomes. All we've done is changed the way we count. Instead of counting ways an outcome can occur, we are counting the number of times it did occur in a sequence. Let $N_{R,Co}$ be the number of times it rained and was cold (2), and $N_{Co}$ be the number of times it was cold (5). Let $n$ be the length of the record (14). Then:

$$\begin{aligned} \text{Fr}(\text{R}|\text{Co}) & = \frac{N_{R, Co}}{N_{Co}} = \frac{n}{n} \times \frac{N_{R, Co}}{N_{Co}} = \frac{N_{R, Co}}{n} \times \frac{n}{N_{Co}} \\ & = \frac{\text{Fr}(R, Co)}{\text{Fr}(Co)}. \end{aligned}$$

:::{hint} Conditional Frequency
In other words, *the conditional frequency of $B$ given $A$ is the frequency with which $B$ and $A$ happen, divided by the frequency with which $A$ happens*, or, *the conditional frequency of $B$ given $A$ is the joint frequency of $B$ and $A$, divided by the marginal frequency of $A$.* 
:::

Compare these statement to what we wrote for equally likely outcomes. They are identical, up to substituting frequency for probability. Since frequencies should match probabilities on long trials (in the limit as $n$ goes to infinity), we've just derived the general definition for conditional probability:

:::{note} Conditional Probability Definition
Given any two events, $A$ and $B$, the **conditional probability** of $B$ given $A$ is:

$$\text{Pr}(B|A) = \frac{\text{Pr}(B,A)}{\text{Pr}(A)} $$
::

In our example, $\text{Pr}(\text{R}|\text{Co}) = (2/10)/(6/10) = 2/6$ exactly as we predicted by normalizing.

If we want probabilities to match long run frequencies, this is the only sensible definition. You should remember it, "conditional equals joint over marginal." 

:::{warning} Divide by the Appropriate Marginal
:class: dropdown
Always make sure that you *divide by the marginal that matches the event you conditioned on.* If you forget what to divide by, either:

1. Go back to the normalization argument and check that you divide by the quantity needed to normalize the conditional distribution, or
1. Remember that we are filtering for only the outcomes that are consistent with the conditioning statement, so need to reduce the size of the outcome space, or length of the record, by the fraction of the outcome space/length of record consistent with the conditioning statement. We filtered for days when it was cold, not days when it rained, to find the conditional probability it raisn when it is cold.
:::

### Conditioning Preserves Odds

The definition, conditional equals joint over marginal, can be justified without referencing frequencies. We can, instead, generalize the rule we introduced for equally likely outcomes. We said before that, if two outcomes are equally likely before conditioning, then, if both outcomes are consistent with the conditioning statement, then they should be equally likely after conditioning. 

More generally, we might require that the relative likelihood of two outcomes is unchanged by conditioning if both outcomes are consistent with the conditioning statement. For instance, if it is twice as likely to rain and be cold than to be sunny and be cold, we should expect that it is twice as likely to rain if it is cold than to be sunny if it is cold. 

The argument provided above is a statement about odds. The odds of two events is a comparison of their likelihood. Specifically, the odds $A$ happens relative to $B$ is defined:

$$\text{Odds}(A;B) = \frac{\text{Pr}(A)}{\text{Pr}(B)} $$

So, if $\text{Odds}(A;B) = 2$ then $A$ is twice as likely as $B$.

It turns out that, if we know all of the odds for a set of outcomes, then we also know their probabilities. For instance, suppose that there are three possible outcomes, $\{a,b,c\}$ and $a$ is twice as likely as $b$ and three times as likely as $c$. Then the list of their probabilities must be proportional to the list $[6,3,2]$. This is a list of nonnegative numbers, so there is only one matching distribution. The only matching distribution is the distribution we recover by normalizing:

$$[p_a,p_b,p_c] = \frac{1}{6 + 3 + 2}[6,3,2] = [6/11,3/11,2/11]$$

So, knowing the odds is the same thing as knowing the distribution. 

:::{tip} Exercise üõ†Ô∏è
Here's a challenge problem: *show that, if all odds for all pairs of outcomes consistent with the conditioning statement are unchanged by conditioning, then conditional probability must equal joint probability normalized by marginal probability.*
:::

In other words, if the relative likelihood of two events that are consistent with the condition is unchanged by conditioning, then it must be true that conditional distributions are recovered from joint distributions by:

1. isolating the appropriate row or column of the joint probability table
1. dividing all joint entries by their sum, which is the associated marginal

### Conditional Distributions

Let's practice using this rule. Here's the weather example again:

1. Write down the joint table:

Event | Rain  | Clouds  | Sun    | Marginals
:----:|:-------------|:-------------|:-------------|:-------------
Cold  | $2/10$ | $3/10$ | $1/10$ | $6/10$
Warm  | $1/10 $   | $0$ | $1/10$ | $2/10$
Hot   | $0$   | $0 $ | $2/10$ | $2/10$
Marginals | $3/10$ | $3/10$ | $4/10$ | 1

2. Filter for only the cold events:

Event | Rain  | Clouds  | Sun    | Marginals
:----:|:-------------|:-------------|:-------------|:-------------
Cold  | $2/10$ | $3/10$ | $1/10$ | $6/10$

3. Normalize by the marginal:

Event | Rain  | Clouds  | Sun    
:----:|:-------------|:-------------|:-------------
Cold  | $2/6$ | $3/6$ | $1/6$ 


Notice that the conditional distribution is proportional to the list of joint probabilities in the isolated row. This is a nice visual rule of thumb. If you have a joint table, and want the conditionals, jsut look up the appropriate row or column and scale it. 

For instance, if we conditioned on sun, we'd isolate the column:

Event | Sun    | 
:----:|:-------------|
Cold  | $1/10$ |
Warm  |  $1/10$ | 
Hot   | $2/10$ | 
Marginals | $4/10$ | 

Then rescale to find the conditional probabilities:

Event | Sun    | 
:----:|:-------------|
Cold  | $1/4$ |
Warm  |  $1/4$ | 
Hot   | $2/4$ | 

## The Multiplication Rule

Now that we know how to handle "if" statements, we can go back to our original aim, understanding "and" statements. Consider the definition of a conditional probability:

$$\text{Pr}(B|A) = \frac{\text{Pr}(A,B)}{\text{Pr}(A)} $$

Rearranging the expression gives our next fundamental probability rule:

:::{note} Rules of Chance
6. **The Multiplication Rule:** Given any two events $A$ and $B$:

$$\text{Pr}(A,B) = \text{Pr}(A \text{ and } B) = \text{Pr}(A \cap B) = \text{Pr}(A) \times \text{Pr}(B|A) $$

You should read this rule, *the probability $A$ and $B$ happen is the probability $A$ happens times the probability $B$ happens given that $A$ happened.* Alternately, *joint equals marginal times conditional.*
:::

This rule is sensible. Suppose that four in every ten days are sunny, and half of all sunny days are hot. Then it is sensible that the fraction of all days that are both hot and sunny should equal the fraction of all days that are sunny, times the fraction of all sunny days that are hot. This is precisely the calculation we performed to find the probability that a sunny day is hot in reverse:

$$\text{Pr}(\text{S}, \text{H}) = \text{Pr}(\text{S}) \times \text{Pr}(\text{H}) = \frac{4}{10} \times \frac{2}{4} = \frac{2}{10}, $$

The multiplication rule should be used in the same fashion as the addition rule or the complement rule...

1. You are asked for the probability of some event,
1. that event is naturally expressed as a joint event or intersection (it can be expanded as a sequence of "and" statements) where,
1. the individual parts are each simpler to work with.

In particular, look for examples where the event is best described with a conditional sequence. Then you can find the joint probability by simply walking through the sequence.

:::{tip} Example
:class: dropdown
Consider, for example, the probability that we draw two aces in a row, when we draw two cards from a thoroughly shuffled deck. We worked this probability out by counting in [Section 1.2](ch1.2). There, we computed:

$$\text{Pr}(AA) = \frac{4 \times 3 \times 50!}{52!} = \frac{4 \times 3}{52 \times 51} $$

We noticed that the 50! term cancelled because the order of the last 50 cards in the deck didn't matter, but, we didn't have an organized way of calculating this probability without counting the number of two card hands that are a pair of aces, relative to the number of possible two card hands. 

Notice that:

$$\text{Pr}(AA) = \frac{4 \times 3}{52 \times 51} = \frac{4}{52} \times \frac{3}{51} $$

These numbers are suggestive. There are four aces in the deck of 52. After removing an ace, there are three remaining in a deck of 51. So:

$$\text{Pr}(A) = \frac{4}{52}, \quad \text{Pr}(AA|A) = \frac{3}{51} $$

Therefore:

$$\text{Pr}(AA) =  \frac{4}{52} \times \frac{3}{51} = \text{Pr}(A) \times \text{Pr}(AA|A).$$

In short, the chance we draw two aces in a row is the chance our first draw is an ace, times the chance our second draw is an ace if our first draw was an ace.

If you've used the multiplication rule before, this approach might have seemed the most natural to you. Now we know why it is true. Anytime you are inclined to describe an event with a conditional sequence (first $A$ happens, then $B$ happens, then $C$ happens, ...) you should try the multiplication rule. 
:::

Notice: ‚ö†Ô∏è *the multiplication rule does not tell you to directly multiply marginal probabilities. This is a common mistake. Always multiply a marginal with a conditional.* Otherwise, your calculation will be incorrect. For example, the chance of drawing an ace on the first draw is $4/52 = 1/13$ and on the second draw is $4/52 = 1/13$, but the chance of drawing two aces is $1/13 \times 3/51$ not $1/13 \times 1/13$.

:::{important} Generalization
:class: dropdown
The multiplication rule generalizes naturally to longer sequences of events. The probability that $A$ and $B$ and $C$ happen is:

$$\text{Pr}(A,B,C) = \text{Pr}(A) \times \text{Pr}(B|A) \times \text{Pr}(C|B,A) $$

You should read the equation above, *the probability that $A$, $B$, and $C$ all happen is the probability $A$ happenss times the probability $B$ happens if $A$ happens, times the probability $C$ happens if $A$ and $B$ both happen.*

For a longer sequence, we get the chain rule:

$$\text{Pr}(A_1,A_2,A_3,...,A_n) = \text{Pr}(A_1) \times \text{Pr}(A_2|A_1) \times \text{Pr}(A_3|A_1,A_2) \times ... \text{Pr}(A_n|A_1, A_2, ... A_{n-1}) $$
:::

We can now complete out summary table of probability rules:

:::{tip} Logic, Sets, and Algebra

Logical | Set Operation                               | Notation  | Chance Rule |
:----:|:-------------------------------------------------|:-------------|:-------------
not   | complement                            |$^c$ | $ 1 - ...$|
or   | union           |$\cup$ | $+$ if disjoint |
if   | restrict $\Omega$            |  $\mid$ | joint $/$ marginal|
and   | intersect | $\cap$ |marginal $\times$ conditional|     

:::

## Reasoning with Sequences

The multiplication rule, and its extension to sequences of events, gives us a new visual tool for computing probabilities. 

Consider the weather example again. If we rewrite the table thinking first about the marginal chance of temperature, then the conditional chances of precipitation, we can express the probability model:

Event |  Marginal Probability
:----:|:-------------|
Cold  |$6/10$
Warm  | $2/10$
Hot   | $2/10$

and

Event | Rain  | Clouds  | Sun    |
:----:|:-------------|:-------------|:-------------|
if Cold  | $2/6$ | $3/6$ | $1/6$ |
if Warm  | $1/2 $   | $0$ | $1/2$ |
if Hot   | $0$   | $0 $ | $1$ |

This information can then be represented with an **outcome tree**. The outcome tree works like a decision tree. First ask, what is the temperature? Then ask, given the temperature, what is the precipitation? Label each edge in the decision tree with the matching marginal or conditional probability:


![Outcome tree for the weather model.](weather_outcome_tree.png "Outcome tree for the weather model.")

To find the joint probabilities of the events at the far end of the outcome tree, simply multiply the probabilities down the matching path. 

For example:

$$\text{Pr}(\text{W}, \text{R}) = \frac{2}{10} \times \frac{1}{2} = \frac{1}{10}. $$

If you consult the joint table in [Section 1.4](ch1.4), you'll find the same value.

## Bayes Rule

How would we find the conditional probability that it is warm if it rains?

Notice that, the outcome diagram sketched above does not provide this consitional directly. Nor does the specification:

Event |  Marginal Probability
:----:|:-------------|
Cold  |$6/10$
Warm  | $2/10$
Hot   | $2/10$

and

Event | Rain  | Clouds  | Sun    |
:----:|:-------------|:-------------|:-------------|
if Cold  | $2/6$ | $3/6$ | $1/6$ |
if Warm  | $1/2 $   | $0$ | $1/2$ |
if Hot   | $0$   | $0 $ | $1$ |


Nevertheless, we can always find the desired conditional by first solving for the appropriate joint and marginal, then scaling the joint by the marginal. In many ways, this procedure is the same as what we've done before, only we start with different information.

Suppose that we know $\text{Pr}(A)$ and the conditionals given $A$, $\text{Pr}(B|A)$ and $\text{Pr}(B|A^c)$. How can we find $\text{Pr}(A|B)$?

Well, let's use our rules. 

1. Always start from what you need to find. By definition:

    $$\text{Pr}(A|B) = \frac{\text{Pr}(A,B)}{\text{Pr}(A)} $$

2. Let's find the joint probabilities. If we have a complete joint probability table then we can find any conditionals we want. To find the conditional probability of $B$ given $A$ we need the joint $\text{Pr}(B,A) $. We can find it by multiplying down the appropriate path in the outcome tree:

$$\text{Pr}(A,B) = \text{Pr}(B,A) = \text{Pr}(A) \times \text{Pr}(B|A) $$

3. We now have two ways of expressing the joint. Both are valid applications of the multiplication rule:

$$\text{Pr}(A,B) = \text{Pr}(B,A) = \begin{cases} & \text{Pr}(A) \times \text{Pr}(B|A) \\
& \text{Pr}(B) \times \text{Pr}(A|B) \end{cases}$$

We can compute the top line, and want the last term in the bottom line. Since the two lines return the same joint they are equal, and:

$$\text{Pr}(A|B) = \frac{\text{Pr}(A,B)}{\text{Pr}(B)} = \frac{\text{Pr}(A) \times \text{Pr}(B|A)}{\text{Pr}(B)} $$

4. By assumption, we know all the values in the numerator. That leaves the denominator. The demoninator is a marginal, so we can always expand it as we did in [Section 1.4](ch1.4):

$$\text{Pr}(B) = \text{Pr}(A,B) + \text{Pr}(A^c,B) $$

Then, since both terms on the right hand side are joint probabilities, we can find them with the multiplication rule:

$$\text{Pr}(B) = \text{Pr}(A) \times \text{Pr}(B|A) + \text{Pr}(A^c) \times \text{Pr}(B|A^c) $$

Putting the numerator and denominator together gives **Baye's rule**:

:::{note} Rules of Chance
7. **Bayes Rule:** Given $\text{Pr}(A)$, $\text{Pr}(B|A)$, and $\text{Pr}(B|A^c)$:

$$\text{Pr}(A|B) = \frac{\text{Pr}(A) \text{Pr}(B|A)}{\text{Pr}(B)} =  \frac{\text{Pr}(A) \text{Pr}(B|A)}{\text{Pr}(A) \text{Pr}(B|A) + \text{Pr}(A^c) \text{Pr}(B|A^c) }$$
:::

It is usually more helpful to think about Baye's rule in two stages. First, find all the joint probabilities by multiplying down the paths of the outcome tree that point to an event where $B$ occurs. Then, find the marginal probability that $B$ occurs by summing over the joint probabilities. Finally, take the ratio of joint to marginal that recovers the desired conditional.

Here are two examples:

:::{tip} Warm if Rainy
:class: dropdown

*What is the chance it was warm if it rained?*

1. Calculate the joint that appears in the numerator:

$$\text{Pr}(\text{W},\text{R}) = \text{Pr}(\text{W}) \text{Pr}(\text{R}|\text{W}) = 2/10 \times 1/2 = 1/10 $$

1. Find the necessary marginal:
    - Expand the marginal by partitioning, then apply the addition rule for disjoint sets:

    $$\text{Pr}(\text{R}) = \text{Pr}(\text{R},\text{Co}) + \text{Pr}(\text{R},\text{W}) + \text{Pr}(\text{R},\text{H})  $$

    - Find each joint using the multiplication rule:

    $$\begin{aligned} & \text{Pr}(\text{R},\text{Co}) = \text{Pr}(\text{Co}) \text{Pr}(\text{R}|\text{Co}) = 6/10 \times 2/6 = 2/10\\
    &  \text{Pr}(\text{R},\text{W}) = \text{Pr}(\text{W}) \text{Pr}(\text{R}|\text{W}) = 2/10 \times 1/2 = 1/10 \\
    &  \text{Pr}(\text{R},\text{H}) = \text{Pr}(\text{H}) \text{Pr}(\text{R}|\text{H}) = 2/10 \times 0 = 0 \end{aligned} $$

    - Therefore:

    $$\text{Pr}(\text{R}) = \frac{2}{10} + \frac{1}{10} = \frac{3}{10} $$

1. Apply the definition of conditional probability (plug in):

    $$\text{Pr}(\text{W}|\text{R})  = \frac{1/10}{3/10} = \frac{1}{3} $$

    - So, if it rained, then there is a 1/3 chance it was warm. 

To check our answer, look up the column of the joint table where it rained:

    Event | Rain  |
    :----:|:------|
    Cold  | $2/10$ |
    Warm  | $1/10 $ |
    Hot   | $0$ |
    Marginals | $3/10$ |

Normalizing by the marginal produces the same calculation. Alternately, it is twice as likely to be cold if it rains than warm if it rains, and it is never hot, so the chance it is warm, if it rains, is one third.
:::

:::{tip} Ace On First Draw If Ace on Second
:class: dropdown

*If I pull two cards from a thoroughly shuffled deck, and the second card was an ace, what is the chance the first card was an ace?* 

Before we solve this problem, take a moment to think about what knowing the second card was an ace should tell us about the first card. If we know nothing the probability the first card was an ace is $4/52 = 1/13$ since there are 4 aces in all 52 cards. However, if we pull an ace on the first draw, then we are less likely to draw an ace on our second draw. It stands to reason that, if we did draw an ace on our second draw, the chance we drew an ace first should be less than $1/13$. Let's check...

1. Calculate the joint that appears in the numerator:

    $$\text{Pr}(AA) = \text{Pr}(A) \times \text{Pr}(AA|A) = \frac{4}{52} \times \frac{3}{51} = \frac{1}{13} \times \frac{3}{51} $$

2. Calculate the necessary marginal. In this case we can do it directly. Knowing nothing about the first draw, the chance we draw an ace on our second draw is $4/52 = 1/13$:

    $$\text{Pr}(\text{second draw is an } A) = \text{Pr}( \underline{~~~} A) = \frac{4}{52} = \frac{1}{13} $$

3. Apply the definition of conditional probability (plug in):

    $$\text{Pr}(A|\underline{~~~}A) = \frac{\text{Pr}(AA)}{\text{Pr}(\underline{~~~}A)} = \frac{(1/13) \times (3/51)}{(1/13)} = \frac{3}{51} $$

Then, as expected $\text{Pr}(A|\underline{~~~}A) = 3/51 < 4/52 = \text{Pr}(A) $. In other words, learning that the second draw is an ace decreases the chance the first draw was an ace.
:::

:::{caution} Problems in the Philosophy of Chance
:class: dropdown
The second example shows something odd about conditional probability. If we draw cards in sequence, the possible second cards drawn depends on the outcome of the frist draw, but the second card drawn cannot influence the first card drawn since it was drawn second. Yet, observing the second draw can change what we know about the first draw. While the second draw cannot influence the first draw in a causal way, it can change the conditional chance the first card was an ace. 

This distinction between a causal relation, and an informational relation, may seem reasonable here, but it points to some of the deepest questions in probability. *Is probability a model for information, and how we learn from evidence?* If so, then there is no contradiction in the conditional statement that the we can gain information about the first card from the second. If instead, *probability is a model for truly random events* then we are in a pickle since, by the time we are drawing the second card, the first card must be drawn, so cannot be random. We might not know its value (imagine drawing it but leaving it face down), but it is either an ace or not. It is not random. You could even imagine asking a friend to draw both cards and tell you only the value of the second. Your friend *knows* whether the first card is an ace or not. How then can we discuss its conditional probability? Does it make sense to say the value of the first draw is random simply because you don't know it?

These distinctions are an important part of the philosophical debate between two famous camps of statisticians, the Bayesians, who adopt probability as a self-consistent language for expressing uncertainty and for learning from information, and the Frequentists, who adopt a more restrictive definition that requires an empirical relation to frequency in a series of experiments. It also points to a thorny definition issue that we've glossed over... *what do we mean when we say a process is random?* 

We'll save this discussion for later. In the end, the more practical issue between the Frequentists and Bayesians regards uses and misuses of modeling anyways. If you are curious, come talk to the Professor. 

Keep this example in mind when you work on the Monty Hall problem in your homework this week.
::: 

 ### Example: Base Rate Neglect

Let's look at a practical problem where the Bayesian approach is necessary. 

Suppose you are subject to a medical test that is designed to determine whether or not you have some medical condition. For example, you take a Covid or Flu test. The result of the test is important, since it will impact your behavior. For example, if you are sick, you might decide to stay home, or may invest in a medical intervention which is costly. 

No test is perfectly accurate. In essentially all cases a test could predict that a healthy patient is sick, or that a sick patient is healthy. Let $H$ denote the event that the recipient is healthy, $S$ the event they are sick, $N$ the event that the tests returns negative (predicts healthy), and $P$ the event that the test returns positive (predicts sick). Then, there are four possible outcomes. We can arrange them just like we did a joint probability table:

Event | N  | P  |
:----:|:-------------|:-------------|
H  | TN | FP |
S  | FN  | TP |

Here the labels TN, TP, FN, FP stand for (true/false) and (positive/negative). Notice that there are two ways the test can make a mistake. Either it falsely predicts positive or falsely predicts negative. Both rates matter. False positives are costly and can be harmful to the recipient if they take actions assuming they are sick. At worst, a false positive can lead to uneccessary medical intervention. False negatives are dangerous, since the recipient may act as if they are healthy, so may risk others' health, or not take medical action that could address their condition.

It is standard in test design to control the false positive rate. That is, the conditional probability that the test returns positive if the truth is negative (i.e. the patient is healthy). The smaller the false positive rate, the more significant the test result, and the more specific the procedure. The other error rate, the probability the test misses a sick patient, controls the power of the test (its ability to detect sick patients) and its sensitivity (how sensitive it is to evidence of disease).

Let's name these probabilities:

$$\text{Pr}(P|H) = p_{FP}, \quad \text{Pr}(N|S) = p_{FN} $$

Suppose that:

$$\text{Pr}(P|H) = p_{FP} = 0.05, \quad \text{Pr}(N|S) = p_{FN} = 0.01$$

This looks like a good test. It is reasonably selective/specific, since it only makes a false detection for 5 percent of patients. It's also pretty powerful/sensitive. It only misses a true detection in 1 percent of sick patients. For reference the significance of a mammomgram is 90%, so about 10% of healthy women are falsely flagged for breast cancer. The sensitivity of mammograms is 87%, so they have a false negative rate of 0.13.

Now suppose you take the test, and it returns positive. Should you act as if you are sick? What is the probability you were a false positive, and are actually healthy? In other words, what is $\text{Pr}(H|P)? $

:::{caution} Problems in the Philosophy of Chance
:class: dropdown
So far, we don't have enough information to answer the question. We have the conditional probabilities of the test outcomes given the patient's health, but not the conditional probabilities of the patient's health given the test outcome. To find the conditional probability of the patient's health given the test outcome, we need a joint probability model that can assign joint probabilities to every entry in the table. This means that we must model the pateint's health as random.

Here again, we hit a philosophical snag. If we believe that people can be sick or healthy, then at any given time, an individual is sick or healthy. Whether they are sick or not is not determined randomly when they happen to take a test. So, while they may not know whether they are sick, there is a fixed ground truth answer. The idea that there is a fixed ground truth answer is baked into the "true", "false" labels when we used to describe the error rates. 

Yet, we're really asking a question about information and uncertainty. What should the patient learn from the test result, and how should they behave in response?

Here it's reasonable to say that, they could treat their health status as if it was random. For an explicit example, imagine that you are not a patient, but are instead a doctor, or national health service, that applies the test and recommends action. You don't apply the test once, you apply it many times. So, you want it to give accurate advice for most patients, e.g. *on average*. When you try to make decisions that work, *on average*, over a population, you could model the situation by imagining that the decisions are applied to a randomly selected sample of individuals. From the doctor's perspective it is quite sensible to imagine that a patient's true health status is a random quantity, since they could imagine applying the test after selecting a random sample of patients. As long as the frequency of sick individuals in their sampling model matches the actual frequencies of patients who the test is administered to, then the doctor's decision to model the process that sends her patients as random, won't lead her to make any errors when reasoning with averages.
:::

Imagine that, of all patients who take the test, $p_S$ percent are actually sick. This percent is sometimes called a **base rate**. Base rates can influence conditional chances in surprising ways.

In many settings, they are quite small. For example, only 0.5% of women screened for breast cancer in a mammogram are diagnozed for breast cancer in a follow up test. Since the mammogram is meant to filter for women with breast cancer, the population screened post mammogram should include more cases of breast cancer, so should have a higher base rate than the population of all women who take the mammogram. So, let's be conservative, and assume that the second stage test is perfect. Then we can put a conservative upper bound on the base rate of cancer in the population of women undergoing the mammogram at $p_S \leq 0.005$. 

:::{hint} Using Outcome Trees
:class: dropdown
Here's an outcome tree representing the test, labelling the marginal probability for each test outcome:

![Outcome tree for mammography.](mammography_outcome_tree.png "Outcome tree for the mammography example.")

We could compute any of the joint probabilities for any of the outcomes by applying the multiplication rule. For example, the chance of a true negative outcome is $0.995 \times 0.90 = 0.8955$. In other words, of all women who get a mammography, 89.55% should be healthy and receive a negative test result (the test predicts healthy). Alternately, the chance a woman is healthy, but receives a falsely positive test result, if $0.995 \times 0.10 = 0.0995$. In other words, 9.95% of women will be healthy but incorrectly flagged as positive. Nine percent is a lot of women, but it is not too large given the test accuracy. So far, the test looks reliable.

However, the diagram does not match our question. We didn't ask what is the chance a woman is healthy and is falsely flagged as sick. The woman's health status is unknown. The entire premise of the test is that we can't know her health status, and that the test might be wrong, but is usually right. We asked, *what is the chance a woman is healthy if she is flagged as sick*. Notice that we've only conditioned on what we could observe, then asked a probability question about what we can't observe.

Since we can't observe a woman's true health status, we should really draw the diagram in a way that groups all outcomes where the test result is identical:

![Outcome tree for mammography grouped by test result.](mammography_outcome_tree_observable.png "Outcome tree for the mammography example grouping by test result.")

Notice how the arrows corresponding to test decisions have moved. The arrows that move straight across are true predictions. The arrows that cross are errors. The error chances are relatively small since the test rarely makes a mistake when we condition on the patient's health status. When we ask, *what's the chance a woman flagged as sick is healthy*, we asked what percent of the positive cases come from the diagonal error arrow. 

Take a look at the marginals. Something odd has happened. Compare the marginal fractions of woman that are actually sick (0.5%) to the number that have been flagged as sick (10.4%). Almost 20 times as many women have been flagged as sick than women who are actually sick! So, *most of the women flagged as sick must be healthy!*
:::

:::{hint} By Population Diagram
:class: dropdown

Let's try this a bit more directly. Here's a grid of 200 women. You can imagine that each icon represents 100 women who get tested each year, so this grid represents a population of 20,000 women. The base rate is 0.005, so we'd expect $0.005 \times 20 \times 10 \times 100 = 100$ women in this population to be sick. We can represent those women with one icon.

![Population of women.](10_by_20_grid_one_sick.png "Sample population.")

All the grey icons are healthy. The one orange icon represents the sick women. 

What fraction of the healthy women get incorrectly flagged as sick? Well, the chance of a false positive was $0.10$, so 10% of the remaining 19,900 women will be flagged. Thats 19.9 icons, so 20 icons. We'll represent these women with purple.

![The healthy women.](Only_Healthy_2.png "Only the healthy population.")

Most of these cases are correctly flag. Why the high error rate the other way around?

Now, most of the sick women get detected, so we can be optimistic and assume all of those women get detected. So, here's the whole population with all the sick women highlighted and all the false positive cases highlighted:

![All women.](women_with_false_positives.png "The full population")

Then, filtering for only the positive cases, we can see the issue:

![All positives.](Only_Positives.png "All positives")

Most of the positive cases are false positives since the base rate is so low. 100% of a small population is smaller than 10% of a much larger population.


:::

Let's compute a lower bound on the chance a woman who recieves a positive mammogram does not have breast cancer. Applying Bayes rule:

$$\begin{aligned} \text{Pr}(H|P) & = \frac{\text{Pr}(H,P)}{\text{Pr}(P)} \\
& = \frac{\text{Pr}(H) \times \text{Pr}(P|H)}{\text{Pr}(H) \times \text{Pr}(P|H) + \text{Pr}(S) \times \text{Pr}(P|S)} \\
& \geq \frac{0.995 \times 0.1}{0.995 \times 0.1 + 0.005 \times 0.87} \\
& = \left(1 + \frac{0.005}{0.995} \frac{0.87}{0.1} \right)^{-1} \\
& \approx (1 + 0.04)^{-1} = \frac{100}{104} = \frac{25}{26} \approx 0.95 \end{aligned}$$

That's a shocking number. Read it again. Don't skim it. 

Even though only 10% of healthy women get flagged by the mammogram, then *the fraction of women flagged by the mammogram who actually have breast cancer is at least 95%*!

Pause to let that sink in.  What's happened here?

The problem is the base rate. The base rate of actual sick cases is so low that, even with a reasonably accurate test, the small fraction of healthy cases who are flagged as sick vastly outweighs the large fraction actually sick cases who are flagged as sick. Why? Because there are 995 healthy cases for every 5 sick cases.

What about our hypothetical test with $\text{Pr}(P|H) = p_{FP} = 0.05$ and $\text{Pr}(N|S) = p_{FN} = 0.01$. This is a much more accurate test. Can it filter out enough healthy cases so that a patient who receives a positive result is usually sick? Is it good enough to at least get $\text{Pr}(H|P)$ under 50%?

$$\begin{aligned} \text{Pr}(H|P) & = \frac{\text{Pr}(H,P)}{\text{Pr}(P)} \\
& = \frac{\text{Pr}(H) \times \text{Pr}(P|H)}{\text{Pr}(H) \times \text{Pr}(P|H) + \text{Pr}(S) \times \text{Pr}(P|S)} \\
& \geq \frac{0.995 \times 0.05}{0.995 \times 0.05 + 0.005 \times 0.99} \\
& = \left(1 + \frac{0.005}{0.995} \frac{0.99}{0.05} \right)^{-1} \\
& \approx (1 + 0.1)^{-1} = \frac{10}{11} = \frac{25}{26} \approx 0.91 \end{aligned}$$

Even with the better test, the chance that a patient who received a positive result is actually healthy is still greater than 90%.

This is why we should use multistage test procedures when false positives are a problem, and base rates are low. Forgetting to account for the base rate is sometimes called base rate neglect. 

:::{seealso} Further Reading
Base rate neglect is a natural mistake to make. If the results above were surprising to you, or surprising in their degree, then you are experiencing surprise because the result is far from your intuition. Most people don't account for, or account enough for, base rates. If you want to learn more about the physcological research here, read Chapter 14 in Daniel Kahneman's *Thinking Fast and Slow*. We'll highlight some other probability related problems later in the book where intuition is a faulty guide.
:::