## Why we need Bayes' theorem and other fun with conditional probability

Say we need to determine the probability that a test result is giving you correct information. For example: suppose you are getting tested for a disease that occurs in 1% of a population. You know that the test has a 2% false positive rate and a 5% false negative rate. If your result comes back positive, what's the probability that you actually have the disease? If it comes back negative, what's the chance that you actully do have the disease?

Let's make a table with everything we know about this test on a hypothetical population of 10000. Since it occurs in 1% of the population, 100 people will actually have it while 9900 won't. And of those people who actually have it, if they were all tested, 5 would get a (false) negative result. Of the 9900 people who don't have it, 2% (or 198) would get a (false) positive result.

Here's all that in a table. 

|               | diseased | not diseased | total |   |
|---------------|----------|--------------|-------|---|
| test positive | 95       | 198          | 293   |   |
| test negative | 5        | 9702         | 9707  |   |
| total         | 100      | 9900         | 10000 |   |

Before we move on to answering the original questions, let's stop here for a moment and define some terms about tests. 

**Specificity** is the test's ability to correctly identify healthy patients.

**Sensitivity** is the test's ability to correctly identify diseased patients.

Let's name each cell in the table so we can use them in some equations.

|               | diseased                | not diseased              | total |   |
|---------------|-------------------------|---------------------------|-------|---|
| test positive | 95 (true positive / TP) | 198 (false positive / FP) | 293   |   |
| test negative | 5 (false negative / FN) | 9702 (true negative / TN) | 9707  |   |
| total         | 100 (positive / P)      | 9900 (negative / N)       | 10000 |   |

From the Wikipedia article:

$\text{sensitivity} = \text{recall} = \text{true positive rate} = \text{TPR} =\frac{TP}{P}$

Plugging in our values, we have

$\text{TPR} =\frac{TP}{P} =\frac{95}{100} = 95\%$

And for specificity:

$\text{specificity} = \text{true negative rate} = \text{TNR} =\frac{TN}{N} =\frac{9702}{9900} = 98\%$

Another few terms that are worth mentioning are **Accuracy**, the fraction of test results that are correct, and **Precision**, the fraction of true positives to total positives. Here's an example from the [Wikipedia page](https://en.wikipedia.org/wiki/Precision_and_recall) for precision and recall:

> Suppose a computer program for recognizing dogs in photographs identifies 8 dogs in a picture containing 12 dogs and some cats. Of the 8 identified as dogs, 5 actually are dogs (true positives), while the rest are cats (false positives). The program's precision is 5/8 while its recall is 5/12. When a search engine returns 30 pages, only 20 of which were relevant, while failing to return 40 additional relevant pages, its precision is 20/30 = 2/3 while its recall is 20/60 = 1/3. So, in this case, precision is "how valid the search results are", and recall is "how complete the results are".

So back to our test from above, here are the definitions and results.

$\text{accuracy} = \text{ACC} =\frac{TP + TN}{P + N} =\frac{95 + 9702}{100 + 9900} = 97.97\%$

$\text{precision} = \text{positive predictive value} = \text{PPV} =\frac{TP}{TP + FP} =\frac{95}{95 + 9702} = 0.97\%$

### Bayes' Theorem

Okay, but we still haven't answered our original question: if you test positive, what are the chances you actually have the disease? If we think about it and look at the table a bit, we might be tempted to say the following: we tested positive, so we're either in the True Positive or False Positive categories. So, what's the likelihood in being in either category? For True Positive, it would be the fraction of true positives to total positives, or $\frac{\text{TP}}{\text{TP+FP}}$, right?

Well, no. Why not? **Let's come back to this.** This is conditional probability, and requires a bit more work. We're going to need to call in Bayes. Bayes' theorem is for this: we want to figure out the probability for some event A, given that B has happened. Specifically, the event B changes the probability of A happening. So if B is rolling a 6 on a 6-sided die while A is rolling a 1, we don't need Bayes' theorem, because B happening tells us nothing about whether or not A is going to happen. But if we're talking about B being the result of a disease test coming back positive or negative, then we definitely do need the theorem. Here's what it says:

$\text{probability of A given B} = \text{P(A|B)} = \frac{P(B|A)P(A)}{P(B)}$

So let's try this out on our question.

$\text{probability we have disease if positive test result} = \text{P(disease | positive)} = \frac{\text{P(positive | disease)}\text{P(disease)}}{\text{P(positive)}}$

$\text{P(positive | disease)}$ is the probability we will test positive if we have the disease, which is the True Positive (TP) rate. And we know the disease rate is 1% and the positive test rate is 1%. So then the probability we have the disease if we test positive is:

$\text{P(disease | positive)} = \frac{\text{P(positive | disease)}\text{P(disease)}}{\text{P(positive)}} = \frac{95\% * 1\%}{1\%} = \frac{.95\%}{1\%} = 9.5\%$

Actually pretty unlikely.

### Exercises
1. What if the disease rate ticked up to 2% of the population next year? Which measurements would stay the same and which would change?
2. What if we developed a better test that halved the false positive rate? How would that affect the chances that you have the disease if you test positive?
3. Suppose we have a test with a sensitivity of 50% and specificity of 75%, with a population prevalence of 1%. Make a table like the one above with all the cells filled in.