## Bayes Theorem and the Base Rate Problem

Suppose that at your regular physical exam you test positive for Disease X. Although Disease X has only mild symptoms, you are concerned and ask your doctor about the accuracy of the test. It turns out that the test is 99% accurate. It would appear that the probability that you have Disease X is therefore 0.99.

However, the situation is not that simple.
 
For one thing, more information about the accuracy of the test is needed because there two kinds of errors the test can make: Misses (False negatives), and false positives.
 
* If you actually had Disease X and the test failed to detect it, that would be a miss.

* If you did not have Disease X and the test indicated you did, that would be a false positive.

The miss and false positive rates are not necessarily the same.

For example, lets say that the test accurately indicates the disease in 99% of the people who have it and accurately indicates no disease in 91% of the people who do not have it. This would mean that the test has a miss rate of 1% and a false positive rate of 9%.

This might lead you to revise your judgment and conclude that your chance of having the disease is 0.09 rather than 0.05, should it?

The missing piece is the prevelance of the disease in the population. The proportion of the people who have the disease is called the **base rate**. Assume that Disease X is a rare disease, and only 2% of people have it. How does that affect the probability that you have it? Or, what is the probability that someone who tests positive actually has the disease?

We can compute the probability two ways. Say we have a population of 1 million people,

In [1]:
N = 1000000
base_rate = 0.02

num_disease = N * base_rate
num_no_disease = N - num_disease

num_positive_no_disease = num_no_disease * 0.09
num_negative_no_disease = num_no_disease * 0.91

num_positive_disease = num_disease * 0.99
num_negative_disease = num_disease * 0.01

# of the people who tested positive, what fraction actually have the disease
true_positive = num_positive_disease / (num_positive_disease+num_positive_no_disease)
print('chance you have the disease if you test positive for it is', true_positive )

# of the people who tested negative, what fraction actually have the disease
false_negative = num_negative_disease / (num_negative_disease+num_negative_no_disease)
print('chance you have the disease if you test negative for it is', false_negative )

chance you have the disease if you test positive for it is 0.18333333333333332
chance you have the disease if you test negative for it is 0.0002242152466367713


Summary:

* Then 20,000 people actually have the disease, and 980,000 do not.

* Of the 980,000 people who do not have the disease, 9% or 88,200 people will get a false positive on the test and the rest will test negative.

* Of the 20,000 people who do have the disease, 99% or 19,800 people will get a true positive on the test and the other 200 will get a false negative (1% miss rate)

<img src="images/base_rate.png" width=50%>

The numbers of people diagnosed with the disease are shown in red. Of the one million people tested, the test was correct for 891,000 of those without the disease and for 19,800 with the disease; the test was correct 91% of the time.

However, if you look only at the people testing positive (shown in red), only 19,800 (0.1833) of the 911,600 testing positive actually have the disease.

To sum up, 19,800 people who tested positive would actually have the disease and 88,200 people who tested positive would not have the disease.

This means that of all those who tested positive, only 19,800/(19,800 + 88,200) = 0.1833 would actually have the disease. So the probability that you have the disease is only 18.33% (!!)

**Q**: How does the True Positive rate vary with the Base Rate?

In [None]:
import numpy as np
import matplotlib.pyplot as plt

N = 1000000

x = np.linspace(0.01,0.99,1000)
y = []
for base_rate in x:

    true_positive = ...
    
    y.append(true_positive)
    
plt.plot(x,y,'k-')
plt.show()

What would happen if one million people were tested?

1. Out of these one million people, 2% or 20,000 people would have the disease.
2. Of these 20,000 with the disease, the test would accurately detect it in 99% of them. This means that 19,800 cases would be accurately identified.
3. Now consider the 98% of the one million people (980,000) who do not have the disease. Since the false positive rate is 0.09, 9% of these 980,000 people will test positive for the disease. This is a total of **88,200 people incorrectly diagnosed**. 

**Q**: How does the False Positive rate vary with the Base Rate?

In [None]:
import numpy as np
import matplotlib.pyplot as plt

N = 1000000

x = np.linspace(0.01,0.08,1000)
y = []
for base_rate in x:

    false_positive = ...
    
    y.append(false_positive)
    
plt.plot(x,y,'k-')
plt.show()

Bayes' theorem is stated mathematically as the following equation

$$ P(A\mid B) = \frac{P(B \mid A) \, P(A)}{P(B)} $$

where $A$ and $B$ are events and $P(B) \neq 0$.

It is often useful to compute $P(B)$ using the law of total probability:

$$P(B) = {\sum_j P(B\mid A_j) P(A_j)},$$
$$\Rightarrow P(A_i\mid B) = \frac{P(B\mid A_i)\,P(A_i)}{\sum\limits_j P(B\mid A_j)\,P(A_j)}$$

In the special case where $A$ is a dichotomous outcome:

$$P(A\mid B) = \frac{P(B\mid A)\,P(A)}{ P(B\mid A) P(A) + P(B\mid A^c) P(A^c)}$$


In our example, we let $A$ be the event that you have the disease, and $B$ the event that you test positive. Then by Bayes' theorem,

From the problem statement,

* $P(A) = P(\mbox{disease}) = 0.02$, so $P(A^c) = P(\mbox{no disease}) = 0.98$
* $P(B | A) = P(\mbox{test positive}~|~   \mbox{disease}) = 0.99$
* $P(B | A^c) = P(\mbox{test positive}~|~\mbox{no disease}) = 0.09$  

Then by Bayes' rule

$$P(\mbox{disease}~|~\mbox{test positive}) = P(A\mid B) = \frac{P(B\mid A)\,P(A)}{ P(B\mid A) P(A) + P(B\mid A^c) P(A^c)} =  \frac{0.99 \cdot 0.02}{ 0.99 \cdot 0.02 + 0.09 \cdot 0.98} \approx 0.183$$