# Lambda School Data Science Module 143

## Introduction to Bayesian Inference

!['Detector! What would the Bayesian statistician say if I asked him whether the--' [roll] 'I AM A NEUTRINO DETECTOR, NOT A LABYRINTH GUARD. SERIOUSLY, DID YOUR BRAIN FALL OUT?' [roll] '... yes.'](https://imgs.xkcd.com/comics/frequentists_vs_bayesians_2x.png)

*[XKCD 1132](https://www.xkcd.com/1132/)*


## Prepare - Bayes' Theorem and the Bayesian mindset

Bayes' theorem possesses a near-mythical quality - a bit of math that somehow magically evaluates a situation. But this mythicalness has more to do with its reputation and advanced applications than the actual core of it - deriving it is actually remarkably straightforward.

### The Law of Total Probability

By definition, the total probability of all outcomes (events) if some variable (event space) $A$ is 1. That is:

$$P(A) = \sum_n P(A_n) = 1$$

The law of total probability takes this further, considering two variables ($A$ and $B$) and relating their marginal probabilities (their likelihoods considered independently, without reference to one another) and their conditional probabilities (their likelihoods considered jointly). A marginal probability is simply notated as e.g. $P(A)$, while a conditional probability is notated $P(A|B)$, which reads "probability of $A$ *given* $B$".

The law of total probability states:

$$P(A) = \sum_n P(A | B_n) P(B_n)$$

In words - the total probability of $A$ is equal to the sum of the conditional probability of $A$ on any given event $B_n$ times the probability of that event $B_n$, and summed over all possible events in $B$.

### The Law of Conditional Probability

What's the probability of something conditioned on something else? To determine this we have to go back to set theory and think about the intersection of sets:

The formula for actual calculation:

$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$

We can see how this relates back to the law of total probability - multiply both sides by $P(B)$ and you get $P(A|B)P(B) = P(A \cap B)$ - replaced back into the law of total probability we get $P(A) = \sum_n P(A \cap B_n)$.

This may not seem like an improvement at first, but try to relate it back to the above picture - if you think of sets as physical objects, we're saying that the total probability of $A$ given $B$ is all the little pieces of it intersected with $B$, added together. The conditional probability is then just that again, but divided by the probability of $B$ itself happening in the first place.

### Bayes Theorem

Here is is, the seemingly magic tool:

$$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$$

In words - the probability of $A$ conditioned on $B$ is the probability of $B$ conditioned on $A$, times the probability of $A$ and divided by the probability of $B$. These unconditioned probabilities are referred to as "prior beliefs", and the conditioned probabilities as "updated."

Why is this important? Scroll back up to the XKCD example - the Bayesian statistician draws a less absurd conclusion because their prior belief in the likelihood that the sun will go nova is extremely low. So, even when updated based on evidence from a detector that is $35/36 = 0.972$ accurate, the prior belief doesn't shift enough to change their overall opinion.

There's many examples of Bayes' theorem - one less absurd example is to apply to [breathalyzer tests](https://www.bayestheorem.net/breathalyzer-example/). You may think that a breathalyzer test that is 100% accurate for true positives (detecting somebody who is drunk) is pretty good, but what if it also has 8% false positives (indicating somebody is drunk when they're not)? And furthermore, the rate of drunk driving (and thus our prior belief)  is 1/1000.

What is the likelihood somebody really is drunk if they test positive? Some may guess it's 92% - the difference between the true positives and the false positives. But we have a prior belief of the background/true rate of drunk driving. Sounds like a job for Bayes' theorem!

$$
\begin{aligned}
P(Drunk | Positive) &= \frac{P(Positive | Drunk)P(Drunk)}{P(Positive)} \\
&= \frac{1 \times 0.001}{0.08} \\
&= 0.0125
\end{aligned}
$$

In other words, the likelihood that somebody is drunk given they tested positive with a breathalyzer in this situation is only 1.25% - probably much lower than you'd guess. This is why, in practice, it's important to have a repeated test to confirm (the probability of two false positives in a row is $0.08 * 0.08 = 0.0064$, much lower), and Bayes' theorem has been relevant in court cases where proper consideration of evidence was important.

## Live Lecture - Deriving Bayes' Theorem, Calculating Bayesian Confidence

Notice that $P(A|B)$ appears in the above laws - in Bayesian terms, this is the belief in $A$ updated for the evidence $B$. So all we need to do is solve for this term to derive Bayes' theorem. Let's do it together!

In [0]:
# Activity 2 - Use SciPy to calculate Bayesian confidence intervals
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bayes_mvs.html#scipy.stats.bayes_mvs

#From frequentist approach - yesterday's notebook
def confidence_interval(data, confidence=0.95):
    """
    We want to Calculate a confidence interval around a sample mean for
    given data.
    Using t-distribution and two-tailed test, default 95% confidence.
    
    Arguments:
      data - iterable (list or numpy array) of sample observations
      confidence - level of confidence for the interval
      
    Returns:
      tuple of (mean, lower bound, upper bound)
    """
    data = np.array(data)
    mean = np.mean(data)
    n = len(data)
    stderr = stats.sem(data) # https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.sem.html
    interval = stderr * stats.t.ppf((1 + confidence) / 2., n - 1) #dividing by 2 cause it's 2tailed
    # stats.t.ppf = gives the probability density function
    return (mean, mean - interval, mean + interval)


In [0]:
from scipy import stats
import numpy as np
import pandas as pd

In [0]:
coinflips = np.random.binomial(n=1, p=0.5, size=100)

In [12]:
# More conservative interval - broader interval with frequentist approach vs Bayesian 
confidence_interval(coinflips)

(0.49, 0.39030929062808245, 0.5896907093719175)

In [9]:
# Also gives stdev and variance with their interval
# See cell below and see how it matches stdev by pandas
stats.bayes_mvs(coinflips)

(Mean(statistic=0.49, minmax=(0.40657889423317223, 0.5734211057668277)),
 Variance(statistic=0.2576288659793814, minmax=(0.20279939208271736, 0.32435028891692574)),
 Std_dev(statistic=0.506265071182084, minmax=(0.45033253500354303, 0.5695175931583902)))

In [13]:
pd.DataFrame(coinflips).describe()

Unnamed: 0,0
count,100.0
mean,0.49
std,0.502418
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


In [15]:
# Let's do something else medical

# Generating some data - condition that you recover from or don't and a treatment that you take or not
# Treated people recover with prob of .65
# Non-treated recover with prob .4

treatment_group =  np.random.binomial(n=1, p=0.65, size=100)
nontreated_group = np.random.binomial(n=1, p=0.4, size=100)

print(treatment_group)

[1 0 1 1 1 0 0 1 0 0 0 0 1 0 1 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 0
 1 1 0 0 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 1 1 0 1 0 0 1 1 1
 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 1 1 0 1 1 1 1]


In [0]:
df = pd.DataFrame({'treated': treatment_group, 'untreated':nontreated_group})

In [21]:
df.describe()

Unnamed: 0,treated,untreated
count,100.0,100.0
mean,0.66,0.42
std,0.476095,0.496045
min,0.0,0.0
25%,0.0,0.0
50%,1.0,0.0
75%,1.0,1.0
max,1.0,1.0


In [25]:
# Frequentists hypothesis test
stats.ttest_ind(df.treated, df.untreated)

Ttest_indResult(statistic=3.490646843296438, pvalue=0.0005939845864903332)

In [27]:
# Confidence intervals between treated and untreated groups don't even overlap
# Statisticatly significance going on!
# Meaning their means are very different
# We reject the null
# When we would to reject the null? = when the intervals overlap
stats.bayes_mvs(df.treated)

(Mean(statistic=0.66, minmax=(0.5809495693071084, 0.7390504306928917)),
 Variance(statistic=0.23134020618556705, minmax=(0.1821055765640728, 0.29125332066009674)),
 Std_dev(statistic=0.4797403673067261, minmax=(0.4267382998560978, 0.5396789051464738)))

In [28]:
stats.bayes_mvs(df.untreated)

(Mean(statistic=0.42, minmax=(0.33763713292148456, 0.5023628670785154)),
 Variance(statistic=0.2511340206185568, minmax=(0.19768680236634642, 0.3161733908770034)),
 Std_dev(statistic=0.4998428440976869, minmax=(0.4446198402752023, 0.5622929760160653)))

In [0]:
# SUGGESTED TASK -> Write your own Bayes test function that compares CIs from stats.bayes_mvs

## Assignment - Code it up!

Most of the above was pure math - write Python code to reproduce the results. This is purposefully open ended - you'll have to think about how you should represent probabilities and events. You can and should look things up, and as a stretch goal - refactor your code into helpful reusable functions!

If you're unsure where to start, check out [this blog post of Bayes theorem with Python](https://dataconomy.com/2015/02/introduction-to-bayes-theorem-with-python/) - you could and should create something similar!

Stretch goal - apply a Bayesian technique to a problem you previously worked (in an assignment or project work) on from a frequentist (standard) perspective.

In [0]:
# TODO - code!
#help(stats.bayes_mvs)

You might be interested in finding out a patient’s probability of having liver disease if they are an alcoholic. “Being an alcoholic” is the __test__ (kind of like a litmus test) for liver disease.

__A__ could mean the event “Patient has liver disease.” Past data tells you that 10% of patients entering your clinic have liver disease. P(A) = 0.10.

__B__ could mean the litmus test that “Patient is an alcoholic.” Five percent of the clinic’s patients are alcoholics. P(B) = 0.05.

You might also know that among those patients diagnosed with liver disease, 7% are alcoholics. This is your __B|A__: the probability that a patient is alcoholic, given that they have liver disease, is 7%.

Bayes’ theorem tells you:
__P(A|B) = (0.07 * 0.1)/0.05 = 0.14__
    
In other words, if the patient is an alcoholic, their chances of having liver disease is 0.14 (14%). This is a large increase from the 10% suggested by past data. But it’s still unlikely that any particular patient has liver disease.

In [0]:
liver_disease =  np.random.binomial(n=1, p=0.10, size=100)
alcoholic = np.random.binomial(n=1, p=0.05, size=100)

print(treatment_group)

## Introduction to Bayes Theorem in Python - [Article here](https://dataconomy.com/2015/02/introduction-to-bayes-theorem-with-python/)


$P(A|B)*P(B) = P(B|A)*P(A)$

Which implies:

$P(A|B) = \dfrac{P(B|A)*P(A)}{P(B)}$

And plug in $θ$ for $A$ and $X$ for $B$:

$P(\theta|X) = \dfrac{P(X|\theta)*P(\theta)}{P(X)}$

Nice! Now we can plug in some terminology we know:

$Posterior = \dfrac{likelihood * prior}{P(X)}$

But what is the $P(X)?$ Or in English, the probability of our data? That sounds weird… Let’s go back to some math and use B and A again:

We know that $P(B)=∑AP(A,B)$ (check out [this page](http://en.wikipedia.org/wiki/Marginal_distribution) for a refresher)

And from our definitions above, we know that:

$P(A,B) = P(A|B)*P(A)$

Thus:

$P(B) = \sum_{A} P(A|B)*P(A)$

Plug in our $θ$ and $X$:

$P(X) = \sum_{\theta} P(\theta|X)*P(\theta)$

Plug in our terminology:

$P(X) = \sum_{\theta} likelihood * prior$

But what do we mean by $∑θ$. This means to sum over all the values of our parameters. _In our coin flip example, we defined 100 values for our parameter p_, so we would have to calculate __the likelihood * prior for each of these values and sum all those anwers__. That is our denominator for Bayes Theorem. Thus our final answer for Bayes is:

$Posterior = \dfrac{likelihood * prior}{\sum_{\theta} likelihood * prior}$