<a href="https://colab.research.google.com/github/dunkelweizen/DS-Unit-1-Sprint-3-Statistical-Tests-and-Experiments/blob/master/module3-introduction-to-bayesian-inference/Cai_Nowicki_LS_DS_133_Introduction_to_Bayesian_Inference_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lambda School Data Science Module 133

## Introduction to Bayesian Inference




## Assignment - Code it up!

Most of the above was pure math - now write Python code to reproduce the results! This is purposefully open ended - you'll have to think about how you should represent probabilities and events. You can and should look things up, and as a stretch goal - refactor your code into helpful reusable functions!

Specific goals/targets:

1. Write a function `def prob_drunk_given_positive(prob_drunk_prior, prob_positive, prob_positive_drunk)` that reproduces the example from lecture, and use it to calculate and visualize a range of situations
2. Explore `scipy.stats.bayes_mvs` - read its documentation, and experiment with it on data you've tested in other ways earlier this week
3. Create a visualization comparing the results of a Bayesian approach to a traditional/frequentist approach
4. In your own words, summarize the difference between Bayesian and Frequentist statistics

If you're unsure where to start, check out [this blog post of Bayes theorem with Python](https://dataconomy.com/2015/02/introduction-to-bayes-theorem-with-python/) - you could and should create something similar!

Stretch goals:

- Apply a Bayesian technique to a problem you previously worked (in an assignment or project work) on from a frequentist (standard) perspective
- Check out [PyMC3](https://docs.pymc.io/) (note this goes beyond hypothesis tests into modeling) - read the guides and work through some examples
- Take PyMC3 further - see if you can build something with it!

I need to find the probability of drunk(condition positive breath test), which equals "true positive"  * background probability that someone is drunk (prior belief), divided by the overall chance of a positive test. In the breathalyzer case, that's so close to the false positive rate as to make the difference negligible.

In [0]:
def prob_drunk_given_positive(prob_drunk_prior, prob_positive, prob_positive_drunk):
  numerator = prob_positive_drunk * prob_drunk_prior
  return numerator / prob_positive
  

In [0]:
def prob_positive(prior_belief, false_pos, true_pos):
  pos_pos = true_pos * prior_belief
  pos_neg = false_pos * (1 - prior_belief)
  return pos_pos + pos_neg
  

In [0]:
#Let's say true positive rate is 95%, false positive rate is 10% (based on manufacturer's stats)
#research says there are 111 million self-reported incidences of drunk driving in a year, and there are 225 million licensed drivers in the US. 
#If i work from the very basic assumption that everyone with a license drives twice per day 
#(which balances out people who hardly drive against people who make several trips)
#that would mean my prior belief should be 0.00067 (total drives per year divided by drunk driving reports)

In [0]:
prob_pos = prob_positive(0.00067, 0.10, 0.95)

In [4]:
prob_drunk_given_positive(0.00067, prob_pos, 0.95)

0.006328956592207379

That would mean only a 0.6% chance that a driver is drunk given a positive breath test!

But that's assuming total randomness of drunk driving vs total driving, which is of course not true. 

In [5]:
#lets use the same numbers for true and false positives, and assume 5% of our drivers are drunk (it's 1AM on Saturday)
prob_pos = prob_positive(0.05, 0.10, 0.95)
prob_drunk_given_positive(0.05, prob_pos, 0.95)

0.3333333333333333

In [6]:
#let's look at a re-test scenario - the chance of a false positive TWICE is only 10% * 10%, or 1%

prob_pos = prob_positive(0.05, 0.01, 0.95)
prob_drunk_given_positive(0.05, prob_pos, 0.95)

0.8333333333333333

In [0]:
import pandas as pd
import numpy as np

In [0]:

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data'

df = pd.read_csv(url, header=None)

columns = ['Class Name', 'handicapped-infants', 
           'water-project-cost-sharing',
           'adoption-of-the-budget-resolution',
           'physician-fee-freeze',
           'el-salvador-aid',
           'religious-groups-in-schools',
           'anti-satellite-test-ban', 
           'aid-to-nicaraguan-contras',
           'mx-missile',
           'immigration',
           'synfuels-corporation-cutback',
           'education-spending',
           'superfund-right-to-sue',
           'crime',
           'duty-free-exports',
           'export-administration-act-south-africa']
df.columns = columns

In [0]:
df = df.replace(('y','n', '?'), (1, 0, np.NaN))
df=df.fillna(df.mean())
dem_mask = df['Class Name'] == 'democrat'
df_dems = df[dem_mask]
rep_mask = df['Class Name'] == 'republican'
df_reps = df[rep_mask]

In [0]:
from scipy import stats

In [0]:
rep_arr_crime = df_reps['crime'].values
dem_arr_crime = df_dems['crime'].values

In [31]:
rep_arr_crime

array([1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 0.59330144, 0.59330144, 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 0.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.     

In [32]:
stats.bayes_mvs(rep_arr_crime, alpha=0.95)

(Mean(statistic=0.9651970836181363, minmax=(0.9417585340819545, 0.988635633154318)),
 Variance(statistic=0.023965680212103144, minmax=(0.01932010102772202, 0.029707162775876695)),
 Std_dev(statistic=0.1545741482465878, minmax=(0.13899676624915425, 0.17235765946390866)))

In [33]:
stats.bayes_mvs(dem_arr_crime, alpha=0.95)

(Mean(statistic=0.3592996792287153, minmax=(0.3025259916761121, 0.4160733667813185)),
 Variance(statistic=0.2236794115443048, minmax=(0.18862013252081036, 0.2651417081251185)),
 Std_dev(statistic=0.47249992404532587, minmax=(0.4343041935335305, 0.5149191277522311)))

###Bayesian Vs Frequentist

The difference between frequentist and Bayesian statistics is that frequentist statistics is based on how frequently an event has occurred over the long-term, without attaching a probability to any hypothesis or unknown value. Frequentists define "probability" as the number of times that a repeatable random result will occur (ie, a coin flip) over a very large set of events. 

Bayesian statistics attaches a probability to the uncertainty of any given hypothesis or event, such as an election. This is different because you can't re-run the event to see the result over a large number of chances. 

## Resources

- [Worked example of Bayes rule calculation](https://en.wikipedia.org/wiki/Bayes'_theorem#Examples) (helpful as it fully breaks out the denominator)
- [Source code for mvsdist in scipy](https://github.com/scipy/scipy/blob/90534919e139d2a81c24bf08341734ff41a3db12/scipy/stats/morestats.py#L139)