# Taking the Hypothesis Testing 1 Step Forward with a Bayesian Twist

This lab will be divided into two sections. One we'll manually compute some of the Bayesian statistical testing procedures manually/step-wise using Python as a calculator, and the second half, we'll go ahead and use PyMC to go through the analysis and leverage the libraries to provide more automatic results for us. 

The first thing we'll do is load up the data set, which contains 3 columns, the number of clicks for a website on a given day, the number of sales conversions (actual sales) on a given day, and whether it was the weekend.

>Note to instructor: This data is simulated, and I generated most of the click data using a Poisson process. Part of the "extra" challenge in this lab will be to see if the fast-learners can figure out which distribution to use outside of the normal distribution I've outlined here for them to use.

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt


# Load the data
CTR_dat = pd.read_csv('CSV/CTR_Sim.csv')

In [2]:
CTR_dat.head()

Unnamed: 0,Clicks Conv,Clicks,Weekend
0,11,19,0
1,10,20,0
2,11,17,0
3,15,14,0
4,11,18,0


Our question will be simple: If we calculate the mean clicks for both the weekend and non-weekend days, can we determine which mean represents the "true" mean for the site? i.e. 

### If we wanted to know which set of clicks (weekend or weekday) represented the "true" set of clicks on the website, how would we state it formally in a statistical test?

**Answer**: $H_0$: Weekend Mean, $H_1$: Weekday Mean 

Compute the Weekend Mean and the Weekday Mean of clicks and assign them to the variables Weekend_Mean and Weekday_Mean.

We don't really have any beliefs one way or the other. How would I state this in Bayesian terms?

**Answer**: 

Now that we have computed our priors, we need to move onto our likelihood function. Suppose I know that the true variance of clicks is 8. Further, we just got more 100 more days of observations all from mobile (as opposed to computer accessed) and the mean clicks is 25. Compute the Standard Error (SER) - Assuming the number of clicks is being drawn from a normal distribution

**Answer**:

Now compute the Z-scores and the associated Likelihoods:

**Answer**:

Compute the total probability (remember from the first lecture):

**Answer**: 

Now let's bring it all together, what is the posterior for the first hypothesis and the second hypothesis?

**Answer:**

We've just completed a hands-on exercise computing a statistical test using Bayesian analysis. But how do we interpret it? Think about and discuss with your colleagues a while. Is this strong evidence for one hypothesis or another? What are the possible ways we could "measure" the strength of $H_0$ vis-a-vis $H_1$?

**Answer**: 

# Credible Interval vs Confidence Interval

To review again in plain English, there is a clear difference between Credible and Confidence interval interpretations. 

**Confidence Intervals** : As the number of experimental trials increases X.X% ('X.X' being the desired confidence 'level') of the time, the confidence intervals will contain the "true" value of the parameter you are interested in (usually the mean). 

**Credible Intervals** : Given some data, there is a X.X% probability that the "true" value of your parameter is contained within your credible interval. 

As we discussed, the credible interval is much easier to explain to normal people, and does not require you to go into horse-shoe metaphors, or any other linguistic legerdemain. 

A subtle point that should be emphasized, much like other parts of Bayesian analysis, when one does the Bayesian "inversion" trick, often times what's being fixed in an analysis and what's being varied switches. No different here, observe that in the confidence interval, the parameter is fixed, but each iteration brings a new confidence interval (hence the horse-shoe metaphor). 

However, in the credible interval, it's the opposite, the interval is what's fixed, but the parameter value is what changes.


### Implementing a Bayesian basic stats test process with scikit-learn 

Before, we proceed with putting the above finger-calculations into code, let's do some quick visualizations comparing the data to a fitted normal distribution. In the following below, create 2 plots with a matplotlib graph that shows a histogram of that overplayed with distribution

Plot both distributions over each other. Does the graph visually confirm what you suspected based on the finger exercises above?

Although scikit-learn isn't explicitly a Bayesian computing library, we are going to use standard statistical methods from it to compute the two posteriors above.

First browse the following page (scikit-learn stats docs) http://docs.scipy.org/doc/scipy-0.14.0/reference/stats.html

Now using the stats suite of methods, define the normal likelihood for both hypothesis 1 and 2




In [2]:
# normal_likelihood for hypothesis 1


In [3]:
# normal likelihood for hypothesis 2



Now bring it all together and define both posterior distributions based on the work above. 

In [4]:
#Posterior 1


In [5]:
#Posterior 2

