In [18]:
from __future__ import division
import math, random

def normal_cdf(x, mu=0,sigma=1):
    return (1 + math.erf((x - mu) / math.sqrt(2) / sigma)) / 2

def inverse_normal_cdf(p, mu=0, sigma=1, tolerance=0.00001):
    """find approximate inverse using binary search"""
    # if not standard, compute standard and rescale
    if mu != 0 or sigma != 1:
        return mu + sigma * inverse_normal_cdf(p, tolerance=tolerance)
    low_z, low_p = -10.0, 0            # normal_cdf(-10) is (very close to) 0
    hi_z,  hi_p  =  10.0, 1            # normal_cdf(10)  is (very close to) 1
    while hi_z - low_z > tolerance:
        mid_z = (low_z + hi_z) / 2     # consider the midpoint
        mid_p = normal_cdf(mid_z)      # and the cdf's value there
        if mid_p < p:
            # midpoint is still too low, search above it
            low_z, low_p = mid_z, mid_p
        elif mid_p > p:
            # midpoint is still too high, search below it
            hi_z, hi_p = mid_z, mid_p
        else:
            break

    return mid_z

# Chapter 7. Hypothesis and Inference

What will we do with all this statistics and probability theory?  
The science part of data science frequently involves forming and testing hypotheses about our data and the processes that generate it.  

## Statistical Hypothesis Testing

For our purposes, [hypotheses](https://en.wikipedia.org/wiki/Hypothesis) are assertions like "this coin is fair" that can be translated into statistics about data.  
In the classical setup, we have a null hypothesis $H_0$ that represents some default position, and an alternative hypothesis $H_1$ that we would like to compare it with.  
We then use statistics to decide whether we can reject $H_0$ as false or not.  

## Example: Flipping a Coin

Imagine that we have a coin and we want to test whether it's fair or not.  
We make the assumption that the coin has some probability $p$ of landing heads up, so our null hypothesis is that the coin is fair, or $p=0.5$.  
The null hypothesis is tested against the alternative hypothesis $p\neq0.5$.

Our test will involve flipping the coin some number $n$ times and counting the number of heads $X$.  
Each coin flip is a [Bernoulli trial](https://en.wikipedia.org/wiki/Bernoulli_trial), which means that X is a Binomial(n,p) random variable that can be approximated using the normal distribution that was covered in Chapter 6.

In [19]:
def normal_approximation_to_binomial(n, p):
    """ finds mu and sigma corresponding to a Binomial(n, p) """
    mu = p * n
    sigma = math.sqrt(p *(1 - p) * n)
    return mu, sigma

Whenever a random variable follows a normal distribution, we can use <code>normal_cdf</code> to figure out the probability that its realized value lies within (or outside) a particular interval.

In [20]:
# The normal_cdf is the probability that the variable is below a threshold
normal_probability_below = normal_cdf

# therefore, if it's not below the threshold, it's above the threshold
def normal_probability_above(lo, mu=0, sigma=1):
    return 1 - normal_cdf(lo, mu, sigma)

# less than hi and not less than lo gives us between
def normal_probability_between(lo, hi, mu=0, sigma=1):
    return normal_cdf(hi, mu, sigma) - normal_cdf(lo, mu, sigma)

# and if it's not between, it's outside
def normal_probability_outside(lo, hi, mu=0, sigma=1):
    return 1 - normal_probability_between(lo, hi, mu, sigma)

We can also do the reverse -- find either the nontail region or the (symmetric) interval around the mean that accounts for a certain level of likelihood.  
For example, if we want to find an interval centerd at the mean and containing 60% probability, then we find the cutoffs where the upper and lower tails each contain 20% of the probability:

In [21]:
def normal_upper_bound(probability, mu=0, sigma=1):
    """ returns the z for which P(Z <= z) = probability """
    return inverse_normal_cdf(probability, mu, sigma)

def normal_lower_bound(probability, mu=0, sigma=1):
    """ returns the z for which P(Z >= z) = probability """
    return inverse_normal_cdf(1 - probability, mu, sigma)

def normal_two_sided_bounds(probability, mu=0, sigma=1):
    """ returns the symmetric (about the mean) bounds that contain the specified probability """
    tail_probability = (1 - probability) / 2
    # upper bound should have tail_probability above it
    upper_bound = normal_lower_bound(tail_probability, mu, sigma)
    # lower bound should have tail_probability below it
    lower_bound = normal_upper_bound(tail_probability, mu, sigma)

    return lower_bound, upper_bound

In particular, let's say that we choose to flip the coin $n=1000$ times.  
If our hypothesis of fairness is true, $X$ should be distributed approximately normally with mean 50 and standard deviation 15.8:

In [22]:
mu_0, sigma_0 = normal_approximation_to_binomial(1000, 0.5)

We need to make a decision about significance --  
How willing are we to make a $type \;1\ error$ (false positive), in which we reject $H_0$ (the null hypothesis) even though it's true?
For reasons lost to the annals of history (meaning it's common practice but no one can agree why), this willingness is often set at 5% or 1%.  
We're going with 5%.

Consider the test that rejects $H_0$ if $X$ falls outside the bounds given by:

In [23]:
normal_two_sided_bounds(0.95, mu_0, sigma_0)

(469.01026640487555, 530.9897335951244)

Assuming $p$ really equals 0.5 (meaning $H_0$ is true), there is just a 5% chance that we observe an x that lis outside this interval, which is the exact significance that we wanted.  
Said differently, if $H_0$ is true, then, approximately 19 times out of 20, this test will give the correct result.  


We are also often interested in the $power$ of a test, which is the probability of not making a $type \;2\ error$ (false negative), in which we fail to reject $H_0$ even though it is false.  
In order to measure this, we have to specify what exactly $H_0$ being false $means$.  
In other words, knowing merely that $p$ is not 0.5 doesn't tell you very much about the distribution of X.  
In particular, let's check what happens if $p$ is really 0.55, so that the coin is slightly biased towards heads.

In that case, we can calculate the power of the test with:

In [24]:
# 95% bounds based on the assumption p is 0.5
lo, hi = normal_two_sided_bounds(0.95, mu_0, sigma_0)

# actual mu and sigma based on p = 0.55
mu_1, sigma_1 = normal_approximation_to_binomial(1000, 0.55)

# a type 2 error means we fail to reject the null hypothesis, 
# which will happen when X is still in our original interval
type_2_probability = normal_probability_between(lo, hi, mu_1, sigma_1)
power = 1 - type_2_probability
print(type_2_probability)
print(power)
power

0.113451998705
0.886548001295


0.8865480012953671

Imagine instead that our null hypothesis was that the coin is not biased towards heads, or that $p\le0.5$.  
In that case we want a <em>one-sided test</em> that rejects the null hypothesis when X is much larger than 50 but not when X is smaller than 50.  
So, let's do a 5%-significance test using <code>normal_probability_below</code> to find the cutoff below which 95% of the probability lies:

In [25]:
hi = normal_upper_bound(0.95, mu_0, sigma_0)
hi  # is 526 (< 531, since we need more probability in the upper tail)

526.0073585242053

In [26]:
type_2_probability = normal_probability_below(hi, mu_1, sigma_1)
power = 1 - type_2_probability
power

0.9363794803307173

This is a more powerful test, since it no longer rejects $H_0$ when X is below 469 (which is unlikely to happen if $H_1$ is true)  
and instead rejects $H_0$ when X is between 526 and 531 (which is somewhat likely to happen if $H_1$ is true).

## p-values

An alternative way of thinking about the preceding test involves p-values.  
Instead of choosing bounds based on some probability cutoff, we compute the probability -- assuming $H_0$ is true -- that we would see a value at least as extreme as the one we actually observed.

For our two-sided test of whether the coin is fair, we compute: 

In [27]:
def two_sided_p_value(x, mu=0, sigma=1):
    if x >= mu:
        # if x is greater than the mean, the tail is what's greater than x
        return 2 * normal_probability_above(x, mu, sigma)
    else:
        # if x is less than the mean, the tail is what's less than x
        return 2 * normal_probability_below(x, mu, sigma)

# If we were to see 530 heads, we would compute:
two_sided_p_value(529.5, mu_0, sigma_0)  # mu_0 is 500.0

0.06207721579598857

Why did we choose 529.5 instead of 530?  
This is called a [continuity correction](https://en.wikipedia.org/wiki/Continuity_correction).  
It reflects the fact that <code>normal_probability_between(529.5, 530.5, mu_0, sigma_0)</code> is a better estimate of the probability of seeing 530 heads than <code>normal_probability_between(530, 531, mu_0, sigma_0)</code> is.  
Correspondingly, <code>normal_probability_above(529.5, mu_0, sigma_0)</code> is a better estimate of the probability of seeing at least 530 heads.

One way to convince yourself that this is a sensible estimate is with a simulation:

In [28]:
extreme_value_count = 0
for _ in range(100000):
    # count number of heads in 1000 flips
    num_heads = sum(1 if random.random() < 0.5 else 0 for _ in range(1000))
    if num_heads >= 530 or num_heads <= 470:
        # and count how often the number is 'extreme'
        extreme_value_count += 1
        
print extreme_value_count / 100000

0.06168


Since the p-value is greater than our 5% significance, we <em>would not reject</em> the null hypothesis.  
If instead we saw 532 heads, the p-value would be:

In [29]:
two_sided_p_value(531.5, mu_0, sigma_0)

0.046345287837786575

which is smaller than the 5% significance, which means we <em> would reject</em> the null hypothesis.  
It's the exact same test as before, just a different way of approaching the statistics.

Similarly, we would have:

In [30]:
upper_p_value = normal_probability_above
lower_p_value = normal_probability_below

For our one-sided test, if we saw 525 heads we would compute:

In [31]:
upper_p_value(524.5, mu_0, sigma_0)

0.06062885772582083

which means we <em>would not reject</em> the null hypothesis.  
If we saw 527 heads, the computation would be:

In [32]:
upper_p_value(526.5, mu_0, sigma_0)

0.04686839508859242

and we <em>would reject</em> the null hypothesis.

#### Caution
Make sure your data is roughly normally distributed before using <code>normal_probability_above()</code> to compute p-values.  
The annals of bad data science are filled with examples of people opining that the chance of some observed event occurring at random is one in a million, when what they really mean is "the chance, assuming the data is distributed normally", which is pretty meaningless if the data is not distributed normally.  
There are various statistical tests for normality, but even plotting the data is a good start.

## Confidence Intervals

We have been testing hypotheses about the value of the heads probability $p$, which is a $parameter$ of the unknown 'heads' distribution.   
When this is the case, a third approach is to construct a <em>confidence interval</em> around the observed value of the parameter.  
For example, we can estimate the probability of the unfair coin by looking at the average value of the Bernoulli variables corresponding to each flip -- 1 if heads, 0 if tails.  
If we observe 525 heads out of 1000 flips, then we estimate $p$ equals 0.525.  
How confident can we be about this estimate?  
Well, if we knew the exact value of $p$, the Central Limit Theorem tells us that the average of those Bernoulli variables should be approximately normal, with mean $p$ and standard deviation <code>math.sqrt(p * (1 - p) / 1000)</code>.  
Here, we don't know $p$, so instead we use our estimate:

In [33]:
p_hat = 525 / 1000
mu = p_hat
sigma = math.sqrt(p_hat * (1 - p_hat) / 1000)
sigma

0.015791611697353755

This is not entirely justified, but people seem to do it anyway. (???)  
Using the normal approximation, we conclude that we are "95 percent confident" that the following interval contains the parameter $p$:

In [34]:
normal_two_sided_bounds(0.95, mu, sigma)

(0.4940490278129096, 0.5559509721870904)

Note:  
This is a statement about the $interval$, not about $p$.  
You should understand it as the assertion that if you were to repeat the experiment many times, 95% of the time the "true" parameter (which is the same every time) would lie within the observed confidence interval (which might be different every time).

In particular, we do not conclude that the coin is unfair, since 0.5 falls within our confidence interval.  
If instead we had seen 540 heads, then we would have:

In [44]:
p_hat = 540 / 1000
mu = p_hat 
sigma = math.sqrt(p_hat * (1 - p_hat) / 1000)
print "Sigma is {}".format(sigma)
normal_two_sided_bounds(0.95, mu, sigma)

Sigma is 0.0157607106439


(0.5091095927295919, 0.5708904072704082)

Here, "fair coin", doesn't lie in the confidence interval.  
In other words, the "fair coin" hypothesis doesn't pass a test that you would expect it to pass 95% of the time if it were true.