# Hypothesis and Inference
- we frequently will form and test hypotheses about our data

## Statistical Hypothesis Testing
- hypotheses are assertions that can be translated into statistics about data. 
- In the classical setup, we have a null hypothesis H 0 that represents some default position, and some alternative hypothesis H 1 that we’d like to compare it with. We use statistics to decide whether we can reject H 0 as false or not.

## Example: Flipping a Coin
- say, we want to test whether a coin is fair
- assumption: the coin has some probability p of landing heads
 - null hypothesis: coin is fair (p=0.5)
 - alternative hypothesis: coin is not fair (p!=0.5)
 
Testing:
- flip coin some number n times, count the number of heads X
- Each coin flip is a Bernoulli trial, which means that X is a Binomial(n,p) random variable, which we can approximate using the normal distribution (central limit theorem)

In [11]:
def normal_approximation_to_binomial(n, p):
    """finds mu and sigma corresponding to a Binomial(n, p)"""
    mu = p * n
    sigma = math.sqrt(p * (1 - p) * n)
    return mu, sigma

Whenever a random variable follows a normal distribution, we can use normal_cdf to figure out the probability that its realized value lies within (or outside) a particular interval.

It is the difference in the area under cdf

In [12]:
def normal_cdf(x, mu=0,sigma=1):
    return (1 + math.erf((x - mu) / math.sqrt(2) / sigma)) / 2

# the normal cdf _is_ the probability the variable is below a threshold
normal_probability_below = normal_cdf

# it's above the threshold if it's not below the threshold
def normal_probability_above(lo, mu=0, sigma=1):
    return 1 - normal_cdf(lo, mu, sigma)

# it's between if it's less than hi, but not less than lo
def normal_probability_between(lo, hi, mu=0, sigma=1):
    return normal_cdf(hi, mu, sigma) - normal_cdf(lo, mu, sigma)

# it's outside if it's not between
def normal_probability_outside(lo, hi, mu=0, sigma=1):
    return 1 - normal_probability_between(lo, hi, mu, sigma)

We can also find either the nontail region or the (symmetric) interval around the mean that accounts for a certain level of likelihood.

For example, if we want to find an interval centered at the mean and containing 60% probability, then
we find the cutoffs where the upper and lower tails each contain 20% of the probability (leaving 60%):

In [19]:
def normal_upper_bound(probability, mu=0, sigma=1):
    """returns the z for which P(Z <= z) = probability"""
    return inverse_normal_cdf(probability, mu, sigma)
def normal_lower_bound(probability, mu=0, sigma=1):
    """returns the z for which P(Z >= z) = probability"""
    return inverse_normal_cdf(1 - probability, mu, sigma)
def normal_two_sided_bounds(probability, mu=0, sigma=1):
    """returns the symmetric (about the mean) bounds
    that contain the specified probability"""
    tail_probability = (1 - probability) / 2
    
    # upper bound should have tail_probability above it
    upper_bound = normal_lower_bound(tail_probability, mu, sigma)
    # lower bound should have tail_probability below it
    lower_bound = normal_upper_bound(tail_probability, mu, sigma)
    return lower_bound, upper_bound

- Say we choose to flip the coin n=1000 times.
- If our hypothesis of fairness is true, X should be distributed approximately normally with mean 50 and std 15.8

In [20]:
import math
mu_0,sigma_0 = normal_approximation_to_binomial(1000, 0.5)

- We need to make a decision about significance - how willing are we to make a type 1 error (false positive), in which we reject H0 even though it's true.
- Historically, this 'willinness' is often set at 5% or 1%

Consider the test that rejects H0 if X falls outside the bounds given by:

In [21]:
normal_two_sided_bounds(0.95, mu_0, sigma_0)

NameError: name 'inverse_normal_cdf' is not defined

# Reference
- Chapter 7 of Data Science From Scratch