# Hypothesis and Inference

Often, as data scientists, we’ll want to test whether a certain hypothesis is likely to be true.
For our purposes, hypotheses are assertions like “this coin is fair” or “data scientists prefer
Python to R”

### Flipping a Coin

Imagine we have a coin and we want to test whether it's fiar.

In [1]:
import math

In [2]:
def normal_approximation_to_binomial(n, p):
    """finds mu and sigma corresponding to a Binomial(n, p)"""
    mu = n * p
    sigma = math.sqrt(p * (1 - p) * n)
    return mu,sigma

In [3]:
def normal_cdf(x, mu=0, sigma=1):
    return (1 + math.erf((x-mu) / math.sqrt(2) / sigma)) / 2

In [4]:
# the normal_cdf is the probability the variable is below a threshold
normal_probability_below = normal_cdf

# it's above the threshold if it's not below the threshold
def normal_probability_above(lo, mu=0, sigma=1):
    return 1 - normal_cdf(lo, mu, sigma)

# it's between if it's less than hi, bit not less than lo
def normal_probability_between(lo, hi, mu=0, sigma=1):
    return normal_cdf(hi, mu, sigma) - normal_cdf(lo, mu, sigma)

# it's outside if it's not between
def normal_probability_outside(lo, hi, mu=0, sigma=1):
    return 1 - normal_probability_between(lo, hi, mu, sigma)

In [6]:
# Inverse Normal CDF
def inverse_normal_cdf(p, mu=0, sigma=1, tolerance=0.00001):
    """find approximate inverse using binary search"""
    
    # if not standard, compute standard and rescale
    if mu != 0 or sigma !=1:
        return mu + sigma * inverse_normal_cdf(p,tolerance=tolerance)
    
    low_z, low_p = -10.0, 0   # normal_cdf(-10) is very close to 0
    hi_z, hi_p = 10.0, 1      # normal_cdf(10) is very close to 1
    while hi_z - low_z > tolerance:
        mid_z = (low_z + hi_z) / 2
        mid_p = (low_p + hi_p) / 2
        if mid_p < p:
            low_z, low_p = mid_z, mid_p
        elif mid_p > p:
            hi_z, hi_p = mid_z, mid_p
        else:
            break
    
    return mid_z

We can also do the reverse — find either the nontail region or the (symmetric) interval
around the mean that accounts for a certain level of likelihood

In [7]:
def normal_upper_bound(probability, mu=0, sigma=1):
    """returns the z for which P(Z <= z) = probability"""
    return inverse_normal_cdf(probability, mu, sigma)

def normal_lower_bound(probability, mu=0, sigma=1):
    """returns the z for which P(Z >= z) = probability"""
    return inverse_normal_cdf(1 - probability, mu, sigma)

def normal_two_sided_bounds(probability, mu=0, sigma=1):
    """returns the symmetric (about the mean) bounds that 
    contain the specified probability"""
    tail_probability = (1 - probability) / 2
    
    # upper bound should have tail_probability above it
    upper_bound = normal_lower_bound(tail_probability, mu, sigma)
    
    # lower bound should have tail_probability below it
    lower_bound = normal_upper_bound(tail_probability, mu, sigma)
    
    return lower_bound, upper_bound

---

In particular, let’s say that we choose to flip the coin n=1000 times. If our
hypothesis of fairness is true, X should be distributed approximately normally with mean
50 and standard deviation 15.8

In [8]:
mu_0, sigma_0 = normal_approximation_to_binomial(1000, 0.5)

In [9]:
print(f"mu_0 is: {mu_0}, and sigma_0 is: {sigma_0}")

mu_0 is: 500.0, and sigma_0 is: 15.811388300841896
