## Hypothesis Testing

### Overview

Say you have a novel theory about the way the world works.  You decide to test your idea by collecting evidence.  As you go about gathering data you find some pieces of evidence that support your hypothesis but also you turn up many that do not.  When all is said and done, however, the sum of the data appears to suggest that your idea was correct!  Given the scatter in your measurements, however, how confident can you be that your positive result was not a fluke?

_Hypothesis testing is used to evaluate the statistical significance of a positive result._

It works by defining a null hypothesis that is mutually exclusive with your idea about the world.  Given the evidence, a statistical test is performed to calculate how likely it is for the null hypothesis to be true.  If it is unlikely, your idea is supported.  If it is likely, your positive result isn't statistically significant.

The words "likely" and "unlikely" here are vague.  In practice, we define a threshold value that discriminates between the two.  A commonly used, but by no means universal, threshold is 95%.  That is, it may be fair to deem the null hypothesis to be "unlikely" if we are at least 95% certain that the evidence does not support it.  It is important to note, however, that in this case 1 out of 20 trials ought to result in a rejection (falsely) of the null hypothesis just by random chance.  This relates to the concepts of precision and recall that are discussed at the end of this lesson.

### Example

To make this discussion more concrete, let's formulate null and alternative hypotheses for a real-life example.  Art and Jesse are business partners who run a small cash-only donut shop.  Until recently, their best-selling item was Art's Old Fashioned Donut, but Jesse recently added a new option to the menu---the Boston Creme Eclair, which has become quite popular.  Art, feeling somewhat snubbed, wishes to show that the increased popularity the the new donut is actually costing the shop money, because the profit on each sold Boston Creme is smaller than on each Old Fashioned.

Art knows with a great (in fact, perfect) degree of certainty that the profit on one Old Fashioned donut is $2.93, but he does not have the same information about Jesse's eclair.  Therefore, he painstakingly tracks the cost of making 100 individual eclairs at random.  He obtains the following distribution:

![Distribution](distribution.png)

Art is pleased to see that the average profit is smaller than $2.93 for the sample of 100 eclairs, but before he brings this finding to Jesse, he wants to be quite certain (95%) that his result is not a fluke.  He needs to do a hypothesis test where:

\begin{align}
H_0: \mu = \$2.93 \\
H_A: \mu < \$2.93 
\end{align}

#### The $t$-test

To reject the null hypothesis, Art must calculate how likely it is that his result occurred by random chance.  Assuming that his data are independent and identically normally distributed (IID)---statistical jargon meaning that none of the measurements are correlated and that they all derive from the same distribution---, Art can do this by conducting a $t$-test.

This is done by computing the $t$-statistic of the dataset and comparing it with the value on the cumulative distribution function ([CDF](https://en.wikipedia.org/wiki/Cumulative_distribution_function)) of the $t$-distribution for $n$ - 1 degrees of freedom. The t-statistic for parameter $\beta$ is given by 

\begin{equation}
t_\hat{\beta} = \frac{\hat \beta - \beta_0}{SE(\hat B)}
\end{equation}

where $\hat \beta$ is the estimator of $\beta$, $\beta_0$ is the reference value of $\beta$, and $SE$ is the standard error of the measurement.  In Art's case, he should be looking at the *left tail* of the t-distribution because he is seeking to show that the mean margin is *less than* \$2.93. Therefore, if the CDF is $<0.05$ for the $t$-statistic, Art can reject the null hypothesis with 95% confidence that the result didn't occur by random chance.  The value 0.05 is called the *significance level* of the test.

Let's look at the code that generated Art's distribution of measurements.

In [2]:
from numpy.random import seed
from numpy import mean, std
from scipy.stats import norm

#Model Art's measurements as a normally distributed random variate with prescribed mean and variance
true_mean = 2.82
true_variance = 0.56

seed(313)
x = norm.rvs(loc=true_mean, scale=true_variance**0.5, size=100)

Most statistical programming languages have packages that can conduct the t-test under the hood, but let's go through the calculation ourselves.  First, compute the $t$-statistic:

In [3]:
#The mean margin on Old Fashioned
reference_mean = 2.93

se     = std(x)/sqrt(len(x))
tstat  = (mean(x)-reference_mean)/se

print("The t-statistic is %12f" % tstat)

The t-statistic is    -1.258419


Now let's calculate the CDF for the t-distribution for the value of the $t$-statistic:

In [5]:
from scipy.stats import t
pvalue = t.cdf(tstat,df=len(x)-1)

print('The p-value is %12f' % pvalue)

The p-value is     0.105600


The **$p$-value** is the probability that the result occurred by random chance.  Since it is greater than 0.05 in this example, Art cannot reject the null hypothesis.

This example demonstrates a **type II error** in hypothesis testing because, unlike Art, we contrived the example and know that the null hypothesis was false.  This is in contrast to a **type I error** where a null hypothesis is incorrectly rejected.  Let us consider what could have Art done to make his hypothesis test more (or less) accurate using the concepts of precision and recall.

### Precision and Recall

Precision = $\frac{\mathrm{True\,Positives}}{\mathrm{True\,Positives} + \mathrm{False\,Positives}}$

Recall = $\frac{\mathrm{True\,Positives}}{\mathrm{True\,Positives} + \mathrm{False\,negatives}}$

A True Positive occurs when the null hypothesis is correctly rejected; a True Negative occurs with the null hypothesis is correctly *not* rejected.

Using code, let's repeat Art's experiment 10,000 times to see how often he gets a negative result when drawing samples from the same underlying distribution.  To make things more interesting, there is a 50 percent chance that we'll draw data drawing data from the Eclair distribution (true result is positive) and Old Fashioned distribution (true result is negative).

In [7]:
def calc_precision_recall(nsamples = 100, signif_level = 0.05, p_positive=0.5):
    from numpy.random import rand #uniform random variate between 0 and 1
    
    n_iters = 10000
    
    true_positives = 0
    false_positives = 0
    true_negatives = 0
    false_negatives = 0

    for iteration in range(n_iters):
        seed(313+iteration)
        
        if rand() < p_positive:
            x = norm.rvs(loc=true_mean, scale=true_variance**0.5, size=nsamples)
            truth = 'positive'
        else:
            x = norm.rvs(loc=reference_mean, scale=true_variance**0.5, size=nsamples)
            truth = 'negative'
            
        se     = std(x)/sqrt(len(x))
        tstat  = (mean(x)-reference_mean)/se
        pvalue = t.cdf(tstat,df=len(x)-1)

        if truth == 'positive' and pvalue < signif_level:
            true_positives += 1
        if truth == 'positive' and pvalue >= signif_level:
            false_negatives += 1
        if truth == 'negative' and pvalue < signif_level:
            false_positives += 1
        if truth == 'negative' and pvalue >= signif_level:
            true_negatives += 1
            

    precision = true_positives / (true_positives + false_positives) 
    recall    = true_positives / (true_positives + false_negatives) 

    print('Signif. level %4.2f, Prob +ve result %4.2f:' % (signif_level, p_positive))
    print('     True  positives = %8d' % true_positives)
    print('     True  negatives = %8d' % true_negatives)
    print('     False positives = %8d' % false_positives)
    print('     False negatives = %8d' % false_negatives)
    print('     Precision       = %8d' % (precision*100.) + '%')
    print('     Recall          = %8d' % (recall*100.) + '%')
    print('********************************')
    
calc_precision_recall()

Signif. level 0.05, Prob +ve result 0.50:
     True  positives =     2204
     True  negatives =     4751
     False positives =      250
     False negatives =     2795
     Precision       =       89%
     Recall          =       44%
********************************


It turns out that Art's test was fairly precise---only about 11% of the time would he have falsely obtained a positive result.  Recall, however, was poor.  He was more likely to falsely obtain a negative result than correctly obtain a positive one.  In light of this, it is illustrative to explore how precision and recall are affected by the significance level of the $t$-test.  Let's repeat the above analysis varying $\alpha$.

In [8]:
calc_precision_recall(signif_level=0.01)
calc_precision_recall(signif_level=0.05)
calc_precision_recall(signif_level=0.10)

Signif. level 0.01, Prob +ve result 0.50:
     True  positives =      965
     True  negatives =     4940
     False positives =       61
     False negatives =     4034
     Precision       =       94%
     Recall          =       19%
********************************
Signif. level 0.05, Prob +ve result 0.50:
     True  positives =     2204
     True  negatives =     4751
     False positives =      250
     False negatives =     2795
     Precision       =       89%
     Recall          =       44%
********************************
Signif. level 0.10, Prob +ve result 0.50:
     True  positives =     2947
     True  negatives =     4522
     False positives =      479
     False negatives =     2052
     Precision       =       86%
     Recall          =       58%
********************************


Notice how the number of false positives increases as the significance level, $\alpha$, of the statistical test is increased. Generally speaking, an optimal test seeks to consider both precision and recall.  Tuning $\alpha$ to a very small value would give high precision, but virtually no positive results would be returned.  Conversely, assigning $\alpha$ a high value would improve recall, but at the expense of an increased proportion of false positives.

Let's rerun the calculation above with $\alpha=0.05$, varying the probability that the true result is positive.

In [5]:
calc_precision_recall(signif_level=0.05, p_positive=0.5)
calc_precision_recall(signif_level=0.05, p_positive=0.1)
calc_precision_recall(signif_level=0.05, p_positive=0.01)

Signif. level 0.05, Prob +ve result 0.50:
     True  positives =     2204
     True  negatives =     4751
     False positives =      250
     False negatives =     2795
     Precision       =       89%
     Recall          =       44%
********************************
Signif. level 0.05, Prob +ve result 0.10:
     True  positives =      444
     True  negatives =     8519
     False positives =      460
     False negatives =      577
     Precision       =       49%
     Recall          =       43%
********************************
Signif. level 0.05, Prob +ve result 0.01:
     True  positives =       48
     True  negatives =     9388
     False positives =      506
     False negatives =       58
     Precision       =        8%
     Recall          =       45%
********************************


One major limitation of hypothesis testing is that it does not to account for prior information.  Applying it to hypotheses that are highly unlikely will result in a false positive far more often than a true one.  Bayesian inference, which will be covered in the next lesson, is an alternative approach to hypothesis testing that accounts for prior information.

[Related Reading at Five Thirty Eight](https://fivethirtyeight.com/features/how-shoddy-statistics-found-a-home-in-sports-research/)