<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Hypothesis Testing

_Authors: Tim Book (DC), Matt Brems (DC), et. al_

---

### Learning Objectives
- Define the null and alternative hypotheses.
- Perform a two-sample t-test.
- Define the t-statistics and p-value.
- List the steps of hypothesis testing.

In [1]:
# Bring in our libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import statsmodels.api as sm

## Introduction to Hypothesis Testing

In the real world, we like to make **data-driven decisions$^{\text{TM}}$**!
- In order to make these decisions, though, we need to collect some data.
- We take this data, put it into a "box," which gives us a statistically-powered yes-or-no decision.
- This "box" is hypothesis testing.
- **Hypothesis testing is a mathematically rigorous way of making yes-or-no decisions!**

Hypothesis testing is a little more complicated than that, but not much!

# First: How do we interpret statsmodels?

In [3]:
houses = pd.read_csv('./data/houses-norm.csv')

houses.head()

Unnamed: 0,sqft,bedrooms,age,price
0,2.104,3.0,7.0,3.999
1,1.6,3.0,2.8,3.299
2,2.4,3.0,4.4,3.69
3,1.416,2.0,4.9,2.32
4,3.0,4.0,7.5,5.399


In [5]:
X = houses[['sqft', 'bedrooms','age']]
y = houses['price']

X = sm.add_constant(X, prepend=True) # Add a column of ones to first col of X -> print X to see (Telling the machine to include a constant)
results = sm.OLS(y, X).fit()

In [8]:
print(X.head())

   const   sqft  bedrooms  age
0    1.0  2.104       3.0  7.0
1    1.0  1.600       3.0  2.8
2    1.0  2.400       3.0  4.4
3    1.0  1.416       2.0  4.9
4    1.0  3.000       4.0  7.5


In [6]:
results.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.733
Model:,OLS,Adj. R-squared:,0.715
Method:,Least Squares,F-statistic:,39.38
Date:,"Thu, 21 Oct 2021",Prob (F-statistic):,2.12e-12
Time:,18:47:05,Log-Likelihood:,-45.641
No. Observations:,47,AIC:,99.28
Df Residuals:,43,BIC:,106.7
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.9245,0.449,2.060,0.045,0.019,1.830
sqft,1.3933,0.150,9.305,0.000,1.091,1.695
bedrooms,-0.0862,0.156,-0.551,0.584,-0.402,0.229
age,-0.0081,0.043,-0.188,0.852,-0.095,0.079

0,1,2,3
Omnibus:,3.841,Durbin-Watson:,1.819
Prob(Omnibus):,0.147,Jarque-Bera (JB):,2.771
Skew:,0.552,Prob(JB):,0.25
Kurtosis:,3.444,Cond. No.,28.9


### Interpreting the statsmodels results

![](./images/statsmodels1.png)

---

![](./images/statsmodels2.png)

---

![](./images/statsmodels3.png)

## You Try: Hypothesis Testing our OLS coefficients
By far the most common place we'll see hypothesis testing is in the context of linear regression coefficients. Let's read in some data and use `statsmodels` to conduct a quick linear regression.

In [9]:
import statsmodels.api as sm

In [10]:
# This is a NASA dataset of airfoils at various wind tunnel speeds and angles of attack.
# Their goal was to minimize noise (measured in db)
df = pd.read_csv(
    "data/airfoil_self_noise.dat",
    sep="\t",
    names=["freq", "angle", "chord_len", "velocity", "thickness", "db"]
)

df["junk"] = np.random.randn(df.shape[0])

df.head()

Unnamed: 0,freq,angle,chord_len,velocity,thickness,db,junk
0,800,0.0,0.3048,71.3,0.002663,126.201,0.693487
1,1000,0.0,0.3048,71.3,0.002663,125.201,1.713463
2,1250,0.0,0.3048,71.3,0.002663,125.951,1.879276
3,1600,0.0,0.3048,71.3,0.002663,127.591,-0.129269
4,2000,0.0,0.3048,71.3,0.002663,127.461,-0.766628


In [11]:
X = df.drop('db', axis=1)
X = sm.add_constant(X)
y = df['db']

In [12]:
model = sm.OLS(y, X).fit()

In [13]:
model.summary()

0,1,2,3
Dep. Variable:,db,R-squared:,0.516
Model:,OLS,Adj. R-squared:,0.514
Method:,Least Squares,F-statistic:,265.6
Date:,"Thu, 21 Oct 2021",Prob (F-statistic):,2e-231
Time:,19:16:02,Log-Likelihood:,-4490.0
No. Observations:,1503,AIC:,8994.0
Df Residuals:,1496,BIC:,9031.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,132.8290,0.545,243.725,0.000,131.760,133.898
freq,-0.0013,4.21e-05,-30.446,0.000,-0.001,-0.001
angle,-0.4222,0.039,-10.849,0.000,-0.499,-0.346
chord_len,-35.6632,1.632,-21.850,0.000,-38.865,-32.462
velocity,0.0999,0.008,12.279,0.000,0.084,0.116
thickness,-147.2075,15.021,-9.800,0.000,-176.672,-117.743
junk,-0.0474,0.123,-0.386,0.699,-0.288,0.193

0,1,2,3
Omnibus:,12.879,Durbin-Watson:,0.447
Prob(Omnibus):,0.002,Jarque-Bera (JB):,19.081
Skew:,-0.022,Prob(JB):,7.19e-05
Kurtosis:,3.55,Cond. No.,518000.0


## Hypotheses
Notice the columns marked `t` and `P>|t|`. These are the $t$-statistics and $p$-values for the hypothesis test:

$$
H_0: \beta_i = 0 \\
H_A: \beta_i \ne 0
$$

(THREAD) In your own words, what would it mean if $\beta_i = 0$ for one of these coefficients?

### Hypothesis Testing with Puppies

[This example is pulled liberally from Cassie Kozyrkov's Medium post.](https://hackernoon.com/explaining-p-values-with-puppies-af63d68005d0)
 
Let's say that we come home at the end of the day to find some unspooled toilet paper.

<img src="./images/pug_toilet_paper.jpg" alt="doggo" width="600"/>

We need to make a **data-driven** decision: Do we yell at our dog? 

Our possibilities are:
- Yes, we yell at our dog.
- No, we don't yell at our dog.

Let's assume that our dog is innocent. Being good data scientists, we want to gather data, then use this data to make a decision.
- **Gust of wind?** We check to see if the bathroom window is open or closed.
- **Floor vent?** We check the thermostat to see if we left the heating/air conditioning on.
- **Another human?** We text your sibling to see if they brought our niece over.

Once you're done "gathering your data," you determine the probability of observing this naturally, if our dog didn't do it.
- If the probability is low enough (<= 0.05), we blame our dog.
- Otherwise, we can't blame our dog!

We just walked through a hypothesis test! We had two potential decisions, we gathered data, and used this data to make a decision.

> **Note that we only deem our dog guilty or not guilty. The dog is never pronounced innocent! Just like the U.S. court system, hypothesis testing works this way too.**

### Hypothesis Testing: A Drug Efficacy Example

---

Say we are testing the efficacy of a new drug:

- We randomly select 50 people to be in the control group and 50 people to recieve the treatment.
    - In the context of experiments, we often talk about the "control" group and the "experimental" or "treatment" group. In our example, the control group is the one given the old drug (the one currently on the market) and the treatment group is the one given the actual drug. 
    - In other experiments, the control group is the one that receives no treatment. There can be a placebo group as well, which is one that receives a false treatment. **Is this ethical in this scenario?**
- We are interested in the average difference in blood pressure levels between the treatment and control groups.
- We know our sample is selected from a broader, unknown population pool.
- We can imagine that, in a hypothetical parallel world, we could have ended up with a different random sample of subjects from the population pool.

<a id='null-hypothesis'></a>

### The "Null" Hypothesis

---

The **null hypothesis** is typically the exact opposite of what you want to test for, i.e. the "status quo". We typically denote the null hypothesis with $H_0$.
- In our dog example, we assume that our dog is innocent.
- In our drug efficacy experiment example, our null hypothesis is that there is no difference in blood pressure between a subject taking a placebo and and one taking the treatment drug.

> $H_0:$ The average difference in blood pressure between treatment and control groups is zero.

Or, as it's properly written:

> $H_0: \mu_\text{trt} = \mu_\text{ctrl}$

Or, as it's often written:

> $H_0: \mu_\text{trt} - \mu_\text{ctrl} = 0$

<a id='alternative-hypothesis'></a>

### The "Alternative Hypothesis"

---

The **alternative hypothesis** is the outcome of the experiment that we hope to show. It's the opposite of our null hypothesis!
- In our dog example, the alternative hypothesis is that our dog is guilty of unspooling the toilet paper.
- In our drug efficacy experiment example, the alternative hypothesis is that there is in fact an average difference in blood pressure between the treatment and control groups. 

> $H_A:$ The parameter of interest — our average difference between treatment and control — is not zero.

Or, in math:

> $H_A: \mu_\text{trt} \ne \mu_\text{ctrl}$

Again, we usually write

> $H_A: \mu_\text{trt} - \mu_\text{ctrl} \ne 0$

**NOTE:** The null and alternative hypotheses are concerned with the true values, or, in other words, the **parameter of the overall population**. Through hypothesis testing, we will make an **inference** (a decision) about this population parameter.

### Why is it written like this? $\mu$ vs $\bar{x}$
(THREAD) Can you remind me what a *population parameter* is?

(THREAD) Can you remind me what a *sample statistic* is?

Population parameters are often denoted with Greek letters. It would make no sense to conduct a hypothesis test with sample statistics, since they differ with each experiment, and you don't need to hypothesize about them.

### Introduction to the $t$-Test

---

In our dog example, we gathered data in a way that's different from how we'll usually gather data in order to make a decision.

Say that, in our drug experiment, we measure the following results:

- The 50 subjects in the control group have an average systolic blood pressure of 121.38.
- The 50 subjects in the experimental/treatment group have an average systolic blood pressure of 111.56.

The difference between experimental and control samples is -9.82 points. 

**But**, with only 50 subjects in each sample, how confident can we be that this measured difference is real? Do we have enough evidence to say that the population average blood pressure is different between these two groups?

We can perform what is known as a **t-test** to evaluate this. (A $t$-test is one of many, many types of hypothesis tests.)

Four steps to hypothesis testing:
1. Construct a null hypothesis that you want to contradict and its complement, the alternative hypothesis.
2. Specify a level of significance.
3. Calculate your test statistic.
4. Find your $p$-value and make a conclusion.

In [14]:
bp = pd.read_csv("data/blood-pressure.csv")
print(bp.shape)
bp.head()

(100, 2)


Unnamed: 0,bp,group
0,166,control
1,165,control
2,120,control
3,94,control
4,104,control


In [15]:
bp['group'].unique()

array(['control', 'treatment'], dtype=object)

In [16]:
# Separate the blood pressure data into two separate vectors
# (this is how we'll need it for a SciPy t-test)
ctrl = bp.loc[bp['group'] == 'control', 'bp']
trt = bp.loc[bp['group'] == 'treatment', 'bp']

In [18]:
# Print the average of the control and experimental groups.
print(f'Mean - Control: {np.mean(ctrl)}, Mean - Treatment: {np.mean(trt)}')

Mean - Control: 121.38, Mean - Treatment: 111.56


<a id='likelihood-data'></a>

### Step 1: Construct the null and alternative hypotheses

---

For our experiment, we will set up a null hypothesis and an alternative hypothesis:

$H_0:$ The true mean difference in systolic blood pressure between those who receive the treatment and those who do not is 0.

$H_A:$ The true mean difference in systolic blood pressure between those who receive the treatment and those who do not is NOT 0.

### Formally:

$$
\begin{align}
H_0: & \mu_\text{trt} = \mu_\text{ctrl} \\
H_A: & \mu_\text{trt} \ne \mu_\text{ctrl} \\
\end{align}
$$

Recall, our measured difference is $\bar{x}_\text{trt} - \bar{x}_\text{ctrl} = -9.82$

Written out using probability notation, we want to know:

### $$P(\text{data}\;|\;H_0 \text{ true})$$

**What is the probability that we observed this data, assuming that our null hypothesis is true?**


### Step 2: Specify a level of significance

If we assume that our null hypothesis is true, and the probability of observing the data we observed is "small," then our data does not support our null hypothesis. 

**But how "small" is small enough?**

This is set by our level of significance, which we call $\alpha$.

Typically (and arbitrarily) the value $\alpha=0.05$ is used.

### Step 3: Calculating your Test Statistic

---

Remember that hypothesis testing is a "box" where the inputs are our data and the outputs allow us to make our decision? Well, in this "box," we are calculating $P(\text{data}\;|\;H_0 \text{ true})$.

When comparing two means, the **t-statistic** (based on the [Student's $t$-distribution](https://en.wikipedia.org/wiki/Student%27s_t-distribution)) is a classic way to quantify the difference between groups. In essence, our $t$-statistic is a standardized version of the difference between groups.

Luckily, our computer will do this for us!

---

<details><summary>Want the mathematical details of the calculation of the t-statistic?</summary>
When comparing the difference between groups, we can calculate the two-sample t-statistic like so:

### $$t = \frac{\bar{x}_E - \bar{x}_C}{\sqrt {s^2 \Big(\frac{1}{n_E} + \frac{1}{n_C}\Big)}}$$

In our example, $\bar{x}_E$ is the mean of our experimental group's sample measurements and $\bar{x}_C$ is the mean of our control group's sample measurements.

$n_E$ and $n_C$ are the number of observations in each group. 

The $s^2$ denotes our *sample variance*. In this version of the t-test, we are assuming equal variances in our experimental and control groups in the overall population. There is another way to calculate the t-test where equal variance is not assumed, but, in our case, it is a reasonable assumption.

The sample variance is calculated like so:

### $$ s^2 = \frac{\sum_{i=1}^{n_E} (x_i - \bar{x}_E)^2 + \sum_{j=1}^{n_C} (x_j - \bar{x}_C)^2}{ n_E + n_C -2} $$

This combines the variance of the two groups' measurements into a single pooled metric. 

</details>

## TL;DR What are we doing?

**GOAL:** To tell whether or not our new treatment is effective. We define "effective" as whether or not those who get the treatment see lower systolic blood pressure, on average.

To do this, we follow the following steps to carry out a **hypothesis test**:

1. Set up null and alternative hypotheses. Remember, ours was this:

$$ H_0: \mu_\text{trt} - \mu_\text{ctrl} = 0 $$
$$ H_A: \mu_\text{trt} - \mu_\text{ctrl} \ne 0 $$

2. Decide on a significance level. $\alpha = 0.05$ is a typical choice.
3. Decide on a hypothesis test. There are a million of them. In this case, we're testing the difference between two means, which is a great time to use a **two-sample $t$-test**.

> The two-sample (independent) $t$-test tests whether or not two population means differ.

4. After carrying out this hypothesis test, we'll see if our data provide enough evidence to reject the null hypothesis.

## Let's do it!
Uh... how? What function do I use? Help me, Google!

In [19]:
# Import scipy.stats
from scipy import stats

In [22]:
# Conduct our t-test.
stats.ttest_ind(trt, ctrl) # equal_var has NO impact in this case as we know ctrl and trt have equal sample sizes

Ttest_indResult(statistic=-1.8915462966190273, pvalue=0.061504240672530394)

In [23]:
# Assigning t_statistic and p-value from the scipy stat test to 2 variables
t_stat, p_value = stats.ttest_ind(trt, ctrl)

<a id='p-value'></a>

### Step 4: The P-Value

---

Remember that our goal of doing all of this work is to make a decision? Well, using our $t$-statistic, we can generate a **p-value**.

> **The p-value is the probability that, given that the null hypothesis $H_0$ is true, we could have ended up with a statistic at least as extreme as the one we got.**

We have measured a difference in blood pressure of -9.82 between the experimental and control groups. We then calculated a $t$-statistic associated with this difference of -1.89. In our specific example:

> The p-value is the probability that, assuming there is truly no difference in blood pressure between treatment and control conditions (i.e., no effect of the drug), we get results that yield a t-statistic more extreme than -1.89.

### So how do we make the decision? *(This will show up in interviews!)*

Remember that $\alpha$ is our level of significance.

- If $p\text{-value} < \alpha$, then there is evidence to reject the null hypothesis, so you accept that $H_0$ is incorrect and therefore $H_A$ is correct.
    - i.e., a statisically significant difference between the two groups!
    - This is like saying there is enough evidence to say our dog isn't innocent... so we say our dog is guilty.
- If $p\text{-value} \ge \alpha$, then there is insufficient evidence to reject the null hypothesis and you cannot accept that either $H_0$ or $H_A$ is correct.
    - i.e., there is no statistical difference between your two groups.
    - This is like saying there is not enough evidence to say our dog isn't innocent. We can't totally determine that our dog is innocent, but we haven't determined that our dog is guilty, either.

## So.... what is our decision?

> **ANSWER HERE:**

## Just for good measure... what's the opposite opinion?

> **ANSWER HERE:**

## The Law of Parsimony (aka: Occam's Razor)
This is usually paraphrased as:
> The simplest explanation for a phenomenon is usually the correct one.

We don't want to overspecify our model. In our context, that means we want to avoid any potential overfitting. While we **never accept the null hypothesis**, the truth is, _some decision must be made_. Oftentimes, we drop variables from our model that do not have significant $p$-values.

## Other Hypothesis Tests
The goal of this lesson was to teach you, in general, how hypothesis testing works. We showed you what is probably the most common variety of hypothesis test: the $t$-test. However, there are kajillions of other ones out there. It's not worth our time to go over so many more of them, as they all have the same implementation and interpretation, just in different situations. Instead, here is a list of many of the "big" ones and when to use them:

| Situation | Common hypothesis test | Example | Notes |
| --- | --- | --- | --- |
| Testing whether or not one mean is equal to a value | One-sample $t$-test | Do cars on a given road, on average, drive about 65mph? | |
| Testing whether or not two means are equal to eachother | Two-sample $t$-test | Is the mean systolic blood pressure of people who receive Medicine A or Medicine B the same? | |
| Testing whether or not paired observations have the same value | Paired $t$-test | Among heterosexual married couples, is the husband, on average, taller than the wife? | This is functionally the same as a one-sample $t$-test of the differences |
| Testing whether or not three or more means are the same | One-way ANOVA test | Are base salaries upon graduation different for graduates of Penn State, Ohio State, and Michigan? | The ANOVA test has many variants |
| Testing whether or not there is a relationship between two categorical variables | $\chi^2$ test | Is there a relationship between home state and political affiliation? | |
| Testing whether or not a given distribution is normally distributed | Kolmogorov-Smirnov Test | Testing whether or not model residuals are normally distributed. Useful for testing linear regression assumptions! | |
| Testing whether or not one proportion is equal to a number | One-sample $z$-test | Testing whether or not a coin is fair (ie, testing $P(Heads) = 0.5$) | |
| Testing whether or not two proportions are euqal | Two-sample $z$-test | Who is going to win an election? | Testing two or more proportions can be done better with a $\chi^2$ test |






## Recap

Four steps to hypothesis testing:
1. Construct a null hypothesis that you want to contradict and its complement, the alternative hypothesis.
2. Specify a level of significance.
3. Calculate your test statistic.
4. Find your $p$-value and make a conclusion.