<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Hypothesis Testing

_Authors: Tim Book (DC), Matt Brems (DC), et. al_

---

### Learning Objectives
- Define the null and alternative hypotheses.
- Perform a two-sample t-test.
- Define the t-statistics and p-value.
- List the steps of hypothesis testing.

In [1]:
# Bring in our libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import statsmodels.api as sm # a module enabling statistical testing, exploration

## Introduction to Hypothesis Testing

In the real world, we like to make **data-driven decisions$^{\text{TM}}$**!
- In order to make these decisions, though, we need to collect some data.
- We take this data, put it into a "box," which gives us a statistics-powered yes-or-no decision.
- This "box" is *hypothesis testing*.
- **Hypothesis testing is a mathematically rigorous way of making yes-or-no decisions!**

Hypothesis testing is a little more complicated than that, but not much!

# First: How do we interpret statsmodels?

In [2]:
houses = pd.read_csv('../data/houses-norm.csv')
print(houses.shape)
houses.head()

(47, 4)


Unnamed: 0,sqft,bedrooms,age,price
0,2.104,3.0,7.0,3.999
1,1.6,3.0,2.8,3.299
2,2.4,3.0,4.4,3.69
3,1.416,2.0,4.9,2.32
4,3.0,4.0,7.5,5.399


In [3]:
# if we wanted to only subset specific columns from our dataframe into a SUBSET dataframe
# pass subset columns list within list: dataframe[['subset_col1', 'subset_col2'..]]
houses[['sqft', 'bedrooms','age']].head()

Unnamed: 0,sqft,bedrooms,age
0,2.104,3.0,7.0
1,1.6,3.0,2.8
2,2.4,3.0,4.4
3,1.416,2.0,4.9
4,3.0,4.0,7.5


In [4]:
# double [] returns a pandas DataFrame
type(houses[['sqft', 'bedrooms','age']])

pandas.core.frame.DataFrame

In [5]:
houses['price'].head()

0    3.999
1    3.299
2    3.690
3    2.320
4    5.399
Name: price, dtype: float64

In [6]:
# single [] will only return a pandas Series
type(houses['price'])

pandas.core.series.Series

- What follows might be more familiar for those moving from R programming background, putting statsmodels module to use
- In the industry, mostly the stats module imported from scipy library is what is used for statistical testing and related tasks. 
- As someone that started their coding journey with Python (not knowing R), I've never seen or used statsmodels module before :)

In [7]:
X = houses[['sqft', 'bedrooms','age']] # X is now a subset df with column, values from only these 3 cols
y = houses['price'] # y is now a series with only values from this 1 col

X = sm.add_constant(X, prepend=True) # Add a column of ones to first col of X -> print X to see. This is done to let statsmodels know that we want the intercept of linear equation (more details soon)
results = sm.OLS(y, X).fit() # OLS: ordinary least squares, fits a simple linear reg model with OLS

- In statistics, ordinary least squares (OLS) is a type of linear least squares method for ***estimating the unknown parameters in a linear regression model*** (further read on [Wikipedia](https://en.wikipedia.org/wiki/Ordinary_least_squares) if interested)
- sm.add_constant in statsmodel is the same as sklearn's fit_intercept parameter in LinearRegression().  If you don't do sm.add_constant or when LinearRegression(fit_intercept=False), then both statsmodels and sklearn algorithms assume that b=0 in y = mx + b, and it'll fit the model using b=0 instead of calculating what b is supposed to be based on your data.

Google to find [documentation for sm.add_constant](https://www.statsmodels.org/stable/generated/statsmodels.tools.tools.add_constant.html)
- Similarly, [documentation for sm.OLS](https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html)

We'll cover linear regression concept in more details next week, let's just scratch the surface right now. Just understand `fit()` as fitting a relation between features (X) and response (y) with a **mathematical equation**. For linear regression, that equation is `y = m1X1 + m2X2 + m3X3 + b`, *where b is the constant - the y-intercept when X=0*
- an intial value of 1 ensures update to the appropriate intercept value during the model fit 

In [8]:
print(f'unique_vals_const: {X["const"].unique()}')
X.head() # see const column inserted as first col as a result of sm.add_constant() above

unique_vals_const: [1.]


Unnamed: 0,const,sqft,bedrooms,age
0,1.0,2.104,3.0,7.0
1,1.0,1.6,3.0,2.8
2,1.0,2.4,3.0,4.4
3,1.0,1.416,2.0,4.9
4,1.0,3.0,4.0,7.5


In [9]:
results.summary() # displays various linear regression model stats

0,1,2,3
Dep. Variable:,price,R-squared:,0.733
Model:,OLS,Adj. R-squared:,0.715
Method:,Least Squares,F-statistic:,39.38
Date:,"Wed, 20 Oct 2021",Prob (F-statistic):,2.12e-12
Time:,22:28:25,Log-Likelihood:,-45.641
No. Observations:,47,AIC:,99.28
Df Residuals:,43,BIC:,106.7
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.9245,0.449,2.060,0.045,0.019,1.830
sqft,1.3933,0.150,9.305,0.000,1.091,1.695
bedrooms,-0.0862,0.156,-0.551,0.584,-0.402,0.229
age,-0.0081,0.043,-0.188,0.852,-0.095,0.079

0,1,2,3
Omnibus:,3.841,Durbin-Watson:,1.819
Prob(Omnibus):,0.147,Jarque-Bera (JB):,2.771
Skew:,0.552,Prob(JB):,0.25
Kurtosis:,3.444,Cond. No.,28.9


### Interpreting the statsmodels results

![](../images/statsmodels1.png)

---

![](../images/statsmodels2.png)

---

![](../images/statsmodels3.png)

Key points:
- **Hypothesis**: 
    - In statistical hypothesis testing, we are generally trying to decide if there is a statistically significant difference between two or more groups of data. 
    - To do this, we formulate a hypothesis called **Null hypothesis ($H_0$)** which states that there is ***NO*** significant difference between specified populations, any observed difference being due to sampling or experimental error.  The opposite of Null hypothesis is called **Alternate hypothesis ($H_A$)**
    - The goal of hypothesis testing is to perform statistical analysis and gather evidence to either ***Reject the Null Hypothesis*** or ***Fail to reject the Null Hypothesis***
    - The single metric you must look at to make this decision is called the **p-value**. In general, $p-value < 0.05$ indicates we can **Reject the Null Hypothesis**. More about this soon.
    

- For the above Linear Regression case, we are trying to build an equation that can determine the **Price** of a house using the **Area (sqft)**, **Number of Bedrooms** and **Age**. 
    - The equation we choose to build is Ordinary Least Squares which is of the form $y=\beta_1X_1+\beta_2X_2+\beta_3X_3+b$ which you must be familiar from secondary school mathematics.
    - $y$ is the **Price**
    - $X_1$ is the **Area (sqft)**. $\beta_1$ is the coefficient or weight of **Area (sqft)**
    - $X_2$ is the **Number of Bedrooms**. $\beta_2$ is the coefficient or weight of **Number of Bedrooms**
    - $X_3$ is the **Age**. $\beta_3$ is the coefficient or weight of **Age**
    - $b$ is the constant (intercept) term


- In this case, the **Null hypothesis** (No statistically significant difference) will ***not be rejected*** if **ALL** coefficients (coef) are 0 or close to 0 because it indicates the 3 variables ($X_1$, $X_2$, $X_3$) we used to determine the $y$ actually have no impact on $y$ at all. As a Data Scientist/Analyst, we are generally checking for evidence to ***reject the null hypothesis ($H_{0}$)***

- The values under coef in summary above are $\beta_0$ or $b$, $\beta_1$, $\beta_2$, $\beta_3$ respectively
    
- the p-value (is actually a probability) is evidence **AGAINST** a null hypothesis. The **smaller** the p-value, the stronger the evidence that you should **reject the null hypothesis**. A p-value typically ≤ 0.05 is considered statistically significant. **Low p-values are good; They indicate your data did not occur by chance**. For example, a p-value of 0.01 means there is only a 1% probability that the results from an experiment happened by chance
    - "P > |t|" column in the statsmodel results represents the p-values corresponding to each feature & its impact on the response, y
    - p-values are related to t-values [further read](https://www.statisticshowto.com/probability-and-statistics/t-test/#:~:text=Every%20t%2Dvalue%20has%20a,value%20of%205%25%20is%200.05.)
    
- AIC, BIC are just different criterion used for models selection by means of parameter tuning. In real world Data Science tasks, we will train many different models and use some metrics like these to choose the best model
    - Further read from [Wiki-AIC](https://en.wikipedia.org/wiki/Akaike_information_criterion)
    - Further read from [Wiki-BIC](https://en.wikipedia.org/wiki/Bayesian_information_criterion)

## You Try: Hypothesis Testing our OLS coefficients

In [10]:
# This is a NASA dataset of airfoils at various wind tunnel speeds and angles of attack.
# Their goal was to minimize noise (measured in db)
df = pd.read_csv(
    "../data/airfoil_self_noise.dat",
    sep="\t",
    names=["freq", "angle", "chord_len", "velocity", "thickness", "db"]
)

# Let's create a rubbish column randomly to prove hypothesis testing works!
df["junk"] = np.random.randn(df.shape[0]) # Returns df's row number of samples from  “std normal” dist
print(df.shape)
df.head()

(1503, 7)


Unnamed: 0,freq,angle,chord_len,velocity,thickness,db,junk
0,800,0.0,0.3048,71.3,0.002663,126.201,-0.279348
1,1000,0.0,0.3048,71.3,0.002663,125.201,0.117033
2,1250,0.0,0.3048,71.3,0.002663,125.951,0.246328
3,1600,0.0,0.3048,71.3,0.002663,127.591,-1.378038
4,2000,0.0,0.3048,71.3,0.002663,127.461,0.037208


In [11]:
# repeating the same feature-response assignment from defined dataframe above
X = df.drop("db", axis=1) # dropping db as that'll be response
X = sm.add_constant(X)
y = df["db"]

In [12]:
model = sm.OLS(y, X).fit()

In [13]:
model.summary()
# Check out the p-value of the rubbish "junk" column and confirm its really unable to reject the null hypothesis (p-value > 0.05). 
# This indicates that the rubbish "junk" column does NOT have a statistically significant impact on our "db" column

0,1,2,3
Dep. Variable:,db,R-squared:,0.516
Model:,OLS,Adj. R-squared:,0.514
Method:,Least Squares,F-statistic:,265.5
Date:,"Wed, 20 Oct 2021",Prob (F-statistic):,2.14e-231
Time:,22:28:25,Log-Likelihood:,-4490.1
No. Observations:,1503,AIC:,8994.0
Df Residuals:,1496,BIC:,9031.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,132.8331,0.545,243.767,0.000,131.764,133.902
freq,-0.0013,4.21e-05,-30.442,0.000,-0.001,-0.001
angle,-0.4218,0.039,-10.839,0.000,-0.498,-0.345
chord_len,-35.6867,1.631,-21.880,0.000,-38.886,-32.487
velocity,0.0999,0.008,12.275,0.000,0.084,0.116
thickness,-147.3613,15.030,-9.805,0.000,-176.843,-117.880
junk,-0.0138,0.125,-0.110,0.912,-0.259,0.231

0,1,2,3
Omnibus:,12.914,Durbin-Watson:,0.447
Prob(Omnibus):,0.002,Jarque-Bera (JB):,19.161
Skew:,-0.021,Prob(JB):,6.91e-05
Kurtosis:,3.552,Cond. No.,518000.0


## Hypotheses
Notice the columns marked `t` and `P>|t|`. These are the $t$-statistics and $p$-values for the hypothesis test:

$$
H_0: \beta_i = 0
$$
$$
H_A: \beta_i \ne 0
$$

where $H_{0}$ --> null hypothesis and $H_{A}$ --> alternate hypothesis


(THREAD) In your own words, what would it mean if $\beta_i = 0$?

Generally, one way to practically internalize how null hypothesis and p-value work together is: we use a significant p-value result (<=0.05) to reject the null hypothesis, so we can statistically confirm there IS a significant toggle (difference) between the groups we are studying. Like in house dataset, the feature sqft has a statistically significant impact on the response house price (based on a p-value of 0), so we can reject null hypothesis.

### Hypothesis Testing with Puppies

[This example is pulled liberally from Cassie Kozyrkov's Medium post.](https://hackernoon.com/explaining-p-values-with-puppies-af63d68005d0)

Let's say that we come home at the end of the day to find some unwound toilet paper.

<img src="./images/pug_toilet_paper.jpg" alt="doggo" width="600"/>

We need to make a **data-driven** decision: Do we yell at our dog? 

Our possibilities are:
- Yes, we yell at our dog.
- No, we don't yell at our dog.

Let's assume that our dog is innocent (null hypothesis). Being good data scientists, we want to gather data, then use this data to make a decision.
- **Gust of wind?** We check to see if the bathroom window is open or closed.
- **Floor vent?** We check the thermostat to see if we left the heating/air conditioning on.
- **Another human?** We text your sibling to see if they brought our niece over.

Once you're done "gathering your data," you determine the probability of observing this naturally, if our dog didn't do it.
- If the probability (p-value) is low enough (<= 0.05), the incident *did not* happen naturally, SO, we reject the null hypothesis and blame our dog. (because, smaller p-values stronger evidence ***against*** null hypothesis)
- Otherwise, we can't blame our dog!

We just walked through a hypothesis test! We had two potential decisions, we gathered data, and used this data to make a decision.

> **Note that we only deem our dog guilty or not guilty. The dog is never pronounced innocent! Just like the U.S. court system, hypothesis testing works this way too.**

### Hypothesis Testing: A Drug Efficacy Example

---

Say we are testing the effectiveness of a new drug:

- We randomly select 50 people to be in the *control group* who are given the old drug (the one currently on the market), and 50 people to recieve the treatment (or also known as the *experiment group*) in the context our our experiment run.
    - In other experiments, the control group is the one that receives no treatment. There can be a placebo group as well, which is one that receives a false treatment. **Is this ethical in this scenario?**
- We are interested in the average difference in blood pressure levels between the treatment and control groups to gauge the efficacy of the new drug.
- We know our sample is selected from a broader, unknown population pool.
- We can imagine that, in a hypothetical parallel world, we could have ended up with a different random sample of subjects from the population pool.

<a id='null-hypothesis'></a>

### The "Null" Hypothesis

---

The **null hypothesis** is typically the exact opposite of what you want to test for, i.e. the "status quo". We typically denote the null hypothesis with $H_0$.
- In our dog example, we assume that our dog is innocent.
- In our drug efficacy experiment example, our *null hypothesis* is that there is ***no difference in blood pressure between a subject taking a placebo and and one taking the treatment drug***.

> $H_0:$ The *average difference* in blood pressure between treatment and control groups is ***zero***.

Or, as it's properly written:

> $H_0: \mu_\text{trt} = \mu_\text{ctrl}$

Or, as it's often written:

> $H_0: \mu_\text{trt} - \mu_\text{ctrl} = 0$

<a id='alternative-hypothesis'></a>

### The "Alternative Hypothesis"

---

The **alternative hypothesis** is the outcome of the experiment that we hope to show. It's the ***opposite of our null hypothesis***!
- In our dog example, the alternative hypothesis is that our dog is guilty of unspooling the toilet paper.
- In our drug efficacy experiment example, the *alternative hypothesis* is that ***there is in fact an average difference in blood pressure between the treatment and control groups***. 

> $H_A:$ The parameter of interest (our *average difference* between treatment and control) is ***not zero***.

Or, in math:

> $H_A: \mu_\text{trt} \ne \mu_\text{ctrl}$

Again, we usually write

> $H_A: \mu_\text{trt} - \mu_\text{ctrl} \ne 0$

**NOTE:** The null and alternative hypotheses are concerned with the true values, or, in other words, the **parameter of the overall population**. Through hypothesis testing, we will make an **inference** (a decision) about this ***population parameter***.

### Why is it written like this? $\mu$ vs $\bar{x}$
(THREAD) Can you remind me what a *population parameter* is?

(THREAD) Can you remind me what a *sample statistic* is?

Population parameters are often denoted with Greek letters. It would make no sense to conduct a hypothesis test with sample statistics, since they differ with each experiment, and you don't need to hypothesize about them.

### Introduction to the $t$-Test

---

In our dog example, we gathered data in a way that's different from how we'll usually gather data in order to make a decision.

Say that, in our drug experiment, we measure the following results:

- The 50 subjects in the control group have an average systolic blood pressure of 121.38.
- The 50 subjects in the experimental/treatment group have an average systolic blood pressure of 111.56.

The difference between experimental and control samples is -9.82 points. So? is this **statistically significant to make a data-driven decision on the efficacy of the new drug?**

With **only 50 subjects** in each sample, how confident can we be that this measured difference is real? Do we have enough evidence to say that the population average blood pressure is different between these two groups?

We can perform what is known as a **t-test** to evaluate this. (A $t$-test is one of many, many types of hypothesis tests.)

***Four steps to hypothesis testing:***
1. Construct a null hypothesis that you want to contradict and prove its complement, the alternative hypothesis.
2. Specify a level (or threshold) of statistical significance.
3. Calculate your test statistic.
4. Find your $p$-value and make a conclusion.

In [14]:
bp = pd.read_csv("../data/blood-pressure.csv")
print(bp.shape)
bp.head()

(100, 2)


Unnamed: 0,bp,group
0,166,control
1,165,control
2,120,control
3,94,control
4,104,control


In [15]:
# let's look at the values in 'group'
bp['group'].unique()

array(['control', 'treatment'], dtype=object)

In [16]:
# Separate the blood pressure data into two separate vectors, corresponding to each group
# (this is how we'll need it for a SciPy t-test)
ctrl = bp.loc[bp["group"] == "control", "bp"] # specifying row, col filter for control group
trt = bp.loc[bp["group"] == "treatment", "bp"] # specifying row, col filter for treatment group

In [17]:
ctrl.head() # we get all the 'bp' values corresponding to control 'group'

0    166
1    165
2    120
3     94
4    104
Name: bp, dtype: int64

In [18]:
trt.head() # we get all the 'bp' values corresponding to treatment 'group'

50     83
51    100
52    123
53     75
54    130
Name: bp, dtype: int64

In [19]:
# alternative way to also extract the bp values corresponding to control group
bp[bp['group']=='control']['bp'].values

array([166, 165, 120,  94, 104, 166,  98,  85,  97,  87, 114, 100, 152,
        87, 152, 102,  82,  80,  84, 109,  98, 154, 135, 164, 137, 128,
       122, 146,  86, 146,  85, 101, 109, 105, 163, 136, 142, 144, 140,
       128, 126, 119, 121, 126, 169,  87,  97, 167,  89, 155])

In [20]:
# Print the average of the control and experimental groups.
print(ctrl.mean())
print(trt.mean())
print(round(trt.mean() - ctrl.mean(),2))

121.38
111.56
-9.82


<a id='likelihood-data'></a>

### Step 1: Construct the null and alternative hypotheses

---

For our experiment, we will set up a null hypothesis and an alternative hypothesis:

$H_0:$ The true mean difference in systolic blood pressure between those who receive the treatment and those who do not is 0.

$H_A:$ The true mean difference in systolic blood pressure between those who receive the treatment and those who do not is NOT 0.

### Formally:

$$
\begin{align}
H_0: & \mu_\text{trt} = \mu_\text{ctrl} \\
H_A: & \mu_\text{trt} \ne \mu_\text{ctrl} \\
\end{align}
$$

Recall, our measured difference is $\bar{x}_\text{trt} - \bar{x}_\text{ctrl} = -9.82$

Written out using probability notation, we want to know:

### $$P(\text{data}\;|\;H_0 \text{ true})$$

**What is the probability that we observed this data, assuming that our null hypothesis is true?**

If the probability is *low* --> **reject null hypothesis**

### Step 2: Specify a level of significance

If we assume that our null hypothesis is true, and the probability of observing the data we observed is "small," then our data does not support our null hypothesis. 

**But how "small" is small enough?**

This is set by our level of significance, which we call $\alpha$.

Typically (and arbitrarily) the value $\alpha=0.05$ is used.

We'll check if **p-value < =** $\alpha$ ***to reject null hypothesis***

### Step 3: Calculating your Test Statistic

---

Remember that hypothesis testing is a "box" where the inputs are our data and the outputs allow us to make our decision? Well, in this "box," we are calculating $P(\text{data}\;|\;H_0 \text{ true})$.

When comparing two means, the **t-statistic** (based on the [Student's $t$-distribution](https://en.wikipedia.org/wiki/Student%27s_t-distribution)) is a classic way to quantify the difference between groups. In essence, our $t$-statistic is a standardized version of the difference between groups.

Luckily, our computer will do this for us!

---

<details><summary>Want the mathematical details of the calculation of the t-statistic?</summary>
When comparing the difference between groups, we can calculate the two-sample t-statistic like so:

### $$t = \frac{\bar{x}_E - \bar{x}_C}{\sqrt {s^2 \Big(\frac{1}{n_E} + \frac{1}{n_C}\Big)}}$$

In our example, $\bar{x}_E$ is the mean of our experimental group's sample measurements and $\bar{x}_C$ is the mean of our control group's sample measurements.

$n_E$ and $n_C$ are the number of observations in each group. 

The $s^2$ denotes our *sample variance*. In this version of the t-test, we are assuming equal variances in our experimental and control groups in the overall population. There is another way to calculate the t-test where equal variance is not assumed, but, in our case, it is a reasonable assumption.

The sample variance is calculated like so:

### $$ s^2 = \frac{\sum_{i=1}^{n_E} (x_i - \bar{x}_E)^2 + \sum_{j=1}^{n_C} (x_j - \bar{x}_C)^2}{ n_E + n_C -2} $$

This combines the variance of the two groups' measurements into a single pooled metric. 

</details>

## TL;DR What are we doing?
(I just found out TL;DR stands for “Too Long; Didn’t Read.” Thanks to Google, again!!)

**GOAL:** To tell *whether or not our new treatment is effective*. We define "effective" as ***whether or not those who get the treatment see lower systolic blood pressure, on average***, that is **statistically** significant.

To do this, we follow the following steps to carry out a **hypothesis test**:

1. Set up null and alternative hypotheses. Remember, ours was this:

$$ H_0: \mu_\text{trt} - \mu_\text{ctrl} = 0 $$
$$ H_A: \mu_\text{trt} - \mu_\text{ctrl} \ne 0 $$

2. Decide on a significance level. $\alpha = 0.05$ is a typical choice.
3. Decide on a hypothesis test. There are several different ones. In this case, we're testing the difference between two means, which is a great time to use a **two-sample $t$-test**.

> The two-sample (independent) $t$-test tests whether or not two population means differ.

4. After carrying out this hypothesis test, we'll see if our data provides enough evidence to reject the null hypothesis. And confirm that the difference in the mean blood pressure between the two groups being studied is statistically significant. 

## Let's do it!
Uh... how? What function do I use? Help me, Google!

In [21]:
# Import scipy.stats - I did tell about this at the beginning! Here we go..
from scipy import stats

In [22]:
# Conduct our t-test between treatment and control groups.
# the stats test below does the T-test for the means of two independent samples of scores
# toggling equal_var=False has NO impact in this case as we know ctrl & trt have EQUAL sample sizes
stats.ttest_ind(trt, ctrl)

Ttest_indResult(statistic=-1.8915462966190273, pvalue=0.061504240672530394)

In [23]:
# assigning t_statistic and p-value from the scipy stat test to 2 variables
t_stat, p_value = stats.ttest_ind(trt, ctrl)

<a id='p-value'></a>

### Step 4: The P-Value

---

Remember that our goal of doing all of this work is to make a decision? Well, using our $t$-statistic, we can generate a **p-value**.

> **The p-value is the probability that, given that the null hypothesis $H_0$ is true, we could have ended up with a statistic at least as extreme as the one we got.**

We have measured a difference in blood pressure of -9.82 between the experimental and control groups. We then calculated a $t$-statistic associated with this difference of -1.89. In our specific example:

> The p-value is the probability that, assuming there is truly no difference in blood pressure between treatment and control conditions (i.e., no effect of the drug), we get results that yield a t-statistic more extreme than -1.89.

### So how do we make the decision? *(This will show up in interviews!)*

Remember that $\alpha$ is our level of significance.

- If $p\text{-value} < \alpha$, then ***there is evidence to reject the null hypothesis***, so you accept that $H_0$ is incorrect and therefore $H_A$ is correct.
    - i.e., **a statisically significant difference between the two groups!**
    - This is like saying there is enough evidence to say our dog isn't innocent... so we say our dog is guilty.
- If $p\text{-value} \ge \alpha$, then there is ***insufficient evidence to reject the null hypothesis*** and you cannot accept that either $H_0$ or $H_A$ is correct.
    - i.e., there is **no statistical difference between your two groups.**
    - This is like saying there is not enough evidence to say our dog isn't innocent. We can't totally determine that our dog is innocent, but we haven't determined that our dog is guilty, either.

## So.... what is our decision?

> **DECISION:** Because our $p$-value *(0.06 from our stats test) is greater than* our $\alpha = 0.05$, **we fail to reject our null hypothesis**. We do not have enough evidence to conclude that the mean systolic blood pressure differs between the treatment and placebo group.

## Just for good measure... what's the opposite opinion?

> **DECISION:** Because our p-value was below 0.05, we reject the null hypothesis and conclude that the mean blood pressure between the treatment and control group differs.

## The Law of Parsimony (aka: Occam's Razor)
This is usually paraphrased as:
> The simplest explanation for a phenomenon is usually the correct one.

We don't want to overspecify our model. In our context, that means we want to avoid any potential overfitting. While we **never accept the null hypothesis**, the truth is, _some decision must be made_. Oftentimes, we drop variables from our model that do not have significant $p$-values.

## Other Hypothesis Tests
The goal of this lesson was to teach you, in general, how hypothesis testing works. We showed you what is probably the most common variety of hypothesis test: the $t$-test. However, there are several other ones out there. It's not worth our time to go over so many more of them, as they all have the same implementation and interpretation, just used in different situations. Instead, here is a list of many of the "big" ones and when to use them:

| Situation | Common hypothesis test | Example | Notes |
| --- | --- | --- | --- |
| Testing whether or not one mean is equal to a value | One-sample $t$-test | Do cars on a given road, on average, drive about 65mph? | |
| Testing whether or not two means are equal to eachother | Two-sample $t$-test | Is the mean systolic blood pressure of people who receive Medicine A or Medicine B the same? | |
| Testing whether or not paired observations have the same value | Paired $t$-test | Among heterosexual married couples, is the husband, on average, taller than the wife? | This is functionally the same as a one-sample $t$-test of the differences |
| Testing whether or not three or more means are the same | One-way ANOVA test | Testing for normally distributed variables. Examples - Are base salaries upon graduation different for graduates of Penn State, Ohio State, and Michigan? | The ANOVA test has many variants |
| Testing whether or not there is a relationship between two categorical variables | $\chi^2$ test (read as kie squared) | Is there a relationship between home state and political affiliation? | |
| Testing whether or not a given distribution is normally distributed | Kolmogorov-Smirnov Test | Testing whether or not model residuals are normally distributed. Useful for testing linear regression assumptions! | |
| Testing whether or not one proportion is equal to a number | One-sample $z$-test | Testing whether or not a coin is fair (ie, testing $P(Heads) = 0.5$) | |
| Testing whether or not two proportions are euqal | Two-sample $z$-test | Who is going to win an election? | Testing two or more proportions can be done better with a $\chi^2$ test |






## Recap

Four steps to hypothesis testing:
1. Construct a null hypothesis that you want to contradict and its complement, the alternative hypothesis.
2. Specify a level of significance.
3. Calculate your test statistic.
4. Find your $p$-value and make a conclusion.