# Lesson 2.06 Hypothesis Testing

In [1]:
# Bring in our libraries.
import numpy as np
import matplotlib.pyplot as plt

# For the full list of available styles check out matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html
# Read more on matplotlib’s fivethirtyeight style at this link - https://www.dataquest.io/blog/making-538-plots/
plt.style.use('fivethirtyeight')

# It's just that the definition of the displayed plot is a bit better: retina quality. 
# Any display with retina resolution will make the figures look better
#  if your monitor's resolution is sub-retina than the improvement will be less noticeable.
%config InlineBackend.figure_format = 'retina'

# make your plot outputs appear inline right after code cell and be stored within the notebook.
%matplotlib inline

## Introduction to Hypothesis Testing

In the real world, we like to make **data-driven decisions**!
- In order to make these decisions, though, we need to collect some data.
- We take this data, put it into a "box," and the output effectively tells us what type of decision we should make.
- This "box" is hypothesis testing.

Hypothesis testing is a little more complicated than that, but not much!

### Hypothesis Testing: A Drug Efficacy Example

---

Say we are testing the efficacy of a new drug:

- We randomly select 50 people to be in the placebo control condition and 50 people to recieve the treatment.
    - In the context of experiments, we often talk about the "control" group and the "experimental" or "treatment" group. In our example, the control group is the one given the placebo (sugar pill) and the treatment group is the one given the actual drug. 
- We are interested in the average difference in blood pressure levels between the treatment and control groups.
- We know our sample is selected from a broader, unknown population pool.
- We can imagine that, in a hypothetical parallel world, we could have ended up with a different random sample of subjects from the population pool.

<a id='null-hypothesis'></a>

### The "Null" Hypothesis

---

The **null hypothesis** is a fundamental concept of statistical tests. We typically denote the null hypothesis with $H_0$.
- In our drug efficacy experiment example, our null hypothesis is that there is no difference in blood pressure between a subject taking a placebo and and one taking the treatment drug.

> $H_0:$ The average difference in blood pressure between treatment and control groups is zero.

<a id='alternative-hypothesis'></a>

### The "Alternative Hypothesis"

---

The **alternative hypothesis** is the outcome of the experiment that we hope to show. It's the opposite of our null hypothesis!
- In our drug efficacy experiment example, the alternative hypothesis is that there is in fact an average difference in blood pressure between the treatment and control groups. 

> $H_A:$ The parameter of interest — our average difference between treatment and control — is not zero.

**NOTE:** The null and alternative hypotheses are concerned with the true values, or, in other words, the **parameter of the overall population**. Through hypothesis testing, we will make an **inference** (a decision) about this population parameter.

### Introduction to the $t$-Test

---

Say that, in our drug experiment, we measure the following results:

- The 50 subjects in the control group have an average systolic blood pressure of 121.38.
- The 50 subjects in the experimental/treatment group have an average systolic blood pressure of 111.56.

The difference between experimental and control samples is -9.82 points. 

**But**, with only 50 subjects in each sample, how confident can we be that this measured difference is real? Do we have enough evidence to say that the population average blood pressure is different between these two groups?

We can perform what is known as a [**t-test**](https://www.scribbr.com/statistics/t-test/) to evaluate this. (A $t$-test is one of many, many types of hypothesis tests.)

Four steps to hypothesis testing:
1. Construct a null hypothesis that you want to contradict and its complement, the alternative hypothesis.
2. Specify a level of significance.
3. Calculate your test statistic.
4. Find your $p$-value and make a conclusion.

**We can set up the experimental and control observations below as `numpy` arrays.**

In [2]:
control = np.array([166, 165, 120,  94, 104, 166,  98,  85,  97,  87, 114, 100, 152,
                    87, 152, 102,  82,  80,  84, 109,  98, 154, 135, 164, 137, 128,
                    122, 146,  86, 146,  85, 101, 109, 105, 163, 136, 142, 144, 140,
                    128, 126, 119, 121, 126, 169,  87,  97, 167,  89, 155])

experimental = np.array([ 83, 100, 123,  75, 130,  77,  78,  87, 116, 116, 141,  93, 107,
                         101, 142, 152, 130, 123, 122, 154, 119, 149, 106, 107, 108, 151,
                         97,  95, 104, 141,  80, 110, 136, 134, 142, 135, 111,  83,  86,
                         116,  86, 117,  87, 143, 104, 107,  86,  88, 124,  76])

# Print the average of the control and experimental groups.
print(f"Average control BP is: {np.mean(control)}")
print(f"Average experimental BP is: {np.mean(experimental)}")
# Print the difference of the sample means, too.
print(f"Difference in average BP between control and experimental is: {np.mean(experimental) - np.mean(control)}")

Average control BP is: 121.38
Average experimental BP is: 111.56
Difference in average BP between control and experimental is: -9.819999999999993


<a id='likelihood-data'></a>

### Step 1: Construct the null and alternative hypotheses

---

For our experiment, we will set up a null hypothesis and an alternative hypothesis:

$$ H_0: \text{The true mean difference in systolic blood pressure between those who receive the treatment and those who do not is 0.} $$

$$ H_A: \text{The true mean difference in systolic blood pressure between those who receive the treatment and those who do not is NOT 0.} $$

Likewise, our measured difference is **-9.82**.

Written out using probability notation, we want to know:

### $$P(\text{data}\;|\;H_0 \text{ true})$$

**What is the probability that we observed this data, assuming that our null hypothesis is true?**


### Step 2: Specify a level of significance

If we assume that our null hypothesis is true, and the probability of observing the data we observed is "small," then our data does not support our null hypothesis. 

**But how "small" is small enough?**

This is set by our level of significance, which we call $\alpha$.

Typically (and arbitrarily) the value $\alpha=0.05$ is used.
- This means that there is a 5% chance that we will _incorrectly reject the null hypothesis_ (a.k.a. Type 1 error or false positive).
- Put another way, there is a 5% chance that we will claim a significant difference in blood pressure between the two groups when in fact there is no (statistically significant) difference.

### Step 3: Calculating your Test Statistic

---

Remember that hypothesis testing is a "box" where the inputs are our data and the outputs allow us to make our decision? Well, in this "box," we are calculating $P(\text{data}\;|\;H_0 \text{ true})$. This calculation requires picking a probability distribution, then comparing the results of our experiment to this distribution to see how extreme our results are relative to the null hypothesis.

When comparing two means, the **t-statistic** (based on the [Student's $t$-distribution](https://en.wikipedia.org/wiki/Student%27s_t-distribution)) is a classic way to quantify the difference between groups. In essence, our $t$-statistic is be a standardized version of the difference between groups.

Luckily, our computer will do this for us!

## TL;DR What are we doing?

**GOAL:** To tell whether or not our new treatment is effective. We define "effective" as whether or not those who get the treatment see lower systolic blood pressure, on average.

To do this, we follow the following steps to carry out a **hypothesis test**:

1. Set up null and alternative hypotheses. In pure math terms that, looks like this:

$$ H_0: \mu_{\text{treatment}} - \mu_{\text{placebo}} = 0 $$
$$ H_A: \mu_{\text{treatment}} - \mu_{\text{placebo}} \ne 0 $$

2. Decide on a significance level. $\alpha = 0.05$ is a typical choice.
3. Decide on a hypothesis test. There are a million of them. In this case, we're testing the difference between two means, which is a great time to use a **two-sample $t$-test**.

> The two-sample (independent) $t$-test tests whether or not two population means differ.

4. After carrying out this hypothesis test, we'll see if our data provide enough evidence to reject the null hypothesis.

**Let's do this calculation using `scipy.stats.ttest_ind`.**

> On your own: To try your skills at `numpy` and `python`, try doing this calculation by implementing the above equations on your own!

In [3]:
# Import scipy.stats
import scipy.stats as stats

In [4]:
# Conduct our t-test
# More details at https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html
stats.ttest_ind(experimental, control, equal_var=False)

Ttest_indResult(statistic=-1.8915462966190273, pvalue=0.06161817112302221)

In [18]:
# store t-test results in variables
t_stat, p_value = stats.ttest_ind(experimental, control, equal_var=False)
p_value # p-value > 0.05 --> fail to reject null hypothesis

0.06161817112302221

<a id='p-value'></a>

### Step 4: The P-Value

---

Remember that our goal of doing all of this work is to make a decision? Well, using our $t$-statistic, we can generate a **p-value**.

> **The p-value is the probability that, given that the null hypothesis $H_0$ is true, we could have ended up with a statistic at least as extreme as the one measured from our random sample of data from the true population.**

We have measured a difference in blood pressure of -9.82 between the experimental and control groups. We then calculated a $t$-statistic associated with this difference of -1.89. In our specific example:

> The p-value is the probability that, assuming there is truly no difference in blood pressure between experimental and control conditions (i.e., no effect of the drug), we get sample results that are at least as extreme as getting a test statistic of -1.89.

**Our p-value corresponds to the area under the curve of the  distribution where the magnitude of the t-statistic is greater than or equal to the one we measured**.

### So how do we make the decision? *(This will show up in interviews!)*

Remember that $\alpha$ is our level of significance.

- If $p\text{-value} < \alpha$, then there is evidence to reject the null hypothesis, so you accept that $H_0$ is incorrect and therefore $H_A$ is correct.
    - i.e., a statisically significant difference between the two groups!
- If $p\text{-value} \ge \alpha$, then there is insufficient evidence to reject the null hypothesis and you cannot accept that either $H_0$ or $H_A$ is correct.
    - i.e., there is no statistical difference between your two groups.

## So.... what is our decision?

> **DECISION:** Because our $p$-value is greater than our $\alpha = 0.05$, we fail to reject our null hypothesis. We do not have enough evidence to conclude that the mean systolic blood pressure differs between the treatment and placebo group.

## Let's practice Hypothesis Testing for another example
- Let's check if there is any significant difference between the males and females present on the Titanic

In [10]:
import pandas as pd
df = pd.read_csv('train.csv')

In [11]:
df.dropna(subset=['Embarked'], inplace=True)
df.shape

df['IsMale'] = df['Sex'].map(lambda x: 1 if x == 'male' else 0)
df.drop('Sex', axis=1, inplace=True)

In [12]:
df.groupby('IsMale')['Survived'].mean()

IsMale
0    0.740385
1    0.188908
Name: Survived, dtype: float64

In [13]:
sample_m = df[df['IsMale']==1]['Survived']
print(len(sample_m))
sample_f = df[df['IsMale']==0]['Survived']
print(len(sample_f))

577
312


In [14]:
import scipy.stats as stats
t_stat, p_value = stats.ttest_ind(sample_m, sample_f, equal_var=False)
p_value # p-value < 0.05 --> reject null hypothesis

1.2967913799919294e-60

## Summary of Hypothesis Testing
The goal of this lesson was to teach you, in general, how hypothesis testing works. We showed you what is probably the most common variety of hypothesis test: the $t$-test. However, there are kajillions of other ones out there. It's not worth our time to go over so many more of them, as they all have the same implementation and interpretation, just in different situations.

Read more at [Everything You Need To Know about Hypothesis Testing](https://towardsdatascience.com/everything-you-need-to-know-about-hypothesis-testing-part-i-4de9abebbc8a)


## Recap

Four steps to hypothesis testing:
1. Construct a null hypothesis that you want to contradict and its complement, the alternative hypothesis.
2. Specify a level of significance.
3. Calculate your test statistic.
4. Find your $p$-value and make a conclusion.