# Hypothesis Testing

Hypothesis Testing is basically to check whether the difference between the means (or ratio) of two samples is statistically significant. But what do we mean by "statistically significant"? It means that there is definitely some significant amount of difference between the two means (or ratios) and this difference is not by chance/luck, this difference has some statistical significance to it.

For Hypothesis Testing I will work on Environmental Protection Agency's Air Quality Index Data.
I will leverage AQI data to help America's government to prioritize their strategy for improving air quality.

I will consider the following activities. For each, I will construct a hypothesis test and use my results of that test to make a recommendation:

1. I will consider a metropolitan-focused approach. Within California, I want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.
2. I will check between New York and Ohio which has a better AQI and check whether there's a statistical significance between their means?
3. I will check if Michigan has a mean AQI of 10 or higher?

**Notes:**
1. For this analysis, I'll default to a 5% level of significance.
2. Throughout the lab, for two-sample t-tests, use Welch's t-test (i.e., setting the `equal_var` parameter to `False` in `scipy.stats.ttest_ind()`). This will account for the possibly unequal variances between the two groups in the comparison.

In [1]:
import numpy as np
import pandas as pd
from scipy import stats

In [2]:
aqi = pd.read_csv('c4_epa_air_quality.csv')

In [3]:
aqi.head()

Unnamed: 0.1,Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
0,0,2018-01-01,Arizona,Maricopa,Buckeye,BUCKEYE,Carbon monoxide,Parts per million,0.473684,7
1,1,2018-01-01,Ohio,Belmont,Shadyside,Shadyside,Carbon monoxide,Parts per million,0.263158,5
2,2,2018-01-01,Wyoming,Teton,Not in a city,Yellowstone National Park - Old Faithful Snow ...,Carbon monoxide,Parts per million,0.111111,2
3,3,2018-01-01,Pennsylvania,Philadelphia,Philadelphia,North East Waste (NEW),Carbon monoxide,Parts per million,0.3,3
4,4,2018-01-01,Iowa,Polk,Des Moines,CARPENTER,Carbon monoxide,Parts per million,0.215789,3


First let's do some descriptive analysis.

In [4]:
aqi.describe()

Unnamed: 0.1,Unnamed: 0,arithmetic_mean,aqi
count,260.0,260.0,260.0
mean,129.5,0.403169,6.757692
std,75.199734,0.317902,7.061707
min,0.0,0.0,0.0
25%,64.75,0.2,2.0
50%,129.5,0.276315,5.0
75%,194.25,0.516009,9.0
max,259.0,1.921053,50.0


We can see that there are total 260 rows containing aqi data. The mean is at 6.75 for the whole data. The minimum aqi is 0 while the maximum aqi is 50.

## Statistical Tests

Now there are 5 steps that can be followed to conduct Hypothesis Testing :

1. Formulate the null hypothesis and the alternative hypothesis.<br>
2. Set the significance level.<br>
3. Determine the appropriate test procedure.<br>
4. Compute the p-value.<br>
5. Draw your conclusion.

**What exactly is Null Hypothesis?**
Well, we can say it is more like a conservative opinion. For example, let's say, the mean of the first sample is 7 and the mean of the second sample is 9. Our Null Hypothesis will say that there is no statistical significance between these two means. To put it in different words, this difference in means of the two samples happened completely by chance/luck.

**What is Alternative Hypothesis?**
It is the opposite of the Null Hypothesis. Our alternative hypothesis will state that there is a statistical significance between the two means. To put it in different words, this difference in means of the two samples didn't happen by chance/luck.

## Activity 1

To calculate the mean AQI of LA and mean AQI of other cities other than LA in California we will first have to subset our data.

In [5]:
la_aqi = aqi[aqi['county_name'] == 'Los Angeles']
non_la_aqi = aqi[(aqi['state_name'] == 'California') & (aqi['county_name'] != 'Los Angeles')]

print('Mean aqi for Los Angeles county cities :', la_aqi['aqi'].mean())
print('Mean aqi for Non - Los Angeles county cities :', non_la_aqi['aqi'].mean())

Mean aqi for Los Angeles county cities : 16.285714285714285
Mean aqi for Non - Los Angeles county cities : 11.0


#### Step 1:
**Formulate your null and alternative hypotheses:**

*   $H_0$: There is no difference in the mean AQI between Los Angeles County and the rest of California.
*   $H_A$: There is a difference in the mean AQI between Los Angeles County and the rest of California.


#### Step 2:
**Set the significance level:**

In [6]:
significance_level = 0.05

#### Step 3:
**Determine the appropriate test procedure:**

Here, we are comparing the sample means between two independent samples. Therefore, I will utilize a **two-sample  𝑡-test**.

#### Step 4:
**Compute the P-value**

In [7]:
stats.ttest_ind(a=la_aqi['aqi'], b=non_la_aqi['aqi'], equal_var=False)

Ttest_indResult(statistic=2.1107010796372014, pvalue=0.049839056842410995)

#### Step 5:
The p-value comes out to be 0.0498 or 4.98% which is less than our significance level of 0.05 or 5%. Hence, we will reject the null-hypothesis and we conclude that there is a statistical significance between the mean aqi's of la county cities and non-la county cities.

## Activity 2

To calculate the mean AQI of New York and mean AQI of Ohio, we will first have to subset our data.

In [8]:
ny_aqi = aqi[aqi['state_name'] == 'New York']
ohio_aqi = aqi[aqi['state_name'] == 'Ohio']

print('Mean aqi for New York county cities :', ny_aqi['aqi'].mean())
print('Mean aqi for Ohio county cities :', ohio_aqi['aqi'].mean())

Mean aqi for New York county cities : 2.5
Mean aqi for Ohio county cities : 3.3333333333333335


#### Step 1:
**Formulate your null and alternative hypotheses:**

*   $H_0$: The mean AQI of New York is greater than or equal to that of Ohio.
*   $H_A$: The mean AQI of New York is **below** that of Ohio.


#### Step 2:
**Significance Level (remains at 5%)**

#### Step 3:
**Determine the appropriate test procedure:**

Here, you are comparing the sample means between two independent samples in one direction. Therefore, you will utilize a **two-sample  𝑡-test**.

#### Step 4:
**Compute the P-value**

In [9]:
stats.ttest_ind(a=ny_aqi['aqi'], b=ohio_aqi['aqi'], alternative='less', equal_var=False)

Ttest_indResult(statistic=-2.025951038880333, pvalue=0.030446502691934697)

#### Step 5:
The p-value comes out to be 0.0304 or 3.04% which is less than our significance level of 0.05 or 5%. Hence, we will reject the null-hypothesis and we conclude that there is a statistical significance between the mean aqi's of New York and Ohio.

## Activity 3

To calculate the mean AQI of Michigan, we will first have to subset our data.

In [10]:
michigan_aqi = aqi[aqi['state_name'] == 'Michigan']

print('Mean aqi for Michigan state cities :', michigan_aqi['aqi'].mean())

Mean aqi for Michigan state cities : 8.11111111111111


#### Step 1:
**Formulate your null and alternative hypotheses here:**

*   $H_0$: The mean AQI of Michigan is less than or equal to 10.
*   $H_A$: The mean AQI of Michigan is greater than 10.


#### Step2:
**Significance Level (remains at 5%)**

#### Step 3:
Determine the appropriate test procedure:

Here, you are comparing one sample mean relative to a particular value in one direction. Therefore, you will utilize a **one-sample  𝑡-test**. 

#### Step 4:
**Compute the P-value**
In the previous 2 activities, we were comparing means between two samples. But here, we will have to check statistical significance whether Michigan's AQI is 10 or more. So, we will assign popmean attribute as 10.

In [11]:
stats.ttest_1samp(a=michigan_aqi['aqi'], popmean=10, alternative='greater')

Ttest_1sampResult(statistic=-1.7395913343286131, pvalue=0.9399405193140109)

#### Step 5:
The p-value comes out to be 0.9399 or 93.99% which is more than our significance level of 0.05 or 5%. Hence we fail to reject our null-hypothesis and we accept our alternative hypothesis that the mean AQI for the state of Michigan is greater than 10.