# Explore hypothesis testing

## Introduction

The dataset is about the air quality in America, which is porveded by an environment protection agency. 

The data contains Air Quality Index (AQI), allowing reviewers to get some guidance in making a decision
- An AQI value close to 0 signals "little to no" public health concern, while higher values are associated with increased risk to public health.

ROA is considering the following decisions. For each, construct a hypothesis test and an accompanying visualization, using your results of that test to make a recommendation:

1. ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.
2. With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?
3. A new policy will affect those states with a mean AQI of 10 or greater. Can you rule out Michigan from being affected by this new policy?

For your analysis, you'll default to a 5% level of significance.

**The purpose** of this project is to conduct explora hypothesis test on a provided data set.
  
**The goal** is to analyize the data set and perform a hypothesis test.
<br/>  
*This activity has 4 parts:*

**Part 1:** Imports, links, and loading

**Part 2:** Data Exploration
- Data cleaning

**Part 3:** Building a model for a hypothesis test
- For the analysis, let's set default `significance level` to `5%`.

**Part 4:** Evaluate and share results

### Part 1. Imports, links, and loading

For EDA of the data, import the data and packages that would be most helpful, such as pandas, numpy and matplotlib.

Then, import the dataset.

In [59]:
import pandas as pd
import numpy as np
from scipy import stats

In [60]:
aqi = pd.read_csv('hypothesis_test_data.csv')
df = aqi.copy()
df

Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
0,1/1/18,Arizona,Maricopa,Buckeye,BUCKEYE,Carbon monoxide,Parts per million,0.473684,7
1,1/1/18,Ohio,Belmont,Shadyside,Shadyside,Carbon monoxide,Parts per million,0.263158,5
2,1/1/18,Wyoming,Teton,Not in a city,Yellowstone National Park - Old Faithful Snow ...,Carbon monoxide,Parts per million,0.111111,2
3,1/1/18,Pennsylvania,Philadelphia,Philadelphia,North East Waste (NEW),Carbon monoxide,Parts per million,0.300000,3
4,1/1/18,Iowa,Polk,Des Moines,CARPENTER,Carbon monoxide,Parts per million,0.215789,3
...,...,...,...,...,...,...,...,...,...
255,1/1/18,District Of Columbia,District of Columbia,Washington,Near Road,Carbon monoxide,Parts per million,0.244444,3
256,1/1/18,Wisconsin,Dodge,Kekoskee,HORICON WILDLIFE AREA,Carbon monoxide,Parts per million,0.200000,2
257,1/1/18,Kentucky,Jefferson,Louisville,CANNONS LANE,Carbon monoxide,Parts per million,0.163158,2
258,1/1/18,Nebraska,Douglas,Omaha,,Carbon monoxide,Parts per million,0.421053,9


## Part 2: Data Exploration

### Before building a hypothesis test model, explore the datasets.

Review here the descriptive statistics about the data. 
In particular, briefly consider the research questions and answers.

In [61]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 260 entries, 0 to 259
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   date_local        260 non-null    object 
 1   state_name        260 non-null    object 
 2   county_name       260 non-null    object 
 3   city_name         260 non-null    object 
 4   local_site_name   257 non-null    object 
 5   parameter_name    260 non-null    object 
 6   units_of_measure  260 non-null    object 
 7   arithmetic_mean   260 non-null    float64
 8   aqi               260 non-null    int64  
dtypes: float64(1), int64(1), object(7)
memory usage: 18.4+ KB


In [62]:
df.describe()

Unnamed: 0,arithmetic_mean,aqi
count,260.0,260.0
mean,0.403169,6.757692
std,0.317902,7.061707
min,0.0,0.0
25%,0.2,2.0
50%,0.276315,5.0
75%,0.516009,9.0
max,1.921053,50.0


In [63]:
df['state_name'].value_counts()

California              66
Arizona                 14
Ohio                    12
Florida                 12
Texas                   10
New York                10
Pennsylvania            10
Michigan                 9
Colorado                 9
Minnesota                7
New Jersey               6
Indiana                  5
North Carolina           4
Massachusetts            4
Maryland                 4
Oklahoma                 4
Virginia                 4
Nevada                   4
Connecticut              4
Kentucky                 3
Missouri                 3
Wyoming                  3
Iowa                     3
Hawaii                   3
Utah                     3
Vermont                  3
Illinois                 3
New Hampshire            2
District Of Columbia     2
New Mexico               2
Montana                  2
Oregon                   2
Alaska                   2
Georgia                  2
Washington               2
Idaho                    2
Nebraska                 2
R

## Part 3. Statistical Tests

Before proceeding, recall the following steps for conducting hypothesis testing:

1. Formulate the null hypothesis and the alternative hypothesis.<br>
2. Set the significance level.<br>
3. Determine the appropriate test procedure.<br>
4. Compute the p-value.<br>
5. Draw your conclusion.

### Hypothesis 1: Focus on comparing metropolitan areas. 

Within `California`, check whether the mean AQI in `Los Angeles` County is statistically different from `the rest of California`.

`Tip`. Subsetting the data for the purposes would be helpful for the work.

In [64]:
# Create datafromes for each sample being compared in the test.
ca_la = df[df['county_name']=='Los Angeles']
ca_ot = df[(df['state_name']=='California') & (df['county_name']!='Los Angeles')]

#### Formulate your hypothesis:

Formulate your null and alternative hypotheses:

*   $H_0$: There is `NO` difference in the mean AQI between `Los Angeles` County and the rest of `California`.
*   $H_A$: There is a difference in the mean AQI between `Los Angeles` County and the rest of `California`.

#### Set the significance level

In [66]:
significance_level = 0.05
significance_level

0.05

#### Determine the appropriate test procedure:

Compare the sameple means between two independent samples by utilizing a `two-sample` $t$-`test`.

#### Compute the P-value

In [67]:
stats.ttest_ind(a=ca_la['aqi'], b=ca_ot['aqi'], equal_var=False)

Ttest_indResult(statistic=2.1107010796372014, pvalue=0.049839056842410995)

The result shows `p-value` is less than `0.05`, the significance level and `t-statistic` > 0.

It indicates that the $H_0$, null hypothesis, is `rejected`.

Therefore, the conclusion is that `Los Angeles` county has a higer mean AQI than `Other` counties in `California` at the 5% significance level.

### Hypothesis 2: Focus on specific 2 states.

Simply compare the sample means between `New York` and `Ohio`

For this test, subset the data from the original one.

In [68]:
ny = df[df['state_name']=='New York']
oh = df[df['state_name']=='Ohio']

#### Formulate your hypothesis:

Formulate your null and alternative hypotheses:

*   $H_0$: There is `NO` difference in the mean AQI between `New York` and `Ohio`.
*   $H_A$: There is a difference in the mean AQI between `New York` and `Ohio`.

#### Set the significance level (remains at 5%)

#### Determine the appropriate test procedure
Compare the sample means between two independent samples in `one direction` by utilizing a **two-sample  𝑡-test**.

#### Compute the P-value

In [69]:
# Include `alternative=less` option as parameter for one-sided test
stats.ttest_ind(a=ny['aqi'], b=oh['aqi'], alternative='less')

Ttest_indResult(statistic=-1.891850434703295, pvalue=0.03654034300840755)

The result shows `p-value` is less than `0.05`, the significance level and `t-statistic` < 0.

It indicates that the $H_0$, null hypothesis, is `rejected`.

Therefore, the conclusion is that `New York` has a lower mean AQI than `Ohio` at the 5% significance level.

### Part 4. Evaluate and share results

#### Part 4a. Evaluation
Hypothesis test 1 results
- Air Quality Index (AQI) is different between `Los Angeles` County and `the rest of California`.

Hypothesis test 2 results
- Using a 5% significance level, we can conclude that `New York` has a lower AQI than `Ohio`.

#### Part 4b. Conclusion or Takeaway

1. Even with small sample sizes, the variation within the data is enough to allow you to make statistically significant conclusions. 
2. For each test, the `null` and `alternative` hypothesis would be presented, then conclusion can be described based on `p-value` under the significance level.
3. Perfoming `A/B test`, a part of hypothesis tests, allows stakeholders to quickly understand the difference between your results.