<a href="https://colab.research.google.com/github/amiguelnobrega/AQI/blob/main/aqi4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#1. Imports

In [1]:
# Import relevant packages

import pandas as pd
import numpy as np
from scipy import stats

In [2]:
# Mount Google Drive in colab notebook

from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [3]:
# Get .csv from gdrive

aqi = pd.read_csv('/content/gdrive/Othercomputers/My MacBook Air/Estudo/Google Advance DA/Projects/AQI/c4_epa_air_quality.csv')
print('done!')

done!


#2. Data Exploration

In [4]:
# Explore the AQI dataframe

print("Use head() to show a sample of data")
print(aqi.head())

print("Use describe() to summarize AQI")
print(aqi.describe(include='all'))

print("For a more thorough examination of observations by state use values_counts()")
print(aqi['state_name'].value_counts())

print('for a more')

Use head() to show a sample of data
   Unnamed: 0  date_local    state_name   county_name      city_name  \
0           0  2018-01-01       Arizona      Maricopa        Buckeye   
1           1  2018-01-01          Ohio       Belmont      Shadyside   
2           2  2018-01-01       Wyoming         Teton  Not in a city   
3           3  2018-01-01  Pennsylvania  Philadelphia   Philadelphia   
4           4  2018-01-01          Iowa          Polk     Des Moines   

                                     local_site_name   parameter_name  \
0                                            BUCKEYE  Carbon monoxide   
1                                          Shadyside  Carbon monoxide   
2  Yellowstone National Park - Old Faithful Snow ...  Carbon monoxide   
3                             North East Waste (NEW)  Carbon monoxide   
4                                          CARPENTER  Carbon monoxide   

    units_of_measure  arithmetic_mean  aqi  
0  Parts per million         0.473684    7  
1 

#3. Statistical tests

* Is the mean AQI in Los Angeles County is statistically different from the rest of California?
* Does New York have a lower AQI than Ohio?

**Notes:**
1. For this analysis, we will default to a 5% level of significance.
2. Throughout this exercise, for two-sample t-tests, we will use Welch's t-test (i.e., setting the `equal_var` parameter to `False` in `scipy.stats.ttest_ind()`). This will account for the possibly unequal variances between the two groups in the comparison.

###Steps to conduct hypothesis testing (a/b testing)

1. Formulate the null hypothesis and the alternative hypothesis.<br>
2. Set the significance level.<br>
3. Determine the appropriate test procedure.<br>
4. Compute the p-value.<br>
5. Interpret results.

## Hypothesis 1: Is the mean AQI in Los Angeles County statistically different from the rest of California.

#### 1. Formulate the null and alternative hypotheses:



*   $H_0$: There is no difference in the mean AQI between Los Angeles County and the rest of California.
*   $H_A$: There is a difference in the mean AQI between Los Angeles County and the rest of California.

In [13]:
# Create dataframes for each sample being compared in the test

ca_la = aqi[aqi['county_name']=='Los Angeles']
ca_other = aqi[(aqi['state_name']=='California') & (aqi['county_name']!='Los Angeles')]

####2. Set Significance Level



In [8]:
# For this analysis, the significance level is 5%

significance_level = 0.05
significance_level

0.05

####3. Determine the appropriate test procedure:

* Two sample T-test, since we are comparing the sample means between two independent samples.

#### 4. Compute P-value

In [9]:
# Compute p-value

stats.ttest_ind(a=ca_la['aqi'], b=ca_other['aqi'], equal_var=False)

TtestResult(statistic=2.1107010796372014, pvalue=0.049839056842410995, df=17.08246830361151)

####5. Interpretation



* With the p-value (0.049) being less than 0.05 (as your significance level is 5%), we reject the null hypothesis in favor of the alternative hypothesis.

## Hypothesis 2: Does New York have a lower AQI than Ohio?




#### 1. Formulate the null and alternative hypotheses:



*   $H_0$: The mean AQI of New York is greater than or equal to that of Ohio.
*   $H_A$: The mean AQI of New York is **below** that of Ohio.

In [11]:
# Create dataframes for each sample being compared in the test

ny = aqi[aqi['state_name']=='New York']
ohio = aqi[aqi['state_name']=='Ohio']

####2. Set Significance Level



* Significance Level (remains at 5%)

#### 3. Determine the appropriate test procedure:

* We will utilize a two-sample 𝑡-test, since we are comparing the sample means between two independent samples in one direction.

#### 4. Compute P-value

In [12]:
# Compute p-value

tstat, pvalue = stats.ttest_ind(a=ny['aqi'], b=ohio['aqi'], alternative='less', equal_var=False)
print(tstat)
print(pvalue)

-2.025951038880333
0.03044650269193468


####5. Interpretation


* With a p-value (0.030) of less than 0.05 (as the significance level is 5%) and a t-statistic < 0 (-2.036), **reject the null hypothesis in favor of the alternative hypothesis**.

* Therefore, we conclude at the 5% significance level that New York has a lower mean AQI than Ohio.

# 4. Results and Evaluation

* The results indicated that the AQI in Los Angeles County was in fact different from the rest of California.

* Using a 5% significance level, you can conclude that New York has a lower AQI than Ohio based on the results.

###Conclusion

* Even with small sample sizes, the variation within the data is enough to allow to make statistically significant conclusions.
* We identified at the 5% significance level that the Los Angeles mean AQI was stastitically different from the rest of California, and that New York does have a lower mean AQI than Ohio.