# **Introduction**
You work for an environmental think tank called Repair Our Air (ROA). ROA is formulating policy recommendations to improve the air quality in America, using the Environmental Protection Agency's Air Quality Index (AQI) to guide their decision making. An AQI value close to 0 signals "little to no" public health concern, while higher values are associated with increased risk to public health.

They've tasked you with leveraging AQI data to help them prioritize their strategy for improving air quality in America.

ROA is considering the following decisions. For each, construct a hypothesis test and an accompanying visualization, using your results of that test to make a recommendation:

1. ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.
2. With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?
3. A new policy will affect those states with a mean AQI of 10 or greater. Can you rule out Michigan from being affected by this new policy?

**Notes:**

- For your analysis, you'll default to a 5% level of significance.
- Throughout the lab, for two-sample t-tests, use Welch's t-test (i.e., setting the equal_var parameter to False in scipy.stats.ttest_ind()). This will account for the possibly unequal variances between the two groups in the comparison.

In [3]:
import pandas as pd
import numpy as np
from scipy import stats

aqi = pd.read_csv('/content/_c4_epa_air_quality.csv')
print("Showing a sample of data:")
print(aqi.head())

print("\nSummarizing AQI:")
print(aqi.describe(include='all'))

print("\nFurther examinations:")
print(aqi['state_name'].value_counts())

Showing a sample of data:
   Unnamed: 0  date_local    state_name   county_name      city_name  \
0           0  2018-01-01       Arizona      Maricopa        Buckeye   
1           1  2018-01-01          Ohio       Belmont      Shadyside   
2           2  2018-01-01       Wyoming         Teton  Not in a city   
3           3  2018-01-01  Pennsylvania  Philadelphia   Philadelphia   
4           4  2018-01-01          Iowa          Polk     Des Moines   

                                     local_site_name   parameter_name  \
0                                            BUCKEYE  Carbon monoxide   
1                                          Shadyside  Carbon monoxide   
2  Yellowstone National Park - Old Faithful Snow ...  Carbon monoxide   
3                             North East Waste (NEW)  Carbon monoxide   
4                                          CARPENTER  Carbon monoxide   

    units_of_measure  arithmetic_mean  aqi  
0  Parts per million         0.473684    7  
1  Parts per

**Insights:**
- The dataset presents county-level data for the first hypothesis.
- Ohio and New York both have a higher number of observations in the dataset.

**Statistical Tests:**
1. Formulate the null hypothesis and the alternative hypothesis.
2. Set the significance level.
3. Determine the appropriate test procedure.
4. Compute the p-value.
5. Draw the conclusion.

**Hypothesis 1:** ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.


In [10]:
#creating dataframes for each sample being compared in the test

ca_la = aqi[aqi['county_name']=='Los Angeles']
ca_other = aqi[(aqi['state_name']=='California') & (aqi['county_name'] !='Los Angeles')]

**Formulating the hypothesis:**

H0: There is no difference in the mean AQI between Los Angeles County and the rest of California.

Ha: There is a difference in the mean AQI between Los Angeles County and the rest of California.

In [13]:
#significance level will be set at 5%

significance_level = 0.05
significance_level

#utilizing a two-sample t-test to compare the sample means between the two independent samples.
stats.ttest_ind(a=ca_la['aqi'], b=ca_other['aqi'], equal_var=False)

Ttest_indResult(statistic=2.1107010796372014, pvalue=0.049839056842410995)

**Insights:**
- Since the p-value is 0.049 is less than the significance rate of 0.05%, we would reject the null hypothesis in favor of the alternative hypothesis. In this case, the metropolitan approach would make more sense.

**Hypothesis 2:** With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?

In [14]:
#creating dataframes for each sample being compared in the test

ny = aqi[aqi['state_name']=='New York']
ohio = aqi[aqi['state_name']=='Ohio']

**Formulating the hypothesis:**

H0: The mean AQI of New York is greater than or equal to Ohio.

Ha: The mean AQI of New York is below Ohio.

- **Significance levels remain in 5%**
- Two-sample t-test

In [15]:
#computing the p-value

tstat, pvalue = stats.ttest_ind(a=ny['aqi'], b=ohio['aqi'], alternative='less', equal_var=False)
print("tstat:", tstat)
print("p-value:", pvalue)


tstat: -2.025951038880333
p-value: 0.03044650269193468


**Insights:**

- Since the p-value is 0.030 is less than the significance level of 0.05% and the t-statistic valued as 0 < (-2.036), we would reject the null hypothesis in favor of the alternative hypothesis.
- In result, we can conclude at 5% significance level that the AQI of New York is lower than the mean AQI in Ohio.

**Hypothesis 3:** A new policy will affect those states within a mean AQI of 10 or greater. Can you rule out Michigan from being affected by this new policy?

In [16]:
michigan = aqi[aqi['state_name']=='Michigan']

**Formulating hypothesis:**

H0: The mean AQI of Michigan is less than or equal to 10.
HA: The mean AQI of Michigan is greater than 10.

- Significance levels remain in 5%
- One-sample t-test

In [17]:
#computing p-value

tstat, pvalue = stats.ttest_1samp(michigan['aqi'], 10, alternative='greater')
print("tstat:", tstat)
print("p-value:", pvalue)

tstat: -1.7395913343286131
p-value: 0.9399405193140109


**Insights:**

- Since the p-value of 0.940 being greater than the significance level of 0.05 and the t-statistic value is < 0 (-1.74), we'll fail to reject the null hypothesis.

- In result, we can not conclude that at 5% significance level Michigan's mean AQI is greater than 10. Therefore, Michigan would not be affected by the new policy.
- The results indicated that the AQI in Los Angeles County is in fact different from the rest of the state of California. Also, based on results we can conclude that New York has in fact a lower AQI than Ohio.