# Capstone Project 1 Exploratory Data Analysis
## Molly McNamara

The dataset for this project consists of daily levels of 4 primary air pollutants (Nitrogen Dioxide, Sulphur Dioxide, Carbon Monoxide and Ozone) and their air quality indices from major cities across the United States between 2000 and 2016. The data is sourced from the United States Environmental Protection Agency.

### Import packages and dataset

In [None]:
import pandas as pd
pd.set_option("display.max.columns", 500)
import matplotlib.pyplot as plt
pollution = pd.read_csv('~/Desktop/cleanpollution.csv', index_col='Unnamed: 0')

In [None]:
pollution.head(3)

To begin with, the describe function was used to evaluate the basic statistics of the dataset.

In [None]:
pollution.describe()

The dataframe is comprised of data from 412,856 collection sites.  The four Air Quality Index values, one for each pollutant, range from 0 to 132-218 (depending on which pollutant).  

### Initial Exploratory Questions

How many states are represented in the dataset?

In [None]:
print("Number of States", pollution['State'].nunique())

How many cities are represented in the dataset?

In [None]:
print("Number of Cities", pollution['City'].nunique())

How many cities are represented for each state?

In [None]:
print("Number of Cities Per", pollution['City'].groupby(pollution['State']).nunique())

The number of cities with data collection sites varies by state.  This may in part be a function of population or size of state.

In [None]:
pollution.groupby('State').describe()

The statistics by state show that each area seems to have its own unique patterns and that the dataset is quite rich in information, especially for the more populated states that have been tracking pollutant levels longer (higher number of records). 

To better visualize this large dataset, some of the features can be plotted, beginning with a bar plot of the four Air Quality Indices. 

In [None]:
cols = ['NO2_AQI', 'O3_AQI', 'SO2_AQI', 'CO_AQI']
pollution[cols].plot(kind='box')
plt.xlabel('Pollutant')
plt.ylabel('Air Quality Index')
plt.show()

### Top 10 Cities

To futher visualize the data, the dataset is filtered to the 10 most populous cities in the US (in 2016, at the end of the dataset time period - using census data: https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population).  

In [None]:
top10cities = ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio', 'San Diego', 'Dallas', 'San Jose']
top10 = pollution[pollution['City'].isin(top10cities)]

How many data collection sites exist in the 10 most populous cities?

In [None]:
print("Number of Data Collection Sites in 10 Cities:", top10['Site_Num'].nunique())

The basic statistics were compared amongst the cities.

In [None]:
top10.groupby('City').describe()

Each pollutant's Air Quality Index was compared visually between the cities.

In [None]:
top10.boxplot(column='CO_AQI', by='City', rot=45)
plt.show()

In [None]:
top10.boxplot(column='SO2_AQI', by='City', rot=45)
plt.show()

In [None]:
top10.boxplot(column='O3_AQI', by='City', rot=45)
plt.show()

In [None]:
top10.boxplot(column='NO2_AQI', by='City', rot=45)
plt.show()

The boxplots confirmed that each city has its own unique profile of pollutants.

### Top 5 Cities

Something about top 5 cities

In [None]:
top5cities = ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']
top5 = pollution[pollution['City'].isin(top5cities)]

In [None]:
cols = ['NO2_AQI', 'O3_AQI', 'SO2_AQI', 'CO_AQI']
plt.plot_date(top5['Date_Local'], top5[cols])
plt.xlabel('Date')
plt.ylabel('AQI')
plt.show()

Can we look at these values by city over time?

In [None]:
plt.plot_date(top5['Date_Local'], top5['NO2_AQI'])
plt.xlabel('Date')
plt.ylabel('NO2 AQI')
plt.show()