## Data Fetching
This interactive notebook handles the fetching and cleaning of data from the EPA's Air Quality System. To do so, we should start by importing some of our required libraries, including our custom created `pyaqs` module that provides wrapper methods to convert information stored in the AQS REST API to easily accessible and modifiable Pandas dataframes.

In [1]:
from pyaqs import AQSFetcher
import pandas as pd

Now, we will instantiate a new AQSFetcher object and use it to get the required data from the EPA website. For now, we will focus our attention on counties within Illinois, the state where we currently reside. To do so, we will use some of the custom defined methods to get the appropriate identification codes for the necessary locations and parameters.

To note, in this context, a *parameter* is a compound that in the air that can be measured. The EPA has many such parameters, sorted into different classes whose descriptions are easily accessible through the API.

In [2]:
aqs_fetcher = AQSFetcher('bbjornstad.flatiron@gmail.com', 'ochrefox21')

In [3]:
state_codes = aqs_fetcher.get_state_codes()
state_codes.head()

Unnamed: 0,code,state_name
0,1,Alabama
1,2,Alaska
2,4,Arizona
3,5,Arkansas
4,6,California


Let's store the code for Illinois in a variable for easy access.

In [4]:
il_code = state_codes.loc[state_codes.state_name == 'Illinois', 'code'].values[0]
il_code

'17'

And now we will get a list of codes for the counties within Illinois.

In [5]:
il_county_codes = aqs_fetcher.get_counties_by_state(il_code)
il_county_codes.head()

Unnamed: 0,code,county_name
0,1,Adams
1,3,Alexander
2,5,Bond
3,7,Boone
4,9,Brown


Finally, let's take a look at the possible parameter classes and identify a set that seems reasonable for analysis.

In [6]:
aqs_fetcher.get_parameter_classes()

Unnamed: 0,class_name,class_description
0,AIRNOW MAPS,The parameters represented on AirNow maps (881...
1,ALL,Select all Parameters Available
2,AQI POLLUTANTS,Pollutants that have an AQI Defined
3,CORE_HAPS,Urban Air Toxic Pollutants
4,CRITERIA,Criteria Pollutants
5,CSN DART,List of CSN speciation parameters to populate ...
6,FORECAST,Parameters routinely extracted by AirNow (STI)
7,HAPS,Hazardous Air Pollutants
8,IMPROVE CARBON,IMPROVE Carbon Parameters
9,IMPROVE_SPECIATION,PM2.5 Speciated Parameters Measured at IMPROVE...


We are most interested in those parameters held in the CRITERIA class, as indicated by the description. In particular, this class defines pollutants that the EPA has determined to be suitable criteria for overall air quality.

In [7]:
parameter_codes = aqs_fetcher.get_parameter_list_by_class('CRITERIA')
parameter_codes

Unnamed: 0,code,parameter_description
0,14129,Lead (TSP) LC
1,42101,Carbon monoxide
2,42401,Sulfur dioxide
3,42602,Nitrogen dioxide (NO2)
4,44201,Ozone
5,81102,PM10 Total 0-10um STP
6,85129,Lead PM10 LC FRM/FEM
7,88101,PM2.5 - Local Conditions


Fantastic, these will allow us the possibility to easily partition and query the data that we need to continue with the analysis.

In [8]:
il_aq_data = aqs_fetcher.annual_data_by_state(il_code, parameter_codes.code, 20120101, 20161231)
il_aq_data.head()

Unnamed: 0,state_code,county_code,site_number,parameter_code,poc,latitude,longitude,datum,parameter,sample_duration,...,fiftieth_percentile,tenth_percentile,local_site_name,site_address,state,county,city,cbsa_code,cbsa,date_of_last_change
0,17,31,1,44201,1,41.670992,-87.732457,WGS84,Ozone,1 HOUR,...,0.049,0.03,VILLAGE GARAGE,4500 W. 123RD ST.,Illinois,Cook,Alsip,16980,"Chicago-Naperville-Elgin, IL-IN-WI",2018-07-20
1,17,31,1,44201,1,41.670992,-87.732457,WGS84,Ozone,8-HR RUN AVG BEGIN HOUR,...,0.045,0.026,VILLAGE GARAGE,4500 W. 123RD ST.,Illinois,Cook,Alsip,16980,"Chicago-Naperville-Elgin, IL-IN-WI",2018-07-20
2,17,31,1,44201,1,41.670992,-87.732457,WGS84,Ozone,8-HR RUN AVG BEGIN HOUR,...,0.045,0.026,VILLAGE GARAGE,4500 W. 123RD ST.,Illinois,Cook,Alsip,16980,"Chicago-Naperville-Elgin, IL-IN-WI",2018-07-20
3,17,31,1,44201,1,41.670992,-87.732457,WGS84,Ozone,8-HR RUN AVG BEGIN HOUR,...,0.044,0.026,VILLAGE GARAGE,4500 W. 123RD ST.,Illinois,Cook,Alsip,16980,"Chicago-Naperville-Elgin, IL-IN-WI",2018-07-20
4,17,31,1,44201,1,41.670992,-87.732457,WGS84,Ozone,1 HOUR,...,0.043,0.031,VILLAGE GARAGE,4500 W. 123RD ST.,Illinois,Cook,Alsip,16980,"Chicago-Naperville-Elgin, IL-IN-WI",2018-07-20


In [9]:
il_aq_data.columns

Index(['state_code', 'county_code', 'site_number', 'parameter_code', 'poc',
       'latitude', 'longitude', 'datum', 'parameter', 'sample_duration',
       'pollutant_standard', 'metric_used', 'method', 'year',
       'units_of_measure', 'event_type', 'observation_count',
       'observation_percent', 'validity_indicator', 'valid_day_count',
       'required_day_count', 'exceptional_data_count',
       'null_observation_count', 'primary_exceedance_count',
       'secondary_exceedance_count', 'certification_indicator',
       'arithmetic_mean', 'standard_deviation', 'first_max_value',
       'first_max_datetime', 'second_max_value', 'second_max_datetime',
       'third_max_value', 'third_max_datetime', 'fourth_max_value',
       'fourth_max_datetime', 'first_max_nonoverlap_value',
       'first_max_n_o_datetime', 'second_max_nonoverlap_value',
       'second_max_n_o_datetime', 'ninety_ninth_percentile',
       'ninety_eighth_percentile', 'ninety_fifth_percentile',
       'ninetieth_perc

Let's also do some paring of this large number of columns. Many of these fields are superfluous for our analysis and so we can simply drop the columns (or in this case, keep the columns that we want).

In [10]:
cols_to_drop = ['state_code', 'poc', 'latitude', 'longitude', 'datum', 'event_type', 'observation_percent', 'validity_indicator',
                'valid_day_count', 'required_day_count', 'primary_exceedance_count', 'secondary_exceedance_count', 
                'certification_indicator', 'first_max_value', 'first_max_datetime', 'second_max_value', 'second_max_datetime',
                'third_max_value', 'third_max_datetime', 'fourth_max_value', 'fourth_max_datetime', 'first_max_nonoverlap_value',
                'first_max_n_o_datetime', 'second_max_nonoverlap_value', 'second_max_n_o_datetime', 'ninety_ninth_percentile',
                'ninety_eighth_percentile', 'ninety_fifth_percentile', 'ninetieth_percentile', 'seventy_fifth_percentile',
                'fiftieth_percentile', 'tenth_percentile', 'cbsa_code', 'cbsa', 'pollutant_standard', 'method', 'metric_used']
il_aq_data.drop(columns=cols_to_drop, inplace=True)

In [11]:
il_aq_data.head()

Unnamed: 0,county_code,site_number,parameter_code,parameter,sample_duration,year,units_of_measure,observation_count,exceptional_data_count,null_observation_count,arithmetic_mean,standard_deviation,local_site_name,site_address,state,county,city,date_of_last_change
0,31,1,44201,Ozone,1 HOUR,2012,Parts per million,5075,0,61,0.051695,0.017956,VILLAGE GARAGE,4500 W. 123RD ST.,Illinois,Cook,Alsip,2018-07-20
1,31,1,44201,Ozone,8-HR RUN AVG BEGIN HOUR,2012,Parts per million,5073,0,0,0.045976,0.016587,VILLAGE GARAGE,4500 W. 123RD ST.,Illinois,Cook,Alsip,2018-07-20
2,31,1,44201,Ozone,8-HR RUN AVG BEGIN HOUR,2012,Parts per million,5073,0,0,0.045976,0.016587,VILLAGE GARAGE,4500 W. 123RD ST.,Illinois,Cook,Alsip,2018-07-20
3,31,1,44201,Ozone,8-HR RUN AVG BEGIN HOUR,2012,Parts per million,3606,0,0,0.045788,0.016663,VILLAGE GARAGE,4500 W. 123RD ST.,Illinois,Cook,Alsip,2018-07-20
4,31,1,44201,Ozone,1 HOUR,2013,Parts per million,6194,0,406,0.044667,0.013483,VILLAGE GARAGE,4500 W. 123RD ST.,Illinois,Cook,Alsip,2018-07-20


Let's also do a bit of investigation into the consistency of the data. In particular, there are various possible values represented in the `units_of_measure` field, leading us to suspect that perhaps we will need to do a bit of unit conversion before we are ready to begin analysis. To check if this is the case, we can group by the parameter name toogether with the units of measure. If we see that each name is associated to only a single unit, then we will know that we won't have to perform any unit conversions in order to compare within each compound present in the air.

In [12]:
il_aq_data.groupby(['parameter', 'units_of_measure']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,county_code,site_number,parameter_code,sample_duration,year,observation_count,exceptional_data_count,null_observation_count,arithmetic_mean,standard_deviation,local_site_name,site_address,state,county,city,date_of_last_change
parameter,units_of_measure,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Carbon monoxide,Parts per million,46,46,46,46,46,46,46,46,46,46,46,46,46,46,34,46
Lead (TSP) LC,Micrograms/cubic meter (LC),62,62,62,62,62,62,62,62,62,62,62,62,62,62,58,62
Nitrogen dioxide (NO2),Parts per billion,68,68,68,68,68,68,68,68,68,68,68,68,68,68,68,68
Ozone,Parts per million,748,748,748,748,748,748,748,748,748,748,748,748,748,748,608,748
PM10 Total 0-10um STP,Micrograms/cubic meter (25 C),35,35,35,35,35,35,35,35,35,35,35,35,35,35,35,35
PM2.5 - Local Conditions,Micrograms/cubic meter (LC),701,701,701,701,701,701,701,701,701,701,701,701,701,701,633,701
Sulfur dioxide,Parts per billion,311,311,311,311,311,311,311,311,311,311,307,311,311,311,269,311


We see that in fact, it is the case that the units have already been standardized. Therefore, we don't need to do any unit conversion if we want to gain some actionable insights when making comparisons within a particular parameter. We also see that we have a lot more readings for the Ozone and PM 2.5 parameters.

In [13]:
il_county_means = il_aq_data.groupby(['county', 'parameter', 'units_of_measure']).mean()
il_county_means

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,year,observation_count,exceptional_data_count,null_observation_count,arithmetic_mean,standard_deviation
county,parameter,units_of_measure,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Adams,Ozone,Parts per million,2014.000000,5809.400000,0.0,79.400000,0.042986,0.010343
Champaign,Carbon monoxide,Parts per million,2014.250000,21264.000000,0.0,1450.250000,0.139411,0.061764
Champaign,Ozone,Parts per million,2014.000000,7721.025000,0.0,128.325000,0.044802,0.011319
Champaign,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),2015.147059,571.205882,0.0,97.617647,8.457767,4.246211
Champaign,Sulfur dioxide,Parts per billion,2014.136364,13005.545455,0.0,543.818182,1.286831,3.203973
...,...,...,...,...,...,...,...,...
Will,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),2015.172414,122.551724,0.0,23.793103,8.351912,4.332240
Winnebago,Carbon monoxide,Parts per million,2012.000000,8664.500000,0.0,35.000000,0.396856,0.176690
Winnebago,Lead (TSP) LC,Micrograms/cubic meter (LC),2012.500000,55.500000,0.0,4.500000,0.027988,0.041999
Winnebago,Ozone,Parts per million,2014.000000,7905.850000,0.0,61.600000,0.042892,0.011944


In [14]:
il_county_counts = il_aq_data.groupby(['county', 'parameter', 'units_of_measure']).count()
il_county_counts

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,county_code,site_number,parameter_code,sample_duration,year,observation_count,exceptional_data_count,null_observation_count,arithmetic_mean,standard_deviation,local_site_name,site_address,state,city,date_of_last_change
county,parameter,units_of_measure,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Adams,Ozone,Parts per million,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20
Champaign,Carbon monoxide,Parts per million,12,12,12,12,12,12,12,12,12,12,12,12,12,0,12
Champaign,Ozone,Parts per million,40,40,40,40,40,40,40,40,40,40,40,40,40,20,40
Champaign,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),34,34,34,34,34,34,34,34,34,34,34,34,34,12,34
Champaign,Sulfur dioxide,Parts per billion,22,22,22,22,22,22,22,22,22,22,22,22,22,0,22
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Will,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),29,29,29,29,29,29,29,29,29,29,29,29,29,29,29
Winnebago,Carbon monoxide,Parts per million,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
Winnebago,Lead (TSP) LC,Micrograms/cubic meter (LC),2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
Winnebago,Ozone,Parts per million,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20
