## Data Fetching
This interactive notebook handles the fetching and cleaning of data from the EPA's Air Quality System. To do so, we should start by importing some of our required libraries, including our custom created `pyaqs` module that provides wrapper methods to convert information stored in the AQS REST API to easily accessible and modifiable Pandas dataframes.

In [1]:
from pyaqs import AQSFetcher
import pandas as pd

Now, we will instantiate a new AQSFetcher object and use it to get the required data from the EPA website. For now, we will focus our attention on counties within the states of Illinois, New York, California, and Georgia. We have selected these states due to their relatively high number of counties and in order to get a decent geographic spread accross the United States. To fetch this data, we will use some of the custom defined methods to get the appropriate identification codes for the necessary locations and parameters.

To note, in this context, a *parameter* is a compound that in the air that can be measured. The EPA has many such parameters, sorted into different classes whose descriptions are easily accessible through the API.

In [2]:
aqs_fetcher = AQSFetcher('bbjornstad.flatiron@gmail.com', 'ochrefox21')

In [3]:
state_codes = aqs_fetcher.get_state_codes()
state_codes.head()

Unnamed: 0,code,state_name
0,1,Alabama
1,2,Alaska
2,4,Arizona
3,5,Arkansas
4,6,California


Let's store the codes for our states of intersest (Illinois, New York, California, and Georgia) in variables for easy access.

In [4]:
il_code = state_codes.loc[state_codes.state_name == 'Illinois', 'code'].values[0]
il_code

'17'

In [5]:
ny_code = state_codes.loc[state_codes.state_name == 'New York', 'code'].values[0]
ny_code

'36'

In [6]:
ca_code = state_codes.loc[state_codes.state_name == 'California', 'code'].values[0]
ca_code

'06'

In [7]:
ga_code = state_codes.loc[state_codes.state_name == 'Georgia', 'code'].values[0]
ga_code

'13'

And now we will get a list of codes for the counties within each of our states.

In [8]:
il_county_codes = aqs_fetcher.get_counties_by_state(il_code)
ny_county_codes = aqs_fetcher.get_counties_by_state(ny_code)
ca_county_codes = aqs_fetcher.get_counties_by_state(ca_code)
ga_county_codes = aqs_fetcher.get_counties_by_state(ga_code)

Finally, let's take a look at the possible parameter classes and identify a set that seems reasonable for analysis.

In [9]:
aqs_fetcher.get_parameter_classes()

Unnamed: 0,class_name,class_description
0,AIRNOW MAPS,The parameters represented on AirNow maps (881...
1,ALL,Select all Parameters Available
2,AQI POLLUTANTS,Pollutants that have an AQI Defined
3,CORE_HAPS,Urban Air Toxic Pollutants
4,CRITERIA,Criteria Pollutants
5,CSN DART,List of CSN speciation parameters to populate ...
6,FORECAST,Parameters routinely extracted by AirNow (STI)
7,HAPS,Hazardous Air Pollutants
8,IMPROVE CARBON,IMPROVE Carbon Parameters
9,IMPROVE_SPECIATION,PM2.5 Speciated Parameters Measured at IMPROVE...


We are most interested in those parameters held in the CRITERIA class, as indicated by the description. In particular, this class defines pollutants that the EPA has determined to be suitable criteria for overall air quality.

In [10]:
parameter_codes = aqs_fetcher.get_parameter_list_by_class('CRITERIA')
parameter_codes

Unnamed: 0,code,parameter_description
0,14129,Lead (TSP) LC
1,42101,Carbon monoxide
2,42401,Sulfur dioxide
3,42602,Nitrogen dioxide (NO2)
4,44201,Ozone
5,81102,PM10 Total 0-10um STP
6,85129,Lead PM10 LC FRM/FEM
7,88101,PM2.5 - Local Conditions


Fantastic, these will allow us the possibility to easily partition and query the data that we need to continue with the analysis. We will start by fetching and clenaing the data from Illinois, then we will move ontoo the other states.

In [11]:
il_aq_data = aqs_fetcher.annual_data_by_state(il_code, parameter_codes.code, 20120101, 20161231)
il_aq_data.head()

Unnamed: 0,state_code,county_code,site_number,parameter_code,poc,latitude,longitude,datum,parameter,sample_duration,...,fiftieth_percentile,tenth_percentile,local_site_name,site_address,state,county,city,cbsa_code,cbsa,date_of_last_change
0,17,115,110,14129,1,39.862576,-88.940748,WGS84,Lead (TSP) LC,24 HOUR,...,0.02,0.01,MUELLER,1226 E. GARFIELD,Illinois,Macon,Decatur,19500,"Decatur, IL",2013-06-28
1,17,115,110,14129,1,39.862576,-88.940748,WGS84,Lead (TSP) LC,24 HOUR,...,0.011,0.004,MUELLER,1226 E. GARFIELD,Illinois,Macon,Decatur,19500,"Decatur, IL",2014-02-25
2,17,115,110,14129,1,39.862576,-88.940748,WGS84,Lead (TSP) LC,24 HOUR,...,0.012,0.003,MUELLER,1226 E. GARFIELD,Illinois,Macon,Decatur,19500,"Decatur, IL",2015-03-18
3,17,115,110,14129,1,39.862576,-88.940748,WGS84,Lead (TSP) LC,24 HOUR,...,0.012,0.004,MUELLER,1226 E. GARFIELD,Illinois,Macon,Decatur,19500,"Decatur, IL",2016-01-19
4,17,115,110,14129,1,39.862576,-88.940748,WGS84,Lead (TSP) LC,24 HOUR,...,0.008,0.004,MUELLER,1226 E. GARFIELD,Illinois,Macon,Decatur,19500,"Decatur, IL",2017-02-02


In [12]:
il_aq_data.columns

Index(['state_code', 'county_code', 'site_number', 'parameter_code', 'poc',
       'latitude', 'longitude', 'datum', 'parameter', 'sample_duration',
       'pollutant_standard', 'metric_used', 'method', 'year',
       'units_of_measure', 'event_type', 'observation_count',
       'observation_percent', 'validity_indicator', 'valid_day_count',
       'required_day_count', 'exceptional_data_count',
       'null_observation_count', 'primary_exceedance_count',
       'secondary_exceedance_count', 'certification_indicator',
       'arithmetic_mean', 'standard_deviation', 'first_max_value',
       'first_max_datetime', 'second_max_value', 'second_max_datetime',
       'third_max_value', 'third_max_datetime', 'fourth_max_value',
       'fourth_max_datetime', 'first_max_nonoverlap_value',
       'first_max_n_o_datetime', 'second_max_nonoverlap_value',
       'second_max_n_o_datetime', 'ninety_ninth_percentile',
       'ninety_eighth_percentile', 'ninety_fifth_percentile',
       'ninetieth_perc

Let's also do some paring of this large number of columns. Many of these fields are superfluous for our analysis and so we can simply drop the columns (or in this case, keep the columns that we want).

In [13]:
cols_to_drop = ['state_code', 'poc', 'latitude', 'longitude', 'datum', 'event_type', 'observation_percent', 'validity_indicator',
                'valid_day_count', 'required_day_count', 'primary_exceedance_count', 'secondary_exceedance_count', 
                'certification_indicator', 'first_max_value', 'first_max_datetime', 'second_max_value', 'second_max_datetime',
                'third_max_value', 'third_max_datetime', 'fourth_max_value', 'fourth_max_datetime', 'first_max_nonoverlap_value',
                'first_max_n_o_datetime', 'second_max_nonoverlap_value', 'second_max_n_o_datetime', 'ninety_ninth_percentile',
                'ninety_eighth_percentile', 'ninety_fifth_percentile', 'ninetieth_percentile', 'seventy_fifth_percentile',
                'fiftieth_percentile', 'tenth_percentile', 'cbsa_code', 'cbsa', 'pollutant_standard', 'method', 'metric_used']
il_aq_data.drop(columns=cols_to_drop, inplace=True)

In [14]:
il_aq_data.head()

Unnamed: 0,county_code,site_number,parameter_code,parameter,sample_duration,year,units_of_measure,observation_count,exceptional_data_count,null_observation_count,arithmetic_mean,standard_deviation,local_site_name,site_address,state,county,city,date_of_last_change
0,115,110,14129,Lead (TSP) LC,24 HOUR,2012,Micrograms/cubic meter (LC),56,0,0,0.054107,0.078457,MUELLER,1226 E. GARFIELD,Illinois,Macon,Decatur,2013-06-28
1,115,110,14129,Lead (TSP) LC,24 HOUR,2013,Micrograms/cubic meter (LC),60,0,1,0.032333,0.05609,MUELLER,1226 E. GARFIELD,Illinois,Macon,Decatur,2014-02-25
2,115,110,14129,Lead (TSP) LC,24 HOUR,2014,Micrograms/cubic meter (LC),60,0,1,0.02895,0.037702,MUELLER,1226 E. GARFIELD,Illinois,Macon,Decatur,2015-03-18
3,115,110,14129,Lead (TSP) LC,24 HOUR,2015,Micrograms/cubic meter (LC),56,0,5,0.030464,0.042582,MUELLER,1226 E. GARFIELD,Illinois,Macon,Decatur,2016-01-19
4,115,110,14129,Lead (TSP) LC,24 HOUR,2016,Micrograms/cubic meter (LC),59,0,3,0.025508,0.051635,MUELLER,1226 E. GARFIELD,Illinois,Macon,Decatur,2017-02-02


Let's also do a bit of investigation into the consistency of the data. In particular, there are various possible values represented in the `units_of_measure` field, leading us to suspect that perhaps we will need to do a bit of unit conversion before we are ready to begin analysis. To check if this is the case, we can group by the parameter name toogether with the units of measure. If we see that each name is associated to only a single unit, then we will know that we won't have to perform any unit conversions in order to compare within each compound present in the air.

In [15]:
il_aq_data.groupby(['parameter', 'units_of_measure']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,county_code,site_number,parameter_code,sample_duration,year,observation_count,exceptional_data_count,null_observation_count,arithmetic_mean,standard_deviation,local_site_name,site_address,state,county,city,date_of_last_change
parameter,units_of_measure,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Carbon monoxide,Parts per million,46,46,46,46,46,46,46,46,46,46,46,46,46,46,34,46
Lead (TSP) LC,Micrograms/cubic meter (LC),62,62,62,62,62,62,62,62,62,62,62,62,62,62,58,62
Nitrogen dioxide (NO2),Parts per billion,68,68,68,68,68,68,68,68,68,68,68,68,68,68,68,68
Ozone,Parts per million,748,748,748,748,748,748,748,748,748,748,748,748,748,748,608,748
PM10 Total 0-10um STP,Micrograms/cubic meter (25 C),35,35,35,35,35,35,35,35,35,35,35,35,35,35,35,35
PM2.5 - Local Conditions,Micrograms/cubic meter (LC),701,701,701,701,701,701,701,701,701,701,701,701,701,701,633,701
Sulfur dioxide,Parts per billion,311,311,311,311,311,311,311,311,311,311,307,311,311,311,269,311


We see that in fact, it is the case that the units have already been standardized. Therefore, we don't need to do any unit conversion if we want to gain some actionable insights when making comparisons within a particular parameter. We also see that we have a lot more readings for the Ozone and PM 2.5 parameters.

In [16]:
il_county_means = il_aq_data.groupby(['county', 'parameter', 'units_of_measure']).mean()
il_county_means

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,year,observation_count,exceptional_data_count,null_observation_count,arithmetic_mean,standard_deviation
county,parameter,units_of_measure,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Adams,Ozone,Parts per million,2014.000000,5809.400000,0.0,79.400000,0.042986,0.010343
Champaign,Carbon monoxide,Parts per million,2014.250000,21264.000000,0.0,1450.250000,0.139411,0.061764
Champaign,Ozone,Parts per million,2014.000000,7721.025000,0.0,128.325000,0.044802,0.011319
Champaign,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),2015.147059,571.205882,0.0,97.617647,8.457767,4.246211
Champaign,Sulfur dioxide,Parts per billion,2014.136364,13005.545455,0.0,543.818182,1.286831,3.203973
...,...,...,...,...,...,...,...,...
Will,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),2015.172414,122.551724,0.0,23.793103,8.351912,4.332240
Winnebago,Carbon monoxide,Parts per million,2012.000000,8664.500000,0.0,35.000000,0.396856,0.176690
Winnebago,Lead (TSP) LC,Micrograms/cubic meter (LC),2012.500000,55.500000,0.0,4.500000,0.027988,0.041999
Winnebago,Ozone,Parts per million,2014.000000,7905.850000,0.0,61.600000,0.042892,0.011944


In [17]:
il_county_counts = il_aq_data.groupby(['county', 'parameter', 'units_of_measure']).count()
il_county_counts

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,county_code,site_number,parameter_code,sample_duration,year,observation_count,exceptional_data_count,null_observation_count,arithmetic_mean,standard_deviation,local_site_name,site_address,state,city,date_of_last_change
county,parameter,units_of_measure,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Adams,Ozone,Parts per million,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20
Champaign,Carbon monoxide,Parts per million,12,12,12,12,12,12,12,12,12,12,12,12,12,0,12
Champaign,Ozone,Parts per million,40,40,40,40,40,40,40,40,40,40,40,40,40,20,40
Champaign,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),34,34,34,34,34,34,34,34,34,34,34,34,34,12,34
Champaign,Sulfur dioxide,Parts per billion,22,22,22,22,22,22,22,22,22,22,22,22,22,0,22
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Will,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),29,29,29,29,29,29,29,29,29,29,29,29,29,29,29
Winnebago,Carbon monoxide,Parts per million,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
Winnebago,Lead (TSP) LC,Micrograms/cubic meter (LC),2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
Winnebago,Ozone,Parts per million,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20


Let's go through the same process of steps for the other states that we are interested in as well, namely New York, California, and Georgia.

In [18]:
ny_aq_data = aqs_fetcher.annual_data_by_state(ny_code, parameter_codes.code, 20120101, 20161231)
ca_aq_data = aqs_fetcher.annual_data_by_state(ca_code, parameter_codes.code, 20120101, 20161231)
ga_aq_data = aqs_fetcher.annual_data_by_state(ga_code, parameter_codes.code, 20120101, 20161231)
ny_aq_data.drop(columns=cols_to_drop, inplace=True)
ca_aq_data.drop(columns=cols_to_drop, inplace=True)
ga_aq_data.drop(columns=cols_to_drop, inplace=True)

And we'll also do some exploratory data analysis, as we did in the case of Illinois, to check that the units are standardized and get some information about the breakdown by county and parameter in table format.

In [19]:
ny_aq_data.head()

Unnamed: 0,county_code,site_number,parameter_code,parameter,sample_duration,year,units_of_measure,observation_count,exceptional_data_count,null_observation_count,arithmetic_mean,standard_deviation,local_site_name,site_address,state,county,city,date_of_last_change
0,1,12,42101,Carbon monoxide,1 HOUR,2012,Parts per million,8650,0,134,0.265642,0.123018,LOUDONVILLE,LOUDONVILLE RESERVOIR 300 ALBANY SHAKER RD,New York,Albany,Albany,2016-04-09
1,1,12,42101,Carbon monoxide,8-HR RUN AVG END HOUR,2012,Parts per million,8625,0,0,0.268649,0.118286,LOUDONVILLE,LOUDONVILLE RESERVOIR 300 ALBANY SHAKER RD,New York,Albany,Albany,2016-04-09
2,1,12,42101,Carbon monoxide,1 HOUR,2013,Parts per million,8618,0,141,0.231562,0.145281,LOUDONVILLE,LOUDONVILLE RESERVOIR 300 ALBANY SHAKER RD,New York,Albany,Albany,2016-04-08
3,1,12,42101,Carbon monoxide,8-HR RUN AVG END HOUR,2013,Parts per million,8650,0,0,0.234936,0.140438,LOUDONVILLE,LOUDONVILLE RESERVOIR 300 ALBANY SHAKER RD,New York,Albany,Albany,2016-04-08
4,1,12,42101,Carbon monoxide,1 HOUR,2014,Parts per million,8623,0,137,0.226638,0.132641,LOUDONVILLE,LOUDONVILLE RESERVOIR 300 ALBANY SHAKER RD,New York,Albany,Albany,2016-04-08


In [20]:
ny_aq_data.groupby(['parameter', 'units_of_measure']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,county_code,site_number,parameter_code,sample_duration,year,observation_count,exceptional_data_count,null_observation_count,arithmetic_mean,standard_deviation,local_site_name,site_address,state,county,city,date_of_last_change
parameter,units_of_measure,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Carbon monoxide,Parts per million,86,86,86,86,86,86,86,86,86,86,86,86,86,86,76,86
Lead (TSP) LC,Micrograms/cubic meter (LC),22,22,22,22,22,22,22,22,22,22,22,22,22,22,17,22
Nitrogen dioxide (NO2),Parts per billion,64,64,64,64,64,64,64,64,64,64,64,64,64,64,54,64
Ozone,Parts per million,664,664,664,664,664,664,664,664,664,664,664,664,664,664,372,664
PM2.5 - Local Conditions,Micrograms/cubic meter (LC),543,543,543,543,543,543,543,543,543,543,543,543,543,543,488,543
Sulfur dioxide,Parts per billion,500,500,500,500,500,500,500,500,500,500,500,500,500,500,315,500


In [21]:
ny_county_means = ny_aq_data.groupby(['county', 'parameter', 'units_of_measure']).mean()
ny_county_means

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,year,observation_count,exceptional_data_count,null_observation_count,arithmetic_mean,standard_deviation
county,parameter,units_of_measure,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Albany,Carbon monoxide,Parts per million,2014.0,8575.00,0.0,94.100000,0.213539,0.124838
Albany,Ozone,Parts per million,2014.0,7884.20,0.0,64.700000,0.041735,0.011822
Albany,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),2014.0,113.80,0.0,7.466667,6.946367,4.077637
Albany,Sulfur dioxide,Parts per billion,2014.0,24442.24,0.0,611.280000,1.410937,1.037610
Bronx,Carbon monoxide,Parts per million,2014.0,8647.30,0.0,69.900000,0.349356,0.179301
...,...,...,...,...,...,...,...,...
Tompkins,Ozone,Parts per million,2014.0,7850.25,0.0,119.500000,0.043515,0.010523
Ulster,Ozone,Parts per million,2012.0,7871.00,0.0,75.250000,0.044534,0.011054
Ulster,Sulfur dioxide,Parts per billion,2012.0,22088.40,0.0,1254.800000,0.396695,0.586396
Wayne,Ozone,Parts per million,2014.0,6941.00,0.0,170.650000,0.041794,0.011224


In [22]:
ca_aq_data.head()

Unnamed: 0,county_code,site_number,parameter_code,parameter,sample_duration,year,units_of_measure,observation_count,exceptional_data_count,null_observation_count,arithmetic_mean,standard_deviation,local_site_name,site_address,state,county,city,date_of_last_change
0,73,1006,44201,Ozone,1 HOUR,2012,Parts per million,8268,0,516,0.057747,0.014912,Alpine,"2300 VICTORIA DR., ALPINE",California,San Diego,Alpine,2018-07-21
1,73,1006,44201,Ozone,8-HR RUN AVG BEGIN HOUR,2012,Parts per million,8582,0,0,0.051136,0.012308,Alpine,"2300 VICTORIA DR., ALPINE",California,San Diego,Alpine,2018-07-21
2,73,1006,44201,Ozone,8-HR RUN AVG BEGIN HOUR,2012,Parts per million,8582,0,0,0.051136,0.012308,Alpine,"2300 VICTORIA DR., ALPINE",California,San Diego,Alpine,2018-07-21
3,73,1006,44201,Ozone,8-HR RUN AVG BEGIN HOUR,2012,Parts per million,6138,0,0,0.05095,0.01246,Alpine,"2300 VICTORIA DR., ALPINE",California,San Diego,Alpine,2018-07-21
4,73,1006,44201,Ozone,1 HOUR,2013,Parts per million,8143,0,617,0.05893,0.013469,Alpine,"2300 VICTORIA DR., ALPINE",California,San Diego,Alpine,2018-07-21


In [23]:
ca_aq_data.groupby(['parameter', 'units_of_measure']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,county_code,site_number,parameter_code,sample_duration,year,observation_count,exceptional_data_count,null_observation_count,arithmetic_mean,standard_deviation,local_site_name,site_address,state,county,city,date_of_last_change
parameter,units_of_measure,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Carbon monoxide,Parts per million,762,762,762,762,762,762,762,762,762,762,756,762,762,762,722,762
Lead (TSP) LC,Micrograms/cubic meter (LC),138,138,138,138,138,138,138,138,138,138,138,138,138,138,132,138
Lead PM10 LC FRM/FEM,Micrograms/cubic meter (LC),12,12,12,12,12,12,12,12,12,12,12,12,12,12,12,12
Nitrogen dioxide (NO2),Parts per billion,1066,1066,1066,1066,1066,1066,1066,1066,1066,1066,1066,1066,1066,1066,962,1066
Ozone,Parts per million,3861,3861,3861,3861,3861,3861,3861,3861,3861,3861,3861,3861,3861,3861,3101,3861
PM10 Total 0-10um STP,Micrograms/cubic meter (25 C),1501,1501,1501,1501,1501,1501,1501,1501,1501,1501,1486,1501,1501,1501,1311,1501
PM2.5 - Local Conditions,Micrograms/cubic meter (LC),3784,3784,3784,3784,3784,3784,3784,3784,3784,3784,3784,3784,3784,3784,3534,3784
Sulfur dioxide,Parts per billion,712,712,712,712,712,712,712,712,712,712,712,712,712,712,678,712


In [24]:
ca_county_means = ca_aq_data.groupby(['county', 'parameter', 'units_of_measure']).mean()
ca_county_means

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,year,observation_count,exceptional_data_count,null_observation_count,arithmetic_mean,standard_deviation
county,parameter,units_of_measure,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Alameda,Carbon monoxide,Parts per million,2014.357143,8055.071429,0.000000,119.678571,0.420692,0.229867
Alameda,Nitrogen dioxide (NO2),Parts per billion,2014.208333,7703.416667,0.000000,190.958333,15.562419,8.261447
Alameda,Ozone,Parts per million,2014.217391,7090.652174,159.184783,49.826087,0.033779,0.010086
Alameda,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),2014.263158,1815.568421,0.000000,22.389474,8.641273,5.233965
Alameda,Sulfur dioxide,Parts per billion,2014.000000,4623.500000,0.000000,190.600000,1.030929,1.426650
...,...,...,...,...,...,...,...,...
Ventura,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),2013.849162,1452.201117,0.000000,49.486034,8.600528,4.663786
Yolo,Nitrogen dioxide (NO2),Parts per billion,2014.000000,8108.800000,0.000000,470.200000,10.020551,5.890308
Yolo,Ozone,Parts per million,2014.000000,7944.025000,0.000000,63.875000,0.040638,0.011562
Yolo,PM10 Total 0-10um STP,Micrograms/cubic meter (25 C),2014.000000,60.500000,0.000000,4.300000,18.797836,11.119300


In [25]:
ga_aq_data.head()

Unnamed: 0,county_code,site_number,parameter_code,parameter,sample_duration,year,units_of_measure,observation_count,exceptional_data_count,null_observation_count,arithmetic_mean,standard_deviation,local_site_name,site_address,state,county,city,date_of_last_change
0,121,56,42101,Carbon monoxide,1 HOUR,2014,Parts per million,4714,0,86,0.636742,0.305351,NR-GA Tech,"Georgia Institute of Technology, 6th Street an...",Georgia,Fulton,Atlanta,2018-06-05
1,121,56,42101,Carbon monoxide,8-HR RUN AVG END HOUR,2014,Parts per million,4747,0,0,0.642237,0.250243,NR-GA Tech,"Georgia Institute of Technology, 6th Street an...",Georgia,Fulton,Atlanta,2018-06-05
2,121,56,42101,Carbon monoxide,1 HOUR,2015,Parts per million,8622,0,138,0.79949,0.333112,NR-GA Tech,"Georgia Institute of Technology, 6th Street an...",Georgia,Fulton,Atlanta,2016-04-07
3,121,56,42101,Carbon monoxide,8-HR RUN AVG END HOUR,2015,Parts per million,8720,0,0,0.80586,0.282898,NR-GA Tech,"Georgia Institute of Technology, 6th Street an...",Georgia,Fulton,Atlanta,2016-04-07
4,121,56,42101,Carbon monoxide,1 HOUR,2016,Parts per million,8210,0,556,0.787454,0.28833,NR-GA Tech,"Georgia Institute of Technology, 6th Street an...",Georgia,Fulton,Atlanta,2018-01-31


In [26]:
ga_aq_data.groupby(['parameter', 'units_of_measure']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,county_code,site_number,parameter_code,sample_duration,year,observation_count,exceptional_data_count,null_observation_count,arithmetic_mean,standard_deviation,local_site_name,site_address,state,county,city,date_of_last_change
parameter,units_of_measure,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Carbon monoxide,Parts per million,30,30,30,30,30,30,30,30,30,30,30,30,30,30,12,30
Lead (TSP) LC,Micrograms/cubic meter (LC),25,25,25,25,25,25,25,25,25,25,23,25,25,25,23,25
Nitrogen dioxide (NO2),Parts per billion,36,36,36,36,36,36,36,36,36,36,36,36,36,36,18,36
Ozone,Parts per million,420,420,420,420,420,420,420,420,420,420,420,420,420,420,300,420
PM10 Total 0-10um STP,Micrograms/cubic meter (25 C),37,37,37,37,37,37,37,37,37,37,35,37,37,37,26,37
PM2.5 - Local Conditions,Micrograms/cubic meter (LC),827,827,827,827,827,827,827,827,827,827,827,827,827,827,632,827
Sulfur dioxide,Parts per billion,144,144,144,144,144,144,144,144,144,144,144,144,144,144,104,144


In [27]:
ga_county_means = ga_aq_data.groupby(['county', 'parameter', 'units_of_measure']).mean()
ga_county_means

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,year,observation_count,exceptional_data_count,null_observation_count,arithmetic_mean,standard_deviation
county,parameter,units_of_measure,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Bartow,Lead (TSP) LC,Micrograms/cubic meter (LC),2012.5,61.500000,0.000000,0.000000,0.012931,0.011252
Bibb,Ozone,Parts per million,2014.0,5375.800000,0.000000,26.750000,0.043618,0.011525
Bibb,PM10 Total 0-10um STP,Micrograms/cubic meter (25 C),2012.0,57.000000,0.000000,4.000000,20.328197,8.606977
Bibb,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),2014.0,174.000000,0.315789,13.105263,9.677191,4.172899
Bibb,Sulfur dioxide,Parts per billion,2014.0,5086.450000,0.000000,96.800000,1.242514,0.917359
...,...,...,...,...,...,...,...,...
Sumter,Ozone,Parts per million,2014.0,5400.100000,0.000000,19.400000,0.041083,0.010460
Walker,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),2014.0,113.571429,0.285714,9.000000,10.005792,4.808037
Washington,PM10 Total 0-10um STP,Micrograms/cubic meter (25 C),2012.0,55.000000,0.000000,7.000000,16.181818,5.182202
Washington,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),2014.0,113.714286,0.285714,10.285714,9.261911,3.774138


Finally, let's save our cleaned up dataframes to the [`cleaned_data`](./cleaned_data/) folder for easy access in our analysis notebook.

In [28]:
il_aq_data.to_csv('cleaned_data/il_aq_data.csv')
ny_aq_data.to_csv('cleaned_data/ny_aq_data.csv')
ca_aq_data.to_csv('cleaned_data/ca_aq_data.csv')
ga_aq_data.to_csv('cleaned_data/ga_aq_data.csv')