# Web scraping example 2: Downloading air quality data from US AQS API


## Air quality data in the US
The United States Environmental Protection Agency (US EPA) has provided multiple ways to access air quality data in the US. You can have a look at the below pages for some background information:

- [US EPA Outdoor Air Quality Data Homepage](https://www.epa.gov/outdoor-air-quality-data)
- [Map of monitoring sites](https://gispub.epa.gov/airnow/?mlayer=ozonepm&clayer=none&panel=0%203%3E%20TROPOMI%20codes%20-%20visualisation)


## Download data from Air Quality System (AQS) API
Details instructions:
- [ Air Quality System API](https://aqs.epa.gov/aqsweb/documents/data_api.html)

In brief:
- you need to get your unique "key" first using your registered email address
- you need the information of your target state, county, site codes and species parameter codes

You can copy and paste the example links below to your web browser to see how the API works:
- retrieve the list of states in the US: https://aqs.epa.gov/data/api/list/states?email=test@aqs.api&key=test

- retireve the list of counties in a certain state: https://aqs.epa.gov/data/api/list/countiesByState?email=test@aqs.api&key=test&state=06

- retrieve the list of sites in a certain county: https://aqs.epa.gov/data/api/list/sitesByCounty?email=test@aqs.api&key=test&state=06&county=037

Once you have your registered email and personal key, you can edit the below link (insert your own email and key) to get some sample data:

- request data at a sample site: https://aqs.epa.gov/data/api/sampleData/bySite?email=insert_your_own_email&key=insert_your_own_key&param=42602&bdate=20190101&edate=20190101&state=06&county=037&site=2005

## Now use Python to request the same data
- I will show all the steps, but I didn't run them here because the returned results are very long. In this version, you can have a better idea of how the codes are developed.
- You only need to provide your ```"registered email"``` and ```"unique key"``` at the beginning, then you should be able to run all the following codes.

In [None]:
###############################################################################
# provide your "registered email" and "unique key"
my_email = "insert-your-registered-email-address" 
my_key = "insert-your-unique-key-obtained-from-US-AQS-API"

In [None]:
###############################################################################
# load packages

import os
import pandas as pd
import json
from urllib.request import urlopen

# now build a function to download json data from the target url
# mearsurements are in "Data" ("Header" returns the status of this request)

def get_AQS_data(download_link):
    """Open the target download link provided by US EPA Air Quality System (AQS) API, 
       return the data or error message (if there is no data)"""
    response = urlopen(download_link).read().decode("utf-8")
    responseJson = json.loads(response)
    if (len(responseJson.get("Data")) == 0):
        return responseJson.get("Header")
    else:
        return responseJson.get("Data") 
    
# request data at the same sample site using your registered email and key
# you can remove "\" at the end and write the url in a single line
sample_url = "https://aqs.epa.gov/data/api/sampleData/bySite?email="+str(my_email)+"&key="+str(my_key)+ \
             "&param=42602&bdate=20190101&edate=20190101&state=06&county=037&site=2005"

# retrive data
sample_results = get_AQS_data(sample_url)
display(sample_results) 

In [None]:
# the returned results include a list of dictionaries
print("type:",type(sample_results[0]))
print("number of dictionary keys:",len(sample_results))
print("dictionary keys:",sample_results[0].keys())

In [None]:
# check the first record (measurements at 00:00 on the selected day)
display(sample_results[0])

## Using this function, we can loop through a big number of air quality stations qucikly.

## For example, New York City seems to be missing from the historical data files provided by US EPA. Let's do a systematic check of NO2 data in New York City to confirm this.

In [None]:
###############################################################################
# New York state code: 36
# New York City county code: 061
# NO2 species code: 42602

NY_sites = get_data("https://aqs.epa.gov/data/api/list/sitesByCounty?email=test@aqs.api&key=test&state=36&county=061")

# There seems to be a lot of missing sites ('value_represented': None)
display(NY_sites)

In [None]:
# get the site code list in New York City
NY_sites_codes = []

for i in range(len(NY_sites)):
        NY_sites_codes.append(NY_sites[i]['code']) 
        
print(NY_sites_codes)

In [None]:
# generate the corresponding link for each site code
NY_sites_links = []

for i in range(len(NY_sites_codes)):
    NY_sites_links.append("https://aqs.epa.gov/data/api/sampleData/bySite?email="+str(my_email)+"&key="+str(my_key)+ \
                          "&param=42602&bdate=20190101&edate=20190131&state=36&county=061&site="+str(NY_sites_codes[i]))

# print out all the download links
for i in range(len(NY_sites_links)):
    print(NY_sites_links[i])

In [None]:
# get data from each link
NY_data = [get_AQS_data(link) for link in NY_sites_links]   

# print out the results
for i in range(len(NY_data)):
    print(NY_data[i])

## Using this function, we can also download data for multiple species at multiple sites all at once. Let's use Los Angeles as an example.

In [None]:
###############################################################################
# download NO2,SO2,CO,O3,PM2.5,PM10 from Los Angeles

# first summarize the codes needed for input variables (parameter,bdate,edate,state code,county code,site code)
# then store information using regular expressions

import re

parameters = '''
             NO2: 42602
             SO2: 42401
             CO: 42101
             O3: 44201
             PM2.5 FRM/FEM Mass: 88101
             PM2.5 non FRM/FEM Mass: 88502
             PM10: 81102
             '''
# here create the list of species in the same order (this will be used later when assigning the output file names)
species = ['NO2','SO2','CO','O3','PM2.5 FRM FEM Mass','PM2.5 non FRM FEM Mass','PM10'] 

date = '''
       begin date: 20190101
       end date: 20191231
       '''

state_code = '''
             California: 06
             '''

county_code = '''
              Los Angeles: 037
              '''

site_code = '''
             Compton: 1302
             Lancaster: 9033
             North Main: 1103
             LAX: 5005
             Glendora: 0016
             '''
# once we have established the order, we only need to keep numbers on the right
# tell Python to find "digits" after ": "
parameters = re.findall(r'(?<=:\s)\d+',parameters)
date = re.findall(r'(?<=:\s)\d+',date)
state_code = re.findall(r'(?<=:\s)\d+',state_code)
county_code = re.findall(r'(?<=:\s)\d+',county_code)
site_code = re.findall(r'(?<=:\s)\d+',site_code)

In [None]:
###############################################################################
# create download links to all species at each site
LA_Compton = []
LA_Lancaster = []
LA_North_Main = []
LA_LAX = []
LA_Glendora = []

for i in range(len(parameters)):
    LA_Compton.append("https://aqs.epa.gov/data/api/sampleData/bySite?email="+str(my_email)+"&key="+str(my_key)+ \
                      "&param="+str(parameters[i])+"&bdate="+str(date[0])+"&edate="+str(date[1])+"&state="+str(state_code[0])+ \
                      "&county="+str(county_code[0])+"&site="+str(site_code[0]))
    LA_Lancaster.append("https://aqs.epa.gov/data/api/sampleData/bySite?email="+str(my_email)+"&key="+str(my_key)+ \
                        "&param="+str(parameters[i])+"&bdate="+str(date[0])+"&edate="+str(date[1])+"&state="+str(state_code[0])+ \
                        "&county="+str(county_code[0])+"&site="+str(site_code[1]))
    LA_North_Main.append("https://aqs.epa.gov/data/api/sampleData/bySite?email="+str(my_email)+"&key="+str(my_key)+ \
                         "&param="+str(parameters[i])+"&bdate="+str(date[0])+"&edate="+str(date[1])+"&state="+str(state_code[0])+ \
                         "&county="+str(county_code[0])+"&site="+str(site_code[2]))
    LA_LAX.append("https://aqs.epa.gov/data/api/sampleData/bySite?email="+str(my_email)+"&key="+str(my_key)+ \
                  "&param="+str(parameters[i])+"&bdate="+str(date[0])+"&edate="+str(date[1])+"&state="+str(state_code[0])+ \
                  "&county="+str(county_code[0])+"&site="+str(site_code[3]))
    LA_Glendora.append("https://aqs.epa.gov/data/api/sampleData/bySite?email="+str(my_email)+"&key="+str(my_key)+ \
                       "&param="+str(parameters[i])+"&bdate="+str(date[0])+"&edate="+str(date[1])+"&state="+str(state_code[0])+ \
                       "&county="+str(county_code[0])+"&site="+str(site_code[4]))

In [None]:
# retrieve all species at each site in LA in 2019
LA_Compton_results = [get_AQS_data(link) for link in LA_Compton]
LA_Lancaster_results = [get_AQS_data(link) for link in LA_Lancaster]
LA_North_Main_results = [get_AQS_data(link) for link in LA_North_Main]
LA_LAX_results = [get_AQS_data(link) for link in LA_LAX]
LA_Glendora_results = [get_AQS_data(link) for link in LA_Glendora]

In [None]:
# check the results and save out
# if there is data, measurements during the samplng period are returned as a list of dictionaries
# if there is no data, an error message is kept
# the length will be "1"

# now build a function to convert the list of dictionary to a pandas dataframes
def save_raw_AQS_results_to_df(raw_AQS_data_results):
    """For each request, the API returns measurements of one species at one site during one sampling period.
       Results are returned as a list of dictionaries. This function converts the results from each request 
       to a single pandas dataframe.
    """
    test_data = [pd.DataFrame([data]) for data in raw_AQS_data_results]
    test_df = pd.concat(test_data,ignore_index=True)
    return test_df

# now convert results to pandas dataframes if there is data

# 1> Comton site
LA_Compton_df = []

for i in range(len(parameters)):
    if (len(LA_Compton_results[i]) > 1):
        LA_Compton_df.append(save_raw_AQS_results_to_df(LA_Compton_results[i]))
    else:
        LA_Compton_df.append("There is no observation for "+str(species[i]))

# 2> Lancaster site
LA_Lancaster_df = []

for i in range(len(parameters)):
    if (len(LA_Lancaster_results[i]) > 1):
        LA_Lancaster_df.append(save_raw_AQS_results_to_df(LA_Lancaster_results[i]))
    else:
        LA_Lancaster_df.append("There is no observation for "+str(species[i]))

# 3> North Main street site
LA_North_Main_df = []

for i in range(len(parameters)):
    if (len(LA_North_Main_results[i]) > 1):
        LA_North_Main_df.append(save_raw_AQS_results_to_df(LA_North_Main_results[i]))
    else:
        LA_North_Main_df.append("There is no observation for "+str(species[i]))

# 4> LAX site
LA_LAX_df = []

for i in range(len(parameters)):
    if (len(LA_LAX_results[i]) > 1):
        LA_LAX_df.append(save_raw_AQS_results_to_df(LA_LAX_results[i]))
    else:
        LA_LAX_df.append("There is no observation for "+str(species[i]))

# 5> Glendora site
LA_Glendora_df = []

for i in range(len(parameters)):
    if (len(LA_Glendora_results[i]) > 1):
        LA_Glendora_df.append(save_raw_AQS_results_to_df(LA_Glendora_results[i]))
    else:
        LA_Glendora_df.append("There is no observation for "+str(species[i]))

In [None]:
###############################################################################
# output the results as csv files with unique informative names

for i in range(len(parameters)):
    if (len(LA_Compton_results[i]) > 0):
        LA_Compton_df[i].to_csv("LA_Compton_2019_"+str(species[i])+".csv")
        
for i in range(len(parameters)):
    if (len(LA_Lancaster_results[i]) > 0):
        LA_Lancaster_df[i].to_csv("LA_Lancaster_2019_"+str(species[i])+".csv")

for i in range(len(parameters)):
    if (len(LA_North_Main_results[i]) > 0):
        LA_North_Main_df[i].to_csv("LA_North_Main_2019_"+str(species[i])+".csv")

for i in range(len(parameters)):
    if (len(LA_LAX_results[i]) > 0):
        LA_LAX_df[i].to_csv("LA_LAX_2019_"+str(species[i])+".csv")
        
for i in range(len(parameters)):
    if (len(LA_Glendora_results[i]) > 0):
        LA_Glendora_df[i].to_csv("LA_Glendora_2019_"+str(species[i])+".csv")

## The focus here is on developing a web scraping tool. Actually you can re-write the codes and use arugments to further improve the efficiency for repeated tasks. You can use arguments to import information like "parameter code","state code", "county code" and "species code". Check out ["argparse"](https://docs.python.org/3/library/argparse.html) for more information.