### Focusing on: https://aqs.epa.gov/aqsweb/documents/data_api.html  

Nice because output is in json

RATE LIMITING:
The API has the following limits imposed on request size:

* Length of time. All services (except Monitor) must have the end date (edate field) be in the same year as the begin date (bdate field).
* Number of parameters. Most services allow for the selection of multiple parameter codes (param field). A maximum of 5 parameter codes may be listed in a single request.

Please adhere to the following when using the API.
* Limit the size of q*ueries. Our database contains billions of values and you may request more than you intend. If you are unsure of the amount of data, start small and work your way up. We request that you limit queries to 1,000,000 rows of data each. You can use the "observation count" field on the annualData service to determine how much data exists for a time-parameter-geography combination. If you have any questions or need advice, please contact us.
* Limit the frequency of queries. Our system can process a limited load. If scripting requests, please wait for one request to complete before submitting another and do not make more than 10 requests per minute. Also, we request a pause of 5 seconds between requests and adjust accordingly for response time and size.

In [None]:
import requests
import requests_cache
import html

# requires ipykernel ~ for this specific environment

In [None]:
# Trouble Shooting

import os
print(os.getcwd())

# List files in the current directory to ensure personal.py is present
print(os.listdir())

# identified issues with a pychache directory ~ solve by removing

In [None]:
# Creating the Cache
session = requests_cache.CachedSession('EPA_air_quality')

# Install cache globally
# requests_cache.install_cache('EPA_air_quality')

In [None]:
from personal import email

In [None]:
# https://aqs.epa.gov/data/api/signup?email=myemail@example.com
# Sending Access (signup token) to email

endpoint = "https://aqs.epa.gov/data/api/signup"
param = {"email" : email}

response = session.get(endpoint, params=param)
response.raise_for_status()


In [None]:
from personal import EPA_API_KEY

Relevant Packages to Add onto the project  
**Will add packages as necessary ~ not installing the entire redundancy into the environment yet**

### What are the relevant endpoints

* __list/""__ for internal values or codes
* __monitors/__ for operational information about the samplers (monitors) used to collect the data. Includes identifying information, operational dates, operating organizations
* __sampleData/__

DATA:
* __dailyData/__
* __quarterlyData/__
* __annualData/__
* __qaAnnualPerformanceEvaluations/__ pairs of data (known and measured values) at several concentration levels for gaseous criteria pollutants
* __qaCollocatedAssessments/__flow rate checks performed by monitoring agencies
* __qaFlowRateVerifications/__ flow rate audit data
* __qaFlowRateAudits/__ measured versus actual concentration of 1 point QC checks
* __qaPepAudits/__ data related to PM2.5 monitoring system audits
* __transactionsSample/__ sample data in the submission (transaction) format for AQS.
* __transactionsQaAnnualPerformanceEvaluations/__ pairs of data QA at several concentration levels in the submission (transaction) format for AQS

~ blank samples?


## Identification of Relevant Data to Extract

In [None]:
# endpoint list/states with parameters email and key

state_endpoint = "https://aqs.epa.gov/data/api/list/states"
param = {"email" : email, "key" : EPA_API_KEY}

states = session.get(state_endpoint, params = param)
states.raise_for_status()
# going further list/countiesByState	

In [None]:
import pandas as pd

states_data = states.json()['Data']
states_data

state_df = pd.DataFrame(states_data)

In [None]:
state_df

### Suggested Codes:  
_____
* PM2.5 (88101) for sure cuz its the most crucial for health impacts  
* Carbon Monoxide (42101) - direct byproduct of combustion and wildfires  
* Nitrogen Dioxide (42602) - common in fire affected regions and contributes to respiratory issues  
* Carbon Dioxide (42102)
* Ozone (44201) -  wildfire emissions interacting with sunlight, leading to smog formation  
* Maybe PM10 (81102) - coarser particulate matter that contributes to haze and visibility issues  

## Obtaining Quarterly Data w/ quarterlyData/byState endpoint
  
parameters: email, key, param, bdate, edate, state  
(date format: YYYYMMDD ) (param: AQS codes ~ comma seperated list of 5 digit codes) (state: 2 digit FIPS)  
~ data supposedly begins in 1980  


In [None]:
# example to identify structure of data
endpoint = "https://aqs.epa.gov/data/api/quarterlyData/byState"
param = {
    "email": email,
    "key": EPA_API_KEY,
    "state": "22",  # Louisiana
    "param": "88101,42101,42602,42102,44201",
    "bdate": "20100101",  # Start date: January 1, 2010
    "edate": "20101231"   # End date: December 31, 2010 (within one year)
}# extracting quarterly summary data for Alabama for 2023

response = session.get(endpoint, params = param)
response.raise_for_status()
test_data = response.json()

In [None]:
len(test_data['Data']) # 4 quarters and 5 parameters ~ still so many retrieved is questionable

In [None]:
test_data['Data'][1]

Identified Parameters is located in quarterly-data-structure.txt

In [None]:
def get_quarterly_data_by_state(state_code, bdate, edate):
    '''
    Structuring the get request; utilizing the cache ~ set baseline for most of the parameters
    '''
    # Define the endpoint and parameters
    endpoint = "https://aqs.epa.gov/data/api/quarterlyData/byState"
    param = {
        "email": email,
        "key": EPA_API_KEY,
        "state": state_code,
        "param": "88101,42101,42602,42102,44201",
        "bdate":  bdate, # ~ YYYYMMDD
        "edate":  edate, # ~ YYYYMMDD
    }

    # Make the API request
    response = session.get(endpoint, params=param)
    response.raise_for_status()

    # Process the response
    data = response.json()
    return data


def collect_data(dataframe):
    '''
    ### DEFUNCT
    Helper function to execute get_quarterly_data_by_state over each state
    
    Returns a list of dictionaries, each containing the data for a state and the value represented
    '''
    repository = []
    for index, row in dataframe.iterrows():
            # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html
        state_code = row['code']
        value_rep = row['value_represented']
        data = get_quarterly_data_by_state(state_code)
        repository.append({'data': data, 'value_rep': value_rep})
    return repository

# Function to extract specific fields
def extract_fields(data):
    '''
    Function to extract specific fields from the API response
    '''
    extracted_data = []
    for record in data['Data']:
        extracted_record = {
            'state_code': record.get('state_code'),
            'parameter_code': record.get('parameter_code'),
            'latitude': record.get('latitude'),
            'longitude': record.get('longitude'),
            'parameter': record.get('parameter'),
            'sample_duration': record.get('sample_duration'),
            'pollutant_standard': record.get('pollutant_standard'),
            'year': record.get('year'),
            'quarter': record.get('quarter'),
            'observation_percent': record.get('observation_percent'),
            'arithmetic_mean': record.get('arithmetic_mean'),
            'maximum_value': record.get('maximum_value'),
            'quarterly_criteria_met': record.get('quarterly_criteria_met'),
            'monitoring_agency': record.get('monitoring_agency'),
            'state': record.get('state')
        }
        extracted_data.append(extracted_record)
    return extracted_data

In [None]:
import requests
import pandas as pd
from datetime import datetime, timedelta
import time

def data_extract(dataframe):
    '''
    Helper function to iterate over a state_code dataframe and get data for each state
    
    Returns a list of dictionaries, each containing the data for a state and the value represented
    
    [
    {
        'value_rep': 'Alabama',  # Example value represented
        'data': [  # List of extracted records for the state
            {
                'state_code': '01',
                ...
            },
            # More records...
        ]
    },
    '''
    repository = []
    for index, row in dataframe.iterrows():
        state_code = row['code']
        value_rep = row['value_represented']
        
        # Skip rows where state_code is not a two-digit numeric value
        if not (state_code.isdigit() and len(state_code) == 2): # error handling
            continue
        
        state_entry = next((item for item in repository if item['value_rep'] == value_rep), None)
        if state_entry is None:
            state_entry = {'value_rep': value_rep, 'data': []}
            repository.append(state_entry)
        for year in range(2020, 2025): # iterate over the last 5 years
            bdate = f"{year}0101"  # January 1st of the year
            edate = f"{year}1231"  # December 31st of the year
            data = get_quarterly_data_by_state(state_code, bdate, edate)
            extracted_data = extract_fields(data)
            state_entry['data'].extend(extracted_data)
            time.sleep(1)  # Sleep for 1 second to avoid hitting the API rate limit
    return repository

In [None]:
### Function Call

my_data = data_extract(state_df)

In [None]:
import json
import gzip

def export_compressed_data(my_data, filename):
    with gzip.open(filename, 'wt', encoding='utf-8') as f:
        json.dump(my_data, f)

# Call the function to export my_data
export_compressed_data(my_data, 'data/Raw_EPA_AQI_Quarterly(by_State_Last_5yrs).gz')

In [None]:
def extract_metadata(my_data):
    for state_data in my_data:
        value_rep = state_data['value_rep']
        data_records = state_data['data']
        
        # Print metadata for each state
        print(f"Metadata for {value_rep}:")
        for record in data_records:
            state_code = record.get('state_code')
            parameter_code = record.get('parameter_code')
            year = record.get('year')
            quarter = record.get('quarter')
            monitoring_agency = record.get('monitoring_agency')
            
            # Print the extracted metadata
            print(f"  State Code: {state_code}")
            print(f"  Parameter Code: {parameter_code}")
            print(f"  Year: {year}")
            print(f"  Quarter: {quarter}")
            print(f"  Monitoring Agency: {monitoring_agency}")
            print("  ---")

# Call the function to extract and print metadata
extract_metadata(my_data)

## Retrieve Daily Data for Just California
* same parameters  
* endpoin:	dailyData/byState
* state code for California is 06

In [None]:
# example to identify structure of data
endpoint = "https://aqs.epa.gov/data/api/dailyData/byState"
param = {
    "email": email,
    "key": EPA_API_KEY,
    "state": "22",  # Louisiana
    "param": "88101,42101,42602,42102,44201",
    "bdate": "20100101",  # Start date: January 1, 2010
    "edate": "20101231"   # End date: December 31, 2010 (within one year)
}# extracting quarterly summary data for Alabama for 2023

response = session.get(endpoint, params = param)
response.raise_for_status()
test_data = response.json()

In [None]:
test_data['Data'][1]

In [None]:
import time
import requests
from requests.exceptions import ConnectionError, HTTPError

def get_daily_data_by_state(state_code, bdate, edate):
    '''
    Structuring the get request; utilizing the cache ~ set baseline for most of the parameters
    '''
    # Define the endpoint and parameters
    endpoint = "https://aqs.epa.gov/data/api/dailyData/byState"
    param = {
        "email": email,
        "key": EPA_API_KEY,
        "state": state_code,
        "param": "88101,42101,42602,42102,44201",
        "bdate": bdate,  # ~ YYYYMMDD
        "edate": edate,  # ~ YYYYMMDD
    }

    # Make the API request with error handling
    for attempt in range(5):  # Retry up to 5 times
        try:
            response = session.get(endpoint, params=param)
            response.raise_for_status()
            data = response.json()
            return data
        except (ConnectionError, HTTPError) as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            time.sleep(2)  # Wait for 2 seconds before retrying
    raise Exception("Failed to fetch data after 5 attempts")

# ERROR HANDLING

# require a new extract_fields function
# require a new data_extract function

def extract_fields2(data):
    '''
    Function to extract specific fields from the API response
    '''
    extracted_data = []
    for record in data['Data']:
        extracted_record = {
            'latitude': record.get('latitude'),
            'longitude': record.get('longitude'),
            'parameter': record.get('parameter'),
            'sample_duration': record.get('sample_duration'),
            'pollutant_standard': record.get('pollutant_standard'),
            'date_local': record.get('date_local'),
            'units_of_measure': record.get('units_of_measure'),
            'arithmetic_mean': record.get('arithmetic_mean'),
            'first_max_value': record.get('first_max_value'),
            'state': record.get('state'),
            'city': record.get('city')
        }
        extracted_data.append(extracted_record)
    return extracted_data

def singlestate_dailydata_extract(state_code, years):
    '''
    Helper function to extract quarterly data for a given state and list of years
    
    Aggregates the data into a single repository list
    '''
    quarters = [
        ("0101", "0331"),  # Q1: January 1st to March 31st
        ("0401", "0630"),  # Q2: April 1st to June 30th
        ("0701", "0930"),  # Q3: July 1st to September 30th
        ("1001", "1231")   # Q4: October 1st to December 31st
    ]
    
    repository = []
    
    for year in years:
        for bdate_suffix, edate_suffix in quarters:
            bdate = f"{year}{bdate_suffix}"
            edate = f"{year}{edate_suffix}"
            data = get_daily_data_by_state(state_code, bdate, edate)
            extracted_data = extract_fields2(data)
            repository.extend(extracted_data)
            time.sleep(1)  # Sleep for 1 second to avoid hitting the API rate limit
            
    return repository

In [None]:
# Define the state code for California
state_code = "06"
years = [2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024]


california_daily_data = singlestate_dailydata_extract(state_code, years)
