# Data Science Tools 1 - Final Project

Pulling in what we wrote as answers for the 1st hwk for reference:

We aim to analyze the impact of wildfires on daily AQI trends in Colorado over the past decade, focusing on how these events influence air quality across the state. Using air quality data from the EPA’s Air Quality System (AQS), this project will identify temporal patterns and highlight the most affected regions. Special attention will be given to pollutants like PM2.5 and Ozone, which are highly sensitive to wildfire activity. By examining variations in air quality during wildfire seasons, we seek to uncover actionable insights into seasonal and regional pollution dynamics, contributing to a better understanding of the environmental impacts of wildfires in Colorado.

Download the air quality data from the Air Quality System (AQS) website of the U.S. Environmental Protection Agency (EPA) using web scraping and web API, which can be done in two weeks.  
Some data fields or attributes are listed below.

    State Code: The FIPS code of the state in which the monitor resides. (CO = 08)

    County Code: The FIPS code of the county in which the monitor resides.

    Site Num: A unique number within the county identifying the site.

    Date Local: The calendar date for the summary. All daily summaries are for the local standard day (midnight to midnight) at the monitor.

    Latitude and Longitude: Geographic location of the monitoring station.

    AQI: The Air Quality Index value for the day.

    Parameter Code: The AQS code corresponding to the parameter measured by the monitor.

    Parameter Name: The name or description assigned in AQS to the parameter measured by the monitor. Parameters may be pollutants or non-pollutants (e.g. PM2.5, Ozone, etc.).

    Pollutant Standard: Specifies the ambient air quality standard rules used to aggregate statistics. (calculate AQI).

    Units of Measure: The unit of measure for the parameter. QAD always returns data in the standard units for the parameter. Submitters can report data in any unit and the EPA converts it to a standard unit so we may use the data in calculations.

Noise data could come from outliers, missing data, invalid values, or misspellings. So far, we notice that not every air monitor station has a record for every day of the year. Also, as EPA pointed out that some stations collected multiple types of data for air quality (i.e., Ozone, CO, NO2, PM2.5, and PM10) while some situations collected only one type of data. These will be challenges for us to pre-process the data.

Noise in the dataset arises from missing data, as not all monitoring stations record data daily. Additionally, stations often measure different pollutants, leading to inconsistencies when comparing across locations. Outliers, such as unusually high AQI values, may occur due to equipment errors or localized events unrelated to wildfires. Finally, geographic bias exists as monitoring stations are often concentrated in urban areas, potentially underrepresenting rural regions where wildfires are more likely to occur.

Feature engineering opportunities include extracting wildfire season information by identifying months typically associated with wildfire activity (e.g., June–September). Temporal features such as monthly or yearly aggregates, and rolling averages, zcan be created to smooth short-term fluctuations and highlight seasonal trends. Pollutant-specific transformations, such as calculating ratios (e.g., PM2.5/AQI) or isolating key wildfire-related pollutants, will help identify their contributions to air quality degradation. Geospatial aggregation by county or monitoring station will allow for the identification of regions most affected by wildfires. Additionally, AQI values can be re-coded into categorical levels (e.g., “Good,” “Moderate,” “Unhealthy”) to better communicate air quality patterns during wildfire events.

## Package and Path Management



Add proper path variables to the python environment so we can pull in source files and save data and figures.

In [13]:
import os, sys
sys.path.append(os.path.join(os.path.abspath(''), '../src'))
sys.path.append(os.path.join(os.path.abspath(''), '../data'))
sys.path.append(os.path.join(os.path.abspath(''), '../figures'))

* **Pandas** - Data manipulation and analysis
* **Numpy** - Mathematical functions
* **Matplotlib** - Data visualization
* **Seaborn** - Data visualization
* **OS** - Operating system dependent functionality
* **Datetime** - Date and time manipulation
* **Requests** - HTTP library
* **Logging** - Logging facility for Python
* **Time** - Time access and conversions

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from AQS_API import AirQualityCollector

## Data Collection

Before talking about the data source and collection process, we can talk about the date range that we're using for the analysis - March 23 - Sept 23 is what's defaulted in the API .py code and not sure if we want to change that, but can justify it here with hsitorical background (e.g. bad year for forest fires, etc.).

In [15]:
# Month Range of Interest in YYYYMM format
start_date = datetime(2023, 3, 1)
end_date = datetime(2023, 4, 1)

Background on EPA AQI data and API system for how we're pulling it.

We'll check first to see if we already have a dataset downloaded before we go through the process of downloading it again.

In [17]:
# Check if file already exists
filename = (f"../data/Colorado_AQI_{start_date.strftime('%Y%m')}_{end_date.strftime('%Y%m')}.csv")
if os.path.exists(filename):
    print(f"File {filename} already present.")
else:
    # Hard-coding the API key here which we can fix later, just a quick band-aid for notebook purposes
    API_KEY = "544ED264-55E3-4422-94CA-406B625CFF54"
    collector = AirQualityCollector(api_key=API_KEY, start_date=start_date, end_date=end_date)
    collector.collect_data()
    print(f"File {filename} downloaded.")

Data collection finished. Check air_quality_data.log for details.
File ../data/Colorado_AQI_202303_202304.csv downloaded.


## Data Cleaning

We can load the downloaded CSV into a dataframe and start poking around. We can see our

In [11]:
df = pd.read_csv(filename)
df.head()

Unnamed: 0,Latitude,Longitude,UTC,Parameter,Unit,Value,RawConcentration,AQI,Category,SiteName,AgencyName,FullAQSCode,IntlAQSCode
0,40.086941,-108.761002,2023-03-01T00:00,OZONE,PPB,44.0,48.0,41,1,"Rangely, CO",National Park Service,81030006,840081030006
1,40.086941,-108.761002,2023-03-01T00:00,PM2.5,UG/M3,2.3,4.0,13,1,"Rangely, CO",National Park Service,81030006,840081030006
2,37.200056,-108.733111,2023-03-01T00:00,OZONE,PPB,43.0,42.0,40,1,Towaoc,Quality Review and Exchange System for Tribes ...,840080838001,840080838001
3,37.34997,-108.58737,2023-03-01T00:00,OZONE,PPB,36.0,38.0,33,1,Cortez Ozone,Colorado Department of Public Health and Envir...,80830006,840080830006
4,39.063599,-108.561096,2023-03-01T00:00,PM2.5,UG/M3,1.6,1.6,9,1,Grand Junction - Powell Building,Colorado Department of Public Health and Envir...,80770017,840080770017


In [12]:
# Check for missing values
missing = df.isnull().sum()
missing = missing[missing > 0]
missing

Series([], dtype: int64)