# <center> US pollution analysis

### Introduction

The US Pollution Data project leverages a comprehensive dataset covering air pollution measurements across the United States ‚Äî consisting of over 1.4‚ÄØmillion observations and around 28 variables that record concentrations of major pollutants such as nitrogen dioxide (NO‚ÇÇ), sulfur dioxide (SO‚ÇÇ), carbon monoxide (CO), and ozone (O‚ÇÉ). 


This dataset spans several years and states, allowing detailed spatio-temporal analysis of pollutant levels, seasonal trends, and geographic variation. The primary objective of this project is to transform this raw data into clean, analysis-ready formats (e.g., Parquet), conduct exploratory data analysis (EDA) to uncover patterns and insights, and ‚Äî where possible ‚Äî apply machine learning techniques to forecast pollutant concentrations, classify air‚Äëquality levels, or identify key factors driving pollution.

Given the public health importance of air quality, this project has the potential not only to improve our understanding of pollution trends in the U.S., but also to inform policy, raise awareness, or support predictive systems that warn populations about deteriorating air conditions.

The goal of this project is to:

* Clean and transform the raw data (e.g., into Parquet)
* Perform EDA to uncover trends, seasonal patterns, and geographic variation
* Apply ML techniques to forecast pollutant levels, classify air quality, and analyze feature importance

By improving our understanding of air quality trends, this project supports public health insights and data-driven policy decisions.

Dataset content:
* State Code: Numeric code representing the U.S. state
* County Code:	Numeric code for the county within the state
* Site Num:	Identifier for the air monitoring site
* Address:	Street address of the monitoring station
* State:	Full name of the U.S. state
* County:	Name of the county
* City:	City where the measurement site is located
* Date Local:	Date of the observation (YYYY-MM-DD)
* NO2 Units:	Units used for nitrogen dioxide measurements
* NO2 Mean:	Daily average NO‚ÇÇ concentration
* NO2 1st Max Value:	Highest NO‚ÇÇ value recorded that day
* NO2 1st Max Hour:	Hour when the highest NO‚ÇÇ was recorded
* NO2 AQI:	Air Quality Index for NO‚ÇÇ on that day
* O3 Units:	Units used for ozone measurements
* O3 Mean:	Daily average ozone concentration
* O3 1st Max Value:	Highest O‚ÇÉ value recorded that day
* O3 1st Max Hour:	Hour when the highest O‚ÇÉ was recorded
* O3 AQI:	Air Quality Index for O‚ÇÉ on that day
* SO2 Units:	Units used for sulfur dioxide measurements
* SO2 Mean:	Daily average SO‚ÇÇ concentration
* SO2 1st Max Value:	Highest SO‚ÇÇ value recorded that day
* SO2 1st Max Hour:	Hour when the highest SO‚ÇÇ was recorded
* SO2 AQI:	Air Quality Index for SO‚ÇÇ on that day
* CO Units:	Units used for carbon monoxide measurements
* CO Mean:	Daily average CO concentration
* CO 1st Max Value:	Highest CO value recorded that day
* CO 1st Max Hour:	Hour when the highest CO was recorded
* CO AQI:	Air Quality Index for CO on that day

---

### 1. EDA and Initial data visualisation

First, all necessary libraries are imported

In [1]:
import os                          #import os for operating system interactions
import pandas as pd                 #import Pandas for data manipulation
import numpy as np                  #import Numpy for numerical operations
import matplotlib.pyplot as plt     #import Matplotlib for data visualization
import seaborn as sns               #import Seaborn for statistical data visualization
from plotly.subplots import make_subplots  #import Plotly subplots for creating complex figures
import plotly.express as px         #import Plotly Express for interactive visualizations
import plotly.graph_objects as go   #import Plotly Graph Objects for detailed figure customization


In [2]:
sns.set(style="whitegrid")                  # Set Seaborn style for plots
plt.rcParams["figure.figsize"] = (10,6)     # Set default figure size for Matplotlib plots

#### 1.1. ETL and EDA


In this section EDA, including data load and cleaning, is performed. As a first step, data set is loaded into DataFrame

In [3]:
df = pd.read_parquet("../data/preprocessed/pollution_dataset_geocoded.parquet", engine="pyarrow")
df.head()

Unnamed: 0.1,Unnamed: 0,State Code,County Code,Site Num,Address,State,County,City,Date Local,NO2 Units,...,SO2 1st Max Value,SO2 1st Max Hour,SO2 AQI,CO Units,CO Mean,CO 1st Max Value,CO 1st Max Hour,CO AQI,Latitude,Longitude
0,0,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,9.0,21,13.0,Parts per million,1.145833,4.2,21,,33.458426,-112.046574
1,1,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,9.0,21,13.0,Parts per million,0.878947,2.2,23,25.0,33.458426,-112.046574
2,2,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,6.6,23,,Parts per million,1.145833,4.2,21,,33.458426,-112.046574
3,3,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,6.6,23,,Parts per million,0.878947,2.2,23,25.0,33.458426,-112.046574
4,4,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-02,Parts per billion,...,3.0,22,4.0,Parts per million,0.85,1.6,23,,33.458426,-112.046574


In [4]:
df. drop(columns=['Unnamed: 0'], inplace=True)  # Drop unnecessary column
df.head()

Unnamed: 0,State Code,County Code,Site Num,Address,State,County,City,Date Local,NO2 Units,NO2 Mean,...,SO2 1st Max Value,SO2 1st Max Hour,SO2 AQI,CO Units,CO Mean,CO 1st Max Value,CO 1st Max Hour,CO AQI,Latitude,Longitude
0,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,19.041667,...,9.0,21,13.0,Parts per million,1.145833,4.2,21,,33.458426,-112.046574
1,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,19.041667,...,9.0,21,13.0,Parts per million,0.878947,2.2,23,25.0,33.458426,-112.046574
2,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,19.041667,...,6.6,23,,Parts per million,1.145833,4.2,21,,33.458426,-112.046574
3,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,19.041667,...,6.6,23,,Parts per million,0.878947,2.2,23,25.0,33.458426,-112.046574
4,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-02,Parts per billion,22.958333,...,3.0,22,4.0,Parts per million,0.85,1.6,23,,33.458426,-112.046574


In the following subsection initial data set inspection is performed. Here the shape and Info of DataFrame are shown

In [5]:
print(df.shape)                     # Print the shape of the DataFrame           
print(df.info())                    # Print concise summary of the DataFrame            
print(df.dtypes)                    # Print data types of each column

(1746661, 30)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1746661 entries, 0 to 1746660
Data columns (total 30 columns):
 #   Column             Dtype  
---  ------             -----  
 0   State Code         int64  
 1   County Code        int64  
 2   Site Num           int64  
 3   Address            object 
 4   State              object 
 5   County             object 
 6   City               object 
 7   Date Local         object 
 8   NO2 Units          object 
 9   NO2 Mean           float64
 10  NO2 1st Max Value  float64
 11  NO2 1st Max Hour   int64  
 12  NO2 AQI            int64  
 13  O3 Units           object 
 14  O3 Mean            float64
 15  O3 1st Max Value   float64
 16  O3 1st Max Hour    int64  
 17  O3 AQI             int64  
 18  SO2 Units          object 
 19  SO2 Mean           float64
 20  SO2 1st Max Value  float64
 21  SO2 1st Max Hour   int64  
 22  SO2 AQI            float64
 23  CO Units           object 
 24  CO Mean            float64
 25  CO 1

In [6]:
df.dtypes.value_counts()         # Count occurrences of each data type

float64    12
int64       9
object      9
Name: count, dtype: int64

As it shown above dataset consists of 1746661 entries and 28 columns. Also dataset contains 10 float columns, 9 integer and 9 categorical columns. 

In the next steps DataFrame is checked for any incosistencies(dublicates, missing value and etc.)

In [7]:
df.isnull().sum()           # Check for missing values in each column

State Code                0
County Code               0
Site Num                  0
Address                   0
State                     0
County                    0
City                      0
Date Local                0
NO2 Units                 0
NO2 Mean                  0
NO2 1st Max Value         0
NO2 1st Max Hour          0
NO2 AQI                   0
O3 Units                  0
O3 Mean                   0
O3 1st Max Value          0
O3 1st Max Hour           0
O3 AQI                    0
SO2 Units                 0
SO2 Mean                  0
SO2 1st Max Value         0
SO2 1st Max Hour          0
SO2 AQI              872907
CO Units                  0
CO Mean                   0
CO 1st Max Value          0
CO 1st Max Hour           0
CO AQI               873323
Latitude                  0
Longitude                 0
dtype: int64

The dataset misses 872907 and 873323 values in SO2 AQI and CO AQI columns respectively. This is a big amount of missing data to just remove lines. Instead, these missing values can be calculated, as far as other columns have no missing points.

Air Quality Index (AQI) is calculated by converting measured pollutant concentrations (e.g., SO‚ÇÇ, CO, NO‚ÇÇ, O‚ÇÉ, PM‚ÇÇ.‚ÇÖ, PM‚ÇÅ‚ÇÄ) into a standardized scale (usually 0‚Äì500) using breakpoints.
Core AQI Formula

Each pollutant gets its own AQI number. The final AQI for the city/location is the highest of all pollutants.

Each government sets concentration ranges for each pollutant.
Example (US EPA standard):
SO‚ÇÇ 1-hour breakpoints (ppb)
AQI Range	                SO‚ÇÇ (ppb)
0‚Äì50 (Good)	                0‚Äì35
51‚Äì100 (Moderate)	        36‚Äì75
101‚Äì150 (Unhealthy SG)	    76‚Äì185
151‚Äì200 (Unhealthy)	        186‚Äì304
201‚Äì300 (Very Unhealthy)    305‚Äì604

CO 8-hour breakpoints (ppm)
AQI Range	CO (ppm)
0‚Äì50	    0.0‚Äì4.4
51‚Äì100	    4.5‚Äì9.4
101‚Äì150	    9.5‚Äì12.4
151‚Äì200	    12.5‚Äì15.4
201‚Äì300	    15.5‚Äì30.4

Before calculating AQIs SO2 and CO, let's confirm that dedicated cilumns do not contain negeative values (which is physically impossible)

In [8]:
neg_so2 = df[df["SO2 1st Max Value"] < 0][["SO2 Units", "SO2 Mean", "SO2 1st Max Value"]]
neg_co = df[df["CO Mean"] < 0][["CO Units", "CO Mean", "CO 1st Max Value"]]

print("Negative SO‚ÇÇ values:")
print(neg_so2.head())

print("\nNegative CO values:")
print(neg_co.head())


Negative SO‚ÇÇ values:
                 SO2 Units  SO2 Mean  SO2 1st Max Value
1004801  Parts per billion -0.428571               -0.3
1004802  Parts per billion -0.428571               -0.3
1091334  Parts per billion -0.391667               -0.3
1091335  Parts per billion -0.391667               -0.3
1091336  Parts per billion -0.375000               -0.3

Negative CO values:
                  CO Units   CO Mean  CO 1st Max Value
1116634  Parts per million -0.015385               0.0
1116636  Parts per million -0.015385               0.0
1116638  Parts per million -0.019048               0.1
1116639  Parts per million -0.009524               0.1
1116640  Parts per million -0.019048               0.1


In [9]:
print(neg_co.value_counts().sum())
print(neg_so2.value_counts().sum())

1064
8286


There are 1064 and 8286 negative values in CO Mean and SO2 1st Max Value columns, respectively. In this case, removing these values from the dataset is the simplest, cleanest and safest approach.

Negative pollution values are invalid. SO‚ÇÇ and CO cannot be negative in reality. These values come from:
* sensor malfunction
* data ingestion error
* interpolation issues

Removing them does not lose valid information.

Only a tiny fraction of data is affected. There are 1,746,661 total rows. Problematic rows:
* SO‚ÇÇ negatives: 8,286
* CO negatives: 1,064
* Combined: < 0.5% of data

Removing them has zero statistical impact on AQI analysis.

Deleting invalid rows

In [10]:
df = df[(df["SO2 1st Max Value"] >= 0) & (df["CO Mean"] >= 0)]
print(df.shape)

(1737313, 30)


Below SO2 AQI column calculated based on given values in dedicated SO2 columns. 

In [11]:

def calculate_so2_aqi(C):
    """
    Calculate SO2 AQI using expanded breakpoint intervals.
    This avoids NA values from strict EPA bins.
    
    C : float 
        1-hour SO2 concentration in ppb
    """

    if pd.isna(C):
        return np.nan

    # ===== Expanded breakpoints ensuring continuous coverage =====
    if 0 <= C <= 35.999:
        Clow, Chigh = 0, 35
        Ilow, Ihigh = 0, 50

    elif 36 <= C <= 75.999:
        Clow, Chigh = 36, 75
        Ilow, Ihigh = 51, 100

    elif 76 <= C <= 185.999:
        Clow, Chigh = 76, 185
        Ilow, Ihigh = 101, 150

    elif 186 <= C <= 304.999:
        Clow, Chigh = 186, 304
        Ilow, Ihigh = 151, 200

    elif 305 <= C <= 604.999:
        Clow, Chigh = 305, 604
        Ilow, Ihigh = 201, 300

    else:
        # Out of range but we extend for safety
        return np.nan

    # ===== AQI Formula =====
    aqi = ((Ihigh - Ilow) / (Chigh - Clow)) * (C - Clow) + Ilow
    return round(aqi, 1)


In [12]:
# Function to add SO2 AQI column to DataFrame
def add_so2_aqi_column(df, col_name="SO2 1st Max Value"):
    """
    df : pandas dataframe  
    col_name : column containing SO2 1-hour max in ppb
    """
    df = df.copy()
    df["SO2 AQI"] = df[col_name].apply(calculate_so2_aqi)
    return df

In [13]:
df = add_so2_aqi_column(df, "SO2 1st Max Value")
df["SO2 AQI"].isna().sum()

np.int64(0)

And check column  and values

In [14]:
print(df[["SO2 1st Max Value", "SO2 AQI"]].head())

   SO2 1st Max Value  SO2 AQI
0                9.0     12.9
1                9.0     12.9
2                6.6      9.4
3                6.6      9.4
4                3.0      4.3


And CO AQI calculation

In [15]:
import numpy as np
import pandas as pd

def calculate_co_aqi(C):
    """
    Calculate CO AQI using expanded breakpoint intervals.
    This ensures continuous coverage with NO missing AQI values.

    C : float 
        8-hour CO concentration in ppm
    """

    if pd.isna(C):
        return np.nan

    # ===== Expanded breakpoints =====
    if 0.0 <= C <= 4.499:
        Clow, Chigh = 0.0, 4.4
        Ilow, Ihigh = 0, 50

    elif 4.5 <= C <= 9.499:
        Clow, Chigh = 4.5, 9.4
        Ilow, Ihigh = 51, 100

    elif 9.5 <= C <= 12.499:
        Clow, Chigh = 9.5, 12.4
        Ilow, Ihigh = 101, 150

    elif 12.5 <= C <= 15.499:
        Clow, Chigh = 12.5, 15.4
        Ilow, Ihigh = 151, 200

    elif 15.5 <= C <= 30.499:
        Clow, Chigh = 15.5, 30.4
        Ilow, Ihigh = 201, 300

    else:
        return np.nan  # extremely high or wrong units

    # ===== AQI Formula =====
    aqi = ((Ihigh - Ilow) / (Chigh - Clow)) * (C - Clow) + Ilow
    return round(aqi, 1)


In [16]:
def add_co_aqi_column(df, col="CO Mean"):
    df = df.copy()
    df["CO AQI"] = df[col].apply(calculate_co_aqi)
    return df

In [17]:
df = add_co_aqi_column(df, "CO Mean")
df["CO AQI"].isna().sum()

np.int64(0)

Verifying imputations

In [18]:
df.isnull().sum()           # Check for missing values in each column

State Code           0
County Code          0
Site Num             0
Address              0
State                0
County               0
City                 0
Date Local           0
NO2 Units            0
NO2 Mean             0
NO2 1st Max Value    0
NO2 1st Max Hour     0
NO2 AQI              0
O3 Units             0
O3 Mean              0
O3 1st Max Value     0
O3 1st Max Hour      0
O3 AQI               0
SO2 Units            0
SO2 Mean             0
SO2 1st Max Value    0
SO2 1st Max Hour     0
SO2 AQI              0
CO Units             0
CO Mean              0
CO 1st Max Value     0
CO 1st Max Hour      0
CO AQI               0
Latitude             0
Longitude            0
dtype: int64

In [19]:
df.shape

(1737313, 30)

And searching for negative valeus in other numeric columns:

In [20]:
def validate_no_negative_values(df, auto_fix=False, stop_on_error=False):
    """
    Validate that no negative values exist in numeric pollutant measurement columns.
    Excludes Units columns and AQI columns.
    """

    # Select *only pollutant numeric columns*
    pollutant_cols = [
        c for c in df.columns
        if any(p in c for p in ["NO2", "SO2", "CO", "O3"])
        and not c.endswith("Units")
        and not c.endswith("AQI")
    ]

    negatives_report = {}

    # Check each pollutant numeric column
    for col in pollutant_cols:
        neg_mask = df[col] < 0
        neg_count = neg_mask.sum()

        if neg_count > 0:
            negatives_report[col] = {
                "count": int(neg_count),
                "sample": df.loc[neg_mask].head()
            }

    # If clean ‚Üí success
    if len(negatives_report) == 0:
        print("‚úì No negative values found in pollutant numeric columns.")
        return df

    # Otherwise print detailed report
    print("‚úó Negative values found:")
    for col, info in negatives_report.items():
        print(f"\nColumn: {col}")
        print(f"Negative count: {info['count']}")
        print(info["sample"])
        print("-" * 40)

    # Auto-fix negative rows
    if auto_fix:
        print("\n‚úì Auto-fix enabled ‚Üí removing rows containing negative pollutant values...")
        df = df[(df[pollutant_cols] >= 0).all(axis=1)]
        print("‚úì Negative rows removed.")
        return df

    # Raise exception if required
    if stop_on_error:
        raise ValueError("Dataset contains negative pollutant values.")

    return df

df = validate_no_negative_values(df, auto_fix=True, stop_on_error=True)


‚úó Negative values found:

Column: NO2 Mean
Negative count: 828
         State Code  County Code  Site Num  \
1144580          23            3      1100   
1144581          23            3      1100   
1144582          23            3      1100   
1144583          23            3      1100   
1157402          36           55      1007   

                                         Address     State     County  \
1144580  8 NORTHERN ROAD, PRESQUE ISLE, ME 04769     Maine  Aroostook   
1144581  8 NORTHERN ROAD, PRESQUE ISLE, ME 04769     Maine  Aroostook   
1144582  8 NORTHERN ROAD, PRESQUE ISLE, ME 04769     Maine  Aroostook   
1144583  8 NORTHERN ROAD, PRESQUE ISLE, ME 04769     Maine  Aroostook   
1157402         2 YARMOUTH ROAD, RG&E Substation  New York     Monroe   

                 City  Date Local          NO2 Units  NO2 Mean  ...  \
1144580  Presque Isle  2011-04-30  Parts per billion -0.279167  ...   
1144581  Presque Isle  2011-04-30  Parts per billion -0.279167  ...   
114458

828 negative valeus were found in NO2 mean columns. These rows were automatically deleted.

In [21]:
print(df.shape)     
print(df.isnull().sum())                # Print the shape of the DataFrame  

(1722197, 30)
State Code           0
County Code          0
Site Num             0
Address              0
State                0
County               0
City                 0
Date Local           0
NO2 Units            0
NO2 Mean             0
NO2 1st Max Value    0
NO2 1st Max Hour     0
NO2 AQI              0
O3 Units             0
O3 Mean              0
O3 1st Max Value     0
O3 1st Max Hour      0
O3 AQI               0
SO2 Units            0
SO2 Mean             0
SO2 1st Max Value    0
SO2 1st Max Hour     0
SO2 AQI              0
CO Units             0
CO Mean              0
CO 1st Max Value     0
CO 1st Max Hour      0
CO AQI               0
Latitude             0
Longitude            0
dtype: int64


Checking for TRUE duplicates.

In [22]:
def find_full_pollutant_duplicates(df):
    """
    Find duplicates using:
    - Date Local
    - Address
    - All pollutant mean / max / hour values
    - All AQI values
    
    Shows duplicates but DOES NOT remove them.
    """

    # Identify pollutant measurement columns dynamically
    pollutant_cols = [
        col for col in df.columns 
        if any(p in col for p in ["NO2", "O3", "SO2", "CO"])
    ]

    # Identify AQI columns
    aqi_cols = [col for col in df.columns if col.endswith("AQI")]

    # Build final set of columns for duplicate detection
    key_columns = ["Date Local", "Address"] + pollutant_cols + aqi_cols

    print("üîç Checking duplicates using ALL pollutant columns:")
    print(key_columns, "\n")

    # Find duplicates (count both first and later occurrences)
    dup_mask = df.duplicated(subset=key_columns, keep=False)
    duplicates = df.loc[dup_mask].sort_values(by=key_columns)

    print(f"üìå Total FULL pollutant duplicates found: {len(duplicates)}\n")

    if len(duplicates) == 0:
        print("‚úÖ No duplicates found.")
        return duplicates

    print("üìÑ Sample of duplicate rows (first 30):")
    display(duplicates.head(30))

    print("\nüìä Duplicate groups summary:")
    group_counts = (
        duplicates.groupby(["Date Local", "Address"])
        .size()
        .reset_index(name="Count")
        .sort_values("Count", ascending=False)
    )
    display(group_counts.head(20))

    return duplicates


In [23]:
duplicates = find_full_pollutant_duplicates(df)

üîç Checking duplicates using ALL pollutant columns:
['Date Local', 'Address', 'NO2 Units', 'NO2 Mean', 'NO2 1st Max Value', 'NO2 1st Max Hour', 'NO2 AQI', 'O3 Units', 'O3 Mean', 'O3 1st Max Value', 'O3 1st Max Hour', 'O3 AQI', 'SO2 Units', 'SO2 Mean', 'SO2 1st Max Value', 'SO2 1st Max Hour', 'SO2 AQI', 'CO Units', 'CO Mean', 'CO 1st Max Value', 'CO 1st Max Hour', 'CO AQI', 'NO2 AQI', 'O3 AQI', 'SO2 AQI', 'CO AQI'] 

üìå Total FULL pollutant duplicates found: 94462

üìÑ Sample of duplicate rows (first 30):


Unnamed: 0,State Code,County Code,Site Num,Address,State,County,City,Date Local,NO2 Units,NO2 Mean,...,SO2 1st Max Value,SO2 1st Max Hour,SO2 AQI,CO Units,CO Mean,CO 1st Max Value,CO 1st Max Hour,CO AQI,Latitude,Longitude
28240,6,83,1025,LFC #1-LAS FLORES CANYON,California,Santa Barbara,Capitan,2000-01-03,Parts per billion,1.869565,...,0.0,0,0.0,Parts per million,0.1,0.1,0,1.1,34.479435,-120.042365
28241,6,83,1025,LFC #1-LAS FLORES CANYON,California,Santa Barbara,Capitan,2000-01-03,Parts per billion,1.869565,...,0.0,0,0.0,Parts per million,0.1,0.1,0,1.1,34.479435,-120.042365
28242,6,83,1025,LFC #1-LAS FLORES CANYON,California,Santa Barbara,Capitan,2000-01-03,Parts per billion,1.869565,...,0.0,5,0.0,Parts per million,0.1,0.1,0,1.1,34.479435,-120.042365
28243,6,83,1025,LFC #1-LAS FLORES CANYON,California,Santa Barbara,Capitan,2000-01-03,Parts per billion,1.869565,...,0.0,5,0.0,Parts per million,0.1,0.1,0,1.1,34.479435,-120.042365
86568,51,59,5,CUB RUN LEE RD CHANT.(CUBRUN TREAT PLANT,Virginia,Fairfax,Not in a city,2000-01-04,Parts per billion,4.869565,...,7.0,5,10.0,Parts per million,0.2,0.2,0,2.3,38.522667,-78.641278
86569,51,59,5,CUB RUN LEE RD CHANT.(CUBRUN TREAT PLANT,Virginia,Fairfax,Not in a city,2000-01-04,Parts per billion,4.869565,...,7.0,5,10.0,Parts per million,0.2,0.2,0,2.3,38.522667,-78.641278
86566,51,59,5,CUB RUN LEE RD CHANT.(CUBRUN TREAT PLANT,Virginia,Fairfax,Not in a city,2000-01-04,Parts per billion,4.869565,...,7.0,0,10.0,Parts per million,0.2,0.2,0,2.3,38.522667,-78.641278
86567,51,59,5,CUB RUN LEE RD CHANT.(CUBRUN TREAT PLANT,Virginia,Fairfax,Not in a city,2000-01-04,Parts per billion,4.869565,...,7.0,0,10.0,Parts per million,0.2,0.2,0,2.3,38.522667,-78.641278
28246,6,83,1025,LFC #1-LAS FLORES CANYON,California,Santa Barbara,Capitan,2000-01-04,Parts per billion,1.958333,...,0.3,11,0.4,Parts per million,0.1,0.1,0,1.1,34.479435,-120.042365
28247,6,83,1025,LFC #1-LAS FLORES CANYON,California,Santa Barbara,Capitan,2000-01-04,Parts per billion,1.958333,...,0.3,11,0.4,Parts per million,0.1,0.1,0,1.1,34.479435,-120.042365



üìä Duplicate groups summary:


Unnamed: 0,Date Local,Address,Count
2471,2002-06-10,HARRISON AVE,324
2465,2002-06-09,HARRISON AVE,320
14416,2011-05-24,"5888 MISSION BLVD., RUBIDOUX",80
16360,2012-10-11,2052 LAUWILIWILI ST,16
16608,2012-12-09,2052 LAUWILIWILI ST,16
16351,2012-10-08,2052 LAUWILIWILI ST,16
16010,2012-07-15,2052 LAUWILIWILI ST,16
20167,2015-01-25,2052 LAUWILIWILI ST,16
16173,2012-08-19,2052 LAUWILIWILI ST,16
16176,2012-08-20,2052 LAUWILIWILI ST,16


Remove only TRUE Duplicates, perfectly matching.

This function:
* Removes only exact duplicates
*  Keeps the first occurrence
* Shows how many were removed
* Shows which Date/Address pairs had duplicates
* Returns a cleaned dataframe

In [24]:
def remove_full_pollutant_duplicates(df):
    """
    Remove ONLY true duplicates based on:
    - Date Local
    - Address
    - All pollutant measurement columns (Mean, Max, Hour, Units)
    - All AQI columns

    Keeps the FIRST occurrence.
    Returns cleaned dataframe + summary report.
    """

    df_clean = df.copy()

    # Identify pollutant columns dynamically
    pollutant_cols = [
        col for col in df.columns
        if any(p in col for p in ["NO2", "O3", "SO2", "CO"])
    ]

    # Identify AQI columns
    aqi_cols = [col for col in df.columns if col.endswith("AQI")]

    key_columns = ["Date Local", "Address"] + pollutant_cols + aqi_cols

    print("üßπ Removing true duplicates based on columns:")
    print(key_columns, "\n")

    before = len(df_clean)

    # Remove duplicates (keep first occurrence)
    df_clean = df_clean.drop_duplicates(subset=key_columns, keep="first")

    after = len(df_clean)
    removed = before - after

    print(f"üìâ Total rows BEFORE: {before}")
    print(f"üìà Total rows AFTER:  {after}")
    print(f"üóëÔ∏è Removed duplicates: {removed}\n")

    # Show top duplicate groups (optional)
    if removed > 0:
        print("üìä Duplicate groups (Date + Address) impacted:")
        dup_groups = (
            df[df.duplicated(subset=key_columns, keep=False)]
            .groupby(["Date Local", "Address"])
            .size()
            .reset_index(name="Count")
            .sort_values("Count", ascending=False)
        )
        display(dup_groups.head(20))
    else:
        print("‚úÖ No duplicates were removed. Dataset was already clean.")

    return df_clean


In [25]:
df = remove_full_pollutant_duplicates(df)


üßπ Removing true duplicates based on columns:
['Date Local', 'Address', 'NO2 Units', 'NO2 Mean', 'NO2 1st Max Value', 'NO2 1st Max Hour', 'NO2 AQI', 'O3 Units', 'O3 Mean', 'O3 1st Max Value', 'O3 1st Max Hour', 'O3 AQI', 'SO2 Units', 'SO2 Mean', 'SO2 1st Max Value', 'SO2 1st Max Hour', 'SO2 AQI', 'CO Units', 'CO Mean', 'CO 1st Max Value', 'CO 1st Max Hour', 'CO AQI', 'NO2 AQI', 'O3 AQI', 'SO2 AQI', 'CO AQI'] 

üìâ Total rows BEFORE: 1722197
üìà Total rows AFTER:  1674675
üóëÔ∏è Removed duplicates: 47522

üìä Duplicate groups (Date + Address) impacted:


Unnamed: 0,Date Local,Address,Count
2471,2002-06-10,HARRISON AVE,324
2465,2002-06-09,HARRISON AVE,320
14416,2011-05-24,"5888 MISSION BLVD., RUBIDOUX",80
16360,2012-10-11,2052 LAUWILIWILI ST,16
16608,2012-12-09,2052 LAUWILIWILI ST,16
16351,2012-10-08,2052 LAUWILIWILI ST,16
16010,2012-07-15,2052 LAUWILIWILI ST,16
20167,2015-01-25,2052 LAUWILIWILI ST,16
16173,2012-08-19,2052 LAUWILIWILI ST,16
16176,2012-08-20,2052 LAUWILIWILI ST,16


Conflicting duplicates identification:
* Detect measurement inconsistencies
* Identify stations that report multiple measurements at same time
* Flag data quality problems
* Decide whether to:
 * average conflicts
 * drop the worst sensors
 * keep the max value (common for AQI rules)

Resolve Conflicting Duplicates. Keep the Maximum Values (EPA-style AQI logic)

EPA rules for AQI calculations already require maxima for 1-hour & 8-hour values.
So to remain consistent, we choose the maximum values within each conflict group.

* Best for AQI
* Prevents underestimation
* Officially aligned with U.S. EPA methodology

It ensures:
* proper pollutant selection
* AQI is never underestimated
* dataset integrity for environmental analysis

In [26]:
def detect_and_resolve_conflicts(df):
    """
    Detect and resolve conflicting duplicates based on:
        keys  : ['Date Local', 'Address']
        score : all pollutant measurement cols (Mean, Max, Hour) + all AQI cols

    Strategy:
      1. Sort rows so that "best" (highest pollutant/AQI) rows come first.
      2. Drop duplicates on (Date Local, Address), keeping the first row.
      3. Return cleaned df and the dropped 'conflict' rows.

    This DOES NOT touch non-pollutant columns (Address, City, State, etc.).
    """

    df_copy = df.copy()

    # --- 1. Identify pollutant + AQI columns dynamically ---
    pollutant_cols = [
        c for c in df_copy.columns
        if any(p in c for p in ["NO2", "O3", "SO2", "CO"])
        and not c.endswith("Units")
    ]

    aqi_cols = [c for c in df_copy.columns if c.endswith("AQI")]

    score_cols = pollutant_cols + aqi_cols

    key_cols = ["Date Local", "Address"]

    print("üîç Checking columns:")
    print("  Pollutant cols:", pollutant_cols)
    print("  AQI cols      :", aqi_cols)
    print("  Keys          :", key_cols)
    print()

    # --- 2. Sort by pollutant/AQI descending (highest first) ---
    # Missing values will stay at the bottom of each group.
    df_sorted = df_copy.sort_values(by=score_cols, ascending=False)

    # --- 3. Mark conflicts: all but first in each (Date Local, Address) group ---
    conflict_mask = df_sorted.duplicated(subset=key_cols, keep="first")

    conflicts = df_sorted[conflict_mask].copy()
    df_cleaned = df_sorted[~conflict_mask].copy()

    # Optional: restore original index order
    df_cleaned = df_cleaned.sort_index()

    # --- 4. Reporting ---
    print(f"‚úÖ Total rows BEFORE: {len(df_copy)}")
    print(f"‚úÖ Total rows AFTER : {len(df_cleaned)}")
    print(f"‚ö†Ô∏è  Conflicting rows removed: {len(conflicts)}")
    print()

    if not conflicts.empty:
        print("üìå Example conflict groups (Date Local + Address):")
        example = (
            conflicts
            .groupby(key_cols)
            .size()
            .reset_index(name="Count")
            .sort_values("Count", ascending=False)
            .head(10)
        )
        print(example.to_string(index=False))

    return df_cleaned, conflicts

In [27]:
df_cleaned, conflicts = detect_and_resolve_conflicts(df)


üîç Checking columns:
  Pollutant cols: ['NO2 Mean', 'NO2 1st Max Value', 'NO2 1st Max Hour', 'NO2 AQI', 'O3 Mean', 'O3 1st Max Value', 'O3 1st Max Hour', 'O3 AQI', 'SO2 Mean', 'SO2 1st Max Value', 'SO2 1st Max Hour', 'SO2 AQI', 'CO Mean', 'CO 1st Max Value', 'CO 1st Max Hour', 'CO AQI']
  AQI cols      : ['NO2 AQI', 'O3 AQI', 'SO2 AQI', 'CO AQI']
  Keys          : ['Date Local', 'Address']

‚úÖ Total rows BEFORE: 1674675
‚úÖ Total rows AFTER : 407557
‚ö†Ô∏è  Conflicting rows removed: 1267118

üìå Example conflict groups (Date Local + Address):
Date Local                      Address  Count
2002-06-09                 HARRISON AVE     63
2011-05-24 5888 MISSION BLVD., RUBIDOUX     47
2009-07-15              750 DUNDEE ROAD     31
2009-06-03              750 DUNDEE ROAD     31
2009-07-17              750 DUNDEE ROAD     31
2009-06-02              750 DUNDEE ROAD     31
2009-06-19              750 DUNDEE ROAD     31
2009-04-23              750 DUNDEE ROAD     31
2009-07-16              

In [28]:
df_cleaned.head()

Unnamed: 0,State Code,County Code,Site Num,Address,State,County,City,Date Local,NO2 Units,NO2 Mean,...,SO2 1st Max Value,SO2 1st Max Hour,SO2 AQI,CO Units,CO Mean,CO 1st Max Value,CO 1st Max Hour,CO AQI,Latitude,Longitude
0,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,19.041667,...,9.0,21,12.9,Parts per million,1.145833,4.2,21,13.0,33.458426,-112.046574
5,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-02,Parts per billion,22.958333,...,3.0,22,4.3,Parts per million,1.066667,2.3,0,12.1,33.458426,-112.046574
8,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-03,Parts per billion,38.125,...,11.0,19,15.7,Parts per million,1.929167,4.4,8,21.9,33.458426,-112.046574
12,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-04,Parts per billion,40.26087,...,16.0,8,22.9,Parts per million,1.991667,5.1,21,22.6,33.458426,-112.046574
17,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-05,Parts per billion,48.45,...,15.0,7,21.4,Parts per million,2.7,3.7,2,30.7,33.458426,-112.046574


In [29]:
conflicts.head()

Unnamed: 0,State Code,County Code,Site Num,Address,State,County,City,Date Local,NO2 Units,NO2 Mean,...,SO2 1st Max Value,SO2 1st Max Hour,SO2 AQI,CO Units,CO Mean,CO 1st Max Value,CO 1st Max Hour,CO AQI,Latitude,Longitude
1469,4,13,3003,2857 N MILLER RD-S SCOTTSDALE STN,Arizona,Maricopa,Scottsdale,2000-01-19,Parts per billion,139.541667,...,6.0,8,8.6,Parts per million,2.120833,2.9,23,24.1,33.4799,-111.917292
1470,4,13,3003,2857 N MILLER RD-S SCOTTSDALE STN,Arizona,Maricopa,Scottsdale,2000-01-19,Parts per billion,139.541667,...,5.0,8,7.1,Parts per million,2.15,3.9,20,24.4,33.4799,-111.917292
1471,4,13,3003,2857 N MILLER RD-S SCOTTSDALE STN,Arizona,Maricopa,Scottsdale,2000-01-19,Parts per billion,139.541667,...,5.0,8,7.1,Parts per million,2.120833,2.9,23,24.1,33.4799,-111.917292
1445,4,13,3003,2857 N MILLER RD-S SCOTTSDALE STN,Arizona,Maricopa,Scottsdale,2000-01-13,Parts per billion,135.333333,...,4.0,19,5.7,Parts per million,1.970833,3.1,23,22.4,33.4799,-111.917292
1446,4,13,3003,2857 N MILLER RD-S SCOTTSDALE STN,Arizona,Maricopa,Scottsdale,2000-01-13,Parts per billion,135.333333,...,3.3,20,4.7,Parts per million,2.079167,4.3,19,23.6,33.4799,-111.917292


In [30]:
df_cleaned.shape

(407557, 30)

In [31]:
df_cleaned.isnull().sum()           # Check for missing values in each column

State Code           0
County Code          0
Site Num             0
Address              0
State                0
County               0
City                 0
Date Local           0
NO2 Units            0
NO2 Mean             0
NO2 1st Max Value    0
NO2 1st Max Hour     0
NO2 AQI              0
O3 Units             0
O3 Mean              0
O3 1st Max Value     0
O3 1st Max Hour      0
O3 AQI               0
SO2 Units            0
SO2 Mean             0
SO2 1st Max Value    0
SO2 1st Max Hour     0
SO2 AQI              0
CO Units             0
CO Mean              0
CO 1st Max Value     0
CO 1st Max Hour      0
CO AQI               0
Latitude             0
Longitude            0
dtype: int64

Typical US EPA AQS dataset uses:
| Pollutant | Unit (US EPA)           | Type                |
| --------- | ----------------------- | ------------------- |
| **NO2**   | Parts per billion (ppb) | Volume mixing ratio |
| **SO2**   | Parts per billion (ppb) | "                   |
| **CO**    | Parts per million (ppm) | "                   |
| **O3**    | Parts per billion (ppb) | "                   |
| **PM2.5** | ¬µg/m¬≥                   | Mass concentration  |
| **PM10**  | ¬µg/m¬≥                   | Mass concentration  |

The dataset uses:
* NO2 = ppb
* SO2 = ppb
* O3 = ppb
* CO = ppm
* And AQI is unitless

So the units are consistent per gas, but different across gases.

Units can be standardize for comparison. It is possible to convert all gases to the same mass concentration (¬µg/m¬≥)

This is the most common approach.

Conversion Formula:

For gases, you use the Ideal Gas Law:

ùê∂(ùúáùëî/ùëö3)=ùê∂(ùëùùëùùëè)√ó((ùëÄ√óùëÉ)/(ùëÖ√óùëá))

Where:

M = molar mass of gas
* P = pressure (Pa)
* T = temperature (K)
* R = 8.3145 J/mol/K

EPA Simplified Conversion Factors:
| Pollutant | Convert from | To ¬µg/m¬≥ | Factor  |
| --------- | ------------ | -------- | ------- |
| **NO2**   | 1 ppb        | ¬µg/m¬≥    | √ó 1.88  |
| **SO2**   | 1 ppb        | ¬µg/m¬≥    | √ó 2.62  |
| **O3**    | 1 ppb        | ¬µg/m¬≥    | √ó 2.00  |
| **CO**    | 1 ppm        | mg/m¬≥    | √ó 1.145 |


Unit standartizytion is recommened:
* For pollutants numerically comparison
* For clustering / PCA / modeling across pollutants
* To plot all pollutants on the same normalized scale

In [32]:
def convert_to_ugm3(df):
    """
    Convert pollutant concentrations to ¬µg/m¬≥ 
    and remove original columns (Mean, Max, and Units).
    """

    # Conversion factors at 25¬∞C and 1 atm
    factors = {
        "NO2": 1.88,   # ppb ‚Üí ¬µg/m¬≥
        "SO2": 2.62,   # ppb ‚Üí ¬µg/m¬≥
        "O3": 2.00,    # ppb ‚Üí ¬µg/m¬≥
        "CO": 1145     # ppm ‚Üí ¬µg/m¬≥ (1 ppm = 1145 ¬µg/m¬≥)
    }

    cols_to_drop = []

    for gas, factor in factors.items():

        # --- Mean ---
        col_mean = f"{gas} Mean"
        if col_mean in df.columns:
            new_col = f"{gas} Mean ¬µg/m¬≥"
            df[new_col] = df[col_mean] * factor
            cols_to_drop.append(col_mean)

        # --- Max ---
        col_max = f"{gas} 1st Max Value"
        if col_max in df.columns:
            new_col = f"{gas} Max ¬µg/m¬≥"
            df[new_col] = df[col_max] * factor
            cols_to_drop.append(col_max)

        # --- Units column ---
        col_units = f"{gas} Units"
        if col_units in df.columns:
            cols_to_drop.append(col_units)

        # --- Max hour column (not needed for modeling or maps) ---
        col_hour = f"{gas} 1st Max Hour"
        if col_hour in df.columns:
            cols_to_drop.append(col_hour)

        # --- AQI column (unitless but not needed if you recompute later) ---
        col_aqi = f"{gas} AQI"
        if col_aqi in df.columns:
            cols_to_drop.append(col_aqi)

    # Remove duplicates from drop list
    cols_to_drop = list(set(cols_to_drop))

    print("Dropping original pollutant columns:", cols_to_drop)

    # Drop old columns safely
    df = df.drop(columns=cols_to_drop, errors="ignore")

    return df
df_cleaned = convert_to_ugm3(df_cleaned)

Dropping original pollutant columns: ['O3 Units', 'CO 1st Max Value', 'NO2 Mean', 'NO2 1st Max Value', 'SO2 Units', 'CO 1st Max Hour', 'SO2 AQI', 'SO2 1st Max Hour', 'NO2 AQI', 'O3 1st Max Hour', 'NO2 Units', 'NO2 1st Max Hour', 'CO AQI', 'O3 1st Max Value', 'CO Mean', 'SO2 Mean', 'O3 Mean', 'O3 AQI', 'CO Units', 'SO2 1st Max Value']


In [33]:
df_cleaned.head()

Unnamed: 0,State Code,County Code,Site Num,Address,State,County,City,Date Local,Latitude,Longitude,NO2 Mean ¬µg/m¬≥,NO2 Max ¬µg/m¬≥,SO2 Mean ¬µg/m¬≥,SO2 Max ¬µg/m¬≥,O3 Mean ¬µg/m¬≥,O3 Max ¬µg/m¬≥,CO Mean ¬µg/m¬≥,CO Max ¬µg/m¬≥
0,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,33.458426,-112.046574,35.798334,92.12,7.86,23.58,0.045,0.08,1311.978785,4809.0
5,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-02,33.458426,-112.046574,43.161666,67.68,5.130832,7.86,0.02675,0.064,1221.333715,2633.5
8,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-03,33.458426,-112.046574,71.675,95.88,13.755,28.82,0.015916,0.032,2208.896215,5038.0
12,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-04,33.458426,-112.046574,75.690436,139.12,18.558332,41.92,0.028334,0.066,2280.458715,5839.5
17,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-05,33.458426,-112.046574,91.086,114.68,22.815832,39.3,0.013334,0.024,3091.5,4236.5


In [34]:
print(df_cleaned.shape)                     # Print the shape of the DataFrame           
print(df_cleaned.info())                    # Print concise summary of the DataFrame     

(407557, 18)
<class 'pandas.core.frame.DataFrame'>
Index: 407557 entries, 0 to 1746660
Data columns (total 18 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   State Code      407557 non-null  int64  
 1   County Code     407557 non-null  int64  
 2   Site Num        407557 non-null  int64  
 3   Address         407557 non-null  object 
 4   State           407557 non-null  object 
 5   County          407557 non-null  object 
 6   City            407557 non-null  object 
 7   Date Local      407557 non-null  object 
 8   Latitude        407557 non-null  float64
 9   Longitude       407557 non-null  float64
 10  NO2 Mean ¬µg/m¬≥  407557 non-null  float64
 11  NO2 Max ¬µg/m¬≥   407557 non-null  float64
 12  SO2 Mean ¬µg/m¬≥  407557 non-null  float64
 13  SO2 Max ¬µg/m¬≥   407557 non-null  float64
 14  O3 Mean ¬µg/m¬≥   407557 non-null  float64
 15  O3 Max ¬µg/m¬≥    407557 non-null  float64
 16  CO Mean ¬µg/m¬≥   407557 non-null  

All units are standardized everything into ¬µg/m¬≥ so that:
* All numbers represent the same physical concept
* Models can compare pollutants
* Visuals become consistent
* You eliminate unit differences

In [35]:
df_cleaned.to_parquet("../data/processed/pollution_dataset_cleaned.parquet", engine="pyarrow", index=False)