# How does wildfire impact air quality?

## The goal of this notebook is to build multiple models exploring the impact of widlfire on air quality.



## 1. Multiple Linear Regression Model

### The goal of this section is to build a multiple linear regression model to predict air quality based on various predictors surrounding wildfire proximity.

#### y = ß0 + ß1X1 + ß2X2 + ... + ßnXn + error, where y is PM2.5 (fine particulate matter) levels, ß0 is a constant, X1 through Xn are the predictors, and ß1 through ßn are the predictors' coefficients.

### Imports

In [1]:
# Needed imports

import pandas as pd
import numpy as np
import os
import statsmodels.api as sm
import plotly.express as px
from sklearn.preprocessing import StandardScaler
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score

### Data Loading and Exploration

In [2]:
# Load the data
script_dir = os.getcwd()

df = pd.read_csv(f'{script_dir}/air_quality_weather_fires_final.csv')

df = df.dropna()

df

Unnamed: 0.1,Unnamed: 0,date,site_id,latitude,longitude,state_name,county_name,city_name,site_name,PM25,...,fires_within_100km,has_nearby_fire,datetime,month,day_of_week,is_weekend,season,wildfire_season,fire_distance_category,fire_intensity
0,0,2024-01-01,01-073-0023,33.553056,-86.815000,Alabama,Jefferson,Birmingham,North Birmingham,11.55,...,3,1,2024-01-01,1,0,0,winter,0,close,low
1,1,2024-01-01,04-013-9997,33.503833,-112.095767,Arizona,Maricopa,Phoenix,JLG SUPERSITE,85.35,...,0,1,2024-01-01,1,0,0,winter,0,far,low
2,2,2024-01-01,04-019-1028,32.295150,-110.982300,Arizona,Pima,Tucson,CHILDREN'S PARK NCore,16.30,...,0,1,2024-01-01,1,0,0,winter,0,far,low
3,3,2024-01-01,05-119-0007,34.756189,-92.281296,Arkansas,Pulaski,North Little Rock,PARR,5.90,...,0,1,2024-01-01,1,0,0,winter,0,far,low
4,4,2024-01-01,06-001-0011,37.814781,-122.282347,California,Alameda,Oakland,Oakland West,6.90,...,9,1,2024-01-01,1,0,0,winter,0,close,low
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19797,19797,2024-12-31,49-035-3015,40.777145,-111.945849,Utah,Salt Lake,Salt Lake City,Utah Technical Center,4.50,...,0,1,2024-12-31,12,1,0,winter,0,far,low
19798,19798,2024-12-31,50-021-0002,43.608056,-72.982778,Vermont,Rutland,Rutland,State of Vermont District Court Parking Lot,4.70,...,0,1,2024-12-31,12,1,0,winter,0,far,low
19799,19799,2024-12-31,51-087-0014,37.556520,-77.400270,Virginia,Henrico,East Highland Park,MathScience Innovation Center,5.60,...,5,1,2024-12-31,12,1,0,winter,0,close,low
19800,19800,2024-12-31,53-033-0080,47.568236,-122.308628,Washington,King,Seattle,SEATTLE - BEACON HILL,3.40,...,1,1,2024-12-31,12,1,0,winter,0,moderate,low


In [3]:
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,19792,19793,19794,19795,19796,19797,19798,19799,19800,19801
Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,19792,19793,19794,19795,19796,19797,19798,19799,19800,19801
date,2024-01-01,2024-01-01,2024-01-01,2024-01-01,2024-01-01,2024-01-01,2024-01-01,2024-01-01,2024-01-01,2024-01-01,...,2024-12-31,2024-12-31,2024-12-31,2024-12-31,2024-12-31,2024-12-31,2024-12-31,2024-12-31,2024-12-31,2024-12-31
site_id,01-073-0023,04-013-9997,04-019-1028,05-119-0007,06-001-0011,06-013-0002,06-013-1004,06-019-0011,06-037-1103,06-065-8001,...,48-201-1035,48-201-1039,49-035-2005,49-035-3006,49-035-3010,49-035-3015,50-021-0002,51-087-0014,53-033-0080,56-021-0100
latitude,33.553056,33.503833,32.29515,34.756189,37.814781,37.936013,37.9604,36.78538,34.06659,33.99958,...,29.733726,29.670025,40.598056,40.736389,40.78422,40.777145,43.608056,37.55652,47.568236,41.182227
longitude,-86.815,-112.095767,-110.9823,-92.281296,-122.282347,-122.026154,-122.356811,-119.77321,-118.22688,-117.41601,...,-95.257593,-95.128508,-111.894167,-111.872222,-111.931,-111.945849,-72.982778,-77.40027,-122.308628,-104.778334
state_name,Alabama,Arizona,Arizona,Arkansas,California,California,California,California,California,California,...,Texas,Texas,Utah,Utah,Utah,Utah,Vermont,Virginia,Washington,Wyoming
county_name,Jefferson,Maricopa,Pima,Pulaski,Alameda,Contra Costa,Contra Costa,Fresno,Los Angeles,Riverside,...,Harris,Harris,Salt Lake,Salt Lake,Salt Lake,Salt Lake,Rutland,Henrico,King,Laramie
city_name,Birmingham,Phoenix,Tucson,North Little Rock,Oakland,Concord,San Pablo,Fresno,Los Angeles,Rubidoux,...,Houston,Deer Park,Midvale,Salt Lake City,Salt Lake City,Salt Lake City,Rutland,East Highland Park,Seattle,Cheyenne
site_name,North Birmingham,JLG SUPERSITE,CHILDREN'S PARK NCore,PARR,Oakland West,Concord,San Pablo,Fresno - Garland,Los Angeles-North Main Street,Rubidoux,...,Clinton,Houston Deer Park #2,Copper View,Hawthorne,ROSE PARK,Utah Technical Center,State of Vermont District Court Parking Lot,MathScience Innovation Center,SEATTLE - BEACON HILL,Cheyenne NCore
PM25,11.55,85.35,16.3,5.9,6.9,4.7,12.0,22.15,15.4,18.85,...,15.0,7.9,8.0,2.833333,4.666667,4.5,4.7,5.6,3.4,3.6


In [4]:
# Obtain a list of the columns for ease of typing
df.columns

Index(['Unnamed: 0', 'date', 'site_id', 'latitude', 'longitude', 'state_name',
       'county_name', 'city_name', 'site_name', 'PM25', 'CO', 'O3', 'NO2',
       'SO2', 'AQI_PM25', 'AQI_CO', 'AQI_O3', 'AQI_NO2', 'AQI_SO2', 'AQI',
       'temperature_2m_mean', 'temperature_2m_max', 'temperature_2m_min',
       'relative_humidity_2m_mean', 'wind_speed_10m_mean',
       'wind_direction_10m_dominant', 'precipitation_sum',
       'precipitation_hours', 'et0_fao_evapotranspiration', 'weather_code',
       'distance_to_fire_km', 'fire_brightness', 'fire_frp',
       'fires_within_50km', 'fires_within_100km', 'has_nearby_fire',
       'datetime', 'month', 'day_of_week', 'is_weekend', 'season',
       'wildfire_season', 'fire_distance_category', 'fire_intensity'],
      dtype='object')

### Variable Selection

In [5]:
# List numeric variables
# Only use mean temperature (instead of 'temperature_2m_max', 'temperature_2m_min'), mean wind speed (instead of 'wind_direction_10m_dominant'), and precipitation sum (instead of precipitation_hours') to avoid multicollinearity
# Temperature, humidity, and evaporatranspiration are often correlated, so remove the latter two variables
X_num = ['latitude', 'longitude', 'temperature_2m_mean', 'wind_speed_10m_mean', 'precipitation_sum', 'fires_within_50km', 'fires_within_100km', 'distance_to_fire_km', 'fire_brightness', 'fire_frp']

# Convert X_num to numerics, not strings
df[X_num] = df[X_num].apply(pd.to_numeric, errors='coerce')
df = df.dropna(subset=X_num + ['PM25']).reset_index(drop=True)

# Has nearby fire and wildfire season are already binary
# However, has nearby fire is described by the number of fires within 50 and 100 km - remove to avoid multicollinearity 
X_binary = ['wildfire_season']

# List categorical variables
# Site_name, Site_ID, state_name, county_name, city_name, and site name are multicollinear with lat and lon
# Fire distance category is correlated with the distance to fire in km so remove that 
X_cat = ['weather_code', 'season', 'fire_intensity']

# Time data are not used as predictors in our model; do not include those

# Reserve y variable in a separate df
y_df = df[['PM25']].copy() 

# Select binary and categorical X variables (still need to scale numeric variables)
selected_columns = X_binary + X_cat 
X_df = df[selected_columns].reset_index(drop=True)

# Scale our numeric variables
scaler = StandardScaler()
scaled_num_X = scaler.fit_transform(df[X_num])
scaled_X_df = pd.DataFrame(scaled_num_X, columns=X_num)

# Combine scaled_X_df and X_df
X_df = pd.concat([X_df, scaled_X_df], axis=1)

# Use One-Hot_Encoding for categorical variables
X_df_processed = pd.get_dummies(X_df, 
                                 columns=X_cat, 
                                 prefix='dummy', 
                                 drop_first=True)


In [6]:
cols = X_df_processed.columns.tolist()

In [7]:
# Set up X and y 
X = X_df_processed[cols].astype('float64')
X = sm.add_constant(X)
y = y_df[['PM25']].to_numpy(dtype='float64')

In [8]:
# Check for multicollinearity by analyzing the Variance Inflation Factor (VIF)
def vif(X):
    vif_data = pd.DataFrame()
    vif_data["feature"] = X.columns

    vif_data["VIF"] = [variance_inflation_factor(X.values, i)
                            for i in range(len(X.columns))]
    print(vif_data)

vif(X)

                     feature          VIF
0                      const  5080.630830
1            wildfire_season     5.631668
2                   latitude     1.607875
3                  longitude     1.281669
4        temperature_2m_mean     3.836251
5        wind_speed_10m_mean     1.157803
6          precipitation_sum     3.130571
7          fires_within_50km     1.769995
8         fires_within_100km     1.798840
9        distance_to_fire_km     1.292837
10           fire_brightness     1.548601
11                  fire_frp     6.754296
12       dummy_Dense drizzle     1.397890
13          dummy_Heavy rain     1.854548
14     dummy_Heavy snow fall     1.209746
15       dummy_Light drizzle     2.679066
16        dummy_Mainly clear     1.635395
17    dummy_Moderate drizzle     1.782914
18       dummy_Moderate rain     3.045576
19  dummy_Moderate snow fall     1.423706
20            dummy_Overcast     3.446839
21       dummy_Partly cloudy     1.839946
22         dummy_Slight rain     2

In [9]:
# Remove variables with moderate to strong multicollinearity (above 5) to improve model performance

# Fire frp and fire brightness instinctively will be correlated, so remove fire frp as per its VIF above 5
# Wildfire season has a high VIF - likely correlated to season
# The category fire_intensity has high VIF - likely due to multicollinearity with fire brightness - remove

# Important note: high VIFs within a single categorical variable are not necessarily a concern
# They likely indicate a small number of reference category cases due to our relatively small dataset
# We will address this later with bootstrapping
# One change to make is to combine certain weather_code values to increase the number of values in a category (will reduce VIF)


"""
Combine cells into clear, cloudy, rainy, snowy:

Notes:
1 ("Clouds generally dissolving or becoming less developed")
2 ("State of sky on the whole unchanged")
3 ("Clouds generally forming or developing")
51 ("Drizzle, not freezing, continuous - slight at time of observation")
53 ("Drizzle, not freezing, continuous - moderate at time of observation")
55 ("Drizzle, not freezing, continuous - heavy (dense) at time of observation") 
61 ("Rain, not freezing, continuous - slight at time of observation")
63 ("Rain, not freezing, continuous - moderate at time of observation"))
65 ("Rain, not freezing, continuous - heavy at time of observation")  
71 ("Continuous fall of snowflakes - slight at time of observation")
73 ("Continuous fall of snowflakes - moderate at time of observation")
75 ("Continuous fall of snowflakes - heavy at time of observation")

So:

Clear => 1, Clear sky, Mainly clear      
2 => Does not tell us about the actual state of the weather, remove
Cloudy => 3, Overcast, Partly cloudy 
Rainy => 51, 53, 55, 61, 63, 65, Dense drizzle, Heavy rain, Light drizzle, Moderate drizzle, Moderate rain, Slight rain
Snowy => 71, 73, 75, Heavy snow fall, Moderate snow fall, Slight snow fall 
"""

df['weather_code'] = df['weather_code'].replace(['1', 'Clear sky', 'Mainly clear'], 'clear')
df = df[df.weather_code != '2']
df['weather_code'] = df['weather_code'].replace(['3', 'Overcast', 'Partly cloudy'], 'cloudy')
df['weather_code'] = df['weather_code'].replace(['51', '53', '55', '61', '63', '65', 'Dense drizzle', 'Heavy rain', 'Light drizzle', 'Moderate drizzle', 'Moderate rain', 'Slight rain'], 'rainy')
df['weather_code'] = df['weather_code'].replace(['71', '73', '75', 'Heavy snow fall', 'Moderate snow fall', 'Slight snow fall'], 'snowy')

# List numeric variables
X_num = ['latitude', 'longitude', 'temperature_2m_mean', 'wind_speed_10m_mean', 'precipitation_sum', 'fires_within_50km', 'fires_within_100km', 'distance_to_fire_km', 'fire_brightness']

# Convert X_num to numerics, not strings
df[X_num] = df[X_num].apply(pd.to_numeric, errors='coerce')

# List categorical variables
# Site_name, Site_ID, state_name, county_name, city_name, and site name are multicollinear with lat and lon
# Fire distance category is correlated with the distance to fire in km so remove that 
X_cat = ['weather_code', 'season']

# Need to drop NaN values again because of the error coercion in X_num
# Drop NaNs only from columns we actually care about
# We include y ('PM25') in the subset to ensure we have targets for all rows
cols_to_check = X_num + X_cat + ['PM25']
df = df.dropna(subset=cols_to_check)

# Reserve y variable in a separate df
y_df = df[['PM25']].copy().reset_index(drop=True)

# Select categorical X variables (still need to scale numeric variables)
X_df_cat = df[X_cat].reset_index(drop=True)

# Select the numeric data from the X_df to scale
X_num_data = df[X_num]

# Scale our numeric variables
scaler = StandardScaler()
scaled_num_X = scaler.fit_transform(X_num_data)
scaled_X_df = pd.DataFrame(scaled_num_X, columns=X_num).reset_index(drop=True)

# Combine scaled_X_df and X_df
X_df = pd.concat([X_df_cat, scaled_X_df], axis=1)

# Use One-Hot_Encoding for categorical variables
X_df_processed = pd.get_dummies(X_df, 
                                columns=X_cat, 
                                prefix='dummy', 
                                drop_first=True)

cols = X_df_processed.columns.tolist()

# Set up X and y 
X = X_df_processed[cols].astype('float64')
X = sm.add_constant(X)
y = y_df[['PM25']].to_numpy(dtype='float64')

In [10]:
# Retest multicollinearity with VIF
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns

vif_data["VIF"] = [variance_inflation_factor(X.values, i)
                          for i in range(len(X.columns))]
print(vif_data)

                feature       VIF
0                 const  9.390895
1              latitude  1.525219
2             longitude  1.241639
3   temperature_2m_mean  3.319696
4   wind_speed_10m_mean  1.154808
5     precipitation_sum  1.249153
6     fires_within_50km  1.767626
7    fires_within_100km  1.794071
8   distance_to_fire_km  1.281565
9       fire_brightness  1.012107
10         dummy_cloudy  2.365160
11          dummy_rainy  2.684226
12          dummy_snowy  1.621005
13         dummy_spring  1.629298
14         dummy_summer  2.007822
15         dummy_winter  2.308080


In [11]:
# Now we can see that weather has high VIF regardless of the category, so remove this
# This are likely multicollinear with season (spring, summer, winter)

# List numeric variables as X, since only variables left are numeric
X_cols = ['latitude', 'longitude', 'temperature_2m_mean', 'wind_speed_10m_mean', 'precipitation_sum', 'fires_within_50km', 'fires_within_100km', 'distance_to_fire_km', 'fire_brightness']

# Convert X_num to numerics, not strings
df[X_cols] = df[X_cols].apply(pd.to_numeric, errors='coerce')

# Drop rows with NaN values after conversion and row removal in previous steps/ This aligns X and y within the main 'df' DataFrame.
df.dropna(subset=['PM25'] + X_cols, inplace=True)

# Reserve y variable in a separate df and reset index
y_df = df[['PM25']].copy().reset_index(drop=True)

# Select appropriate X variables and name it all such that we will scale it 
X_data = df[X_cols]

# Scale our  variables
scaler = StandardScaler()
scaled_X = scaler.fit_transform(X_data)
scaled_df = pd.DataFrame(scaled_X, columns=X_cols)

# Set up X and y 
X = sm.add_constant(scaled_df)
y = y_df[['PM25']].to_numpy(dtype='float64')

In [12]:
# Check VIF one last time
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns

vif_data["VIF"] = [variance_inflation_factor(X.values, i)
                          for i in range(len(X.columns))]
print(vif_data)

               feature       VIF
0                const  1.000000
1             latitude  1.314693
2            longitude  1.202200
3  temperature_2m_mean  1.226561
4  wind_speed_10m_mean  1.129930
5    precipitation_sum  1.069564
6    fires_within_50km  1.766784
7   fires_within_100km  1.780217
8  distance_to_fire_km  1.244202
9      fire_brightness  1.009583


### Modeling and  Bootstrapping

In [13]:
# Make X and y pd dfs
X = pd.DataFrame(X)
y = pd.DataFrame(y)
y = y.rename(columns={0:'PM25'})

In [14]:
# Make a baseline model
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size = 0.30, random_state = 42) 

X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size = 0.50, random_state = 42) 

model1 = sm.OLS(y_train, X_train).fit()

print(model1.summary())

                            OLS Regression Results                            
Dep. Variable:                   PM25   R-squared:                       0.120
Model:                            OLS   Adj. R-squared:                  0.120
Method:                 Least Squares   F-statistic:                     210.0
Date:                Thu, 11 Dec 2025   Prob (F-statistic):               0.00
Time:                        15:36:26   Log-Likelihood:                -41436.
No. Observations:               13827   AIC:                         8.289e+04
Df Residuals:                   13817   BIC:                         8.297e+04
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                   7.4185    

In [15]:
# Bootstrap entire process of making a model

def boot(X, y, S):
    
    # Create an array to store the coefficients for each model
    coef_store = np.zeros((S, X.shape[1]))

    n = len(X)

    # Divide each dataset into a test/train split
    for dataset in range(S):

        # Sample indices with replacement
        idx = np.random.choice(n, n, replace=True)

        # Bootstrap samples
        X_b = X.iloc[idx]
        y_b = y.iloc[idx]

        # Fit the model on the training data
        boot_model1 = sm.OLS(y_b,X_b).fit()

        # Store the coefficients
        coef_store[dataset, :] = boot_model1.params

    # Store results as a dataframe
    boot_results = pd.DataFrame(coef_store, columns=X.columns)

    return boot_results.mean(), boot_results.std()

# Bootstrap
mean, std = boot(X_train, y_train, 1000)

print(f"\nBootstrap Coefficient Means: \n{mean} \n\nBootstrap Coefficient Standard Errors: \n{std}")


Bootstrap Coefficient Means: 
const                  7.419504
latitude              -0.647469
longitude              0.158739
temperature_2m_mean    0.459312
wind_speed_10m_mean   -1.066381
precipitation_sum     -0.465975
fires_within_50km      0.015331
fires_within_100km     0.343225
distance_to_fire_km   -0.465118
fire_brightness       -0.118123
dtype: float64 

Bootstrap Coefficient Standard Errors: 
const                  0.041320
latitude               0.051926
longitude              0.046159
temperature_2m_mean    0.049732
wind_speed_10m_mean    0.050523
precipitation_sum      0.031031
fires_within_50km      0.237854
fires_within_100km     0.126309
distance_to_fire_km    0.033361
fire_brightness        0.037390
dtype: float64


In [16]:
# Compare the bootstrapped coefficients' standard errors to the OLS standard errors
# If they are similar: OLS assumptions are met, traditional inference is reliable
# If they differ greatly: OLS assumptions may be violated; trust bootstrap SEs for inference
# Large standard errors (regardless of method) suggest high uncertainty in coefficient estimates

# OLS standard errors (from summary)
ols_se = model1.bse

# Compare them
comparison = pd.DataFrame({
    'Bootstrap SE': std,
    'OLS SE': ols_se,
    'Abs Val of Diff': np.abs(std - ols_se)
})

print(comparison)

                     Bootstrap SE    OLS SE  Abs Val of Diff
const                    0.041320  0.041217         0.000103
latitude                 0.051926  0.047172         0.004754
longitude                0.046159  0.045068         0.001091
temperature_2m_mean      0.049732  0.045391         0.004340
wind_speed_10m_mean      0.050523  0.044008         0.006515
precipitation_sum        0.031031  0.042342         0.011311
fires_within_50km        0.237854  0.048915         0.188938
fires_within_100km       0.126309  0.052374         0.073934
distance_to_fire_km      0.033361  0.045117         0.011756
fire_brightness          0.037390  0.041471         0.004082


They are similar: OLS assumptions are met, traditional inference is reliable

### Residual Examination

In [17]:
df["fitted_vals"]=model1.fittedvalues
df["residuals"]=model1.resid

In [18]:
fig=px.scatter(df, x="fitted_vals", y="residuals")
fig.add_hline(y=0, line_dash="dash", line_color="red")
fig.show() 

As our fitted values for PM25 increase, the spread of residuals increases.

More of our residuals are positive, and some rise as high as almost 80.

This, combined with the low R-squared and adjusted R-squared, leads us to conclude that our model is not very accurate at predicting, despite the fact that all predictors were highly significant

### Comparing Train and Validate Set

In [19]:
def rmse(y_true, y_pred):
    return mean_squared_error(y_true, y_pred) ** 0.5

rows = []

yhat_tr = model1.predict(X_train)
yhat_val = model1.predict(X_test)

rows.append({
    "Train RMSE": rmse(y_train, yhat_tr),
    "Validate RMSE": rmse(y_test, yhat_val),
    "Train R2": r2_score(y_train, yhat_tr),
    "Validate R2": r2_score(y_test, yhat_val),
})

results = pd.DataFrame(rows).sort_values("Validate RMSE")
print(results.round(4))

   Train RMSE  Validate RMSE  Train R2  Validate R2
0      4.8443         5.0489    0.1203       0.1076


Our train and validate RMSE and R-squared values do not differ significantly. This is a sign that our model adapts well to new data and is robust.

### Ridge Regularization

In [20]:
# Create Ridge regression models with various alphas to find the best one
alphas = [0.01, 0.1, 1.0, 10.0, 100.0]
best_alpha = None
best_val_mse = float('inf')

for alpha in alphas:
    # Train on training set
    ridge_model = Ridge(alpha=alpha)
    ridge_model.fit(X_train, y_train)
    
    # Evaluate on validation set
    y_pred_val = ridge_model.predict(X_val)
    val_mse = mean_squared_error(y_val, y_pred_val)

    print(f"Alpha: {alpha}, Validation MSE: {val_mse}")
    
    if val_mse < best_val_mse:
        best_val_mse = val_mse
        best_alpha = alpha

print(f"\nBest alpha: {best_alpha}")

Alpha: 0.01, Validation MSE: 22.40303040273487
Alpha: 0.1, Validation MSE: 22.40303168812954
Alpha: 1.0, Validation MSE: 22.403044549502585
Alpha: 10.0, Validation MSE: 22.403173904785593
Alpha: 100.0, Validation MSE: 22.40454055027116

Best alpha: 0.01


A note that the validation MSEs for all alphas are very similar. Still use the best one, of course.

In [21]:
# Evaluate on validate set
final_val_model = Ridge(alpha=best_alpha)
final_val_model.fit(X_val, y_val)

y_pred_val = final_val_model.predict(X_val)
val_mse = mean_squared_error(y_val, y_pred_val)

print(f"Final Validate MSE: {val_mse}")
print(f"Final Validate RMSE: {np.sqrt(val_mse)}")

Final Validate MSE: 22.215361376177817


Final Validate RMSE: 4.7133174491198675


The ridge model performs worse than the OLS model without ridge, as it has a higher RMSE value (5.0455) than the OLS validate RMSE (4.7662). Try lasso next.

### Lasso Regularization

In [22]:
# Create Lasso  models with various alphas to find the best one
alphas = [0.01, 0.1, 1.0, 10.0, 100.0]
best_alpha = None
best_val_mse = float('inf')

# Use the train data to find the best alpha
for alpha in alphas:
    # Train on training set
    lasso_model = Lasso(alpha=alpha, fit_intercept=True) 
    lasso_model.fit(X_train, y_train)
    
    # Evaluate on validation set
    y_pred_val = lasso_model.predict(X_val)
    val_mse = mean_squared_error(y_val, y_pred_val)

    print(f"Alpha: {alpha}, Validation MSE: {val_mse}")
    
    if val_mse < best_val_mse:
        best_val_mse = val_mse
        best_alpha = alpha

print(f"\nBest alpha: {best_alpha}")

Alpha: 0.01, Validation MSE: 22.409455593006783
Alpha: 0.1, Validation MSE: 22.500409808320338
Alpha: 1.0, Validation MSE: 24.814444634359653
Alpha: 10.0, Validation MSE: 25.697989115640475
Alpha: 100.0, Validation MSE: 25.697989115640475

Best alpha: 0.01


In [23]:
# Evaluate on validate set
final_val_model = Lasso(alpha=best_alpha)
final_val_model.fit(X_val, y_val)

y_pred_val = final_val_model.predict(X_val)
val_mse = mean_squared_error(y_val, y_pred_val)

print(f"Final Validate MSE: {val_mse}")
print(f"Final Validate RMSE: {np.sqrt(val_mse)}")

Final Validate MSE: 22.216219705772478
Final Validate RMSE: 4.713408501898862


The lasso model also performs worse than the OLS model without ridge, as it has a higher RMSE value (5.0456) than the OLS validate RMSE (4.7662). Stick with OLS.

### Final Model Testing - OLS (No Ridge or Lasso)

In [24]:
model2 = sm.OLS(y_test,X_test).fit()

print(model2.summary())

                            OLS Regression Results                            
Dep. Variable:                   PM25   R-squared:                       0.121
Model:                            OLS   Adj. R-squared:                  0.118
Method:                 Least Squares   F-statistic:                     45.05
Date:                Thu, 11 Dec 2025   Prob (F-statistic):           2.13e-76
Time:                        15:36:51   Log-Likelihood:                -8979.9
No. Observations:                2963   AIC:                         1.798e+04
Df Residuals:                    2953   BIC:                         1.804e+04
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                   7.4798    

In [25]:
def rmse(y_true, y_pred):
    return mean_squared_error(y_true, y_pred) ** 0.5

rows = []

yhat_test = model2.predict(X_test)

rows.append({
    "Test RMSE": rmse(y_test, yhat_test),
    "Test R2": r2_score(y_test, yhat_test),
})

results = pd.DataFrame(rows).sort_values("Test RMSE")
print(results.round(4))

   Test RMSE  Test R2
0     5.0116   0.1207


In [26]:
df["fitted_vals"]=model2.fittedvalues
df["residuals"]=model2.resid

fig=px.scatter(df, x="fitted_vals", y="residuals")
fig.add_hline(y=0, line_dash="dash", line_color="red")
fig.show() 

### Conclusions

The residual plot for the model fitted to the validated model, as well as the RMSE and R-squared values, are similar to what was seen on the training and validation data.

The residual plot has the same flaws as previously noted, and our R-squared is low. These indicate a poor model fit.

To interpret the RMSE, we must determine the scaled range of the PM25 variable.

In [27]:
np.max(y_val['PM25']) - np.min(y_val['PM25'])

61.1

An RMSE of 4.7505 on a scale of almost 50 means that our predictions are off by about 8% of the range on average (about 4.7505/49.2). This means that our predictions are usually reasonable, despite being not very accurate, as indicated by our low R-squared. We conclude middling results for this model.