# How does wildfire impact air quality?

## The goal of this notebook is to build a logisitic regression model exploring the impact of widlfire on air quality.

#### y = ß0 + ß1X1 + ß2X2 + ... + ßnXn + error, where y decribes whether PM2.5 levels are above or below a chosen value, ß0 is a constant, X1 through Xn are the predictors, and ß1 through ßn are the predictors' coefficients.

### Imports

In [63]:
# Needed imports

import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, log_loss, confusion_matrix, roc_curve, roc_auc_score
import plotly.graph_objects as go
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_curve, auc

### Loading the Data

In [54]:
# Load the data
script_dir = os.getcwd()

df = pd.read_csv(f'{script_dir}/air_quality_weather_fires.csv')

df = df.dropna()

df

Unnamed: 0.1,Unnamed: 0,date,site_id,latitude,longitude,state_name,county_name,city_name,site_name,PM25,...,fires_within_100km,has_nearby_fire,datetime,month,day_of_week,is_weekend,season,wildfire_season,fire_distance_category,fire_intensity
0,0,2024-01-01,01-073-0023,33.553056,-86.815000,Alabama,Jefferson,Birmingham,North Birmingham,11.55,...,3,1,2024-01-01,1,0,0,winter,0,close,low
1,1,2024-01-01,04-013-9997,33.503833,-112.095767,Arizona,Maricopa,Phoenix,JLG SUPERSITE,85.35,...,0,1,2024-01-01,1,0,0,winter,0,far,low
2,2,2024-01-01,04-019-1028,32.295150,-110.982300,Arizona,Pima,Tucson,CHILDREN'S PARK NCore,16.30,...,0,1,2024-01-01,1,0,0,winter,0,far,low
3,3,2024-01-01,05-119-0007,34.756189,-92.281296,Arkansas,Pulaski,North Little Rock,PARR,5.90,...,0,1,2024-01-01,1,0,0,winter,0,far,low
4,4,2024-01-01,06-001-0011,37.814781,-122.282347,California,Alameda,Oakland,Oakland West,6.90,...,9,1,2024-01-01,1,0,0,winter,0,close,low
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19797,19797,2024-12-31,49-035-3015,40.777145,-111.945849,Utah,Salt Lake,Salt Lake City,Utah Technical Center,4.50,...,0,1,2024-12-31,12,1,0,winter,0,far,low
19798,19798,2024-12-31,50-021-0002,43.608056,-72.982778,Vermont,Rutland,Rutland,State of Vermont District Court Parking Lot,4.70,...,0,1,2024-12-31,12,1,0,winter,0,far,low
19799,19799,2024-12-31,51-087-0014,37.556520,-77.400270,Virginia,Henrico,East Highland Park,MathScience Innovation Center,5.60,...,5,1,2024-12-31,12,1,0,winter,0,close,low
19800,19800,2024-12-31,53-033-0080,47.568236,-122.308628,Washington,King,Seattle,SEATTLE - BEACON HILL,3.40,...,1,1,2024-12-31,12,1,0,winter,0,moderate,low


### Preparing the Data

In [55]:
# Select same predictors as ones chosen in multiple linear regression and process them in same way

X_cols = ['latitude', 'longitude', 'temperature_2m_mean', 'wind_speed_10m_mean', 'precipitation_sum', 'fires_within_50km', 'fires_within_100km', 'distance_to_fire_km', 'fire_brightness']

# Convert X_num to numerics, not strings
df[X_cols] = df[X_cols].apply(pd.to_numeric, errors='coerce')

# Drop rows with NaN values after conversion and row removal in previous steps/ This aligns X and y within the main 'df' DataFrame.
df.dropna(subset=['PM25'] + X_cols, inplace=True)

# Reserve y variable in a separate df and reset index
y_df = df[['PM25']].copy().reset_index(drop=True)

# Now characterize PM25 values as greater than and equal to OR below the scaled median for a y df that will work in a logit model
y_median = np.median(y_df)
y['PM25_median'] = (y_df >= y_median).astype(int)
y = y['PM25_median']

# Select appropriate X variables and name it all such that we will scale it 
X = df[X_cols]

# Scale our numeric variables
scaler = StandardScaler()
scaled_num_X = scaler.fit_transform(X)
scaled_X_df = pd.DataFrame(scaled_num_X, columns=X_cols)

# Create X df with type float64
X = scaled_X_df[X_cols].astype('float64')

In [56]:
# Train, validate, test split
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size = 0.30, random_state = 42) 
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size = 0.50, random_state = 42) 

### Build the Logit Model

In [57]:
# Begin to build the logit model
LogisticRegression(multi_class="multinomial", solver="lbfgs")

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [58]:
pipe = Pipeline(steps = [
    ("model", LogisticRegression(solver="lbfgs", max_iter = 2000))])

pipe.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


0,1,2
,steps,"[('model', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,2000


### Model Evaluation

In [62]:
# Evaluate the model on the test data
accuracy = accuracy_score(y_test, pipe.predict(X_test))
logloss = log_loss(y_test, pipe.predict_proba(X_test))

print(f"Accuracy: {accuracy}")
print(f"Log Loss: {logloss}")

Accuracy: 0.6722915963550455
Log Loss: 0.620174773235742


The accuracy is pretty good - 67% of the predictions are accurate. 

The log loss is a way to maximize the accuracy of the model's parameters i.e. minimize the errors. It penalizes a model for being super confident about wrong answers. It calculates its value based on the number of observations, the observation value, and the "predicted probability" of an observation having a value of 1 or 0. It doesn't just care if your model is wrong or right - how confident was it, as well? It's best to try to minimize log loss. 0.62017 is very large, so this indicates a flaw in our model.

In [61]:
confusion_matrix(y_test, pipe.predict(X_test))

array([[ 920,  514],
       [ 457, 1072]])

This confusion matrix shows more true positives and true negatives than false positives and false negatives; roughly twice in each case. This is a good sign.

### ROC Curves and AUC

For multiclass logistic regression, we can create an ROC for each class. The following code binarizes the true labels and plots a ROC curve using the predicted probabilities.


In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x = fpr, y = tpr, mode = "lines"))

fig.update_layout(title='ROC Curve',xaxis_title='False Positive Rate',yaxis_title='True Positive Rate')