# Logistic Regression Baseline for Wildfire Prediction

This notebook demonstrates how to build a **logistic regression** model as a baseline classifier for predicting wildfire occurrences using weather and environmental variables. Logistic regression is a simple yet powerful linear model that estimates the probability of a binary outcome—in this case, whether a wildfire occurs (`fire_occurred = 1`) or not (`fire_occurred = 0`).

Because the wildfire dataset is often highly imbalanced, with far fewer wildfire events than non‑events, it’s important to account for this imbalance when training the model. In the following sections we load the data, perform basic cleaning, engineer features, split the data into training and testing sets, scale the features, train a logistic regression classifier with balanced class weights, and evaluate its performance.


In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

# Display settings
pd.set_option('display.max_columns', None)


2025-08-07 01:19:17.497739: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

In [2]:
# TODO: Update the file path to point to your data file
data_path = './matched_data_20_miles.csv'

# Load the matched data
final_matched_data = pd.read_csv(data_path)

# Identify quality control (QC) columns and remove rows with unreliable flags
qc_columns = ['qc.1','qc.2','qc.3','qc.4','qc.5','qc.6','qc.7','qc.8','qc.9','qc.10','qc.11','qc.12','qc.13']
excluded_flags = {'M','I','S'}  # Missing, Invalid, or suspect readings

qc_data = final_matched_data[qc_columns].astype(str)

def contains_excluded_flag(series, excluded_flags):
    return series.apply(lambda val: any(flag in val for flag in excluded_flags))

# Create a mask for rows without excluded QC flags
column_masks = qc_data.apply(lambda col: ~contains_excluded_flag(col, excluded_flags))
mask = column_masks.all(axis=1)

# Filter out poor‑quality rows
cleaned_data = final_matched_data[mask].copy()
print(f'Removed {(~mask).sum()} rows due to QC flags.')


Removed 13913 rows due to QC flags.


In [3]:
# List the base feature columns
general_features = [
    'ETo (mm)', 'Precip (mm)', 'Sol Rad (W/sq.m)', 'Avg Vap Pres (kPa)',
    'Max Air Temp (C)', 'Min Air Temp (C)', 'Avg Air Temp (C)',
    'Max Rel Hum (%)', 'Min Rel Hum (%)', 'Avg Rel Hum (%)',
    'Dew Point (C)', 'Avg Wind Speed (m/s)', 'Wind Run (km)', 'Avg Soil Temp (C)'
]

# Compute a 7‑day trailing average for each feature within each weather station
for col in general_features:
    cleaned_data[f'{col}_7d_avg'] = (
        cleaned_data.groupby('StationNbr')[col]
        .transform(lambda x: x.rolling(7, min_periods=1).mean())
    )

# Select the trailing average features and the target label
all_trailing_features = [f'{col}_7d_avg' for col in general_features]
X = cleaned_data[all_trailing_features].dropna()
y = cleaned_data.loc[X.index, 'fire_occurred']

print(f'Feature matrix shape: {X.shape}')


Feature matrix shape: (183666, 14)


In [4]:
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Standardize features using z‑score scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [5]:
model = Sequential([
    Dense(1, input_shape=(X_train_scaled.shape[1],), activation='sigmoid')
])

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [6]:
# 4. Compile the model
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

In [7]:
history = model.fit(
    X_train_scaled, y_train
)

[1m4592/4592[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 2ms/step - accuracy: 0.8748 - loss: 0.4229


In [8]:
loss, accuracy = model.evaluate(X_test_scaled, y_test, verbose=0)
print(f"Test Accuracy: {accuracy:.2f}")

# Predict class labels
y_probs = model.predict(X_test_scaled)
y_pred = (y_probs >= 0.5).astype(int)

Test Accuracy: 0.99
[1m1148/1148[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 890us/step


In [11]:
# Report precision, recall, accuracy
print("Logistic Regression Results:")
print("\nFull Classification Report:")
print(classification_report(y_test, y_pred))

Logistic Regression Results:

Full Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99     36298
           1       0.00      0.00      0.00       436

    accuracy                           0.99     36734
   macro avg       0.49      0.50      0.50     36734
weighted avg       0.98      0.99      0.98     36734



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Conclusion

This notebook provided a straightforward logistic regression baseline for wildfire prediction. We cleaned the raw matched dataset by removing rows with unreliable quality‑control flags, engineered 7‑day trailing averages for key weather features, and trained a logistic regression classifier with balanced class weights to handle the imbalanced target distribution.

    •	Logistic Regression:
Accuracy = 99%, Precision = 0% for the wildfire class, Recall = 0%

The accuracy rate is extremely high because of the class imbalance. The model only succesfully predict the majority 'no fire' class and fails to detect any fire events. 

Compare these results to other models such as Random Forests, gradient boosting (e.g., XGBoost), or deep neural networks. LR offers an interpretable baseline.