In [None]:
# Forest Fire Two-Stage Modeling: Classification + Negative Binomial Regression
# Author: [Your Name]
# Date: [Current Date]
# Description:
# This notebook models forest fire behavior in two stages:
# Stage 1 - Classification: Will a fire occur? (area > 0)
# Stage 2 - Regression: If fire occurs, how large is the area? (using Negative Binomial Regression)
# This respects real-world logic: first ignition, then spread. It avoids deleting data while maintaining clarity.
#
# Discovery Background:
# Initially, forest fire area ("area") appeared to be exponentially driven, but correlation and variance testing showed weak and scattered signal.
# ISI (Initial Spread Index) and wind had very low variance and correlation to area, so they were excluded.
# After filtering out records with ISI = 0 and area = 0, the resulting dataset revealed a clearer structure:
# - Fires seemed to emerge independently in separate patches
# - Each ignition resembled a binomial event (burn or not), and fire area resembled a sum of such ignition events
# This led to the hypothesis: forest fire area is better modeled as a **stochastic binomial process**, not exponential.
# A Negative Binomial Regression was then applied to capture this overdispersed, discrete-fire-area structure.

import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, mean_squared_error, r2_score

# Load data
df = pd.read_csv('forestfires.csv')
df = df.drop(columns=['day'], errors='ignore')

# Add binary target for fire occurrence
# Stage 1 target: fire_occur (1 if area > 0)
df['fire_occur'] = (df['area'] > 0).astype(int)

# Stage 1: Classification - Will a fire occur?
X_class = pd.get_dummies(df.drop(columns=['area', 'fire_occur']), drop_first=True)
y_class = df['fire_occur']
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(X_class, y_class, test_size=0.2, random_state=42)

clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train_c, y_train_c)
y_pred_c = clf.predict(X_test_c)
print("Classification Accuracy:", accuracy_score(y_test_c, y_pred_c))
print(classification_report(y_test_c, y_pred_c))

# Stage 2: Regression - How much area will burn (only where area > 0)
df_reg = df[df['area'] > 0].copy()
df_reg = df_reg.drop(columns=['ISI', 'wind'], errors='ignore')
df_reg['month'] = df_reg['month'].astype('category')

X_reg = pd.get_dummies(df_reg.drop(columns='area'), drop_first=True)
y_reg = df_reg['area'].astype(float)
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)

# Ensure test has same columns as train
X_test_r = X_test_r.reindex(columns=X_train_r.columns, fill_value=0)

# Add constant and match shapes precisely
X_train_r_const = sm.add_constant(X_train_r, has_constant='add').astype(float)
X_test_r_const = sm.add_constant(X_test_r, has_constant='add').astype(float)

# Fit Negative Binomial Regression
model = sm.GLM(y_train_r, X_train_r_const, family=sm.families.NegativeBinomial(alpha=1.0)).fit()

# Predict and evaluate
predictions = model.predict(X_test_r_const)
rmse = np.sqrt(mean_squared_error(y_test_r, predictions))
r2 = r2_score(y_test_r, predictions)
print("\nNegative Binomial Regression Summary:")
print(model.summary())
print("\nRegression RMSE:", rmse)
print("Regression R²:", r2)

# Final Notes:
# This notebook introduces a two-stage statistical ML framework where fire occurrence and magnitude are modeled separately.
# It avoids naive assumptions (like exponential area spread), and instead uses variance, correlation, and pattern recognition
# to propose a binomial-generative theory of fire behavior.
# 
# Fire occurrence itself is stochastic and binomial, modeled via Random Forest classification.
# Area burned is overdispersed and best captured using Negative Binomial Regression.
# 
# The final model uses only real, interpretable features — no artificially introduced time-sequence variables — preserving the clarity of the pure mathematical insight behind fire ignition and progression.
# 
# This approach is well-suited for roles in applied machine learning, data science, environmental modeling, and research-oriented ML teams.



Classification Accuracy: 0.5288461538461539
              precision    recall  f1-score   support

           0       0.52      0.45      0.48        51
           1       0.53      0.60      0.57        53

    accuracy                           0.53       104
   macro avg       0.53      0.53      0.53       104
weighted avg       0.53      0.53      0.53       104



  df_reg['area_prev'] = df_reg['area_prev'].fillna(method='bfill')



Negative Binomial Regression Summary:
                 Generalized Linear Model Regression Results                  
Dep. Variable:                   area   No. Observations:                  216
Model:                            GLM   Df Residuals:                      187
Model Family:        NegativeBinomial   Df Model:                           28
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -845.16
Date:                Wed, 16 Apr 2025   Deviance:                       364.11
Time:                        02:28:14   Pearson chi2:                     541.
No. Iterations:                   100   Pseudo R-squ. (CS):             0.5728
Covariance Type:            nonrobust                                         
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------