# Wildfire Risk Prediction – Baseline Model

Goal: Build a simple baseline model to predict **wildfire occurrence** using 
environmental and weather features.

This notebook is part of the **Disaster GeoAI** portfolio:
- Task: Supervised classification (fire vs. no fire)
- Techniques: EDA, feature engineering, Random Forest baseline
- Output: Basic metrics + feature importance + optional map


In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score

import matplotlib.pyplot as plt

# Optional geo packages (install if needed)
try:
    import geopandas as gpd
except ImportError:
    gpd = None
    print("GeoPandas not installed – maps will be skipped.")


ModuleNotFoundError: No module named 'sklearn'

## 1. Load Data

Assumed columns (you can adapt to your actual dataset):

- `latitude`, `longitude`
- `temperature`, `humidity`, `wind_speed`
- `vegetation_index` (e.g., NDVI)
- `drought_index`
- `fire_occurred` (0/1 label)

Replace the path below with your actual CSV file.


In [None]:
# TODO: change this path to your data
data_path = "../data/wildfire_sample.csv"

df = pd.read_csv(data_path)
df.head()


In [None]:
df.info()
display(df.describe(include="all"))

# Class balance
print("Class distribution (fire_occurred):")
print(df["fire_occurred"].value_counts(normalize=True))


In [None]:
df.isna().mean().sort_values(ascending=False).head(20)


In [None]:
target_col = "fire_occurred"
feature_cols = [c for c in df.columns if c not in [target_col]]

X = df[feature_cols]
y = df[target_col]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

X_train.shape, X_test.shape


In [None]:
rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=None,
    random_state=42,
    n_jobs=-1
)

rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
y_proba = rf.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, y_proba))


In [None]:
importances = pd.Series(rf.feature_importances_, index=feature_cols)
importances = importances.sort_values(ascending=False)

plt.figure(figsize=(8, 4))
importances.head(15).plot(kind="bar")
plt.title("Top Feature Importances – Wildfire Risk")
plt.ylabel("Importance")
plt.tight_layout()
plt.show()

importances.head(15)


In [None]:
if gpd is not None and {"latitude", "longitude"}.issubset(df.columns):
    gdf = gpd.GeoDataFrame(
        df,
        geometry=gpd.points_from_xy(df["longitude"], df["latitude"]),
        crs="EPSG:4326"
    )
    display(gdf.head())
else:
    print("GeoPandas not available or no lat/long columns – skipping map.")
