<a href="https://colab.research.google.com/github/gianinapetrascu/wids-datathon-university-solution1/blob/main/Route2_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Project Title & Team Info

**Project Title**: _Workshop 1: WiDS University Datathon 2026_  
**Team Name**: _Research DUO_  
**University**: _Bucharest University of Economic Studies_  
**Course**: _Software Open Source for Statistics and Data Science_  
**Term**: _1st Semester, 2025_  

**Team Members**:  
- Gianina PetraÈ™cu (GitHub: [@gianinapetrascu](https://github.com/gianinapetrascu))  
- Ioana BÃ®rlan (GitHub: [@ioanabirlan](https://github.com/ioanabirlan))  


### ðŸ”¹ Route 2: Designing for Economic Resilience

**Core Question:**  
*How can wildfire disruption analytics inform stronger economic safety nets for affected workers, families, and small businesses?*

This route is about quantifying how wildfires affect employment, income, and tourism â€” and using that insight to design better protections for vulnerable communities.

## Dataset Overview

Summarize the datasets you used and how you processed them.

- `infrastructure.csv`: Metadata and coordinates of infrastructure
- `fire_perimeters.geojson`: Timestamped fire perimeter polygons
- `evacuation_zones.csv`: (Optional) evacuation declarations
- `watch_duty_change_log.csv`: Alerts and timestamps
- (Optional) NOAA weather data or census data

**Load data**

Kaggle data

In [1]:
!pip install kaggle



In [2]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"gianinamariapetrascu","key":"3b4740567442e05ace027246b5e7d29a"}'}

In [3]:
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [4]:
!kaggle competitions download -c wids-university-datathon-2025


Downloading wids-university-datathon-2025.zip to /content
 91% 367M/403M [00:00<00:00, 470MB/s]
100% 403M/403M [00:00<00:00, 469MB/s]


In [5]:
!unzip wids-university-datathon-2025.zip -d data/

Archive:  wids-university-datathon-2025.zip
  inflating: data/WiDS _-_ Watch Duty_ Data Dictionary.docx  
  inflating: data/evac_zone_status_geo_event_map.csv  
  inflating: data/evac_zones_gis_evaczone.csv  
  inflating: data/evac_zones_gis_evaczonechangelog.csv  
  inflating: data/fire_perimeters_gis_fireperimeter.csv  
  inflating: data/fire_perimeters_gis_fireperimeterchangelog.csv  
  inflating: data/geo_events_externalgeoevent.csv  
  inflating: data/geo_events_externalgeoeventchangelog.csv  
  inflating: data/geo_events_geoevent.csv  
  inflating: data/geo_events_geoeventchangelog.csv  


External Census Data

In [6]:
!ls *.zip

wids-university-datathon-2025.zip


In [7]:
import zipfile, glob, os
from pathlib import Path

Path("external_data").mkdir(exist_ok=True)

# Extract ACS datasets
acs_zips = glob.glob("ACS*.zip")
for z in acs_zips:
    with zipfile.ZipFile(z, 'r') as zip_ref:
        zip_ref.extractall("external_data")

!ls external_data

Merge Census data

In [8]:
import pandas as pd, re, glob

# Find all the ACS *-Data.csv files
csvs = [f for f in glob.glob("external_data/*-Data.csv")]
print("Found ACS CSVs:", csvs)

def read_acs(pattern, value_col, rename_to):
    path = next((f for f in csvs if pattern in f), None)
    if not path:
        raise FileNotFoundError(f"No ACS file matching {pattern}")
    df = pd.read_csv(path)
    geo_col = next((c for c in df.columns if re.search("GEO", c, re.I)), None)
    df["county_fips"] = df[geo_col].astype(str).str.extract(r'(\d{5})$')
    df = df.rename(columns={value_col: rename_to})
    return df[["county_fips", rename_to]]

income = read_acs("B19013", "B19013_001E", "median_income")
pop    = read_acs("B01003", "B01003_001E", "population")
unemp  = read_acs("DP03", "DP03_0005PE", "unemployment_rate")
pov    = read_acs("S1701", "S1701_C02_001E", "poverty_rate")

# Merge them
acs = income.merge(pop, on="county_fips", how="outer") \
            .merge(unemp, on="county_fips", how="outer") \
            .merge(pov, on="county_fips", how="outer")

for col in ["median_income", "unemployment_rate", "poverty_rate", "population"]:
    acs[col] = pd.to_numeric(acs[col], errors="coerce")

acs.dropna(subset=["county_fips"], inplace=True)
acs["county_fips"] = acs["county_fips"].astype(str).str.zfill(5)

acs.to_csv("external_data/acs_income_employment.csv", index=False)
print("Clean ACS dataset saved:", acs.shape)
acs.head()


Found ACS CSVs: []


FileNotFoundError: No ACS file matching B19013

In [None]:
import pandas as pd
import geopandas as gpd
import numpy as np
import json
from shapely.geometry import Point

In [None]:
events    = pd.read_csv("data/geo_events_geoevent.csv", low_memory=True)

In [None]:
events.head()

## Approach

In [None]:
# keep wildfires only
events = events[events["geo_event_type"] == "wildfire"].copy()

# basic cleaning
events = events[["id", "name", "lat", "lng", "date_created", "date_modified", "is_active", "data"]]
events.dropna(subset=["lat","lng"], inplace=True)

# extract containment and acreage from JSON-encoded data
def extract_json_field(js, key):
    if not isinstance(js, str) or "{" not in js: return np.nan
    try:
        parsed = json.loads(js.strip().strip('"').strip("'"))
        return parsed.get(key, np.nan)
    except Exception:
        return np.nan

for col in ["is_prescribed","is_fps","containment","acreage"]:
    events[col] = events["data"].apply(lambda x: extract_json_field(x, col))

events.drop(columns=["data"], inplace=True, errors="ignore")
events.head()

US county shapefile

In [None]:
!mkdir -p shapefiles
!wget -q https://www2.census.gov/geo/tiger/TIGER2023/COUNTY/tl_2023_us_county.zip -O shapefiles/us_counties.zip
!unzip -q shapefiles/us_counties.zip -d shapefiles/

In [None]:
# load shapefile
counties = gpd.read_file("shapefiles/tl_2023_us_county.shp")[["STATEFP", "COUNTYFP", "NAME", "geometry"]]
counties["county_fips"] = (counties["STATEFP"] + counties["COUNTYFP"]).astype(str)

# turn events into GeoDataFrame
gdf = gpd.GeoDataFrame(
    events,
    geometry=[Point(xy) for xy in zip(events.lng, events.lat)],
    crs="EPSG:4326"
)

# spatial join
gdf_joined = gpd.sjoin(gdf, counties, how="left", predicate="within")
gdf_joined["county_fips"] = gdf_joined["county_fips"].astype(str).str.zfill(5)
print("Mapped fires to counties:", gdf_joined.shape)
gdf_joined[["id", "name", "county_fips", "containment", "acreage"]].head()

In [None]:
gdf_joined["county_fips"] = (
    gdf_joined["county_fips"].astype(str).str.extract(r"(\d{5})")[0].str.zfill(5))
acs["county_fips"] = acs["county_fips"].astype(str).str.zfill(5)

merged = gdf_joined.merge(acs, on="county_fips", how="left")

print("Merged successfully:", merged.shape)
merged[["id","name","county_fips","median_income","unemployment_rate","poverty_rate","population"]].head()


Analyze correlations

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# correlation of fire impact with socioeconomic variables
cols = ["containment","acreage","median_income","unemployment_rate","poverty_rate","population"]
corr = merged[cols].apply(pd.to_numeric, errors="coerce").corr()

plt.figure(figsize=(8,6))
sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f", vmin=-1, vmax=1)
plt.title("Correlation between Wildfire Metrics and Socioeconomic Factors")
plt.show()

Economic exposure by county

In [None]:
impact = (
    merged.groupby("county_fips")
    .agg(
        fires=("id","count"),
        total_acres=("acreage","sum"),
        avg_containment=("containment","mean"),
        median_income=("median_income","mean"),
        unemployment=("unemployment_rate","mean"),
        poverty=("poverty_rate","mean"),
        population=("population","mean")).reset_index())

impact["acres_per_1000_people"] = impact["total_acres"] / (impact["population"]/1000)
impact["fire_risk_index"] = (
    impact["acres_per_1000_people"] * (1 + impact["poverty"]/100 - impact["median_income"]/1e5))

impact.head()


In [None]:
plt.figure(figsize=(9,6))
sns.scatterplot(
    data=impact,
    x="median_income", y="acres_per_1000_people",
    size="poverty", hue="unemployment",
    palette="YlOrRd", alpha=0.8
)
plt.title("Wildfire Exposure vs Socioeconomic Factors")
plt.xlabel("Median Household Income ($)")
plt.ylabel("Burned Acres per 1000 People")
plt.legend(title="Unemployment / Poverty")
plt.show()


In [None]:
# import folium
# from folium import Choropleth

# if "GEOID" not in counties.columns:
#     counties["GEOID"] = (counties["STATEFP"] + counties["COUNTYFP"]).astype(str)

# # interactive Choropleth map
# m = folium.Map(location=[37.5, -119], zoom_start=5, tiles="cartodbpositron")

# Choropleth(
#     geo_data=counties,
#     data=impact,
#     columns=["county_fips", "fire_risk_index"],
#     key_on="feature.properties.GEOID",
#     fill_color="YlOrRd",
#     fill_opacity=0.7,
#     line_opacity=0.2,
#     legend_name="Fire Risk Index (weighted by poverty & income)").add_to(m)

# m.save("route2_fire_econ_map.html")
# m


  ML models

In [None]:
!pip install lightgbm shap -q

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import lightgbm as lgb
import shap

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error

In [None]:
# relevant features and target variable
features = ["fires", "total_acres", "avg_containment", "unemployment", "poverty", "median_income"]
target = "fire_risk_index"

df = impact.dropna(subset=features + [target]).copy()
X = df[features]
y = df[target]

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [None]:
# Train a baseline LightGBM
lgbm = lgb.LGBMRegressor(
    objective="regression",
    boosting_type="gbdt",
    n_estimators=500,
    learning_rate=0.05,
    num_leaves=31,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)
lgbm.fit(X_train_scaled, y_train)

# baseline performance
y_pred = lgbm.predict(X_test_scaled)
print("LightGBM Baseline Performance")
print("RÂ²:", round(r2_score(y_test, y_pred), 3))
print("MAE:", round(mean_absolute_error(y_test, y_pred), 2))

In [None]:
# grid of parameters to tune
param_grid = {
    "num_leaves": [15, 30],
    "learning_rate": [0.01, 0.05, 0.1],
    "n_estimators": [500, 1000],
    "subsample": [0.7, 0.9]
}

# Grid search
grid = GridSearchCV(
    estimator=lgbm,
    param_grid=param_grid,
    cv=3,
    scoring="r2",
    verbose=3,
    n_jobs=-1
)

grid.fit(X_train_scaled, y_train)

best_model = grid.best_estimator_
y_pred_best = best_model.predict(X_test_scaled)

print("Best Model Parameters:")
print(grid.best_params_)
print("R^2:", round(r2_score(y_test, y_pred_best), 3))
print("MAE:", round(mean_absolute_error(y_test, y_pred_best), 2))


In [None]:
# feature importance scores
importances = pd.Series(best_model.feature_importances_, index=features).sort_values(ascending=True)

plt.figure(figsize=(7,5))
importances.plot(kind="barh", color="#E26A4A")
plt.title("Feature Importance â€” LightGBM (Fire Risk Prediction)")
plt.xlabel("Importance Score")
plt.tight_layout()
plt.show()

In [None]:
# SHAP analysis
shap.initjs()

explainer = shap.Explainer(best_model)
shap_values = explainer(X_test_scaled)

shap.summary_plot(shap_values, X_test, plot_type="bar", color="#c20235")
shap.summary_plot(shap_values, X_test, color="#c20235")

for col in ["poverty", "unemployment", "median_income"]:
    shap.dependence_plot(col, shap_values.values, X_test, display_features=X_test)


## Results

Report your final results and key insights:
- Metrics: Precision, Recall, ROC AUC, RMSE, etc.
- Key findings or visualizations (include in slides)
- Any limitations or ethical considerations

## Team Contributions

| Name         | Contributions                                |
|--------------|----------------------------------------------|
| Gianina PetraÈ™cu       | Feature engineering, model tuning            |
| Ioana BÃ®rlan           | EDA, preprocessing, geospatial joins         |