# Urban Heat Island – Baseline Regression Model

Goal: Build a baseline model to predict **land surface temperature (LST)** 
across urban neighborhoods and explore drivers of urban heat.

Part of the **Urban Resilience AI** portfolio:
- Task: Regression (predict continuous LST)
- Focus: Urban heat island and resilience


In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score

import matplotlib.pyplot as plt

try:
    import geopandas as gpd
except ImportError:
    gpd = None
    print("GeoPandas not installed – maps will be skipped.")


ModuleNotFoundError: No module named 'sklearn'

## 1. Load Data

Assumed columns:

- `neighborhood_id`
- `latitude`, `longitude` or polygon shapefile key
- `lst` – land surface temperature (°C)
- `ndvi` – greenness
- `impervious_surface`
- `population_density`
- `nightlight_intensity`
- optional: `median_income`

Replace the path with your dataset.


In [2]:
data_path = "../data/urban_heat_sample.csv"
df = pd.read_csv(data_path)
df.head()


FileNotFoundError: [Errno 2] No such file or directory: '../data/urban_heat_sample.csv'

In [3]:
df.info()
display(df.describe())

df["lst"].hist(bins=30)
plt.title("Distribution of Land Surface Temperature")
plt.xlabel("LST (°C)")
plt.ylabel("Count")
plt.show()


NameError: name 'df' is not defined

In [4]:
target_col = "lst"
ignore_cols = ["neighborhood_id"]
feature_cols = [c for c in df.columns if c not in ignore_cols + [target_col]]

X = df[feature_cols]
y = df[target_col]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_train.shape, X_test.shape


NameError: name 'df' is not defined

In [5]:
rf = RandomForestRegressor(
    n_estimators=200,
    random_state=42,
    n_jobs=-1
)

rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("MAE:", mae)
print("R²:", r2)


NameError: name 'RandomForestRegressor' is not defined

In [6]:
plt.figure(figsize=(5, 5))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.xlabel("True LST")
plt.ylabel("Predicted LST")
plt.title("True vs Predicted Land Surface Temperature")
plt.plot([y_test.min(), y_test.max()],
         [y_test.min(), y_test.max()])
plt.tight_layout()
plt.show()


NameError: name 'plt' is not defined

In [7]:
importances = pd.Series(rf.feature_importances_, index=feature_cols)
importances = importances.sort_values(ascending=False)

plt.figure(figsize=(8, 4))
importances.head(15).plot(kind="bar")
plt.title("Top Drivers of Urban Heat (Feature Importance)")
plt.ylabel("Importance")
plt.tight_layout()
plt.show()

importances.head(15)


NameError: name 'rf' is not defined

In [8]:
results = df.loc[y_test.index].copy()
results["lst_pred"] = y_pred
results["residual"] = results["lst"] - results["lst_pred"]

high_heat = results.sort_values("lst", ascending=False).head(20)
high_heat[["neighborhood_id", "lst", "lst_pred", "residual"]]


NameError: name 'df' is not defined