# Spatial Prediction Notebook
This notebook consumes the cleaned Manila dataset produced by `01_data_preprocessing.ipynb` and showcases the simplified `SpatialPredictor` API found in `src/core/analysis/predictor.py`. We train the configured model from `config/settings.yaml`, evaluate its performance, and generate next-year hotspot forecasts for Manila districts.

In [1]:
from pathlib import Path
import sys
import pandas as pd

# Add project root to Python path
if '__file__' in globals():
	project_root = Path(__file__).resolve().parent.parent
else:
	project_root = Path.cwd().parent if Path.cwd().name == 'notebook' else Path.cwd()

if str(project_root) not in sys.path:
	sys.path.insert(0, str(project_root))

from src.core.analysis.predictor import SpatialPredictor

In [2]:
if '__file__' in globals():
    NOTEBOOK_DIR = Path(__file__).resolve().parent
else:
    NOTEBOOK_DIR = Path.cwd()

DATA_PATH = NOTEBOOK_DIR / 'Missing People - cleaned.csv'
OUTPUT_DIR = NOTEBOOK_DIR / 'outputs'
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
if not DATA_PATH.exists():
    fallback_path = Path.cwd() / 'notebook' / 'Missing People - cleaned.csv'
    if fallback_path.exists():
        DATA_PATH = fallback_path
    else:
        raise FileNotFoundError(
            "Run 01_data_preprocessing.ipynb to regenerate 'Missing People - cleaned.csv' before continuing."
        )

full_df = pd.read_csv(DATA_PATH)
full_df.head()

Unnamed: 0,Person_ID,Person ID,AGE,GENDER,Date Reported Missing,Time Reported Missing,Date Last Seen,Location Last Seen,Post URL,Time_Obj,Age_Group,City_Cleaned,Barangay_Cleaned,District_Cleaned,Latitude,Longitude,Location_Match_Level,Location_Match_Score,Year,Hour_Missing
0,MP-0001,,59,Male,2020-01-14,12:48 PM,2019-12-14,"Malate, Manila",https://www.facebook.com/share/p/1Fp5H7uddW/,2025-12-11 12:48:00,Adult,Manila City,,Malate,14.5714,120.9904,district,0.95,2020.0,12.0
1,MP-0002,,41,Male,2020-01-24,5:12 PM,2021-01-16,"Sampaloc, Manila",https://www.facebook.com/share/p/1CwZW3pbpf/,2025-12-11 17:12:00,Adult,Manila City,,Sampaloc,14.6133,121.0003,district,0.95,2020.0,17.0
2,MP-0003,,43,Male,2020-02-09,7:03 PM,,"Tondo, Manila",https://www.facebook.com/share/p/1CoiXoTEjb/,2025-12-11 19:03:00,Adult,Manila City,,Tondo,14.6186,120.9681,district,0.95,2020.0,19.0
3,MP-0004,,14,Male,2020-02-15,12:19 PM,,"Binondo, Manila",https://www.facebook.com/share/p/17Umn23xj9/,2025-12-11 12:19:00,Young Teen,Manila City,,Binondo,14.6006,120.9754,district,0.95,2020.0,12.0
4,MP-0005,,16,Male,2020-03-23,12:25,2025-03-11,"Paco,. Manila",https://www.facebook.com/share/p/1BhMzYvEJN/,2025-12-11 12:25:00,Teen,Manila City,,Paco,14.5833,120.9961,district,0.95,2020.0,12.0


In [3]:
model_df = full_df.copy()

# Ensure identifier aligns with predictor expectations
if 'Person ID' in model_df.columns:
    model_df['Person ID'] = model_df['Person ID'].fillna(model_df.get('Person_ID'))
else:
    model_df['Person ID'] = model_df.get('Person_ID')
model_df['Person ID'] = model_df['Person ID'].fillna(
    model_df.index.to_series().apply(lambda idx: f"UNK-{idx:04d}")
)

# Build barangay/district key used by the predictor
barangay_series = model_df['Barangay_Cleaned'] if 'Barangay_Cleaned' in model_df.columns else pd.Series(pd.NA, index=model_df.index)
district_series = model_df['District_Cleaned'] if 'District_Cleaned' in model_df.columns else pd.Series('Unknown', index=model_df.index)
model_df['Barangay District'] = (
    barangay_series
    .fillna(district_series)
    .replace('', pd.NA)
    .fillna(district_series)
    .fillna('Unknown')
 )
model_df['Barangay District'] = model_df['Barangay District'].astype(str).str.strip()

# Normalize numeric columns expected by the model
age_source = model_df['AGE'] if 'AGE' in model_df.columns else model_df.get('Age')
model_df['Age'] = pd.to_numeric(age_source, errors='coerce')
model_df['Latitude'] = pd.to_numeric(model_df['Latitude'], errors='coerce')
model_df['Longitude'] = pd.to_numeric(model_df['Longitude'], errors='coerce')
model_df['Date Reported Missing'] = pd.to_datetime(
    model_df['Date Reported Missing'], errors='coerce'
 )
year_series = model_df['Year'] if 'Year' in model_df.columns else pd.Series(pd.NA, index=model_df.index)
model_df['Year'] = year_series.fillna(model_df['Date Reported Missing'].dt.year)
model_df['Year'] = model_df['Year'].astype('Int64')

# Drop rows missing required geographic fields
model_df = model_df.dropna(
    subset=['Barangay District', 'Latitude', 'Longitude', 'Year']
 )

# Fill any remaining age gaps with the cohort median
age_median = model_df['Age'].median()
model_df['Age'] = model_df['Age'].fillna(age_median)

model_df[['Person ID', 'Barangay District', 'Age', 'Year', 'Latitude', 'Longitude']].head()

Unnamed: 0,Person ID,Barangay District,Age,Year,Latitude,Longitude
0,MP-0001,Malate,59.0,2020,14.5714,120.9904
1,MP-0002,Sampaloc,41.0,2020,14.6133,121.0003
2,MP-0003,Tondo,43.0,2020,14.6186,120.9681
3,MP-0004,Binondo,14.0,2020,14.6006,120.9754
4,MP-0005,Paco,16.0,2020,14.5833,120.9961


## Train Configured Model
Instantiate `SpatialPredictor`, then call `train_configured_model`. Passing `model_name='poisson'` mirrors the configuration and keeps results deterministic inside the notebook. The helper internally aggregates yearly counts per barangay and performs a compact hyperparameter search.

In [4]:
predictor = SpatialPredictor()
metrics = predictor.train_configured_model(model_df, model_name='poisson')
metrics

ðŸ¤– Training Poisson...
âœ“ Trained! Test RÂ²: -2.0700, RMSE: 1.3382


{'model': 'poisson',
 'test_r2': -2.0700115899538836,
 'test_rmse': 1.3382227371180648,
 'train_r2': 0.48709139501021115,
 'overfit_gap': 2.5571029849640947}

## Forecast Next-Year Hotspots
Forecast hotspots for the year following the latest observation in `model_df`. Adjust `top_n` to inspect more or fewer candidate barangays.

In [5]:
latest_year = int(model_df['Year'].max())
next_year = latest_year + 1
top_predictions = predictor.predict_next_year_hotspots(model_df, next_year=next_year, top_n=15)
top_predictions

âœ“ Predicted top 15 hotspots for 2026


Unnamed: 0,Barangay District,Latitude,Longitude,Predicted_Cases,Prev_Year_Count
15,Tondo,14.6186,120.9681,11.925545,13.0
13,Santa Cruz,14.615,120.983,4.608773,4.0
10,Sampaloc,14.6133,121.0003,3.82901,4.0
2,Binondo,14.6006,120.9754,3.647915,3.0
8,Port Area,14.588,120.963,2.706035,1.0
0,Barangay 287,14.6006,120.9754,2.636757,0.0
9,Quiapo,14.5998,120.9844,2.47563,1.0
4,Intramuros,14.5904,120.977,2.46548,1.0
1,Barangay 641,14.5966,121.0008,2.195004,0.0
6,Paco,14.5833,120.9961,2.182596,2.0


## Export Predictions
Persist the top hotspot forecasts so downstream tools can read `outputs/<year>_predictions.csv`.

In [6]:
prediction_csv_path = OUTPUT_DIR / f"{next_year}_predictions.csv"
top_predictions.to_csv(prediction_csv_path, index=False)
print(f"Saved top {len(top_predictions)} predictions to {prediction_csv_path.relative_to(NOTEBOOK_DIR)}")
prediction_csv_path

Saved top 15 predictions to outputs/2026_predictions.csv


PosixPath('/Users/benny/missing-person-heatmap/notebook/outputs/2026_predictions.csv')

## Full Prediction Table
Retrieve the complete set of model predictions across all barangays and years. This is useful for exporting to downstream visualizations or heatmap generation via `generate_prediction_heatmap`.

In [7]:
all_predictions = predictor.get_all_predictions()
all_predictions.sort_values('Predicted_Cases', ascending=False).head(20)

Unnamed: 0,Barangay District,Latitude,Longitude,Age,Prev_Year_Count,Year,Predicted_Cases
15,Tondo,14.6186,120.9681,31.677419,13.0,2026,11.925545
13,Santa Cruz,14.615,120.983,42.0,4.0,2026,4.608773
10,Sampaloc,14.6133,121.0003,41.833333,4.0,2026,3.82901
2,Binondo,14.6006,120.9754,39.3,3.0,2026,3.647915
8,Port Area,14.588,120.963,29.5,1.0,2026,2.706035
0,Barangay 287,14.6006,120.9754,31.0,0.0,2026,2.636757
9,Quiapo,14.5998,120.9844,21.0,1.0,2026,2.47563
4,Intramuros,14.5904,120.977,30.25,1.0,2026,2.46548
1,Barangay 641,14.5966,121.0008,49.0,0.0,2026,2.195004
6,Paco,14.5833,120.9961,40.5,2.0,2026,2.182596


## Feature Importance
Inspect the relative contribution of each feature used by the Poisson model as a quick diagnostic for the learned weights.

In [8]:
feature_importance = predictor.get_feature_importance()
feature_importance

Unnamed: 0,Feature,Importance
0,Latitude,0.228498
3,Prev_Year_Count,0.157544
1,Longitude,0.146985
4,Age,0.090578
2,Year,0.066788
