# ML and Spatial Analysis

This notebook presents the results of machine learning anomaly detection, spatial autocorrelation
analysis, county-level clustering, ARIMA time-series forecasting, and covariate-adjusted
overperformance ranking for the Geospatial Crime Analysis project.

**Sections:**
1. ML Anomaly Detection -- Isolation Forest, Local Outlier Factor, and ensemble scoring
2. Spatial Analysis -- Global and local Moran's I for spatial autocorrelation
3. County Clustering -- DBSCAN and hierarchical clustering of county-level behavioral profiles
4. ARIMA Forecasting -- Time-series forecasts with backtesting against linear baselines
5. Overperformance Analysis -- Random Forest residual analysis for covariate-adjusted outlier detection
6. Summary of Key Findings

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings

warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", 30)
pd.set_option("display.max_rows", 60)
pd.set_option("display.float_format", "{:.4f}".format)

DATA_DIR = "../data/analysis/"
print("Setup complete.")

---
## 1. ML Anomaly Detection

Two unsupervised algorithms were applied to county-decade observations:
- **Isolation Forest (IF):** scores how easily a point can be isolated via random splits (lower = more anomalous).
- **Local Outlier Factor (LOF):** measures local density deviation relative to neighbors (higher = more anomalous).

An ensemble score combines both, and a categorical `ml_classification` label is assigned:
Normal, Suspicious, Anomalous, or Extreme Anomaly.

In [None]:
# Load ML anomaly scores
ml_df = pd.read_csv(DATA_DIR + "ml_anomaly_scores.csv")
print(f"ML anomaly scores: {ml_df.shape[0]} county-decade observations, {ml_df.shape[1]} columns")
print(f"Decades covered: {sorted(ml_df['decade'].unique())}")
print(f"Unique counties: {ml_df[['State', 'County']].drop_duplicates().shape[0]}")
ml_df.head(3)

In [None]:
# Classification distribution
class_order = ["Normal", "Suspicious", "Anomalous", "Extreme Anomaly"]
class_counts = ml_df["ml_classification"].value_counts().reindex(class_order)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart of classification counts
colors_bar = ["#2ecc71", "#f39c12", "#e74c3c", "#8e44ad"]
axes[0].bar(class_counts.index, class_counts.values, color=colors_bar, edgecolor="black", linewidth=0.5)
axes[0].set_title("ML Classification Distribution", fontsize=13, fontweight="bold")
axes[0].set_ylabel("Count")
for i, (label, val) in enumerate(zip(class_counts.index, class_counts.values)):
    axes[0].text(i, val + 40, str(val), ha="center", fontsize=10, fontweight="bold")

# Breakdown by decade
decade_class = ml_df.groupby(["decade", "ml_classification"]).size().unstack(fill_value=0)
decade_class = decade_class.reindex(columns=class_order, fill_value=0)
decade_class.plot(kind="bar", stacked=True, ax=axes[1], color=colors_bar, edgecolor="black", linewidth=0.3)
axes[1].set_title("Classification by Decade", fontsize=13, fontweight="bold")
axes[1].set_ylabel("Count")
axes[1].set_xlabel("Decade")
axes[1].legend(title="Classification", fontsize=8, title_fontsize=9)
axes[1].tick_params(axis="x", rotation=0)

plt.tight_layout()
plt.show()

print("\nClassification counts:")
print(class_counts.to_frame("count").to_string())

In [None]:
# Scatter plot: IF score vs LOF score, colored by ML classification
color_map = {
    "Normal": "#2ecc71",
    "Suspicious": "#f39c12",
    "Anomalous": "#e74c3c",
    "Extreme Anomaly": "#8e44ad"
}

fig = px.scatter(
    ml_df,
    x="if_score",
    y="lof_score",
    color="ml_classification",
    color_discrete_map=color_map,
    category_orders={"ml_classification": class_order},
    hover_data=["state_name", "County", "decade", "missing_per_100k", "ensemble_score"],
    opacity=0.6,
    title="Isolation Forest Score vs Local Outlier Factor Score",
    labels={
        "if_score": "Isolation Forest Score",
        "lof_score": "Local Outlier Factor Score",
        "ml_classification": "Classification"
    },
    width=900,
    height=550
)
fig.update_traces(marker=dict(size=5, line=dict(width=0.3, color="black")))
fig.update_layout(legend=dict(title_text="ML Classification"))
fig.show()

In [None]:
# Top 20 anomalies by ensemble score
top20 = (
    ml_df
    .sort_values("ensemble_score", ascending=False)
    .head(20)
    [["state_name", "County", "decade", "missing_per_100k", "bodies_per_100k",
      "if_score", "lof_score", "ensemble_score", "ml_classification", "statistical_alert"]]
    .reset_index(drop=True)
)
top20.index = top20.index + 1
top20.index.name = "Rank"
print("Top 20 County-Decade Observations by Ensemble Anomaly Score:\n")
top20

In [None]:
# Concordance analysis: ML anomaly detection vs statistical alerts
# is_concordant = 1 when ML flags anomalous AND statistical alert is RED/YELLOW

alert_levels = ml_df["statistical_alert"].value_counts()
print("Statistical Alert Distribution:")
print(alert_levels.to_string())
print()

# Cross-tabulation: ML classification vs statistical alert
cross_tab = pd.crosstab(
    ml_df["ml_classification"],
    ml_df["statistical_alert"],
    margins=True
)
cross_tab = cross_tab.reindex(index=class_order + ["All"])
print("Cross-tabulation -- ML Classification vs Statistical Alert:\n")
print(cross_tab.to_string())
print()

# Concordance rate among non-Normal ML classifications
non_normal = ml_df[ml_df["ml_classification"] != "Normal"]
concordance_rate = non_normal["is_concordant"].mean() * 100
print(f"Concordance rate (ML flags non-Normal AND statistical alert agrees): {concordance_rate:.1f}%")
print(f"Total concordant observations: {ml_df['is_concordant'].sum()}")

---
## 2. Spatial Analysis

Spatial autocorrelation measures whether nearby counties have similar rates of missing persons
and unidentified bodies.

- **Global Moran's I** tests for overall spatial clustering across all counties in a given decade.
  Values near +1 indicate strong positive spatial autocorrelation (similar neighbors cluster together).
- **Local Moran's I (LISA)** identifies specific county-level clusters:
  - **HH (High-High):** a high-rate county surrounded by high-rate neighbors (hot spot).
  - **LL (Low-Low):** a low-rate county surrounded by low-rate neighbors (cold spot).
  - **HL / LH:** spatial outliers (high among low, or low among high).

In [None]:
# Global Moran's I results
global_mi = pd.read_csv(DATA_DIR + "global_morans_i.csv")
print("Global Moran's I -- Spatial Autocorrelation Summary\n")
print("A significant positive Moran's I indicates that county rates are spatially clustered")
print("(nearby counties tend to have similar values), not randomly distributed.\n")
global_mi

In [None]:
# Local Moran's I (LISA) -- spatial autocorrelation at county level
lisa_df = pd.read_csv(DATA_DIR + "spatial_autocorrelation.csv")
print(f"LISA results: {lisa_df.shape[0]} county-decade observations\n")

# Count of each cluster type per decade for missing persons
mp_cluster_counts = (
    lisa_df
    .groupby(["decade", "mp_lisa_cluster"])
    .size()
    .unstack(fill_value=0)
)
print("Missing Persons LISA Cluster Counts by Decade:\n")
print(mp_cluster_counts.to_string())
print()

# Count of each cluster type per decade for bodies
bodies_cluster_counts = (
    lisa_df
    .groupby(["decade", "bodies_lisa_cluster"])
    .size()
    .unstack(fill_value=0)
)
print("Unidentified Bodies LISA Cluster Counts by Decade:\n")
print(bodies_cluster_counts.to_string())

In [None]:
# Significant HH (High-High) clusters for missing persons -- hot spots
hh_mp = lisa_df[(lisa_df["mp_lisa_cluster"] == "HH") & (lisa_df["mp_lisa_p"] <= 0.05)].copy()
print(f"Significant HH hot spots for missing persons: {hh_mp.shape[0]} county-decade entries\n")

# Per-decade counts
hh_per_decade = hh_mp.groupby("decade").size()
print("HH Hot Spot Counts by Decade:")
print(hh_per_decade.to_string())
print()

# Top HH counties by LISA I statistic
top_hh = (
    hh_mp
    .sort_values("mp_lisa_i", ascending=False)
    .head(15)
    [["State", "County", "decade", "mp_lisa_i", "mp_lisa_p"]]
    .reset_index(drop=True)
)
top_hh.index = top_hh.index + 1
top_hh.index.name = "Rank"
print("Top 15 Missing Persons HH Hot Spots (by local Moran's I):\n")
top_hh

---
## 3. County Clustering

Counties were clustered based on their behavioral profiles (mean missing persons rate, mean bodies
rate, temporal trends, robust z-scores, and population). Two algorithms were used:
- **DBSCAN:** density-based clustering that identifies noise points (cluster = -1).
- **Hierarchical (Agglomerative):** partitions counties into a fixed number of clusters.

Cluster profiles reveal distinct groups such as high-rate rural counties versus low-rate
metropolitan counties.

In [None]:
# Load county clusters
clusters_df = pd.read_csv(DATA_DIR + "county_clusters.csv")
print(f"County clusters: {clusters_df.shape[0]} counties, {clusters_df.shape[1]} columns\n")

# Hierarchical cluster profiles
profile_cols = ["mean_mp_rate", "mean_bodies_rate", "mp_trend", "bodies_trend",
                "mean_mp_robust_z", "mean_bodies_robust_z", "log_population"]

hier_profile = clusters_df.groupby("hierarchical_cluster")[profile_cols].agg(["mean", "count"])
# Flatten multi-level columns
hier_profile.columns = [f"{col}_{stat}" for col, stat in hier_profile.columns]
# Keep only mean columns + one count column
count_col = "mean_mp_rate_count"
mean_cols = [c for c in hier_profile.columns if c.endswith("_mean")]
hier_summary = hier_profile[mean_cols].copy()
hier_summary.insert(0, "n_counties", hier_profile[count_col])
hier_summary.columns = ["n_counties"] + [c.replace("_mean", "") for c in mean_cols]

print("Hierarchical Cluster Profiles (mean values per cluster):\n")
hier_summary

In [None]:
# Scatter: mean missing persons rate vs mean bodies rate, colored by hierarchical cluster
fig = px.scatter(
    clusters_df,
    x="mean_mp_rate",
    y="mean_bodies_rate",
    color="hierarchical_cluster",
    color_continuous_scale="Viridis",
    hover_data=["state_name", "County", "log_population", "dbscan_cluster"],
    opacity=0.65,
    title="County Clusters: Mean Missing Persons Rate vs Mean Bodies Rate (Hierarchical Clustering)",
    labels={
        "mean_mp_rate": "Mean Missing Persons per 100k",
        "mean_bodies_rate": "Mean Unidentified Bodies per 100k",
        "hierarchical_cluster": "Hierarchical Cluster"
    },
    width=900,
    height=550
)
fig.update_traces(marker=dict(size=6, line=dict(width=0.3, color="gray")))
fig.show()

In [None]:
# DBSCAN cluster summary
dbscan_profile = clusters_df.groupby("dbscan_cluster")[profile_cols].mean()
dbscan_counts = clusters_df.groupby("dbscan_cluster").size().rename("n_counties")
dbscan_summary = pd.concat([dbscan_counts, dbscan_profile], axis=1)
dbscan_summary.index.name = "dbscan_cluster"

print("DBSCAN Cluster Profiles (cluster -1 = noise/outlier points):\n")
dbscan_summary

---
## 4. ARIMA Forecasting

ARIMA (AutoRegressive Integrated Moving Average) models were fit per zone and metric
to produce multi-year forecasts with 95% confidence intervals.

A backtest compares ARIMA out-of-sample RMSE against a simple linear trend model.
When `better_model = arima`, the ARIMA model reduced forecast error relative to the
linear baseline by the reported `improvement_pct`.

In [None]:
# Load backtest results
backtest_df = pd.read_csv(DATA_DIR + "forecast_backtest.csv")
print("Forecast Backtest: ARIMA vs Linear Model RMSE Comparison\n")
backtest_df

In [None]:
# Backtest comparison bar chart
fig, ax = plt.subplots(figsize=(12, 5))
bt = backtest_df.copy()
bt["label"] = bt["zone"] + " | " + bt["metric"]
x = np.arange(len(bt))
width = 0.35

bars1 = ax.bar(x - width / 2, bt["linear_rmse"], width, label="Linear RMSE", color="#3498db", edgecolor="black", linewidth=0.4)
bars2 = ax.bar(x + width / 2, bt["arima_rmse"], width, label="ARIMA RMSE", color="#e74c3c", edgecolor="black", linewidth=0.4)

ax.set_xlabel("Zone | Metric")
ax.set_ylabel("RMSE")
ax.set_title("Backtest RMSE Comparison: Linear vs ARIMA", fontsize=13, fontweight="bold")
ax.set_xticks(x)
ax.set_xticklabels(bt["label"], rotation=45, ha="right", fontsize=8)
ax.legend()

# Annotate with better model
for i, row in bt.iterrows():
    better = row["better_model"].upper()
    pct = row["improvement_pct"]
    y_pos = max(row["linear_rmse"], row["arima_rmse"]) + 1
    ax.text(i, y_pos, f"{better} +{pct:.1f}%", ha="center", fontsize=7, fontweight="bold")

plt.tight_layout()
plt.show()

In [None]:
# Load ARIMA forecasts
arima_df = pd.read_csv(DATA_DIR + "arima_forecasts.csv")
print(f"ARIMA forecasts: {arima_df.shape[0]} rows")
print(f"Zones: {arima_df['zone'].unique().tolist()}")
print(f"Metrics: {arima_df['metric'].unique().tolist()}")
print(f"Year range: {arima_df['year'].min()} - {arima_df['year'].max()}")
arima_df.head()

In [None]:
# Plot ARIMA forecasts with 95% confidence intervals for 2 example zones
example_zones = ["US-Mexico Border", "I-35 Corridor"]
example_metric = "mp_count"

fig, axes = plt.subplots(1, 2, figsize=(15, 5), sharey=False)

for idx, zone in enumerate(example_zones):
    ax = axes[idx]
    subset = arima_df[(arima_df["zone"] == zone) & (arima_df["metric"] == example_metric)].sort_values("year")

    if subset.empty:
        ax.set_title(f"{zone} -- no data for {example_metric}")
        continue

    years = subset["year"].values
    forecast = subset["arima_forecast"].values
    lower = subset["arima_lower_95"].values
    upper = subset["arima_upper_95"].values
    order = subset["arima_order"].iloc[0]

    ax.plot(years, forecast, "o-", color="#2c3e50", linewidth=2, markersize=5, label="ARIMA Forecast")
    ax.fill_between(years, lower, upper, alpha=0.25, color="#3498db", label="95% CI")
    ax.set_title(f"{zone} -- {example_metric}\nARIMA{order}", fontsize=11, fontweight="bold")
    ax.set_xlabel("Year")
    ax.set_ylabel("Forecast Count")
    ax.legend(fontsize=9)
    ax.grid(True, alpha=0.3)
    ax.xaxis.set_major_locator(mticker.MaxNLocator(integer=True))

plt.suptitle("ARIMA Forecasts with 95% Confidence Intervals", fontsize=14, fontweight="bold", y=1.02)
plt.tight_layout()
plt.show()

---
## 5. Overperformance Analysis (Covariate-Adjusted Outliers)

A Random Forest regression model predicts expected missing persons rates based on
socioeconomic covariates (poverty rate, median income, unemployment, foreign-born
percentage, population density). Counties whose **observed** rate far exceeds the
**predicted** rate have large positive residuals, indicating they are anomalous
even after controlling for known risk factors.

- `rf_predicted_mp`: the RF-predicted missing persons rate given the county's covariates.
- `rf_residual_mp`: observed minus predicted (positive = higher than expected).
- `rf_residual_mp_z`: z-score of the residual (how many standard deviations above expected).
- `rf_overperformance_rank`: rank where 1 = highest overperformance.

In [None]:
# Load covariate-adjusted outliers
cov_df = pd.read_csv(DATA_DIR + "covariate_adjusted_outliers.csv")
print(f"Covariate-adjusted outlier data: {cov_df.shape[0]} county-decade observations\n")

# Top RF overperformers (observed >> predicted)
top_overperf = (
    cov_df
    .sort_values("rf_overperformance_rank")
    .head(20)
    [["State", "County", "decade", "missing_per_100k", "rf_predicted_mp",
      "rf_residual_mp", "rf_residual_mp_z", "rf_overperformance_rank"]]
    .reset_index(drop=True)
)
top_overperf.index = top_overperf.index + 1
top_overperf.index.name = "Rank"
print("Top 20 RF Overperformers (observed missing rate far exceeds covariate-predicted rate):\n")
top_overperf

In [None]:
# Observed vs Predicted missing persons rate (RF model)
# Color by residual z-score to highlight overperformers

fig = px.scatter(
    cov_df,
    x="rf_predicted_mp",
    y="missing_per_100k",
    color="rf_residual_mp_z",
    color_continuous_scale="RdYlBu_r",
    hover_data=["State", "County", "decade", "rf_residual_mp", "rf_overperformance_rank"],
    opacity=0.6,
    title="Observed vs RF-Predicted Missing Persons Rate (Covariate-Adjusted)",
    labels={
        "rf_predicted_mp": "RF Predicted Missing per 100k",
        "missing_per_100k": "Observed Missing per 100k",
        "rf_residual_mp_z": "Residual Z-Score"
    },
    width=900,
    height=600
)

# Add diagonal reference line (perfect prediction)
max_val = max(cov_df["missing_per_100k"].max(), cov_df["rf_predicted_mp"].max())
fig.add_trace(
    go.Scatter(
        x=[0, max_val],
        y=[0, max_val],
        mode="lines",
        line=dict(dash="dash", color="black", width=1.5),
        name="Perfect Prediction",
        showlegend=True
    )
)

fig.update_traces(marker=dict(size=5), selector=dict(mode="markers"))
fig.update_layout(coloraxis_colorbar=dict(title="Residual Z"))
fig.show()

In [None]:
# Distribution of RF residual z-scores
fig, ax = plt.subplots(figsize=(10, 4))

ax.hist(cov_df["rf_residual_mp_z"].dropna(), bins=60, color="#2c3e50", edgecolor="white", linewidth=0.3, alpha=0.85)
ax.axvline(x=2, color="#e74c3c", linestyle="--", linewidth=1.5, label="z = 2 (overperformer threshold)")
ax.axvline(x=-2, color="#3498db", linestyle="--", linewidth=1.5, label="z = -2 (underperformer threshold)")
ax.set_xlabel("RF Residual Z-Score (missing persons)")
ax.set_ylabel("Count")
ax.set_title("Distribution of Covariate-Adjusted Residual Z-Scores", fontsize=12, fontweight="bold")
ax.legend(fontsize=9)
ax.grid(True, alpha=0.2)
plt.tight_layout()
plt.show()

n_over = (cov_df["rf_residual_mp_z"] > 2).sum()
n_under = (cov_df["rf_residual_mp_z"] < -2).sum()
print(f"Counties with z > 2 (significant overperformers): {n_over}")
print(f"Counties with z < -2 (significant underperformers): {n_under}")

---
## 6. Summary of Key Findings

**ML Anomaly Detection:**
- The ensemble approach (Isolation Forest + LOF) classifies county-decade observations into four
  tiers: Normal, Suspicious, Anomalous, and Extreme Anomaly.
- The scatter of IF vs LOF scores reveals distinct separation between classes, confirming the
  algorithms capture complementary notions of anomaly.
- Concordance analysis shows the degree to which unsupervised ML flags align with statistical
  alert thresholds, providing a cross-validation of both approaches.

**Spatial Autocorrelation:**
- Global Moran's I values confirm statistically significant spatial clustering of missing persons
  rates across counties -- nearby counties do not have independent rates.
- LISA analysis identifies persistent HH hot spots (counties with high rates surrounded by
  high-rate neighbors), which may warrant focused investigation.

**County Clustering:**
- Hierarchical clustering separates counties into distinct behavioral profiles ranging from
  low-rate metropolitan counties to high-rate rural outliers.
- DBSCAN noise points (cluster -1) represent counties that do not fit any dense cluster,
  often corresponding to extreme or unusual behavioral patterns.

**ARIMA Forecasting:**
- Backtesting shows that ARIMA models outperform linear baselines for some zone-metric
  combinations, though improvement varies.
- Forecast confidence intervals widen substantially over multi-year horizons, reflecting
  inherent uncertainty in extrapolation.

**Covariate-Adjusted Overperformance:**
- Random Forest residual analysis identifies counties whose missing persons rates significantly
  exceed what socioeconomic covariates predict, making them strong candidates for further
  investigation.
- The observed-vs-predicted scatter clearly delineates counties that fall well above the
  diagonal, indicating rates not explained by poverty, income, unemployment, or demographics.