# DBSCAN Clustering Analysis

**Notebook**: 01_dbscan_clustering.ipynb  
**Sprint**: Phase 2 Sprint 8 - Advanced Geospatial Analysis  
**Created**: 2025-11-08  

## Objectives

1. Apply DBSCAN (Density-Based Spatial Clustering) to identify accident hotspots
2. Analyze cluster characteristics (size, fatalities, aircraft types)
3. Identify top hotspot clusters
4. Visualize clusters on interactive maps
5. Analyze temporal evolution of clusters

## DBSCAN Parameters

- **eps**: 50 km (search radius, converted to radians for Haversine metric)
- **min_samples**: 10 (minimum accidents to form a cluster)
- **metric**: Haversine (great-circle distance for lat/lon)

## Expected Output

- Cluster assignments for all events
- Cluster statistics and rankings
- Interactive Folium map with clusters
- Temporal cluster evolution analysis

In [None]:
# Standard library
import json
from pathlib import Path
from typing import Dict, List

# Data manipulation
import pandas as pd
import numpy as np

# Geospatial
import geopandas as gpd
from shapely.geometry import Point, MultiPoint

# Machine learning
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from folium.plugins import MarkerCluster

# Configuration
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
pd.set_option('display.max_columns', None)

# Paths
DATA_DIR = Path('../../data')
FIG_DIR = Path('figures')
MAP_DIR = Path('maps')
FIG_DIR.mkdir(exist_ok=True)
MAP_DIR.mkdir(exist_ok=True)

print('✅ All packages imported successfully')

## 1. Load Geospatial Dataset

In [None]:
# Load geospatial data
gdf = gpd.read_parquet(DATA_DIR / 'geospatial_events.parquet')

# Load statistics
with open(DATA_DIR / 'geospatial_events_stats.json', 'r') as f:
    stats = json.load(f)

print(f'✅ Loaded {len(gdf):,} events')
print(f'Date range: {stats["date_range"]["min"]} to {stats["date_range"]["max"]}')
print(f'Total fatalities: {stats["fatalities"]["total"]:,}')
print(f'CRS: {gdf.crs}')

# Display sample
gdf.head()

## 2. DBSCAN Clustering

Apply DBSCAN with Haversine metric to identify spatial clusters.

In [None]:
# Extract coordinates
coords = gdf[['dec_latitude', 'dec_longitude']].values

# Convert to radians for Haversine metric
coords_rad = np.radians(coords)

# DBSCAN parameters
# eps in radians: 50 km / 6371 km (Earth radius) ≈ 0.00784
eps_km = 50  # kilometers
earth_radius_km = 6371
eps_rad = eps_km / earth_radius_km
min_samples = 10

print(f'DBSCAN Parameters:')
print(f'  eps: {eps_km} km ({eps_rad:.6f} radians)')
print(f'  min_samples: {min_samples}')
print(f'  metric: haversine')
print(f'\nRunning DBSCAN...')

# Run DBSCAN
db = DBSCAN(eps=eps_rad, min_samples=min_samples, metric='haversine', n_jobs=-1)
labels = db.fit_predict(coords_rad)

# Add cluster labels to GeoDataFrame
gdf['cluster'] = labels

# Analyze results
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
n_clustered = len(gdf) - n_noise

print(f'\n✅ DBSCAN Complete')
print(f'Number of clusters: {n_clusters}')
print(f'Clustered events: {n_clustered:,} ({n_clustered/len(gdf)*100:.2f}%)')
print(f'Noise events: {n_noise:,} ({n_noise/len(gdf)*100:.2f}%)')

## 3. Cluster Statistics Analysis

In [None]:
# Calculate cluster statistics (exclude noise cluster -1)
cluster_stats = []

for cluster_id in range(n_clusters):
    cluster_data = gdf[gdf['cluster'] == cluster_id]
    
    # Calculate centroid
    centroid_lat = cluster_data['dec_latitude'].mean()
    centroid_lon = cluster_data['dec_longitude'].mean()
    
    # Most common state
    dominant_state = cluster_data['ev_state'].mode()[0] if len(cluster_data['ev_state'].mode()) > 0 else 'Unknown'
    
    # Fatality statistics
    total_fatalities = cluster_data['inj_tot_f'].sum()
    fatal_accidents = (cluster_data['inj_tot_f'] > 0).sum()
    
    # Aircraft statistics
    top_aircraft = cluster_data['acft_make'].mode()[0] if len(cluster_data['acft_make'].mode()) > 0 else 'Unknown'
    
    # Temporal statistics
    year_min = cluster_data['ev_year'].min()
    year_max = cluster_data['ev_year'].max()
    
    cluster_stats.append({
        'cluster_id': cluster_id,
        'size': len(cluster_data),
        'centroid_lat': centroid_lat,
        'centroid_lon': centroid_lon,
        'dominant_state': dominant_state,
        'total_fatalities': int(total_fatalities),
        'fatal_accidents': int(fatal_accidents),
        'avg_fatalities_per_accident': total_fatalities / len(cluster_data),
        'fatal_accident_rate': fatal_accidents / len(cluster_data),
        'top_aircraft_make': top_aircraft,
        'year_min': int(year_min),
        'year_max': int(year_max),
        'year_span': int(year_max - year_min)
    })

cluster_df = pd.DataFrame(cluster_stats)

# Save cluster statistics
cluster_df.to_csv(DATA_DIR / 'cluster_statistics.csv', index=False)
print(f'✅ Saved cluster statistics to data/cluster_statistics.csv')

# Display top 10 clusters by size
print('\n=== Top 10 Clusters by Size ===')
print(cluster_df.nlargest(10, 'size')[[
    'cluster_id', 'size', 'dominant_state', 'total_fatalities', 
    'fatal_accidents', 'centroid_lat', 'centroid_lon'
]])

In [None]:
# Top 10 clusters by fatality count
print('\n=== Top 10 Clusters by Total Fatalities ===')
top_fatality_clusters = cluster_df.nlargest(10, 'total_fatalities')[[
    'cluster_id', 'size', 'dominant_state', 'total_fatalities', 
    'fatal_accidents', 'avg_fatalities_per_accident'
]]
print(top_fatality_clusters)

# Top 10 clusters by fatal accident rate
print('\n=== Top 10 Clusters by Fatal Accident Rate ===')
top_fatal_rate_clusters = cluster_df[cluster_df['size'] >= 20].nlargest(10, 'fatal_accident_rate')[[
    'cluster_id', 'size', 'dominant_state', 'fatal_accident_rate', 'total_fatalities'
]]
print(top_fatal_rate_clusters)

## 4. Cluster Visualizations

In [None]:
# Figure 1: Cluster size distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Histogram
ax1.hist(cluster_df['size'], bins=30, color='steelblue', edgecolor='black', alpha=0.7)
ax1.set_title('Cluster Size Distribution', fontsize=14, fontweight='bold')
ax1.set_xlabel('Cluster Size (number of accidents)', fontsize=12)
ax1.set_ylabel('Frequency', fontsize=12)
ax1.axvline(cluster_df['size'].median(), color='red', linestyle='--', label=f'Median: {cluster_df["size"].median():.0f}')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Box plot
ax2.boxplot(cluster_df['size'], vert=True)
ax2.set_title('Cluster Size Box Plot', fontsize=14, fontweight='bold')
ax2.set_ylabel('Cluster Size', fontsize=12)
ax2.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig(FIG_DIR / 'dbscan_cluster_size_distribution.png', dpi=150, bbox_inches='tight')
plt.show()
print('✅ Saved: dbscan_cluster_size_distribution.png')

In [None]:
# Figure 2: Cluster fatality analysis
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Scatter: Size vs Fatalities
ax1.scatter(cluster_df['size'], cluster_df['total_fatalities'], 
            s=50, alpha=0.6, c=cluster_df['fatal_accident_rate'], 
            cmap='YlOrRd', edgecolors='black', linewidth=0.5)
ax1.set_title('Cluster Size vs Total Fatalities', fontsize=14, fontweight='bold')
ax1.set_xlabel('Cluster Size (accidents)', fontsize=12)
ax1.set_ylabel('Total Fatalities', fontsize=12)
ax1.grid(True, alpha=0.3)
cbar = plt.colorbar(ax1.collections[0], ax=ax1)
cbar.set_label('Fatal Accident Rate', fontsize=10)

# Box plot: Fatalities per accident by cluster size category
cluster_df['size_category'] = pd.cut(cluster_df['size'], 
                                       bins=[0, 20, 50, 100, 1000], 
                                       labels=['Small (≤20)', 'Medium (21-50)', 'Large (51-100)', 'Very Large (>100)'])
cluster_df.boxplot(column='avg_fatalities_per_accident', by='size_category', ax=ax2)
ax2.set_title('Average Fatalities per Accident by Cluster Size', fontsize=14, fontweight='bold')
ax2.set_xlabel('Cluster Size Category', fontsize=12)
ax2.set_ylabel('Avg Fatalities per Accident', fontsize=12)
plt.suptitle('')  # Remove default title

plt.tight_layout()
plt.savefig(FIG_DIR / 'dbscan_cluster_fatality_analysis.png', dpi=150, bbox_inches='tight')
plt.show()
print('✅ Saved: dbscan_cluster_fatality_analysis.png')

In [None]:
# Figure 3: Clusters by state (top 15 states)
state_cluster_counts = cluster_df['dominant_state'].value_counts().head(15)

fig, ax = plt.subplots(figsize=(12, 6))
state_cluster_counts.plot(kind='bar', ax=ax, color='teal', edgecolor='black', alpha=0.7)
ax.set_title('Number of Accident Clusters by State (Top 15)', fontsize=14, fontweight='bold')
ax.set_xlabel('State', fontsize=12)
ax.set_ylabel('Number of Clusters', fontsize=12)
ax.grid(True, alpha=0.3, axis='y')
plt.xticks(rotation=45)
for i, v in enumerate(state_cluster_counts.values):
    ax.text(i, v, f' {v}', ha='center', va='bottom', fontsize=10)
plt.tight_layout()
plt.savefig(FIG_DIR / 'dbscan_clusters_by_state.png', dpi=150, bbox_inches='tight')
plt.show()
print('✅ Saved: dbscan_clusters_by_state.png')

## 5. Interactive Cluster Map

Create Folium map with color-coded clusters and centroids.

In [None]:
# Create base map centered on US
m = folium.Map(
    location=[39.8283, -98.5795],  # Geographic center of contiguous US
    zoom_start=4,
    tiles='OpenStreetMap'
)

# Color palette for clusters (top 10 clusters get distinct colors, rest get gray)
top_10_clusters = cluster_df.nlargest(10, 'size')['cluster_id'].tolist()
colors = ['red', 'blue', 'green', 'purple', 'orange', 'darkred', 'lightred', 
          'beige', 'darkblue', 'darkgreen']

# Add cluster markers (sample for performance - MarkerCluster for dense areas)
for cluster_id in range(n_clusters):
    cluster_data = gdf[gdf['cluster'] == cluster_id]
    
    # Get color for this cluster
    if cluster_id in top_10_clusters:
        color = colors[top_10_clusters.index(cluster_id)]
    else:
        color = 'gray'
    
    # Sample events if cluster is large (for performance)
    if len(cluster_data) > 100:
        cluster_sample = cluster_data.sample(100, random_state=42)
    else:
        cluster_sample = cluster_data
    
    # Add markers to MarkerCluster
    marker_cluster = MarkerCluster(name=f'Cluster {cluster_id}')
    
    for idx, row in cluster_sample.iterrows():
        folium.CircleMarker(
            location=[row['dec_latitude'], row['dec_longitude']],
            radius=3,
            color=color,
            fill=True,
            fillColor=color,
            fillOpacity=0.6,
            popup=f"Event: {row['ev_id']}<br>Date: {row['ev_date']}<br>Fatalities: {row['inj_tot_f']}"
        ).add_to(marker_cluster)
    
    marker_cluster.add_to(m)

# Add cluster centroids with labels
for idx, row in cluster_df.nlargest(20, 'size').iterrows():
    folium.Marker(
        location=[row['centroid_lat'], row['centroid_lon']],
        icon=folium.Icon(color='black', icon='info-sign'),
        popup=f"""<b>Cluster {row['cluster_id']}</b><br>
                  Size: {row['size']} accidents<br>
                  State: {row['dominant_state']}<br>
                  Fatalities: {row['total_fatalities']}<br>
                  Fatal Accidents: {row['fatal_accidents']}<br>
                  Years: {row['year_min']}-{row['year_max']}"""
    ).add_to(m)

# Add legend
legend_html = f'''<div style="position: fixed; 
                bottom: 50px; right: 50px; width: 250px; height: auto; 
                background-color: white; border:2px solid grey; z-index:9999; 
                font-size:14px; padding: 10px">
                <p><b>DBSCAN Clusters</b></p>
                <p>Total Clusters: {n_clusters}<br>
                   Clustered Events: {n_clustered:,}<br>
                   Noise Events: {n_noise:,}</p>
                <p><b>Top 10 Clusters</b> (by size)<br>
                   shown with distinct colors</p>
                <p><b>Black markers</b>: Cluster centroids<br>
                   (top 20 by size)</p>
                </div>'''
m.get_root().html.add_child(folium.Element(legend_html))

# Save map
map_path = MAP_DIR / 'dbscan_clusters.html'
m.save(str(map_path))
print(f'✅ Saved interactive map: {map_path}')

# Display in notebook
m

## 6. Temporal Evolution of Clusters

Analyze how clusters have changed over decades.

In [None]:
# Add decade column
gdf['decade'] = (gdf['ev_year'] // 10) * 10

# Analyze cluster activity by decade
temporal_analysis = []

for cluster_id in range(n_clusters):
    cluster_data = gdf[gdf['cluster'] == cluster_id]
    
    for decade in sorted(gdf['decade'].unique()):
        decade_data = cluster_data[cluster_data['decade'] == decade]
        
        if len(decade_data) > 0:
            temporal_analysis.append({
                'cluster_id': cluster_id,
                'decade': decade,
                'accidents': len(decade_data),
                'fatalities': int(decade_data['inj_tot_f'].sum())
            })

temporal_df = pd.DataFrame(temporal_analysis)

# Top 5 clusters by total size
top_5_clusters = cluster_df.nlargest(5, 'size')['cluster_id'].tolist()

# Plot temporal evolution for top 5 clusters
fig, ax = plt.subplots(figsize=(14, 6))

for cluster_id in top_5_clusters:
    cluster_temporal = temporal_df[temporal_df['cluster_id'] == cluster_id]
    cluster_info = cluster_df[cluster_df['cluster_id'] == cluster_id].iloc[0]
    
    ax.plot(cluster_temporal['decade'], cluster_temporal['accidents'], 
            marker='o', linewidth=2, markersize=8,
            label=f'Cluster {cluster_id} ({cluster_info["dominant_state"]})')

ax.set_title('Temporal Evolution of Top 5 Largest Clusters', fontsize=14, fontweight='bold')
ax.set_xlabel('Decade', fontsize=12)
ax.set_ylabel('Number of Accidents', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig(FIG_DIR / 'dbscan_temporal_evolution.png', dpi=150, bbox_inches='tight')
plt.show()
print('✅ Saved: dbscan_temporal_evolution.png')

## 7. Save Results

In [None]:
# Save clustered GeoDataFrame as GeoJSON
gdf_output = gdf[['ev_id', 'ev_date', 'ev_year', 'ev_state', 'dec_latitude', 'dec_longitude',
                  'inj_tot_f', 'cluster', 'geometry']].copy()

output_path = DATA_DIR / 'dbscan_clusters.geojson'
gdf_output.to_file(output_path, driver='GeoJSON')
print(f'✅ Saved: {output_path} ({output_path.stat().st_size / 1024**2:.2f} MB)')

# Save summary statistics
summary = {
    'dbscan_parameters': {
        'eps_km': eps_km,
        'eps_radians': float(eps_rad),
        'min_samples': min_samples,
        'metric': 'haversine'
    },
    'results': {
        'total_events': len(gdf),
        'n_clusters': n_clusters,
        'n_clustered': n_clustered,
        'n_noise': n_noise,
        'clustered_pct': round(n_clustered / len(gdf) * 100, 2),
        'noise_pct': round(n_noise / len(gdf) * 100, 2)
    },
    'cluster_statistics': {
        'avg_cluster_size': float(cluster_df['size'].mean()),
        'median_cluster_size': float(cluster_df['size'].median()),
        'max_cluster_size': int(cluster_df['size'].max()),
        'min_cluster_size': int(cluster_df['size'].min()),
        'total_fatalities_in_clusters': int(cluster_df['total_fatalities'].sum())
    },
    'top_10_clusters_by_size': cluster_df.nlargest(10, 'size')[[
        'cluster_id', 'size', 'dominant_state', 'total_fatalities'
    ]].to_dict('records'),
    'top_10_clusters_by_fatalities': cluster_df.nlargest(10, 'total_fatalities')[[
        'cluster_id', 'size', 'dominant_state', 'total_fatalities'
    ]].to_dict('records')
}

summary_path = DATA_DIR / 'dbscan_summary.json'
with open(summary_path, 'w') as f:
    json.dump(summary, f, indent=2)

print(f'✅ Saved: {summary_path}')

## Summary

**DBSCAN Clustering Complete** ✅

**Parameters**:
- eps: 50 km
- min_samples: 10
- metric: Haversine

**Results**:
- Total Clusters: {n_clusters}
- Clustered Events: {n_clustered:,} ({n_clustered/len(gdf)*100:.2f}%)
- Noise Events: {n_noise:,} ({n_noise/len(gdf)*100:.2f}%)
- Average Cluster Size: {cluster_df['size'].mean():.1f} accidents
- Largest Cluster: {cluster_df['size'].max()} accidents

**Top 3 Clusters by Size**:
1. Cluster {cluster_df.nlargest(1, 'size').iloc[0]['cluster_id']}: {cluster_df.nlargest(1, 'size').iloc[0]['size']} accidents ({cluster_df.nlargest(1, 'size').iloc[0]['dominant_state']})
2. Cluster {cluster_df.nlargest(2, 'size').iloc[1]['cluster_id']}: {cluster_df.nlargest(2, 'size').iloc[1]['size']} accidents ({cluster_df.nlargest(2, 'size').iloc[1]['dominant_state']})
3. Cluster {cluster_df.nlargest(3, 'size').iloc[2]['cluster_id']}: {cluster_df.nlargest(3, 'size').iloc[2]['size']} accidents ({cluster_df.nlargest(3, 'size').iloc[2]['dominant_state']})

**Files Created**:
- `data/dbscan_clusters.geojson` - Clustered events
- `data/cluster_statistics.csv` - Cluster summary stats
- `data/dbscan_summary.json` - Analysis summary
- `maps/dbscan_clusters.html` - Interactive map
- `figures/dbscan_cluster_size_distribution.png`
- `figures/dbscan_cluster_fatality_analysis.png`
- `figures/dbscan_clusters_by_state.png`
- `figures/dbscan_temporal_evolution.png`

**Next Steps**:
- Kernel Density Estimation (02_kernel_density_estimation.ipynb)
- Getis-Ord Gi* Hotspot Analysis (03_getis_ord_gi_star.ipynb)