# Geospatial Data Preparation

**Notebook**: 00_geospatial_data_preparation.ipynb  
**Sprint**: Phase 2 Sprint 8 - Advanced Geospatial Analysis  
**Created**: 2025-11-08  

## Objectives

1. Extract geospatial data from PostgreSQL database
2. Validate and clean coordinate data
3. Create GeoDataFrame with proper CRS
4. Remove statistical outliers
5. Prepare data for geospatial analysis
6. Save processed dataset

## Data Source

- **Database**: ntsb_aviation (PostgreSQL 18.0 with PostGIS)
- **Table**: events
- **Total Events**: 179,809 (1962-2025)
- **Expected Coordinates**: ~77,891 events (43.32% coverage)

## Output

- `data/geospatial_events.parquet` - Clean geospatial dataset
- `data/geospatial_events_stats.json` - Dataset statistics

In [None]:
# Standard library
import json
from pathlib import Path
from typing import Tuple

# Data manipulation
import pandas as pd
import numpy as np

# Geospatial
import geopandas as gpd
from shapely.geometry import Point

# Database
import psycopg2
from sqlalchemy import create_engine

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Configuration
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 6)

# Paths
DATA_DIR = Path('../../data')
FIG_DIR = Path('figures')
DATA_DIR.mkdir(exist_ok=True)
FIG_DIR.mkdir(exist_ok=True)

print('✅ All packages imported successfully')

## 1. Database Connection and Data Extraction

Extract events with coordinates and relevant attributes for geospatial analysis.

In [None]:
# Database connection
engine = create_engine('postgresql://parobek@localhost/ntsb_aviation')

# SQL query - extract events with coordinates
query = """
SELECT 
    ev_id,
    ev_date,
    ev_year,
    ev_state,
    ev_city,
    ev_site_zipcode,
    dec_latitude,
    dec_longitude,
    inj_tot_f,
    inj_tot_s,
    inj_tot_m,
    inj_tot_n,
    acft_damage,
    acft_make,
    acft_model,
    acft_category,
    far_part,
    flt_plan_filed,
    wx_cond_basic,
    light_cond
FROM events
WHERE dec_latitude IS NOT NULL 
  AND dec_longitude IS NOT NULL
  AND dec_latitude BETWEEN -90 AND 90
  AND dec_longitude BETWEEN -180 AND 180
ORDER BY ev_date;
"""

print('Extracting geospatial data from database...')
df = pd.read_sql(query, engine)

print(f'✅ Extracted {len(df):,} events with valid coordinates')
print(f'Date range: {df["ev_date"].min()} to {df["ev_date"].max()}')
print(f'Year range: {df["ev_year"].min()} to {df["ev_year"].max()}')
print(f'Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB')

## 2. Data Quality Assessment

Check for missing values, outliers, and data quality issues.

In [None]:
# Missing values analysis
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Missing %': missing_pct
}).sort_values('Missing Count', ascending=False)

print('\n=== Missing Values Summary ===')
print(missing_df[missing_df['Missing Count'] > 0])

# Coordinate statistics
print('\n=== Coordinate Statistics ===')
print(df[['dec_latitude', 'dec_longitude']].describe())

# Fatality statistics
total_fatalities = df['inj_tot_f'].sum()
total_serious = df['inj_tot_s'].sum()
print(f'\nTotal fatalities: {total_fatalities:,}')
print(f'Total serious injuries: {total_serious:,}')
print(f'Fatal accidents: {(df["inj_tot_f"] > 0).sum():,} ({(df["inj_tot_f"] > 0).mean()*100:.2f}%)')

## 3. Outlier Detection and Removal

Use IQR method to identify and remove coordinate outliers.

In [None]:
def remove_outliers_iqr(data: pd.DataFrame, column: str, k: float = 1.5) -> Tuple[pd.DataFrame, int]:
    """Remove outliers using IQR method.
    
    Args:
        data: DataFrame to process
        column: Column name to check for outliers
        k: IQR multiplier (1.5 = mild outliers, 3.0 = extreme outliers)
    
    Returns:
        Tuple of (cleaned DataFrame, number of outliers removed)
    """
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - k * IQR
    upper_bound = Q3 + k * IQR
    
    mask = (data[column] >= lower_bound) & (data[column] <= upper_bound)
    outliers_removed = (~mask).sum()
    
    return data[mask], outliers_removed

# Remove latitude outliers
df_clean, lat_outliers = remove_outliers_iqr(df, 'dec_latitude', k=3.0)
print(f'Latitude outliers removed: {lat_outliers}')

# Remove longitude outliers
df_clean, lon_outliers = remove_outliers_iqr(df_clean, 'dec_longitude', k=3.0)
print(f'Longitude outliers removed: {lon_outliers}')

total_removed = lat_outliers + lon_outliers
print(f'\nTotal outliers removed: {total_removed} ({total_removed/len(df)*100:.3f}%)')
print(f'Clean dataset: {len(df_clean):,} events')

## 4. Create GeoDataFrame

Convert to GeoDataFrame with proper CRS (EPSG:4326 for WGS84).

In [None]:
# Create Point geometries
geometry = [Point(xy) for xy in zip(df_clean['dec_longitude'], df_clean['dec_latitude'])]

# Create GeoDataFrame
gdf = gpd.GeoDataFrame(
    df_clean,
    geometry=geometry,
    crs='EPSG:4326'  # WGS84 (lat/lon)
)

print(f'✅ GeoDataFrame created with {len(gdf):,} events')
print(f'CRS: {gdf.crs}')
print(f'\nBounds:')
print(f'  Latitude:  {gdf["dec_latitude"].min():.6f} to {gdf["dec_latitude"].max():.6f}')
print(f'  Longitude: {gdf["dec_longitude"].min():.6f} to {gdf["dec_longitude"].max():.6f}')

# Project to US Albers Equal Area (EPSG:5070) for distance-based analysis
gdf_proj = gdf.to_crs('EPSG:5070')
print(f'\n✅ Projected to EPSG:5070 (Albers Equal Area) for spatial analysis')

## 5. Exploratory Visualizations

In [None]:
# Figure 1: Coordinate scatter plot (all events)
fig, ax = plt.subplots(figsize=(14, 8))
gdf.plot(ax=ax, markersize=0.5, alpha=0.3, color='blue')
ax.set_title(f'NTSB Aviation Accidents with Coordinates (n={len(gdf):,})', fontsize=14, fontweight='bold')
ax.set_xlabel('Longitude', fontsize=12)
ax.set_ylabel('Latitude', fontsize=12)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig(FIG_DIR / 'coordinate_scatter_all.png', dpi=150, bbox_inches='tight')
plt.show()
print('✅ Saved: coordinate_scatter_all.png')

In [None]:
# Figure 2: Event distribution by state (top 20)
state_counts = gdf['ev_state'].value_counts().head(20)

fig, ax = plt.subplots(figsize=(12, 8))
state_counts.plot(kind='barh', ax=ax, color='steelblue')
ax.set_title('Top 20 States by Accident Count (with coordinates)', fontsize=14, fontweight='bold')
ax.set_xlabel('Number of Accidents', fontsize=12)
ax.set_ylabel('State', fontsize=12)
ax.invert_yaxis()
for i, v in enumerate(state_counts.values):
    ax.text(v, i, f' {v:,}', va='center', fontsize=10)
plt.tight_layout()
plt.savefig(FIG_DIR / 'state_distribution.png', dpi=150, bbox_inches='tight')
plt.show()
print('✅ Saved: state_distribution.png')

In [None]:
# Figure 3: Missing coordinate analysis
missing_coords_query = """
SELECT 
    ev_year,
    COUNT(*) as total_events,
    COUNT(dec_latitude) as with_coords,
    COUNT(*) - COUNT(dec_latitude) as missing_coords,
    ROUND(100.0 * COUNT(dec_latitude) / COUNT(*), 2) as coverage_pct
FROM events
WHERE ev_year IS NOT NULL
GROUP BY ev_year
ORDER BY ev_year;
"""

missing_df = pd.read_sql(missing_coords_query, engine)

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 10))

# Plot 1: Coverage percentage over time
ax1.plot(missing_df['ev_year'], missing_df['coverage_pct'], linewidth=2, color='green', label='Coverage %')
ax1.fill_between(missing_df['ev_year'], 0, missing_df['coverage_pct'], alpha=0.3, color='green')
ax1.set_title('Coordinate Coverage Over Time', fontsize=14, fontweight='bold')
ax1.set_xlabel('Year', fontsize=12)
ax1.set_ylabel('Coverage %', fontsize=12)
ax1.grid(True, alpha=0.3)
ax1.legend()

# Plot 2: Events with/without coordinates
ax2.bar(missing_df['ev_year'], missing_df['with_coords'], label='With Coordinates', color='steelblue', alpha=0.7)
ax2.bar(missing_df['ev_year'], missing_df['missing_coords'], bottom=missing_df['with_coords'], 
        label='Missing Coordinates', color='coral', alpha=0.7)
ax2.set_title('Event Counts by Coordinate Availability', fontsize=14, fontweight='bold')
ax2.set_xlabel('Year', fontsize=12)
ax2.set_ylabel('Number of Events', fontsize=12)
ax2.legend()
ax2.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig(FIG_DIR / 'coordinate_coverage_analysis.png', dpi=150, bbox_inches='tight')
plt.show()
print('✅ Saved: coordinate_coverage_analysis.png')

## 6. Save Processed Dataset

In [None]:
# Save to Parquet (both CRS versions)
output_path = DATA_DIR / 'geospatial_events.parquet'
output_path_proj = DATA_DIR / 'geospatial_events_projected.parquet'

gdf.to_parquet(output_path)
gdf_proj.to_parquet(output_path_proj)

print(f'✅ Saved: {output_path} ({output_path.stat().st_size / 1024**2:.2f} MB)')
print(f'✅ Saved: {output_path_proj} ({output_path_proj.stat().st_size / 1024**2:.2f} MB)')

# Save dataset statistics
stats = {
    'total_events_db': 179809,
    'events_with_coords_raw': len(df),
    'outliers_removed': total_removed,
    'events_clean': len(gdf),
    'coverage_pct': round(len(gdf) / 179809 * 100, 2),
    'date_range': {
        'min': str(gdf['ev_date'].min()),
        'max': str(gdf['ev_date'].max())
    },
    'year_range': {
        'min': int(gdf['ev_year'].min()),
        'max': int(gdf['ev_year'].max())
    },
    'coordinate_bounds': {
        'lat_min': float(gdf['dec_latitude'].min()),
        'lat_max': float(gdf['dec_latitude'].max()),
        'lon_min': float(gdf['dec_longitude'].min()),
        'lon_max': float(gdf['dec_longitude'].max())
    },
    'fatalities': {
        'total': int(gdf['inj_tot_f'].sum()),
        'fatal_accidents': int((gdf['inj_tot_f'] > 0).sum()),
        'fatal_accident_pct': round((gdf['inj_tot_f'] > 0).mean() * 100, 2)
    },
    'top_states': state_counts.head(10).to_dict(),
    'crs': {
        'original': str(gdf.crs),
        'projected': str(gdf_proj.crs)
    }
}

stats_path = DATA_DIR / 'geospatial_events_stats.json'
with open(stats_path, 'w') as f:
    json.dump(stats, f, indent=2)

print(f'✅ Saved: {stats_path}')

## Summary

**Dataset Prepared** ✅

- **Total Events**: 179,809 in database
- **Events with Coordinates**: {len(gdf):,} ({len(gdf)/179809*100:.2f}%)
- **Outliers Removed**: {total_removed}
- **Date Range**: {gdf['ev_date'].min()} to {gdf['ev_date'].max()}
- **Year Range**: {gdf['ev_year'].min()} to {gdf['ev_year'].max()}
- **Total Fatalities**: {gdf['inj_tot_f'].sum():,}
- **Fatal Accidents**: {(gdf['inj_tot_f'] > 0).sum():,} ({(gdf['inj_tot_f'] > 0).mean()*100:.2f}%)

**Files Created**:
- `data/geospatial_events.parquet` - GeoDataFrame (EPSG:4326)
- `data/geospatial_events_projected.parquet` - GeoDataFrame (EPSG:5070)
- `data/geospatial_events_stats.json` - Statistics
- `figures/coordinate_scatter_all.png` - Coordinate scatter plot
- `figures/state_distribution.png` - State distribution
- `figures/coordinate_coverage_analysis.png` - Coverage analysis

**Next Steps**:
1. DBSCAN Clustering (01_dbscan_clustering.ipynb)
2. Kernel Density Estimation (02_kernel_density_estimation.ipynb)
3. Getis-Ord Gi* Analysis (03_getis_ord_gi_star.ipynb)
4. Moran's I Autocorrelation (04_morans_i_autocorrelation.ipynb)
5. Interactive Visualizations (05_interactive_geospatial_viz.ipynb)