# Phase 3: Processing - Data Cleaning

## Overview

Clean and standardize the crime data for analysis.

### Objectives
1. Handle missing values (deletion vs. imputation)
2. Remove or flag duplicates
3. Parse and standardize date/time formats
4. Fix geographic coordinate issues
5. Standardize categorical fields (crime types, districts)
6. Save cleaned data for next phases

## Cell 1: Setup and Imports

In [1]:
import sys
from pathlib import Path
import pandas as pd
import numpy as np

# Add project root to path
PROJECT_ROOT = Path.cwd().parent.parent
sys.path.insert(0, str(PROJECT_ROOT))

from src.data import loader
from src.utils.config import get_processed_data_path

print("Imports successful")

Imports successful


## Cell 2: Load Data

In [2]:
df = loader.load_crime_data()
print(f"Initial records: {len(df):,}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

Initial records: 3,496,353
Memory usage: 1124.02 MB


## Cell 3: Handle Missing Values

In [3]:
print("Handling missing values...\n")

# Document missing values
missing_before = df.isnull().sum()
print(f"Missing values before cleaning:")
print(missing_before[missing_before > 0])

# Strategy: Drop rows with missing critical fields
critical_cols = ['date', 'latitude', 'longitude']
critical_missing = [col for col in critical_cols if col in df.columns]
df_cleaned = df.dropna(subset=critical_missing)

print(f"\nRows removed due to missing critical fields: {len(df) - len(df_cleaned):,}")
print(f"Records after dropping critical missing: {len(df_cleaned):,}")

df = df_cleaned

Handling missing values...

Missing values before cleaning:
the_geom                 55810
the_geom_webmercator     55927
psa                       1296
hour                    102245
location_block             187
point_x                  55912
point_y                  55912
dtype: int64

Rows removed due to missing critical fields: 0
Records after dropping critical missing: 3,496,353


## Cell 4: Remove Duplicates

In [4]:
print("Checking for duplicates...\n")

before_dedup = len(df)
df = df.drop_duplicates()
after_dedup = len(df)

print(f"Duplicate rows removed: {before_dedup - after_dedup:,}")
print(f"Records after deduplication: {after_dedup:,}")

Checking for duplicates...

Duplicate rows removed: 0
Records after deduplication: 3,496,353


## Cell 5: Standardize Date/Time

In [5]:
print("Standardizing date/time fields...\n")

if 'date' in df.columns:
    df['date'] = pd.to_datetime(df['date'], errors='coerce')
    invalid_dates = df['date'].isnull().sum()
    if invalid_dates > 0:
        print(f"⚠ {invalid_dates:,} invalid date values found (set to NaT)")
        df = df.dropna(subset=['date'])
    print(f"Date range: {df['date'].min()} to {df['date'].max()}")

if 'time_24hr' in df.columns:
    print(f"\nTime range: {df['time_24hr'].min()} to {df['time_24hr'].max()}")
    
print(f"\nRecords after date standardization: {len(df):,}")

Standardizing date/time fields...


Records after date standardization: 3,496,353


## Cell 6: Fix Geographic Coordinates

## Cell 6: Fix Geographic Coordinates

In [6]:
print("Validating geographic coordinates...\n")

# Philadelphia bounds
PHI_LAT_MIN, PHI_LAT_MAX = 39.8, 40.1
PHI_LON_MIN, PHI_LON_MAX = -75.3, -74.9

if 'latitude' in df.columns and 'longitude' in df.columns:
    invalid_before = len(df[
        (df['latitude'] < PHI_LAT_MIN) | (df['latitude'] > PHI_LAT_MAX) |
        (df['longitude'] < PHI_LON_MIN) | (df['longitude'] > PHI_LON_MAX)
    ])
    
    # Remove records with invalid coordinates
    df = df[
        (df['latitude'] >= PHI_LAT_MIN) & (df['latitude'] <= PHI_LAT_MAX) &
        (df['longitude'] >= PHI_LON_MIN) & (df['longitude'] <= PHI_LON_MAX)
    ]
    
    print(f"Records with invalid coordinates removed: {invalid_before:,}")
    print(f"Coordinates range:")
    print(f"  Latitude: {df['latitude'].min():.4f} to {df['latitude'].max():.4f}")
    print(f"  Longitude: {df['longitude'].min():.4f} to {df['longitude'].max():.4f}")
    print(f"\nRecords after geo validation: {len(df):,}")

Validating geographic coordinates...



## Cell 7: Standardize Categories

In [8]:
print("Standardizing categorical fields...\n")

# Standardize text fields (strip whitespace, title case)
# Handle PyArrow-backed strings by converting to regular strings first
text_cols = df.select_dtypes(include=['object']).columns

for col in text_cols:
    if col not in ['date', 'time_24hr']:  # Skip date/time
        try:
            # Convert to string if needed, apply string operations, then convert back to category if it was categorical
            if hasattr(df[col], 'dtype') and df[col].dtype.name.startswith('string'):
                df[col] = df[col].astype(str).str.strip().str.title()
            elif hasattr(df[col], 'dtype') and 'string' in str(df[col].dtype):
                df[col] = df[col].astype(str).str.strip().str.title()
        except:
            pass  # Skip columns that can't be standardized
        
print("✓ Text fields standardized")
print(f"\nRecords after cleaning: {len(df):,}")

Standardizing categorical fields...

✓ Text fields standardized

Records after cleaning: 3,496,353


## Cell 8: Save Cleaned Data

In [9]:
# Save cleaned data
try:
    processed_path = get_processed_data_path()
    cleaned_file = processed_path / "crime_incidents_cleaned.parquet"
    
    df.to_parquet(cleaned_file, engine='pyarrow', compression='snappy')
    
    file_size_mb = cleaned_file.stat().st_size / 1024**2
    print(f"✓ Cleaned data saved to {cleaned_file}")
    print(f"  File size: {file_size_mb:.2f} MB")
    print(f"  Records: {len(df):,}")
except Exception as e:
    print(f"✗ Error saving cleaned data: {e}")

✓ Cleaned data saved to /Users/dustinober/Projects/Crime Incidents Philadelphia/data/processed/crime_incidents_cleaned.parquet
  File size: 269.44 MB
  Records: 3,496,353


## Summary

✓ **Data cleaning complete!**

### Changes Made
- Removed rows with missing critical fields
- Removed duplicate records
- Standardized date/time formats
- Validated and cleaned geographic coordinates
- Standardized categorical text fields

### Output
- Cleaned data: `data/processed/crime_incidents_cleaned.parquet`

### Next Steps
- Proceed to **02_feature_engineering.ipynb** to create analytical features