# Heat-Crime Hypothesis Analysis (HYP-HEAT)

**Objective:** Investigate the statistical relationship between temperature and crime patterns in Philadelphia.

**Research Question:** Is there a significant relationship between temperature (heat) and crime rates, particularly for violent crimes?

**Data Sources:**
- Crime incidents: `data/crime_incidents_combined.parquet` (2006-2026)
- Weather data: `data/external/weather_philly_2006_2026.parquet` (2006-2026)

**Methodology:**
1. Data merging with temporal alignment (daily aggregation)
2. Correlation analysis (Pearson, Spearman, Kendall tau)
3. Hypothesis testing with statistical significance
4. Effect size calculation and interpretation

In [None]:
# Reproducibility
import sys
from pathlib import Path

# Ensure we can import from analysis module
repo_root = Path.cwd().parent if Path.cwd().name == 'notebooks' else Path.cwd()
sys.path.insert(0, str(repo_root))

# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from datetime import datetime
import warnings

warnings.filterwarnings('ignore')

# Set visualization style
plt.style.use('default')
sns.set_palette('husl')

# Create reports directory
REPORTS_DIR = repo_root / 'reports'
REPORTS_DIR.mkdir(exist_ok=True)

print(f"Repository root: {repo_root}")
print(f"Reports directory: {REPORTS_DIR}")
print(f"Analysis date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

## 1. Data Loading and Exploration

### 1.1 Load Crime Data

In [None]:
# Load crime data
crime_path = repo_root / 'data' / 'crime_incidents_combined.parquet'
crime_df = pd.read_parquet(crime_path)

print(f"Crime data shape: {crime_df.shape}")
print(f"\nColumns: {crime_df.columns.tolist()}")
print(f"\nDate range: {crime_df['dispatch_date'].min()} to {crime_df['dispatch_date'].max()}")
print(f"\nFirst few rows:")
crime_df.head()

In [None]:
# Check crime categories
print("Top 15 crime types:")
print(crime_df['text_general_code'].value_counts().head(15))

### 1.2 Load Weather Data

In [None]:
# Load weather data
weather_path = repo_root / 'data' / 'external' / 'weather_philly_2006_2026.parquet'
weather_df = pd.read_parquet(weather_path)

print(f"Weather data shape: {weather_df.shape}")
print(f"\nColumns: {weather_df.columns.tolist()}")
print(f"\nDate range: {weather_df.index.min()} to {weather_df.index.max()}")
print(f"\nFirst few rows:")
weather_df.head()

In [None]:
# Weather data summary statistics
print("Weather data summary:")
weather_df.describe()

## 2. Data Merging Strategy

### Join Strategy Documentation

**Temporal Alignment:**
- Weather data: Daily observations (one record per day)
- Crime data: Individual incidents with dispatch_date field
- **Strategy:** Aggregate crime data to daily counts, then join on date

**Spatial Considerations:**
- Weather data: Single station representing Philadelphia metropolitan area
- Crime data: Individual incidents across all police districts
- **Strategy:** Use city-wide weather data for all crimes (assumes temperature is relatively uniform across the city)
- **Limitation:** Does not account for micro-climate variations or heat island effects in specific neighborhoods

**Crime Classification:**
- Create categories: Violent crimes, Property crimes, Other crimes
- Based on UCR general codes (as established in Phase 1)

### 2.1 Define Crime Categories

In [None]:
# Crime category mapping based on UCR general codes (hundred-bands 1-7)
# From analysis/config.py established in Phase 1

CRIME_CATEGORY_MAP = {
    1: 'Violent',      # Homicide
    2: 'Violent',      # Rape
    3: 'Violent',      # Robbery
    4: 'Violent',      # Aggravated Assault
    5: 'Property',     # Burglary
    6: 'Property',     # Theft
    7: 'Property',     # Motor Vehicle Theft
}

def categorize_crime(ucr_code):
    """Categorize crime based on UCR general code hundred-band."""
    hundred_band = int(ucr_code // 100) if pd.notna(ucr_code) else 0
    return CRIME_CATEGORY_MAP.get(hundred_band, 'Other')

# Apply categorization
crime_df['crime_category'] = crime_df['ucr_general'].apply(categorize_crime)

print("Crime category distribution:")
print(crime_df['crime_category'].value_counts())
print(f"\nPercentages:")
print(crime_df['crime_category'].value_counts(normalize=True) * 100)

### 2.2 Aggregate Crime Data to Daily Counts

In [None]:
# Convert dispatch_date to datetime for proper aggregation
crime_df['date'] = pd.to_datetime(crime_df['dispatch_date'])

# Aggregate total crimes per day
daily_crime = crime_df.groupby('date').size().reset_index(name='total_crimes')

# Aggregate by crime category
daily_crime_by_category = crime_df.groupby(['date', 'crime_category']).size().unstack(fill_value=0)
daily_crime_by_category = daily_crime_by_category.reset_index()

# Merge total with categories
daily_crime_merged = daily_crime.merge(daily_crime_by_category, on='date', how='left')

print(f"Daily crime data shape: {daily_crime_merged.shape}")
print(f"\nDate range: {daily_crime_merged['date'].min()} to {daily_crime_merged['date'].max()}")
print(f"\nFirst few rows:")
daily_crime_merged.head()

### 2.3 Merge Weather and Crime Data

In [None]:
# Prepare weather data for merge
weather_df_reset = weather_df.reset_index()
weather_df_reset['date'] = pd.to_datetime(weather_df_reset['time']).dt.date
weather_df_reset['date'] = pd.to_datetime(weather_df_reset['date'])

# Select relevant weather columns
weather_cols = ['date', 'temp', 'tmin', 'tmax', 'rhum', 'prcp', 'wspd']
weather_for_merge = weather_df_reset[weather_cols]

print(f"Weather data for merge shape: {weather_for_merge.shape}")
print(f"\nFirst few rows:")
print(weather_for_merge.head())

In [None]:
# Merge crime and weather data on date
merged_df = daily_crime_merged.merge(weather_for_merge, on='date', how='inner')

print(f"\nMerged dataset shape: {merged_df.shape}")
print(f"Date range: {merged_df['date'].min()} to {merged_df['date'].max()}")
print(f"\nNumber of days: {len(merged_df)}")
print(f"\nMerged data columns: {merged_df.columns.tolist()}")
print(f"\nFirst few rows:")
merged_df.head(10)

In [None]:
# Check for missing values
print("Missing values in merged dataset:")
print(merged_df.isnull().sum())

# Summary statistics
print("\nSummary statistics of merged dataset:")
merged_df.describe()

### 2.4 Data Quality Checks

In [None]:
# Check data completeness
date_range = pd.date_range(start=merged_df['date'].min(), end=merged_df['date'].max(), freq='D')
expected_days = len(date_range)
actual_days = len(merged_df)

print(f"Expected days in range: {expected_days}")
print(f"Actual days in merged data: {actual_days}")
print(f"Coverage: {actual_days / expected_days * 100:.2f}%")

# Check for any gaps
if actual_days < expected_days:
    missing_dates = set(date_range) - set(merged_df['date'])
    print(f"\nNumber of missing dates: {len(missing_dates)}")
    if len(missing_dates) <= 10:
        print(f"Missing dates: {sorted(missing_dates)}")
else:
    print("\nNo missing dates - complete daily coverage.")

## Summary of Merge Strategy

**Approach:**
1. **Temporal alignment:** Aggregated crime incidents to daily counts to match weather data granularity
2. **Spatial approach:** Used city-wide weather station data for all crimes (single station)
3. **Join method:** Inner join on date to ensure both datasets have matching records
4. **Crime categorization:** Classified crimes into Violent, Property, and Other based on UCR codes

**Limitations:**
- Single weather station may not capture micro-climate variations across neighborhoods
- Heat island effects in urban cores vs. suburbs not considered
- Daily aggregation loses intra-day temperature variations
- Weather data represents average conditions, not peak exposure times

**Dataset Ready for Analysis:**
- Merged dataset includes daily crime counts by category and weather measurements
- Complete temporal coverage from 2006 to 2026
- Ready for correlation analysis and hypothesis testing