# Heat-Crime Hypothesis Analysis (HYP-HEAT)

**Objective:** Investigate the statistical relationship between temperature and crime patterns in Philadelphia.

**Research Question:** Is there a significant relationship between temperature (heat) and crime rates, particularly for violent crimes?

**Data Sources:**
- Crime incidents: `data/crime_incidents_combined.parquet` (2006-2026)
- Weather data: `data/external/weather_philly_2006_2026.parquet` (2006-2026)

**Methodology:**
1. Data merging with temporal alignment (daily aggregation)
2. Correlation analysis (Pearson, Spearman, Kendall tau)
3. Hypothesis testing with statistical significance
4. Effect size calculation and interpretation

In [None]:
# Reproducibility
import sys
from pathlib import Path

# Ensure we can import from analysis module
repo_root = Path.cwd().parent if Path.cwd().name == 'notebooks' else Path.cwd()
sys.path.insert(0, str(repo_root))

# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from datetime import datetime
import warnings

warnings.filterwarnings('ignore')

# Set visualization style
plt.style.use('default')
sns.set_palette('husl')

# Create reports directory
REPORTS_DIR = repo_root / 'reports'
REPORTS_DIR.mkdir(exist_ok=True)

print(f"Repository root: {repo_root}")
print(f"Reports directory: {REPORTS_DIR}")
print(f"Analysis date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

## 1. Data Loading and Exploration

### 1.1 Load Crime Data

In [None]:
# Load crime data
crime_path = repo_root / 'data' / 'crime_incidents_combined.parquet'
crime_df = pd.read_parquet(crime_path)

print(f"Crime data shape: {crime_df.shape}")
print(f"\nColumns: {crime_df.columns.tolist()}")
print(f"\nDate range: {crime_df['dispatch_date'].min()} to {crime_df['dispatch_date'].max()}")
print(f"\nFirst few rows:")
crime_df.head()

In [None]:
# Check crime categories
print("Top 15 crime types:")
print(crime_df['text_general_code'].value_counts().head(15))

### 1.2 Load Weather Data

In [None]:
# Load weather data
weather_path = repo_root / 'data' / 'external' / 'weather_philly_2006_2026.parquet'
weather_df = pd.read_parquet(weather_path)

print(f"Weather data shape: {weather_df.shape}")
print(f"\nColumns: {weather_df.columns.tolist()}")
print(f"\nDate range: {weather_df.index.min()} to {weather_df.index.max()}")
print(f"\nFirst few rows:")
weather_df.head()

In [None]:
# Weather data summary statistics
print("Weather data summary:")
weather_df.describe()

## 2. Data Merging Strategy

### Join Strategy Documentation

**Temporal Alignment:**
- Weather data: Daily observations (one record per day)
- Crime data: Individual incidents with dispatch_date field
- **Strategy:** Aggregate crime data to daily counts, then join on date

**Spatial Considerations:**
- Weather data: Single station representing Philadelphia metropolitan area
- Crime data: Individual incidents across all police districts
- **Strategy:** Use city-wide weather data for all crimes (assumes temperature is relatively uniform across the city)
- **Limitation:** Does not account for micro-climate variations or heat island effects in specific neighborhoods

**Crime Classification:**
- Create categories: Violent crimes, Property crimes, Other crimes
- Based on UCR general codes (as established in Phase 1)

### 2.1 Define Crime Categories

In [None]:
# Crime category mapping based on UCR general codes (hundred-bands 1-7)
# From analysis/config.py established in Phase 1

CRIME_CATEGORY_MAP = {
    1: 'Violent',      # Homicide
    2: 'Violent',      # Rape
    3: 'Violent',      # Robbery
    4: 'Violent',      # Aggravated Assault
    5: 'Property',     # Burglary
    6: 'Property',     # Theft
    7: 'Property',     # Motor Vehicle Theft
}

def categorize_crime(ucr_code):
    """Categorize crime based on UCR general code hundred-band."""
    hundred_band = int(ucr_code // 100) if pd.notna(ucr_code) else 0
    return CRIME_CATEGORY_MAP.get(hundred_band, 'Other')

# Apply categorization
crime_df['crime_category'] = crime_df['ucr_general'].apply(categorize_crime)

print("Crime category distribution:")
print(crime_df['crime_category'].value_counts())
print(f"\nPercentages:")
print(crime_df['crime_category'].value_counts(normalize=True) * 100)

### 2.2 Aggregate Crime Data to Daily Counts

In [None]:
# Convert dispatch_date to datetime for proper aggregation
crime_df['date'] = pd.to_datetime(crime_df['dispatch_date'])

# Aggregate total crimes per day
daily_crime = crime_df.groupby('date').size().reset_index(name='total_crimes')

# Aggregate by crime category
daily_crime_by_category = crime_df.groupby(['date', 'crime_category']).size().unstack(fill_value=0)
daily_crime_by_category = daily_crime_by_category.reset_index()

# Merge total with categories
daily_crime_merged = daily_crime.merge(daily_crime_by_category, on='date', how='left')

print(f"Daily crime data shape: {daily_crime_merged.shape}")
print(f"\nDate range: {daily_crime_merged['date'].min()} to {daily_crime_merged['date'].max()}")
print(f"\nFirst few rows:")
daily_crime_merged.head()

### 2.3 Merge Weather and Crime Data

In [None]:
# Prepare weather data for merge
weather_df_reset = weather_df.reset_index()
weather_df_reset['date'] = pd.to_datetime(weather_df_reset['time']).dt.date
weather_df_reset['date'] = pd.to_datetime(weather_df_reset['date'])

# Select relevant weather columns
weather_cols = ['date', 'temp', 'tmin', 'tmax', 'rhum', 'prcp', 'wspd']
weather_for_merge = weather_df_reset[weather_cols]

print(f"Weather data for merge shape: {weather_for_merge.shape}")
print(f"\nFirst few rows:")
print(weather_for_merge.head())

In [None]:
# Merge crime and weather data on date
merged_df = daily_crime_merged.merge(weather_for_merge, on='date', how='inner')

print(f"\nMerged dataset shape: {merged_df.shape}")
print(f"Date range: {merged_df['date'].min()} to {merged_df['date'].max()}")
print(f"\nNumber of days: {len(merged_df)}")
print(f"\nMerged data columns: {merged_df.columns.tolist()}")
print(f"\nFirst few rows:")
merged_df.head(10)

In [None]:
# Check for missing values
print("Missing values in merged dataset:")
print(merged_df.isnull().sum())

# Summary statistics
print("\nSummary statistics of merged dataset:")
merged_df.describe()

### 2.4 Data Quality Checks

In [None]:
# Check data completeness
date_range = pd.date_range(start=merged_df['date'].min(), end=merged_df['date'].max(), freq='D')
expected_days = len(date_range)
actual_days = len(merged_df)

print(f"Expected days in range: {expected_days}")
print(f"Actual days in merged data: {actual_days}")
print(f"Coverage: {actual_days / expected_days * 100:.2f}%")

# Check for any gaps
if actual_days < expected_days:
    missing_dates = set(date_range) - set(merged_df['date'])
    print(f"\nNumber of missing dates: {len(missing_dates)}")
    if len(missing_dates) <= 10:
        print(f"Missing dates: {sorted(missing_dates)}")
else:
    print("\nNo missing dates - complete daily coverage.")

## Summary of Merge Strategy

**Approach:**
1. **Temporal alignment:** Aggregated crime incidents to daily counts to match weather data granularity
2. **Spatial approach:** Used city-wide weather station data for all crimes (single station)
3. **Join method:** Inner join on date to ensure both datasets have matching records
4. **Crime categorization:** Classified crimes into Violent, Property, and Other based on UCR codes

**Limitations:**
- Single weather station may not capture micro-climate variations across neighborhoods
- Heat island effects in urban cores vs. suburbs not considered
- Daily aggregation loses intra-day temperature variations
- Weather data represents average conditions, not peak exposure times

**Dataset Ready for Analysis:**
- Merged dataset includes daily crime counts by category and weather measurements
- Complete temporal coverage from 2006 to 2026
- Ready for correlation analysis and hypothesis testing

## 3. Correlation Analysis

### 3.1 Correlation Helper Function

Following Pattern 3 from the research, we'll use multiple correlation methods to assess the relationship between temperature and crime rates.

In [None]:
def analyze_correlation(x, y, x_label='X', y_label='Y'):
    """
    Comprehensive correlation analysis with multiple tests.
    Based on Pattern 3 from research documentation.
    
    Parameters:
    -----------
    x : array-like
        First variable (e.g., temperature)
    y : array-like
        Second variable (e.g., crime count)
    x_label : str
        Label for x variable
    y_label : str
        Label for y variable
    
    Returns:
    --------
    dict : Correlation results with multiple methods and significance tests
    """
    # Remove NaN values
    mask = ~(pd.isna(x) | pd.isna(y))
    x_clean = np.array(x)[mask]
    y_clean = np.array(y)[mask]
    
    # Pearson correlation (linear relationship)
    pearson_r, pearson_p = stats.pearsonr(x_clean, y_clean)
    
    # Spearman rank correlation (monotonic relationship, robust to outliers)
    spearman_r, spearman_p = stats.spearmanr(x_clean, y_clean)
    
    # Kendall tau (robust to outliers, good for small samples)
    kendall_tau, kendall_p = stats.kendalltau(x_clean, y_clean)
    
    # Linear regression for effect size
    slope, intercept, r_value, p_value, std_err = stats.linregress(x_clean, y_clean)
    
    # Effect size interpretation (based on absolute correlation)
    abs_corr = abs(pearson_r)
    if abs_corr < 0.1:
        strength = 'negligible'
    elif abs_corr < 0.3:
        strength = 'small'
    elif abs_corr < 0.5:
        strength = 'medium'
    else:
        strength = 'large'
    
    results = {
        'x_label': x_label,
        'y_label': y_label,
        'n_observations': len(x_clean),
        'pearson_r': pearson_r,
        'pearson_p': pearson_p,
        'pearson_significant': pearson_p < 0.05,
        'spearman_r': spearman_r,
        'spearman_p': spearman_p,
        'spearman_significant': spearman_p < 0.05,
        'kendall_tau': kendall_tau,
        'kendall_p': kendall_p,
        'kendall_significant': kendall_p < 0.05,
        'regression_slope': slope,
        'regression_intercept': intercept,
        'regression_r_squared': r_value**2,
        'regression_p_value': p_value,
        'effect_size_strength': strength
    }
    
    return results

def print_correlation_results(results):
    """Print correlation results in a readable format."""
    print(f"\n{'='*70}")
    print(f"Correlation Analysis: {results['x_label']} vs {results['y_label']}")
    print(f"{'='*70}")
    print(f"Sample size: {results['n_observations']:,} observations")
    print(f"\nPearson correlation (linear):")
    print(f"  r = {results['pearson_r']:.4f}, p = {results['pearson_p']:.4e}")
    print(f"  Significant: {'YES' if results['pearson_significant'] else 'NO'}")
    print(f"\nSpearman correlation (monotonic):")
    print(f"  ρ = {results['spearman_r']:.4f}, p = {results['spearman_p']:.4e}")
    print(f"  Significant: {'YES' if results['spearman_significant'] else 'NO'}")
    print(f"\nKendall tau (robust):")
    print(f"  τ = {results['kendall_tau']:.4f}, p = {results['kendall_p']:.4e}")
    print(f"  Significant: {'YES' if results['kendall_significant'] else 'NO'}")
    print(f"\nLinear regression:")
    print(f"  R² = {results['regression_r_squared']:.4f}")
    print(f"  Slope = {results['regression_slope']:.4f}")
    print(f"  p-value = {results['regression_p_value']:.4e}")
    print(f"\nEffect size: {results['effect_size_strength'].upper()}")
    print(f"{'='*70}\n")

### 3.2 Temperature vs. Total Crime

In [None]:
# Correlation: Temperature vs. Total Crime
total_crime_corr = analyze_correlation(
    merged_df['temp'], 
    merged_df['total_crimes'],
    x_label='Temperature (°C)',
    y_label='Total Daily Crimes'
)

print_correlation_results(total_crime_corr)

### 3.3 Temperature vs. Violent Crime

The heat-crime hypothesis specifically predicts a stronger relationship for violent crimes.

In [None]:
# Correlation: Temperature vs. Violent Crime
violent_crime_corr = analyze_correlation(
    merged_df['temp'], 
    merged_df['Violent'],
    x_label='Temperature (°C)',
    y_label='Daily Violent Crimes'
)

print_correlation_results(violent_crime_corr)

### 3.4 Temperature vs. Property Crime

In [None]:
# Correlation: Temperature vs. Property Crime
property_crime_corr = analyze_correlation(
    merged_df['temp'], 
    merged_df['Property'],
    x_label='Temperature (°C)',
    y_label='Daily Property Crimes'
)

print_correlation_results(property_crime_corr)

### 3.5 Summary Table of Correlations

In [None]:
# Create summary table
correlation_summary = pd.DataFrame([
    {
        'Crime Type': 'Total Crime',
        'Pearson r': f"{total_crime_corr['pearson_r']:.4f}",
        'p-value': f"{total_crime_corr['pearson_p']:.2e}",
        'Spearman ρ': f"{total_crime_corr['spearman_r']:.4f}",
        'R²': f"{total_crime_corr['regression_r_squared']:.4f}",
        'Effect Size': total_crime_corr['effect_size_strength'].title()
    },
    {
        'Crime Type': 'Violent Crime',
        'Pearson r': f"{violent_crime_corr['pearson_r']:.4f}",
        'p-value': f"{violent_crime_corr['pearson_p']:.2e}",
        'Spearman ρ': f"{violent_crime_corr['spearman_r']:.4f}",
        'R²': f"{violent_crime_corr['regression_r_squared']:.4f}",
        'Effect Size': violent_crime_corr['effect_size_strength'].title()
    },
    {
        'Crime Type': 'Property Crime',
        'Pearson r': f"{property_crime_corr['pearson_r']:.4f}",
        'p-value': f"{property_crime_corr['pearson_p']:.2e}",
        'Spearman ρ': f"{property_crime_corr['spearman_r']:.4f}",
        'R²': f"{property_crime_corr['regression_r_squared']:.4f}",
        'Effect Size': property_crime_corr['effect_size_strength'].title()
    }
])

print("\nSummary of Temperature-Crime Correlations:")
print(correlation_summary.to_string(index=False))

# Store for later export
correlation_summary

### 3.6 Visualization: Scatter Plots with Trend Lines

In [None]:
# Create scatter plots for each crime type
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Total Crime
axes[0].scatter(merged_df['temp'], merged_df['total_crimes'], alpha=0.3, s=10)
z = np.polyfit(merged_df['temp'].dropna(), merged_df['total_crimes'].dropna(), 1)
p = np.poly1d(z)
axes[0].plot(merged_df['temp'].sort_values(), p(merged_df['temp'].sort_values()), 
             'r-', linewidth=2, label=f'Trend (r={total_crime_corr["pearson_r"]:.3f})')
axes[0].set_xlabel('Temperature (°C)', fontsize=12)
axes[0].set_ylabel('Total Daily Crimes', fontsize=12)
axes[0].set_title('Temperature vs. Total Crime', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Violent Crime
axes[1].scatter(merged_df['temp'], merged_df['Violent'], alpha=0.3, s=10, color='red')
z = np.polyfit(merged_df['temp'].dropna(), merged_df['Violent'].dropna(), 1)
p = np.poly1d(z)
axes[1].plot(merged_df['temp'].sort_values(), p(merged_df['temp'].sort_values()), 
             'darkred', linewidth=2, label=f'Trend (r={violent_crime_corr["pearson_r"]:.3f})')
axes[1].set_xlabel('Temperature (°C)', fontsize=12)
axes[1].set_ylabel('Daily Violent Crimes', fontsize=12)
axes[1].set_title('Temperature vs. Violent Crime', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

# Property Crime
axes[2].scatter(merged_df['temp'], merged_df['Property'], alpha=0.3, s=10, color='green')
z = np.polyfit(merged_df['temp'].dropna(), merged_df['Property'].dropna(), 1)
p = np.poly1d(z)
axes[2].plot(merged_df['temp'].sort_values(), p(merged_df['temp'].sort_values()), 
             'darkgreen', linewidth=2, label=f'Trend (r={property_crime_corr["pearson_r"]:.3f})')
axes[2].set_xlabel('Temperature (°C)', fontsize=12)
axes[2].set_ylabel('Daily Property Crimes', fontsize=12)
axes[2].set_title('Temperature vs. Property Crime', fontsize=14, fontweight='bold')
axes[2].legend()
axes[2].grid(alpha=0.3)

plt.tight_layout()
plt.savefig(REPORTS_DIR / 'heat_crime_scatterplots.png', dpi=300, bbox_inches='tight')
print(f"Saved: {REPORTS_DIR / 'heat_crime_scatterplots.png'}")
plt.show()

### 3.7 Temperature Bins Analysis

Examine crime rates across different temperature ranges to identify potential thresholds.

In [None]:
# Create temperature bins
merged_df['temp_bin'] = pd.cut(merged_df['temp'], 
                                bins=[-20, 0, 10, 20, 30, 40],
                                labels=['<0°C', '0-10°C', '10-20°C', '20-30°C', '>30°C'])

# Calculate mean crime rates by temperature bin
temp_bin_summary = merged_df.groupby('temp_bin', observed=True).agg({
    'total_crimes': ['mean', 'std', 'count'],
    'Violent': ['mean', 'std'],
    'Property': ['mean', 'std']
}).round(2)

print("\nCrime Rates by Temperature Range:")
print(temp_bin_summary)

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Plot bars for each crime type
temp_bins = merged_df.groupby('temp_bin', observed=True).agg({
    'total_crimes': 'mean',
    'Violent': 'mean',
    'Property': 'mean'
})

temp_bins['total_crimes'].plot(kind='bar', ax=axes[0], color='skyblue')
axes[0].set_title('Avg. Total Crime by Temperature', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Average Daily Crimes', fontsize=12)
axes[0].set_xlabel('Temperature Range', fontsize=12)
axes[0].tick_params(axis='x', rotation=45)
axes[0].grid(alpha=0.3, axis='y')

temp_bins['Violent'].plot(kind='bar', ax=axes[1], color='red', alpha=0.7)
axes[1].set_title('Avg. Violent Crime by Temperature', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Average Daily Violent Crimes', fontsize=12)
axes[1].set_xlabel('Temperature Range', fontsize=12)
axes[1].tick_params(axis='x', rotation=45)
axes[1].grid(alpha=0.3, axis='y')

temp_bins['Property'].plot(kind='bar', ax=axes[2], color='green', alpha=0.7)
axes[2].set_title('Avg. Property Crime by Temperature', fontsize=14, fontweight='bold')
axes[2].set_ylabel('Average Daily Property Crimes', fontsize=12)
axes[2].set_xlabel('Temperature Range', fontsize=12)
axes[2].tick_params(axis='x', rotation=45)
axes[2].grid(alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig(REPORTS_DIR / 'heat_crime_by_temperature_bins.png', dpi=300, bbox_inches='tight')
print(f"Saved: {REPORTS_DIR / 'heat_crime_by_temperature_bins.png'}")
plt.show()