# Correlation Analysis: Crime vs. Weather

This notebook performs a detailed analysis of the correlation between weather variables and daily crime incidence.  
The main objective is to understand how average temperature and precipitation affect crime counts, using statistical methods and recruiter-friendly visualizations.

## 1. Import Libraries and Load Data

In this step, we load the essential libraries for data manipulation (`pandas`), interactive visualization (`plotly.express`), and statistical analysis (`scipy.stats`).  
Then we read the cleaned crime and weather datasets, ensuring the `date` column is parsed correctly.

In [None]:

# %%
# 1. Import libraries and load cleaned data
import pandas as pd
import plotly.express as px
from scipy.stats import pearsonr
from pathlib import Path

# Paths
dir_processed = Path('../data/processed')

# Load cleaned data
crime_df   = pd.read_csv(dir_processed / 'crime_clean.csv', parse_dates=['date'])
weather_df = pd.read_csv(dir_processed / 'weather_clean.csv', parse_dates=['date'])


## 2. Prepare Daily Metrics

Aggregate crime records by date to compute daily crime counts.  
If `temp_max` and `temp_min` are available, calculate the daily average temperature.  
Retain precipitation values for the joint analysis.

In [None]:
## 2. Prepare aggregated daily metrics
# Daily crime counts
daily_crime = (
    crime_df.groupby('date')
            .size()
            .reset_index(name='crime_count')
)

# Daily average temperature (mean of max/min) if per-hour; or just temp_avg from file
# If you have temp_max and temp_min:
if 'temp_max' in weather_df.columns and 'temp_min' in weather_df.columns:
    weather_df['temp_avg'] = (weather_df['temp_max'] + weather_df['temp_min']) / 2

# Use the existing daily weather records
daily_weather = weather_df[['date', 'temp_avg', 'precipitation']]

## 3. Merge Datasets on Date

Combine the daily crime counts with weather metrics via an inner join on the `date` column.  
Remove any rows with missing temperature or precipitation.

In [None]:
## 3. Merge datasets on date

merged = pd.merge(daily_crime, daily_weather, on='date', how='inner')
# Ensure no NaN values in key columns
# Drop rows with NaN in 'temp_avg' or 'precipitation'
merged = merged.dropna(subset=['temp_avg', 'precipitation'])

## 4. Compute Correlation Metrics

Use Pearson’s correlation coefficient to quantify the linear association between crime counts and each weather variable.  
Report both the correlation coefficient and the p-value for significance.

In [None]:

## 4. Correlation metrics

pearson_temp, pval_temp = pearsonr(merged['crime_count'], merged['temp_avg'])
#Pearson correlation for precipitation
pearson_precip, pval_precip = pearsonr(merged['crime_count'], merged['precipitation'])

# Print results
print(f"Pearson correlation (crime_count vs. temp_avg): {pearson_temp:.3f} (p={pval_temp:.3f})")
print(f"Pearson correlation (crime_count vs. precipitation): {pearson_precip:.3f} (p={pval_precip:.3f})")


## 5. Scatter Plots

Create scatter plots with OLS trendlines to visualize how crime counts vary with average temperature and precipitation.

In [None]:
## 5. Scatter plots

# %%
fig1 = px.scatter(
    merged, x='temp_avg', y='crime_count',
    trendline='ols',
    title='Scatter: Crime Count vs. Average Temperature'
)
fig1.show()

# %%
fig2 = px.scatter(
    merged, x='precipitation', y='crime_count',
    trendline='ols',
    title='Scatter: Crime Count vs. Precipitation'
)
fig2.show()

## 6. Box Plot by Precipitation Category

Bin precipitation into categories (No rain, Light, Moderate, Heavy) and present a box plot showing the distribution of daily crime counts across these categories.


In [None]:

## 6. Box plot: Crime distribution by precipitation category

# %%
# Define precipitation categories
merged['precip_bin'] = pd.cut(
    merged['precipitation'],
    bins=[-0.1, 0.0, 2.5, 10.0, merged['precipitation'].max()],
    labels=['No rain', 'Light rain', 'Moderate rain', 'Heavy rain']
)

fig3 = px.box(
    merged, x='precip_bin', y='crime_count',
    title='Crime Count Distribution by Precipitation Level'
)
fig3.show()
