# 🚦 Traffic Incident Analysis

This project explores traffic incident patterns in the City of Calgary using public incident data. We analyze spatial and temporal trends in Calgary using data mining techniques, spatial clustering, and temporal trend analysis with machine learning

## ✨ Project Goals
- Detect spatial hotspots using clustering (DBSCAN)
- Analyze incident frequency by hour, weekday, and season
- Predict incident counts with Random Forest Regression

## 📂 Structure
- `data/`: Raw and cleaned datasets
- `notebooks/`: JupyterLab analysis steps
- `output/`: Visualizations and results

## 🛠️ Tools
- Python, Pandas, Matplotlib, Seaborn, Scikit-learn, Folium
- JupyterLab for development
- GitHub for version control

## 📍 Data Source
[City of Calgary Open Data Portal - Traffic Incidents](https://data.calgary.ca/Transportation-Transit/Traffic-Incidents/35ra-9556)

---

# Step 1: Load and Preprocess Traffic Incident Data

In this step, we will:
- Import the traffic incidents dataset.
- Clean the data by removing missing or invalid values.
- Extract useful temporal features (hour, day of the week, month, season).
- Save the cleaned dataset for future use.

## Import Libraries

In [55]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Load the Dataset

The dataset is update till March 31st 2025 and downloaded from the [City of Calgary Open Data Portal](https://data.calgary.ca/Transportation-Transit/Traffic-Incidents/35ra-9556), and stored in the `data/` folder.


In [58]:
df = pd.read_csv('../data/Traffic_Incidents_20250331.csv')

# Preview shape and column names
print("Shape:", df.shape)
df.columns


Shape: (53788, 10)


Index(['INCIDENT INFO', 'DESCRIPTION', 'START_DT', 'MODIFIED_DT', 'QUADRANT',
       'Longitude', 'Latitude', 'Count', 'id', 'Point'],
      dtype='object')

## Clean the Dataset

- Convert `START_DT` to datetime format
- Drop rows with missing or invalid coordinates.
- Filter out invalid lat/long points outside Calgary's range.

In [61]:
# Convert START_DT to datetime
df['START_DT'] = pd.to_datetime(df['START_DT'])

# Drop rows with missing latitude or longitude
df.dropna(subset=['Latitude', 'Longitude'], inplace=True)

# Filtering for valid Calgary coordinates
df = df[(df['Latitude'] > 50) & (df['Latitude'] < 52)]
df = df[(df['Longitude'] < -113) & (df['Longitude'] > -115)]

## Extract Temporal Features

We'll derive additional columns for:
- Hour of the day
- Day of the week
- Month of the year
- Season

In [64]:
# Extract time-based features
df['hour'] = df['START_DT'].dt.hour
df['day_of_week'] = df['START_DT'].dt.dayofweek  # 0 = Monday
df['month'] = df['START_DT'].dt.month
df['season'] = df['month'].map({
    12: 'Winter', 1: 'Winter', 2: 'Winter',
    3: 'Spring', 4: 'Spring', 5: 'Spring',
    6: 'Summer', 7: 'Summer', 8: 'Summer',
    9: 'Fall', 10: 'Fall', 11: 'Fall'
})

## Save the Cleaned Dataset

We'll export the preprocessed data to a new CSV file for use in analysis and modeling.

In [67]:
df.to_csv('../data/Traffic_Incidents_Cleaned.csv', index=False)
print("✅ Cleaned data saved successfully.")

✅ Cleaned data saved successfully.


## Preprocessed Data Summary

Let's inspect the cleaned dataset before proceeding with analysis.


In [74]:
# Reload the cleaned data just to be sure
df = pd.read_csv('../data/Traffic_Incidents_Cleaned.csv')

# Show first few rows
df.head()

# Check structure
df.info()

# Check for any remaining missing values
print("\nMissing values:\n", df.isnull().sum())

# Check unique values in key features
print("\nSeasons:", df['season'].unique())
print("Days of Week (0 = Monday):", df['day_of_week'].unique())
print("Unique months:", sorted(df['month'].unique()))
print("Hours:", sorted(df['hour'].unique()))


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53788 entries, 0 to 53787
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   INCIDENT INFO  53788 non-null  object 
 1   DESCRIPTION    53786 non-null  object 
 2   START_DT       53788 non-null  object 
 3   MODIFIED_DT    39731 non-null  object 
 4   QUADRANT       39729 non-null  object 
 5   Longitude      53788 non-null  float64
 6   Latitude       53788 non-null  float64
 7   Count          53788 non-null  int64  
 8   id             53788 non-null  object 
 9   Point          53788 non-null  object 
 10  hour           53788 non-null  int64  
 11  day_of_week    53788 non-null  int64  
 12  month          53788 non-null  int64  
 13  season         53788 non-null  object 
dtypes: float64(2), int64(4), object(8)
memory usage: 5.7+ MB

Missing values:
 INCIDENT INFO        0
DESCRIPTION          2
START_DT             0
MODIFIED_DT      14057
QUADRANT       