# Chicago Car Crash Analysis

## Overview
This project analyzes traffic crash data from Chicago's Open Data Portal to identify patterns and factors contributing to accidents. By leveraging datasets on crash incidents, involved people, and vehicles, the analysis provides actionable insights into high-risk conditions and behaviors. The results aim to inform public safety initiatives and reduce traffic incidents through data-driven decisions.


<img src="./images/chicago_header_image.jpg" width="1280" height="640">

*Photo by [Sawyer Bengtson](https://unsplash.com/@sawyerbengtson) on Unsplash*
___

## Table of Contents

### 1. [Business Understanding](#Business-Understanding)
* 1.1 [Background](#Background)
* 1.2 [Goals](#Goals)
* 1.3 [Success Criteria](#Success-Criteria)
   
### 2. [Data Understanding](#Data-Understanding)


### 3. [Data Preparation](#Data-Preparation)

### 4. [Exploratory Data Analysis](#Exploratory-Data-Analysis)


### 5. [Modeling](#Modeling)


### 6. [Evaluation](#Evaluation)

### 7. [Conclusion](#Conclusion)
* 7.1 [Limitations](#Limitations)
* 7.2 [Recommendations](#Recommendations)
* 7.3 [Next Steps](#Next-Steps)

### 8. [References](#References)

## 1. <a name ="Business-Understanding"></a> Business Understanding

### 1.1 <a name ="Background"></a> Background 

[Vision Zero](https://visionzeronetwork.org/about/what-is-vision-zero/) is a traffic safety initiative aimed at eliminating traffic deaths. Despite pledges from many U.S. cities, traffic fatalities remain a persistent issue. In 2024, Illinois recorded 1,111 traffic deaths, with 361 in Cook County (including Chicago) ([source](https://apps.dot.illinois.gov/FatalCrash/snapshot.html)). Traffic safety is also an equity issue, as Black and Brown communities, particularly in urban areas like Chicago and Philadelphia, experience disproportionately high traffic fatalities. Analysis from the Philadelphia Department of Public Health reveals that zip codes with higher poverty rates also have higher traffic crash hospitalization rates (City of Philadelphia, 2024). These communities often face underinvestment in infrastructure, contributing to these disparities. 


One distinguishing aspect of traffic fatalities is that they are largely preventable. Existing research highlights speed as one of the leading contributors to traffic deaths, with speed often seen as a critical factor in the severity of crashes. This raises the question: How do we get drivers to slow down? The deeper issue becomes a question of behavior change, which is much trickier to address. Different strategies have been employed to varying degrees, but there is often debate over the most effective methods to achieve meaningful change in driver behavior, such as the use of speed cushions, lowering speed limits, trimming lane size, or increasing police presence in high-speed areas.

Through my involvement with the [City of Philadelphia’s Vision Zero Ambassadors](https://visionzerophl.com/get-involved/) program, I gained extensive domain knowledge and firsthand experience in community engagement, where I worked to raise awareness and drive actions toward traffic safety in underserved neighborhoods. This work provided me with a deeper understanding of the challenges faced by vulnerable communities and the importance of strategic, data-driven interventions in reducing traffic fatalities.

### 1.2 <a name ="Goals"></a> Goals

The goal of this project is to develop a model that predicts whether a crash resulted in serious or fatal injuries. For the purposes of this project, ‘serious’ injuries refer to ‘incapacitating’ injuries, as defined by Chicago’s Department of Transportation (CDOT). CDOT defines incapacitating injuries as injuries that prevent an individual from walking, driving, or performing normal activities. By predicting crash severity, the model aims to identify which factors—such as speed, road conditions, road design, enforcement, and vehicle type & size—are most strongly associated with the outcomes. The analysis will be based on data from Chicago’s open data portal, which includes detailed records on traffic incidents across the city.


The model will be designed to be easily interpretable, meaning it will provide clear and understandable explanations for how it arrives at its predictions. This is crucial for decision-makers, as it allows them to trust the model’s results and use it to guide traffic safety policies and resource allocation. An easily interpretable model will help agencies like CDOT and the Chicago Metropolitan Agency for Planning (CMAP) target resources more effectively, allow policymakers to evaluate the impact of different safety strategies, and help make informed decisions about where to focus their efforts. Ultimately, this model will assist in making data-driven decisions that reduce fatalities and improve the overall safety of Chicago’s roadways.

### 1.3 <a name ="Success-Criteria"></a> Success Criteria

The success of this project will be determined by how easily the model can explain its predictions and which features are driving the outcomes. Although the focus is on model interpretability, measuring the model’s ability to accurately predict whether a crash resulted in serious or fatal injuries is also a key success criterion, as it provides information about how confident we can be about the results. Since this will be treated as a classification problem (i.e., classifying a crash to have fatal/serious injuries or not), classification metrics that I will measure are accuracy, precision, recall, and F1.


The model will be considered successful if it achieves moderate accuracy (e.g., between 65% and 80%) and provides clear, actionable insights that guide traffic safety policies and resource allocation. This moderate accuracy ensures that the insights gained from the model are trustworthy enough for stakeholders to implement targeted interventions effectively.

## 2. <a name ="Data-Understanding"></a> Data Understanding

The data for this project came from the city of [Chicago's Data Portal](https://data.cityofchicago.org/). Within this database, I used three different traffic crash datasets: 
1. [Traffic Crashes - Crashes](https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if/about_data): Contains detailed information about each traffic crash that occurred within the City of Chicago.
    * Total of **901k** observations with **48** features.
        * Includes fields such as crash date, crash location, weather conditions, road conditions, and contributing factors like speed limits.
    * Unique identifier for each crash is the `CRASH_RECORD_ID`.
    * Updated regularly, with the most recent update on Dec 12, 2024.
    * Earliest recorded data dates back to March of 2023.
<br><br>

2. [Traffic Crashes - People](https://data.cityofchicago.org/Transportation/Traffic-Crashes-People/u6pd-qa9d/about_data): Provides information about individuals involved in a traffic crash, including details about their injuries 
    * Total of **1.98M** observations with **29** features.
        * Each record corresponds to an occupant in a vehicle listed in the Crash dataset, and it includes data such as the type of injury, role (driver, passenger, pedestrian, etc.), and whether the individual sustained any injuries.
    * The unique identifier for each record is `CRASH_RECORD_ID`.
    * Updated regularly, with the most recent update on Dec 12, 2024.
    * Earliest recorded data dates back to March of 2023.
<br><br>

3. [Traffic Crashes - Vehicles](https://data.cityofchicago.org/Transportation/Traffic-Crashes-Vehicles/68nd-jvt3/about_data): Contains information about the vehicle(s) involved in traffic crashes. 
   
    * Total of **1.84M** observations with **71** features.
        * Each “unit” involved in a crash (e.g., motor vehicles, bicycles, pedestrians) is assigned a record. 
        * Information about the vehicle type, damage, and trajectory, as well as the relationship with the individuals involved (drivers, passengers, pedestrians).
        
    * Links to the `Crash` and `People` datasets using the `CRASH_RECORD_ID`. 
    * Updated regularly, with the most recent update on Dec 12, 2024.
    * Earliest recorded data dates back to March of 2023.

### 2.1 Importing Necessary Libraries and Data

In [1]:
# for getting data
import os
import zipfile

# for data analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
%matplotlib inline

# for modeling

In [2]:
# Sets environment variable to point to the location of kaggle.json
os.environ['KAGGLE_CONFIG_DIR'] = './config'

In [3]:
# Check if the KAGGLE_CONFIG_DIR environment variable is set
print(os.getenv('KAGGLE_CONFIG_DIR'))

./config


In [4]:
# Information about the dataset
dataset_name = 'ckucewicz/Chicago-Traffic-Data'
zip_filename = 'Chicago-Traffic-Data.zip' 
download_path = './data'
unzip_path = './data'

# Step 1: Downloads dataset from Kaggle, if not already downloaded
dataset_path = os.path.join(download_path, zip_filename)
if not os.path.exists(dataset_path):
    print(f"Downloading {dataset_name}...")
    os.system(f"kaggle datasets download -d {dataset_name} --path {download_path}")
else:
    print(f"{dataset_name} already downloaded.")

# Step 2: Unzips the downloaded file
print(f"Unzipping {zip_filename}...")
with zipfile.ZipFile(dataset_path, 'r') as zip_ref:
    zip_ref.extractall(unzip_path)

# Step 3: Loads the CSV files into pandas dataframes
csv_filenames = ['people.csv', 'traffic_crashes.csv', 'vehicles.csv']
dataframes = {}
for csv_filename in csv_filenames:
    csv_file = os.path.join(unzip_path, csv_filename)
    print(f"Loading CSV: {csv_filename}...")
    
    # Creates a variable name based on the CSV filename
    dataframe_name = csv_filename.split('.')[0]
    
    # Stores the dataframe in the dictionary
    dataframes[dataframe_name] = pd.read_csv(csv_file, low_memory = False)

     # Prints "Complete" once the CSV is successfully loaded
    print(f"Loading of {dataframe_name} complete.")
    print("-" * 50)

# stores each dataset in its own variable
people_df = dataframes['people']
traffic_crashes_df = dataframes['traffic_crashes']
vehicles_df = dataframes['vehicles']

ckucewicz/Chicago-Traffic-Data already downloaded.
Unzipping Chicago-Traffic-Data.zip...
Loading CSV: people.csv...
Loading CSV: traffic_crashes.csv...
Loading CSV: vehicles.csv...


### 2.2 Data Understanding Function

In [None]:
# Dataset understanding function
def dataset_understanding(dataset_path, date_col=None):
    """
    Automates the process of understanding the structure and contents of a given dataset.

    This function provides an overview of the dataset by:
    - Loading the dataset from a specified CSV file.
    - Displaying the first few rows of the dataset.
    - Printing information about the columns and data types.
    - Calculating and displaying the percentage of missing values for each feature.
    - Displaying the value counts and number of unique values for each feature.
    - Plotting histograms for numeric features and bar charts for categorical features.
    - Analyzing a date column (if specified) to identify the earliest and latest dates.

    Parameters:
    dataset_path (str): The path to the dataset CSV file to be analyzed.
    date_col (str, optional): The name of the column containing date values to analyze. 
                              If provided, the function will calculate the earliest and latest dates.

    Returns:
    pandas.DataFrame: The loaded DataFrame with the dataset contents.
    
    """
    # Read the dataset
    print(f"Loading dataset from {dataset_path}...\n")
    df = pd.read_csv(dataset_path, low_memory=False)
    
    # Print the first 5 rows
    print("First 5 rows of the dataset:\n")
    print(df.head(), "\n")
    
    # Print info about columns
    print("DataFrame Info:\n")
    print(df.info(), "\n")
    
    # Calculate and print percentage of missing values
    print("Percentage of missing values in each feature:\n")
    print(round((df.isna().sum() / len(df) * 100), 2), "\n")
    
    # Loop through features and print value counts and unique values
    print("Value counts and number of unique values for each feature:\n")
    for feature in df.columns:
        print(f"Value counts for column '{feature}':")
        print(df[feature].value_counts())
        print(f"Number of unique values: {df[feature].nunique()}\n")
        print("-" * 50, "\n")
    
    # Plot histograms for numeric features
    numeric_cols = df.select_dtypes(include='number').columns
    print("Plotting histograms for numeric features...\n")
    df[numeric_cols].hist(bins=20, figsize=(12, 10))
    plt.tight_layout()
    plt.show()
    
    # Plot bar charts for categorical features (top 20 categories)
    categorical_cols = [col for col in df.select_dtypes(include='object').columns if col != 'CRASH_RECORD_ID']
    print("Plotting bar charts for categorical features (top 20 categories)...\n")
    n_cols = 3
    n_rows = math.ceil(len(categorical_cols) / n_cols)
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(25, n_rows * 5))
    axes = axes.flatten()
    
    for i, col in enumerate(categorical_cols):
        ax = axes[i]
        top_categories = df[col].value_counts().head(20)
        top_categories.plot(kind='bar', ax=ax, color='skyblue', edgecolor='black')
        ax.set_title(f'Bar Chart of {col}')
        ax.set_xlabel(col)
        ax.set_ylabel('Count')
        ax.tick_params(axis='x', rotation=0)
    
    for j in range(len(categorical_cols), len(axes)):
        fig.delaxes(axes[j])
    
    plt.tight_layout()
    plt.show()
    
    # Analyze date column for earliest and latest dates if provided
    if date_col:
        print(f"Analyzing date column '{date_col}'...\n")
        df[date_col] = pd.to_datetime(df[date_col], errors='coerce')
        earliest_date = df[date_col].min()
        latest_date = df[date_col].max()
        print(f"Earliest {date_col}: {earliest_date}")
        print(f"Latest {date_col}: {latest_date}\n")
    
    return df

### 2.2.1 Crashes Dataset

In [None]:
# Data understanding for Crashes dataset
crashes_df = dataset_understanding('./data/Traffic_Crashes_-_Crashes_20241213.csv', date_col='CRASH_DATE')

### 2.2.2 People Dataset

In [None]:
# Data understanding for People dataset
people_df = dataset_understanding('./data/Traffic_Crashes_-_People_20241213.csv', date_col = 'CRASH_DATE')

### 2.2.3 Vehicles Dataset

In [None]:
# Data understanding for Vehicles dataset
vehicles_df = dataset_understanding('./data/Traffic_Crashes_-_Vehicles_20241213.csv', date_col = 'CRASH_DATE')

## 3. <a name ="Data-Preparation"></a> Data Preparation

## 4. <a name ="Exploratory-Data-Analysis"></a>Exploratory Data Analysis (EDA)

## 5. <a name ="Modeling"></a>  Modeling

## 6. <a name ="Evaluation"></a> Evaluation

## 7. <a name ="Conclusion"></a> Conclusion

### 7.1 <a name ="Limitations"></a> Limitations

### 7.2 <a name ="Recommendations"></a> Recommendations

### 7.3 <a name ="Next-Steps"></a> Next Steps

## 8. References

1. City of Philadelphia. (2024). *Vision Zero Annual Report 2024*. Philadelphia.gov. https://visionzerophl.com/plans-and-reports/annual-report-2024/