## Traffic Crashes in Chicago

Urban traffic accidents pose significant challenges, leading to considerable human suffering and economic losses. In Chicago, the annual toll includes numerous traffic collisions that result in injuries, fatalities, and substantial property damage. Identifying the factors that lead to these incidents and forecasting their severity are crucial steps for city planners and public safety officials aiming to enhance road safety. This initiative utilizes the City of Chicago's Traffic Crashes dataset to develop predictive models that pinpoint high-risk situations and offer actionable insights.

## Project Overview

This initiative seeks to examine Chicago's traffic accident data to forecast the leading contributory causes of these incidents. By employing machine learning algorithms, the project aims to pinpoint critical factors that lead to accidents, thereby guiding policy decisions to enhance traffic safety and mitigate future collisions. The dataset encompasses various attributes related to accidents, including weather conditions, lighting, and roadway surface states. The primary focus is on the 'Primary Contributory Cause,' approached as a multi-class classification challenge.

## Business Understanding

Traffic accidents represent a critical public safety concern. Identifying the factors that lead to these incidents enables city planners, traffic engineers, and policymakers to implement more effective safety measures. This project aims to develop a model that accurately predicts the primary contributory causes of accidents, providing actionable insights to reduce their frequency and severity.

## Data Understanding

he dataset for this project is sourced from the City of Chicago's Traffic Crashes database, which provides comprehensive details for each incident, including:

- Crash Date: Specifies the exact date and time when the crash
occurred.
- Traffic Control Device: Identifies the type of traffic control mechanism in place at the location of the crash.
- Weather Conditions: Describes the atmospheric conditions present at the time of the accident.
- Roadway Surface Conditions: Indicates the state of the road surface during the crash.
- Lighting Conditions: Details the level of illumination at the crash scene.


## Data Preparation

## Importing Libraries

In [3]:
# Step 1: Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle, sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_curve, roc_auc_score
from numbers import Number

## Reading and accessing the data

In [4]:
df = pd.read_csv("Traffic_Crashes.csv")
print(df.head())
print(df.info())
print(df.describe())

                                     CRASH_RECORD_ID CRASH_DATE_EST_I  \
0  23a79931ef555d54118f64dc9be2cf2dbf59636ce253f7...              NaN   
1  2675c13fd0f474d730a5b780968b3cafc7c12d7adb661f...              NaN   
2  5f54a59fcb087b12ae5b1acff96a3caf4f2d37e79f8db4...              NaN   
3  7ebf015016f83d09b321afd671a836d6b148330535d5df...              NaN   
4  6c1659069e9c6285a650e70d6f9b574ed5f64c12888479...              NaN   

               CRASH_DATE  POSTED_SPEED_LIMIT TRAFFIC_CONTROL_DEVICE  \
0  09/05/2023 07:05:00 PM                  30         TRAFFIC SIGNAL   
1  09/22/2023 06:45:00 PM                  50            NO CONTROLS   
2  07/29/2023 02:45:00 PM                  30         TRAFFIC SIGNAL   
3  08/09/2023 11:00:00 PM                  30            NO CONTROLS   
4  08/18/2023 12:50:00 PM                  15                  OTHER   

       DEVICE_CONDITION WEATHER_CONDITION      LIGHTING_CONDITION  \
0  FUNCTIONING PROPERLY             CLEAR                  

In [5]:
df.columns

Index(['CRASH_RECORD_ID', 'CRASH_DATE_EST_I', 'CRASH_DATE',
       'POSTED_SPEED_LIMIT', 'TRAFFIC_CONTROL_DEVICE', 'DEVICE_CONDITION',
       'WEATHER_CONDITION', 'LIGHTING_CONDITION', 'FIRST_CRASH_TYPE',
       'TRAFFICWAY_TYPE', 'LANE_CNT', 'ALIGNMENT', 'ROADWAY_SURFACE_COND',
       'ROAD_DEFECT', 'REPORT_TYPE', 'CRASH_TYPE', 'INTERSECTION_RELATED_I',
       'NOT_RIGHT_OF_WAY_I', 'HIT_AND_RUN_I', 'DAMAGE', 'DATE_POLICE_NOTIFIED',
       'PRIM_CONTRIBUTORY_CAUSE', 'SEC_CONTRIBUTORY_CAUSE', 'STREET_NO',
       'STREET_DIRECTION', 'STREET_NAME', 'BEAT_OF_OCCURRENCE',
       'PHOTOS_TAKEN_I', 'STATEMENTS_TAKEN_I', 'DOORING_I', 'WORK_ZONE_I',
       'WORK_ZONE_TYPE', 'WORKERS_PRESENT_I', 'NUM_UNITS',
       'MOST_SEVERE_INJURY', 'INJURIES_TOTAL', 'INJURIES_FATAL',
       'INJURIES_INCAPACITATING', 'INJURIES_NON_INCAPACITATING',
       'INJURIES_REPORTED_NOT_EVIDENT', 'INJURIES_NO_INDICATION',
       'INJURIES_UNKNOWN', 'CRASH_HOUR', 'CRASH_DAY_OF_WEEK', 'CRASH_MONTH',
       'LATITUDE', 

## Check for any Missing values

In [6]:
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0])

CRASH_DATE_EST_I                 13291
LANE_CNT                         11394
REPORT_TYPE                        342
INTERSECTION_RELATED_I           11313
NOT_RIGHT_OF_WAY_I               13657
HIT_AND_RUN_I                    10489
PRIM_CONTRIBUTORY_CAUSE              1
SEC_CONTRIBUTORY_CAUSE               1
STREET_NO                            1
STREET_DIRECTION                     4
STREET_NAME                          2
BEAT_OF_OCCURRENCE                   3
PHOTOS_TAKEN_I                   14120
STATEMENTS_TAKEN_I               13957
DOORING_I                        14259
WORK_ZONE_I                      14201
WORK_ZONE_TYPE                   14214
WORKERS_PRESENT_I                14260
NUM_UNITS                            1
MOST_SEVERE_INJURY                  31
INJURIES_TOTAL                      31
INJURIES_FATAL                      31
INJURIES_INCAPACITATING             31
INJURIES_NON_INCAPACITATING         31
INJURIES_REPORTED_NOT_EVIDENT       31
INJURIES_NO_INDICATION   

## Check for Duplicte values

In [7]:
duplicate_rows = df[df.duplicated()]
print(duplicate_rows)

Empty DataFrame
Columns: [CRASH_RECORD_ID, CRASH_DATE_EST_I, CRASH_DATE, POSTED_SPEED_LIMIT, TRAFFIC_CONTROL_DEVICE, DEVICE_CONDITION, WEATHER_CONDITION, LIGHTING_CONDITION, FIRST_CRASH_TYPE, TRAFFICWAY_TYPE, LANE_CNT, ALIGNMENT, ROADWAY_SURFACE_COND, ROAD_DEFECT, REPORT_TYPE, CRASH_TYPE, INTERSECTION_RELATED_I, NOT_RIGHT_OF_WAY_I, HIT_AND_RUN_I, DAMAGE, DATE_POLICE_NOTIFIED, PRIM_CONTRIBUTORY_CAUSE, SEC_CONTRIBUTORY_CAUSE, STREET_NO, STREET_DIRECTION, STREET_NAME, BEAT_OF_OCCURRENCE, PHOTOS_TAKEN_I, STATEMENTS_TAKEN_I, DOORING_I, WORK_ZONE_I, WORK_ZONE_TYPE, WORKERS_PRESENT_I, NUM_UNITS, MOST_SEVERE_INJURY, INJURIES_TOTAL, INJURIES_FATAL, INJURIES_INCAPACITATING, INJURIES_NON_INCAPACITATING, INJURIES_REPORTED_NOT_EVIDENT, INJURIES_NO_INDICATION, INJURIES_UNKNOWN, CRASH_HOUR, CRASH_DAY_OF_WEEK, CRASH_MONTH, LATITUDE, LONGITUDE, LOCATION]
Index: []

[0 rows x 48 columns]


## Dealing with missing values

In [8]:
df['WEATHER_CONDITION'].fillna(df['WEATHER_CONDITION'].mode()[0], inplace=True)
df['LIGHTING_CONDITION'].fillna(df['LIGHTING_CONDITION'].mode()[0], inplace=True)
df['ROADWAY_SURFACE_COND'].fillna(df['ROADWAY_SURFACE_COND'].mode()[0], inplace=True)
df['TRAFFIC_CONTROL_DEVICE'].fillna(df['TRAFFIC_CONTROL_DEVICE'].mode()[0], inplace=True)
df['DEVICE_CONDITION'].fillna(df['DEVICE_CONDITION'].mode()[0], inplace=True)
df['DEVICE_CONDITION'].fillna(df['DEVICE_CONDITION'].mode()[0], inplace=True)
df['INTERSECTION_RELATED_I'].fillna(df['INTERSECTION_RELATED_I'].mode()[0], inplace=True)
df['NOT_RIGHT_OF_WAY_I'].fillna(df['NOT_RIGHT_OF_WAY_I'].mode()[0], inplace=True)
df['MOST_SEVERE_INJURY'].fillna(df['MOST_SEVERE_INJURY'].mode()[0], inplace=True)
df.drop(['PHOTOS_TAKEN_I', 'STATEMENTS_TAKEN_I', 'DOORING_I', 'WORK_ZONE_I',
         'WORK_ZONE_TYPE', 'WORKERS_PRESENT_I', 'STREET_NO', 'STREET_DIRECTION',
         'STREET_NAME', 'BEAT_OF_OCCURRENCE','CRASH_DATE_EST_I'], axis=1, inplace=True)
df['HIT_AND_RUN_I'].fillna(df['HIT_AND_RUN_I'].mode()[0], inplace=True)
df['LANE_CNT'].fillna(df['LANE_CNT'].median(), inplace=True)
df['LATITUDE'].fillna(df['LATITUDE'].median(), inplace=True)
df['LONGITUDE'].fillna(df['LONGITUDE'].median(), inplace=True)
injury_columns = [
    'INJURIES_TOTAL', 'INJURIES_FATAL', 'INJURIES_INCAPACITATING',
    'INJURIES_NON_INCAPACITATING', 'INJURIES_REPORTED_NOT_EVIDENT',
    'INJURIES_NO_INDICATION', 'INJURIES_UNKNOWN'
]
df[injury_columns] = df[injury_columns].fillna(0)
df['LOCATION'].fillna(method='ffill', inplace=True)
df['REPORT_TYPE'].fillna(df['REPORT_TYPE'].mode()[0], inplace=True)
df['LOCATION'].fillna(method='bfill', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['WEATHER_CONDITION'].fillna(df['WEATHER_CONDITION'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['LIGHTING_CONDITION'].fillna(df['LIGHTING_CONDITION'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never

In [9]:
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0])

PRIM_CONTRIBUTORY_CAUSE    1
SEC_CONTRIBUTORY_CAUSE     1
NUM_UNITS                  1
CRASH_HOUR                 1
CRASH_DAY_OF_WEEK          1
CRASH_MONTH                1
dtype: int64


## Feature Engineering

The dataset has pre-engineered features, such as CRASH_HOUR and CRASH_MONTH. To identify the top 10 features most relevant to our target variable, we will empdeployloy a machine learning model.

In the following cell, we will implement a Logistic Regression model with preprocessing pipelines to efficiently handle our large dataset. This approach will also prepare both categorical and numerical features for modeling.

In [15]:

df_sampled = df.sample(frac=0.01, random_state=42)

X = df_sampled.drop('PRIM_CONTRIBUTORY_CAUSE', axis=1)
y = df_sampled['PRIM_CONTRIBUTORY_CAUSE']


categorical_cols = X.select_dtypes(include=['object']).columns
numerical_cols = X.select_dtypes(exclude=['object']).columns


preprocessor = ColumnTransformer(
    transformers=[
        ('num', SimpleImputer(strategy='median'), numerical_cols),
        ('cat', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('onehot', OneHotEncoder(handle_unknown='ignore'))
        ]), categorical_cols)
    ])


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

model = LogisticRegression(max_iter=100, random_state=42, n_jobs=-1)
model.fit(X_train, y_train)

## Next we xtract coefficients by creating features then analyze the importance of each feature.

Feature Name Construction
The code reconstructs feature names after preprocessing, including those generated through one-hot encoding for categorical variables.

Model Coefficients
It retrieves the coefficients from the trained Logistic Regression model to analyze the contribution of each feature to the predictions.

Feature Importance
By sorting the coefficients based on their absolute values, the code highlights the most influential features in the model, offering insights into the key factors driving the target variable.


In [16]:

feature_names = []

for col in numerical_cols:
    feature_names.append(col)

for col in categorical_cols:
    categories = preprocessor.named_transformers_['cat']['onehot'].categories_[categorical_cols.get_loc(col)]
    feature_names.extend([f"{col}_{category}" for category in categories])

coefficients = model.coef_[0]

importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': coefficients
})

importance_df['Abs_Coefficient'] = np.abs(importance_df['Coefficient'])
importance_df = importance_df.sort_values(by='Abs_Coefficient', ascending=False)

print(importance_df.head(20))

                                               Feature  Coefficient  \
256                       DEVICE_CONDITION_NO CONTROLS    -0.318587   
250                 TRAFFIC_CONTROL_DEVICE_NO CONTROLS    -0.312541   
452  LOCATION_POINT (-87.585945066953 41.744151639042)     0.300816   
11                                   CRASH_DAY_OF_WEEK    -0.294536   
272                             FIRST_CRASH_TYPE_ANGLE     0.285345   
8                               INJURIES_NO_INDICATION     0.254969   
283                           FIRST_CRASH_TYPE_TURNING     0.249495   
15                                      CRASH_HOUR_sin     0.237472   
319                                 DAMAGE_OVER $1,500     0.215492   
20                                     CRASH_MONTH_cos     0.212613   
12                                         CRASH_MONTH    -0.211625   
318                               DAMAGE_$501 - $1,500    -0.194453   
282          FIRST_CRASH_TYPE_SIDESWIPE SAME DIRECTION    -0.193910   
10    


Let's convert time-related features into a format that effectively represents their cyclic nature, which can enhance both the performance and interpretability of our machine learning models.

In [12]:
df['CRASH_HOUR_sin'] = np.sin(2 * np.pi * df['CRASH_HOUR'] / 24)
df['CRASH_HOUR_cos'] = np.cos(2 * np.pi * df['CRASH_HOUR'] / 24)

df['CRASH_DAY_OF_WEEK_sin'] = np.sin(2 * np.pi * df['CRASH_DAY_OF_WEEK'] / 7)
df['CRASH_DAY_OF_WEEK_cos'] = np.cos(2 * np.pi * df['CRASH_DAY_OF_WEEK'] / 7)

df['CRASH_MONTH_sin'] = np.sin(2 * np.pi * df['CRASH_MONTH'] / 12)
df['CRASH_MONTH_cos'] = np.cos(2 * np.pi * df['CRASH_MONTH'] / 12)

## Enhancing Model Predictive Power
Additional engineered features can be integrated into the raw dataset to improve the model's ability to predict the primary contributory causes of traffic accidents by incorporating crucial contextual information.

Features such as

## Weekend:
Traffic patterns during weekends often differ significantly from weekdays, influencing both the likelihood and types of crashes. Including a feature that identifies whether a crash occurred on a weekend allows the model to better distinguish between these patterns.

## Speed Weather Interaction:
The impact of speed limits on crashes can vary depending on weather conditions. For instance, high speed limits are particularly risky in adverse weather. This interaction feature enables the model to capture the relationship between speed and weather, improving predictive accuracy.

In [18]:
df['Speed_Weather_Interaction'] = df['POSTED_SPEED_LIMIT'] * df['WEATHER_CONDITION'].factorize()[0]
df['Is_Weekend'] = df['CRASH_DAY_OF_WEEK'].apply(lambda x: 1 if x in [6, 7] else 0)

Let's analyze the frequency of causes to assess how closely each feature correlates with the target variable.

In [19]:
cause_counts = df['PRIM_CONTRIBUTORY_CAUSE'].value_counts()
print(cause_counts)

PRIM_CONTRIBUTORY_CAUSE
UNABLE TO DETERMINE                                                                 5252
FAILING TO YIELD RIGHT-OF-WAY                                                       1689
FOLLOWING TOO CLOSELY                                                               1442
NOT APPLICABLE                                                                       763
IMPROPER OVERTAKING/PASSING                                                          744
IMPROPER LANE USAGE                                                                  638
IMPROPER BACKING                                                                     586
DRIVING SKILLS/KNOWLEDGE/EXPERIENCE                                                  538
FAILING TO REDUCE SPEED TO AVOID CRASH                                               526
IMPROPER TURNING/NO SIGNAL                                                           483
DISREGARDING TRAFFIC SIGNALS                                                         2