# <center>Predicting Motor Vehicle Accident Severity in Seattle, Washington</center>

## <center>Christopher Bates</center>
### <center>13 October 2020</center>

## Introduction

### Background

Personal transportation plays a critical role in the lives of nearly every resident of the United States. 
With the continued growth of suburbs and increasing commuting times, it is essential for commuters to have safe, reliable modes of transportation.
Ride-sharing services like Uber, as well as fleets of bicycles and electric scooters for rent, have certainly provided more travel options.
But these "greener" modes of transportation are not without risk.
In 2017, according to the Centers for Disease Control and Prevention, motor vehicle accidents alone accounted for 40,231 fatalites, 
not to mention many times this number of severe injuries.
Could the increased use of bicycles, scooters, and ride-sharing services actually increase the frequency of serious or fatal accidents
between motor vehicles and the more vulnerable pedestrians or cyclists?
Commuters certainly deserve to know the risks involved in their daily transit behaviors.
Machine learning technologies can quantify these risks, allowing commuters to make informed decisions based on travel conditions and traffic data.

### Problem

The goal of this project is to predict the severity of a given motor vehicle accident in Seattle, Washington, given certain data related to the accident.
Our models will classify each accident into exactly one of two categories:
<ol>
    <li>Minor Accidents</li>
    <ul>
        <li>property damage, but no injuries and no fatalities;</li>
        <li>at least one injury, but no serious injuries and no fatalites;</li>
    </ul>
    <li>Major Accidents</li>
    <ul>
        <li>at least one serious injury, but no fatalites;</li>
       <li>at least one fatality.</li>
    </ul>
 </ol>
Each model will receive data regarding a set of moter vehicle accidents.
For each incident, the model will take into account the weather, road, and lighting conditions at the time and location of the motor vehicle accident.
Input to the model will also include the number of people involved in the accident and the nature of their involvement,
the number and types of vehicles and the nature of their involvement.

### Interest

When an accident is initially reported to first responders, it can be helpful to have an accurate prediction of the severity of the accident.
Such  an "accident severity forecast" could allowing emergency rooms and hospital staff to allocate their resources more efficiently.
Furthermore, these accident severity forecasts can enable city planners, engineers, and architects to design traffic safer transit systems
and roadways by identifying locations with a higher frequencies of accidents with serious injuries or fatalites.

With the advent of autonomous vehicles, it is possible for accident severity models to aid in the selection safer routes based on weather and traffic conditions. 
In the event an autonomous vehicles is involved in an accident, the onboad system could communicate data to first responders and local hospitals,
allowing them to formulate an efficient, effective response driven by accident severity models.

## Data Acquisition and Cleaning

### Data Sources

All of the data used in the training, testing, and validation of our machine learning models was obtained from the 
[City of Seattle's Open Data Portal](https://data-seattlecitygis.opendata.arcgis.com/datasets/collisions) free of charge. 
Accorinding to the website, the data set includes all types of collisions in a timeframe starting in 2004 through the present.
Because the data set available on this website may change over time,
we downloaded the collisions data set once and performed all training, testing, and validation with the same locally stored data set.

### Data Cleaning

We went through several iterations of cleaning the data and looking for trends. 
The original data set was in the form of a .csv file, which we imported as a data frame. 
Initially, before any data cleaning, the data frame consisted of 221143 rows and 40 columns.
There were 5 columns of floating point type, 12 columns of integer type, and 23 columns of object type.
Using the [Attributes Information Form](https://www.seattle.gov/Documents/Departments/SDOT/GIS/Collisions_OD.pdf) located on the same website as the collisions .csv file,
we removed any columns corresponding to unique identification keys or codes that would offer no benefit in the context of machine learning.
We also removed any columns deemed redundant in light of combinations of other columns that could yield the same information.
Using the Attribute Information Form, we also identified any "sentinal" values, which actually indiate that a value is "unknown," and converted those values to NaN.
We then dropped any column with more than %15 NaN because there was no reasonable way to assign a value to such fields.
Finally, we dropped any row that contained at least in NaN.

### Feature Selection

In selecting the features for training and testing the models, we went through several iterations to determine the most important features for this problem.
Since the goal is to predict accident severity, it makes sense to include features related to weather, road, and light conditions.
The number of pedestrians, bicycle riders, and vehicles involed also seems like reasonabe information to use in the model.
In the real world, this information could be gathered at the scene of an accident.
While the original data set contains information representing the longitude, latitude, and date-time of the accidents,
these types of data require more processing and feature engineering to convert them into a form that is useful for machine learning applications.
In our first attempt at predicting accident severity, we will take a simpler approach and reserve the spatial and temporal data for future investigation.

## Exploratory Data Analysis

### Target Variable - Values and Distribution

The target variable to be predicted by the our models is the severity of a given motor vehicle accident.
In the original data set, the severity of each accident is represented by the "SEVERITYCODE" column, taking one of the following values:
<ul>
    <li><b>1</b> - property damage, but no injuries and no fatalities;</li>
    <li><b>2</b> - at least one injury, but no serious injuries and no fatalites;</li>
    <li><b>2b</b> - at least one serious injury, but no fatalites;</li>
    <li><b>3</b> -  at least one fatality.</li>
</ul>
To sharpen the diference between the features of the accidents in the data set, we created a binary classification scheme.
SEVERITYCODE values 1 and 2 are consolidated into a single class designated "Minor Severity,"
and SEVERITYCODE values 2b and 3 are grouped into a class called "Major Severity."
Based on this consolidation of severity, we created a new boolean variable IS_SEVERE;
it has the value True in case an accident has Major Severity,and False if the accident has Minor Severity.

After cleaning the data, and prior to splitting it into training and testing sets, the motor vehicle accdients had the following distribution of labels according to our new binary classification scheme:

In [None]:
df_categorical['IS_SEVERE'].value_counts(normalize=True, dropna=False)

Initially, we split the data into two, disjoint, imbalanced sets:
<ul>
    <li>a set for generating a balanced training set;</li>
    <li>a set to be used for testing and comparing the models.</li>
</ul>
The classes for the values of column IS_SEVERE are imbalanced, skewing greatly in favor of minor accidents.
In order to generate a balanced training set, we use the techniques of oversampling with replacement for the minority class (value is True in column IS_SEVERE),
along with undersampling without replacement for the majority class (value is False in column IS_SEVERE)
Oversampling with replacement occured when the number of samples taken exceeded the number accidents available for sampling;
conversely, undersampling without replacement occured if the number of samples taken was less than the number of accidents available for sampling.

### Relationship between Severity and Weather Conditions

### Relationship between Severity and Road Conditions

### Relationship between Severity and Lighting Conditions

### Relationship between Severity and Collision Type

## Classification Models

### Applying Standard Algorithms

### Issues with the Standard Algorithms

### Resolving the Algorithm Issues

### Perfomance Comparison if the Algorithms

## Conclusions

## Future Investigations

All of the models in this project were designed to solve a multiclass classification problem: classify the severity of a given motor vehicle accident.
An extension of this problem is to predict the number of non-serious injuries, the number of serious injuries, and the number of fatalities for a given accident.
Because an accident may involve independent combinations of people from these three categories, this new problem is a multlabel problem.
At the same time, because the number of individuals in each categroy is integer-valued, it is also a regression problem.
Therefore, this new problem requires the construction of three regresssion models, one for each category of victim.

An additional problem brought to light by the Seattle Collisions Dataset is the question of how accident severity depends on location and time data.
Since the dataset contains attributes representing the longitude and latitude for each incident, as well as the date and time,
it is reasonable to incorporate this data into future model development.
Theses data will perhaps reveal certain hours of the day or parts of the year that are more dangerous compared to other times.
Traffic engineers would likely be interested in identifying specific locations/times, e.g. certain intersections, showing increased likelihoodof fatal or serious accidents, so they can mitigate risk by modifying traffic patterns or infrastructure.

## Acknowledgements

In completing this Data Science Capstone Project, I am eternally grateful to my wife, Linda Garica, for her invaluable advice and infinite patience.

I am also indebted to Mike Jennings, Captain (Ret.), Oceanside Fire Department, Oceanside, CA. 
Mike was gratious enough to share his experiences as a first responder, as well as his professioal insight and enthusiasm.