### Applied Data Science Capstone Project Report
Christopher Riper - October 2020

#### Introduction: The Problem & Background

*Is it possible to warn you, given the weather and road conditions, how severe a car accident you could be in?  Would it make you drive more carefully or possible change your travel timing or route?*

The purpose of this project is to attempt to predict the severity of a car accident given certain conditions. 

There are several potential applications for this prediction model. As noted above, understanding the potential severity of an accident given certain weather, road and other conditions - before you travel - could impact your travel timing or route.  Insurance companies could incentivize drivers to travel at other times when conditions lead to more severe incidents, or make preliminary determinations of monetary exposure they may have in an accident, prior to physical inspections by insurance adjustors, medical evaluations, etc.   

A machine learning model will be used to attempt to solve this problem: predicting the severity of a car accident given certain conditions.

The following process will be followed:

- Data will be collected then explored and visualized to understand its attributes that will be used to train the machine learning model.

- That data will then be prepared for modeling through labeling, transformation, filling missing data, etc.

- Various algorightms and methods will be selected and applied to build the model, including supervised machine learning techniques.

- The model will be be evaluated to ensure the objective is achieved. 

- Finally, a report will be developed describing the process and results.

#### The Data: Overview

This project will utilize Seattle, WA vehicle accident data from January 1, 2004 through May 20, 2020. Each record in this dataset represents one vehicle accident during this time period. 

Each accident has a "severity" rating, which is a code that corresponds to the severity of the collision.  Within the dataset provided there are two severity ratings:
- 1 Property Damage Only Collision
- 2 Injury Collision 

This differs from the metadata that accompanies the dataset which indicates five severity ratings, indicating the data provided may be a subset of the complete dataset:
- 3 fatality
- 2b serious injury
- 2 injury
- 1 prop damage
- 0 unknown 

Other attributes for each accident are included, such as incident location, weather, light and road conditions at the time of the indident, the number of vehicles involved, whether a pedestrian was involved, etc.

For example, the weather attribute has the following values:
- Clear
- Raining
- Overcast
- Unknown
- Snowing
- Other
- Fog/Smog/Smoke
- Sleet/Hail/Freezing Rain
- Blowing Sand/Dirt
- Severe Crosswind
- Partly Cloudy

And the road condition attribute has the following values:
- Dry
- Wet
- Unknown
- Ice
- Snow/Slush
- Other
- Standing Water
- Sand/Mud/Dirt
- Oil

Complete metadata for this dataset is provided <a href='https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Metadata.pdf'>here</a>.

The purpose of this exercise is to develop a model which will predict the "severity" of an accident (the target/dependent variable) based upon other characteristics surrounding the collision (the independent variables).

#### Methodology

The data exploration phase revealed numerous potential accident attributes which can be considered for my predictive machine learning model.  The dependent variable (SEVERITYCODE) is not perfectly balanced, but not so skewed such that I will try and balance unless it causes issues with our model.

Not all of these incident attributes will be used in the model. I removed several features from the dataset which were duplicative or did not appear to have any predictive value:
* OBJECTID
* INCKEY
* COLDETKEY
* REPORTNO
* STATUS
* EXCEPTRSNCODE
* EXCEPTRSNDESC
* SEVERITYCODE.1
* SDOTCOLNUM
* ST_COLCODE
* ST_COLDESC
* SEGLANEKEY

To make our prediction I will use data points which would likely be "known" to a driver prior to leaving on a trip: weather conditions (WEATHER), road conditions (ROADCOND), and the light conditions (LIGHTCOND).  I could use the time of day (hour) however in the data exploration phase it was noted that many records did not specify the time of day, and the lighting conditions is likely a better indicator.

Each of the three attributes selected for our model has 8+ values, several of which appear to be used infrequently. To simplify the model these various values will be aggregated.  The following featuers were then used to predict the SEVERITYCODE:
* Weather_Bad         
* Weather_Clear       
* Weather_Impaired    
* Road_Dry            
* Road_Impaired       
* Road_Wet            
* Light_DawnDusk      
* Light_Daylight      
* Light_Impaired      

I will attempt to make a prediction using two classification models: decision tree and logistic regression.  

The data will then be split into training and test datasets, and the models will be evaluated. I will also make minor adjustments to the dataset, by changing the training/testing data split and adding a feature (PERSONCOUNT), to see whether that changes the accuracy of the mode.

#### Results

I built four models, each with very similar results: 

1. Decision Tree - This model used a 70%/30% split between training/testing data.  The resulting accuracy score was 0.6994109790760591. 
2. Logistic Regression 1 - This model used the same 70%/30% split between training/testing data as the Decision Tree.  The resulting accuracy score was 0.6994109790760591.
3. Logistic Regression 2 - This model was similar to Logistic Regression 1 but the data was split 80%/20% for training/testing. The resulting accuracy score was 0.6992423269551817.
4. Logistic Regression 3 - This model added a feature, PERSONCOUNT (the number of persons involved in the accident). Data was split 80%/20% for training/testing. The resulting accuracy score was 0.7006806215487351.


#### Discussion

For this project I thought it would be interesting to experiment by both comparing two different models and making a few changes to size of training data / features used to see the impact.  I was surprised how similar the accuracy scores were for each of the models, even when another feature was added to the final model. 

There is still room for improvement. I could have further explored balancing the dataset before developing my prediction models.  There are additional features within the dataset which could be predictive in their current form or through feature engineering. Other prediction models could also be tested.

Finally, although predicting the severity of an accident is an interesting problem it might be more useful to attempt to predict **whether an accident will occur**, which would require additional data.

#### Conclusion

I was able to create a model that predicts with 70% accuracy the severity of an accident in Seattle, WA using road conditions, weather conditions, light conditions and the number of persons involved in the accident. This could be useful for persons driving in the area to understand their risk should they be in an accident, or insurance companies attempting to make an initial accident assessment.