# Codeup Individual Project: Predicting Vehicle Accident Severity

The purpose of this project is to analyze the selected dataset, answer questions regarding the data, and develop a machine learning model to predict the severity of an accident based on human and environmental circumstances. I obtained the dataset for this project from https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents?resource=download.

I am using this dataset for academic purposes only.

Initial Questions:

- What road conditions are most likely to result in an accident?
- What time of day are accidents most likely to occur? What time of year are accidents most likely to occur?
- Are there specific areas that are prone to crashes?

Questions regarding time:

- Have the number of accidents increased overall between 2016 and 2021?
- Has the severity of accidents changed between 2016 and 2021?

## Project Utility

- Predicting accident severity based on environmental conditions and road features can be useful for first responders, drivers, and rideshare companies. Accurate predictions can help first responders gauge the amount of services and emergency aid needed based on the most commonly required responses for each level of severity. Drivers can get accurate updates on how long traffic will be delayed and if alternate routes are needed. Future utility includes providing warnings to drivers and first responders of potential accident locations and severity based on current environmental and road conditions.


## Executive Summary

- The dataset was downsampled using random sampling due to an imbalance in the target variable. I split the downsampled data into train, validate, and test using a 60/20/20 split stratefied on severity. The total number of observations after removing nulls and outliers and downsampling was 191,685.
- The selected model is a random forest classifier with a depth of 16 and minimum sample leaf size of 35. I selected 23 features for the final model based on visualizations and statistical tests. I used a random_seed of 217 for reproducibility. The baseline prediction for the training set was .34. The model performed above baseline accuracy at .71 on train and .69 on validate, indicating that the decision tree was not overfit. The model scored .69 on the test set as well. The model was 34 percent more accurate than baseline on the validate and test sets.


## Acquisition and Preparation

- Acquire the dataset from Kaggle and save to a local csv
- Prepare the data with the intent to discover the main predictors of crash severity; clean the data and encode categorical features if necessary; ensure that the data is tidy
- Write functions to wrangle the data and save to wrangle.py

In [1]:
# required imports for the project
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import explore
import wrangle
import model
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler, QuantileTransformer
import warnings
warnings.filterwarnings("ignore")

SyntaxError: invalid syntax (explore.py, line 37)

In [None]:
# acquire the dataset using wrangle.py file
df = wrangle.wrangle_data()

In [None]:
# verify dataset was wrangled successfully
df.head(3)

In [None]:
# downsample severity level 2 to balance the dataset, drop severity level 1
df = wrangle.downsample_data(df)

In [None]:
# create a new column for the year the accident occurred
df['year'] = df.start_time.dt.year
# create a new column for the month the accident occurred
df['month'] = df.start_time.dt.month
# create a new column for the day of the month the accident occurred
df['day'] = df.start_time.dt.day
# create a new column for the hour of the day the accident occurred
df['hour'] = df.start_time.dt.hour

In [None]:
# verify that all columns are present and there are no null values
df.info(show_counts=True)

### Acquisition and Preparation Takeaways

- Four columns have been dropped: number, country, airport_code, and turning_loop. These columns are not necessary or useful for analysis at this time.
- All observations with null values were dropped.
- Outliers for wind_speed and wind_chill were dropped, and total_time was limited to accidents with a duration of one day or less.
- The final dataset has 191,685 observations. The target variable has been reduced from 4 categories to 3. One category was dropped because the observations all occurred within a 9-month span during the beginning of the Covid-19 pandemic.

## Exploratory Data Analysis

- Explore the data:
    - Univariate, bivariate, and multivariate analyses; statistical tests for significance, find the three primary features affecting crash severity; use distance, precipitation, and visibility for the first model
- Create graphical representations of the analyses
- Answer initial questions

In [None]:
# histograms of each feature and the target variable
df.hist(figsize=[26,20])
plt.show()

In [None]:
# split the dataset using a 60/20/20 split, stratified on the target variable
train, validate, test = explore.split_data(df, 'severity')

In [None]:
# verify the data was split correctly
train.shape, validate.shape, test.shape

#### Are accidents more likely to occur during the day or at night?

In [None]:
# plot graphs for the relationship between day/night and accident severity
explore.plot_day_night(train)

Statistical test for independence between severity and sunrise_sunset:
- H0: There is no association between the severity of an accident and whether it is day or night.
- Ha: There is an association between the severity of an accident and whether it is day or night.

In [None]:
# conduct a chi2 test of independence for sunrise_sunset and severity
explore.stat_chi2(train.severity, train.sunrise_sunset)

- Crashes occur more often during the day, but there is generally more traffic during daytime hours. An interesting finding is that accidents that occur more frequently during nighttime on the sunrise_sunset angle than on the other angles. This may be because sunlight is still visible when the sun reaches the sunrise_sunset angle, but visibility is reduced during this particular time period.

#### Are there specific areas that are prone to crashes?

In [None]:
# plot the relationships between individual road features and accident severity
explore.countplot_data(train)

- Most of the accidents occurred when these particular features were NOT present, but there are certain features, such as traffic signals, crossings, and junctions, where accidents occur more often.

#### What road and environmental conditions are most likely to result in an accident?

In [None]:
# plot the relationships between environmental conditions and accident severity
explore.barplot_data(train)

- Severe accidents tend to cover more distance that moderate crashes. Precipitation seems to have an effect on the severity of a crash. Wind speed also appears to play a role in crash severity.

Statistical test for severity and distance:
- H0: There is no mean difference of accident distance between the three severity categories.
- Ha: There is a mean difference of accident distance between the three severity categories.

In [None]:
# create variables for distance based on severity levels
sev2_dist = train[train.severity==2]['distance']
sev3_dist = train[train.severity==3]['distance']
sev4_dist = train[train.severity==4]['distance']

In [None]:
# test for equal variance between severity 2 distance and severity 3 distance
explore.stat_levene(sev2_dist, sev3_dist)

In [None]:
# test for equal variance between severity 2 distance and severity 4 distance
explore.stat_levene(sev2_dist, sev4_dist)

In [None]:
# test for equal variance between severity 3 distance and severity 4 distance
explore.stat_levene(sev3_dist, sev4_dist)

In [None]:
# use the Kruskal-Wallis one-way analysis of variance for nonparametric data
explore.stat_kruskal(sev2_dist, sev3_dist, sev4_dist)

Statistical test for severity and precipitation:
- H0: There is no mean difference in precipitation between the three severity categories.
- Ha: There is a mean difference in precipitation between the three severity categories.

In [None]:
# create variables for precipitation for each severity level
sev2_rain = train[train.severity==2]['precipitation']
sev3_rain = train[train.severity==3]['precipitation']
sev4_rain = train[train.severity==4]['precipitation']

In [None]:
# test for equal variance between severity 2 precipitation and severity 3 precipitation
explore.stat_levene(sev2_rain, sev3_rain)

In [None]:
# test for equal variance between severity 2 precipitation and severity 4 precipitation
explore.stat_levene(sev2_rain, sev4_rain)

In [None]:
# test for equal variance between severity 4 precipitation and severity 3 precipitation
explore.stat_levene(sev4_rain, sev3_rain)

In [None]:
# use the Kruskal-Wallis one-way analysis of variance for nonparametric data
explore.stat_kruskal(sev2_rain, sev3_rain, sev4_rain)

Statistical test for severity and visibility:
- H0: There is no mean difference in visibility between the three severity categories.
- Ha: There is a mean difference in visibility between the three severity categories.

In [None]:
# create variables for visibility for each severity level
sev2_vis = train[train.severity==2]['visibility']
sev3_vis = train[train.severity==3]['visibility']
sev4_vis = train[train.severity==4]['visibility']

In [None]:
# test for equal variance between severity 2 visibility and severity 3 visibility
explore.stat_levene(sev2_vis, sev3_vis)

In [None]:
# test for equal variance between severity 2 visibility and severity 4 visibility
explore.stat_levene(sev2_vis, sev4_vis)

In [None]:
# test for equal variance between severity 4 visibility and severity 3 visibility
explore.stat_levene(sev4_vis, sev3_vis)

In [None]:
# use the Kruskal-Wallis one-way analysis of variance for nonparametric data
explore.stat_kruskal(sev2_vis, sev3_vis, sev4_vis)

#### What time of day are accidents most likely to occur? What time of year are accidents most likely to occur?

In [None]:
# plot number of accidents based on different measurements of time
explore.plot_time_data(train)

- Accidents increase from April to June, and increase again in December. This coincides with periods where children are out of school and families are traveling. These months tend to have more precipitation in certain regions as well.
- Accidents appear to occur most frequently during afternoon rush-hour traffic, when most people are traveling home from work or school.

### Exploration Takeaways

Initial exploration: 

- The target variable caused the dataset to be unbalanced, as most accidents were classified as severity level 2. This resulted in a baseline accuracy using the mode to be 93 percent. In order to balance the dataset, I took a random sample of 65,000 level-2 severity accidents from the total dataset using random_seed=217 for reproducibility. I concatenated this sample with the total observations from the other severity classes into a new dataframe. This sampling did not take into account any features, so important data about key features of a crash may have been lost.

Time exploration:

- When resampling for start_time, additional visualization indicated that more accidents occur in April through June, and between the hours of 2pm and 6pm. Perhaps using start time as a feature will improve the models' performance. The number of accidents has increased year over year, but this may be due to improved data collection and digitized accident information over the years. 

Statistical exploration:

- Statistical testing using a Kruskal-Wallis one-way analysis of variance showed significant differences in the three initial features selected for modeling (precipitation, visibility, and distance). Chi^2 test of severity and whether it is day or night (according to sunrise/sunset angle) showed an association.


## Modeling

- Train and test four models:
    - Establish a baseline using the mode for severity
    - Select key features and train multiple classification models (Decision Tree, Random Forest, KNN, Logistic Regression)
    - Test the model on the validate set, adjust for overfitting if necessary

In [None]:
# find the most observed severity level
train.severity.mode()

In [None]:
# establish a baseline prediction using the mode
baseline = len(train[train.severity==2]) / len(train)
baseline

In [None]:
# select significant features for modeling based on visualization and statistical testing
cols = ['distance','precipitation','visibility','humidity','temperature','pressure','wind_speed','amenity','bump', 
       'crossing','give_way','junction','no_exit','railway','roundabout','station','stop','traffic_calming',
       'traffic_signal','sunrise_sunset','year','month', 'hour']
# create the dataframes for train features and target
X_train, y_train = train[cols], train.severity
# create the dataframes for validate features and target
X_validate, y_validate = validate[cols], validate.severity
# create the dataframes for test features and target
X_test, y_test = test[cols], test.severity

#### Decision Tree Model, Depth = 8

In [None]:
# decision tree model function from model.py with a selected depth of 8
model.tree_model(X_train, y_train, X_validate, y_validate, 8)

#### Random Forest Model

In [None]:
# random forest function from model.py with a depth of 16 and 35 sample leaf size
model.rand_forest(X_train, y_train, X_validate, y_validate, 16, 35)

#### K Nearest Neighbors Model, Scaled, n=40

In [None]:
# make the object, put it into the variable scaler
scaler = MinMaxScaler()
# fit the object to my data:
X_train_scaled = scaler.fit_transform(X_train)
X_validate_scaled = scaler.transform(X_validate)

In [None]:
# knn model function from model.py with 40 neighbors
model.knn_model(X_train_scaled, y_train, X_validate_scaled, y_validate, 40)

#### Logistic Regression Model

In [None]:
# logistic regression model function from model.py 
model.log_model(X_train, y_train, X_validate, y_validate)

In [None]:
# logistic regression model using scaled data
model.log_model(X_train_scaled, y_train, X_validate_scaled, y_validate)

### Modeling Takeaways

- The Decision Tree Classifier with a depth of 8 had a 68 percent accuracy on the training set and a 67 percent accuracy on validate. The model had high F1-scores for severity levels 2 and 3, but struggled to accurately predict level 4. This model provided a 33 percent increase in accuracy above the baseline.

- The Random Forest Model with a depth of 16 and minimum sample leaf of 35 had a 71 percent accuracy on train and 69 percent accuracy on validate. This model has the most potential for tuning with hyperparameters and feature engineering. Random Forest was 34 percent more accurate on validate than the baseline prediction.

- The K Nearest Neighbors Model with 40 neighbors had an accuracy of 65 percent on train and 63 percent on validate. The features were scaled prior to training the model. The model was 29 percent more accurate than baseline.

- The Logistic Regression Model had a 60 percent accuracy on train and 61 percent on validate using scaled data. The accuracy was 49 percent on train and validate with unscaled data. This model was not tuned and may perform better with specific arguments and fine tuning. The model was 26 percent more accurate than baseline.

- I will use the Random Forest model for testing because it performed best on precision, accuracy, and recall. 

## Test the Best Model

In [None]:
# random forest model was selected; test the model using a function from model.py
model.test_forest(X_train, y_train, X_test, y_test, 16, 35)

## Conclusions, Recommendations, and Next Steps



- The target variable caused the dataset to be unbalanced, as most accidents were classified as severity level 2. This resulted in a baseline accuracy using the mode to be 93 percent. In order to balance the dataset, I took a random sample of 65,000 level-2 severity accidents from the total dataset using random_seed=217 for reproducibility. I concatenated this sample with the total observations from the other severity classes into a new dataframe of 215,240 observations. This sampling did not take into account any features, so important data about key features of a crash may have been lost. Exploration of this data revealed that crashes of severity level 1 were limited to a nine-month time period in 2020; I dropped this severity level because it is in itself an outlier.


- The minimum viable product model is a decision tree classifier with a maximum depth of 4. I selected three features for the initial model: distance, precipitation, and visibility. I selected these features based on visualizations and statistical tests. I used a random_seed of 217 for reproducibility. The baseline prediction for the training set was .302. The model performed above baseline accuracy at .42 on train and .41 on validate, indicating that the decision tree was not overfit. 


- Subsequent models with other features added increased accuracy by another 30 percent on average. Decision Tree, Random Forest, K Nearest Neighbors, and Logistic Regression were used. I scaled the features for train and validate prior to using the KNN model, but I did not scale the test set because this model was not selected for testing. I evaluated multiple depths and sample sizes for each model. The selected parameters provided the highest performance without overfitting. 


- When resampling the start_time, additional visualization indicated that more accidents occur in April through June, and between the hours of 2pm and 6pm. Using month and hour as features will likely improve the models' performance. The number of accidents has increased year over year, but this may be due to improved data collection and digitized accident information over the years. 


- The Random Forest Model performed best overall, and when evaluated on the test set, achieved the same overall level of accuracy as train and validate, indicating that the model was not overfit. I used 23 features, 20 from the dataset and 3 engineered features using the start time. 


- I recommend using this model if real-time information of the selected features is available when a crash occurs. The model assumes that severity has been established based on specific parameters and that previous crash data was correctly classified. I also recommend adding posted speed limits to the dataset and whether the area has a special classification, e.g. construction zone, school zone, etc. Information about injuries and fatalities for previous crashes could provide valuable insight on what emergency services will be needed based on the predicted severity of the crash.


- If I had more time, I would explore the coordinate data to see if there are certain areas that experience recurring crashes. I would also like to know more about traffic patterns in the area of the crash to see if traffic density has a significant impact on crash severity. 