This notebook will be used to complete the data analysis of car accident data for the capstone project for IBM's Data Science program on Coursera


In [11]:
# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Exploratory Data Analysis](#eploratory)
* [Modeling](#modeling)
* [Conclusions](#conclusions)
* [Future Work](#future)

In [6]:
import pandas as pd
import numpy as np
%matplotlib inline 
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.style.use('ggplot')

# Introduction: <a name="introduction"></a>


## 1.1: Background
As we have increased the number of drivers on the road over the course of the last century we have also seen an unfortunate increase in the number of accidents. As car manufacturers and city planners have worked diligently to not only decrease the likelihood of accidents but also the resultant severity of accidents when they do unforunately occur, there is still work to be done to better predict the occurence and level of severity. 

## 1.2 Business Use Case
In this project we seek to answer a question on the behalf of the Seattle Department of Transportation and the drivers of the city of Seattle regarding car accident severity. Namely, we want to be able to predict if accident severity, in particular if an accident results in soley damage to property or personal injury, based on the conditions of the road and the lighting at the time of the accident. If we can answer this question well enough, can we use it to change our education of drivers or send alerts during particularly treacherous conditions?

# Data <a name="data"></a>

## 2.1 Data Sources
For the purposes of this project we are [using a dataset](https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Metadata.pdf "Dataset") of all accidents recorded in Seattle by the Seattle Police Department from 2004 to present. We have isolated only the data pertaining to road conditions based on factors such as rain, standing water, mud, oil, or other factors and those for lighting conditions based on time of day and street lights. We have coded our severity scores as either accidents that result in damage to property or those that also include injury to those in any vehicle involved. 

In [None]:
data = pd.read_csv(r'C:\Users\ejfel\Google Drive\Work\IBM DS\Capstone\ExampleData.csv')

In [None]:
### Let's take a closer look at our dataset
data.head()

## 2.2 Data Cleaning
   Data was read from a CSV file and into a Pandas dataframe for cleaning, structuring, and analysis. There were a number of pieces of data that were included that were deemed not pertinent to our analysis. Included in this were items such as descriptions of the accident, categories of locations, counts of pedestrians and passengers, and the data, among others. These items were not related to the analysis of the impact light and road conditions on accident severity. This will make up our feature set. We also take this opportunity to retitle our columns.

In [None]:
features = data[['SEVERITYCODE','ROADCOND','LIGHTCOND','X','Y']]
features.rename(columns={'SEVERITYCODE':'Severity', 'ROADCOND':'Precipitation','LIGHTCOND':'Lighting','X':'Longitude','Y':'Latitutde'}, inplace=True)
all(isinstance(column, str) for column in features.columns)
features.shape()

  Once the data had been reduced to pertinent items, we identified and dropped any accidents which were missing data from any column, as it would be impossible to know accident severity or conditions without them being given. Furthermore, many accidents listed lighting or road conditions as "Other" or "Unknown." For the same reasons as missing data, these items were removed from our data. 

In [None]:
features = features.dropna()
features=features[~features.ROADCOND.str.contains('Other') == True]
features=features[~features.ROADCOND.str.contains("Unknown") == True]
features=features[~features.LIGHTCOND.str.contains("Unknown") == True]
features=features[~features.LIGHTCOND.str.contains("Other") == True]
features.shape()

   Then, we replaced all our categorical variables with integer values to be able to apply our models and metrics more cleanly.

In [None]:
from sklearn import preprocessing
le_precip = preprocessing.LabelEncoder()
le_precip.fit(['Dry','Wet','Ice','Oil','Sand/Mud/Dirt','Snow/Slush','Standing Water'])
X[:,0] = le_precip.transform(X[:,0]) 


le_light = preprocessing.LabelEncoder()
le_light.fit([ 'Dark - No Street Lights', 'Dark - Street Lights Off', 'Dark - Street Lights On','Dawn','Daylight','Dusk'])
X[:,1] = le_light.transform(X[:,1])
X

We'll also define our target set, severity, for future use

In [None]:
y = features["Severity"]

   Finally, we moved on to dealing with the imbalance in our target of severity, with many more accidents being damage to property but not injuries to passengers. If left unchecked, we would expect our model to run with relatively high accuracy by always predicting solely property damage, but not actually providing any kind of prediction based on lighting and road condtions. This imbalance is not the result of improper sampling and should not be ignored, as it points to the truth that most accidents do not result in injuries to passengers, but still may result in costly damage for drivers. We have chosen to downsample the accidents with property damage to represent a set of the same size as the accidents with personal injury. In this way, our models can analyze for structure between the target and the feature set, and not get tricked by the imbalance of real outcomes in our target.

In [None]:
from sklearn.utils import resample

# Separate majority and minority classes
df_majority = features[features.Severity==0]
df_minority = features[features.Severity==1]
 
# Downsample majority class
df_majority_downsampled = resample(df_majority, 
                                 replace=False,    # sample without replacement
                                 n_samples=55000,     # to match minority class
                                 random_state=123) # reproducible results
 
# Combine minority class with downsampled majority class
features_downsampled = pd.concat([df_majority_downsampled, df_minority])
 
# Display new class counts
features_downsampled.Severity.value_counts()

   We then repeated the exercise previously done on our data set to replace our categorical variables with integer values for our modeling work. 

In [None]:
X_down = features_downsampled[[ 'Precipitation','Lighting']].values

le_precip = preprocessing.LabelEncoder()
le_precip.fit(['Dry','Wet','Ice','Oil','Sand/Mud/Dirt','Snow/Slush','Standing Water'])
X_down[:,0] = le_precip.transform(X_down[:,0]) 


le_light = preprocessing.LabelEncoder()
le_light.fit([ 'Dark - No Street Lights', 'Dark - Street Lights Off', 'Dark - Street Lights On','Dawn','Daylight','Dusk'])
X_down[:,1] = le_light.transform(X_down[:,1])
X_down

y_down = features_downsampled['Severity']

# 3. Exploratory Data Analysis<a name="exploratory"></a>

Let's take a quick look at how severity appears to be impacted by our different features. This data is the original data and NOT the down sampled amounts that we will use for training our models later on.

In [None]:
Lighting = features.groupby('Lighting').Severity.value_counts()
Lighting

In [None]:
Precipitation = features.groupby('Precipitation').Severity.value_counts()
Precipitation

In [None]:
k = features.groupby(['Lighting']).Severity.value_counts().rename("count")
frequencyL = k / k.groupby(level=0).sum()
frequencyL

c = features.groupby(['Precipitation']).Severity.value_counts().rename("count")
frequencyP = c / c.groupby(level=0).sum()
frequencyP

For a better visual, here is that same information in a bar chart for each.

In [None]:
frequencyP.plot(kind='bar',figsize=(9,9),
              color=['coral','coral','darkslateblue','darkslateblue'])
plt.title('Severity of Accident by Precipitation')

plt.ylabel('Proportion in Condition')

plt.show

In [None]:
frequencyL.plot(kind='bar',figsize=(9,9),
              color=['coral','coral','darkslateblue','darkslateblue'])
plt.title('Severity of Accident by Lighting')

plt.ylabel('Proportion in Condition')

plt.show

# 4. Predictive Modeling <a name="modeling"></a>
There were three types of models that we used to predict accident severity: logistic regression, decision trees, and k-Nearest Neighbor classifier. 

## 4.1 Logistic Regression

In [None]:
Xnorm = preprocessing.StandardScaler().fit(X).transform(X)
X_train, X_test, y_train, y_test = train_test_split( Xnorm, y, test_size=0.2, random_state=4)
Xnorm

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
yhat = LR.predict(X_test)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
print (classification_report(y_test, yhat))

So we can see here that this model may have a decent accuracy score overall, it never makes a prediction for the higher severity level. This is a result of the bias in our target data set that was used for training and what our downsampling method should seek to account for. Let's see how that downsampling helps us.

In [None]:
X_train1, X_test1, y_train1, y_test1 = train_test_split( X_down, y_down, test_size=0.2, random_state=4)
LR2 = LogisticRegression(C=0.01, solver='liblinear').fit(X_train1,y_train1.values.ravel())
yhat_down = LR2.predict(X_test1)

In [None]:
print (classification_report(y_test1, yhat_down))

From this, we can see while our overall accuracy has dropped considerably, our model is now making true predicitons between the higher and lower severity level. Considering these two outcomes, it seems that this may not be our best choice for a model.

## 4.2 Decision Tree and Random Forest Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=5)

In [None]:
CrashTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
CrashTree.fit(X_train,y_train)

In [None]:
predTree = CrashTree.predict(X_test)

In [None]:
from sklearn import metrics
import matplotlib.pyplot as plt
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_test, predTree))

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

In [None]:
clf_4 = RandomForestClassifier()
clf_4.fit(X_train, y_train)
 
pred_y_4 = clf_4.predict(X_test)
 
print( np.unique( pred_y_4 ) )
print( accuracy_score(y_test, pred_y_4) )

prob_y_4 = clf_4.predict_proba(X_test)
prob_y_4 = [p[1] for p in prob_y_4]
print( roc_auc_score(y_test, prob_y_4) )

We should also verify that downsampling will not play a positive role for our tree classifiers in this case, which we can see here. 

In [None]:
CrashTree2 = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
CrashTree2.fit(X_train1,y_train1)
predTree2 = CrashTree2.predict(X_test1)
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_test1, predTree1))

In [None]:
clf_down = RandomForestClassifier()
clf_down.fit(X_train1, y_train1)
 
pred_down = clf_4.predict(X_test1)
 
print( np.unique( pred_down ) )
print( accuracy_score(y_test1, pred_down) )

prob_down = clf_down.predict_proba(X_test1)
prob_down = [p[1] for p in prob_down]
print( roc_auc_score(y_test1, prob_down))

## 4.3 k-Nearest Neighbor

In [None]:
import itertools
from matplotlib.ticker import NullFormatter
import matplotlib.ticker as ticker
from sklearn.neighbors import KNeighborsClassifier
k = 4
#Train Model and Predict  
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
yhat = neigh.predict(X_test)
print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat))

In [None]:
Ks = 10
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))
ConfustionMx = [];
for n in range(1,Ks):
    
    #Train Model and Predict  
    neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
    yhat=neigh.predict(X_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)

    
    std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])

mean_acc

In [None]:
plt.plot(range(1,Ks),mean_acc,'g')
plt.fill_between(range(1,Ks),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10)
plt.legend(('Accuracy ', '+/- 3xstd'))
plt.ylabel('Accuracy ')
plt.xlabel('Number of Nabors (K)')
plt.tight_layout()
plt.show()

In [None]:
print( "The best accuracy was with", mean_acc.max(), "with k=", mean_acc.argmax()+1) 

# 5. Conclusions <a name="conclusions"></a>
In this project, we have created and evaluated a number of models for predicting accident severity based on lighting and road conditions. These can be of great use to first responders in allocating resources, cities in planning for increased lighting or advisories for upcoming weather, or for driver's education about potentially dangerous driving situations to avoid if possible. We can see from our metrics that the two models that will provide our most accurate predictions going forward would be the 2-Nearest Neighbors and the Decision Tree Classifier created from the original data set.

# 6. Future Directions<a name="future"></a>
This data set provided a wealth of information, but missed a elements that we know exist as well. For example, no data points were for accidents with highest levels of severity in accidents (accidents no just with injury to passengers but more severe cases involving fatalities).
    Models in this study do not explain how or why these accidents occur, simply predicting correlation to particular weather and lighting conditions. They have not included detail into how the lighting conditions might impact a driver different based on time of year, if the drivers are local or not, or how drivers might have been impacted by choices of signage, speed limits, or past experiences. 