# Capstone Project - Car Accident Severity Prediction

This is the notebook for car accident severity prediction project :-)

In [1]:
import pandas as pd
import numpy as np

In [2]:
print("Hello Capstone Project Course!")

Hello Capstone Project Course!


In [3]:
df_before = pd.read_csv('Data-Collisions.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
df_before.head()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [5]:
df_before.isnull().sum()

SEVERITYCODE           0
X                   5334
Y                   5334
OBJECTID               0
INCKEY                 0
COLDETKEY              0
REPORTNO               0
STATUS                 0
ADDRTYPE            1926
INTKEY            129603
LOCATION            2677
EXCEPTRSNCODE     109862
EXCEPTRSNDESC     189035
SEVERITYCODE.1         0
SEVERITYDESC           0
COLLISIONTYPE       4904
PERSONCOUNT            0
PEDCOUNT               0
PEDCYLCOUNT            0
VEHCOUNT               0
INCDATE                0
INCDTTM                0
JUNCTIONTYPE        6329
SDOT_COLCODE           0
SDOT_COLDESC           0
INATTENTIONIND    164868
UNDERINFL           4884
WEATHER             5081
ROADCOND            5012
LIGHTCOND           5170
PEDROWNOTGRNT     190006
SDOTCOLNUM         79737
SPEEDING          185340
ST_COLCODE            18
ST_COLDESC          4904
SEGLANEKEY             0
CROSSWALKKEY           0
HITPARKEDCAR           0
dtype: int64

## 📖 Introduction/Business Problem ##

 With an increasing number of car accidents happened in our community, a model is required to be developed in order to predict how severe an accident would be based on weather condition, road condition, location, etc. This project is focusing on developing a machine learning model to predict the severity of a car collision based on big dataset gathered from the community. 

## 📊 Data ##

### *Severity codes are as follows:*###

*0: Little to no Probability (Clear Conditions)*

*1: Very Low Probability — Chance or Property Damage*

*2: Low Probability — Chance of Injury*

*3: Mild Probability — Chance of Serious Injury*

*4: High Probability — Chance of Fatality*

### ❎ The original dataset has the following problems: ###

-- It has too many attributes (37)

-- It is an unbalanced dataset (SEVERITYCODE column)

### 🔧 So some changes are applied to this dataset: ###

-- The **independent variables** are changed to WEATHER, ROADCOND, LIGHTCOND

-- Balance the **dependent variables** SEVERITYCODE which now 1 and 2 categories have equal amount of data

## 🔮 Data Conversion ##

 ### **❎ Deal with unbalanced SEVERITYCODE**###

In [6]:
df_before['SEVERITYCODE'].value_counts()

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

As we can see the SEVERITYCODE column is unbalanced and category '1' is nearly 2 more times than the category '2'

In [7]:
from sklearn.utils import resample

df_before_maj = df_before[df_before.SEVERITYCODE==1]
df_before_min = df_before[df_before.SEVERITYCODE==2]

df_before_maj_upsampled = resample(df_before_maj, 
                                 replace=True,     # sample with replacement
                                 n_samples=58188,    # to match majority class
                                 random_state=123) # reproducible results
 
# Combine majority class with upsampled minority class
balanced_df = pd.concat([df_before_maj_upsampled, df_before_min])
balanced_df.SEVERITYCODE.value_counts()

2    58188
1    58188
Name: SEVERITYCODE, dtype: int64

#### 🌟 Now the dataset is balanced! ####

### **❎ Deal with too many attributes**###

In [8]:
df = pd.DataFrame(df_before[['SEVERITYCODE','WEATHER','ROADCOND','LIGHTCOND']])
df.head()

Unnamed: 0,SEVERITYCODE,WEATHER,ROADCOND,LIGHTCOND
0,2,Overcast,Wet,Daylight
1,1,Raining,Wet,Dark - Street Lights On
2,1,Overcast,Dry,Daylight
3,1,Clear,Dry,Daylight
4,2,Raining,Wet,Daylight


In [9]:
df_before['WEATHER'].value_counts()

Clear                       111135
Raining                      33145
Overcast                     27714
Unknown                      15091
Snowing                        907
Other                          832
Fog/Smog/Smoke                 569
Sleet/Hail/Freezing Rain       113
Blowing Sand/Dirt               56
Severe Crosswind                25
Partly Cloudy                    5
Name: WEATHER, dtype: int64

In [10]:
df_before['ROADCOND'].value_counts()

Dry               124510
Wet                47474
Unknown            15078
Ice                 1209
Snow/Slush          1004
Other                132
Standing Water       115
Sand/Mud/Dirt         75
Oil                   64
Name: ROADCOND, dtype: int64

In [11]:
df_before['LIGHTCOND'].value_counts()

Daylight                    116137
Dark - Street Lights On      48507
Unknown                      13473
Dusk                          5902
Dawn                          2502
Dark - No Street Lights       1537
Dark - Street Lights Off      1199
Other                          235
Dark - Unknown Lighting         11
Name: LIGHTCOND, dtype: int64

In [12]:
encoding_weather = {'WEATHER':{'Clear': 1, 'Partly Cloudy': 2, 'Overcast':3,  'Fog/Smog/Smoke':4, 'Severe Crosswind':5, 'Raining':6, 'Sleet/Hail/Freezing Rain':7, 'Blowing Sand/Dirt':8, 'Snowing':9,'Other':10,'Unknown':11}}
df.replace(encoding_weather, inplace=True)
df.head()

Unnamed: 0,SEVERITYCODE,WEATHER,ROADCOND,LIGHTCOND
0,2,3.0,Wet,Daylight
1,1,6.0,Wet,Dark - Street Lights On
2,1,3.0,Dry,Daylight
3,1,1.0,Dry,Daylight
4,2,6.0,Wet,Daylight


In [13]:
encoding_roadcond = {'ROADCOND':{'Dry': 1, 'Sand/Mud/Dirt':2, 'Oil':3, 'Wet':4,'Standing Water':5,'Snow/Slush':6,'Ice':7,'Other':8,'Unknown':9}}
df.replace(encoding_roadcond, inplace=True)
df.head()

Unnamed: 0,SEVERITYCODE,WEATHER,ROADCOND,LIGHTCOND
0,2,3.0,4.0,Daylight
1,1,6.0,4.0,Dark - Street Lights On
2,1,3.0,1.0,Daylight
3,1,1.0,1.0,Daylight
4,2,6.0,4.0,Daylight


In [14]:
encoding_lightcond = {'LIGHTCOND':{'Daylight':1, 'Dawn':2, 'Dusk':3, 'Dark - Street Lights On':4, 'Dark - Street Lights Off':5, 'Dark - No Street Lights':6, 'Dark - Unknown Lighting':7, 'Other':8,'Unknown':9 }}
df.replace(encoding_lightcond, inplace=True)
df.head()

Unnamed: 0,SEVERITYCODE,WEATHER,ROADCOND,LIGHTCOND
0,2,3.0,4.0,1.0
1,1,6.0,4.0,4.0
2,1,3.0,1.0,1.0
3,1,1.0,1.0,1.0
4,2,6.0,4.0,1.0


In [15]:
df.dropna(axis = 'rows',inplace = True)


In [16]:
df.isnull().sum()

SEVERITYCODE    0
WEATHER         0
ROADCOND        0
LIGHTCOND       0
dtype: int64

## 💡 Methodology##

The data now is ready to feed into machine learning model.

Here, I come up with four methods to predict the severity code:

- K Nearest Neighbor(KNN) 

*KNN can be implemented by picking up a value of k and then calculate the distance of unknown case from all cases. Select the k-observation in the training data that are "nearest" to the unknown data point and then predict the response of the unknown data point using the most popular response value from the K-nearest neighbors.*

- Decision Tree

*Decision Tree can layout all of the possible outcomes so that we can fully analyse the consequences of a decision.*

- Support Vector Machine

*SVM can be used in two-group classification problem, since this dataset is large, SVM is efficient.*

- Linear Regression

*Linear Regression will give us how much influence each conditions would have on causing a severe car collision*

### Initialisation###

In [17]:
import numpy as np
X = np.asarray(df[['WEATHER', 'ROADCOND', 'LIGHTCOND']])
X[0:5]

array([[3., 4., 1.],
       [6., 4., 4.],
       [3., 1., 1.],
       [1., 1., 1.],
       [6., 4., 1.]])

In [18]:
y = df['SEVERITYCODE'].values
y[0:5]

array([2, 1, 1, 1, 2])

### Normalisation##

In [19]:
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

array([[-0.01750111,  0.65036952, -0.6551748 ],
       [ 0.96073359,  0.65036952,  0.66729801],
       [-0.01750111, -0.61604977, -0.6551748 ],
       [-0.66965757, -0.61604977, -0.6551748 ],
       [ 0.96073359,  0.65036952, -0.6551748 ]])

### Train-Test Split

In [20]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,random_state = 4)

print('Train set shape: ', X_train.shape, y_train.shape)
print('Test set shape: ', X_test.shape, y_test.shape)

Train set shape:  (151469, 3) (151469,)
Test set shape:  (37868, 3) (37868,)


## 🌟Classification

###  KNN

In [34]:
#We initially try k =4

from sklearn.neighbors import KNeighborsClassifier
k =4
neigh = KNeighborsClassifier(n_neighbors= k).fit(X_train, y_train)
neigh
#Predicting
Kyhat4 = neigh.predict(X_test)
Kyhat4
#Accuracy Evaluation
from sklearn import metrics
print("Train set accuracy = ", metrics.accuracy_score(y_train, neigh.predict(X_train)))
print("Test set accuracy = ", metrics.accuracy_score(y_test, Kyhat4 ))


Train set accuracy =  0.665964652833253
Test set accuracy =  0.6617724728002535


In [22]:
#Now we wanna see given a range of k what is the best value

Ks = 20
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))
ConfustionMx = [];
for n in range(1,Ks):
    
    #Train Model and Predict  
    neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
    yhat=neigh.predict(X_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)

    
    std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])

mean_acc

array([0.65421992, 0.6769568 , 0.6404352 , 0.66177247, 0.63753037,
       0.66990599, 0.66956269, 0.66972114, 0.66948347, 0.69259005,
       0.66343615, 0.69248442, 0.66850639, 0.69280131, 0.69272209,
       0.69578536, 0.69581177, 0.69599662, 0.69599662])

In [35]:
#We can when k = 19 it gives the best accuracy and we do k = 19 case
k = 19
neigh = KNeighborsClassifier(n_neighbors=k).fit(X_train,y_train)
neigh
#Predicting
Kyhat19 = neigh.predict(X_test)
Kyhat19
#Accuracy Evaluation
print("Train set accuracy = ", metrics.accuracy_score(y_train, neigh.predict(X_train)))
print("Test set accuracy = ", metrics.accuracy_score(y_test, Kyhat19 ))


Train set accuracy =  0.699337818299454
Test set accuracy =  0.6959966198373296


### Decision Tree

In [36]:
#Train
from sklearn.tree import DecisionTreeClassifier
DT = DecisionTreeClassifier(criterion="entropy",max_depth=7)
DT.fit(X_train,y_train)

#Predict
DTyhat = DT.predict(X_test)

#Accuracy Evaluation
print("Train set accuracy: ", metrics.accuracy_score(y_train,DT.predict(X_train)))
print("Test set accuracy: ", metrics.accuracy_score(y_test,DTyhat))


Train set accuracy:  0.699337818299454
Test set accuracy:  0.6961550649625013


### Support Vector Machine

In [37]:
from sklearn import svm
#Train
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train)

#Predict
SVMyhat = clf.predict(X_test)

#Accuracy Evaluation
print("Train set accuracy: ", metrics.accuracy_score(y_train,clf.predict(X_train)))
print("Test set accuracy: ", metrics.accuracy_score(y_test,SVMyhat))




Train set accuracy:  0.6993048082445913
Test set accuracy:  0.6961550649625013


### Logistic Regression

In [38]:
#Train
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=6, solver='liblinear').fit(X_train,y_train)

#Predict
LRyhat = LR.predict(X_test)

#Accuracy Evaluation
print("Train set accuracy: ", metrics.accuracy_score(y_train,LR.predict(X_train)))
print("Test set accuracy: ", metrics.accuracy_score(y_test,LRyhat))


Train set accuracy:  0.6993048082445913
Test set accuracy:  0.6961550649625013


## 📌 Evaluation

For each of the models we will calculate the **Jaccard index and F1-Score**

* The Jaccard Index, also known as the Jaccard similarity coefficient, is a statistic used in understanding the similarities between sample sets. The measurement emphasizes similarity between finite sample sets, and is formally defined as the size of the intersection divided by the size of the union of the sample sets. 


* It is calculated from the precision and recall of the test, where the **precision is the number of correctly identified positive results divided by the number of all positive results, including those not identified correctly,** and the **recall is the number of correctly identified positive results divided by the number of all samples that should have been identified as positive.** *The highest possible value of F1 is 1, indicating perfect precision and recall, and the lowest possible value is 0, if either the precision or the recall is zero.*

### KNN

In [47]:
from sklearn.metrics import f1_score
from sklearn.metrics import jaccard_similarity_score

print('KNN Jaccard index: %.7f' % jaccard_similarity_score(y_test,Kyhat19))
print('KNN F1-score: %.7f' % f1_score(y_test,Kyhat19,average="weighted"))

KNN Jaccard index: 0.6959966
KNN F1-score: 0.5718590




### Decision Tree

In [49]:
print('Decision Tree Jaccard index: %.7f' % jaccard_similarity_score(y_test,DTyhat))
print('Decision Tree F1-score: %.7f' % f1_score(y_test,DTyhat,average="weighted"))

Decision Tree Jaccard index: 0.6961551
Decision Tree F1-score: 0.5714476





### Support Vector Machine

In [51]:
print('Support Vector Machine Jaccard index: %.7f' % jaccard_similarity_score(y_test,SVMyhat))
print('Support Vector Machine F1-score: %.7f' % f1_score(y_test,SVMyhat,average="weighted"))

Support Vector Machine Jaccard index: 0.6961551
Support Vector Machine F1-score: 0.5714476




### Logistic Regression

In [52]:
print('Logistic Regression Jaccard index: %.7f' % jaccard_similarity_score(y_test, LRyhat))
print('Logistic Regression F1-score: %.7f' % f1_score(y_test, LRyhat, average = 'weighted'))

Logistic Regression Jaccard index: 0.6961551
Logistic Regression F1-score: 0.5714476




## 📃 Results Table

| Algorithm          | Jaccard (7dp) | F1-score (7dp) | Train Set Accuracy (3dp) | Test Set Accuracy (3dp) |
|--------------------|---------|----------|----------|----------|
| KNN                | 0.6959966       | 0.5718590        | 0.699 | 0.696 |
| Decision Tree      | 0.6961551       | 0.5714476       | 0.699 | 0.696 |
| SVM                | 0.6961551       | 0.5714476        | 0.699 | 0.696 |
| LogisticRegression | 0.6961551       | 0.5714476      | 0.699 | 0.696 |

<p>Copyright &copy; Alyson Zhang