## UK Road Safety: Traffic Accidents and Vehicles
Vinita Verma
30 Aug, 2020


### Applied Data Science Capstone By IBM/Coursera

## Table Of Contents 
* [Introduction : Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodoloy)
* [Analysis](#analysis)
* [Results](#results)
* [Conclusion](#conclusion)

## Business Problem

In an effort to reduce the frequency of car collisions in a community, an algorithim must be developed to predict the severity of an accident given the current weather, road and visibility conditions. When conditions are bad, this model will alert drivers to remind them to be more careful.

In [None]:
import numpy as np
import pandas as pd

In [None]:
import warnings

def fxn():
    warnings.warn("deprecated", DeprecationWarning)

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    fxn()
    
import warnings
warnings.filterwarnings("ignore")

In [None]:
#RESAMPLING

import matplotlib.pyplot as plt   
from pydotplus import graph_from_dot_data
from IPython.display import Image  
import seaborn as sns
from IPython.display import HTML, display
import tabulate
from imblearn import under_sampling, over_sampling
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler


In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

In [None]:
from sklearn.tree import DecisionTreeClassifier 
from sklearn.tree import export_graphviz
from sklearn.ensemble import RandomForestRegressor
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from tensorflow.python.keras import utils
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsClassifier

In [None]:
from sklearn.linear_model import LogisticRegression

## Data preprocessing

In [15]:
df = pd.read_csv('Accident_Information.csv', sep=',')

In [16]:
print(df)

       Accident_Index 1st_Road_Class  ...    Year InScotland
0       200501BS00001              A  ...  2005.0         No
1       200501BS00002              B  ...  2005.0         No
2       200501BS00003              C  ...  2005.0         No
3       200501BS00004              A  ...  2005.0         No
4       200501BS00005   Unclassified  ...  2005.0         No
...               ...            ...  ...     ...        ...
121427  20054100C0339   Unclassified  ...  2005.0         No
121428  20054100C0340   Unclassified  ...  2005.0         No
121429  20054100C0341              A  ...  2005.0         No
121430  20054100C0342   Unclassified  ...  2005.0         No
121431  20054100C0343   Unclassified  ...     NaN        NaN

[121432 rows x 34 columns]


In [17]:
encoding = {
"Carriageway_Hazards": {"None": 0, "Other object on road": 1, "Any animal in carriageway (except ridden horse)": 1,  "Pedestrian in carriageway - not injured": 1, "Previous accident": 1, "Vehicle load on road": 1,  "Data missing or out of range": 0  }
}
df.replace(encoding, inplace=True)
print(df['Carriageway_Hazards'].value_counts())

0    119407
1      2025
Name: Carriageway_Hazards, dtype: int64


In [18]:
print(df['Light_Conditions'].value_counts())
encoding_light = {"Light_Conditions": {"Daylight": 0, "Darkness - lights lit": 1, "Darkness - no lighting": 1, "Darkness - lighting unknown": 1, "Darkness - lights unlit": 1, "Data missing or out of range": 0}}
df.replace(encoding_light, inplace=True)
print(df['Light_Conditions'].value_counts())

Daylight                       87573
Darkness - lights lit          27239
Darkness - no lighting          5412
Darkness - lighting unknown      744
Darkness - lights unlit          464
Name: Light_Conditions, dtype: int64
0    87573
1    33859
Name: Light_Conditions, dtype: int64


In [19]:
print(df['Day_of_Week'].value_counts())
encoding_day_of_week = {"Day_of_Week": {"Saturday": 1, "Sunday": 1, "Monday": 0, "Tuesday": 0, "Wednesday": 0, "Thursday": 0, "Friday": 0}}
df.replace(encoding_day_of_week, inplace=True)
print(df['Day_of_Week'].value_counts())

Friday       19987
Wednesday    18473
Thursday     18005
Tuesday      17847
Monday       17111
Saturday     16553
Sunday       13456
Name: Day_of_Week, dtype: int64
0    91423
1    30009
Name: Day_of_Week, dtype: int64


In [20]:
print(df['Special_Conditions_at_Site'].value_counts())
encoding_Special_Conditions_at_Site = {"Special_Conditions_at_Site": {"None": 0, "Roadworks": 1, "Oil or diesel": 1, "Mud": 1, "Road surface defective": 1, "Auto traffic signal - out": 1, "Road sign or marking defective or obscured": 1, "Auto signal part defective": 1, "Data missing or out of range": 0}}
df.replace(encoding_Special_Conditions_at_Site, inplace=True)
print(df['Special_Conditions_at_Site'].value_counts())

None                                          118572
Roadworks                                       1357
Oil or diesel                                    455
Mud                                              343
Road surface defective                           229
Road sign or marking defective or obscured       225
Auto traffic signal - out                        180
Auto signal part defective                        62
Data missing or out of range                       8
Name: Special_Conditions_at_Site, dtype: int64
0.0    118580
1.0      2851
Name: Special_Conditions_at_Site, dtype: int64


In [21]:
encoding_1st_road_class = {"1st_Road_Class": {"A": 1, "A(M)": 1, "B": 2, "C": 3, "Motorway": 4, "Unclassified": 1}}
df.replace(encoding_1st_road_class, inplace=True)
df['1st_Road_Class'].value_counts()

1    90296
2    14433
3    12766
4     3937
Name: 1st_Road_Class, dtype: int64

In [22]:
#replacing 'Data missing or out of range' with most occured value 'Give way or uncontrolled'
df['Junction_Control'] = df['Junction_Control'].replace(['Data missing or out of range'], 'Give way or uncontrolled')

In [23]:
df['Junction_Control'].value_counts()

Give way or uncontrolled               105422
Auto traffic signal                     14389
Stop sign                                 864
Not at junction or within 20 metres       565
Authorised person                         192
Name: Junction_Control, dtype: int64

In [24]:
encoding_junction_detail = {"Junction_Control": 
                            {"Give way or uncontrolled": 1,
                             "Auto traffic signal": 2,
                             "Not at junction or within 20 metres": 3, 
                             "Stop sign": 4,
                             "Authorised person": 5,
                              }}
df.replace(encoding_junction_detail, inplace=True)
df['Junction_Control'].value_counts()

1    105422
2     14389
4       864
3       565
5       192
Name: Junction_Control, dtype: int64

In [25]:
encoding_junction_detail = {"Junction_Detail": 
                            {"Not at junction or within 20 metres": 1,
                             "T or staggered junction": 2,
                             "Crossroads": 3, 
                             "Roundabout": 4,
                             "Private drive or entrance": 5,
                             "Other junction": 6,
                             "Slip road": 7,
                             "More than 4 arms (not roundabout)": 8,
                             "Mini-roundabout": 9,
                             "Data missing or out of range": 1 }}
df.replace(encoding_junction_detail, inplace=True)
df['Junction_Detail'].value_counts()

1    44976
2    40510
3    13682
4     9283
5     4499
6     3856
8     1904
7     1599
9     1123
Name: Junction_Detail, dtype: int64

In [26]:
encoding_road_surface_cond = {"Road_Surface_Conditions": 
                            {"Dry": 1,
                             "Wet or damp": 2,
                             "Frost or ice": 3, 
                             "Snow": 4,
                             "Flood over 3cm. deep": 5,
                             "Data missing or out of range": 1 }}
df.replace(encoding_road_surface_cond, inplace=True)
df['Road_Surface_Conditions'].value_counts()

1.0    84707
2.0    34151
3.0     1832
4.0      650
5.0       91
Name: Road_Surface_Conditions, dtype: int64

In [27]:
encoding_road_type = {"Road_Type": 
                            {"Single carriageway": 1,
                             "Dual carriageway": 2,
                             "Roundabout": 3, 
                             "One way street": 4,
                             "Slip road": 5,
                             "Unknown": 0,
                             "Data missing or out of range": 1 }}
df.replace(encoding_road_type, inplace=True)
df['Road_Type'].value_counts()

1.0    91753
2.0    17769
3.0     7419
4.0     2808
5.0     1018
0.0      664
Name: Road_Type, dtype: int64

In [28]:
encoding_urban_rural = {"Urban_or_Rural_Area": 
                            {"Urban": 1,
                             "Rural": 2,
                             "Unallocated": 1 }}
df.replace(encoding_urban_rural, inplace=True)
df['Urban_or_Rural_Area'].value_counts()

1.0    86284
2.0    35147
Name: Urban_or_Rural_Area, dtype: int64

In [29]:

encoding_weather = {"Weather_Conditions": 
                            {"Fine no high winds": 1,
                             "Raining no high winds": 2,
                             "Raining + high winds": 3,
                             "Fine + high winds": 4,
                             "Snowing no high winds": 5,
                             "Fog or mist": 6,
                             "Snowing + high winds": 7,
                             "Unknown": 1,
                             "Other": 1,
                             "Data missing or out of range": 1 }}
df.replace(encoding_weather, inplace=True)
df['Weather_Conditions'].value_counts()

1.0    103364
2.0     13741
4.0      1482
3.0      1046
5.0      1041
6.0       618
7.0       139
Name: Weather_Conditions, dtype: int64

In [30]:
np.where(np.isnan(df['Speed_limit']))

(array([121431]),)

## Data

In [33]:
df['Speed_limit'].fillna((df['Speed_limit'].mean()), inplace=True)

In [34]:
df['Time'].fillna(0, inplace=True)

In [35]:
def period(row):
    rdf = []
    if(type(row) == float):
        row = str(row)
        rdf = row.split(".")
    else:
        rdf = str(row).split(":");
        
    hr = rdf[0]
    if int(hr) > 8 and int(hr) < 20:
        return 1;
    else:
        return 2;

In [36]:
df['Time'] = df['Time'].apply(period)

In [37]:
df_train1 = df[['1st_Road_Class','Carriageway_Hazards','Junction_Control','Day_of_Week','Junction_Detail','Light_Conditions','Road_Surface_Conditions','Road_Type','Special_Conditions_at_Site','Speed_limit','Time','Urban_or_Rural_Area','Weather_Conditions','Accident_Severity']]

In [38]:
df_slight = df_train1[df_train1['Accident_Severity']=='Slight']

In [41]:
df_serious = df_train1[df_train1['Accident_Severity']=='Serious']

In [40]:
df_fatal = df_train1[df_train1['Accident_Severity']=='Fatal']

In [42]:
df_serious['Accident_Severity'].value_counts()

Serious    15416
Name: Accident_Severity, dtype: int64

In [43]:
random_subset = df_slight.sample(n=3)
random_subset.head()

Unnamed: 0,1st_Road_Class,Carriageway_Hazards,Junction_Control,Day_of_Week,Junction_Detail,Light_Conditions,Road_Surface_Conditions,Road_Type,Special_Conditions_at_Site,Speed_limit,Time,Urban_or_Rural_Area,Weather_Conditions,Accident_Severity
41155,1,0,2,0,3,1,2.0,2.0,0.0,40.0,2,1.0,2.0,Slight
95202,3,0,1,0,1,1,1.0,1.0,0.0,60.0,2,2.0,1.0,Slight
120591,2,0,3,0,1,0,2.0,1.0,0.0,60.0,2,2.0,1.0,Slight


In [44]:
df_fatal['Accident_Severity'].value_counts()

Fatal    1629
Name: Accident_Severity, dtype: int64

In [45]:

df_slight_sampling = df_slight.sample(n=45000)  #Matched the combined number of records for Fatal and Serious(As we are going to club fatal&serious to Serious)

In [46]:

df_serious_sampling = df_serious.sample(n=1629)

In [47]:
df_final_sampling = pd.concat([df_serious_sampling,df_slight_sampling,df_fatal])

In [48]:
df_final_sampling.head()

Unnamed: 0,1st_Road_Class,Carriageway_Hazards,Junction_Control,Day_of_Week,Junction_Detail,Light_Conditions,Road_Surface_Conditions,Road_Type,Special_Conditions_at_Site,Speed_limit,Time,Urban_or_Rural_Area,Weather_Conditions,Accident_Severity
105222,3,0,1,0,1,0,1.0,1.0,0.0,60.0,1,2.0,1.0,Serious
69388,1,0,1,0,8,1,1.0,1.0,0.0,30.0,2,1.0,1.0,Serious
18973,1,0,1,0,2,1,1.0,1.0,0.0,30.0,2,1.0,1.0,Serious
119078,1,0,1,0,4,1,1.0,3.0,0.0,30.0,2,1.0,1.0,Serious
112084,3,0,1,0,1,0,1.0,1.0,0.0,60.0,1,2.0,1.0,Serious


In [49]:

df_test = df_final_sampling[['Accident_Severity']]

In [50]:

#replacing 'Data missing or out of range' with most occured value 'None'
df_test['Accident_Severity'] = df_test['Accident_Severity'].replace(['Fatal'], 'Serious')

In [51]:
df_train = df_final_sampling[['1st_Road_Class','Carriageway_Hazards','Junction_Control','Day_of_Week','Junction_Detail','Light_Conditions','Road_Surface_Conditions','Road_Type','Special_Conditions_at_Site','Speed_limit','Time','Urban_or_Rural_Area','Weather_Conditions']]

In [52]:
df_test['Accident_Severity'].value_counts()

Slight     45000
Serious     3258
Name: Accident_Severity, dtype: int64

## **Methodology** **and** **Analysis**

Our data is now ready to be fed into machine learning models.
we will use the following algorithms:
1. K-Nearest Neighbor (KNN)
2. Decision Tree
3. Logistic Regression
4. RandomForest Classifier
5. XGBClassifier



In [53]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_train, df_test, test_size=0.2)

In [54]:
from sklearn.ensemble import RandomForestClassifier
#class_weight = dict({2:1, 1:15, 0:50})
rdf = RandomForestClassifier(n_estimators=300,random_state=35)

rdf.fit(X_train,y_train)

y_pred=rdf.predict(X_test)

#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix  
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))

Accuracy: 0.9259220886862827
[[   6  671]
 [  44 8931]]
              precision    recall  f1-score   support

     Serious       0.12      0.01      0.02       677
      Slight       0.93      1.00      0.96      8975

    accuracy                           0.93      9652
   macro avg       0.53      0.50      0.49      9652
weighted avg       0.87      0.93      0.90      9652



In [55]:
from sklearn.ensemble import RandomForestClassifier
#class_weight = dict({2:1, 1:15, 0:50})
rdf = RandomForestClassifier(bootstrap=True,
            class_weight="balanced_subsample", 
            criterion='gini',
            max_depth=8, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=4, min_samples_split=10,
            min_weight_fraction_leaf=0.0, n_estimators=300,
            oob_score=True,
            random_state=35,
            verbose=0, warm_start=False)

In [56]:
rdf.fit(X_train,y_train)

y_pred=rdf.predict(X_test)

In [57]:

#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.7232697886448405


In [58]:
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix  
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))

[[ 349  328]
 [2343 6632]]
              precision    recall  f1-score   support

     Serious       0.13      0.52      0.21       677
      Slight       0.95      0.74      0.83      8975

    accuracy                           0.72      9652
   macro avg       0.54      0.63      0.52      9652
weighted avg       0.90      0.72      0.79      9652



In [59]:
from xgboost import XGBClassifier
model = XGBClassifier(learning_rate =0.07, n_estimators=300,
                      class_weight="balanced_subsample",
                      max_depth=8, min_child_weight=1,
                      scale_pos_weight=7,
                      seed=27,subsample=0.8,colsample_bytree=0.8)

model.fit(X_train,y_train)

y_pred=model.predict(X_test)

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.9298590965602984


In [60]:
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))

[[   0  677]
 [   0 8975]]
              precision    recall  f1-score   support

     Serious       0.00      0.00      0.00       677
      Slight       0.93      1.00      0.96      8975

    accuracy                           0.93      9652
   macro avg       0.46      0.50      0.48      9652
weighted avg       0.86      0.93      0.90      9652



In [61]:
# import the class
from sklearn.neighbors import KNeighborsClassifier

# instantiate the model (with the default parameters)
knn = KNeighborsClassifier(n_neighbors=3,weights='distance')

# fit the model with data (occurs in-place)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))

[[  32  645]
 [ 182 8793]]
              precision    recall  f1-score   support

     Serious       0.15      0.05      0.07       677
      Slight       0.93      0.98      0.96      8975

    accuracy                           0.91      9652
   macro avg       0.54      0.51      0.51      9652
weighted avg       0.88      0.91      0.89      9652



In [62]:
from sklearn.linear_model import LogisticRegression
logisticRegr = LogisticRegression()
logisticRegr.fit(X_train, y_train)
y_pred = logisticRegr.predict(X_test)
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))

[[   0  677]
 [   0 8975]]
              precision    recall  f1-score   support

     Serious       0.00      0.00      0.00       677
      Slight       0.93      1.00      0.96      8975

    accuracy                           0.93      9652
   macro avg       0.46      0.50      0.48      9652
weighted avg       0.86      0.93      0.90      9652



In [63]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)
print(accuracy_score(y_test, y_pred))
confusion_matrix(y_test, y_pred)
print(format(classification_report(y_test, y_pred)))

0.9000207210940737
              precision    recall  f1-score   support

     Serious       0.19      0.13      0.15       677
      Slight       0.94      0.96      0.95      8975

    accuracy                           0.90      9652
   macro avg       0.56      0.54      0.55      9652
weighted avg       0.88      0.90      0.89      9652



In [64]:
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(loss="deviance", learning_rate=0.1, 
      n_estimators=100, subsample=1.0, criterion="friedman_mse", 
      min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, 
      max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, 
      random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, 
      presort="auto")

y_pred = gbc.fit(X_train, y_train.values.ravel()).predict(X_test)
print(format(classification_report(y_test, y_pred)))
print(accuracy_score(y_test, y_pred))

              precision    recall  f1-score   support

     Serious       0.00      0.00      0.00       677
      Slight       0.93      1.00      0.96      8975

    accuracy                           0.93      9652
   macro avg       0.46      0.50      0.48      9652
weighted avg       0.86      0.93      0.90      9652

0.9296518856195607


## **Discussion**

In the beginning of this notebook, we had categorical data that was of type 'object'. This is not a data type that we could have fed through an algoritim, so label encoding was used to created new classes that were of type int8; a numerical data type.

After solving that issue we were presented with another - imbalanced data. As mentioned earlier, class 1 was nearly three times larger than class 2. The solution to this was downsampling the majority class with sklearn's resample tool. We downsampled to match the minority class exactly with 58188 values each.

Once we analyzed and cleaned the data, it was then fed through three ML models; K-Nearest Neighbor, Decision Tree and Logistic Regression. Although the first two are ideal for this project, logistic regression made most sense because of its binary nature.

Evaluation metrics used to test the accuracy of our models were jaccard index, f-1 score and logloss for logistic regression. Choosing different k, max depth and hyparameter C values helped to improve our accuracy to be the best possible.
```



## **Conclusion**

Based on historical data from weather conditions pointing to certain classes, we can conclude that particular weather conditions have a somewhat impact on whether or not travel could result in property damage (class 1) or injury (class 2).