# Accident Severity Prediciton

## Introduction

The business objective is to use models on existing data provided by the UK Department of Transport to try to predict the severity of accidents before they happen. This can then be used to make reccomendations to improve road safety.

We will examine models that can be used to make the prediction outlined in the business objective and evaluate which model is best suited to the dataset.

The dataset used can be found at https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-accident-provisional-mid-year-unvalidated-2021.csv

In [23]:
import pandas as pd
import numpy as np
from pandas import read_csv

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB


from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV

In [2]:
pd.set_option("display.max.columns", None)
df=pd.read_csv('dataset.csv')

df.head()

Unnamed: 0,status,accident_index,accident_year,accident_reference,location_easting_osgr,location_northing_osgr,longitude,latitude,police_force,accident_severity,number_of_vehicles,number_of_casualties,date,day_of_week,time,local_authority_district,local_authority_ons_district,local_authority_highway,first_road_class,first_road_number,road_type,speed_limit,junction_detail,junction_control,second_road_class,second_road_number,pedestrian_crossing_human_control,pedestrian_crossing_physical_facilities,light_conditions,weather_conditions,road_surface_conditions,special_conditions_at_site,carriageway_hazards,urban_or_rural_area,did_police_officer_attend_scene_of_accident,trunk_road_flag,lsoa_of_accident_location
0,Unvalidated,2021010000000.0,2021,10287148,521508.0,193079.0,,,1,3,3,1,01/01/2021,6,02:05,-1,E09000003,E09000003,6,0,6,30,9,4,6,0,0,0,4,7,4,1,0,-1,1,-1,-1
1,Unvalidated,2021010000000.0,2021,10287149,535379.0,180783.0,,,1,2,2,3,01/01/2021,6,03:30,-1,E09000030,E09000030,3,1203,3,30,7,2,3,1204,0,5,4,1,1,0,0,-1,1,-1,-1
2,Unvalidated,2021010000000.0,2021,10287151,529701.0,170398.0,,,1,2,2,4,01/01/2021,6,04:07,-1,E09000022,E09000022,4,272,6,30,9,2,5,0,0,5,4,1,1,0,0,-1,1,-1,-1
3,Unvalidated,2021010000000.0,2021,10287155,525312.0,178385.0,,,1,1,1,1,01/01/2021,6,04:26,-1,E09000020,E09000020,3,3220,2,30,9,4,6,0,0,4,4,1,1,0,0,-1,1,-1,-1
4,Unvalidated,2021010000000.0,2021,10287157,512144.0,171526.0,,,1,3,4,1,01/01/2021,6,03:10,-1,E09000018,E09000018,5,0,6,20,3,4,6,0,0,0,4,1,1,0,0,-1,1,-1,-1


## Data Cleaning and pre-processing

We can remove 'status', 'accident index', 'accident year' and 'accident reference' from the dataset as they are not being used.

In [3]:
df = df.drop(df.columns[[0,1,2,3]], axis=1) 

df.head()

Unnamed: 0,location_easting_osgr,location_northing_osgr,longitude,latitude,police_force,accident_severity,number_of_vehicles,number_of_casualties,date,day_of_week,time,local_authority_district,local_authority_ons_district,local_authority_highway,first_road_class,first_road_number,road_type,speed_limit,junction_detail,junction_control,second_road_class,second_road_number,pedestrian_crossing_human_control,pedestrian_crossing_physical_facilities,light_conditions,weather_conditions,road_surface_conditions,special_conditions_at_site,carriageway_hazards,urban_or_rural_area,did_police_officer_attend_scene_of_accident,trunk_road_flag,lsoa_of_accident_location
0,521508.0,193079.0,,,1,3,3,1,01/01/2021,6,02:05,-1,E09000003,E09000003,6,0,6,30,9,4,6,0,0,0,4,7,4,1,0,-1,1,-1,-1
1,535379.0,180783.0,,,1,2,2,3,01/01/2021,6,03:30,-1,E09000030,E09000030,3,1203,3,30,7,2,3,1204,0,5,4,1,1,0,0,-1,1,-1,-1
2,529701.0,170398.0,,,1,2,2,4,01/01/2021,6,04:07,-1,E09000022,E09000022,4,272,6,30,9,2,5,0,0,5,4,1,1,0,0,-1,1,-1,-1
3,525312.0,178385.0,,,1,1,1,1,01/01/2021,6,04:26,-1,E09000020,E09000020,3,3220,2,30,9,4,6,0,0,4,4,1,1,0,0,-1,1,-1,-1
4,512144.0,171526.0,,,1,3,4,1,01/01/2021,6,03:10,-1,E09000018,E09000018,5,0,6,20,3,4,6,0,0,0,4,1,1,0,0,-1,1,-1,-1


### Set the input and target variable

As was discussed in the group work, one of the factors that affects accident severity is sppeding. As there is no data on whether accidents where caused by speeding, we will use speed limit as our input variable.

In [4]:
X = df.iloc[:, 18].values.reshape(-1, 1)   

y = df.iloc[:, 6].values

### Train test split

We dedicate 33% of our data for testing and the remaining 67% will be used for testing.

In [5]:
x_train, x_test, y_train, y_test = train_test_split(
        X, y, test_size = 0.33, random_state = 0)

## Baseline Methods

In order to evaluate the models on the dataset, we perform initial evaluation, using evaluation metrics without hyperparameter tuning. We will then perform hyperparameter tuning and evaluate the models again using the same metrics.

### Logistic Regression

#### Project the model on our data

In [6]:
LR = LogisticRegression(random_state = 0, max_iter=10000)
LR.fit(x_train, y_train)


y_pred = LR.predict(x_test)

#### Evaluation metrics 

To evaluate the model, we will use the accuracy score and the classification report.

In [7]:
print ("accuracy score: ", accuracy_score(y_test, y_pred))
print (classification_report(y_test, y_pred, zero_division=1))

accuracy score:  0.638171543028829
,              precision    recall  f1-score   support
,
,           1       1.00      0.00      0.00      3756
,           2       0.64      1.00      0.78      8921
,           3       1.00      0.00      0.00       993
,           4       1.00      0.00      0.00       220
,           5       1.00      0.00      0.00        56
,           6       1.00      0.00      0.00        18
,           7       1.00      0.00      0.00         9
,           8       1.00      0.00      0.00         3
,           9       1.00      0.00      0.00         1
,          10       1.00      0.00      0.00         1
,          11       1.00      0.00      0.00         1
,
,    accuracy                           0.64     13979
,   macro avg       0.97      0.09      0.07     13979
,weighted avg       0.77      0.64      0.50     13979
,


### Gaussian Naive Bayes 

#### Project the model on our data

In [8]:
GNB = GaussianNB()
GNB.fit(x_train, y_train)
y_predict = GNB.predict(x_test)

#### Evaluation metrics

We will use the same metrics as in the Logistic Regression model to evaluate the Gaussain Naive Bayes model.

In [9]:
print ("accuracy score: ", accuracy_score(y_test, y_predict))
print (classification_report(y_test, y_predict, zero_division=1))

accuracy score:  0.2356391730452822
,              precision    recall  f1-score   support
,
,           1       1.00      0.00      0.00      3756
,           2       0.73      0.37      0.49      8921
,           3       1.00      0.00      0.00       993
,           4       1.00      0.00      0.00       220
,           5       1.00      0.00      0.00        56
,           6       1.00      0.00      0.00        18
,           7       1.00      0.00      0.00         9
,           8       0.00      0.67      0.00         3
,           9       1.00      0.00      0.00         1
,          10       1.00      0.00      0.00         1
,          11       0.00      0.00      0.00         1
,
,    accuracy                           0.24     13979
,   macro avg       0.79      0.09      0.04     13979
,weighted avg       0.83      0.24      0.31     13979
,


## Hyperparameter tuning for Logistic Regression

We will now add and tune hyperparaneters in our Logistic Regression model.

In [10]:
solvers = ['newton-cg', 'lbfgs', 'liblinear']
penalty = ['l2']
c_values = [100, 10, 1.0, 0.1, 0.01]
# define grid search
grid = dict(solver=solvers,penalty=penalty,C=c_values)

In [11]:
LR = LogisticRegression(max_iter=10000)
cv = RepeatedStratifiedKFold(n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=LR, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(x_train, y_train)



In [12]:
grid_result

GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=3, n_splits=5, random_state=1),
             error_score=0, estimator=LogisticRegression(max_iter=10000),
             n_jobs=-1,
             param_grid={'C': [100, 10, 1.0, 0.1, 0.01], 'penalty': ['l2'],
                         'solver': ['newton-cg', 'lbfgs', 'liblinear']},
             scoring='accuracy')

In [13]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.637690 using {'C': 100, 'penalty': 'l2', 'solver': 'newton-cg'}
,0.637690 (0.000079) with: {'C': 100, 'penalty': 'l2', 'solver': 'newton-cg'}
,0.637690 (0.000079) with: {'C': 100, 'penalty': 'l2', 'solver': 'lbfgs'}
,0.637690 (0.000079) with: {'C': 100, 'penalty': 'l2', 'solver': 'liblinear'}
,0.637690 (0.000079) with: {'C': 10, 'penalty': 'l2', 'solver': 'newton-cg'}
,0.637690 (0.000079) with: {'C': 10, 'penalty': 'l2', 'solver': 'lbfgs'}
,0.637690 (0.000079) with: {'C': 10, 'penalty': 'l2', 'solver': 'liblinear'}
,0.637690 (0.000079) with: {'C': 1.0, 'penalty': 'l2', 'solver': 'newton-cg'}
,0.637690 (0.000079) with: {'C': 1.0, 'penalty': 'l2', 'solver': 'lbfgs'}
,0.637690 (0.000079) with: {'C': 1.0, 'penalty': 'l2', 'solver': 'liblinear'}
,0.637690 (0.000079) with: {'C': 0.1, 'penalty': 'l2', 'solver': 'newton-cg'}
,0.637690 (0.000079) with: {'C': 0.1, 'penalty': 'l2', 'solver': 'lbfgs'}
,0.637690 (0.000079) with: {'C': 0.1, 'penalty': 'l2', 'solver': 'liblinear'}
,0.637690 (

In [14]:
y_pred = grid_search.predict(x_test)

In [25]:
print ("accuracy score: ", accuracy_score(y_test, y_pred))
print (classification_report(y_test, y_pred, zero_division=1))

accuracy score:  0.638171543028829
,              precision    recall  f1-score   support
,
,           1       1.00      0.00      0.00      3756
,           2       0.64      1.00      0.78      8921
,           3       1.00      0.00      0.00       993
,           4       1.00      0.00      0.00       220
,           5       1.00      0.00      0.00        56
,           6       1.00      0.00      0.00        18
,           7       1.00      0.00      0.00         9
,           8       1.00      0.00      0.00         3
,           9       1.00      0.00      0.00         1
,          10       1.00      0.00      0.00         1
,          11       1.00      0.00      0.00         1
,
,    accuracy                           0.64     13979
,   macro avg       0.97      0.09      0.07     13979
,weighted avg       0.77      0.64      0.50     13979
,


## Hyperparameter tuning for Gaussian NB

We will now tune hyperparameters for the GaussianNB model.

In [15]:
np.logspace(0,-9, num=100)

array([1.00000000e+00, 8.11130831e-01, 6.57933225e-01, 5.33669923e-01,
       4.32876128e-01, 3.51119173e-01, 2.84803587e-01, 2.31012970e-01,
       1.87381742e-01, 1.51991108e-01, 1.23284674e-01, 1.00000000e-01,
       8.11130831e-02, 6.57933225e-02, 5.33669923e-02, 4.32876128e-02,
       3.51119173e-02, 2.84803587e-02, 2.31012970e-02, 1.87381742e-02,
       1.51991108e-02, 1.23284674e-02, 1.00000000e-02, 8.11130831e-03,
       6.57933225e-03, 5.33669923e-03, 4.32876128e-03, 3.51119173e-03,
       2.84803587e-03, 2.31012970e-03, 1.87381742e-03, 1.51991108e-03,
       1.23284674e-03, 1.00000000e-03, 8.11130831e-04, 6.57933225e-04,
       5.33669923e-04, 4.32876128e-04, 3.51119173e-04, 2.84803587e-04,
       2.31012970e-04, 1.87381742e-04, 1.51991108e-04, 1.23284674e-04,
       1.00000000e-04, 8.11130831e-05, 6.57933225e-05, 5.33669923e-05,
       4.32876128e-05, 3.51119173e-05, 2.84803587e-05, 2.31012970e-05,
       1.87381742e-05, 1.51991108e-05, 1.23284674e-05, 1.00000000e-05,
      

In [16]:
GNB = GaussianNB()
params_NB = {'var_smoothing': np.logspace(0,-9, num=100)}
cv = RepeatedStratifiedKFold(n_repeats=3, random_state=1)
gs_NB = GridSearchCV(estimator=GNB, 
                 param_grid=params_NB, 
                 cv=cv, 
                 verbose=1, 
                 scoring='accuracy') 
gs_NB.fit(x_train, y_train)


Fitting 15 folds for each of 100 candidates, totalling 1500 fits




GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=3, n_splits=5, random_state=1),
             estimator=GaussianNB(),
             param_grid={'var_smoothing': array([1.00000000e+00, 8.11130831e-01, 6.57933225e-01, 5.33669923e-01,
       4.32876128e-01, 3.51119173e-01, 2.84803587e-01, 2.31012970e-01,
       1.87381742e-01, 1.51991108e-01, 1.23284674e-01, 1.00000000e-01,
       8.11130831e-02, 6.57933225e-02, 5.3...
       1.23284674e-07, 1.00000000e-07, 8.11130831e-08, 6.57933225e-08,
       5.33669923e-08, 4.32876128e-08, 3.51119173e-08, 2.84803587e-08,
       2.31012970e-08, 1.87381742e-08, 1.51991108e-08, 1.23284674e-08,
       1.00000000e-08, 8.11130831e-09, 6.57933225e-09, 5.33669923e-09,
       4.32876128e-09, 3.51119173e-09, 2.84803587e-09, 2.31012970e-09,
       1.87381742e-09, 1.51991108e-09, 1.23284674e-09, 1.00000000e-09])},
             scoring='accuracy', verbose=1)

In [17]:
gs_NB.best_params_

{'var_smoothing': 1.0}

In [22]:
y_predict = gs_NB.predict(x_test)

print ("accuracy score: ", accuracy_score(y_test, y_predict))
print (classification_report(y_test, y_predict, zero_division=1))

accuracy score:  0.638171543028829
,              precision    recall  f1-score   support
,
,           1       1.00      0.00      0.00      3756
,           2       0.64      1.00      0.78      8921
,           3       1.00      0.00      0.00       993
,           4       1.00      0.00      0.00       220
,           5       1.00      0.00      0.00        56
,           6       1.00      0.00      0.00        18
,           7       1.00      0.00      0.00         9
,           8       1.00      0.00      0.00         3
,           9       1.00      0.00      0.00         1
,          10       1.00      0.00      0.00         1
,          11       1.00      0.00      0.00         1
,
,    accuracy                           0.64     13979
,   macro avg       0.97      0.09      0.07     13979
,weighted avg       0.77      0.64      0.50     13979
,


## Conclusion

The results from the evaluation of both models showed that;

- Based on the accuracy score, Logistic regression was the best model before hyperparameter tuning.
- There was no improvement with the logistic regression model after hyperparameter tuning in terms of accuracy score.
- GaussianNB improved from 23% to 68% in accuracy score with hyperparameter tuning.

- Metrics in the classification report, for the Gaussian NB model such as precision, recall and f1 score showed no improvement with hyperparameter tuning.

- Metrics in the classification report, for the Logistic regression model such as precision, recall and f1 score showed no improvement with hyperparameter tuning.