Engineers and researchers in the automobile
industry have tried to design and build safer automobiles, but
traffic accidents are unavoidable. Patterns involved in
dangerous crashes could be detected if we develop a
prediction model that automatically classifies the type of
injury severity of various traffic accidents. These behavioral
and roadway patterns are useful in the development of traffic
safety control policy. 

Road accidents are never a happy issue to discuss. It not only has severe consequences for those involved, it also affects the lives of many others like friends and family. With more vehicles on the road than ever before, its important to understand them in greater detail, and possibly ‘predict’ the locations and consequences of these accidents. Government agencies in the UK have been collecting data about the accidents that were reported since the year 2005. The data includes generic and specific details about the vehicles, driver, number of passengers and number of casualties.

With data available since 2005, one could develop a model to predict the accidents. We know that the data recorded in the database are for reported accidents, so we know for sure these accidents have ‘happened’. We use this data to predict the location of the accident, in terms of latitude and longitude, and also the number of expected casualties of the accidents.

## Research Goals

Identify and quantify associations (if any) between the number of causalities and other variables in the data set.

Explore whether it is possible to predict accident hot-spots based on the data.

### Predictive Analytics Problem? 

Can we predict the number of casualties of each accident? This is helpful for us to identify what features might be related to higher number of casualty in an accident and lower number of casualties in an accident?

We use xgboost (extreme gradient boosting) regressors for the model creation we have.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import xgboost
import math
from __future__ import division
from scipy.stats import pearsonr
from sklearn.linear_model import LinearRegression
from sklearn import cross_validation, tree, linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import explained_variance_score



In [2]:
%%time
df1 = pd.read_csv('/Users/user/Desktop/application/DHL/accidents_2005_to_2007.csv',low_memory=False)
df2 = pd.read_csv('/Users/user/Desktop/application/DHL/accidents_2009_to_2011.csv',low_memory=False)
df3 = pd.read_csv('/Users/user/Desktop/application/DHL/accidents_2012_to_2014.csv',low_memory=False)

# check if the three datasets have same column headers
df1.columns.difference(df2.columns).difference(df3.columns)

# combine three datasets into one
df = pd.concat([df1, df2, df3],ignore_index=True)

print('Number of rows and collumns',df1.shape,df2.shape,df3.shape,df.shape)

Number of rows and collumns (570011, 33) (469442, 33) (464697, 33) (1504150, 33)
CPU times: user 10.6 s, sys: 1.67 s, total: 12.2 s
Wall time: 12.3 s


In [3]:
df.columns

Index(['Accident_Index', 'Location_Easting_OSGR', 'Location_Northing_OSGR',
       'Longitude', 'Latitude', 'Police_Force', 'Accident_Severity',
       'Number_of_Vehicles', 'Number_of_Casualties', 'Date', 'Day_of_Week',
       'Time', 'Local_Authority_(District)', 'Local_Authority_(Highway)',
       '1st_Road_Class', '1st_Road_Number', 'Road_Type', 'Speed_limit',
       'Junction_Detail', 'Junction_Control', '2nd_Road_Class',
       '2nd_Road_Number', 'Pedestrian_Crossing-Human_Control',
       'Pedestrian_Crossing-Physical_Facilities', 'Light_Conditions',
       'Weather_Conditions', 'Road_Surface_Conditions',
       'Special_Conditions_at_Site', 'Carriageway_Hazards',
       'Urban_or_Rural_Area', 'Did_Police_Officer_Attend_Scene_of_Accident',
       'LSOA_of_Accident_Location', 'Year'],
      dtype='object')

# Add additional features to be used in prediction

In [4]:
df['Date'] = pd.to_datetime(df['Date'],dayfirst=True)

In [None]:
def to_hour(time):
    try:
        hour = datetime.strptime(str(time), '%H:%M')
        return int(datetime.strftime(hour, '%H'))
    except Exception:
        return 0

In [None]:
# Extract month, day of month, day of year from Date Column
df['month'] = df['Date'].apply(lambda x: x.month)
df['day_of_month'] = df['Date'].apply(lambda x: x.day)
df['day_of_year'] = df['Date'].apply(lambda x: x.dayofyear)
df['hour'] = df['Time'].apply(to_hour)

In [None]:
del df['Junction_Detail']
df = df.dropna(how='any',axis=0) 

In [None]:
df[features].columns

In [None]:
target = 'Number_of_Casualties'
features = [col for col in df.columns if col not in ['Accident_Index','Number_of_Casualties', 'Date', 
                                                     'Time','Special_Conditions_at_Site','Carriageway_Hazards',
                                                    ]]

# Discretize Categorical Data

In [None]:
cat_columns = ['Local_Authority_(Highway)',
'Road_Type',
'Junction_Control',
'Pedestrian_Crossing-Human_Control',
'Pedestrian_Crossing-Physical_Facilities',
'Light_Conditions', 
'Weather_Conditions', 
'Road_Surface_Conditions',
'Did_Police_Officer_Attend_Scene_of_Accident',
'Urban_or_Rural_Area',
'LSOA_of_Accident_Location',
'Accident_Severity',
'Day_of_Week',
'1st_Road_Class',
'Local_Authority_(Highway)',
'2nd_Road_Class',
'Urban_or_Rural_Area',
'LSOA_of_Accident_Location']

In [None]:
df = pd.get_dummies(df, columns=cat_columns)

# Predicting Number of Casualties 

In [25]:
X = df[features].values
y = df[target].values

In [26]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y ,test_size=0.2)

### Simple linear regression model

In [27]:
regr = linear_model.LinearRegression()

In [28]:
regr.fit(X_train, y_train)
print(regr.predict(X_test))

ValueError: could not convert string to float: 'E01013550'