# Car accident severity

## Description of the problem and discussion of the background

Car crashes have risen to the 8th leading cause of death for people globally. 1.35 million people die in road accidents worldwide every year, that is **3700 deaths a day**.

Most accidents take place in the most busy roads, that people take everyday to go to and from work, on their daily commutes.

There are large records that describe the characteristics of each accident: road conditions, weather, vehicle, driver info (age, hours of sleep...), etc. For this project we will be working with some of these characteristics to predict the severity of an accident.

This predicion will make it possible to alert the drivers for the current risks so that they can avoid traveling in the current conditions.
This prediction will help the authoroties to better allocate resources, like police and medical emergency teams, according to the conditions that will make severe accidents more likely to happen in a certain area.

## Looking at the data and preparing it

The dataset we will use refers to accidents in Seattle city. Let's import and take a look at the dataset.

In [1]:
# import necessary libraries: pandas and numpy
import pandas as pd
import numpy as np

In [2]:
# loading the data from a CSV file to a dataframe
df = pd.read_csv(r'C:\Users\u10054354\Desktop\Capstone\Data-Collisions.csv', low_memory=False)
df.columns # listing the columns available on the data

Index(['SEVERITYCODE', 'X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO',
       'STATUS', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE',
       'EXCEPTRSNDESC', 'SEVERITYCODE.1', 'SEVERITYDESC', 'COLLISIONTYPE',
       'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INCDATE',
       'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC',
       'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND',
       'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'ST_COLDESC',
       'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR'],
      dtype='object')

In [3]:
# sneak peek into the data as-is
df.head(3)

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N


The dataset has 38 columns, the first one being SEVERITYCODE.This column will be used as the target variable, as it is a measure of the severity of the accident. SEVERITYCODE varies between 1 and 2. These values represent the following consequences of the accident:

	1. Property Damage
	2. Injury

We will use the attributes WEATHER, ROADCOND and LIGHTCOND to predict the severity of an accident.

The original dataset needs some preparation to be fit for analysis. First, let's keep only the columns that will be used for the analysis.

In [4]:
# discarding unecessary columns that won't be used for the model
df.drop(['X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO', 'STATUS', 'ADDRTYPE',
         'INTKEY', 'LOCATION', 'EXCEPTRSNCODE', 'EXCEPTRSNDESC', 'SEVERITYCODE.1',
         'SEVERITYDESC', 'COLLISIONTYPE', 'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT',
         'VEHCOUNT', 'INCDATE', 'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC',
         'INATTENTIONIND', 'UNDERINFL', 'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING',
         'ST_COLCODE', 'ST_COLDESC', 'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR'],
        axis=1, inplace=True)
df.head() # checking that only the necessary columns remain

Unnamed: 0,SEVERITYCODE,WEATHER,ROADCOND,LIGHTCOND
0,2,Overcast,Wet,Daylight
1,1,Raining,Wet,Dark - Street Lights On
2,1,Overcast,Dry,Daylight
3,1,Clear,Dry,Daylight
4,2,Raining,Wet,Daylight


In [5]:
def df_rows():
    print('Current number of rows: ', len(df)) # funciton for printing number of rows
    
df_rows()

Current number of rows:  194673


We have now around 195k rows of data, but we in order to train our model we need to expurgate the rows with useless data. Let's exclude the rows where there is no data for the three considered features (no value, or NaN) and also drop the lines where the features are 'Unknown'.

In [6]:
# dropping rows where any column has no data
df.dropna(how='any', inplace=True)

#dropping rows where any of the features is 'Unknown'
indexNames = df[(df['WEATHER'] == 'Unknown') | (df['ROADCOND'] == 'Unknown') | (df['LIGHTCOND'] == 'Unknown')].index
df.drop(indexNames , inplace=True)

In [7]:
df_rows()

Current number of rows:  170510


Let's now take a look at the type of our features.

In [8]:
# checking the type of each column
df.dtypes

SEVERITYCODE     int64
WEATHER         object
ROADCOND        object
LIGHTCOND       object
dtype: object

All three features are type 'object' and we need them to of a numerical type. These features need to be converted using label encoding, which will create several numerical columns, encoding the same data using integers.

In [11]:
# label encoding for object to numeric conversion
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
objList = df.select_dtypes(include = "object").columns

# new columns will be created with se suffix '_LE' for label encoded data
for feat in objList:
    df[feat+'_LE'] = le.fit_transform(df[feat].astype(str))

df.head()

Unnamed: 0,SEVERITYCODE,WEATHER,ROADCOND,LIGHTCOND,WEATHER_LE,ROADCOND_LE,LIGHTCOND_LE
0,2,Overcast,Wet,Daylight,4,7,5
1,1,Raining,Wet,Dark - Street Lights On,6,7,2
2,1,Overcast,Dry,Daylight,4,0,5
3,1,Clear,Dry,Daylight,1,0,5
4,2,Raining,Wet,Daylight,6,7,5


The seems all pretty and consistent, now. To make sure it is balanced, we will now check the representativity of the target variable.

In [12]:
# defining a function to check the balance
def balance(df):
    len_sev_1 = len(df[df['SEVERITYCODE'] == 1])
    len_sev_2 = len(df[df['SEVERITYCODE'] == 2])
    len_tot = len(df)

    print('SEVERITYCODE = 1', len_sev_1, '(', round(len_sev_1/len_tot*100, 1), '%)')
    print('SEVERITYCODE = 2', len_sev_2, '(', round(len_sev_2/len_tot*100, 1), '%)')
    
balance(df)

SEVERITYCODE = 1 114659 ( 67.2 %)
SEVERITYCODE = 2 55851 ( 32.8 %)


We notice a great unbalance in the target variable. Only about one third of the rows represent the higher severity accidents. We wish to have around 50% representation for each class. This can be fixed by dropping the majority class in order to achieve that ratio.

In [18]:
# finding the difference in the number of rows for each class
N = len(df[df['SEVERITYCODE'] == 1]) - len(df[df['SEVERITYCODE'] == 2])

#creating a DF with all the rows where SEVERITYCODE is 1 and drop its top N rows
df1 = df[df['SEVERITYCODE'] == 1]
df1.drop(df1.head(N).index , inplace=True)

df_final = pd.concat([df1, df[df['SEVERITYCODE'] == 2]])

In [19]:
balance(df_final)

SEVERITYCODE = 1 55851 ( 50.0 %)
SEVERITYCODE = 2 55851 ( 50.0 %)


We have now the final dataframe with the data ready to be used for our analysis.

Wish me luck! :)