# Reduce Accidents and their severity

## 1. Introduction
### 1.1 Background
Seattle is a seaport city on the West Coast of the United States.It's metropolitan area's population stands at 3.98 million, making it the 15th-largest in the United States. With that much population road accidents also occur. With the help of machine learning, I tried to find out the severity of an accident and what factors make an accident more severe and what are the factors by which accidents are more likely to happen.

### 1.2 Business Problem
The machine learning techniques which we applied, might contribute to reduce the number of accidents and its severity. I used the data which is provided by Traffic Records Group Seattle. SDOT Traffic Management Division would be interested in this report as it will help to take necessary measures to prevent or minimize the impact of road accidents.

In [2]:
import pandas as pd
import numpy as np

In [3]:
df_col = pd.read_csv('data\Data-Collisions.csv')


## 2. Data acquisition and cleaning
### 2.1 Data sources
I used the same data which is povided by coursera capstone project course. It is a single dataset in csv format.
### 2.2 Data cleaning
Removing columns which are not described in the official documentation and their use are also not understandable and also contains a lot of null values or are just serial numbers.

In [4]:
df_col.drop(['EXCEPTRSNCODE','EXCEPTRSNDESC','SDOTCOLNUM','OBJECTID','INCKEY','COLDETKEY','REPORTNO','STATUS'], axis = 1,inplace=True)

Removing ST_COLDESC column and loading it into another table through webscrapping.

In [5]:
df_col.drop(['ST_COLDESC'], axis = 1,inplace=True)
df_col_desc = pd.read_csv('data\CollisionDesc.csv')

Dropping columns which are either duplicate or provide reduntant information

In [6]:
df_col.drop(['SEVERITYCODE.1','SEVERITYDESC','INCDATE','X','Y'], axis = 1,inplace=True)

There are three columns where data with 'Y' only is present. So Assuming nan values are 'N'

In [7]:
df_col= df_col.replace({'INATTENTIONIND' : { 'Y' : 1, np.nan : 0},'SPEEDING' : { 'Y' : 1, np.nan : 0},'PEDROWNOTGRNT' : { 'Y' : 1, np.nan : 0}})

Replacing null values to 'Unknown' in string columns and -1 to integer or float type columns

In [8]:
df_col['WEATHER']= df_col['WEATHER'].replace(np.nan,'Unknown')
df_col['ROADCOND']= df_col['ROADCOND'].replace(np.nan,'Unknown')
df_col['LIGHTCOND']= df_col['LIGHTCOND'].replace(np.nan,'Unknown')
df_col['JUNCTIONTYPE']= df_col['JUNCTIONTYPE'].replace(np.nan,'Unknown')
df_col['LOCATION']= df_col['LOCATION'].replace(np.nan,'Unknown')
df_col['COLLISIONTYPE']= df_col['COLLISIONTYPE'].replace(np.nan,'Unknown')
df_col['ADDRTYPE']= df_col['ADDRTYPE'].replace(np.nan,'Unknown')

df_col['INTKEY']= df_col['INTKEY'].replace(np.nan,-1)
df_col['ST_COLCODE']= df_col['ST_COLCODE'].replace(' ',-1)
df_col['ST_COLCODE']= df_col['ST_COLCODE'].replace(np.nan,-1)
df_col['UNDERINFL']= df_col['UNDERINFL'].replace(np.nan,-1)


Now, Correcting data types

In [9]:
df_col['UNDERINFL']= df_col['UNDERINFL'].replace('N',0)
df_col['UNDERINFL']= df_col['UNDERINFL'].replace('Y',1)
df_col['UNDERINFL']= df_col['UNDERINFL'].replace('1',1)
df_col['UNDERINFL']= df_col['UNDERINFL'].replace('0',0)
df_col['HITPARKEDCAR']= df_col['UNDERINFL'].replace('N',0)
df_col['HITPARKEDCAR']= df_col['UNDERINFL'].replace('Y',1)
df_col['ST_COLCODE'] = df_col['ST_COLCODE'].astype(str)
df_col['INCDTTM'] = pd.to_datetime(df_col['INCDTTM'])

In [12]:
df_col['JUNCTIONTYPE'].value_counts()

Mid-Block (not related to intersection)              89800
At Intersection (intersection related)               62810
Mid-Block (but intersection related)                 22790
Driveway Junction                                    10671
Unknown                                               6338
At Intersection (but not related to intersection)     2098
Ramp Junction                                          166
Name: JUNCTIONTYPE, dtype: int64

In [69]:
df_col['WEATHER'].unique()
df_col['ROADCOND'].unique()
df_col['LIGHTCOND'].unique()

array(['Daylight', 'Dark - Street Lights On', 'Dark - No Street Lights',
       'Unknown', 'Dusk', 'Dawn', 'Dark - Street Lights Off', 'Other',
       'Dark - Unknown Lighting'], dtype=object)

In [70]:
df_col['PEDROWNOTGRNT'].unique()

array([0, 1], dtype=int64)

In [71]:
df_col['SPEEDING'].unique()

array([0, 1], dtype=int64)

In [72]:
df_col['CROSSWALKKEY'].unique()

array([     0, 520838, 521466, ..., 523792, 650595, 523322], dtype=int64)

In [73]:
df_col['HITPARKEDCAR'].unique()

array([ 0, -1,  1], dtype=int64)

In [74]:
df_col['SEGLANEKEY'].unique()

array([    0,  6855, 25242, ..., 11583, 10319, 45880], dtype=int64)

In [75]:
df_col['INTKEY'].value_counts()

-1.0        129603
 29973.0       252
 29933.0       160
 29913.0       138
 29549.0       136
             ...  
 35869.0         1
 26126.0         1
 28236.0         1
 35863.0         1
 29460.0         1
Name: INTKEY, Length: 7615, dtype: int64

In [76]:
df_col.corr(method='pearson', min_periods=1)

Unnamed: 0,SEVERITYCODE,INTKEY,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,SDOT_COLCODE,INATTENTIONIND,UNDERINFL,PEDROWNOTGRNT,SPEEDING,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
SEVERITYCODE,1.0,0.104975,0.130949,0.246338,0.214218,-0.054686,0.188905,0.046378,0.053163,0.206283,0.038938,0.104276,0.175093,0.053163
INTKEY,0.104975,1.0,0.03522,0.070484,0.043848,-0.043266,-0.006596,-0.036928,-0.017212,0.069409,-0.028897,0.012198,0.1061,-0.017212
PERSONCOUNT,0.130949,0.03522,1.0,-0.023464,-0.038809,0.380523,-0.12896,0.077435,0.057148,-0.027449,-0.002963,-0.021383,-0.032258,0.057148
PEDCOUNT,0.246338,0.070484,-0.023464,1.0,-0.01692,-0.261285,0.260393,-0.004315,0.030705,0.494641,-0.032838,0.00181,0.565326,0.030705
PEDCYLCOUNT,0.214218,0.043848,-0.038809,-0.01692,1.0,-0.253773,0.382521,0.004073,0.00295,0.323652,-0.020391,0.453657,0.10982,0.00295
VEHCOUNT,-0.054686,-0.043266,0.380523,-0.261285,-0.253773,1.0,-0.365814,0.076277,0.290873,-0.22799,-0.025743,-0.122941,-0.200526,0.290873
SDOT_COLCODE,0.188905,-0.006596,-0.12896,0.260393,0.382521,-0.365814,1.0,0.029484,0.115771,0.238643,0.144714,0.206835,0.189518,0.115771
INATTENTIONIND,0.046378,-0.036928,0.077435,-0.004315,0.004073,0.076277,0.029484,1.0,0.019387,-0.026442,-0.048805,-0.000513,-0.002053,0.019387
UNDERINFL,0.053163,-0.017212,0.057148,0.030705,0.00295,0.290873,0.115771,0.019387,1.0,0.000807,0.094065,-0.003305,-0.001723,1.0
PEDROWNOTGRNT,0.206283,0.069409,-0.027449,0.494641,0.323652,-0.22799,0.238643,-0.026442,0.000807,1.0,-0.02841,0.152103,0.448176,0.000807


2.3 Feature Selection
After data cleaning, there were 194673 samples and 25 features in the data.

In [77]:
df_col.head()

Unnamed: 0,SEVERITYCODE,ADDRTYPE,INTKEY,LOCATION,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INCDTTM,...,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SPEEDING,ST_COLCODE,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,Intersection,37475.0,5TH AVE NE AND NE 103RD ST,Angles,2,0,0,2,2013-03-27 14:54:00,...,0,Overcast,Wet,Daylight,0,0,10,0,0,0
1,1,Block,-1.0,AURORA BR BETWEEN RAYE ST AND BRIDGE WAY N,Sideswipe,2,0,0,2,2006-12-20 18:55:00,...,0,Raining,Wet,Dark - Street Lights On,0,0,11,0,0,0
2,1,Block,-1.0,4TH AVE BETWEEN SENECA ST AND UNIVERSITY ST,Parked Car,4,0,0,3,2004-11-18 10:20:00,...,0,Overcast,Dry,Daylight,0,0,32,0,0,0
3,1,Block,-1.0,2ND AVE BETWEEN MARION ST AND MADISON ST,Other,3,0,0,3,2013-03-29 09:26:00,...,0,Clear,Dry,Daylight,0,0,23,0,0,0
4,2,Intersection,34387.0,SWIFT AVE S AND SWIFT AV OFF RP,Angles,2,0,0,2,2004-01-28 08:04:00,...,0,Raining,Wet,Daylight,0,0,10,0,0,0
