# Capstone Project - Car accident severity (Week 1)

### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [1. Introduction: Business Problem](#introduction)
* [2. Data](#data)
* [References](#notes)

## 1. Introduction: Business Problem <a name="introduction"></a>

According to the WHO ([1](#notes)), even though vehicles have become much safer in the last decades, every year around 1.35 million people still die because of a road traffic crash and between 20 and 50 million more people suffer non-fatal injuries with many incurring a disability.

If we take a different perspective and consider the economic impact at national level, road traffic accidents also cost around 3% of gross domestic product to most countries ([2](#notes)).

Therefore, there is a great interest in different parts of society (such governments, decision-makers, car-makers, drivers, insurance companies) in changing and descreasing this trend.

A solution that would reduce the number of incidents could be the chance to warn a driver about the possibility of getting into a car accident and how severe that incident would be, given the weather and road conditions. In this way people would drive more carefully or even stay home.

Transforming this issue into a machine learning problem, we could use a dataset provided by a city and its police department (in our case Seattle City and the SPD - Seattle Police Department) to predict the probability and severity of an accident based on various factors, such as the conditions of weather and the road.

## 2. Data <a name="data"></a>

### 2.1 Data understanding

As already mentioned above, in this project we use the data provided by the SPD (Seattle Police Department) and recorded by the Traffic Records. This dataset - called Data-Collisions.csv - includes all types of collisions involving cars, bikes, pedestrians and others (around 200,000) from 2004 to present.

In this phase, after we extract the dataset we look for the most relevant attributes - related to the severity of an incident - out of the 38 available. Logically, the proper attributes are related to the the objective of our project, i.e. condition of the road, driver and weather.

Lastly, we search for trends, pattern and correlations.

In [2]:
import itertools
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline

In [3]:
#wget -O /Users/carlopeano/Desktop/projects/Coursera_Capstone/Data_Collisions.csv https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv 

In [4]:
df = pd.read_csv('/Users/carlopeano/Desktop/projects/Coursera_Capstone/Data_Collisions.csv')
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [5]:
df.shape

(194673, 38)

Thanks to the description of the attributes that is available together wtih the dataset ([3](#notes)), we have been able to define what - in our opinion - are the best independent variables and their dependent variable.

The scope of the project is to predict the likelihood and severity of an accident. Therefore, it becomes obvious that we use SEVERITYCODE (i.e. the severity of the accident) as the dependent variable. SEVERITYCODE is a categorical variable and follows a code that corresponds to the severity of the collision: 2 (injury) and 1 (property damage).

Out of the 37 attributes available in Seattle accident dataset, we start considering 8 of them as independent variables, thanks to their logical connection to the objective of our research study.

| VARIABLE | DESCRIPTION |
| :-:  | :-:  |
| PERSONCOUNT | The total number of people involved in the collision |
| VEHCOUNT | The number of vehicles involved in the collision. This is entered by the state|
| JUNCTIONTYPE |Category of junction at which collision took place |
| WEATHER | A description of the weather conditions during the time of the collision|
| ROADCOND | The condition of the road during the collision|
| LIGHTCOND | The light conditions during the collision|
| SPEEDING | Whether or not speeding was a factor in the collision|
|LOCATION|Description of the general location of the collision|

JUNCTION, WEATHER, ROADCOND, and LIGHTCOND are the main attributes since they are directly connected to our objective that is the prediction of an accident based on the conditions of the weather and road.

PERSONCOUNT and VEHCOUNT makes us understand how big the accident can be. An accident can involve a lot of cars and people without necessarily and still have a low severity.

SPEEDING has a direct impact on the probability of the collision and is the only attribute that is actually a choise of the driver. 

Lastly, even though LOCATION is not good if we want to speak in a general way (therefore we do not utilize it in the general research), it is useful if we want to show the different level of danger in the districts of Seattle city.

### 2.2 Data Preparation

At this point, we drop the columns that are not interesting for us, deal with the missing data (look for them, decide what method apply, etc...), control the data format, balancing the labeled data, transform the categorical variables in binary variables if needed, clean the dataset and normalise the data so as to be ready for the next phase - i.e. modeling.

### 2.2.1 Unnecessary Columns

We the drop unnecessary columns.

In [6]:
df.drop(['OBJECTID','INCKEY','COLDETKEY','REPORTNO','STATUS','SEVERITYCODE.1','SEVERITYDESC','COLLISIONTYPE','PEDCOUNT','PEDCYLCOUNT','ADDRTYPE', 'INTKEY','EXCEPTRSNCODE','EXCEPTRSNDESC','INCDATE','INCDTTM','SDOT_COLCODE','SDOT_COLDESC','INATTENTIONIND','UNDERINFL','PEDROWNOTGRNT','SDOTCOLNUM','ST_COLCODE','ST_COLDESC','SEGLANEKEY','CROSSWALKKEY','HITPARKEDCAR'], axis=1, inplace=True)
df.head(2)

Unnamed: 0,SEVERITYCODE,X,Y,LOCATION,PERSONCOUNT,VEHCOUNT,JUNCTIONTYPE,WEATHER,ROADCOND,LIGHTCOND,SPEEDING
0,2,-122.323148,47.70314,5TH AVE NE AND NE 103RD ST,2,2,At Intersection (intersection related),Overcast,Wet,Daylight,
1,1,-122.347294,47.647172,AURORA BR BETWEEN RAYE ST AND BRIDGE WAY N,2,2,Mid-Block (not related to intersection),Raining,Wet,Dark - Street Lights On,


In [7]:
df.shape

(194673, 11)

### 2.2.2 Evaluating for Missing Data

In [8]:
missing_data = df.isnull()
missing_data.head(5)

Unnamed: 0,SEVERITYCODE,X,Y,LOCATION,PERSONCOUNT,VEHCOUNT,JUNCTIONTYPE,WEATHER,ROADCOND,LIGHTCOND,SPEEDING
0,False,False,False,False,False,False,False,False,False,False,True
1,False,False,False,False,False,False,False,False,False,False,True
2,False,False,False,False,False,False,False,False,False,False,True
3,False,False,False,False,False,False,False,False,False,False,True
4,False,False,False,False,False,False,False,False,False,False,True


"True" stands for missing value, while "False" stands for not missing value.

In [9]:
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")

SEVERITYCODE
False    194673
Name: SEVERITYCODE, dtype: int64

X
False    189339
True       5334
Name: X, dtype: int64

Y
False    189339
True       5334
Name: Y, dtype: int64

LOCATION
False    191996
True       2677
Name: LOCATION, dtype: int64

PERSONCOUNT
False    194673
Name: PERSONCOUNT, dtype: int64

VEHCOUNT
False    194673
Name: VEHCOUNT, dtype: int64

JUNCTIONTYPE
False    188344
True       6329
Name: JUNCTIONTYPE, dtype: int64

WEATHER
False    189592
True       5081
Name: WEATHER, dtype: int64

ROADCOND
False    189661
True       5012
Name: ROADCOND, dtype: int64

LIGHTCOND
False    189503
True       5170
Name: LIGHTCOND, dtype: int64

SPEEDING
True     185340
False      9333
Name: SPEEDING, dtype: int64



Based on the summary above, each column has 205 rows of data, seven columns containing missing data:
1. "SPEEDING" has 185340 missing data
2. "JUNCTIONTYPE" has 6329 missing data
3. "WEATHER" has 5081 missing data
4. "ROADCOND" has 5012 missing data
5. "LIGHTCOND" has 5170 missing data
6. "PERSONCOUNT" has 0 missing data
7. "VEHCOUNT" has 0 missing data
8. "LOCATION" has 2677 missing data

Furthermore:
1. "JUNCTIONTYPE" has 9 "Unknown"
2. "WEATHER" has 832 "Other" and 15091 "Unknown"
3. "ROADCOND" has 132 "Other" and 11012 "Unknown"
4. "LIGHTCOND" has 235 "Other" and 13473 "Unknown"
5. "PERSONCOUNT" has 5544 incidents involving 0 people
6. "VEHCOUNT" has 5085 accidents involving 0 vehicles


### 2.2.3 Dealing with missing data

Considering the attributes that could be interesting for us, we take the following actions:

#### "SPEEDING"

* Action:
    * Replace the missing data with "N" 
    * Replace "N" and "Y" with 0 and 1
* Reason: 
    * SPD recorded the data only when it was a factor of an accident ("Y"), otherwise they left it empty (SPD did never write "N"). Therefore, the missing data can be considered as those incidents when speeding was not a factor. This means that all the missing data can be actually filled with "N".

In [10]:
df['SPEEDING'].replace(np.nan, "N", inplace=True)
df['SPEEDING'].replace(to_replace=['N','Y'], value=[0,1],inplace=True)
df['SPEEDING'].value_counts()

0    185340
1      9333
Name: SPEEDING, dtype: int64

#### "LOCATION"
* Action:
   * simply delete the whole rows of missing data
* Reason: 
   * Delete the missing data so as to avoid biasing the dataset

In [11]:
#Drop the missing data in LOCATION

df.dropna(subset=['LOCATION'], axis=0, inplace=True)
df[['LOCATION']].isnull().sum(axis = 0)
df.shape

(191996, 11)

#### "JUNCTIONTYPE"
* Action:
    * simply delete the whole rows of missing data
    * Replace "Unknown" with the mode
* Reason:
    * Delete the missing data so as to avoid biasing the dataset
    * The mode already represents a high percentage of the data and replacing "Unknown" with the mode is the best solution
    
#### "WEATHER"
* Action:
    * simply delete the whole rows of "Other" and missing data
    * Replace "Unknown" with the mode
* Reason:  
    * Delete the missing data and "Other"  so as to avoid biasing the dataset
    * The mode already represents a high percentage of the data and replacing "Unknown" with the mode is the best solution

#### "ROADCOND"
* Action:
    * simply delete the whole rows of "Other" and missing data
    * Replace "Unknown" with the mode
* Reason:  
    * Delete the missing data and "Other"  so as to avoid biasing the dataset
    * The mode already represents a high percentage of the data and replacing "Unknown" with the mode is the best solution
    
#### "LIGHTCOND"
* Action:
    * simply delete the whole rows of "Other" and missing data
    * Replace "Unknown" with the mode
* Reason:
    * Delete the missing data and "Other"  so as to avoid biasing the dataset
    * The mode already represents a high percentage of the data and replacing "Unknown" with the mode is the best solution

In [12]:
# replace "Other" to NaN
df.replace("Other", np.nan, inplace = True)

In [13]:
# Eliminate nan in 'JUNCTIONTYPE', 'WEATHER', 'ROADCOND','LIGHTCOND'
df.dropna(subset=['JUNCTIONTYPE', 'WEATHER', 'ROADCOND','LIGHTCOND'], axis=0, inplace=True)
df[['JUNCTIONTYPE', 'WEATHER', 'ROADCOND','LIGHTCOND']].isnull().sum(axis = 0)
df.shape

(181628, 11)

In [14]:
# Replace "Unknown" with the mode in 'JUNCTIONTYPE', 'WEATHER', 'ROADCOND','LIGHTCOND' 
df['JUNCTIONTYPE'] = df['JUNCTIONTYPE'].replace(['Unknown'],df['JUNCTIONTYPE'].mode())
df['WEATHER'] = df['WEATHER'].replace(['Unknown'],df['WEATHER'].mode())
df['ROADCOND'] = df['ROADCOND'].replace(['Unknown'],df['ROADCOND'].mode())
df['LIGHTCOND'] = df['LIGHTCOND'].replace(['Unknown'],df['LIGHTCOND'].mode())

### 2.2.4 Dealing with Data and Project's Objective

As the project's objective is to predict the probability and severity of an accident based on various factors, such as the conditions of weather and the road, we are interested in incidents involving at least a person and a car.

Therefore, we need to drop those rows where people and cars involved are zero.

In [15]:
#Drop the accidents with zero people involved

# Get indexes where name column has value 0
indexNames = df[df['PERSONCOUNT'] == 0].index
 
# Delete these row indexes from dataFrame
df.drop(indexNames , inplace=True)

In [16]:
#Drop the accidents with zero people involved

# Get indexes where name column has value 0
indexNames = df[df['VEHCOUNT'] == 0].index
 
# Delete these row indexes from dataFrame
df.drop(indexNames , inplace=True)

In [17]:
df.shape

(176098, 11)

### 2.2.5 Correct data format

In [18]:
df.dtypes

SEVERITYCODE      int64
X               float64
Y               float64
LOCATION         object
PERSONCOUNT       int64
VEHCOUNT          int64
JUNCTIONTYPE     object
WEATHER          object
ROADCOND         object
LIGHTCOND        object
SPEEDING          int64
dtype: object

Considering the data format of each attribute that we are going to use, we do not need to change any of them.

### 2.2.6 One hot encoding technique

We use one hot encoding technique to convert categorical varables to binary variables and append them to the feature Data Frame.

In [19]:
Feature = df[['SPEEDING','JUNCTIONTYPE', 'WEATHER', 'ROADCOND','LIGHTCOND', 'PERSONCOUNT', 'VEHCOUNT']]
Feature = pd.concat([Feature,pd.get_dummies(df['JUNCTIONTYPE'])], axis=1)
Feature = pd.concat([Feature,pd.get_dummies(df['WEATHER'])], axis=1)
Feature = pd.concat([Feature,pd.get_dummies(df['ROADCOND'])], axis=1)
Feature = pd.concat([Feature,pd.get_dummies(df['LIGHTCOND'])], axis=1)
Feature.drop(['JUNCTIONTYPE', 'WEATHER', 'ROADCOND','LIGHTCOND'], axis = 1,inplace=True)
Feature.head()

Unnamed: 0,SPEEDING,PERSONCOUNT,VEHCOUNT,At Intersection (but not related to intersection),At Intersection (intersection related),Driveway Junction,Mid-Block (but intersection related),Mid-Block (not related to intersection),Ramp Junction,Blowing Sand/Dirt,...,Snow/Slush,Standing Water,Wet,Dark - No Street Lights,Dark - Street Lights Off,Dark - Street Lights On,Dark - Unknown Lighting,Dawn,Daylight,Dusk
0,0,2,2,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0
1,0,2,2,0,0,0,0,1,0,0,...,0,0,1,0,0,1,0,0,0,0
2,0,4,3,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
3,0,3,3,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
4,0,2,2,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0


### 2.2.7 Feature selection

We define feature set X and the labels y

In [20]:
X = Feature
X[0:5]

Unnamed: 0,SPEEDING,PERSONCOUNT,VEHCOUNT,At Intersection (but not related to intersection),At Intersection (intersection related),Driveway Junction,Mid-Block (but intersection related),Mid-Block (not related to intersection),Ramp Junction,Blowing Sand/Dirt,...,Snow/Slush,Standing Water,Wet,Dark - No Street Lights,Dark - Street Lights Off,Dark - Street Lights On,Dark - Unknown Lighting,Dawn,Daylight,Dusk
0,0,2,2,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0
1,0,2,2,0,0,0,0,1,0,0,...,0,0,1,0,0,1,0,0,0,0
2,0,4,3,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
3,0,3,3,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
4,0,2,2,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0


In [21]:
y = df['SEVERITYCODE'].values
y[0:5]

array([2, 1, 1, 1, 2])

Now we have a cleaned and balanced dataset and we can finally normalise the data.

### 2.2.8 Normalize Data

In [22]:
X= preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

array([[-0.2332857 , -0.4196598 ,  0.0458818 , -0.10557043,  1.4072195 ,
        -0.24737767, -0.37664609, -0.94293104, -0.02968109, -0.01651211,
        -1.39487249, -0.05494336,  2.37791899, -0.00532861, -0.46890302,
        -0.01191581, -0.0245418 , -0.0702587 , -1.63878024, -0.08075644,
        -0.01815132, -0.01876701, -0.07380343, -0.02407402,  1.6951211 ,
        -0.08916481, -0.08018456, -0.5964367 , -0.00714916, -0.1164739 ,
         0.68888404, -0.1810144 ],
       [-0.2332857 , -0.4196598 ,  0.0458818 , -0.10557043, -0.71062119,
        -0.24737767, -0.37664609,  1.06052295, -0.02968109, -0.01651211,
        -1.39487249, -0.05494336, -0.42053577, -0.00532861,  2.13263713,
        -0.01191581, -0.0245418 , -0.0702587 , -1.63878024, -0.08075644,
        -0.01815132, -0.01876701, -0.07380343, -0.02407402,  1.6951211 ,
        -0.08916481, -0.08018456,  1.67662387, -0.00714916, -0.1164739 ,
        -1.45162311, -0.1810144 ],
       [-0.2332857 ,  1.09189144,  1.83382005, -0.1055

Now we are finally ready for the next phase that is modeling.

## References <a name="notes"></a>

1. "Road traffic injuries", World Health Organisation (WHO), 07/02/2020, https://www.who.int/news-room/fact-sheets/detail/road-traffic-injuries
2. Ibid.
3. "ArcGIS Metadata Form", https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Metadata.pdf