# Capstone Project -  Capstone Project - Car accident severity (Week 1)

### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)
* [References](#notes)

## Introduction: Business Problem <a name="introduction"></a>

According to the WHO ([1](#notes)), every year around 1.35 million people die because of a road traffic crash and between 20 and 50 million more people suffer non-fatal injuries with many incurring a disability.

Considering the economic impact at national level, road traffic accidents also cost around 3% of gross domestic product to most countries ([2](#notes)).

Therefore, it would be great if it could be possible to warn a driver about the possibility of getting into a car accident and how severe the accident would be, given the weather and road conditions. In this way people would drive more carefully or even change their travel if they are able to. 

Most of all, this could actually save lives.

This is exactly what we will try to do in this project. We will use the data provided by Seattle city, take in consideration the specific road and weather conditions and we will try to predict the likelihood and severity of an accident in Seattle so as to allow a person to better decide what to do.

## Data <a name="data"></a>

### The data

The dataset from Seattle city shows the collisions (i.e. 194.673 in total) provided by SPD (Seattle Police Department) and recorded by the Traffic Records. It includes all types of collisions from 2004 to present involving cars, bikes and pedestrians.

Thanks to the description of the data provided by Seattle city, we have been able to define what - in our opinion - are the best independent variables and their dependent variable.

The scope of the project is to predict the likelihood and severity of an accident. Therefore, it becomes obvious that we use SEVERITYCODE (i.e. the severity of the accident) as the dependent variable. SEVERITYCODE is a categorical variable and follows a code that corresponds to the severity of the collision: 2 (injury) and 1 (property damage).

Out of the 37 attributes available in Seattle accident dataset, we start considering 7 of them as independent variables, thanks to their logical connection to the objective of our research study.

| VARIABLE | DESCRIPTION |
| :-:  | :-:  |
| PERSONCOUNT | The total number of people involved in the collision |
| VEHCOUNT | The number of vehicles involved in the collision. This is entered by the state|
| JUNCTIONTYPE |Category of junction at which collision took place |
| WEATHER | A description of the weather conditions during the time of the collision|
| ROADCOND | The condition of the road during the collision|
| LIGHTCOND | The light conditions during the collision|
| SPEEDING | Whether or not speeding was a factor in the collision|

Each of these variables is general enough to be meaningful for a generic situation and provides conditions that are correlated to possible generic incident and its severity. Therefore they are significant for our research study and its objective, i.e. the prediction of the likelihood and severity of an accident from the condition of the road and the weather. Therefore, we will clean them from missing data, convert them in binary variables (when needed it), normalised and use in our model to 

Even though LOCATION (i.e. the description of the general location of the collision) was considered as a possible variable at the beginning, we decided to not use it since it would have refered only to Seattle and not to a more general environment.

### How data will be used to solved the problem

Once we have decided the variables that we are going to use, we work on the dataset and define what attribute has missing data or has results that are not precise - such as "unknown" or "Other" - and therefore cannot be utilised.

We consider the available options for managing the missing data and we apply the best method for each attribute so as to eliminate the missing data without biasing the dataset itself: in this case we delete the rows with "Other" and transform the missing data and the "Unknown" in the mode of the attribute. Then, we justify our decision for each attribute.

In our next step, we consider the type of data in the dataset and we see if they are properly formatted for our scope. We apply the one hot encoding technique to convert categorical varables to binary variables and append them to the feature Data Frame. Lastly, we normalise the feature data frame in order to have the dataset ready to build the model.

At this point we need to assess and apply the classification model (K-nearest neighbours, decision trees, logistic regression or support vector machine) that has the highest accuracy. Once we apply the best classification technique, we have the model ready to be used.

In [43]:
import itertools
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline

In [44]:
#wget -O /Users/carlopeano/Desktop/projects/Coursera_Capstone/Data_Collisions.csv https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv 

In [45]:
df = pd.read_csv('/Users/carlopeano/Desktop/projects/Coursera_Capstone/Data_Collisions.csv')
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [46]:
df.shape

(194673, 38)

### Evaluating for Missing Data

In [47]:
missing_data = df.isnull()
missing_data.head(5)

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,False,False,False,False,False,False,False,False,False,False,...,False,False,True,True,True,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,True,...,False,False,True,False,True,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,True,...,False,False,True,False,True,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,True,...,False,False,True,True,True,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,True,False,True,False,False,False,False,False


"True" stands for missing value, while "False" stands for not missing value.

In [48]:
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")    

SEVERITYCODE
False    194673
Name: SEVERITYCODE, dtype: int64

X
False    189339
True       5334
Name: X, dtype: int64

Y
False    189339
True       5334
Name: Y, dtype: int64

OBJECTID
False    194673
Name: OBJECTID, dtype: int64

INCKEY
False    194673
Name: INCKEY, dtype: int64

COLDETKEY
False    194673
Name: COLDETKEY, dtype: int64

REPORTNO
False    194673
Name: REPORTNO, dtype: int64

STATUS
False    194673
Name: STATUS, dtype: int64

ADDRTYPE
False    192747
True       1926
Name: ADDRTYPE, dtype: int64

INTKEY
True     129603
False     65070
Name: INTKEY, dtype: int64

LOCATION
False    191996
True       2677
Name: LOCATION, dtype: int64

EXCEPTRSNCODE
True     109862
False     84811
Name: EXCEPTRSNCODE, dtype: int64

EXCEPTRSNDESC
True     189035
False      5638
Name: EXCEPTRSNDESC, dtype: int64

SEVERITYCODE.1
False    194673
Name: SEVERITYCODE.1, dtype: int64

SEVERITYDESC
False    194673
Name: SEVERITYDESC, dtype: int64

COLLISIONTYPE
False    189769
True       4904
Name: C

Based on the summary above, each column has 205 rows of data, seven columns containing missing data:
1. "SPEEDING" has 185340 missing data
2. "JUNCTIONTYPE" has 6329 missing data
3. "WEATHER" has 5081 missing data
4. "ROADCOND" has 5012 missing data
5. "LIGHTCOND" has 5170 missing data
6. "PERSONCOUNT" has 0 missing data
7. "VEHCOUNT" has 0 missing data

Furthermore:
1. "JUNCTIONTYPE" has 9 "Unknown"
2. "WEATHER" has 832 "Other" and 15091 "Unknown"
3. "ROADCOND" has 132 "Other" and 11012 "Unknown"
4. "LIGHTCOND" has 235 "Other" and 13473 "Unknown"
5. "PERSONCOUNT" has 5544 incidents involving 0 people
6. "VEHCOUNT" has 5085 accidents involving 0 vehicles


## Dealing with missing data

Considering the attributes that could be interesting for us, we take the following actions:

#### "SPEEDING"

    * Action:
       * Transform the missing data into "N".
    * Reason: 
       * As the dataset shows data only when speeding was a factor in the collision (otherwise the attribute was left empty), the missing data can be considered as those incidents when speeding was not a factor. This means that all the missing data can be actually filled with "N".
    
#### "JUNCTIONTYPE"
    * Action:
       * simply delete the whole rows of missing data
       * Transform "Unknown" into the mode
    * Reason:
       * Deleting the missing data avoids to bias the dataset
       * The mode already represents a high percentage of the data and transforming "Unknown" into the mode is the best option
    
#### "WEATHER"
    * Action:
       * simply delete the whole rows of "Other" and missing data
       * Transform "Unknown" into the mode
    * Reason:  
       * Deleting the missing data avoids to bias the dataset
       * The mode already represents a high percentage in the dataset and transforming "Unknown" into the mode is the best option

#### "ROADCOND"
    * Action:
       * simply delete the whole rows of "Other" and missing data
       * Transform "Unknown" into the mode
    * Reason:  
       * Deleting the missing data avoids to bias the dataset
       * The mode already represents a high percentage in the dataset and transforming "Unknown" into the mode is the best option
    
#### "LIGHTCOND"
    * Action:
       * simply delete the whole rows of "Other" and missing data
       * Transform "Unknown" into the mode
    * Reason:
      * Deleting the missing data avoids to bias the dataset
       * The mode already represents a high percentage in the dataset and transforming "Unknown" into the mode is the best option

#### "PERSONCOUNT"
    * Action:
       * Simply delete the whole rows of 0 people
    * Reason: 
       * Our research study consider accidents involving at least one person.
       
#### "VEHCOUNT"
    * Action:
       * Simply delete the whole rows of 0 vehicles
    * Reason: 
       * Our research study consider accidents involving at least one vehicle

In [49]:
# Replace the nan in SPEEDING with "N" and then "N" and "Y" with 0 and 1

df['SPEEDING'].replace(np.nan, "N", inplace=True)
df['SPEEDING'].replace(to_replace=['N','Y'], value=[0,1],inplace=True)
df['SPEEDING'].value_counts()

0    185340
1      9333
Name: SPEEDING, dtype: int64

In [50]:
# replace "Other" to NaN
df.replace("Other", np.nan, inplace = True)

In [53]:
# Eliminate nan in 'JUNCTIONTYPE', 'WEATHER', 'ROADCOND','LIGHTCOND'
df.dropna(subset=['JUNCTIONTYPE', 'WEATHER', 'ROADCOND','LIGHTCOND'], axis=0, inplace=True)
df[['JUNCTIONTYPE', 'WEATHER', 'ROADCOND','LIGHTCOND']].isnull().sum(axis = 0)
df.shape

(182137, 38)

In [54]:
df['JUNCTIONTYPE'].value_counts()

Mid-Block (not related to intersection)              86131
At Intersection (intersection related)               61060
Mid-Block (but intersection related)                 22267
Driveway Junction                                    10465
At Intersection (but not related to intersection)     2048
Ramp Junction                                          159
Unknown                                                  7
Name: JUNCTIONTYPE, dtype: int64

In [55]:
# Replace "Unknown" with the mode in 'JUNCTIONTYPE', 'WEATHER', 'ROADCOND','LIGHTCOND' 
df['JUNCTIONTYPE'] = df['JUNCTIONTYPE'].replace(['Unknown'],df['JUNCTIONTYPE'].mode())
df['WEATHER'] = df['WEATHER'].replace(['Unknown'],df['WEATHER'].mode())
df['ROADCOND'] = df['ROADCOND'].replace(['Unknown'],df['ROADCOND'].mode())
df['LIGHTCOND'] = df['LIGHTCOND'].replace(['Unknown'],df['LIGHTCOND'].mode())

In [56]:
#As we are interested in collisions involving at least a person, we need to delete the accidents with zero people involved

# Get indexes where name column has value 0
indexNames = df[df['PERSONCOUNT'] == 0].index
 
# Delete these row indexes from dataFrame
df.drop(indexNames , inplace=True)
df['PERSONCOUNT'].value_counts()

2     105130
3      34741
4      14306
1      11168
5       6533
6       2684
7       1120
8        530
9        212
10       127
11        56
12        33
13        20
14        19
17        11
15        11
16         8
18         6
20         6
44         6
25         6
19         5
22         4
26         4
27         3
28         3
29         3
32         3
34         3
47         3
37         3
23         2
21         2
24         2
30         2
36         2
57         1
31         1
35         1
39         1
41         1
43         1
48         1
53         1
54         1
81         1
Name: PERSONCOUNT, dtype: int64

In [57]:
# We are interested only in accindent related to car incidents, therefore we can delete those one that do not involve cars (but 99,48% bicycles)

# Get indexes where name column has value 0
indexNames = df[df['VEHCOUNT'] == 0].index
 
# Delete these row indexes from dataFrame
df.drop(indexNames , inplace=True)
df['VEHCOUNT'].value_counts()

2     136892
1      24210
3      12459
4       2320
5        504
6        139
7         43
8         15
9          8
11         5
10         2
12         1
Name: VEHCOUNT, dtype: int64

In [58]:
df.shape

(176598, 38)

### Correct data format

In [59]:
df.dtypes

SEVERITYCODE        int64
X                 float64
Y                 float64
OBJECTID            int64
INCKEY              int64
COLDETKEY           int64
REPORTNO           object
STATUS             object
ADDRTYPE           object
INTKEY            float64
LOCATION           object
EXCEPTRSNCODE      object
EXCEPTRSNDESC      object
SEVERITYCODE.1      int64
SEVERITYDESC       object
COLLISIONTYPE      object
PERSONCOUNT         int64
PEDCOUNT            int64
PEDCYLCOUNT         int64
VEHCOUNT            int64
INCDATE            object
INCDTTM            object
JUNCTIONTYPE       object
SDOT_COLCODE        int64
SDOT_COLDESC       object
INATTENTIONIND     object
UNDERINFL          object
WEATHER            object
ROADCOND           object
LIGHTCOND          object
PEDROWNOTGRNT      object
SDOTCOLNUM        float64
SPEEDING            int64
ST_COLCODE         object
ST_COLDESC         object
SEGLANEKEY          int64
CROSSWALKKEY        int64
HITPARKEDCAR       object
dtype: objec

Considering the data format of each attribute that we are going to use, we do not need to change any of them.

| VARIABLES | FORMAT|
| :- | :-: |
| SEVERITYCODE | int64 |
|PERSONCOUNT|int64|
|VEHCOUNT|int64|
|JUNCTIONTYPE|object|
|WEATHER|object|
|ROADCOND|object|
|LIGHTCOND|object|
|SPEEDING|int64|

### One hot encoding technique
We use one hot encoding technique to convert categorical varables to binary variables and append them to the feature Data Frame.

In [60]:
Feature = df[['SPEEDING','JUNCTIONTYPE', 'WEATHER', 'ROADCOND','LIGHTCOND', 'PERSONCOUNT', 'VEHCOUNT']]
Feature = pd.concat([Feature,pd.get_dummies(df['JUNCTIONTYPE'])], axis=1)
Feature = pd.concat([Feature,pd.get_dummies(df['WEATHER'])], axis=1)
Feature = pd.concat([Feature,pd.get_dummies(df['ROADCOND'])], axis=1)
Feature = pd.concat([Feature,pd.get_dummies(df['LIGHTCOND'])], axis=1)
Feature.drop(['JUNCTIONTYPE', 'WEATHER', 'ROADCOND','LIGHTCOND'], axis = 1,inplace=True)
Feature.head()

Unnamed: 0,SPEEDING,PERSONCOUNT,VEHCOUNT,At Intersection (but not related to intersection),At Intersection (intersection related),Driveway Junction,Mid-Block (but intersection related),Mid-Block (not related to intersection),Ramp Junction,Blowing Sand/Dirt,...,Snow/Slush,Standing Water,Wet,Dark - No Street Lights,Dark - Street Lights Off,Dark - Street Lights On,Dark - Unknown Lighting,Dawn,Daylight,Dusk
0,0,2,2,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0
1,0,2,2,0,0,0,0,1,0,0,...,0,0,1,0,0,1,0,0,0,0
2,0,4,3,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
3,0,3,3,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
4,0,2,2,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0


### Feature selection

We define feature set X and the labels y

In [61]:
X = Feature
X[0:5]

Unnamed: 0,SPEEDING,PERSONCOUNT,VEHCOUNT,At Intersection (but not related to intersection),At Intersection (intersection related),Driveway Junction,Mid-Block (but intersection related),Mid-Block (not related to intersection),Ramp Junction,Blowing Sand/Dirt,...,Snow/Slush,Standing Water,Wet,Dark - No Street Lights,Dark - Street Lights Off,Dark - Street Lights On,Dark - Unknown Lighting,Dawn,Daylight,Dusk
0,0,2,2,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0
1,0,2,2,0,0,0,0,1,0,0,...,0,0,1,0,0,1,0,0,0,0
2,0,4,3,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
3,0,3,3,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
4,0,2,2,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0


In [62]:
y = df['SEVERITYCODE'].values
y[0:5]

array([2, 1, 1, 1, 2])

Now we have cleaned the data, with the right format and ready to be normalised.

### Normalize Data

In [63]:
X= preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

array([[-0.23338275, -0.41883632,  0.0466896 , -0.10550156,  1.40990017,
        -0.24777912, -0.3761547 , -0.94475077, -0.02982978, -0.01648871,
        -1.3960664 , -0.05502082,  2.37945644, -0.00532106, -0.46856946,
        -0.01189893, -0.02450702, -0.07015868, -1.64001704, -0.0807124 ,
        -0.0181256 , -0.01918862, -0.07377594, -0.02415752,  1.69647751,
        -0.08961716, -0.08021337, -0.59606853, -0.00713903, -0.11638166,
         0.68857204, -0.18094994],
       [-0.23338275, -0.41883632,  0.0466896 , -0.10550156, -0.70927007,
        -0.24777912, -0.3761547 ,  1.05848022, -0.02982978, -0.01648871,
        -1.3960664 , -0.05502082, -0.42026405, -0.00532106,  2.13415532,
        -0.01189893, -0.02450702, -0.07015868, -1.64001704, -0.0807124 ,
        -0.0181256 , -0.01918862, -0.07377594, -0.02415752,  1.69647751,
        -0.08961716, -0.08021337,  1.67765944, -0.00713903, -0.11638166,
        -1.45228087, -0.18094994],
       [-0.23338275,  1.09335178,  1.83486802, -0.1055

## Methodology <a name="methodology"></a>

## Analysis <a name="analysis"></a>

## Results and Discussion <a name="results"></a>

## Conclusion <a name="conclusion"></a>

## References <a name="notes"></a>

1. "Road traffic injuries", World Health Organisation (WHO), 07/02/2020, https://www.who.int/news-room/fact-sheets/detail/road-traffic-injuries
2. Ibidem
3. "ArcGIS Metadata Form", https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Metadata.pdf