# Capstone Project -  Capstone Project - Car accident severity (Week 1)

### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)
* [References](#notes)

## Introduction: Business Problem <a name="introduction"></a>

According to the WHO ([1](#notes)), every year around 1.35 million people die because of a road traffic crash and between 20 and 50 million more people suffer non-fatal injuries with many incurring a disability.

Considering the economic impact at national level, road traffic accidents also cost around 3% of gross domestic product to most countries ([2](#notes)).

Therefore, it would be great if it could be possible to warn a driver about the possibility of getting into a car accident and how severe the accident would be, given the weather and road conditions. In this way people would drive more carefully or even change their travel if they are able to. 

Most of all, this could actually save lives.

This is exactly what we will try to do in this project. We will use the data provided by Seattle city, take in consideration the specific road and weather conditions and we will try to predict the likelihood and severity of an accident in Seattle so as to allow a person to better decide what to do.

## Data <a name="data"></a>

### Data understanding:

In this phase, you need to collect or extract the dataset from various sources such as csv file or SQL database. Then, you need to determine the attributes (columns) that you will use to train your machine learning model. Also, you will assess the condition of chosen attributes by looking for trends, certain patterns, skewed information, correlations, and so on.


### The data

The dataset from Seattle city shows all the collisions (i.e. 194.673 in total) provided by SPD (Seattle Police Department) and recorded by the Traffic Records. It includes all types of collisions from 2004 to present involving cars, bikes and pedestrians.

Thanks to the description of the data provided by Seattle city, we have been able to define what - in our opinion - are the best independent variables and their dependent variable.

The scope of the project is to predict the likelihood and severity of an accident. Therefore, it becomes obvious that we use SEVERITYCODE (i.e. the severity of the accident) as the dependent variable. SEVERITYCODE is a categorical variable and follows a code that corresponds to the severity of the collision: 2 (injury) and 1 (property damage).

Out of the 37 attributes available in Seattle accident dataset, we will consider 8 of them as independent variables.

| VARIABLE | DESCRIPTION |
| :-:  | :-:  |
|LOCATION |Description of the general location of the collision |
| PERSONCOUNT | The total number of people involved in the collision |
| VEHCOUNT | The number of vehicles involved in the collision. This is entered by the state|
| JUNCTIONTYPE |Category of junction at which collision took place |
| WEATHER | A description of the weather conditions during the time of the collision|
| ROADCOND | The condition of the road during the collision|
| LIGHTCOND | The light conditions during the collision|
| SPEEDING | Whether or not speeding was a factor in the collision|

This attributes are those ones more related to our objective (predict the likelihood and severity of an accident in Seattle from the condition of the road and the weather).

Firstly, we work on the dataset and define:
* what attributes have missing data 
* what are our options for each column:
    * we remove those rows that have empty observations
    * complete them in case they have a binary result (like "SPEEDING" that can be only "Y" or "N" it has been filled only when the answer was affermative).

Secondly, we normalise and train the dataset and we use the best classification technique (K-nearest neighbours, decision trees, logistic regression or support vector machine) to predict with the highest accuracy.

We do all of this taken in consideration that we use an imbalanced labeled dataset for our study and we need balance it during the process of data gathering.

In [107]:
import itertools
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline

In [108]:
#wget -O /Users/carlopeano/Desktop/projects/Coursera_Capstone/Data_Collisions.csv https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv 

In [183]:
df = pd.read_csv('/Users/carlopeano/Desktop/projects/Coursera_Capstone/Data_Collisions.csv')
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [184]:
df.shape

(194673, 38)

In [185]:
df.isnull().sum(axis = 0)

SEVERITYCODE           0
X                   5334
Y                   5334
OBJECTID               0
INCKEY                 0
COLDETKEY              0
REPORTNO               0
STATUS                 0
ADDRTYPE            1926
INTKEY            129603
LOCATION            2677
EXCEPTRSNCODE     109862
EXCEPTRSNDESC     189035
SEVERITYCODE.1         0
SEVERITYDESC           0
COLLISIONTYPE       4904
PERSONCOUNT            0
PEDCOUNT               0
PEDCYLCOUNT            0
VEHCOUNT               0
INCDATE                0
INCDTTM                0
JUNCTIONTYPE        6329
SDOT_COLCODE           0
SDOT_COLDESC           0
INATTENTIONIND    164868
UNDERINFL           4884
WEATHER             5081
ROADCOND            5012
LIGHTCOND           5170
PEDROWNOTGRNT     190006
SDOTCOLNUM         79737
SPEEDING          185340
ST_COLCODE            18
ST_COLDESC          4904
SEGLANEKEY             0
CROSSWALKKEY           0
HITPARKEDCAR           0
dtype: int64

In [186]:
df["SEVERITYCODE"].value_counts()

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

In column "SPEEDING" 
    convert nan and "Y" in "0" and "1".

df['SPEEDING'].replace(to_replace=['nan','Y'], value=[0,1],inplace=True)
df.head()

"price": 4 missing data, simply delete the whole row
Reason: price is what we want to predict. Any data entry without price data cannot be used for prediction; therefore any row now without price data is not useful to us

* "LOCATION": 2677 missing data: 
    * Action:
       * simply delete the whole row
    * Reason: 
    
* "JUNCTIONTYPE": 6329 missing data and 9 "Unknown": 
    * Action:
       * simply delete the whole row
       * eliminate the nan and Unknown  
       * transform the conditions in columns with 0 and 1
    * Reason:
    
* "WEATHER": 5081 missing data, 832 "Other" and 15091 "Unknown", simply delete the whole row
   Reason:  
    eliminate nan and other 
    transform unknown in the mode (or distribute them following a mean)
    transform the conditions in columns with 0 and 1

* "ROADCOND": 5012 missing data, 132 "Other" and 11012 "Unknown", simply delete the whole row
   Reason:  
    eliminate nan and other 
    transform unknown in the mode (or distribute them following a mean)
    transform the conditions in columns with 0 and 1
    
* "LIGHTCOND": 5170 missing data, 235 "Other" and 13473 "Unknown", simply delete the whole row
    Reason:
    eliminate nan and other 
    transform unknown in the mode (or distribute them following a mean)
    transform the conditions in columns with 0 and 1

* "PERSONCOUNT": 5544 accidents involving 0 people (i.e. nobody has been injured)

* "VEHCOUNT": 5085 accidents involving no vehicles (zero vehicles), 
   our research study consider accidents involving vehicles and not those ones including bicycles (99,5% of the cases with zero vehicles)

In [187]:
# Replace the nan and "Y" in SPEEDING with 0 and 1

df['SPEEDING'].replace(np.nan, "N", inplace=True)

In [188]:
df['SPEEDING'].replace(to_replace=['N','Y'], value=[0,1],inplace=True)
df['SPEEDING'].value_counts()

0    185340
1      9333
Name: SPEEDING, dtype: int64

In [140]:
# Eliminate nan in LOCATION, JUNCTIONTYPE, WEATHER, ROADCOND, LIGHTCOND
df.dropna(subset=['LOCATION','JUNCTIONTYPE', 'WEATHER', 'ROADCOND','LIGHTCOND'], axis=0, inplace=True)
df[['LOCATION', 'JUNCTIONTYPE', 'WEATHER', 'ROADCOND','LIGHTCOND']].isnull().sum(axis = 0)
df.shape

(182679, 38)

In [189]:
df["JUNCTIONTYPE"].value_counts()

Mid-Block (not related to intersection)              89800
At Intersection (intersection related)               62810
Mid-Block (but intersection related)                 22790
Driveway Junction                                    10671
At Intersection (but not related to intersection)     2098
Ramp Junction                                          166
Unknown                                                  9
Name: JUNCTIONTYPE, dtype: int64

In [150]:
# Get indexes where name column has value Unknown
indexNames = df[df['JUNCTIONTYPE'] == 'Unknown'].index
 
# Delete these row indexes from dataFrame
df.drop(indexNames , inplace=True)
df["JUNCTIONTYPE"].value_counts()

Mid-Block (not related to intersection)              86438
At Intersection (intersection related)               61221
Mid-Block (but intersection related)                 22341
Driveway Junction                                    10460
At Intersection (but not related to intersection)     2054
Ramp Junction                                          160
Name: JUNCTIONTYPE, dtype: int64

In [151]:
# Get indexes where name column has value Other
indexNames = df[df['WEATHER'] == 'Other'].index
 
# Delete these row indexes from dataFrame
df.drop(indexNames , inplace=True)
df["WEATHER"].value_counts()

Clear                       108925
Raining                      32623
Overcast                     27160
Unknown                      11597
Snowing                        881
Fog/Smog/Smoke                 555
Sleet/Hail/Freezing Rain       112
Blowing Sand/Dirt               49
Severe Crosswind                25
Partly Cloudy                    5
Name: WEATHER, dtype: int64

In [182]:
# Get indexes where name column has value Unknown
indexNames = df[df['WEATHER'] == 'Unknown'].index
 
# Delete these row indexes from dataFrame
df.drop(indexNames , inplace=True)
df["WEATHER"].value_counts()

Clear                       111135
Raining                      33145
Overcast                     27714
Snowing                        907
Other                          832
Fog/Smog/Smoke                 569
Sleet/Hail/Freezing Rain       113
Blowing Sand/Dirt               56
Severe Crosswind                25
Partly Cloudy                    5
Name: WEATHER, dtype: int64

In [190]:
df["WEATHER"].value_counts()

Clear                       111135
Raining                      33145
Overcast                     27714
Unknown                      15091
Snowing                        907
Other                          832
Fog/Smog/Smoke                 569
Sleet/Hail/Freezing Rain       113
Blowing Sand/Dirt               56
Severe Crosswind                25
Partly Cloudy                    5
Name: WEATHER, dtype: int64

In [153]:
# Get indexes where name column has value Other
indexNames = df[df['ROADCOND'] == 'Other'].index
 
# Delete these row indexes from dataFrame
df.drop(indexNames , inplace=True)
df['ROADCOND'].value_counts()

Dry               121831
Wet                46615
Unknown            11012
Ice                 1158
Snow/Slush           970
Standing Water       108
Sand/Mud/Dirt         63
Oil                   59
Name: ROADCOND, dtype: int64

In [170]:
# Get indexes where name column has value Unknown
indexNames = df[df['ROADCOND'] == 'Unknown'].index
 
# Delete these row indexes from dataFrame
df.drop(indexNames , inplace=True)
df['ROADCOND'].value_counts()

Dry               120822
Wet                46096
Ice                 1087
Snow/Slush           877
Standing Water       101
Oil                   57
Sand/Mud/Dirt         57
Name: ROADCOND, dtype: int64

In [191]:
df['ROADCOND'].value_counts()

Dry               124510
Wet                47474
Unknown            15078
Ice                 1209
Snow/Slush          1004
Other                132
Standing Water       115
Sand/Mud/Dirt         75
Oil                   64
Name: ROADCOND, dtype: int64

In [156]:
# Get indexes where name column has value Other
indexNames = df[df['LIGHTCOND'] == 'Other'].index
 
# Delete these row indexes from dataFrame
df.drop(indexNames , inplace=True)
df['LIGHTCOND'].value_counts()

Daylight                    113385
Dark - Street Lights On      47334
Unknown                      10131
Dusk                          5744
Dawn                          2438
Dark - No Street Lights       1435
Dark - Street Lights Off      1147
Dark - Unknown Lighting          9
Name: LIGHTCOND, dtype: int64

In [171]:
# Get indexes where name column has value Unknown
indexNames = df[df['LIGHTCOND'] == 'Unknown'].index
 
# Delete these row indexes from dataFrame
df.drop(indexNames , inplace=True)
df['LIGHTCOND'].value_counts()

Daylight                    110677
Dark - Street Lights On      45951
Dusk                          5547
Dawn                          2365
Dark - No Street Lights       1335
Dark - Street Lights Off      1082
Dark - Unknown Lighting          8
Name: LIGHTCOND, dtype: int64

In [192]:
df['LIGHTCOND'].value_counts()

Daylight                    116137
Dark - Street Lights On      48507
Unknown                      13473
Dusk                          5902
Dawn                          2502
Dark - No Street Lights       1537
Dark - Street Lights Off      1199
Other                          235
Dark - Unknown Lighting         11
Name: LIGHTCOND, dtype: int64

In [195]:
df['PERSONCOUNT'].value_counts()

2     114231
3      35553
4      14660
1      13154
5       6584
0       5544
6       2702
7       1131
8        533
9        216
10       128
11        56
12        33
13        21
14        19
15        11
17        11
16         8
44         6
18         6
20         6
25         6
19         5
26         4
22         4
27         3
28         3
29         3
47         3
32         3
34         3
37         3
23         2
21         2
24         2
30         2
36         2
57         1
31         1
35         1
39         1
41         1
43         1
48         1
53         1
54         1
81         1
Name: PERSONCOUNT, dtype: int64

In [193]:
# We are interested only in accindent related to car accindents, therefore we can delete those one that do not involve cars (but 99,48% bicycles)
# Get indexes where name column has value 0
#indexNames = df[df['VEHCOUNT'] == 0].index
 
# Delete these row indexes from dataFrame
#df.drop(indexNames , inplace=True)
df['VEHCOUNT'].value_counts()

2     147650
1      25748
3      13010
0       5085
4       2426
5        529
6        146
7         46
8         15
9          9
11         6
10         2
12         1
Name: VEHCOUNT, dtype: int64

In [161]:
df.shape

(181429, 38)

In [172]:
df.shape

(166965, 38)

In [173]:
df['INCDATE'] = pd.to_datetime(df['INCDATE'])
df['dayofweek'] = df['INCDATE'].dt.dayofweek
df['dayofweek'].value_counts()

4    27792
3    25224
2    24733
1    24503
5    23598
0    22401
6    18714
Name: dayofweek, dtype: int64

## Methodology <a name="methodology"></a>

## Analysis <a name="analysis"></a>

## Results and Discussion <a name="results"></a>

## Conclusion <a name="conclusion"></a>

## References <a name="notes"></a>

1. "Road traffic injuries", World Health Organisation (WHO), 07/02/2020, https://www.who.int/news-room/fact-sheets/detail/road-traffic-injuries
2. Ibidem
3. "ArcGIS Metadata Form", https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Metadata.pdf