# Capstone Project - Auto Accident Prediction (Week 2)
## Applied Data Science Capstone by IBM/Coursera

This notebook will be used for the Applied Data Science Capstone project

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

Say you are driving to another city for work or to visit some friends. It is rainy and windy. On the way to your destination, you come across a terrible traffic jam on the other side of the highway. Long lines of cars are barely moving. As you keep driving, police car start appearing from afar, shutting down the highway. There is an accident and a helicopter is transporting the ones involved in the crash to the nearest hospital. The victems must be in critical condition for all of this to be happening.
 
Now, wouldn't it be great if there is something in place that could warn you, given the weather and the road conditions, about the possibility of you getting into a car accident and how severe it would be.  The advance warning could prompt you to  drive more carefully or even change your travel plans if you are able to.

## Data <a name="data"></a>

Load the required libraries

In [13]:
import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
%matplotlib inline

### Retrieve The Dataset
The data used to train and evaluate the model is the collision data set from the SDOT Traffic Management Division, Traffic Records Group. The data set is updated weekly from 2004 to the present. The data set is compiled from all collisions provided by the Seattle Police department and recorded by the Traffic Records Group.


Download the current collision data from <a name=Seattle Geo Data>http://data-seattlecitygis.opendata.arcgis.com</a>

In [14]:
!wget -O Collisions.csv https://opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0.csv

--2020-09-08 10:26:36--  https://opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0.csv
Resolving opendata.arcgis.com (opendata.arcgis.com)... 54.204.141.17, 34.224.12.157, 50.19.49.12
Connecting to opendata.arcgis.com (opendata.arcgis.com)|54.204.141.17|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘Collisions.csv’

    [     <=>                               ] 84,855,377  87.6MB/s   in 0.9s   

2020-09-08 10:26:38 (87.6 MB/s) - ‘Collisions.csv’ saved [84855377]



### Load Data from CSV file
The data has unlabeled extra columns, which will cause an error if not accounted for. The _OBJECTID_ is used as the index for this dataset.

In [15]:
cols = pd.read_csv('Collisions.csv', nrows=1).columns
df = pd.read_csv('Collisions.csv', usecols=cols, index_col=2)
df.head()

Unnamed: 0_level_0,X,Y,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,LOCATION,EXCEPTRSNCODE,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
OBJECTID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-122.386772,47.56472,326234,327734,E984735,Matched,Intersection,31893.0,CALIFORNIA AVE SW AND SW GENESEE ST,,...,Dry,Daylight,Y,,,2.0,Vehicle turning left hits pedestrian,0,0,N
2,-122.341806,47.686934,326246,327746,E985430,Matched,Intersection,24228.0,STONE AVE N AND N 80TH ST,,...,Wet,Dark - Street Lights On,,,,10.0,Entering at angle,0,0,N
3,-122.374899,47.668666,329254,330754,EA16720,Matched,Block,,NW MARKET ST BETWEEN 14TH AVE NW AND 15TH AVE NW,,...,Dry,Daylight,,,,11.0,From same direction - both going straight - bo...,0,0,N
4,-122.300758,47.683047,21200,21200,1227970,Matched,Intersection,24661.0,25TH AVE NE AND NE 75TH ST,,...,Wet,Dark - Street Lights On,,4160038.0,,28.0,From opposite direction - one left turn - one ...,0,0,N
5,-122.313053,47.567241,17000,17000,1793348,Unmatched,Block,,S DAKOTA ST BETWEEN 15TH AVE S AND 16TH AVE S,,...,,,,4289025.0,,,,0,0,N


### Preprocess The Data

Normalize the data and fill in missing values where it makes sense. Display the frequency tables for various features to helpdetermine which features to use.

In [16]:
df['ADDRTYPE'] = df['ADDRTYPE'].fillna('Unknown')
print("\nAddress Type:\n", df['ADDRTYPE'].value_counts())

df['WEATHER'] = df['WEATHER'].fillna('Unknown')
print("\nWeather:\n", df['WEATHER'].value_counts())

df['LIGHTCOND'] = df['LIGHTCOND'].fillna('Unknown')
print("\nLight Conditions:\n", df['LIGHTCOND'].value_counts())

df['ROADCOND'] = df['ROADCOND'].fillna('Unknown')
print("\nRoad Conditions:\n", df['ROADCOND'].value_counts())

df['JUNCTIONTYPE'] = df['JUNCTIONTYPE'].fillna('Unknown')
print("\nJunction Type:\n", df['JUNCTIONTYPE'].value_counts())

# treat an blank record as N
df['INATTENTIONIND'] = df['INATTENTIONIND'].fillna('N')
print("\nInattention Indicator:\n", df['INATTENTIONIND'].value_counts())

# treat an blank record as N, a 0 as N and 1 as Y
df['UNDERINFL'] = df['UNDERINFL'].fillna('N')
df['UNDERINFL'] = df['UNDERINFL'].replace(['0','1'],['N','Y'])
print("\nUnder Influence:\n", df['UNDERINFL'].value_counts())

# treat an blank record as N, a 0 as N and 1 as Y
df['PEDROWNOTGRNT'] = df['PEDROWNOTGRNT'].fillna('N')
df['PEDROWNOTGRNT'] = df['PEDROWNOTGRNT'].replace(['0','1'],['N','Y'])
print("\nPedestrian Not Granted:\n", df['PEDROWNOTGRNT'].value_counts())

# treat an blank record as N, a 0 as N and 1 as Y
df['SPEEDING'] = df['SPEEDING'].fillna('N')
df['SPEEDING'] = df['SPEEDING'].replace(['0','1'],['N','Y'])
print("\nSpeeding:\n", df['SPEEDING'].value_counts())

print("\nHit Parked Car:\n", df['HITPARKEDCAR'].value_counts())

print("\nSeverity Code:\n", df['SEVERITYCODE'].value_counts())



Address Type:
 Block           144857
Intersection     71823
Unknown           3712
Alley              874
Name: ADDRTYPE, dtype: int64

Weather:
 Clear                       114361
Unknown                      41819
Raining                      34021
Overcast                     28508
Snowing                        919
Other                          853
Fog/Smog/Smoke                 577
Sleet/Hail/Freezing Rain       116
Blowing Sand/Dirt               56
Severe Crosswind                26
Partly Cloudy                    9
Blowing Snow                     1
Name: WEATHER, dtype: int64

Light Conditions:
 Daylight                    119166
Dark - Street Lights On      50053
Unknown                      40299
Dusk                          6076
Dawn                          2599
Dark - No Street Lights       1573
Dark - Street Lights Off      1236
Other                          244
Dark - Unknown Lighting         20
Name: LIGHTCOND, dtype: int64

Road Conditions:
 Dry               12

### Assess The Features To Use

Change the _WEATHER_ types _Snowing_, _Fog/Smog/Snow_, _Sleet/Hail/Freezing_Rain_, _Blowing Sand/Dirt_, _Severe Crosswind_, _Partley Cloudy_, and _Blowing Snow_ to _Other_. These are not major factors in the data set and can be safely combined to help limit the feature set.Timestamps are not available. If they were, the _Unknown_ values could be set to reflect the appropriate weather conditions


In [17]:
df['WEATHER'].replace({'Snowing':'Other', 'Fog/Smog/Smoke':'Other', 'Sleet/Hail/Freezing Rain':'Other', 'Blowing Sand/Dirt':'Other', 'Severe Crosswind':'Other', 'Partly Cloudy':'Other', 'Blowing Snow':'Other'}, value=None, inplace=True)
print("\nWeather:\n", df['WEATHER'].value_counts())


Weather:
 Clear       114361
Unknown      41819
Raining      34021
Overcast     28508
Other         2557
Name: WEATHER, dtype: int64


Change the _LIGHTCOND_ types _Dark - Street Lights On_, _Dark - No Street Lights_, _Dark - Street Lights Off_, and _Dark - Unknown Lighting_ to _Dark_. These are not major factors in the data set and can be safely combined to help limit the feature set. Timestamps are not available. If they were, the _Unknown_ values could be set to reflect the appropriate light conditions.

In [18]:
df['LIGHTCOND'].replace({'Dark - Street Lights On':'Dark', 'Dark - No Street Lights':'Dark', 'Dark - Street Lights Off':'Dark', 'Dark - Unknown Lighting':'Dark'}, value=None, inplace=True)
print("\nLight Conditions:\n", df['LIGHTCOND'].value_counts())


Light Conditions:
 Daylight    119166
Dark         52882
Unknown      40299
Dusk          6076
Dawn          2599
Other          244
Name: LIGHTCOND, dtype: int64


Change the _ROADCOND_ types _Ice_, _Snow/Slush_, _Standing Water_, _Sand/Mud/Dirt_, and _Oil_ to _Other_. These are not major factors in the data set and can be safely combined to help limit the feature set.

In [19]:
df['ROADCOND'].replace({'Ice':'Other', 'Snow/Slush':'Other', 'Standing Water':'Other', 'Sand/Mud/Dirt':'Other', 'Oil':'Other'}, value=None, inplace=True)
print("\nRoad Conditions:\n", df['ROADCOND'].value_counts())


Road Conditions:
 Dry        128171
Wet         48715
Unknown     41739
Other        2641
Name: ROADCOND, dtype: int64


Remove rows where where the value is _Unknown_ or _Other_.

In [20]:
print("Rows before cleaning: ", df.shape)
#df = df[~df['ADDRTYPE'].isin(['Unknown'])]
df = df[~df['WEATHER'].isin(['Unknown'])]
df = df[~df['WEATHER'].isin(['Other'])]
df = df[~df['LIGHTCOND'].isin(['Unknown'])]
df = df[~df['LIGHTCOND'].isin(['Other'])]
df = df[~df['ROADCOND'].isin(['Unknown'])]
df = df[~df['ROADCOND'].isin(['Other'])]
df = df[~df['SEVERITYDESC'].isin(['Unknown'])]
print ("Rows after cleaning: ", df.shape)

Rows before cleaning:  (221266, 39)
Rows after cleaning:  (171806, 39)


### Convert Categorical Features to Numeric Values

In [21]:
Feature = df[['PERSONCOUNT','VEHCOUNT']]
# do One Hot Encoding
Feature = pd.concat([Feature,pd.get_dummies(df['WEATHER'])], axis=1)
Feature = pd.concat([Feature,pd.get_dummies(df['ROADCOND'])], axis=1)
Feature = pd.concat([Feature,pd.get_dummies(df['LIGHTCOND'])], axis=1)
Feature.head()

Unnamed: 0_level_0,PERSONCOUNT,VEHCOUNT,Clear,Overcast,Raining,Dry,Wet,Dark,Dawn,Daylight,Dusk
OBJECTID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,2,1,1,0,0,1,0,0,0,1,0
2,4,2,0,0,1,0,1,1,0,0,0
3,4,3,1,0,0,1,0,0,0,1,0
4,2,2,0,0,1,0,1,1,0,0,0
7,2,2,1,0,0,1,0,1,0,0,0


In [22]:
X = Feature
y = df['SEVERITYDESC'].values

X= preprocessing.StandardScaler().fit(X).transform(X)

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (137444, 11) (137444,)
Test set: (34362, 11) (34362,)


  return self.partial_fit(X, y)


## Methodology <a name="methodology"></a>

## Analysis <a name="analysis"></a>

## K Nearest Neighbor (KNN)

In [None]:
Ks = 10
mean_accKNN = np.zeros((Ks-1))
mean_jacKNN = np.zeros((Ks-1))
mean_F1KNN  = np.zeros((Ks-1))
for n in range(1,Ks):
    
    #Train Model and Predict
    print ("Ks = ", n)
    neigh = KNeighborsClassifier(n_neighbors=n, n_jobs=1, weights='distance').fit(X_train,y_train)
    yhat=neigh.predict(X_test)

    mean_accKNN[n-1] = metrics.accuracy_score(y_test, yhat)
    mean_jacKNN[n-1] = jaccard_similarity_score(y_test, yhat)
    mean_F1KNN[n-1]  = f1_score(y_test, yhat, average='weighted')
    
print ("KNN Accuracy table: ", mean_accKNN)
print( "The best accuracy is", mean_accKNN.max(), "with k=", mean_accKNN.argmax()+1) 

print ("KNN Jaccard index table: ", mean_jacKNN)
print( "The best Jaccard index is", mean_jacKNN.max(), "with k=", mean_jacKNN.argmax()+1) 

print ("KNN F1-score table: ", mean_F1KNN)
print( "The best F1-score is", mean_F1KNN.max(), "with k=", mean_F1KNN.argmax()+1) 

## Decision Tree

In [23]:
from sklearn.tree import DecisionTreeClassifier
DT_model = DecisionTreeClassifier(criterion="entropy", max_depth = 10)
DT_model.fit(X_train,y_train)
DT_model

Ks = 10
mean_accDT = np.zeros((Ks-1))
mean_jacDT = np.zeros((Ks-1))
mean_F1DT  = np.zeros((Ks-1))

for n in range(1,Ks):
    
    #Train Model and Predict
    print ("Ks = ", n)
    DT_model = DecisionTreeClassifier(criterion="entropy", max_depth = n)
    DT_model.fit(X_train,y_train)
    yhat = DT_model.predict(X_test)

    mean_accDT[n-1] = metrics.accuracy_score(y_test, yhat)
    mean_jacDT[n-1] = jaccard_similarity_score(y_test, yhat)
    mean_F1DT[n-1]  = f1_score(y_test, yhat, average='weighted')
    
print ("Decision Tree Accuracy table: ", mean_accDT)
print( "The best accuracy is", mean_accDT.max(), "with k=", mean_accDT.argmax()+1) 

print ("Decision Tree Jaccard index table: ", mean_jacDT)
print( "The best Jaccard index is", mean_jacDT.max(), "with k=", mean_jacDT.argmax()+1) 

print ("Decision Tree F1-score table: ", mean_F1DT)
print( "The best F1-score is", mean_F1DT.max(), "with k=", mean_F1DT.argmax()+1) 

Ks =  1


  'precision', 'predicted', average, warn_for)


Ks =  2


  'precision', 'predicted', average, warn_for)


Ks =  3


  'precision', 'predicted', average, warn_for)


Ks =  4


  'precision', 'predicted', average, warn_for)


Ks =  5


  'precision', 'predicted', average, warn_for)


Ks =  6


  'precision', 'predicted', average, warn_for)


Ks =  7
Ks =  8
Ks =  9
Decision Tree Accuracy table:  [0.67647401 0.70499389 0.70551772 0.70935918 0.70898085 0.71264769
 0.71264769 0.71317153 0.71290961]
The best accuracy is 0.713171526686456 with k= 8
Decision Tree Jaccard index table:  [0.67647401 0.70499389 0.70551772 0.70935918 0.70898085 0.71264769
 0.71264769 0.71317153 0.71290961]
The best Jaccard index is 0.713171526686456 with k= 8
Decision Tree F1-score table:  [0.62938195 0.63573053 0.63820709 0.66148592 0.64745206 0.66085669
 0.66129807 0.66226663 0.66188714]
The best F1-score is 0.6622666321292395 with k= 8


## Support Vector Machine
Use the LinearSVC because of the large number of samples.

In [24]:
from sklearn import svm
SVM_model = svm.LinearSVC()
SVM_model.fit(X_train, y_train) 



LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [25]:
yhat = SVM_model.predict(X_test)
print ("The SVM model accuracy is: ", metrics.accuracy_score(y_test, yhat))
print("SVM Jaccard index: %.2f" % jaccard_similarity_score(y_test, yhat))
print("SVM F1-score: %.2f" % f1_score(y_test, yhat, average='weighted') )


The SVM model accuracy is:  0.6627961119841685
SVM Jaccard index: 0.66
SVM F1-score: 0.57


  'precision', 'predicted', average, warn_for)


## Logistic Regression

In [26]:
from sklearn.linear_model import LogisticRegression
LR_model = LogisticRegression(C=0.01).fit(X_train,y_train)
LR_model



LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [27]:
yhat = LR_model.predict(X_test)
print ("The LR model accuracy is: ", metrics.accuracy_score(y_test, yhat))
yhat_prob = LR_model.predict_proba(X_test)
print("LR Jaccard index: %.2f" % jaccard_similarity_score(y_test, yhat))
print("LR F1-score: %.2f" % f1_score(y_test, yhat, average='weighted') )
print("LR LogLoss: %.2f" % log_loss(y_test, yhat_prob))

The LR model accuracy is:  0.6632326407077586
LR Jaccard index: 0.66
LR F1-score: 0.57
LR LogLoss: 0.70


  'precision', 'predicted', average, warn_for)


## Results and Discussion <a name="results"></a>

## Conclusion <a name="conclusion"></a>