# Capstone Project - Auto Accident Prediction (Week 2)
## Applied Data Science Capstone by IBM/Coursera

This notebook will be used for the Applied Data Science Capstone project

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data Acquisition and Cleaning](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

Say you are driving to another city for work or to visit some friends. It is rainy and windy. On the way to your destination, you come across a terrible traffic jam on the other side of the highway. Long lines of cars are barely moving. As you keep driving, police car start appearing from afar, shutting down the highway. There is an accident and a helicopter is transporting the ones involved in the crash to the nearest hospital. The victems must be in critical condition for all of this to be happening.
 
Now, wouldn't it be great if there is something in place that could warn you, given the weather and the road conditions, about the possibility of you getting into a car accident and how severe it would be.  The advance warning could prompt you to  drive more carefully or even change your travel plans if you are able to.

## Data Acquisition and Cleaning <a name="data"></a>

Load the required libraries

In [59]:
import itertools
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import statsmodels.api as sm
from matplotlib.ticker import NullFormatter
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
%matplotlib inline

### Data Sources
The data used to train and evaluate the model is the collision data set from the SDOT Traffic Management Division, Traffic Records Group. The data set is updated weekly from 2004 to the present. The data set is compiled from all collisions provided by the Seattle Police department and recorded by the Traffic Records Group.


Download the current collision data from <a name=Seattle Geo Data>http://data-seattlecitygis.opendata.arcgis.com</a>

In [71]:
# Live data from Seattle
!wget -O Collisions.csv https://opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0.csv

# Data on IBM Cloud
#!wget -O Collisions.csv https://dataplatform.cloud.ibm.com/projects/3e2f7aff-6ac9-4bb0-aab2-69024078c07a/data-assets/57754927-cde0-43b8-8c82-caf1f02f13a6

--2020-09-15 15:01:07--  https://opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0.csv
Resolving opendata.arcgis.com (opendata.arcgis.com)... 54.204.141.17, 52.45.166.77, 50.19.49.12, ...
Connecting to opendata.arcgis.com (opendata.arcgis.com)|54.204.141.17|:443... connected.
HTTP request sent, awaiting response... 500 Internal Server Error
2020-09-15 15:01:08 ERROR 500: Internal Server Error.



### Load Data from CSV file
The data has unlabeled extra columns, which will cause an error if not accounted for. The _OBJECTID_ is used as the index for this dataset.

In [70]:
cols = pd.read_csv('Collisions.csv', nrows=1).columns
df = pd.read_csv('Collisions.csv', usecols=cols, index_col=2)
df.head()

FileNotFoundError: [Errno 2] File b'/home/dsxuser/Collisions.csv' does not exist: b'/home/dsxuser/Collisions.csv'

### Data Cleaning

An initial review of the dataset indicates that a number of features that may be safely eliminated. Thse features are used for various bookkeeping functions or are textual descriptions of categorical data. Some features such as _X_, _Y_, and _LOCATION_ are redundant. In this case, _LOCATION_ is kept.

In [None]:
df.drop(inplace=True, columns=['X', 'Y', 'INCKEY', 'COLDETKEY', 'REPORTNO', 'STATUS', 'INTKEY', 'EXCEPTRSNCODE', 'EXCEPTRSNDESC', 'SEVERITYDESC', 'INCDATE', 'INCDTTM', 'SDOT_COLDESC', 'SDOTCOLNUM', 'ST_COLCODE', 'ST_COLDESC' ])

There are problems with the dataset. There are numerous missing values that need to be filled in. 

The _ADDRTYPE_, _WEATHER_, _LIGHTCOND_, _ROAD_COND_, and _JUNCTIONTYPE_ features all consist of enumerated values. There are a significant number on blank values in these fields. The blank fields were set to the value of _UNKNOWN_ in order to generate a frequency table easier.

The _INATTENTIONIND_, _UNDERINFL_, _PEDROWNOTGRNT_, _SPEEDING_, and _HITPARKEDCAR_ are binary values representing either _Yes_ or _No_. Blank values were assumed to represent a _No_ value. A _1_ is assumed to be a _Yes_ value while a _0_ is assumed to be a _No_ value. 

In [None]:
df['ADDRTYPE'] = df['ADDRTYPE'].fillna('Unknown')
df["ADDRTYPE"] = df["ADDRTYPE"].astype('category')
df["ADDRTYPE_CAT"] = df["ADDRTYPE"].cat.codes
print("\nAddress Type:\n", df['ADDRTYPE'].value_counts())

df["COLLISIONTYPE"] = df["COLLISIONTYPE"].astype('category')
df["COLLISIONTYPE_CAT"] = df["COLLISIONTYPE"].cat.codes
print("\nCollision Type:\n", df['COLLISIONTYPE'].value_counts())

df["LOCATION"] = df["LOCATION"].astype('category')
df["LOCATION_CAT"] = df["LOCATION"].cat.codes
print("\nLocation:\n", df['LOCATION'].value_counts())

df['WEATHER'] = df['WEATHER'].fillna('Unknown')
df["WEATHER"] = df["WEATHER"].astype('category')
df["WEATHER_CAT"] = df["WEATHER"].cat.codes
print("\nWeather:\n", df['WEATHER'].value_counts())

df['LIGHTCOND'] = df['LIGHTCOND'].fillna('Unknown')
df["LIGHTCOND"] = df["LIGHTCOND"].astype('category')
df["LIGHTCOND_CAT"] = df["LIGHTCOND"].cat.codes
print("\nLight Conditions:\n", df['LIGHTCOND'].value_counts())

df['ROADCOND'] = df['ROADCOND'].fillna('Unknown')
df["ROADCOND"] = df["ROADCOND"].astype('category')
df["ROADCOND_CAT"] = df["ROADCOND"].cat.codes
print("\nRoad Conditions:\n", df['ROADCOND'].value_counts())

df['JUNCTIONTYPE'] = df['JUNCTIONTYPE'].fillna('Unknown')
df["JUNCTIONTYPE"] = df["JUNCTIONTYPE"].astype('category')
df["JUNCTIONTYPE_CAT"] = df["JUNCTIONTYPE"].cat.codes
print("\nJunction Type:\n", df['JUNCTIONTYPE'].value_counts())

# treat a blank record as 0, an N as 0 and Y as 1
df['INATTENTIONIND'] = df['INATTENTIONIND'].fillna('0')
df['INATTENTIONIND'] = df['INATTENTIONIND'].replace(['N','Y'],['0','1'])
df["INATTENTIONIND"] = df["INATTENTIONIND"].astype('int64')
print("\nInattention Indicator:\n", df['INATTENTIONIND'].value_counts())

# treat a blank record as 0, an N as 0 and Y as 1
df['UNDERINFL'] = df['UNDERINFL'].fillna('0')
df['UNDERINFL'] = df['UNDERINFL'].replace(['N','Y'],['0','1'])
df["UNDERINFL"] = df["UNDERINFL"].astype('int64')
print("\nUnder Influence:\n", df['UNDERINFL'].value_counts())

# treat a blank record as 0, an N as 0 and Y as 1
df['PEDROWNOTGRNT'] = df['PEDROWNOTGRNT'].fillna('0')
df['PEDROWNOTGRNT'] = df['PEDROWNOTGRNT'].replace(['N','Y'],['0','1'])
df["PEDROWNOTGRNT"] = df["PEDROWNOTGRNT"].astype('int64')
print("\nPedestrian Not Granted:\n", df['PEDROWNOTGRNT'].value_counts())

# treat a blank record as 0, an N as 0 and Y as 1
df['SPEEDING'] = df['SPEEDING'].fillna('0')
df['SPEEDING'] = df['SPEEDING'].replace(['N','Y'],['0','1'])
df["SPEEDING"] = df["SPEEDING"].astype('int64')
print("\nSpeeding:\n", df['SPEEDING'].value_counts())

# treat a blank record as 0, an N as 0 and Y as 1
df['HITPARKEDCAR'] = df['HITPARKEDCAR'].fillna('0')
df['HITPARKEDCAR'] = df['HITPARKEDCAR'].replace(['N','Y'],['0','1'])
df["HITPARKEDCAR"] = df["HITPARKEDCAR"].astype('int64')
print("\nHit Parked Car:\n", df['HITPARKEDCAR'].value_counts())

print("\nPerson Count:\n", df['PERSONCOUNT'].value_counts())

print("\nVehicle Count:\n", df['VEHCOUNT'].value_counts())

df["SEVERITYCODE"] = df["SEVERITYCODE"].astype('category')
df["SEVERITYCODE_CAT"] = df["SEVERITYCODE"].cat.codes

df['SDOT_COLCODE'] = df['SDOT_COLCODE'].fillna('0')
df["SDOT_COLCODE"] = df["SDOT_COLCODE"].astype('int64')
print("\nState DOT Collision Code:\n", df['SDOT_COLCODE'].value_counts())


df['SDOT_COLCODE'] = df['SDOT_COLCODE'].fillna('0')
df["SDOT_COLCODE"] = df["SDOT_COLCODE"].astype('int64')
print("\nState DOT Collision Code:\n", df['SDOT_COLCODE'].value_counts())



print("\nSeverity Code:\n", df['SEVERITYCODE'].value_counts())

#df.shape
df.dtypes

### Feature Selection

After data cleaning, there were 221,389 samples with 26 features. There are outliers in the data.

The _WEATHER_ feather has several outliers. I changed the _WEATHER_ categories of _Snowing_, _Fog/Smog/Snow_, _Sleet/Hail/Freezing_Rain_, _Blowing Sand/Dirt_, _Severe Crosswind_, _Partley Cloudy_, and _Blowing Snow_ to _Other_. These categories are not major factors in the data set and can be safely combined. Timestamps are not available. If Timestamps were available, the _Unknown_ values could be set to reflect the appropriate weather conditions.

The frequency tables are then regenerated to determine if _Other_ and _Unknown_ are significant.

In [None]:
df['WEATHER'].replace({'Snowing':'Other', 'Fog/Smog/Smoke':'Other', 'Sleet/Hail/Freezing Rain':'Other', 'Blowing Sand/Dirt':'Other', 'Severe Crosswind':'Other', 'Partly Cloudy':'Other', 'Blowing Snow':'Other'}, value=None, inplace=True)
print("\nWeather:\n", df['WEATHER'].value_counts())

The _LIGHTCOND_ feather has several outliers. I changed the _LIGHTCOND_ categories of _Dark - Street Lights On_, _Dark - No Street Lights_, _Dark - Street Lights Off_, and _Dark - Unknown Lighting_ to _Dark_. These categories are not major factors in the data set and can be safely combined. Timestamps are not available. If Timestamps were available, the _Unknown_ values could be set to reflect the appropriate light conditions.

The frequency table was then regenerated.

In [None]:
df['LIGHTCOND'].replace({'Dark - Street Lights On':'Dark', 'Dark - No Street Lights':'Dark', 'Dark - Street Lights Off':'Dark', 'Dark - Unknown Lighting':'Dark'}, value=None, inplace=True)
print("\nLight Conditions:\n", df['LIGHTCOND'].value_counts())

The _ROADCOND_ feature has several outliers. I changed the _ROADCOND_ categories of _Ice_, _Snow/Slush_, _Standing Water_, _Sand/Mud/Dirt_, and _Oil_ to _Other_. These categories are not major factors in the data set and can be safely combined.

The frequency table was then regenerated to determine if _Other_ is now significant.

In [None]:
df['ROADCOND'].replace({'Ice':'Other', 'Snow/Slush':'Other', 'Standing Water':'Other', 'Sand/Mud/Dirt':'Other', 'Oil':'Other'}, value=None, inplace=True)
print("\nRoad Conditions:\n", df['ROADCOND'].value_counts())

A heatmap is generated to examinine the correlation of independent variables.

In [None]:
plt.figure(figsize=(20, 20))
cor = df.corr()
sns.heatmap(cor, annot=True, fmt=".2f", cmap='coolwarm', linewidths=3, linecolor='black')
plt.show()

#Correlation with output variable
cor_target = abs(cor["SEVERITYCODE_CAT"])
#Selecting highly correlated features
relevant_features = cor_target[cor_target>0.25]
relevant_features

In [None]:
#Backward Elimination
X = df.drop(columns=['SEVERITYCODE_CAT', 'SEVERITYCODE','ADDRTYPE', 'COLLISIONTYPE', 'LOCATION', 'WEATHER', 'LIGHTCOND', 'ROADCOND', 'JUNCTIONTYPE'])   #Feature Matrix
y = df['SEVERITYCODE_CAT'] 
cols = list(X.columns)

pmax = 1
while (len(cols)>0):
    p= []
    X_1 = X[cols]
    X_1 = sm.add_constant(X_1)
    model = sm.OLS(y,X_1).fit()
    p = pd.Series(model.pvalues.values[1:],index = cols)      
    pmax = max(p)
    feature_with_p_max = p.idxmax()
    if(pmax>0.05):
        cols.remove(feature_with_p_max)
    else:
        break
selected_features_BE = cols
print(selected_features_BE)


reg = LassoCV()
reg.fit(X, y)
print("Best alpha using built-in LassoCV: %f" % reg.alpha_)
print("Best score using built-in LassoCV: %f" %reg.score(X,y))
coef = pd.Series(reg.coef_, index = X.columns)
print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " +  str(sum(coef == 0)) + " variables")

In [None]:
print("Rows before cleaning: ", df.shape)
df = df[~df['WEATHER'].isin(['Unknown'])]
df = df[~df['WEATHER'].isin(['Other'])]
df = df[~df['LIGHTCOND'].isin(['Unknown'])]
df = df[~df['LIGHTCOND'].isin(['Other'])]
df = df[~df['ROADCOND'].isin(['Unknown'])]
df = df[~df['ROADCOND'].isin(['Other'])]
df = df[~df['SEVERITYDESC'].isin(['Unknown'])]
print ("Rows after cleaning: ", df.shape)

### Convert Categorical Features to Numeric Values

The feature set for the model consists of the following:
* PERSONCOUNT
* VEHCOUNT
* WEATHER
* ROADCOND
* LIGHTCOND

One Hot Encoding will be used for the latter three features.

In [None]:
Feature = df[['PERSONCOUNT','VEHCOUNT']]
# do One Hot Encoding
Feature = pd.concat([Feature,pd.get_dummies(df['WEATHER'])], axis=1)
Feature = pd.concat([Feature,pd.get_dummies(df['ROADCOND'])], axis=1)
Feature = pd.concat([Feature,pd.get_dummies(df['LIGHTCOND'])], axis=1)
Feature.head()

The desired prediction value will be _SEVERITY_DESC_. The feature dataset will be run through the Standard Scaller to normalize all of the data. The dataset will then be split into a training set for the models and a test set to evaluate how good the models are at prediction.

In [None]:
X = Feature
y = df['SEVERITYDESC'].values

X= preprocessing.StandardScaler().fit(X).transform(X)

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

## Methodology <a name="methodology"></a>
In this project we will direct our efforts on predicing accidents based on the feature set defined above.

## Analysis <a name="analysis"></a>
Four models will be developed. The accurracy, Jac

## K Nearest Neighbor (KNN)

In [None]:
Ks = 10
mean_accKNN = np.zeros((Ks-1))
mean_jacKNN = np.zeros((Ks-1))
mean_F1KNN  = np.zeros((Ks-1))
for n in range(1,Ks):
    
    #Train Model and Predict
    print ("Ks = ", n)
    neigh = KNeighborsClassifier(n_neighbors=n, n_jobs=1, weights='distance').fit(X_train,y_train)
    yhat=neigh.predict(X_test)

    mean_accKNN[n-1] = metrics.accuracy_score(y_test, yhat)
    mean_jacKNN[n-1] = jaccard_similarity_score(y_test, yhat)
    mean_F1KNN[n-1]  = f1_score(y_test, yhat, average='weighted')
    
print ("KNN Accuracy table: ", mean_accKNN)
print( "The best accuracy is", mean_accKNN.max(), "with k=", mean_accKNN.argmax()+1) 

print ("KNN Jaccard index table: ", mean_jacKNN)
print( "The best Jaccard index is", mean_jacKNN.max(), "with k=", mean_jacKNN.argmax()+1) 

print ("KNN F1-score table: ", mean_F1KNN)
print( "The best F1-score is", mean_F1KNN.max(), "with k=", mean_F1KNN.argmax()+1) 

## Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
DT_model = DecisionTreeClassifier(criterion="entropy", max_depth = 10)
DT_model.fit(X_train,y_train)
DT_model

Ks = 10
mean_accDT = np.zeros((Ks-1))
mean_jacDT = np.zeros((Ks-1))
mean_F1DT  = np.zeros((Ks-1))

for n in range(1,Ks):
    
    #Train Model and Predict
    print ("Ks = ", n)
    DT_model = DecisionTreeClassifier(criterion="entropy", max_depth = n)
    DT_model.fit(X_train,y_train)
    yhat = DT_model.predict(X_test)

    mean_accDT[n-1] = metrics.accuracy_score(y_test, yhat)
    mean_jacDT[n-1] = jaccard_similarity_score(y_test, yhat)
    mean_F1DT[n-1]  = f1_score(y_test, yhat, average='weighted')
    
print ("Decision Tree Accuracy table: ", mean_accDT)
print( "The best accuracy is", mean_accDT.max(), "with k=", mean_accDT.argmax()+1) 

print ("Decision Tree Jaccard index table: ", mean_jacDT)
print( "The best Jaccard index is", mean_jacDT.max(), "with k=", mean_jacDT.argmax()+1) 

print ("Decision Tree F1-score table: ", mean_F1DT)
print( "The best F1-score is", mean_F1DT.max(), "with k=", mean_F1DT.argmax()+1) 

## Support Vector Machine
Use the LinearSVC because of the large number of samples.

In [None]:
from sklearn import svm
SVM_model = svm.LinearSVC()
SVM_model.fit(X_train, y_train) 

In [None]:
yhat = SVM_model.predict(X_test)
print ("The SVM model accuracy is: ", metrics.accuracy_score(y_test, yhat))
print("SVM Jaccard index: %.2f" % jaccard_similarity_score(y_test, yhat))
print("SVM F1-score: %.2f" % f1_score(y_test, yhat, average='weighted') )


## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
LR_model = LogisticRegression(C=0.01).fit(X_train,y_train)
LR_model

In [None]:
yhat = LR_model.predict(X_test)
print ("The LR model accuracy is: ", metrics.accuracy_score(y_test, yhat))
yhat_prob = LR_model.predict_proba(X_test)
print("LR Jaccard index: %.2f" % jaccard_similarity_score(y_test, yhat))
print("LR F1-score: %.2f" % f1_score(y_test, yhat, average='weighted') )
print("LR LogLoss: %.2f" % log_loss(y_test, yhat_prob))

## Results and Discussion <a name="results"></a>

## Conclusion <a name="conclusion"></a>