# Notebook & Github Setup

In [3]:
# this notebook will be mainly used for the Coursera capstone project

import pandas as pd
import numpy as np

print("Hello Capstone Project Course!")

Hello Capstone Project Course!


# Predicting Seattle Car Accident Severity
### by: Christina Gilligan

## Business Problem & Understanding

The objective of this project is to understand:

**IF** there is a way to predict the possibility of getting a car accident and how severe the accident would be based on factors such as road conditions, weather, and visibility.  
**SO THAT WE** we can use this information to warn and inform drivers so that they can drive more carefully or adjust their travel plans if necessary

To do so we will be utilizing the dataset 'Data-Collisions.csv' which details all collisions provided by the Seattle Police Department (SPD) and recorded by Traffic Records from 2004-present. From this dataset we will train and implement different machine learning models to help:
1. Predict how the above driving conditions affect the severity of a collision and
2. Create a 'Collision Rating Scale' to assess the probability of an accident occurring with 1 being the lowest probability of an accident occurring and 5 being the highest
    

## Data Understanding & Preparation

### Data Understanding
This project will be utilizing the shared data set 'Data-Collisions.csv' provided to us in Week 1 of the capstone course. This dataset details all collisions provided by the Seattle Police Department (SPD) and recorded by Traffic Records from 2004-present. The dataset contains 37 different collision attributes which are not all relevant to our modeling purposes. In order to reduce computational costs and prepare the dataset for the models we will drop these attributes that contain irrelevant information. The specific columns/attributes we will be using for this model to predict the accident severity SEVERITYCODE (y/dependent variable) are WEATHER (x1, independent variable), ROADCOND (x2, independent variable), and LIGHTCOND (x3, independent variable) which detail the weather conditions, road conditions (wet, dry, etc.) and light conditions (daylight, nighttime, etc.) that were present during the collision.

**Predictor/Target Variable:** 'SEVERITYCODE'

    1: Property damage
    2: Property damage and injury
    
**Independent Variables:** 'WEATHER', 'ROADCOND', 'LIGHTCOND'

### Data Preparation
The native version of the dataset is crowded and not fit for analysis. There are too many columns that we will not need for our models so we will need to drop those and only keep the ones discussed above. Also, the dataset is unbalanced so we will need to transform the target variable (SEVERITYCODE) to balance the data before we can model.

In [1]:
# Download the dataset
!wget https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv

--2020-09-29 15:43:38--  https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv
Resolving s3.us.cloud-object-storage.appdomain.cloud (s3.us.cloud-object-storage.appdomain.cloud)... 67.228.254.196
Connecting to s3.us.cloud-object-storage.appdomain.cloud (s3.us.cloud-object-storage.appdomain.cloud)|67.228.254.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 73917638 (70M) [text/csv]
Saving to: ‘Data-Collisions.csv’


2020-09-29 15:43:40 (48.1 MB/s) - ‘Data-Collisions.csv’ saved [73917638/73917638]



In [10]:
# Drop irrelevant attributes/columns and set new relevant categories
df=pd.read_csv("Data-Collisions.csv")

dasta = df.drop(columns = ['OBJECTID', 'SEVERITYCODE.1', 'REPORTNO', 'INCKEY', 'COLDETKEY', 
                           'X', 'Y', 'STATUS', 'ADDRTYPE', 
                           'INTKEY', 'LOCATION', 'EXCEPTRSNCODE', 
                           'EXCEPTRSNDESC', 'SEVERITYDESC', 'INCDATE', 
                           'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE', 
                           'SDOT_COLDESC', 'PEDROWNOTGRNT', 'SDOTCOLNUM', 
                           'ST_COLCODE', 'ST_COLDESC', 'SEGLANEKEY', 
                           'CROSSWALKKEY', 'HITPARKEDCAR', 'PEDCOUNT', 'PEDCYLCOUNT', 
                           'PERSONCOUNT', 'VEHCOUNT', 'COLLISIONTYPE', 
                           'SPEEDING', 'UNDERINFL', 'INATTENTIONIND'])

dasta["WEATHER"] = dasta["WEATHER"].astype('category')
dasta["ROADCOND"] = dasta["ROADCOND"].astype('category')
dasta["LIGHTCOND"] = dasta["LIGHTCOND"].astype('category')

dasta["WEATHER_CAT"] = dasta["WEATHER"].cat.codes
dasta["ROADCOND_CAT"] = dasta["ROADCOND"].cat.codes
dasta["LIGHTCOND_CAT"] = dasta["LIGHTCOND"].cat.codes

# Preview new dataset
dasta.head(5)

Unnamed: 0,SEVERITYCODE,WEATHER,ROADCOND,LIGHTCOND,WEATHER_CAT,ROADCOND_CAT,LIGHTCOND_CAT
0,2,Overcast,Wet,Daylight,4,8,5
1,1,Raining,Wet,Dark - Street Lights On,6,8,2
2,1,Overcast,Dry,Daylight,4,0,5
3,1,Clear,Dry,Daylight,1,0,5
4,2,Raining,Wet,Daylight,6,8,5


In [14]:
# Transform the target variable (SEVERITYCODE) to balance the data before we model

from sklearn.utils import resample
dasta_more = dasta[dasta.SEVERITYCODE==1]
dasta_less = dasta[dasta.SEVERITYCODE==2]
dasta_more_equal = resample(dasta_more,
                            replace=False,
                            n_samples=58188,
                            random_state=99)
dasta_bal = pd.concat([dasta_more_equal, dasta_less])
dasta_bal.SEVERITYCODE.value_counts()

2    58188
1    58188
Name: SEVERITYCODE, dtype: int64

## Data Modeling & Methodology

Now that we have cleaned up and transformed our data it's time to train the models. We will utilize 3 different models to determine which is the best fit for the problem. The models we will be using are logistic regression, K nearest neighbor, and decision tree.

In [20]:
from sklearn import preprocessing

In [21]:
# Data preprocessing (define X and y, normalize the dataset)
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]
X = np.asarray(dasta_bal[['WEATHER_CAT', 'ROADCOND_CAT', 'LIGHTCOND_CAT']])
X[0:5]
y = np.asarray(dasta_bal['SEVERITYCODE'])
y [0:5]



array([1, 1, 1, 1, 1])

In [57]:
# I'm using an 80% train and 20% test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3)
print('Train set:', X_train.shape, y_train.shape)
print('Test set:', X_test.shape, y_test.shape)

Train set: (93100, 3) (93100,)
Test set: (23276, 3) (23276,)


### Logistic Regression Model

The dataset only gives us 2 outcomes for SEVERITYCODE (1,2) which makes it binary which fits wonderfully with Logistic Regression.

In [24]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

In [66]:
logReg = LogisticRegression(C=6, solver='liblinear').fit(X_train,y_train)
logReg

LogisticRegression(C=6, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

In [67]:
logPred = logReg.predict(X_test)
logPred

array([2, 2, 2, ..., 1, 2, 2])

In [68]:
logPredodd = logReg.predict_proba(X_test)
logPredodd

array([[0.47085214, 0.52914786],
       [0.47085214, 0.52914786],
       [0.47085214, 0.52914786],
       ...,
       [0.53547329, 0.46452671],
       [0.47085214, 0.52914786],
       [0.47085214, 0.52914786]])

### K Nearest Neighbor Model

KNN helps predict the SEVERITYCODE of an outcome by finding similar data points within k distance.

In [32]:
from sklearn.neighbors import KNeighborsClassifier

In [33]:
ks = 15
hood = KNeighborsClassifier(n_neighbors = ks).fit(X_train,y_train)
hood
hoodPred = hood.predict(X_test)
hoodPred[0:5]

array([2, 2, 2, 2, 2])

### Decision Tree Model

The Decision Tree will show us all possible outcomes so we can analyze all the consequences of a decision. 

In [34]:
from sklearn.tree import DecisionTreeClassifier

In [58]:
treedat = DecisionTreeClassifier(criterion="entropy", max_depth = 5)
treedat
treedat.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [65]:
DTyhat = treedat.predict(X_test)
print (treePred[0:5])
print (y_test[0:5])

[2 2 2 2 2]
[2 1 1 1 2]


## Model Evaluation & Results
Before proceeding to deployment, we will evaluate the three models we trained above to determine which model most accurately meets our business objective. To do so, we will utilize 3 different evaluation metrics: F1 Score, Jaccard Similarity Index and Log Loss (Logistic Regression Model only).

In [39]:
# Import the measurements from sklearn
from sklearn.metrics import f1_score
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import log_loss

### Logistic Regression Model Evaluation

In [43]:
# F1 Score Evaluation
lrf1 = f1_score(y_test, logPred, average='macro')
print('The F1 Score is', lrf1)

The F1 Score is 0.5149615460549232


In [42]:
# Jaccard Similarity Index Evaluation
lrjac=jaccard_similarity_score(y_test, logPred)
print('The Jaccard Score is', lrjac)

The Jaccard Score is 0.5293864925244888


In [45]:
# Log Loss Evaluation
lrlogloss=log_loss(y_test, logPredodd)
print('The Log Loss Score is', lrlogloss)

The Log Loss Score is 0.6839726419666594


### K Nearest Neighbor Model Evaluation

In [53]:
# F1 Score Evaluation
knnf1=f1_score(y_test, hoodPred, average='macro')
print('The F1 Score is', knnf1)

The F1 Score is 0.5431647126727994


In [52]:
# Jaccard Similarity Index Evaluation
knnjac=jaccard_similarity_score(y_test, hoodPred)
print('The Jaccard Score is', knnjac)

The Jaccard Score is 0.559245574841038


### Decision Tree Model Evaluation

In [55]:
# F1 Score Evaluation
treef1=f1_score(y_test, treePred, average='macro')
print('The F1 Score is', treef1)

The F1 Score is 0.5334047084927266


In [56]:
# Jaccard Similarity Index Evaluation
treejac=jaccard_similarity_score(y_test, treePred)
print('The Jaccard Score is', treejac)

The Jaccard Score is 0.5598470527582059


## Discussion

After training 3 different machine learning model types (Logistic Regression, KNN and Decision Tree) and then performing the appropriate model evaluations for each of the models it can be determined that the Logistic Regression Model performs with the highest accuracy and therefor is the optimal model for the task. This is likely due to the data's binary nature (severity codes 1 and 2).

## Conclusion

Based on the historical collision data collected by the Seattle Police Department that includes attributes such as weather conditions, road conditions, and light conditions we can reasonably conclude that particular combinations of these factors somewhat have an impac on whether or not an individual's travel could results in property damage and or bodily injury. 