# Capstone project - Predicting the Car Accident Severity

### Applied Data Science Capstone by IBM/Coursera

## Table of contents
- [Introduction: Business Problem](#business_problem)
- [Data](#data)
- [Methodology](#methodology)
- [Analysis](#analysis)
- [Results and Discussion](#results)
- [Conclusion](#conclusion)

<a id='business_problem'></a>
## Business Problem 

Seattle, a city on Puget Sound in the Pacific Northwest, is surrounded by water, mountains and evergreen forests, and contains thousands of acres of parkland. Washington State’s largest city, it’s home to a large tech industry, with Microsoft and Amazon headquartered in its metropolitan area. The traffic is also huge as you guess and accidents are very common.

It is an challenge to government to control the accidents. The data of previous data were taken and now the task is to make better use of the available data. Many of the accidents cases in the city are because of the negligance of the people driving the vehicles. But also there are cases of some uncotrollable factors like light, weather, roads, etc. So the accidents because of these uncontrollable factors can be controlled by using the previous data and making an efficient solution out from it. For example, an alert can be sent to the drivers predicting the chances of accident to take place based on the factors previously mentioned.

The target audience of this project are Government of Seattle, local police and rescue teams, also for car financing corporations. They can gain a lot of profit from implementing this thing.  

We will use our Data Science technology to make out an absolute working solution for it now.

<a id="data"></a>
## Data
The data collected here was huge and was collected by the Seattle Police Department and Accident Traffic Records Department from 2004 to present. 
The data consists of 37 independent variables and 194,673 rows.

Depending on the definition of our problem, factors that will influence our decission are:
- Road condition
- Weather condition
- Light condition

#### Importing all required packages

In [150]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import itertools
from sklearn import preprocessing
from matplotlib.ticker import NullFormatter
%matplotlib inline

#### Extracting Data

In [151]:
pre_df = pd.read_csv("https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv")
pre_df.head()
pre_df.columns"

SyntaxError: EOL while scanning string literal (<ipython-input-151-e2a3b71c4a42>, line 3)

#### Removing unnecessary data

In [None]:
pre_df = pre_df[['SEVERITYCODE', 'ROADCOND', 'WEATHER', 'LIGHTCOND']]
pre_df.head()

pre_df = pre_df.dropna()
pre_df.head()

#### Data processing

Anyhow, we have to prepare the data is not a form to analyse it. In order to prepare the data, first, we need to drop the non-relevant columns. Last but not least remove the values containing null.

After studying the data, as per our requirements mentioned above  I have decided to pick up three independent variables - light condition, weather condition and road condition and severity code as the target variable. 

Our target variable is "SEVERITYCODE", contains numbers that correspond to different levels of severity caused by an accident from 0 to 4.

Severity codes are as follows:

0. Little to no Probability (Clear Conditions)
1. Very Low Probability — Chance or Property Damage
2. Low Probability — Chance of Injury
3. Mild Probability — Chance of Serious Injury
4. High Probability — Chance of Fatality

Let us look at the count of various sevirity codes in the data set

In [None]:
pre_df["SEVERITYCODE"].value_counts().plot(kind='bar')

We can see that the number of rows in class 1 is much bigger than the number of rows in class 2. We can simply downsample the class 1 to make the data set unbiased.

In [None]:
from sklearn.utils import resample
pre_df_major = pre_df[pre_df.SEVERITYCODE==1]
pre_df_minor = pre_df[pre_df.SEVERITYCODE==2]

In [None]:
maj_sample = resample(pre_df_major,replace=False,n_samples=58188,random_state=13)

bal_df = pd.concat([maj_sample,pre_df_minor])

bal_df["SEVERITYCODE"].value_counts().plot(kind='bar')

bal_df.head()

Now we are ready for the next step.

<a id='methodology'></a>
## Methodology

As I mentioned earlier <b>WEATHER</b>, <b>ROADCOND</b>, <b>LIGHTCOND</b> are the factors predicting the results more accurately.
Our data was just prepared enough to get the target variable <b>SEVERITYCODE</b>.

In [None]:
pre_df["WEATHER"].value_counts().plot(kind='bar')

In [None]:
pre_df["ROADCOND"].value_counts().plot(kind='bar')

In [None]:
pre_df["LIGHTCOND"].value_counts().plot(kind='bar')

In [None]:
Feature = pre_df[['SEVERITYCODE']]
df = pd.concat([Feature, pd.get_dummies(pre_df[['WEATHER','ROADCOND','LIGHTCOND']])],axis=1)
df.drop(['SEVERITYCODE'],axis=1,inplace=True)
Feature.head()

In [None]:
X=df.values
X[0:5]

In [None]:
y=Feature['SEVERITYCODE'].values
y[0:5]

In [None]:
print("Feature shape:", df.shape)
print("X shape:",X.shape)
print ("y shape:", y.shape)

In [None]:
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

The data look pretty good for the modelling.

<a id='analysis'></a>
## Analysis

### Splitting data set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)
print('train set:', X_train.shape, y_train.shape)
print('test set:', X_test.shape, y_test.shape)

### Decision Tree Model

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import preprocessing

In [None]:
#Modelling phase
Accident_Severity_Model=DecisionTreeClassifier(criterion='entropy', max_depth=5)
Accident_Severity_Model.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [None]:
#Predicting phase
yhat_d=Accident_Severity_Model.predict(X_test)
print(yhat_d [0:5])
print(y_test [0:5])

[1 1 1 1 1]
[1 1 1 1 2]


In [None]:
#Accuracy of the model using sklearn
from sklearn import metrics
print("Decision Tress Accuracy:", metrics.accuracy_score(y_test, yhat_d))

Decision Tress Accuracy: 0.6961550649625013


### KNN MODEL

In [None]:
from sklearn.neighbors import KNeighborsClassifier
k = 25

In [None]:
neigh = KNeighborsClassifier(n_neighbors=k).fit(X_train,y_train)
neigh

In [None]:
yhat_k = neigh.predict(X_test)
yhat_k[0:5]

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=6,solver='liblinear').fit(X_train,y_train)
LR

In [None]:
yhat_lr = LR.predict(X_test)
yhat_lr[0:5]

In [None]:
yhat_proba = LR.predict_proba(X_test)
yhat_proba

<a id='results'></a>
## Results and Discussions

In [None]:
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score

### Desicion Tree

In [None]:
js_d = jaccard_similarity_score(y_test,yhat_d)
fs_d = f1_score(y_test,yhat_d,average='macro')
print(js_d,fs_d)

### K-Nearest Neighbors

In [None]:
js_knn = jaccard_similarity_score(y_test,yhat_k)
fs_knn = jaccard_similarity_score(y_test,yhat_k)
print(js_knn,fs_knn)

### Logistic Regression

In [None]:
js_lr = jaccard_similarity_score(y_test,yhat_lr)
fs_lr = jaccard_similarity_score(y_test,yhat_lr)
print(js_lr,fs_lr)

In [None]:
list_js = [js_d,js_knn,js_lr]
list_fs = [fs_d,fs_knn,fs_lr]
list_index = ["Decision Tree","KNN","Logistic Regression"]

In [None]:
data = {'Jaccard Score':[js_d,js_knn,js_lr],'F1 Score':[fs_d,fs_knn,fs_lr]}
res = pd.DataFrame(data, index=list_index)
res.reset_index()
res


### Discussion
In the beginning of this notebook, we had categorical data that was of type 'object'. This is not a data type that we could have fed through an algoritim, so label encoding was used to created new classes that were of type int8; a numerical data type.

After solving that issue we were presented with another - imbalanced data. As mentioned earlier, class 1 was nearly three times larger than class 2. The solution to this was downsampling the majority class with sklearn's resample tool. We downsampled to match the minority class exactly with 58188 values each.

Once we analyzed and cleaned the data, it was then fed through three ML models; K-Nearest Neighbor, Decision Tree and Logistic Regression. Although the first two are ideal for this project, logistic regression made most sense because of its binary nature.

<a id='conclusion'></a>
## Conclusion

We got a confirmation of the impact the factors <b>WEATHER</b>, <b>ROADCOND</b>, <b>LIGHTCOND</b> on the accidents and built a model predicting the <b>SEVERITYCODE</b> indicating the severity of the accident. Finally, this model can be used to determine the probability of the accident taking place and upto what extent would be the damage.