<a href="https://colab.research.google.com/github/alerlemos/Coursera_Capstone/blob/master/Capstone.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Applied Data Science Capstone
## IBM Data Science Professional Certificate 

### Author: Alexandre Rosseto Lemos

## Introduction
Car accidents are always a big problem in any major city. Being places with lots of traffic, they tend to have a great number of car crashes troughtout the year. 
This means a lot of money spent in government property being destroyed, people getting injured and, in the worst case, even dying because of these accidents.

Being able to predict severity of an accident based on some informations about it, before reaching the crash site, can improve how hospitals and police departments react to the situation. 
Factors like weather, vehicles involved in the accident and light condition, can indicate the probable severity of the crash, and then, hospitals can be better prepared to receive the injured people.
Police and traffic deparments can use the same data to reinforce caution in traffic to minimize car accidents and their severity.
This approach can help save a lot of money in government property and, most important, it can help save human lives.

## Bussines Problem
This project's objective is to be able to predict the severity of the accidents based on information that can be obtained before any police officer or ambulance reach the crash site.
By doing this, the people responsible in dealing with the situation can be better prepared to do so. 

## Data

The dataset used for this project is the dataset "Collisions—All Years", provided by the Seattle Department of Transportation - Traffic Management Division.
This dataset contains records of all types of collisions that happened since 2004 in the city of Seattle.

This dataset contains 37 attributes and further information about it can be found at https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Metadata.pdf

## Data Understanting

### Adquiring the data and selecting the attributes that are going to be used in the analysis

In [1]:
# Importing libraries
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

# Downloading the dataset
df_init = pd.read_csv('https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv')

  import pandas.util.testing as tm
  interactivity=interactivity, compiler=compiler, result=result)


In [2]:
# Information of the full dataset
df_init.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194673 entries, 0 to 194672
Data columns (total 38 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   SEVERITYCODE    194673 non-null  int64  
 1   X               189339 non-null  float64
 2   Y               189339 non-null  float64
 3   OBJECTID        194673 non-null  int64  
 4   INCKEY          194673 non-null  int64  
 5   COLDETKEY       194673 non-null  int64  
 6   REPORTNO        194673 non-null  object 
 7   STATUS          194673 non-null  object 
 8   ADDRTYPE        192747 non-null  object 
 9   INTKEY          65070 non-null   float64
 10  LOCATION        191996 non-null  object 
 11  EXCEPTRSNCODE   84811 non-null   object 
 12  EXCEPTRSNDESC   5638 non-null    object 
 13  SEVERITYCODE.1  194673 non-null  int64  
 14  SEVERITYDESC    194673 non-null  object 
 15  COLLISIONTYPE   189769 non-null  object 
 16  PERSONCOUNT     194673 non-null  int64  
 17  PEDCOUNT  

The dataset contains a lot of attributes that aggregate no real value to the classification. Id's and other attributes used only to identificate the accident can be discarted in the analysis.

The attributes that will be used in the analysis are:


*   COLLISIONTYPE - Type of the collision
*   PERSONCOUNT - The total number of people involved in the collision
*   VEHCOUNT - The total number of vehicles involved in the collision
*   JUNCTIONTYPE - Category of junction at which collision took place
*   WEATHER - A description of the weather conditions during the time of the collision
*   ROADCOND - The condition of the road during the collision
*   LIGHTCOND - The light conditions during the collision
*   SPEEDING - Whether or not speeding was a factor in the collision

The label (target) is: 
* SEVERITYCODE - A code that corresponds to the severity of the collision



In [3]:
# Creating a dataframe with the attributes that are going to be used
df = df_init[['COLLISIONTYPE', 'PERSONCOUNT', 'VEHCOUNT', 'JUNCTIONTYPE', 'WEATHER', 'ROADCOND', 'LIGHTCOND', 'SPEEDING','SEVERITYCODE']]
df.head()

Unnamed: 0,COLLISIONTYPE,PERSONCOUNT,VEHCOUNT,JUNCTIONTYPE,WEATHER,ROADCOND,LIGHTCOND,SPEEDING,SEVERITYCODE
0,Angles,2,2,At Intersection (intersection related),Overcast,Wet,Daylight,,2
1,Sideswipe,2,2,Mid-Block (not related to intersection),Raining,Wet,Dark - Street Lights On,,1
2,Parked Car,4,3,Mid-Block (not related to intersection),Overcast,Dry,Daylight,,1
3,Other,3,3,Mid-Block (not related to intersection),Clear,Dry,Daylight,,1
4,Angles,2,2,At Intersection (intersection related),Raining,Wet,Daylight,,2


In [4]:
# Printing the informations of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194673 entries, 0 to 194672
Data columns (total 9 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   COLLISIONTYPE  189769 non-null  object
 1   PERSONCOUNT    194673 non-null  int64 
 2   VEHCOUNT       194673 non-null  int64 
 3   JUNCTIONTYPE   188344 non-null  object
 4   WEATHER        189592 non-null  object
 5   ROADCOND       189661 non-null  object
 6   LIGHTCOND      189503 non-null  object
 7   SPEEDING       9333 non-null    object
 8   SEVERITYCODE   194673 non-null  int64 
dtypes: int64(3), object(6)
memory usage: 13.4+ MB


## Data Preparation

In this section, the dataset will suffer transformations to prepare the data do be used.

In [5]:
# Checking for NaN values in the dataset
df.isnull().sum()

COLLISIONTYPE      4904
PERSONCOUNT           0
VEHCOUNT              0
JUNCTIONTYPE       6329
WEATHER            5081
ROADCOND           5012
LIGHTCOND          5170
SPEEDING         185340
SEVERITYCODE          0
dtype: int64

There are some NaN values in the dataset. Since they represent a small percentage of the total of rows (194673) these register will be removed from the dataset.The exception is with the SPEEDING column, where the presence of NaN values will be replaced by 'N' (Not speeding).

In [7]:
# Replacing the NaN values of the column SPEEDING with 'N'
df['SPEEDING'].fillna('N', inplace = True)
# Droping the rows that contain NaN values of the dataset
df.dropna(inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 183177 entries, 0 to 194672
Data columns (total 9 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   COLLISIONTYPE  183177 non-null  object
 1   PERSONCOUNT    183177 non-null  int64 
 2   VEHCOUNT       183177 non-null  int64 
 3   JUNCTIONTYPE   183177 non-null  object
 4   WEATHER        183177 non-null  object
 5   ROADCOND       183177 non-null  object
 6   LIGHTCOND      183177 non-null  object
 7   SPEEDING       183177 non-null  object
 8   SEVERITYCODE   183177 non-null  int64 
dtypes: int64(3), object(6)
memory usage: 14.0+ MB


Now we will substitute all the string values with numbers. 

COLLISIONTYPE

In [9]:
df['COLLISIONTYPE'].value_counts()

Parked Car    43272
Angles        34464
Rear Ended    33683
Other         22999
Sideswipe     18312
Left Turn     13641
Pedestrian     6515
Cycles         5365
Right Turn     2930
Head On        1996
Name: COLLISIONTYPE, dtype: int64

In [10]:
# Replacing:
# Parked Car -> 1
df['COLLISIONTYPE'].replace({'Parked Car': 1}, inplace = True)
# Angles -> 2
df['COLLISIONTYPE'].replace({'Angles': 2}, inplace = True)
# Rear Ended -> 3
df['COLLISIONTYPE'].replace({'Rear Ended': 3}, inplace = True)
# Other -> 4
df['COLLISIONTYPE'].replace({'Other': 4}, inplace = True)
# Sideswipe -> 5
df['COLLISIONTYPE'].replace({'Sideswipe': 5}, inplace = True)
# Left Turn -> 6
df['COLLISIONTYPE'].replace({'Left Turn': 6}, inplace = True)
# Pedestrian -> 7
df['COLLISIONTYPE'].replace({'Pedestrian': 7}, inplace = True)
# Cycles -> 8
df['COLLISIONTYPE'].replace({'Cycles': 8}, inplace = True)
# Right Turn -> 9
df['COLLISIONTYPE'].replace({'Right Turn': 9}, inplace = True)
# Head On -> 10
df['COLLISIONTYPE'].replace({'Head On': 10}, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)


JUNCTIONTYPE

In [11]:
df['JUNCTIONTYPE'].value_counts()

Mid-Block (not related to intersection)              86852
At Intersection (intersection related)               61226
Mid-Block (but intersection related)                 22353
Driveway Junction                                    10520
At Intersection (but not related to intersection)     2057
Ramp Junction                                          162
Unknown                                                  7
Name: JUNCTIONTYPE, dtype: int64

In [12]:
# Replacing:
# Mid-Block (not related to intersection) -> 1
df['JUNCTIONTYPE'].replace({'Mid-Block (not related to intersection)': 1}, inplace = True)
# At Intersection (intersection related) -> 2
df['JUNCTIONTYPE'].replace({'At Intersection (intersection related)': 2}, inplace = True)
# Mid-Block (but intersection related) -> 3
df['JUNCTIONTYPE'].replace({'Mid-Block (but intersection related)': 3}, inplace = True)
# Driveway Junction -> 4
df['JUNCTIONTYPE'].replace({'Driveway Junction': 4}, inplace = True)
# At Intersection (but not related to intersection) -> 5
df['JUNCTIONTYPE'].replace({'At Intersection (but not related to intersection)': 5}, inplace = True)
# Ramp Junction -> 6
df['JUNCTIONTYPE'].replace({'Ramp Junction': 6}, inplace = True)
# Unknown -> 7
df['JUNCTIONTYPE'].replace({'Unknown': 7}, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)


WEATHER

In [13]:
df['WEATHER'].value_counts()

Clear                       109157
Raining                      32671
Overcast                     27202
Unknown                      11767
Snowing                        882
Other                          749
Fog/Smog/Smoke                 558
Sleet/Hail/Freezing Rain       112
Blowing Sand/Dirt               49
Severe Crosswind                25
Partly Cloudy                    5
Name: WEATHER, dtype: int64

In [14]:
# Replacing:
# Clear -> 1
df['WEATHER'].replace({'Clear': 1}, inplace = True)
# Raining -> 2
df['WEATHER'].replace({'Raining': 2}, inplace = True)
# Overcast -> 3
df['WEATHER'].replace({'Overcast': 3}, inplace = True)
# Unknown -> 4
df['WEATHER'].replace({'Unknown': 4}, inplace = True)
# Snowing -> 5
df['WEATHER'].replace({'Snowing': 5}, inplace = True)
# Other -> 6
df['WEATHER'].replace({'Other': 6}, inplace = True)
# Fog/Smog/Smoke -> 7
df['WEATHER'].replace({'Fog/Smog/Smoke': 7}, inplace = True)
# Sleet/Hail/Freezing Rain -> 8
df['WEATHER'].replace({'Sleet/Hail/Freezing Rain': 8}, inplace = True)
# Blowing Sand/Dirt -> 9
df['WEATHER'].replace({'Blowing Sand/Dirt': 9}, inplace = True)
# Severe Crosswind -> 10
df['WEATHER'].replace({'Severe Crosswind': 10}, inplace = True)
# Partly Cloudy -> 11
df['WEATHER'].replace({'Partly Cloudy': 11}, inplace = True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)


ROADCOND

In [15]:
df['ROADCOND'].value_counts()

Dry               122260
Wet                46748
Unknown            11652
Ice                 1178
Snow/Slush           980
Other                123
Standing Water       109
Sand/Mud/Dirt         67
Oil                   60
Name: ROADCOND, dtype: int64

In [16]:
# Replacing:
# Dry -> 1
df['ROADCOND'].replace({'Dry': 1}, inplace = True)
# Wet -> 2
df['ROADCOND'].replace({'Wet': 2}, inplace = True)
# Unknown -> 3
df['ROADCOND'].replace({'Unknown': 3}, inplace = True)
# Ice -> 4
df['ROADCOND'].replace({'Ice': 4}, inplace = True)
# Snow/Slush -> 5
df['ROADCOND'].replace({'Snow/Slush': 5}, inplace = True)
# Other -> 6
df['ROADCOND'].replace({'Other': 6}, inplace = True)
# Standing Water -> 7
df['ROADCOND'].replace({'Standing Water': 7}, inplace = True)
# Sand/Mud/Dirt -> 8
df['ROADCOND'].replace({'Sand/Mud/Dirt': 8}, inplace = True)
# Oil -> 9
df['ROADCOND'].replace({'Oil': 9}, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)


LIGHTCOND

In [17]:
df['LIGHTCOND'].value_counts()

Daylight                    113959
Dark - Street Lights On      47590
Unknown                      10553
Dusk                          5780
Dawn                          2453
Dark - No Street Lights       1462
Dark - Street Lights Off      1158
Other                          211
Dark - Unknown Lighting         11
Name: LIGHTCOND, dtype: int64

In [18]:
# Replacing:
# Daylight -> 1
df['LIGHTCOND'].replace({'Daylight': 1}, inplace = True)
# Dark - Street Lights On -> 2
df['LIGHTCOND'].replace({'Dark - Street Lights On': 2}, inplace = True)
# Unknown -> 3
df['LIGHTCOND'].replace({'Unknown': 3}, inplace = True)
# Dusk -> 4
df['LIGHTCOND'].replace({'Dusk': 4}, inplace = True)
# Dawn -> 5
df['LIGHTCOND'].replace({'Dawn': 5}, inplace = True)
# Other -> 6
df['LIGHTCOND'].replace({'Other': 6}, inplace = True)
# Dark - No Street Lights -> 7
df['LIGHTCOND'].replace({'Dark - No Street Lights': 7}, inplace = True)
# Dark - Street Lights Off -> 8
df['LIGHTCOND'].replace({'Dark - Street Lights Off': 8}, inplace = True)
# Dark - Unknown Lighting -> 9
df['LIGHTCOND'].replace({'Dark - Unknown Lighting': 9}, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)


SPEEDING

In [19]:
df['SPEEDING'].value_counts()

N    173968
Y      9209
Name: SPEEDING, dtype: int64

In [20]:
# Replacing:
# N -> 1
df['SPEEDING'].replace({'N': 1}, inplace = True)
# Y -> 2
df['SPEEDING'].replace({'Y': 2}, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)


The dataset after the transformation contains 0 NaN values, and 183177 rows and no string types

In [21]:
# The final dataset is as follow
df.head()

Unnamed: 0,COLLISIONTYPE,PERSONCOUNT,VEHCOUNT,JUNCTIONTYPE,WEATHER,ROADCOND,LIGHTCOND,SPEEDING,SEVERITYCODE
0,2,2,2,2,3,2,1,1,2
1,5,2,2,1,2,2,2,1,1
2,1,4,3,1,3,1,1,1,1
3,4,3,3,1,1,1,1,1,1
4,2,2,2,2,2,2,1,1,2


Separating labels from attributes

In [22]:
# X: atributes
# Y: labels
Y = df['SEVERITYCODE'].values
Y.shape

(183177,)

In [23]:
X = df.drop(columns=['SEVERITYCODE'])
X.shape

(183177, 8)

## Metodology
First, the best model needs to be selected. The following algorithms will be tested with the complete dataset (df):
* K Nearest Neighbor(KNN)
* Decision Tree
* Support Vector Machine
* Logistic Regression


The accuracy of each model will be used to evaluate the best amongs them.

In [24]:
# Separating the attributes
X_col = X['COLLISIONTYPE'].values
X_per = X['PERSONCOUNT'].values
X_veh = X['VEHCOUNT'].values
X_jun = X['JUNCTIONTYPE'].values
X_wea = X['WEATHER'].values
X_roa = X['ROADCOND'].values
X_lig = X['LIGHTCOND'].values
X_spe = X['SPEEDING'].values

It is known that the dataset is unbalanced (there are more samples of a class)

In [26]:
# Printing the informations of the label
df['SEVERITYCODE'].value_counts()

1    126521
2     56656
Name: SEVERITYCODE, dtype: int64

To fix this, the K-Fold Cross Validation technique will be used, with the value of K = 10. The accuracy used will be the mean of the different accuracies calculated for each fold

## Modeling

In this section, different models are going to be tested, and the best one will be selected (using all the attributes)

### Finding the best classifier

In [27]:
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

# Normalizing the data using Data Standardization
X = preprocessing.StandardScaler().fit(X).transform(X)

# List of scores (K-Fold)
scores_knn = []
scores_tree = []
scores_svm = []
scores_log = []

# Creating the models
knn_clf = KNeighborsClassifier(n_neighbors = 5)
tree_clf = DecisionTreeClassifier() 
svm_clf = svm.SVC(gamma='auto')
log_clf = LogisticRegression(random_state=0,solver='lbfgs')

# Using the K-Fold Cross Validation
cv = KFold(n_splits=5, random_state=1, shuffle=True)
for train_index, test_index in cv.split(X):
    X_train, X_test, y_train, y_test = X[train_index], X[test_index], Y[train_index], Y[test_index]

    # Training the models
    knn_clf.fit(X_train, y_train) 
    tree_clf.fit(X_train, y_train)
    svm_clf.fit(X_train, y_train)
    log_clf.fit(X_train, y_train)

    # Making the predictions and calculating the accuracies
    y_pred_knn = knn_clf.predict(X_test)
    scores_knn.append(accuracy_score(y_test, y_pred_knn))

    y_pred_tree = tree_clf.predict(X_test)
    scores_tree.append(accuracy_score(y_test, y_pred_tree))

    y_pred_svm = svm_clf.predict(X_test)
    scores_svm.append(accuracy_score(y_test, y_pred_svm))

    y_pred_log = log_clf.predict(X_test)
    scores_log.append(accuracy_score(y_test, y_pred_log))

In [40]:
# Calculating the mean of the accuracies
import statistics as st
knn_acc = st.mean(scores_knn)
tree_acc = st.mean(scores_tree)
svm_acc = st.mean(scores_svm)
log_acc = st.mean(scores_log)
acc = [knn_acc, tree_acc, svm_acc, log_acc]

print(f'Accuracy for the KNN: {knn_acc}')
print(f'Accuracy for the Decision Tree: {tree_acc}')
print(f'Accuracy for the SVM: {svm_acc}')
print(f'Accuracy for the Logistic Regression: {log_acc}')

report = pd.DataFrame(data = acc, index=['KNN','Decision Tree','SVM','Logistic Regression'])
report.columns = ['Accuracy']
report

Accuracy for the KNN: 0.7196864300501432
Accuracy for the Decision Tree: 0.7413867377914282
Accuracy for the SVM: 0.7477030450009137
Accuracy for the Logistic Regression: 0.7170277917564901


Unnamed: 0,Accuracy
KNN,0.719686
Decision Tree,0.741387
SVM,0.747703
Logistic Regression,0.717028


From the table above, the best classifier was the SVM


## Results
The results obtained in this project are shown next.

In [41]:
report = pd.DataFrame(data = acc, index=['KNN','Decision Tree','SVM','Logistic Regression'])
report.columns = ['Accuracy']
report

Unnamed: 0,Accuracy
KNN,0.719686
Decision Tree,0.741387
SVM,0.747703
Logistic Regression,0.717028


## Conclusion
In this project, a dataset of accidents in the city of Seattle was used to predict the severety of an accident based on some characteristics of the accident.

* The data was treated to optmize the performance of the models created; 
* Four different types of models were tested to see which was better;

Based on the results, its safe to say that the best classifier was the SVM, who achieved the highest accuracy (74,77%). 

## Next Steps 
Further analysis can be made to find out if using the attributes separately, or in different combinations can improve the results obtained.