<h1 style="text-align: center;"><strong>Severity of Car Accidents Classification</strong></h1>

<h2>Project Overview</h2>

This project is a capstone project for Data analysis course on Corsera, which is set up by IBM. During this project a dataset of car accidents will be analysed using Python. The dataset has many attributes which would provide us with a wide range of information on different circumstances of accidents. First, the data will be analysed to understand the attributes and which information would be useful for our analysis. Then the most suitable machine learning technique would be chosen to set up a model to predict future accidents. Finally, the model will be deployed and tested to check the accuracy of the set up model.

<h3>Libraries to be used in the analysis</h3>

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import seaborn as sns
from matplotlib import pyplot as plt

<h2>Problem Description</h2>

Car accidents is a serious problem and lead to heavy losses on both human lives and money. During recent years and with the advancments in technology and safety in cars, nomber of fatalities has been decreasing continiously. Also the safety and organizational parameters for traffic have been conducted by many governments, which also helped a lot in decreasing the nomber of accidents. However, there are still accidents and they are not completely prevented till now. Hence, it would be a great advantage to be able to predict the most common type of accidents and their respective severity, so that we concentrate our efforts on these severe types of accidents and try to minimize them. Therefore a dataset is being analysed during this notebook and then a model will be created which will predict the severity of car accidents depending on a wide range of significant attributes.


<h2>Data Understanding</h2>

The target of this project is accidents severity, hence a labeled dataset with accidents severity types is needed and of course various attributes to create a more inclusive model. There are many sources to get such dataset, however a sample dataset provided by Corsera will be used during the course of this notebook and can be found <a href='https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv'>here</a>. The dataset consists of all types of accidents in the city of Seattle since the year 2004. It has 38 attributes, 194673 observations and contains attribute for accidents severity (our target). Moreover, the significance of attributes will be assessed and then chosen accordingly. The unused attributes will be droped out of our dataset and the missing values will be dealt with during preprocessing. Since grouping of already labeled accidents severity is the objective, Classification learning algorithm will be implemented (Supervised learning) to create the intended model. 

In [9]:
#importing the dataset
df=pd.read_csv('Data-Collisions.csv',low_memory=False)
df.head()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [8]:
print(df.columns)
print(df.shape)

Index(['SEVERITYCODE', 'X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO',
       'STATUS', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE',
       'EXCEPTRSNDESC', 'SEVERITYCODE.1', 'SEVERITYDESC', 'COLLISIONTYPE',
       'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INCDATE',
       'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC',
       'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND',
       'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'ST_COLDESC',
       'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR'],
      dtype='object')
(194673, 38)


<h4>Examining Target Variable</h4>

In [4]:
print(df['SEVERITYCODE'].value_counts())
df['SEVERITYDESC'].value_counts(normalize=True)

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64


Property Damage Only Collision    0.701099
Injury Collision                  0.298901
Name: SEVERITYDESC, dtype: float64

<br>
<p>Our Target is grouped into two labeled groups:</p>
<ul>
<li>Group1: Property Damage Only collision</li>
<li>Group2: Injury Collisions</li>
</ul>  
We will try to create a model to group accident severity accordingly

Furthermore, it is clear that our target variable is imbalanced, where more than half of the observations lie in severity group 1 and this could affect our model and make it more biased to the greater class. Hence, we need to figure out an approach to handle such imbalance.
There are many ways we can handle such problem such as:
<ul>
    <li>Up sampling the minority class (value=2) where we try to replicate random values of smaller class to increase their nomber and balance with greater class</li>
    <li>Down sampling the majority class (value=2) where we try to remove random values of greater class to decrease their nomber and balance with smaller class</li>
    <li>Use specific classification algorithm, which works well with imbalanced data such as Desicion Tree</li>
    <li>Divide the greater class into two distinct groups, then apply the model on both groups, then evaluate the accuracy of both models and get the average</li>

In this project we will use the last option from above in dealing with the imbalanced target variable.

In [5]:
major_class=df[df['SEVERITYCODE']==1]
minor_class=df[df['SEVERITYCODE']==2]

#dividing the majority class
major_class_first,major_class_second=train_test_split(major_class,test_size=0.5,random_state=0)

#Models from which we can set our two Y variables (dependant variables) and X variables (independant variables)
model1=minor_class.append(major_class_first)
model2=minor_class.append(major_class_second)

<h4>Examining Attributes</h4>

Choosing the most suitable features for the model can be hard espicially of the dataset used has many attributes, because choosing too many attributes might lead to overfitting, which affects the accuracy of our model.
Moreover, since this is a Car accident severity test then weather, road lighting, road conditions and main idea about the common locations of accidents must be taken into consideration. Also, it will be helpful to check for accidents during different days of the week. Hence, the features to be used in the model are: 
<ul>
    <li>ADDRTYPE</li>
    <li>WEATHER</li>
    <li>LIGHTCOND</li>
    <li>ROADCOND</li>
    <li>INCDATE</li>
</ul>    

<h2>Data Preparation and Preprocessing</h2>
<p>Now the data will be chosen and prepared for the classification model. Unused columns will be droped and missing values will be dealt with.</p>
<h4>Examining data</h4>

<h4>Prepare our target variable</h4>

In [25]:
Y=df.SEVERITYCODE.values
Y[0:5]

array([2, 1, 1, 1, 2], dtype=int64)