# Capstone Project - Car Accident Prediction (Week 1)
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [My Notes](#notes)
* [Introduction: Business Problem](#introduction)
* [Data](#data)

---

## My Notes <a name="notes"></a>

#### Problem:

Say you are driving to another city for work or to visit some friends. It is rainy and windy, and on the way, you come across a terrible traffic jam on the other side of the highway. Long lines of cars barely moving. 

As you keep driving, police car start appearing from afar shutting down the highway. Oh, it is an accident and there's a helicopter transporting the ones involved in the crash to the nearest hospital. They must be in critical condition for all of this to be happening. 

__Now, wouldn't it be great if there is something in place that could warn you, given the weather and the road conditions about the possibility of you getting into a car accident and how severe it would be, so that you would drive more carefully or even change your travel if you are able to.__

#### Data Selection Criteria:

Decide whether you want to use the shared data or find your own dataset. In case, you choose to find your own dataset from the resources that are suggested in Week-1 video, your dataset should meet the following criteria: 

1. __The target or label columns should be accident "severity" in terms of human fatality, traffic delay, property damage, or any other type of accident bad impact.__ 
2. The machine learning model should be able to predict accident "severity"
3. To build a good model, the dataset should be rich and contain many observations (rows) and various attributes (columns)

__*Thoughts about this problem:*__
- It talks about using the output from 'some system' to change driver behaviour prior to travel such as drive 'more carefully' or 'change your travel'. This suggest predicting severity of crash before it happens.
  - __This would impact the choice of features to use in the model__
  - We couldn't use features that would not be available prior to a crash such as data from police and/or hospital reports.
- The request talks of "possibility of you getting into a car accident" and "how severe it will be". This suggests that two outputs are needed. One for the probability/chance of getting into a car accident and the second for what the severity/consequence could be.
  - Severity (from sample data) is categorical and binary although, the original data set is multi-class
  - Severity (from Vicroads data) is categorical and multi-class

---

## Introduction: Business Problem <a name="introduction"></a>

#### Report Instructions:

_Clearly define a problem or an idea of your choice. Remember that data science problems always target an audience and are meant to help a group of stakeholders solve a problem, so make sure that you explicitly describe your audience and why they would care about your problem._

_The initial phase is to understand the project's objective from the business or application perspective. Then, you need to translate this knowledge into a machine learning problem with a preliminary plan to achieve the objectives._

#### Business Problem:

Problem Brief:  
Now, wouldn't it be great if there is something in place that could warn you, given the weather and the road conditions about the possibility of you getting into a car accident and how severe it would be, so that you would drive more carefully or even change your travel if you are able to.

Target Audience:  
Drivers planning to drive to some destination and are concerned whether current driving conditions might increase the risk of a car crash. The driver can use the warnings to boost focus and concentration or even adjust route planning. 

System Output:  
Given specific driving conditions (in Victoria/Australia), predict the possibility/likelihood of you getting into an accident (percentage %) and the severity (categorical output / 4 classes).

Geographical Focus:  
Analysis will be focused on Victoria/Australia using data from the local road authority 'Vicroads'. 

*Assumptions and Thinking:*  
- The brief talks about using the output from 'some system' to change driver behaviour prior to travel such as drive 'more carefully' or 'change your travel'. This suggest predicting severity of crash before it happens.
  - __This would impact the choice of features to use in the model__
  - We couldn't use features that would not be available prior to a crash such as data from police and/or hospital reports.
- The brief talks of "possibility of you getting into a car accident" and "how severe it will be". This suggests that two outputs are needed. One for the probability/likelihood (%) of getting into a car accident and the second for what the severity/consequence could be. 
  - Severity (from sample data) is categorical and binary
  - Severity (from Vicroads data) is categorical and multi-class (x4)  
- The model will always assign / predict an output (severity). The likelihood of a crash will be the key determinant. __Perhaps the system can first predict the severity and then look at the geospatial information to determine if there is a high / increased likelihood of a crash.__

---

## Data <a name="data"></a>

#### Report Instructions:
_Describe the data that you will be using to solve the problem or execute your idea. So make sure that you provide adequate explanation and discussion, with examples, of the data that you will be using._

_In this phase, you need to collect or extract the dataset from various sources such as csv file or SQL database. Then, you need to determine the attributes (columns) that you will use to train your machine learning model. Also, you will assess the condition of chosen attributes by looking for trends, certain patterns, skewed information, correlations, and so on._

##### Data Understanding:
The data required for this stage should contain details that can be obtained prior to travel. Historical crash data would also contain data that is made available via police investigation and hostpital reports. This data can't be used for the model as it won't be available at the time of driving.

##### Possible Detail / Features of Interest:  
- Driver and passenger details (e.g. age, gender, # of occupants)
- Date and time of travel (e.g. time of day, weekend/weekday, public holiday)
- Vehicle details (engine type, age, safety rating)
- Weather conditions
- Lighting conditions
- Road conditions (sealed, unsealed)
- Location / Address

##### Modelling Notes:
- Use a selection of classification models to determine the crash severity and evaluate the best one:
  - KNN
  - Decision Tree
  - SVC
  - Ransom Forrest ?
- Geospatial information should be a key feature for classification. E.g. if many people have had fatal crash at a particular intersection within a short period of time, then the likely consequence of another crash is a fatality. 
- Use geospatial analysis to determine the likelihood of a crash such as frequency of crashes at a particular location. This could simply be a formula based on the number of crashes within a period of time prior to the travel date.
- Perhaps the geospatial data can be used to highlight accident hotspots in the nearby area.

##### Exclusions:  
- This data science course has focused on binary classification. As such this assignment will only focus on binary class output. Excess classes will be excluded/dropped from the data. Although the desired models chosen above can handle multiclass, the dataset will be simplified to binary class accident severity.

---

##### Victorian Government | Department of Transport | Open data

Fatal and injury crashes on Victorian roads during the latest five year reporting period. This data allows users to analyse Victorian fatal and injury crash data based on time, location, conditions, crash type, road user type, object hit etc. Road Safety data is provided by VicRoads for educational and research purposes. This data is in Web Mercator (Auxiliary Sphere) projection.

_Crashes Last Five Years (Vicroads Open Data):_  
https://vicroadsopendata-vicroadsmaps.opendata.arcgis.com/datasets/crashes-last-five-years

_Metadata Information:_  
http://data.vicroads.vic.gov.au/metadata/Crashes_Last_Five_Years%20-%20Open%20Data.html

In [1]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

from scipy import stats
from scipy.stats import norm, skew

import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')

from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm

from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

In [2]:
# Analysis will be focused on Victoria/Australia using data from the local road authority 'Vicroads'.
crash_data_filename = "Crashes_Last_Five_Years.csv"

In [3]:
# Read VicRoads Data into Dataframe:
crash_data_df = pd.read_csv(crash_data_filename)
print("Dataset Shape:", crash_data_df.shape)

Dataset Shape: (74908, 63)


In [4]:
# test reading dataframe:
crash_data_df.head(5)

Unnamed: 0,OBJECTID,ACCIDENT_NO,ABS_CODE,ACCIDENT_STATUS,ACCIDENT_DATE,ACCIDENT_TIME,ALCOHOLTIME,ACCIDENT_TYPE,DAY_OF_WEEK,DCA_CODE,...,DEG_URBAN_ALL,LGA_NAME_ALL,REGION_NAME_ALL,SRNS,SRNS_ALL,RMA,RMA_ALL,DIVIDED,DIVIDED_ALL,STAT_DIV_NAME
0,3401744,T20130013732,ABS to receive accident,Finished,1/7/2013,18.30.00,Yes,Struck Pedestrian,Monday,PED NEAR SIDE. PED HIT BY VEHICLE FROM THE RIGHT.,...,MELB_URBAN,MELBOURNE,METROPOLITAN NORTH WEST REGION,,,Local Road,Local Road,Undivided,Undiv,Metro
1,3401745,T20130013736,ABS to receive accident,Finished,2/7/2013,16.40.00,No,Collision with vehicle,Tuesday,PARKED VEHICLES ONLY,...,MELB_URBAN,WHITEHORSE,METROPOLITAN SOUTH EAST REGION,,,Arterial Other,"Arterial Other,Local Road",Divided,"Div,Undiv",Metro
2,3401746,T20130013737,ABS to receive accident,Finished,2/7/2013,13.15.00,No,Collision with a fixed object,Tuesday,RIGHT OFF CARRIAGEWAY INTO OBJECT/PARKED VEHICLE,...,MELB_URBAN,BRIMBANK,METROPOLITAN NORTH WEST REGION,,,Local Road,Local Road,Undivided,Undiv,Metro
3,3401747,T20130013738,ABS to receive accident,Finished,2/7/2013,16.45.00,No,Collision with a fixed object,Tuesday,RIGHT OFF CARRIAGEWAY INTO OBJECT/PARKED VEHICLE,...,RURAL_VICTORIA,MITCHELL,NORTHERN REGION,M,M,Freeway,Freeway,Divided,Div,Country
4,3401748,T20130013739,ABS to receive accident,Finished,2/7/2013,15.48.00,No,Collision with vehicle,Tuesday,U TURN,...,"MELBOURNE_CBD,MELB_URBAN",MELBOURNE,METROPOLITAN NORTH WEST REGION,,,Local Road,Local Road,Undivided,Undiv,Metro


In [5]:
# Checking count of unique values in th "ACCIDENT_NO" column. 
# The values in this column are supposed to be unique identifier for each accident.
# Expecting this count to equal the number of records in the dataset.

print("Dataset Unique Accident Records:", crash_data_df['ACCIDENT_NO'].unique().shape)

Dataset Unique Accident Records: (74908,)


This dataset is formatted to include one accident per record/line

In [6]:
# This assignment includes predicting the accident severity. This can be found in the "SEVERITY" column.
# Checking the unique values and records counts for this column:

# Stats of modified dataset
print('Dataset Shape:',crash_data_df.shape,'\n')

totalRecords = crash_data_df.shape[0]

print('Severity Column Values (count):')
print('{0}\n'.format(crash_data_df['SEVERITY'].value_counts()))

print('Severity Column Values (percentage):')
print('{0}'.format(crash_data_df['SEVERITY'].value_counts()/totalRecords*100))

Dataset Shape: (74908, 63) 

Severity Column Values (count):
Other injury accident      52032
Serious injury accident    21561
Fatal accident              1314
Non injury accident            1
Name: SEVERITY, dtype: int64

Severity Column Values (percentage):
Other injury accident      69.461206
Serious injury accident    28.783308
Fatal accident              1.754152
Non injury accident         0.001335
Name: SEVERITY, dtype: float64


##### Notes:  
Given the need to keep the output binary (as this data science course has not covered multi-class classification), I will focus the modelling on just two classes:  
- Other injury accident - 52032 records
- Serious injury accident - 21561 records

This will give enough data to attempt an undersampling approach to balance the data set.

In [7]:
# Remove the small classes from the output to create a binary output with 52032 + 21561 records:
crash_data_df = crash_data_df[~crash_data_df['SEVERITY'].isin(['Fatal accident', 'Non injury accident'])]

# Stats of modified dataset
print('Dataset Shape:',crash_data_df.shape,'\n')

totalRecords = crash_data_df.shape[0]

print('Severity Column Values (count):')
print('{0}\n'.format(crash_data_df['SEVERITY'].value_counts()))

print('Severity Column Values (percentage):')
print('{0}'.format(crash_data_df['SEVERITY'].value_counts()/totalRecords*100))

Dataset Shape: (73593, 63) 

Severity Column Values (count):
Other injury accident      52032
Serious injury accident    21561
Name: SEVERITY, dtype: int64

Severity Column Values (percentage):
Other injury accident      70.702377
Serious injury accident    29.297623
Name: SEVERITY, dtype: float64


In [8]:
list(crash_data_df.columns)

['OBJECTID',
 'ACCIDENT_NO',
 'ABS_CODE',
 'ACCIDENT_STATUS',
 'ACCIDENT_DATE',
 'ACCIDENT_TIME',
 'ALCOHOLTIME',
 'ACCIDENT_TYPE',
 'DAY_OF_WEEK',
 'DCA_CODE',
 'HIT_RUN_FLAG',
 'LIGHT_CONDITION',
 'POLICE_ATTEND',
 'ROAD_GEOMETRY',
 'SEVERITY',
 'SPEED_ZONE',
 'RUN_OFFROAD',
 'NODE_ID',
 'LONGITUDE',
 'LATITUDE',
 'NODE_TYPE',
 'LGA_NAME',
 'REGION_NAME',
 'VICGRID_X',
 'VICGRID_Y',
 'TOTAL_PERSONS',
 'INJ_OR_FATAL',
 'FATALITY',
 'SERIOUSINJURY',
 'OTHERINJURY',
 'NONINJURED',
 'MALES',
 'FEMALES',
 'BICYCLIST',
 'PASSENGER',
 'DRIVER',
 'PEDESTRIAN',
 'PILLION',
 'MOTORIST',
 'UNKNOWN',
 'PED_CYCLIST_5_12',
 'PED_CYCLIST_13_18',
 'OLD_PEDESTRIAN',
 'OLD_DRIVER',
 'YOUNG_DRIVER',
 'ALCOHOL_RELATED',
 'UNLICENCSED',
 'NO_OF_VEHICLES',
 'HEAVYVEHICLE',
 'PASSENGERVEHICLE',
 'MOTORCYCLE',
 'PUBLICVEHICLE',
 'DEG_URBAN_NAME',
 'DEG_URBAN_ALL',
 'LGA_NAME_ALL',
 'REGION_NAME_ALL',
 'SRNS',
 'SRNS_ALL',
 'RMA',
 'RMA_ALL',
 'DIVIDED',
 'DIVIDED_ALL',
 'STAT_DIV_NAME']

In [89]:
# Listing unique values for select column pairs:
print(crash_data_df['DEG_URBAN_NAME'].unique(),'\n')
print(crash_data_df['DEG_URBAN_ALL'].unique(),'\n')

print(crash_data_df['LGA_NAME'].unique(),'\n')
print(crash_data_df['LGA_NAME_ALL'].unique(),'\n')

print(crash_data_df['REGION_NAME'].unique(),'\n')
print(crash_data_df['REGION_NAME_ALL'].unique(),'\n')

print(crash_data_df['SRNS'].unique(),'\n')
print(crash_data_df['SRNS_ALL'].unique(),'\n')

print(crash_data_df['RMA'].unique(),'\n')
print(crash_data_df['RMA_ALL'].unique(),'\n')

print(crash_data_df['DIVIDED'].unique(),'\n')
print(crash_data_df['DIVIDED_ALL'].unique())


['MELB_URBAN' 'RURAL_VICTORIA' 'MELBOURNE_CBD' 'LARGE_PROVINCIAL_CITIES'
 'SMALL_CITIES' 'TOWNS' 'SMALL_TOWNS'] 

['MELB_URBAN' 'RURAL_VICTORIA' 'MELBOURNE_CBD,MELB_URBAN'
 'LARGE_PROVINCIAL_CITIES' 'MELBOURNE_CBD' 'SMALL_CITIES' 'TOWNS'
 'RURAL_VICTORIA,MELB_URBAN' 'TOWNS,RURAL_VICTORIA' 'SMALL_TOWNS'
 'SMALL_CITIES,RURAL_VICTORIA' 'RURAL_VICTORIA,LARGE_PROVINCIAL_CITIES'
 'SMALL_TOWNS,RURAL_VICTORIA' 'SMALL_TOWNS,MELB_URBAN'
 'SMALL_TOWNS,RURAL_VICTORIA,SMALL_CITIES' 'SMALL_TOWNS,SMALL_CITIES'] 

['MELBOURNE' 'WHITEHORSE' 'BRIMBANK' 'MITCHELL' 'BAW BAW' 'BAYSIDE'
 'BOROONDARA' 'BANYULE' 'HUME' 'WHITTLESEA' 'GEELONG' 'HOBSONS BAY'
 'NILLUMBIK' 'PORT PHILLIP' 'DAREBIN' 'YARRA' 'LATROBE' 'MOONEE VALLEY'
 'KNOX' 'CASEY' 'BENDIGO' 'FRANKSTON' 'EAST GIPPSLAND' 'KINGSTON'
 'MAROONDAH' 'BALLARAT' 'CAMPASPE' 'SHEPPARTON' 'MILDURA' 'DANDENONG'
 'MONASH' 'GOLDEN PLAINS' 'MORNINGTON PENINSULA' 'WYNDHAM' 'CORANGAMITE'
 'BASS COAST' 'MURRINDINDI' 'STONNINGTON' 'MORELAND' 'MOORABOOL'
 'YARRA RANGES

The metadata for these columns describes two levels of detail for any given crash. The first column contains a single value descriptor. The second column could contain a list of value descriptors (one or more) depending on the crash location. Each descriptor in the list is composed of values from those in the first column.

To keep this assignment model simple, we will use the single value descriptor and ignore the 'xxx\_ALL' columns.

---
#### The type of data that would be required for this assignment is mapped out below. 
- Circles represent the logical variables / entities required for this model. 
- Rectangles represent the required attributes.
- Diamonds represent fields from the VicRoads dataset that can provide the data needed.

Only features from the dataset, that could be known before the crash, will be used. Some of these features, though, represent detail where half could be known prior and half after the crash. E.g. PASSENGER represents number of passengers involved in the crash. For the purpose of this assignment we will treat these fields as if they represent data known prior to the crash and describe the vehicle being modelled. 

![FEATURE%20MAPPING.png](attachment:FEATURE%20MAPPING.png)

_Find Nulls:_  
np.where(pd.isnull(crash_data_df))  
[crash_data_df.iloc[i,j] for i,j in zip(*np.where(pd.isnull(crash_data_df)))]  
crash_data_df.iloc[0,56] 