# Capstone Project - Car Accident Prediction (Week 1)
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [My Notes](#notes)
* [Introduction: Business Problem](#introduction)
* [Data](#data)

---

## My Notes <a name="notes"></a>

#### Problem:

Say you are driving to another city for work or to visit some friends. It is rainy and windy, and on the way, you come across a terrible traffic jam on the other side of the highway. Long lines of cars barely moving. 

As you keep driving, police car start appearing from afar shutting down the highway. Oh, it is an accident and there's a helicopter transporting the ones involved in the crash to the nearest hospital. They must be in critical condition for all of this to be happening. 

__Now, wouldn't it be great if there is something in place that could warn you, given the weather and the road conditions about the possibility of you getting into a car accident and how severe it would be, so that you would drive more carefully or even change your travel if you are able to.__

__*Thoughts about this problem:*__
- It talks about using the output from 'some system' to change driver behaviour prior to travel such as drive 'more carefully' or 'change your travel'. This suggest predicting severity of crash before it happens.
  - __This would impact the choice of features to use in the model__
  - We couldn't use features that would not be available prior to a crash such as data from police and/or hospital reports.
- The request talks of "possibility of you getting into a car accident" and "how severe it will be". This suggests that two outputs are needed. One for the probability/chance of getting into a car accident and the second for what the severity/consequence could be.
  - Severity (from sample data) is categorical and binary
  - Severity (from Vicroads data) is categorical and multi-class

#### Data Selection Criteria:

For this week, your main task is to decide whether you want to use the shared data or find your own dataset. In case, you choose to find your own dataset from the resources that are suggested in Week-1 video, your dataset should meet the following criteria: 

1. __The target or label columns should be accident "severity" in terms of human fatality, traffic delay, property damage, or any other type of accident bad impact.__ 
2. The machine learning model should be able to predict accident "severity"
3. To build a good model, the dataset should be rich and contain many observations (rows) and various attributes (columns)

*Logistic Regression:*

Although logistic regression is best suited for instances of binary classification, it can be applied to multiclass classification problems, classification tasks with three or more classes. You accomplish this by applying a “one vs. all” strategy.

What’s a “one vs. all strategy?” Let’s say you have three different classes that instances in your dataset could fall into, and if you had these three classes you could treat them as three different binary classification problems.

For instance, you would train a classifier on just the examples belonging to Class A vs. all the examples belonging to all other classes. You would then do the same thing for Class B, and finally for Class C.

---

## Introduction: Business Problem <a name="introduction"></a>

#### Report Instructions:

_Clearly define a problem or an idea of your choice. Remember that data science problems always target an audience and are meant to help a group of stakeholders solve a problem, so make sure that you explicitly describe your audience and why they would care about your problem._

_The initial phase is to understand the project's objective from the business or application perspective. Then, you need to translate this knowledge into a machine learning problem with a preliminary plan to achieve the objectives._

#### Business Problem:

Problem Brief:  
Now, wouldn't it be great if there is something in place that could warn you, given the weather and the road conditions about the possibility of you getting into a car accident and how severe it would be, so that you would drive more carefully or even change your travel if you are able to.

Target Audience:  
Drivers planning to drive to some destination and are concerned whether current driving conditions might increase the risk of a car crash. The driver can use the warnings to boost focus and concentration or even adjust route planning. 

System Output:  
Given specific driving conditions (in Victoria/Australia), predict the possibility of you getting into an accident and the severity.

Geographical Focus:  
Analysis will be focused on Victoria/Australia using data from the local road authority 'Vicroads'. 

*Assumptions and Thinking:*  
- It talks about using the output from 'some system' to change driver behaviour prior to travel such as drive 'more carefully' or 'change your travel'. This suggest predicting severity of crash before it happens.
  - __This would impact the choice of features to use in the model__
  - We couldn't use features that would not be available prior to a crash such as data from police and/or hospital reports.
- The request talks of "possibility of you getting into a car accident" and "how severe it will be". This suggests that two outputs are needed. One for the probability/chance of getting into a car accident and the second for what the severity/consequence could be. 
  - Severity (from sample data) is categorical and binary
  - Severity (from Vicroads data) is categorical and multi-class

---

## Data <a name="data"></a>

#### Report Instructions:
_Describe the data that you will be using to solve the problem or execute your idea. So make sure that you provide adequate explanation and discussion, with examples, of the data that you will be using._

_In this phase, you need to collect or extract the dataset from various sources such as csv file or SQL database. Then, you need to determine the attributes (columns) that you will use to train your machine learning model. Also, you will assess the condition of chosen attributes by looking for trends, certain patterns, skewed information, correlations, and so on._

##### Data Understanding:
The data required for this stage should contain details that can be obtained prior to travel. Historical crash data would also contain data that is made available via police investigation and hostpital reports. This data can't be used for the model as it won't be available at the time of driving.

##### Possible Detail / Features of Interest:  
- Driver and passenger details (e.g. age, gender, # of occupants)
- Date and time of travel (e.g. time of day, weekend/weekday, public holiday)
- Vehicle details (engine type, age, safety rating)
- Weather conditions
- Lighting conditions
- Road conditions (sealed, unsealed)
- Location / Address

##### Modelling Notes:
- The use of a Probabilistic classification model (such as logistic regression) could achieve the desired output.

---

##### Victorian Government | Department of Transport | Open data

Fatal and injury crashes on Victorian roads during the latest five year reporting period. This data allows users to analyse Victorian fatal and injury crash data based on time, location, conditions, crash type, road user type, object hit etc. Road Safety data is provided by VicRoads for educational and research purposes. This data is in Web Mercator (Auxiliary Sphere) projection.

_Crashes Last Five Years (Vicroads Open Data):_  
https://vicroadsopendata-vicroadsmaps.opendata.arcgis.com/datasets/crashes-last-five-years

_Metadata Information:_  
http://data.vicroads.vic.gov.au/metadata/Crashes_Last_Five_Years%20-%20Open%20Data.html

In [1]:
import numpy as pd
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

from scipy import stats
from scipy.stats import norm, skew

import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')

In [2]:
# Analysis will be focused on Victoria/Australia using data from the local road authority 'Vicroads'.
crash_data_filename = "Crashes_Last_Five_Years.csv"

In [15]:
crash_data_df = pd.read_csv(crash_data_filename)

print("Dataset Shape:", crash_data_df.shape)

Dataset Shape: (74908, 63)


This dataset is formatted to include one accident per record/line

In [14]:
crash_data_df.head(5)

Unnamed: 0,OBJECTID,ACCIDENT_NO,ABS_CODE,ACCIDENT_STATUS,ACCIDENT_DATE,ACCIDENT_TIME,ALCOHOLTIME,ACCIDENT_TYPE,DAY_OF_WEEK,DCA_CODE,...,DEG_URBAN_ALL,LGA_NAME_ALL,REGION_NAME_ALL,SRNS,SRNS_ALL,RMA,RMA_ALL,DIVIDED,DIVIDED_ALL,STAT_DIV_NAME
0,3401744,T20130013732,ABS to receive accident,Finished,1/7/2013,18.30.00,Yes,Struck Pedestrian,Monday,PED NEAR SIDE. PED HIT BY VEHICLE FROM THE RIGHT.,...,MELB_URBAN,MELBOURNE,METROPOLITAN NORTH WEST REGION,,,Local Road,Local Road,Undivided,Undiv,Metro
1,3401745,T20130013736,ABS to receive accident,Finished,2/7/2013,16.40.00,No,Collision with vehicle,Tuesday,PARKED VEHICLES ONLY,...,MELB_URBAN,WHITEHORSE,METROPOLITAN SOUTH EAST REGION,,,Arterial Other,"Arterial Other,Local Road",Divided,"Div,Undiv",Metro
2,3401746,T20130013737,ABS to receive accident,Finished,2/7/2013,13.15.00,No,Collision with a fixed object,Tuesday,RIGHT OFF CARRIAGEWAY INTO OBJECT/PARKED VEHICLE,...,MELB_URBAN,BRIMBANK,METROPOLITAN NORTH WEST REGION,,,Local Road,Local Road,Undivided,Undiv,Metro
3,3401747,T20130013738,ABS to receive accident,Finished,2/7/2013,16.45.00,No,Collision with a fixed object,Tuesday,RIGHT OFF CARRIAGEWAY INTO OBJECT/PARKED VEHICLE,...,RURAL_VICTORIA,MITCHELL,NORTHERN REGION,M,M,Freeway,Freeway,Divided,Div,Country
4,3401748,T20130013739,ABS to receive accident,Finished,2/7/2013,15.48.00,No,Collision with vehicle,Tuesday,U TURN,...,"MELBOURNE_CBD,MELB_URBAN",MELBOURNE,METROPOLITAN NORTH WEST REGION,,,Local Road,Local Road,Undivided,Undiv,Metro


In [16]:
print("Dataset Unique Accident Records:", crash_data_df['ACCIDENT_NO'].unique().shape)

Dataset Unique Accident Records: (74908,)


This dataset is formatted to include one accident per record/line

In [17]:
# pd.shape
# pd.value_counts()
# pd.unique()
# pd.describe(include='all')

crash_data_df['SEVERITY'].value_counts()

Other injury accident      52032
Serious injury accident    21561
Fatal accident              1314
Non injury accident            1
Name: SEVERITY, dtype: int64