# Capstone Project - Auto Accident Prediction (Week 2)
## Applied Data Science Capstone by IBM/Coursera

This notebook will be used for the Applied Data Science Capstone project

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

Say you are driving to another city for work or to visit some friends. It is rainy and windy. On the way to your destination, you come across a terrible traffic jam on the other side of the highway. Long lines of cars are barely moving. As you keep driving, police car start appearing from afar, shutting down the highway. There is an accident and a helicopter is transporting the ones involved in the crash to the nearest hospital. The victems must be in critical condition for all of this to be happening.
 
Now, wouldn't it be great if there is something in place that could warn you, given the weather and the road conditions, about the possibility of you getting into a car accident and how severe it would be.  The advance warning could prompt you to  drive more carefully or even change your travel plans if you are able to.

## Data <a name="data"></a>

Load the required libraries

In [1]:
import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline

### Retrieve The Dataset
The data used to train and evaluate the model is the collision data set from the SDOT Traffic Management Division, Traffic Records Group. The data set is updated weekly from 2004 to the present. The data set is compiled from all collisions provided by the Seattle Police department and recorded by the Traffic Records Group.


Download the current collision data from <a name=Seattle Geo Data>http://data-seattlecitygis.opendata.arcgis.com</a>

In [2]:
!wget -O Collisions.csv https://opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0.csv

--2020-09-02 13:37:21--  https://opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0.csv
Resolving opendata.arcgis.com (opendata.arcgis.com)... 52.71.112.223, 34.202.76.40, 54.152.131.176, ...
Connecting to opendata.arcgis.com (opendata.arcgis.com)|52.71.112.223|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘Collisions.csv’

    [     <=>                               ] 84,814,898  84.6MB/s   in 1.0s   

2020-09-02 13:37:23 (84.6 MB/s) - ‘Collisions.csv’ saved [84814898]



### Load Data from CSV file
The data has unlabeled extra columns, which will cause an error if not accounted for. The _OBJECTID_ is used as the index for this dataset.

In [3]:
cols = pd.read_csv('Collisions.csv', nrows=1).columns
df = pd.read_csv('Collisions.csv', usecols=cols, index_col=2)
df.head()

Unnamed: 0_level_0,X,Y,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,LOCATION,EXCEPTRSNCODE,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
OBJECTID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-122.288688,47.532714,29800,29800,1177964,Unmatched,Block,,BEACON ER AVE S BETWEEN S PORTLAND ST AND S CH...,,...,,,,4315006.0,,,,0,0,N
2,,,115700,115700,10097005,Unmatched,,,,NEI,...,,,,10097005.0,,,,0,0,N
3,-122.355556,47.727318,1358,1358,3568600,Matched,Block,,GREENWOOD AVE N BETWEEN N 134TH ST AND N 136TH ST,,...,Dry,Daylight,,,,28.0,From opposite direction - one left turn - one ...,0,0,N
4,-122.317563,47.618764,70700,70700,2806057,Matched,Block,,E DENNY WAY BETWEEN 11TH AVE AND 12TH AVE,,...,Dry,Other,,7065007.0,,32.0,One parked--one moving,0,0,N
5,-122.361015,47.538551,53600,53600,2127310,Matched,Block,,DELRIDGE WAY SW BETWEEN SW MYRTLE ST AND SW OR...,,...,Dry,Daylight,,6137017.0,,14.0,From same direction - both going straight - on...,0,0,N


### Preprocess The Data

Normalize the data and fill in missing values where it makes sense. Display the frequency tables for various features to helpdetermine which features to use.

In [4]:
df['ADDRTYPE'] = df['ADDRTYPE'].fillna('Unknown')
print("\nAddress Type:\n", df['ADDRTYPE'].value_counts())

df['WEATHER'] = df['WEATHER'].fillna('Unknown')
print("\nWeather:\n", df['WEATHER'].value_counts())

df['LIGHTCOND'] = df['LIGHTCOND'].fillna('Unknown')
print("\nLight Conditions:\n", df['LIGHTCOND'].value_counts())

df['ROADCOND'] = df['ROADCOND'].fillna('Unknown')
print("\nRoad Conditions:\n", df['ROADCOND'].value_counts())

df['JUNCTIONTYPE'] = df['JUNCTIONTYPE'].fillna('Unknown')
print("\nJunction Type:\n", df['JUNCTIONTYPE'].value_counts())

# treat an blank record as N
df['INATTENTIONIND'] = df['INATTENTIONIND'].fillna('N')
print("\nInattention Indicator:\n", df['INATTENTIONIND'].value_counts())

# treat an blank record as N, a 0 as N and 1 as Y
df['UNDERINFL'] = df['UNDERINFL'].fillna('N')
df['UNDERINFL'] = df['UNDERINFL'].replace(['0','1'],['N','Y'])
print("\nUnder Influence:\n", df['UNDERINFL'].value_counts())

# treat an blank record as N, a 0 as N and 1 as Y
df['PEDROWNOTGRNT'] = df['PEDROWNOTGRNT'].fillna('N')
df['PEDROWNOTGRNT'] = df['PEDROWNOTGRNT'].replace(['0','1'],['N','Y'])
print("\nPedestrian Not Granted:\n", df['PEDROWNOTGRNT'].value_counts())

# treat an blank record as N, a 0 as N and 1 as Y
df['SPEEDING'] = df['SPEEDING'].fillna('N')
df['SPEEDING'] = df['SPEEDING'].replace(['0','1'],['N','Y'])
print("\nSpeeding:\n", df['SPEEDING'].value_counts())

print("\nHit Parked Car:\n", df['HITPARKEDCAR'].value_counts())



Address Type:
 Block           144784
Intersection     71774
Unknown           3712
Alley              874
Name: ADDRTYPE, dtype: int64

Weather:
 Clear                       114342
Unknown                      41724
Raining                      34019
Overcast                     28504
Snowing                        919
Other                          851
Fog/Smog/Smoke                 577
Sleet/Hail/Freezing Rain       116
Blowing Sand/Dirt               56
Severe Crosswind                26
Partly Cloudy                    9
Blowing Snow                     1
Name: WEATHER, dtype: int64

Light Conditions:
 Daylight                    119149
Dark - Street Lights On      50048
Unknown                      40201
Dusk                          6074
Dawn                          2599
Dark - No Street Lights       1573
Dark - Street Lights Off      1236
Other                          244
Dark - Unknown Lighting         20
Name: LIGHTCOND, dtype: int64

Road Conditions:
 Dry               12

### Assess The Features To Use

Remove rows where where the value is _Unknown_.

In [5]:
print("Rows before cleaning: ", df.shape)
df = df[~df['ADDRTYPE'].isin(['Unknown'])]
df = df[~df['WEATHER'].isin(['Unknown'])]
df = df[~df['LIGHTCOND'].isin(['Unknown'])]
df = df[~df['ROADCOND'].isin(['Unknown'])]
print ("Rows after cleaning: ", df.shape)

Rows before cleaning:  (221144, 39)
Rows after cleaning:  (174633, 39)


Remove unneeded columns/features

In [6]:
df.drop(inplace=True, columns=['INCKEY', 'COLDETKEY','REPORTNO', 'STATUS', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE', 'EXCEPTRSNDESC', 'SEVERITYDESC', 'COLLISIONTYPE', 'PEDCOUNT', 'PEDCYLCOUNT', 'INJURIES', 'SERIOUSINJURIES', 'FATALITIES', 'INCDATE', 'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC', 'INATTENTIONIND', 'UNDERINFL', 'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'ST_COLDESC', 'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR'])
df.head()


Unnamed: 0_level_0,X,Y,SEVERITYCODE,PERSONCOUNT,VEHCOUNT,WEATHER,ROADCOND,LIGHTCOND
OBJECTID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
3,-122.355556,47.727318,1,2,2,Overcast,Dry,Daylight
4,-122.317563,47.618764,1,2,2,Clear,Dry,Other
5,-122.361015,47.538551,1,2,2,Clear,Dry,Daylight
6,-122.386772,47.56472,2,4,2,Clear,Dry,Daylight
7,-122.365869,47.525967,1,2,2,Clear,Dry,Dark - Street Lights On


In [7]:
df.shape

(174633, 8)

## Methodology <a name="methodology"></a>

## Analysis <a name="analysis"></a>

## Results and Discussion <a name="results"></a>

## Conclusion <a name="conclusion"></a>