### Applied Data Science Capstone

#### Week 1 - Introduction and Business Problem
A description of the problem and a discussion of the background. (15 marks)<br>
A description of the data and how it will be used to solve the problem. (15 marks)

#### The Problem & Background

*Is it possible to warn you, given the weather and road conditions, how severe a car accident you could be in?  Would it make you drive more carefully or possible change your travel timing or route?*

The purpose of this project is to attempt to predict the severity of a car accident given certain conditions. 

There are several potential applications for this prediction model. As noted above, understanding the potential severity of an accident given certain weather, road and other conditions - before you travel - could impact your travel timing or route.  Insurance companies could incentivize drivers to travel at other times when conditions lead to more severe incidents, or make preliminary determinations of monetary exposure they may have in an accident, prior to physical inspections by insurance adjustors, medical evaluations, etc.   

A machine learning model will be used to attempt to solve this problem: predicting the severity of a car accident given certain conditions.

The following process will be followed:

- Data will be collected then explored and visualized to understand its attributes that will be used to train the machine learning model.

- That data will then be prepared for modeling through labeling, transformation, filling missing data, etc.

- Various algorightms and methods will be selected and applied to build the model, including supervised machine learning techniques.

- The model will be be evaluated to ensure the objective is achieved. 

- Finally, a report will be developed describing the process and results.

#### The Data - Overview

This project will utilize Seattle, WA vehicle accident data from January 1, 2004 through May 20, 2020. Each record in this dataset represents one vehicle accident during this time period. 

Each accident has a "severity" rating, which is a code that corresponds to the severity of the collision. Other attributes for each accident are included, such as incident location, weather, light and road conditions at the time of the indident, the number of vehicles involved, whether a pedestrian was involved, etc.

For example, the weather attribute has the following values:
- Clear
- Raining
- Overcast
- Unknown
- Snowing
- Other
- Fog/Smog/Smoke
- Sleet/Hail/Freezing Rain
- Blowing Sand/Dirt
- Severe Crosswind
- Partly Cloudy

And the road condition attribute has the following values:
- Dry
- Wet
- Unknown
- Ice
- Snow/Slush
- Other
- Standing Water
- Sand/Mud/Dirt
- Oil

Complete metadata for this dataset is provided <a href='https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Metadata.pdf'>here</a>.

The purpose of this exercise is to develop a model which will predict the "severity" of an accident (the target/dependent variable) based upon other characteristics surrounding the collision (the independent variables).

Not all of these incident attributes will be used in the model.  The remainder of this notebook is devoted to data exploration and making initial determinations of what data appears most useful for the model, what data does not appear useful and should be removed, and other transformations which must take place such as handling missing data, adjusting numerical vs categorical data, and feature engineering.


#### Data exploration

In [1]:
# Libraries
import pandas as pd
import numpy as np

In [18]:
# load the data into a dataframe which had been downloaded locally from:
# https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv

collision_df = pd.read_csv('Data-Collisions.csv')
collision_df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [11]:
# examine column types
collision_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194673 entries, 0 to 194672
Data columns (total 38 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   SEVERITYCODE    194673 non-null  int64  
 1   X               189339 non-null  float64
 2   Y               189339 non-null  float64
 3   OBJECTID        194673 non-null  int64  
 4   INCKEY          194673 non-null  int64  
 5   COLDETKEY       194673 non-null  int64  
 6   REPORTNO        194673 non-null  object 
 7   STATUS          194673 non-null  object 
 8   ADDRTYPE        192747 non-null  object 
 9   INTKEY          65070 non-null   float64
 10  LOCATION        191996 non-null  object 
 11  EXCEPTRSNCODE   84811 non-null   object 
 12  EXCEPTRSNDESC   5638 non-null    object 
 13  SEVERITYCODE.1  194673 non-null  int64  
 14  SEVERITYDESC    194673 non-null  object 
 15  COLLISIONTYPE   189769 non-null  object 
 16  PERSONCOUNT     194673 non-null  int64  
 17  PEDCOUNT  

In [19]:
'''
Examine some columns which do not appear useful per metadata for prediction
* OBJECTID
* INCKEY
* COLDETKEY
* REPORTNO
* STATUS
* EXCEPTRSNCODE
* EXCEPTRSNDESC
* SEVERITYCODE.1
* SDOTCOLNUM
* ST_COLCODE
* ST_COLDESC
* SEGLANEKEY

'''

collision_df[['OBJECTID','INCKEY','COLDETKEY','REPORTNO','STATUS','EXCEPTRSNCODE','EXCEPTRSNDESC','SEVERITYCODE.1','SDOTCOLNUM','ST_COLCODE','ST_COLDESC','SEGLANEKEY']].head(10)


Unnamed: 0,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,EXCEPTRSNCODE,EXCEPTRSNDESC,SEVERITYCODE.1,SDOTCOLNUM,ST_COLCODE,ST_COLDESC,SEGLANEKEY
0,1,1307,1307,3502005,Matched,,,2,,10,Entering at angle,0
1,2,52200,52200,2607959,Matched,,,1,6354039.0,11,From same direction - both going straight - bo...,0
2,3,26700,26700,1482393,Matched,,,1,4323031.0,32,One parked--one moving,0
3,4,1144,1144,3503937,Matched,,,1,,23,From same direction - all others,0
4,5,17700,17700,1807429,Matched,,,2,4028032.0,10,Entering at angle,0
5,6,320840,322340,E919477,Matched,,,1,,10,Entering at angle,0
6,7,83300,83300,3282542,Matched,,,1,8344002.0,10,Entering at angle,0
7,9,330897,332397,EA30304,Matched,,,2,,5,Vehicle Strikes Pedalcyclist,6855
8,10,63400,63400,2071243,Matched,,,1,6166014.0,32,One parked--one moving,0
9,12,58600,58600,2072105,Matched,,,2,6079001.0,10,Entering at angle,0


In [20]:
# drop columns that do not appear useful for prediction
collision_df.drop(['OBJECTID','INCKEY','COLDETKEY','REPORTNO','STATUS','EXCEPTRSNCODE','EXCEPTRSNDESC','SEVERITYCODE.1','SDOTCOLNUM','ST_COLCODE','ST_COLDESC','SEGLANEKEY'], axis = 1, inplace = True)


In [21]:
collision_df.head()

Unnamed: 0,SEVERITYCODE,X,Y,ADDRTYPE,INTKEY,LOCATION,SEVERITYDESC,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,...,SDOT_COLDESC,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SPEEDING,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,Intersection,37475.0,5TH AVE NE AND NE 103RD ST,Injury Collision,Angles,2,0,...,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END ...",,N,Overcast,Wet,Daylight,,,0,N
1,1,-122.347294,47.647172,Block,,AURORA BR BETWEEN RAYE ST AND BRIDGE WAY N,Property Damage Only Collision,Sideswipe,2,0,...,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, LEFT SIDE ...",,0,Raining,Wet,Dark - Street Lights On,,,0,N
2,1,-122.33454,47.607871,Block,,4TH AVE BETWEEN SENECA ST AND UNIVERSITY ST,Property Damage Only Collision,Parked Car,4,0,...,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, REAR END",,0,Overcast,Dry,Daylight,,,0,N
3,1,-122.334803,47.604803,Block,,2ND AVE BETWEEN MARION ST AND MADISON ST,Property Damage Only Collision,Other,3,0,...,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END ...",,N,Clear,Dry,Daylight,,,0,N
4,2,-122.306426,47.545739,Intersection,34387.0,SWIFT AVE S AND SWIFT AV OFF RP,Injury Collision,Angles,2,0,...,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END ...",,0,Raining,Wet,Daylight,,,0,N


In [15]:
'''
SEVERITY
A code that corresponds to the severity of the
collision:
• 3—fatality
• 2b—serious injury
• 2—injury
• 1—prop damage
• 0—unknown
'''
collision_df['SEVERITYCODE'].value_counts()

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

In [17]:
collision_df['ST_COLCODE'].value_counts()

32    27612
10    23427
14    16883
32    16809
10    11247
      ...  
54        1
43        1
43        1
87        1
60        1
Name: ST_COLCODE, Length: 115, dtype: int64

In [18]:
collision_df['COLLISIONTYPE'].value_counts()

Parked Car    47987
Angles        34674
Rear Ended    34090
Other         23703
Sideswipe     18609
Left Turn     13703
Pedestrian     6608
Cycles         5415
Right Turn     2956
Head On        2024
Name: COLLISIONTYPE, dtype: int64

In [21]:
print("Earliest incident date: ", collision_df['INCDATE'].min())
print("Latest incident date: ", collision_df['INCDATE'].max())

Earliest incident date:  2004/01/01 00:00:00+00
Latest incident date:  2020/05/20 00:00:00+00


In [22]:
# Weather
collision_df['WEATHER'].value_counts()

Clear                       111135
Raining                      33145
Overcast                     27714
Unknown                      15091
Snowing                        907
Other                          832
Fog/Smog/Smoke                 569
Sleet/Hail/Freezing Rain       113
Blowing Sand/Dirt               56
Severe Crosswind                25
Partly Cloudy                    5
Name: WEATHER, dtype: int64

In [23]:
# Road Conditions
collision_df['ROADCOND'].value_counts()

Dry               124510
Wet                47474
Unknown            15078
Ice                 1209
Snow/Slush          1004
Other                132
Standing Water       115
Sand/Mud/Dirt         75
Oil                   64
Name: ROADCOND, dtype: int64

In [24]:
# Light Conditions
collision_df['LIGHTCOND'].value_counts()

Daylight                    116137
Dark - Street Lights On      48507
Unknown                      13473
Dusk                          5902
Dawn                          2502
Dark - No Street Lights       1537
Dark - Street Lights Off      1199
Other                          235
Dark - Unknown Lighting         11
Name: LIGHTCOND, dtype: int64

In [4]:
# Correlation matrix

collision_df.corr()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,INTKEY,SEVERITYCODE.1,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,SDOT_COLCODE,SDOTCOLNUM,SEGLANEKEY,CROSSWALKKEY
SEVERITYCODE,1.0,0.010309,0.017737,0.020131,0.022065,0.022079,0.006553,1.0,0.130949,0.246338,0.214218,-0.054686,0.188905,0.004226,0.104276,0.175093
X,0.010309,1.0,-0.160262,0.009956,0.010309,0.0103,0.120754,0.010309,0.012887,0.011304,-0.001752,-0.012168,0.010904,-0.001016,-0.001618,0.013586
Y,0.017737,-0.160262,1.0,-0.023848,-0.027396,-0.027415,-0.114935,0.017737,-0.01385,0.010178,0.026304,0.017058,-0.019694,-0.006958,0.004618,0.009508
OBJECTID,0.020131,0.009956,-0.023848,1.0,0.946383,0.945837,0.046929,0.020131,-0.062333,0.024604,0.034432,-0.09428,-0.037094,0.969276,0.028076,0.056046
INCKEY,0.022065,0.010309,-0.027396,0.946383,1.0,0.999996,0.048524,0.022065,-0.0615,0.024918,0.031342,-0.107528,-0.027617,0.990571,0.019701,0.048179
COLDETKEY,0.022079,0.0103,-0.027415,0.945837,0.999996,1.0,0.048499,0.022079,-0.061403,0.024914,0.031296,-0.107598,-0.027461,0.990571,0.019586,0.048063
INTKEY,0.006553,0.120754,-0.114935,0.046929,0.048524,0.048499,1.0,0.006553,0.001886,-0.004784,0.000531,-0.012929,0.007114,0.032604,-0.01051,0.01842
SEVERITYCODE.1,1.0,0.010309,0.017737,0.020131,0.022065,0.022079,0.006553,1.0,0.130949,0.246338,0.214218,-0.054686,0.188905,0.004226,0.104276,0.175093
PERSONCOUNT,0.130949,0.012887,-0.01385,-0.062333,-0.0615,-0.061403,0.001886,0.130949,1.0,-0.023464,-0.038809,0.380523,-0.12896,0.011784,-0.021383,-0.032258
PEDCOUNT,0.246338,0.011304,0.010178,0.024604,0.024918,0.024914,-0.004784,0.246338,-0.023464,1.0,-0.01692,-0.261285,0.260393,0.021461,0.00181,0.565326


In [3]:
# Visualization of the correlation matrix

import seaborn as sn
import matplotlib.pyplot as plt

sn.heatmap(collision_df.corr(), annot=True)
plt.show()

ImportError: dlopen(/Users/chrisriper/anaconda3/lib/python3.6/site-packages/scipy/spatial/qhull.cpython-36m-darwin.so, 2): Library not loaded: @rpath/libopenblas.dylib
  Referenced from: /Users/chrisriper/anaconda3/lib/python3.6/site-packages/scipy/spatial/qhull.cpython-36m-darwin.so
  Reason: image not found

In [7]:
corr = collision_df.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,INTKEY,SEVERITYCODE.1,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,SDOT_COLCODE,SDOTCOLNUM,SEGLANEKEY,CROSSWALKKEY
SEVERITYCODE,1.0,0.010309,0.017737,0.020131,0.022065,0.022079,0.006553,1.0,0.130949,0.246338,0.214218,-0.054686,0.188905,0.004226,0.104276,0.175093
X,0.010309,1.0,-0.160262,0.009956,0.010309,0.0103,0.120754,0.010309,0.012887,0.011304,-0.001752,-0.012168,0.010904,-0.001016,-0.001618,0.013586
Y,0.017737,-0.160262,1.0,-0.023848,-0.027396,-0.027415,-0.114935,0.017737,-0.01385,0.010178,0.026304,0.017058,-0.019694,-0.006958,0.004618,0.009508
OBJECTID,0.020131,0.009956,-0.023848,1.0,0.946383,0.945837,0.046929,0.020131,-0.062333,0.024604,0.034432,-0.09428,-0.037094,0.969276,0.028076,0.056046
INCKEY,0.022065,0.010309,-0.027396,0.946383,1.0,0.999996,0.048524,0.022065,-0.0615,0.024918,0.031342,-0.107528,-0.027617,0.990571,0.019701,0.048179
COLDETKEY,0.022079,0.0103,-0.027415,0.945837,0.999996,1.0,0.048499,0.022079,-0.061403,0.024914,0.031296,-0.107598,-0.027461,0.990571,0.019586,0.048063
INTKEY,0.006553,0.120754,-0.114935,0.046929,0.048524,0.048499,1.0,0.006553,0.001886,-0.004784,0.000531,-0.012929,0.007114,0.032604,-0.01051,0.01842
SEVERITYCODE.1,1.0,0.010309,0.017737,0.020131,0.022065,0.022079,0.006553,1.0,0.130949,0.246338,0.214218,-0.054686,0.188905,0.004226,0.104276,0.175093
PERSONCOUNT,0.130949,0.012887,-0.01385,-0.062333,-0.0615,-0.061403,0.001886,0.130949,1.0,-0.023464,-0.038809,0.380523,-0.12896,0.011784,-0.021383,-0.032258
PEDCOUNT,0.246338,0.011304,0.010178,0.024604,0.024918,0.024914,-0.004784,0.246338,-0.023464,1.0,-0.01692,-0.261285,0.260393,0.021461,0.00181,0.565326
