# Data Science Capstone

This notebook will be used for the IBM Data Science Certification Capstone

In [1]:
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [2]:
print("Hello Capstone Project Course")

Hello Capstone Project Course


## Introduction: Business Undertanding

The Data for this project was obtained from the City of Seattle's Open Data Portal. The Portal makes data generated by the City openly available to the public for the purpose of increasing the quality of life for the residents, increasing transparency, accountability and comparability, promoting economic development and research, and improving internal performance management.

The data is updated weekly and can be found at the [Seattle Open GeoData Portal](https://opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0.csv).

The objective of this project is to explore the data, find features important in prediciting the severity of accidents, build models for predicting the severity of an accident and finally selecting the top performing model using various classification metrics with the hope that the insights gathered from this project can help the City of Seattle's Department of Transportation planning and policy making to reduce the number of accidents especially severe accidents.

## Data Understanding

At this stage the data is downloaded from the Seattle Open GeoData Portal in the csv format. Features are explored and selected for training the machine learning models and data preprocessing is performed to ensure the data set is robust against biases like unbalanced data, missing values, wrongly input values and also that it is in the right form to aid the best predictive performance. The aim is to obtain good features to reliably predict the severity of an accident

The Data Attributes is listed below for convenience:

|Attribute|Data Type, Length|Description|
|:-|:-|:-|
|OBJECTID|ObjectID|ESRI unique identifier|
|SHAPE|Geometry|ESRI geometry field|
|INCKEY|Long|A unique key for the incident|
|COLDETKEY|Long|Secondary key for the incident|
|ADDRTYPE|Text, 12|Collision address type: Alley, Block, Intersection|
|INTKEY|Double|Key that corresponds to the intersection associated with a collision|
|LOCATION|Text, 255|Description of the general location of the collision|
|EXCEPTRSNCODE|Text, 10|Not specified|
|EXCEPTRSNDESC|Text, 300|Not specified|
|SEVERITYCODE|Text, 100|A code that corresponds to the severity of the collision: 3—fatality, 2b—serious injury, 2—injury, 1—prop damage, 0—unknown|
|SEVERITYDESC|Text|A detailed description of the severity of the collision|
|COLLISIONTYPE|Text, 300|Collision type|
|PERSONCOUNT|Double|The total number of people involved in the collision|
|PEDCOUNT|Double|The number of pedestrians involved in the collision. This is entered by the state.|
|PEDCYLCOUNT|Double|The number of bicycles involved in the collision. This is entered by the state.|
|VEHCOUNT|Double|The number of vehicles involved in the collision. This is entered by the state.|
|INJURIES|Double|The number of total injuries in the collision. This is entered by the state.|
|SERIOUSINJURIES|Double|The number of serious injuries in the collision. This is entered by the state.|
|FATALITIES|Double|The number of fatalities in the collision. This is entered by the state.|
|INCDATE|Date|The date of the incident.|
|INCDTTM|Text, 30|The date and time of the incident.|
|JUNCTIONTYPE|Text, 300|Category of junction at which collision took place|
|SDOT_COLCODE|Text, 10|A code given to the collision by SDOT.|
|SDOT_COLDESC|Text, 300|A description of the collision corresponding to the collision code.|
|INATTENTIONIND|Text, 1|Whether or not collision was due to inattention. (Y/N)|
|UNDERINFL|Text, 10|Whether or not a driver involved was under the influence of drugs or alcohol.|
|WEATHER|Text, 300|A description of the weather conditions during the time of the collision.|
|ROADCOND|Text, 300|The condition of the road during the collision.|
|LIGHTCOND|Text, 300|The light conditions during the collision.|
|PEDROWNOTGRNT|Text, 1|Whether or not the pedestrian right of way was not granted. (Y/N)|
|SDOTCOLNUM|Text, 10|A number given to the collision by SDOT.|
|SPEEDING|Text, 1|Whether or not speeding was a factor in the collision. (Y/N)|
|ST_COLCODE|Text, 10|A code provided by the state that describes the collision. For more information about these codes, please see the State Collision Code Dictionary.|
|ST_COLDESC|Text, 300|A description that corresponds to the state’s coding designation.|
|SEGLANEKEY|Long|A key for the lane segment in which the collision occurred.|
|CROSSWALKKEY|Long|A key for the crosswalk at which the collision occurred.|
|HITPARKEDCAR|Text, 1|Whether or not the collision involved hitting a parked car. (Y/N)|

In [3]:
df = pd.read_csv('C:/Users/david/Downloads/Collisions.csv')

In [4]:
df.shape

(221389, 40)

From initial descriptive statistics it is shown the data set has 221389 observations and 40 features including the
dependent variable SEVERITYCODE.  
Preliminary cleanup exercise would be to 
  1. remove observations with missing values for SEVERITYCODE  
  2. removing features ending with DESC which are meant to provide text description for the variables ending with CODE
  3. removing features that have a large number or percentage of missing values
  4. removing categorical features with high cardinality since using them for model building may introduce high dimensionality in the data set
  5. removing highly correlated features since they may provide redundant information and may not improve predictive perfromance
  

In [5]:
df.isna().sum()

X                    7471
Y                    7471
OBJECTID                0
INCKEY                  0
COLDETKEY               0
REPORTNO                0
STATUS                  0
ADDRTYPE             3712
INTKEY             149505
LOCATION             4588
EXCEPTRSNCODE      120403
EXCEPTRSNDESC      209610
SEVERITYCODE            1
SEVERITYDESC            0
COLLISIONTYPE       26230
PERSONCOUNT             0
PEDCOUNT                0
PEDCYLCOUNT             0
VEHCOUNT                0
INJURIES                0
SERIOUSINJURIES         0
FATALITIES              0
INCDATE                 0
INCDTTM                 0
JUNCTIONTYPE        11972
SDOT_COLCODE            1
SDOT_COLDESC            1
INATTENTIONIND     191201
UNDERINFL           26210
WEATHER             26420
ROADCOND            26339
LIGHTCOND           26509
PEDROWNOTGRNT      216197
SDOTCOLNUM          94184
SPEEDING           211461
ST_COLCODE           9413
ST_COLDESC          26230
SEGLANEKEY              0
CROSSWALKKEY

In [6]:
# simply drop whole row with NaN in "price" column
df.dropna(subset=["SEVERITYCODE"], axis=0, inplace=True)

# reset index, because we droped two rows
df.reset_index(drop=True, inplace=True)

In [7]:
df1 = df.copy()

In [8]:
# dropping features ending with DESC
df1.drop(['EXCEPTRSNDESC','SEVERITYDESC','SDOT_COLDESC','ST_COLDESC'], axis=1, inplace=True)

In [9]:
profile = ProfileReport(df1, title='Pandas Profiling Report')

In [10]:
profile.to_widgets()

HBox(children=(FloatProgress(value=0.0, description='Summarize dataset', max=50.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Generate report structure', max=1.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='Render widgets', max=1.0, style=ProgressStyle(description…

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

In [11]:
df1.columns.values

array(['X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO', 'STATUS',
       'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE', 'SEVERITYCODE',
       'COLLISIONTYPE', 'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT',
       'VEHCOUNT', 'INJURIES', 'SERIOUSINJURIES', 'FATALITIES', 'INCDATE',
       'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'INATTENTIONIND',
       'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND', 'PEDROWNOTGRNT',
       'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'SEGLANEKEY',
       'CROSSWALKKEY', 'HITPARKEDCAR'], dtype=object)

In [12]:
df_selected = df1[['X', 'Y','ADDRTYPE','SEVERITYCODE',]]

In [13]:
df1['ST_COLCODE'].describe()

count     211976
unique        63
top           32
freq       44922
Name: ST_COLCODE, dtype: object

In [14]:
df1['ADDRTYPE'].isna().sum()

3712

In [15]:
data_clean = df1[['X', 'Y', 'WEATHER', 'ROADCOND', 'LIGHTCOND',
                   'SPEEDING', 'SEVERITYCODE', 'UNDERINFL',
                   'PERSONCOUNT', 'VEHCOUNT']]
data_clean.info()

NameError: name 'data' is not defined

In [None]:
# Convert INCDTTM to date type

df1['INCDTTM'] = pd.to_datetime(df1['INCDTTM'], errors='coerce')

# Extract month, weekday, hour information

df1['Month']=df1['INCDTTM'].dt.month
df1['Weekday']=df1['INCDTTM'].dt.weekday
df1['Hour']=df1['INCDTTM'].dt.hour

In [None]:
df1.info()

In [None]:
df1['COLLISIONTYPE'].value_counts()