### Predicting the Severity of Automobile Accidents in Seattle, Washington ###

In this first week, you will discover your
project objectives, find your dataset that you will use for this capstone project, and publish your
dataset on GitHub.

In the second week, you will build your machine
learning solution.

In the third week,
you will finalize your model and be ready
to submit your work.

To complete capstone,
you will be working on a case study which is to predict the severity
of an accident.
Now, wouldn't it be great if there were something in place that could warn you, 
given the weather and the road conditions,
about the possibility of you getting into a car accident and how severe it would be,
so that you would drive more carefully or even change your travel plans?
Let's use our shared data for Seattle, Washington as an example of how to deal with the accidents data.

In [1]:
# Import packages.
import pandas as pd
import numpy as np
import itertools
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline

In [2]:
# NOTE: >>> help(pd.options.display. <TAB>
# pd.options.display.chop_threshold      pd.options.display.float_format        pd.options.display.max_info_columns    pd.options.display.notebook_repr_html
# pd.options.display.colheader_justify   pd.options.display.html                pd.options.display.max_info_rows       pd.options.display.pprint_nest_depth
# pd.options.display.column_space        pd.options.display.large_repr          pd.options.display.max_rows            pd.options.display.precision
# pd.options.display.date_dayfirst       pd.options.display.latex               pd.options.display.max_seq_items       pd.options.display.show_dimensions
# pd.options.display.date_yearfirst      pd.options.display.max_categories      pd.options.display.memory_usage        pd.options.display.unicode
# pd.options.display.encoding            pd.options.display.max_columns         pd.options.display.min_rows            pd.options.display.width
# pd.options.display.expand_frame_repr   pd.options.display.max_colwidth        pd.options.display.multi_sparse        

# Create a list of display options.
list_of_display_options_fully_qualified_names = str(\
"pd.options.display.chop_threshold, pd.options.display.float_format, pd.options.display.max_info_columns, pd.options.display.notebook_repr_html, \
pd.options.display.colheader_justify, pd.options.display.html, pd.options.display.max_info_rows, pd.options.display.pprint_nest_depth, \
pd.options.display.column_space, pd.options.display.large_repr, pd.options.display.max_rows, pd.options.display.precision, \
pd.options.display.date_dayfirst, pd.options.display.latex, pd.options.display.max_seq_items, pd.options.display.show_dimensions, \
pd.options.display.date_yearfirst, pd.options.display.max_categories, pd.options.display.memory_usage, pd.options.display.unicode, \
pd.options.display.encoding, pd.options.display.max_columns, pd.options.display.min_rows, pd.options.display.width, \
pd.options.display.expand_frame_repr, pd.options.display.max_colwidth, pd.options.display.multi_sparse").split(sep=', ')

# Initialize an empty list to store all the short names for display options.
list_of_display_options_short_names = list()
# For each fully qualified option name,
# get the option's short name and add it to the list of short names.
for fully_qualified_option_name in list_of_display_options_fully_qualified_names:
    # Get short option name.
    short_option_name = fully_qualified_option_name.split(sep='.')[-1]
    
    # Add short option name to list of display option short names.
    list_of_display_options_short_names.append(short_option_name)

# Define dictionary of display option settings.
dict_of_display_option_settings_short_names=\
{"max_info_columns": 100,\
"max_info_rows": 200,\
"max_columns": 100,\
"max_rows": 200,\
"precision": 9,\
"max_seq_items": None,\
"show_dimensions": True,\
"max_categories": 1000000,\
"max_colwidth": 300,\
"float_format": lambda x: '%.9f' % x}

# Set pandas display options using dictionary of short names,
# and display the options/value pairs.
print("Setting display options...")
for key in list(dict_of_display_option_settings_short_names.keys()):
    # Set display option.
    pd.set_option(key, dict_of_display_option_settings_short_names[key])
    # Print display option name and value.
    print(key, ": ", pd.get_option(key), sep='')

Setting display options...
max_info_columns: 100
max_info_rows: 200
max_columns: 100
max_rows: 200
precision: 9
max_seq_items: None
show_dimensions: True
max_categories: 1000000
max_colwidth: 300
float_format: <function <lambda> at 0x7f59c973f4c0>


In [3]:
# Attribute Information URL: https://www.seattle.gov/Documents/Departments/SDOT/GIS/Collisions_OD.pdf
# Read the Collisions Data CSV file and store it as a DataFrame.
url="https://opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0.csv"
df=pd.read_csv(url, low_memory=False)

In [4]:
# View the first few rows of the collisions DataFrame.
df.head()

Unnamed: 0,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,LOCATION,EXCEPTRSNCODE,EXCEPTRSNDESC,SEVERITYCODE,SEVERITYDESC,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INJURIES,SERIOUSINJURIES,FATALITIES,INCDATE,INCDTTM,JUNCTIONTYPE,SDOT_COLCODE,SDOT_COLDESC,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,-122.320757054,47.609407946,1,328476,329976,EA08706,Matched,Block,,BROADWAY BETWEEN E COLUMBIA ST AND BOYLSTON AVE,,,1,Property Damage Only Collision,Sideswipe,2,0,0,2,0,0,0,2020/01/22 00:00:00+00,1/22/2020 3:21:00 PM,Mid-Block (not related to intersection),11.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END AT ANGLE",,N,Raining,Wet,Dark - Street Lights On,,,,11.0,From same direction - both going straight - both moving - sideswipe,0,0,N
1,-122.319560827,47.662220664,2,328142,329642,EA06882,Matched,Block,,8TH AVE NE BETWEEN NE 45TH E ST AND NE 47TH ST,,,1,Property Damage Only Collision,Parked Car,2,0,0,2,0,0,0,2020/01/07 00:00:00+00,1/7/2020 8:00:00 AM,Mid-Block (not related to intersection),15.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, RIGHT SIDE SIDESWIPE",,N,Clear,Dry,Daylight,,,,32.0,One parked--one moving,0,0,Y
2,-122.327524508,47.604393273,3,20700,20700,1181833,Unmatched,Block,,JAMES ST BETWEEN 6TH AVE AND 7TH AVE,,,0,Unknown,,0,0,0,0,0,0,0,2004/01/30 00:00:00+00,1/30/2004,Mid-Block (but intersection related),11.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END AT ANGLE",,,,,,,4030032.0,,,,0,0,N
3,-122.327524934,47.708621579,4,332126,333626,M16001640,Unmatched,Block,,NE NORTHGATE WAY BETWEEN 1ST AVE NE AND NE NORTHGATE DR,,,0,Unknown,,0,0,0,0,0,0,0,2016/01/23 00:00:00+00,1/23/2016,Mid-Block (not related to intersection),11.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END AT ANGLE",,,,,,,,,,,0,0,N
4,-122.292120049,47.55900908,5,328238,329738,3857118,Unmatched,Block,,M L KING JR ER WAY S BETWEEN S ANGELINE ST AND S EDMUNDS ST,,,0,Unknown,,0,0,0,0,0,0,0,2020/01/26 00:00:00+00,1/26/2020,Mid-Block (not related to intersection),28.0,MOTOR VEHICLE RAN OFF ROAD - HIT FIXED OBJECT,,,,,,,,,,,0,0,N


<h2 id="data_wrangling">Data Wrangling</h2>

Steps for working with missing data:
<ol>
    <li>Identify missing data.</li>
    <li>Deal with missing data.</li>
    <li>Correct data format.</li>
</ol>

<h3 id="identifying_missing_data">Identifying Missing Data</h3>

The missing values are converted to Python's default. We use Python's built-in functions to identify these missing values. 

In [5]:
# Test whether the collisions DataFrame has NaN values.
if df.isna().any(axis=None):
    print("DataFrame has NaN.")
else:
    print("DataFrame has no NaN.")

DataFrame has NaN.


In [6]:
# Initialize a list to store the labels for the columns with missing data.
list_of_columns_with_missing_data = list()

# For each column in the collisions DataFrame,
# if the column contains at least one NaN, 
# then add the column's label to the list.
for column in list(df.columns):
    if df[column].hasnans:
        list_of_columns_with_missing_data.append(column)

# Print the number of columns
print("Number of columns: %d" % len(df.columns))
print("List of columns labels:")
print(list(df.columns))
print()
print("Number of columns missing data: %d" % len(list_of_columns_with_missing_data))
print("List of columns missing data:")
print(list_of_columns_with_missing_data)

Number of columns: 40
List of columns labels:
['X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO', 'STATUS', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE', 'EXCEPTRSNDESC', 'SEVERITYCODE', 'SEVERITYDESC', 'COLLISIONTYPE', 'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INJURIES', 'SERIOUSINJURIES', 'FATALITIES', 'INCDATE', 'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC', 'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND', 'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'ST_COLDESC', 'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR']

Number of columns missing data: 22
List of columns missing data:
['X', 'Y', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE', 'EXCEPTRSNDESC', 'SEVERITYCODE', 'COLLISIONTYPE', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC', 'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND', 'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'ST_COLDESC']


<h3 id="deal_with_missing_data">Deal with Missing Data</h3>

<ol>
    <li>Drop the Data
        <ol>
            <li>Drop entire row.</li>
            <li>Drop entire column.</li>
        </ol>
    </li>
    <li>Replace the Data
        <ol>
            <li>Replace data by mean.</li>
            <li>Replace data by frequency.</li>
            <li>Replace data based on other functions.</li>
        </ol>
    </li>
        
</ol>

Whole columns should be dropped only if most entries in the column are empty.

In [7]:
### DELETE THIS CELL BEFORE PRODUCTION ###

# Print a list of all the column labels for the collisions DataFrame.
print(list(df.columns))

['X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO', 'STATUS', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE', 'EXCEPTRSNDESC', 'SEVERITYCODE', 'SEVERITYDESC', 'COLLISIONTYPE', 'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INJURIES', 'SERIOUSINJURIES', 'FATALITIES', 'INCDATE', 'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC', 'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND', 'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'ST_COLDESC', 'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR']


In [8]:
### DELETE THIS CELL BEFORE PRODUCTION ###

# NOTE: astype(self: ~FrameOrSeries, dtype, copy: bool = True, errors: str = 'raise') -> ~FrameOrSeries 
#    dtype : data type, or dict of column name -> data type
#        Use a numpy.dtype or Python type to cast entire pandas object to
#        the same type. Alternatively, use {col: dtype, ...}, where col is a
#        column label and dtype is a numpy.dtype or Python type to cast one
#        or more of the DataFrame's columns to column-specific types.

# For each column in collision DataFrame:
# (1) print statistical description and relative frequencies of the data;
# (2) cast column to categorical type and print a statistical description and 
#     the relative frequencies of the categorical data in the column.
for column in list(df.columns):
    print(column, ": original data type: ", df[column].dtype, sep='')
    print("Statistics as type category:")
    print(df[[column]].astype(dtype="category").describe(include="all"))
    print("Relative frequencies as type category:")
    print(df[column].astype(dtype="category").value_counts(normalize=True, dropna=False))
    print()
    print()

X: original data type: float64
Statistics as type category:
                      X
count  213918.000000000
unique  24972.000000000
top      -122.332653349
freq      296.000000000

[4 rows x 1 columns]
Relative frequencies as type category:
nan              0.033746031
-122.332653349   0.001337013
-122.344896079   0.001273776
-122.328078578   0.001246674
-122.344996835   0.001219573
                     ...    
-122.340733024   0.000004517
-122.340691311   0.000004517
-122.277259603   0.000004517
-122.340686208   0.000004517
-122.419091132   0.000004517
Name: X, Length: 24973, dtype: float64


Y: original data type: float64
Statistics as type category:
                      Y
count  213918.000000000
unique  24972.000000000
top        47.708654503
freq      296.000000000

[4 rows x 1 columns]
Relative frequencies as type category:
nan            0.033746031
47.708654503   0.001337013
47.717173101   0.001273776
47.604161235   0.001246674
47.725035552   0.001219573
                   ... 

In [9]:
# Drop any column from the collisions DataFrame if it satisfies at least one of the following conditions:
# 1) more than 15% of the column's data is NaN;
# 2) the column only contains unique identification keys;
# 3) it is unclear how the column's data should be interpreted.

list_of_columns_to_drop = [\
                           "OBJECTID",\
                           "INCKEY",\
                           "COLDETKEY",\
                           "REPORTNO",\
                           "INTKEY",\
                           "EXCEPTRSNCODE",\
                           "EXCEPTRSNDESC",\
                           "INATTENTIONIND",\
                           "PEDROWNOTGRNT",\
                           "SDOTCOLNUM",\
                           "SPEEDING",\
                           "SEGLANEKEY",\
                           "CROSSWALKKEY"]

In [10]:
#NOTE: drop(self, labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
# Drop the selected columns from the collisions DataFrame
# and store the result in a new DataFrame.
df_after_drop_columns = df.drop(columns=list_of_columns_to_drop, inplace=False)

In [11]:
# Print the first few rows of the DataFrame after dropping columns.
df_after_drop_columns.head()

Unnamed: 0,X,Y,STATUS,ADDRTYPE,LOCATION,SEVERITYCODE,SEVERITYDESC,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INJURIES,SERIOUSINJURIES,FATALITIES,INCDATE,INCDTTM,JUNCTIONTYPE,SDOT_COLCODE,SDOT_COLDESC,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,ST_COLCODE,ST_COLDESC,HITPARKEDCAR
0,-122.320757054,47.609407946,Matched,Block,BROADWAY BETWEEN E COLUMBIA ST AND BOYLSTON AVE,1,Property Damage Only Collision,Sideswipe,2,0,0,2,0,0,0,2020/01/22 00:00:00+00,1/22/2020 3:21:00 PM,Mid-Block (not related to intersection),11.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END AT ANGLE",N,Raining,Wet,Dark - Street Lights On,11.0,From same direction - both going straight - both moving - sideswipe,N
1,-122.319560827,47.662220664,Matched,Block,8TH AVE NE BETWEEN NE 45TH E ST AND NE 47TH ST,1,Property Damage Only Collision,Parked Car,2,0,0,2,0,0,0,2020/01/07 00:00:00+00,1/7/2020 8:00:00 AM,Mid-Block (not related to intersection),15.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, RIGHT SIDE SIDESWIPE",N,Clear,Dry,Daylight,32.0,One parked--one moving,Y
2,-122.327524508,47.604393273,Unmatched,Block,JAMES ST BETWEEN 6TH AVE AND 7TH AVE,0,Unknown,,0,0,0,0,0,0,0,2004/01/30 00:00:00+00,1/30/2004,Mid-Block (but intersection related),11.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END AT ANGLE",,,,,,,N
3,-122.327524934,47.708621579,Unmatched,Block,NE NORTHGATE WAY BETWEEN 1ST AVE NE AND NE NORTHGATE DR,0,Unknown,,0,0,0,0,0,0,0,2016/01/23 00:00:00+00,1/23/2016,Mid-Block (not related to intersection),11.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END AT ANGLE",,,,,,,N
4,-122.292120049,47.55900908,Unmatched,Block,M L KING JR ER WAY S BETWEEN S ANGELINE ST AND S EDMUNDS ST,0,Unknown,,0,0,0,0,0,0,0,2020/01/26 00:00:00+00,1/26/2020,Mid-Block (not related to intersection),28.0,MOTOR VEHICLE RAN OFF ROAD - HIT FIXED OBJECT,,,,,,,N


In [12]:
# Test whether DataFrame has NaN after dropping columns.
if df_after_drop_columns.isna().any(axis=None):
    print("DataFrame has NaN.")
else:
    print("DataFrame has no NaN.")

DataFrame has NaN.


In [13]:
# Print a concise, technical summary of the collisions DataFrame.
df_after_drop_columns.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 221389 entries, 0 to 221388
Data columns (total 27 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   X                213918 non-null  float64
 1   Y                213918 non-null  float64
 2   STATUS           221389 non-null  object 
 3   ADDRTYPE         217677 non-null  object 
 4   LOCATION         216801 non-null  object 
 5   SEVERITYCODE     221388 non-null  object 
 6   SEVERITYDESC     221389 non-null  object 
 7   COLLISIONTYPE    195159 non-null  object 
 8   PERSONCOUNT      221389 non-null  int64  
 9   PEDCOUNT         221389 non-null  int64  
 10  PEDCYLCOUNT      221389 non-null  int64  
 11  VEHCOUNT         221389 non-null  int64  
 12  INJURIES         221389 non-null  int64  
 13  SERIOUSINJURIES  221389 non-null  int64  
 14  FATALITIES       221389 non-null  int64  
 15  INCDATE          221389 non-null  object 
 16  INCDTTM          221389 non-null  obje

In [14]:
# For each column in DataFrame after dropping columns,
# print the relative frequencies of values and a description
# of the columns data.
for column in df_after_drop_columns.columns:
    print("Relative frequency:")
    print(df_after_drop_columns[column].value_counts(normalize=True, dropna=False))
    print()

Relative frequency:
nan              0.033746031
-122.332653349   0.001337013
-122.344896079   0.001273776
-122.328078578   0.001246674
-122.344996835   0.001219573
                     ...    
-122.372757223   0.000004517
-122.305825420   0.000004517
-122.385337171   0.000004517
-122.397974101   0.000004517
-122.358295798   0.000004517
Name: X, Length: 24973, dtype: float64

Relative frequency:
nan            0.033746031
47.708654503   0.001337013
47.717173101   0.001273776
47.604161235   0.001246674
47.725035552   0.001219573
                   ...    
47.669143854   0.000004517
47.592493078   0.000004517
47.560592450   0.000004517
47.658522767   0.000004517
47.541978750   0.000004517
Name: Y, Length: 24973, dtype: float64

Relative frequency:
Matched     0.881850498
Unmatched   0.118149502
Name: STATUS, Length: 2, dtype: float64

Relative frequency:
Block          0.654580851
Intersection   0.324695446
NaN            0.016766867
Alley          0.003956836
Name: ADDRTYPE, Length: 4, 

In [15]:
# NOTE: dropna(self, axis=0, how='any', thresh=None, subset=None, inplace=False)

# Drop any row that contains at least one NaN.
print("Number of columns: %d" % len(list(df_after_drop_columns.columns)))
df_after_drop_columns_and_rows = df_after_drop_columns.dropna(axis="index", how="any", thresh=None, subset=None, inplace=False)

Number of columns: 27


In [16]:
# Verify that DataFrame has no NaN after dropping columns and rows.
if df_after_drop_columns_and_rows.isna().any(axis=None):
    print("DataFrame has NaN.")
else:
    print("DataFrame has no NaN.")

DataFrame has no NaN.


In [17]:
# NOTE: info(self, verbose=None, buf=None, max_cols=None, memory_usage=None, null_counts=None) -> None

df_after_drop_columns_and_rows.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 185317 entries, 0 to 221388
Data columns (total 27 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   X                185317 non-null  float64
 1   Y                185317 non-null  float64
 2   STATUS           185317 non-null  object 
 3   ADDRTYPE         185317 non-null  object 
 4   LOCATION         185317 non-null  object 
 5   SEVERITYCODE     185317 non-null  object 
 6   SEVERITYDESC     185317 non-null  object 
 7   COLLISIONTYPE    185317 non-null  object 
 8   PERSONCOUNT      185317 non-null  int64  
 9   PEDCOUNT         185317 non-null  int64  
 10  PEDCYLCOUNT      185317 non-null  int64  
 11  VEHCOUNT         185317 non-null  int64  
 12  INJURIES         185317 non-null  int64  
 13  SERIOUSINJURIES  185317 non-null  int64  
 14  FATALITIES       185317 non-null  int64  
 15  INCDATE          185317 non-null  object 
 16  INCDTTM          185317 non-null  obje

<h3 id="correct_data_format">Correct Data Format</h3>

Ensure that each data type is appropriate for the corresponding feature.
Convert integer data to categorical type if the "real-world" measurement is not
"naturally ordered" as on the number line.
If data represents date, time, or date/time information, then convert the data to the appropriate datetime representation.

In [None]:
# Create new DataFrame to store converted data types.
df_converted = pd.DataFrame()

for columnn in list(df_after_drop_columns_and_rows.columns):
     # Cast column "ST_COLCODE" to type category.
    if column in ["ST_COLCODE"]:
        df_converted["ST_COLCODE"] = df_after_drop_columns_and_rows["ST_COLCODE"].astype('category')
    # Cast columns "INCDATE" and "INCDTTM" to type datetime.
    elif column in ["INCDATE", "INCDTTM"]:
        df_converted[column] = pd.to_datetime(df_after_drop_columns_and_rows[column], infer_datetime_format=True)
    # Cast columns of type object to type category.
    elif (df_after_drop_columns_and_rows[column].dtype in [np.dtype('object')]):
        df_converted[column] = df_after_drop_columns_and_rows[column].astype('category')
    # Copy all other columns to new DataFrame without changing their types.
    else:
        df_converted[column] = df_after_drop_columns_and_rows[column]       

# Display info about new DataFrame after casting objects to category or date
df_converted.info()