### Predicting the Severity of Automobile Accidents in Seattle, Washington ###

In this first week, you will discover your
project objectives, find your dataset that you will use for this capstone project, and publish your
dataset on GitHub.

In the second week, you will build your machine
learning solution.

In the third week,
you will finalize your model and be ready
to submit your work.

To complete capstone,
you will be working on a case study which is to predict the severity
of an accident.
Now, wouldn't it be great if there were something in place that could warn you, 
given the weather and the road conditions,
about the possibility of you getting into a car accident and how severe it would be,
so that you would drive more carefully or even change your travel plans?
Let's use our shared data for Seattle, Washington as an example of how to deal with the accidents data.

In [46]:
# Import packages.
import os
import sys
import pandas as pd
import numpy as np
import itertools
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline

In [15]:
# NOTE: >>> help(pd.options.display. <TAB>
# pd.options.display.chop_threshold      pd.options.display.float_format        pd.options.display.max_info_columns    pd.options.display.notebook_repr_html
# pd.options.display.colheader_justify   pd.options.display.html                pd.options.display.max_info_rows       pd.options.display.pprint_nest_depth
# pd.options.display.column_space        pd.options.display.large_repr          pd.options.display.max_rows            pd.options.display.precision
# pd.options.display.date_dayfirst       pd.options.display.latex               pd.options.display.max_seq_items       pd.options.display.show_dimensions
# pd.options.display.date_yearfirst      pd.options.display.max_categories      pd.options.display.memory_usage        pd.options.display.unicode
# pd.options.display.encoding            pd.options.display.max_columns         pd.options.display.min_rows            pd.options.display.width
# pd.options.display.expand_frame_repr   pd.options.display.max_colwidth        pd.options.display.multi_sparse        

# Create a list of display options.
list_of_display_options_fully_qualified_names = str(\
"pd.options.display.chop_threshold, pd.options.display.float_format, pd.options.display.max_info_columns, pd.options.display.notebook_repr_html, \
pd.options.display.colheader_justify, pd.options.display.html, pd.options.display.max_info_rows, pd.options.display.pprint_nest_depth, \
pd.options.display.column_space, pd.options.display.large_repr, pd.options.display.max_rows, pd.options.display.precision, \
pd.options.display.date_dayfirst, pd.options.display.latex, pd.options.display.max_seq_items, pd.options.display.show_dimensions, \
pd.options.display.date_yearfirst, pd.options.display.max_categories, pd.options.display.memory_usage, pd.options.display.unicode, \
pd.options.display.encoding, pd.options.display.max_columns, pd.options.display.min_rows, pd.options.display.width, \
pd.options.display.expand_frame_repr, pd.options.display.max_colwidth, pd.options.display.multi_sparse").split(sep=', ')

# Initialize an empty list to store all the short names for display options.
list_of_display_options_short_names = list()
# For each fully qualified option name,
# get the option's short name and add it to the list of short names.
for fully_qualified_option_name in list_of_display_options_fully_qualified_names:
    # Get short option name.
    short_option_name = fully_qualified_option_name.split(sep='.')[-1]
    
    # Add short option name to list of display option short names.
    list_of_display_options_short_names.append(short_option_name)

# Define dictionary of display option settings.
dict_of_display_option_settings_short_names=\
{"max_info_columns": 500,\
"max_info_rows": 1000,\
"max_columns": 500,\
"max_rows": 1000,\
"precision": 9,\
"max_seq_items": None,\
"show_dimensions": True,\
"max_categories": 1000000,\
"max_colwidth": 500,\
"float_format": lambda x: '%.9f' % x}

# Set pandas display options using dictionary of short names,
# and display the options/value pairs.
print("Setting display options...")
for key in list(dict_of_display_option_settings_short_names.keys()):
    # Set display option.
    pd.set_option(key, dict_of_display_option_settings_short_names[key])
    # Print display option name and value.
    print(key, ": ", pd.get_option(key), sep='')

Setting display options...
max_info_columns: 500
max_info_rows: 1000
max_columns: 500
max_rows: 1000
precision: 9
max_seq_items: None
show_dimensions: True
max_categories: 1000000
max_colwidth: 500
float_format: <function <lambda> at 0x7f541dfe9820>


In [3]:
# Attribute Information URL: https://www.seattle.gov/Documents/Departments/SDOT/GIS/Collisions_OD.pdf
# Read the Collisions Data CSV file and store it as a DataFrame.
# url="https://opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0.csv" # HTTPError at 202009151050, using local copy of .csv instead.
local_path_to_csv = "../Collisions.csv"
df=pd.read_csv(local_path_to_csv, low_memory=False)

<h2 id="data_wrangling">Data Wrangling</h2>

Steps for working with missing data:
<ol>
    <li>Identify missing data.</li>
    <li>Deal with missing data.</li>
    <li>Correct data format.</li>
</ol>

<h3 id="identifying_missing_data">Identifying Missing Data</h3>

The missing values are converted to Python's default. We use Python's built-in functions to identify these missing values. 

<h3 id="deal_with_missing_data">Deal with Missing Data</h3>

<ol>
    <li>Drop the Data
        <ol>
            <li>Drop entire row.</li>
            <li>Drop entire column.</li>
        </ol>
    </li>
    <li>Replace the Data
        <ol>
            <li>Replace data by mean.</li>
            <li>Replace data by frequency.</li>
            <li>Replace data based on other functions.</li>
        </ol>
    </li>
        
</ol>

Whole columns should be dropped only if most entries in the column are empty.

In [4]:
# Drop any column from the collisions DataFrame if it satisfies at least one of the following conditions:
# 1) more than 15% of the column's data is NaN;
# 2) the column only contains unique identification keys;
# 3) it is unclear how the column's data should be interpreted.

list_of_columns_to_drop = [\
                           "OBJECTID",\
                           "INCKEY",\
                           "COLDETKEY",\
                           "REPORTNO",\
                           "INTKEY",\
                           "EXCEPTRSNCODE",\
                           "EXCEPTRSNDESC",\
                           "INATTENTIONIND",\
                           "PEDROWNOTGRNT",\
                           "SDOTCOLNUM",\
                           "SPEEDING",\
                           "SEGLANEKEY",\
                           "CROSSWALKKEY"]

In [5]:
#NOTE: drop(self, labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
# Drop the selected columns from the collisions DataFrame
# and store the result in a new DataFrame.
df_after_drop_columns = df.drop(columns=list_of_columns_to_drop, inplace=False)

In [6]:
# NOTE: dropna(self, axis=0, how='any', thresh=None, subset=None, inplace=False)

# Drop any row that contains at least one NaN.
#print("Number of columns: %d" % len(list(df_after_drop_columns.columns)))
df_after_drop_columns_and_rows = df_after_drop_columns.dropna(axis="index", how="any", thresh=None, subset=None, inplace=False)

<h3 id="correct_data_format">Correct Data Format</h3>

Ensure that each data type is appropriate for the corresponding feature.
Convert integer data to categorical type if the "real-world" measurement is not
"naturally ordered" as on the number line.
If data represents date, time, or date/time information, then convert the data to the appropriate datetime representation.

In [7]:
# Create new DataFrame to store converted data types.
df_converted = pd.DataFrame()

for column in list(df_after_drop_columns_and_rows.columns):
     # Cast column "ST_COLCODE" to type category.
    if column in ["SDOT_COLCODE"]:
        df_converted["SDOT_COLCODE"] = df_after_drop_columns_and_rows["SDOT_COLCODE"].astype('category')
    # Cast columns "INCDATE" and "INCDTTM" to type datetime.
    elif column in ["INCDATE", "INCDTTM"]:
        df_converted[column] = pd.to_datetime(df_after_drop_columns_and_rows[column], infer_datetime_format=True)
    # Cast columns of type object to type category.
    elif (df_after_drop_columns_and_rows[column].dtype in [np.dtype('object')]):
        df_converted[column] = df_after_drop_columns_and_rows[column].astype('category')
    # Copy all other columns to new DataFrame without changing their types.
    else:
        df_converted[column] = df_after_drop_columns_and_rows[column]

In [8]:
# Display info about new DataFrame after casting objects to category or date
df_converted.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 185317 entries, 0 to 221388
Data columns (total 27 columns):
 #   Column           Non-Null Count   Dtype         
---  ------           --------------   -----         
 0   X                185317 non-null  float64       
 1   Y                185317 non-null  float64       
 2   STATUS           185317 non-null  category      
 3   ADDRTYPE         185317 non-null  category      
 4   LOCATION         185317 non-null  category      
 5   SEVERITYCODE     185317 non-null  category      
 6   SEVERITYDESC     185317 non-null  category      
 7   COLLISIONTYPE    185317 non-null  category      
 8   PERSONCOUNT      185317 non-null  int64         
 9   PEDCOUNT         185317 non-null  int64         
 10  PEDCYLCOUNT      185317 non-null  int64         
 11  VEHCOUNT         185317 non-null  int64         
 12  INJURIES         185317 non-null  int64         
 13  SERIOUSINJURIES  185317 non-null  int64         
 14  FATALITIES       185

In [17]:
# Print the first several rows of columns INCDATE and INCDTTM.
df_converted[["INCDATE", "INCDTTM"]].head()

Unnamed: 0,INCDATE,INCDTTM
0,2020-01-22,2020-01-22 15:21:00
1,2020-01-07,2020-01-07 08:00:00
5,2020-06-11,2020-06-11 17:07:00
6,2020-02-03,2020-02-03 09:49:00
8,2020-01-30,2020-01-30 08:32:00


In [18]:
# Look for correlations between variables of type int64 or float64.
df_converted.corr()

Unnamed: 0,X,Y,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INJURIES,SERIOUSINJURIES,FATALITIES
X,1.0,-0.16052196,0.012071865,0.010809063,-0.003834269,-0.015810624,0.011208065,-0.006274326,-9.1933e-05
Y,-0.16052196,1.0,-0.013397357,0.012962509,0.027700106,0.018114148,0.010610046,-0.000691186,-0.004291236
PERSONCOUNT,0.012071865,-0.013397357,1.0,-0.023841139,-0.043475206,0.3957875,0.27367117,0.105816709,0.047709077
PEDCOUNT,0.010809063,0.012962509,-0.023841139,1.0,-0.021584911,-0.329870583,0.15764071,0.129484025,0.073938403
PEDCYLCOUNT,-0.003834269,0.027700106,-0.043475206,-0.021584911,1.0,-0.313565246,0.113270524,0.059977965,0.010630242
VEHCOUNT,-0.015810624,0.018114148,0.3957875,-0.329870583,-0.313565246,1.0,0.023013754,-0.046766713,-0.028869204
INJURIES,0.011208065,0.010610046,0.27367117,0.15764071,0.113270524,0.023013754,1.0,0.280833962,0.068237339
SERIOUSINJURIES,-0.006274326,-0.000691186,0.105816709,0.129484025,0.059977965,-0.046766713,0.280833962,1.0,0.177365053
FATALITIES,-9.1933e-05,-0.004291236,0.047709077,0.073938403,0.010630242,-0.028869204,0.068237339,0.177365053,1.0


In [31]:
df_converted.columns

Index(['X', 'Y', 'STATUS', 'ADDRTYPE', 'LOCATION', 'SEVERITYCODE',
       'SEVERITYDESC', 'COLLISIONTYPE', 'PERSONCOUNT', 'PEDCOUNT',
       'PEDCYLCOUNT', 'VEHCOUNT', 'INJURIES', 'SERIOUSINJURIES', 'FATALITIES',
       'INCDATE', 'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC',
       'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND', 'ST_COLCODE',
       'ST_COLDESC', 'HITPARKEDCAR'],
      dtype='object')

In [40]:
test_df = df_converted[["SEVERITYCODE", "WEATHER"]]#list(df_converted.columns)[0]]]

In [49]:
import seaborn as sns
df_converted.groupby(["ROADCOND"])[list(df_converted.columns)[6]].value_counts(normalize=True)

ROADCOND        SEVERITYDESC                  
Dry             Property Damage Only Collision   0.660888207
                Injury Collision                 0.319552283
                Serious Injury Collision         0.017479907
                Fatality Collision               0.002079603
Ice             Property Damage Only Collision   0.759932375
                Injury Collision                 0.224852071
                Serious Injury Collision         0.014370245
                Fatality Collision               0.000845309
Oil             Property Damage Only Collision   0.591836735
                Injury Collision                 0.408163265
Other           Property Damage Only Collision   0.647058824
                Injury Collision                 0.327731092
                Serious Injury Collision         0.025210084
Sand/Mud/Dirt   Property Damage Only Collision   0.656250000
                Injury Collision                 0.343750000
Snow/Slush      Property Damage Only C