### Predicting the Severity of Automobile Accidents in Seattle, Washington ###

In this first week, you will discover your
project objectives, find your dataset that you will use for this capstone project, and publish your
dataset on GitHub.

In the second week, you will build your machine
learning solution.

In the third week,
you will finalize your model and be ready
to submit your work.

To complete capstone,
you will be working on a case study which is to predict the severity
of an accident.
Now, wouldn't it be great if there were something in place that could warn you, 
given the weather and the road conditions,
about the possibility of you getting into a car accident and how severe it would be,
so that you would drive more carefully or even change your travel plans?
Let's use our shared data for Seattle, Washington as an example of how to deal with the accidents data.

In [1]:
# Import common packages for Data Science applications.
import io
import itertools
import matplotlib as mpl
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import numpy as np
import os
import pandas as pd
import pylab as pl
import scipy
import scipy.optimize as opt
import seaborn as sns
import sklearn
import sklearn.linear_model
import sys
from matplotlib.ticker import NullFormatter
from scipy import optimize
from scipy.optimize import curve_fit
from sklearn import linear_model
from sklearn import metrics
from sklearn import pipeline
from sklearn import preprocessing
from sklearn import svm
from sklearn import tree
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import jaccard_score 
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
%matplotlib inline

In [2]:
# Create a list of display options.
list_of_display_options_fully_qualified_names = str(\
"pd.options.display.chop_threshold, pd.options.display.float_format, pd.options.display.max_info_columns, pd.options.display.notebook_repr_html, \
pd.options.display.colheader_justify, pd.options.display.html, pd.options.display.max_info_rows, pd.options.display.pprint_nest_depth, \
pd.options.display.column_space, pd.options.display.large_repr, pd.options.display.max_rows, pd.options.display.precision, \
pd.options.display.date_dayfirst, pd.options.display.latex, pd.options.display.max_seq_items, pd.options.display.show_dimensions, \
pd.options.display.date_yearfirst, pd.options.display.max_categories, pd.options.display.memory_usage, pd.options.display.unicode, \
pd.options.display.encoding, pd.options.display.max_columns, pd.options.display.min_rows, pd.options.display.width, \
pd.options.display.expand_frame_repr, pd.options.display.max_colwidth, pd.options.display.multi_sparse").split(sep=', ')

# Initialize an empty list to store all the short names for display options.
list_of_display_options_short_names = list()
# For each fully qualified option name,
# get the option's short name and add it to the list of short names.
for fully_qualified_option_name in list_of_display_options_fully_qualified_names:
    # Get short option name.
    short_option_name = fully_qualified_option_name.split(sep='.')[-1]
    
    # Add short option name to list of display option short names.
    list_of_display_options_short_names.append(short_option_name)

# Define dictionary of display option settings.
dict_of_display_option_settings_short_names=\
{"max_info_columns": 500,\
"colheader_justify": "right",\
"max_info_rows": 1000,\
"column_space": 500,\
"max_rows": 1000,\
"precision": 9,\
"max_seq_items": 1000000000,\
"show_dimensions": True,\
"max_categories": 100,\
"memory_usage": True,\
"max_columns": 500,\
"max_colwidth": 500,\
"float_format": lambda x: '%.9f' % x}

# Set pandas display options using dictionary of short names,
# and display the options/value pairs.
print("Setting display options...")
for key in list(dict_of_display_option_settings_short_names.keys()):
    # Set display option.
    pd.set_option(key, dict_of_display_option_settings_short_names[key])
    # Print display option name and value.
    print(key, ": ", pd.get_option(key), sep='')

Setting display options...
max_info_columns: 500
colheader_justify: right
max_info_rows: 1000
column_space: 500
max_rows: 1000
precision: 9
max_seq_items: 1000000000
show_dimensions: True
max_categories: 100
memory_usage: True
max_columns: 500
max_colwidth: 500
float_format: <function <lambda> at 0x7f768422e040>


In [3]:
# Attribute Information URL: https://www.seattle.gov/Documents/Departments/SDOT/GIS/Collisions_OD.pdf
# Read the Collisions Data CSV file and store it as a DataFrame.
# url="https://opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0.csv" # HTTPError at 202009151050, using local copy of .csv instead.
# print(os.listdir("..")) # Print list of contents of current working directory.
local_path_to_csv = "../Collisions.csv"
df=pd.read_csv(local_path_to_csv, low_memory=False)

In [93]:
# View the first few rows of the collisions DataFrame.
df.head(10)

Unnamed: 0,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,LOCATION,EXCEPTRSNCODE,EXCEPTRSNDESC,SEVERITYCODE,SEVERITYDESC,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INJURIES,SERIOUSINJURIES,FATALITIES,INCDATE,INCDTTM,JUNCTIONTYPE,SDOT_COLCODE,SDOT_COLDESC,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,-122.320757054,47.609407946,1,328476,329976,EA08706,Matched,Block,,BROADWAY BETWEEN E COLUMBIA ST AND BOYLSTON AVE,,,1,Property Damage Only Collision,Sideswipe,2,0,0,2,0,0,0,2020/01/22 00:00:00+00,1/22/2020 3:21:00 PM,Mid-Block (not related to intersection),11.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END AT ANGLE",,N,Raining,Wet,Dark - Street Lights On,,,,11.0,From same direction - both going straight - both moving - sideswipe,0,0,N
1,-122.319560827,47.662220664,2,328142,329642,EA06882,Matched,Block,,8TH AVE NE BETWEEN NE 45TH E ST AND NE 47TH ST,,,1,Property Damage Only Collision,Parked Car,2,0,0,2,0,0,0,2020/01/07 00:00:00+00,1/7/2020 8:00:00 AM,Mid-Block (not related to intersection),15.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, RIGHT SIDE SIDESWIPE",,N,Clear,Dry,Daylight,,,,32.0,One parked--one moving,0,0,Y
2,-122.327524508,47.604393273,3,20700,20700,1181833,Unmatched,Block,,JAMES ST BETWEEN 6TH AVE AND 7TH AVE,,,0,Unknown,,0,0,0,0,0,0,0,2004/01/30 00:00:00+00,1/30/2004,Mid-Block (but intersection related),11.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END AT ANGLE",,,,,,,4030032.0,,,,0,0,N
3,-122.327524934,47.708621579,4,332126,333626,M16001640,Unmatched,Block,,NE NORTHGATE WAY BETWEEN 1ST AVE NE AND NE NORTHGATE DR,,,0,Unknown,,0,0,0,0,0,0,0,2016/01/23 00:00:00+00,1/23/2016,Mid-Block (not related to intersection),11.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END AT ANGLE",,,,,,,,,,,0,0,N
4,-122.292120049,47.55900908,5,328238,329738,3857118,Unmatched,Block,,M L KING JR ER WAY S BETWEEN S ANGELINE ST AND S EDMUNDS ST,,,0,Unknown,,0,0,0,0,0,0,0,2020/01/26 00:00:00+00,1/26/2020,Mid-Block (not related to intersection),28.0,MOTOR VEHICLE RAN OFF ROAD - HIT FIXED OBJECT,,,,,,,,,,,0,0,N
5,-122.374193726,47.5640756,6,332024,333524,3838312,Matched,Block,,SW AVALON WAY BETWEEN SW GENESEE ST AND 35TH AVE SW,,,1,Property Damage Only Collision,Rear Ended,2,0,0,2,0,0,0,2020/06/11 00:00:00+00,6/11/2020 5:07:00 PM,Mid-Block (not related to intersection),14.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, REAR END",,N,Clear,Dry,Daylight,,,,14.0,From same direction - both going straight - one stopped - rear-end,0,0,N
6,-122.290734129,47.709276309,7,328431,329931,3854579,Matched,Block,,35TH AVE NE BETWEEN NE 110TH ST AND NE 113TH ST,,,1,Property Damage Only Collision,Other,2,0,0,1,0,0,0,2020/02/03 00:00:00+00,2/3/2020 9:49:00 AM,Mid-Block (but intersection related),28.0,MOTOR VEHICLE RAN OFF ROAD - HIT FIXED OBJECT,,N,Clear,Wet,Daylight,,,Y,50.0,Fixed object,0,0,N
7,-122.345865266,47.688388912,8,1243,1243,3615301,Unmatched,Block,,N 82ND ST BETWEEN LINDEN AVE N AND AURORA AVE N,,,0,Unknown,,0,0,0,0,0,0,0,2013/03/28 00:00:00+00,3/28/2013,Mid-Block (not related to intersection),0.0,NOT ENOUGH INFORMATION / NOT APPLICABLE,,,,,,,,,,,0,0,N
8,-122.336564829,47.59039783,9,328781,330281,EA12104,Matched,Intersection,30386.0,COLORADO AVE S AND S ATLANTIC ST,,,1,Property Damage Only Collision,Sideswipe,2,0,0,2,0,0,0,2020/01/30 00:00:00+00,1/30/2020 8:32:00 AM,At Intersection (intersection related),14.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, REAR END",,N,Overcast,Dry,Daylight,,,,81.0,Same direction -- both turning left -- both moving -- sideswipe,0,0,N
9,-122.329048658,47.593341161,10,328879,330379,E985438,Matched,Block,,4TH AVE S BETWEEN I90 WB 4TH AV OFF RP AND S ROYAL BROUGHAM WAY,,,1,Property Damage Only Collision,Sideswipe,3,0,0,2,0,0,0,2019/11/23 00:00:00+00,11/23/2019 11:00:00 AM,Mid-Block (not related to intersection),11.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END AT ANGLE",,N,Clear,Dry,Daylight,,,,11.0,From same direction - both going straight - both moving - sideswipe,0,0,N


<h2 id="data_wrangling">Data Wrangling</h2>

Steps for working with missing data:
<ol>
    <li>Identify missing data.</li>
    <li>Deal with missing data.</li>
    <li>Correct data format.</li>
</ol>

<h3 id="identifying_missing_data">Identifying Missing Data</h3>

Any missing or "unknown" values indicated by the metadata document must be converted into NaN.

In [90]:
### NOTE: replace(self, to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad')

# If any row of the collisions DataFrame contains a sentinel value representing "unknown",
# then replace it with NaN:
# SEVERITYCODE == "0",
# JUNCTIONTYPE == "Unknown",
# SDOT_COLCODE == "0"
# WEATHER == "Unknown"
# ROADCOND == "Unknown"
# LIGHTCOND == "Unknown"
# ST_COLCODE == ' '
df_drop_unknowns = df.replace(\
to_replace={"SEVERITYCODE": "0", "JUNCTIONTYPE": "Unknown", "SDOT_COLCODE": 0, "WEATHER": "Unknown", "ROADCOND": "Unknown", "LIGHTCOND": "Unknown", "ST_COLCODE": " ",},\
value=np.nan, inplace=False, limit=None, regex=False, method='pad')

                    X            Y  OBJECTID  INCKEY  COLDETKEY   REPORTNO  \
2      -122.327524508 47.604393273         3   20700      20700    1181833   
3      -122.327524934 47.708621579         4  332126     333626  M16001640   
4      -122.292120049 47.559009080         5  328238     329738    3857118   
7      -122.345865266 47.688388912         8    1243       1243    3615301   
19     -122.351470036 47.626733437        20  328896     330396    EA13640   
...               ...          ...       ...     ...        ...        ...   
221364 -122.295817531 47.543534317    221365  331101     332601    3803375   
221367 -122.284865031 47.718778408    221368  330051     331551    EA20304   
221377 -122.327954999 47.642611502    221378  331134     332634    3852339   
221381 -122.291970539 47.570390620    221382  330753     332253    3856848   
221382 -122.314011536 47.726661128    221383  330031     331531    C823617   

           STATUS      ADDRTYPE          INTKEY  \
2       Unma

In [5]:
# Test if the collisions DataFrame has NaN values.
if df.isna().any(axis=None):
    print("DataFrame has NaN.")
else:
    print("DataFrame has no NaN.")

DataFrame has NaN.


In [6]:
# Initialize a list to store the labels for the columns with missing data.
list_of_columns_with_missing_data = list()

# For each column in the collisions DataFrame,
# if the column contains at least one NaN, 
# then add the column's label to the list.
for column in list(df.columns):
    if df[column].hasnans:
        list_of_columns_with_missing_data.append(column)

# Print the number of columns
print("Number of columns: %d" % len(df.columns))
print("List of labels for columns:")
print(list(df.columns))
print()
print("Number of columns that are missing data: %d" % len(list_of_columns_with_missing_data))
print("List of labels for columns that are missing data:")
print(list_of_columns_with_missing_data)

Number of columns: 40
List of labels for columns:
['X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO', 'STATUS', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE', 'EXCEPTRSNDESC', 'SEVERITYCODE', 'SEVERITYDESC', 'COLLISIONTYPE', 'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INJURIES', 'SERIOUSINJURIES', 'FATALITIES', 'INCDATE', 'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC', 'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND', 'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'ST_COLDESC', 'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR']

Number of columns that are missing data: 22
List of labels for columns that are missing data:
['X', 'Y', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE', 'EXCEPTRSNDESC', 'SEVERITYCODE', 'COLLISIONTYPE', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC', 'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND', 'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'ST_COLDESC']


<h3 id="deal_with_missing_data">Deal with Missing Data</h3>

<ol>
    <li>Drop the Data
        <ol>
            <li>Drop entire row.</li>
            <li>Drop entire column.</li>
        </ol>
    </li>
    <li>Replace the Data
        <ol>
            <li>Replace data by mean.</li>
            <li>Replace data by frequency.</li>
            <li>Replace data based on other functions.</li>
        </ol>
    </li>
        
</ol>

Whole columns should be dropped only if most entries in the column are empty.

In [7]:
# For each column in DataFrame,
# print the relative frequencies of the column's values.
for column in list(df.columns):
    print(column, "Relative Frequencies:")
    print(df[column].value_counts(normalize=True, dropna=False))
    print()

X Relative Frequencies:
nan              0.033746031
-122.332653349   0.001337013
-122.344896079   0.001273776
-122.328078578   0.001246674
-122.344996835   0.001219573
                     ...    
-122.372757223   0.000004517
-122.305825420   0.000004517
-122.385337171   0.000004517
-122.397974101   0.000004517
-122.358295798   0.000004517
Name: X, Length: 24973, dtype: float64

Y Relative Frequencies:
nan            0.033746031
47.708654503   0.001337013
47.717173101   0.001273776
47.604161235   0.001246674
47.725035552   0.001219573
                   ...    
47.669143854   0.000004517
47.592493078   0.000004517
47.560592450   0.000004517
47.658522767   0.000004517
47.541978750   0.000004517
Name: Y, Length: 24973, dtype: float64

OBJECTID Relative Frequencies:
2047     0.000004517
39494    0.000004517
8785     0.000004517
10832    0.000004517
53839    0.000004517
             ...    
21920    0.000004517
109983   0.000004517
107934   0.000004517
114077   0.000004517
2049     0.0000

In [8]:
print(list(df.columns))

['X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO', 'STATUS', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE', 'EXCEPTRSNDESC', 'SEVERITYCODE', 'SEVERITYDESC', 'COLLISIONTYPE', 'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INJURIES', 'SERIOUSINJURIES', 'FATALITIES', 'INCDATE', 'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC', 'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND', 'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'ST_COLDESC', 'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR']


In [9]:
# Drop any column from the collisions DataFrame if it satisfies at least one of the following conditions:
# 1) more than 15% of the column's data is NaN;
# 2) the column only contains unique identification keys;
# 3) the column's data is naturally categorical but does not fit into a small (< 50) number of categories;
# 4) infromation in one column is redundant because it is already represented by another column;
# 5) it is not clear how to interpret the column's data.

list_of_columns_to_drop = [\
                           "STATUS",\
                           "OBJECTID",\
                           "INCKEY",\
                           "COLDETKEY",\
                           "REPORTNO",\
                           "INTKEY",\
                           "LOCATION",\
                           "EXCEPTRSNCODE",\
                           "EXCEPTRSNDESC",\
                           "SEVERITYDESC",\
                           "INCDATE",\
                           "SDOT_COLDESC",\
                           "INATTENTIONIND",\
                           "UNDERINFL",\
                           "PEDROWNOTGRNT",\
                           "SDOTCOLNUM",\
                           "SPEEDING",\
                           "ST_COLDESC",\
                           "SEGLANEKEY",\
                           "CROSSWALKKEY"]

In [10]:
# Drop the selected columns from the collisions DataFrame
# and store the result in a new DataFrame.
df_after_drop_columns = df.drop(columns=list_of_columns_to_drop, inplace=False)

In [11]:
# Test if DataFrame has NaN after dropping columns.
if df_after_drop_columns.isna().any(axis=None):
    print("DataFrame has NaN.")
else:
    print("DataFrame has no NaN.")

DataFrame has NaN.


In [12]:
# Drop any row that contains at least one NaN.
df_after_drop_columns_and_rows = df_after_drop_columns.dropna(axis="index", how="any", thresh=None, subset=None, inplace=False)

In [13]:
# Test if DataFrame has NaN values after dropping columns and rows.
if df_after_drop_columns_and_rows.isna().any(axis=None):
    print("DataFrame has NaN.")
else:
    print("DataFrame has no NaN.")

DataFrame has no NaN.


In [14]:
# For each column in DataFrame after dropping columns and rows,
# print the relative frequencies of the column's values.
for column in list(df_after_drop_columns_and_rows.columns):
    print(column, "Relative Frequencies:")
    print(df_after_drop_columns_and_rows[column].value_counts(normalize=True, dropna=False))
    print()

X Relative Frequencies:
-122.332653349   0.001413786
-122.344896079   0.001370617
-122.328078578   0.001354429
-122.344996835   0.001295071
-122.299159660   0.001251902
                     ...    
-122.375460050   0.000005396
-122.382797469   0.000005396
-122.322845970   0.000005396
-122.392965453   0.000005396
-122.291492727   0.000005396
Name: X, Length: 23603, dtype: float64

Y Relative Frequencies:
47.708654503   0.001413786
47.717173101   0.001370617
47.604161235   0.001354429
47.725035552   0.001295071
47.579673463   0.001251902
                   ...    
47.544284392   0.000005396
47.522827109   0.000005396
47.637249987   0.000005396
47.682557322   0.000005396
47.690588615   0.000005396
Name: Y, Length: 23603, dtype: float64

ADDRTYPE Relative Frequencies:
Block          0.647157858
Intersection   0.352842142
Name: ADDRTYPE, Length: 2, dtype: float64

SEVERITYCODE Relative Frequencies:
1    0.677327621
2    0.304800397
2b   0.016080467
3    0.001780723
0    0.000010792
Name: SE

In [84]:
# Drop any row if at least one of the following conditions are met:
# SEVERITYCODE == "0",
# JUNCTIONTYPE == "Unknown",
# SDOT_COLCODE == 0,
# WEATHER == "Unknown"
# ROADCOND == "Unknown"
# LIGHTCOND == "Unknown"
# ST_COLCODE == ' '
print(df_after_drop_columns_and_rows[df_after_drop_columns_and_rows["SEVERITYCODE"] == "0"])
print()
print(df_after_drop_columns_and_rows[df_after_drop_columns_and_rows["JUNCTIONTYPE"] == "Unknown"])
print()
print(df_after_drop_columns_and_rows[df_after_drop_columns_and_rows["SDOT_COLCODE"] == 0])
print()
print(df_after_drop_columns_and_rows[df_after_drop_columns_and_rows["WEATHER"] == "Unknown"])
print()
print(df_after_drop_columns_and_rows[df_after_drop_columns_and_rows["ROADCOND"] == "Unknown"])
print()
print(df_after_drop_columns_and_rows[df_after_drop_columns_and_rows["LIGHTCOND"] == "Unknown"])
print()
print(df_after_drop_columns_and_rows[df_after_drop_columns_and_rows["ST_COLCODE"] == ' '])
print()

                    X            Y      ADDRTYPE SEVERITYCODE COLLISIONTYPE  \
118643 -122.314165334 47.606191964  Intersection            0        Angles   
214840 -122.385357422 47.676831637         Block            0    Pedestrian   

        PERSONCOUNT  PEDCOUNT  PEDCYLCOUNT  VEHCOUNT  INJURIES  \
118643            0         0            0         0         0   
214840            3         2            0         1         0   

        SERIOUSINJURIES  FATALITIES     INCDTTM  \
118643                0           0  12/20/2012   
214840                0           0    5/6/2020   

                                   JUNCTIONTYPE  SDOT_COLCODE  WEATHER  \
118643   At Intersection (intersection related)  11.000000000  Raining   
214840  Mid-Block (not related to intersection)  11.000000000    Clear   

       ROADCOND LIGHTCOND ST_COLCODE HITPARKEDCAR  
118643      Wet  Daylight         10            N  
214840      Wet  Daylight          1            Y  

[2 rows x 20 columns]

      

In [15]:
# For each column in the DataFrame after dropping all NaN,
# drop any row corresponding to "unknown" or a value equivalent to NaN,
# e.g. SEVERITYCODE == '0'.

# Drop any row if at least one of the following conditions are met:
# SEVERITYCODE == "0",
# JUNCTIONTYPE == "Unknown",
# SDOT_COLCODE == "0"
# WEATHER == "Unknown"
# ROADCOND == "Unknown"
# LIGHTCOND == "Unknown"
# ST_COLCODE == ' '

# Replace the values specified above by NaN.
df_replace_unknowns = df_after_drop_columns_and_rows

# Dropt any row containing NaN.

<h3 id="correct_data_format">Correct Data Format</h3>

Ensure that each data type is appropriate for the corresponding feature.
Convert integer data to "ordered" categorical types, e.g. SEVERITYCODE,
especially if the "integer ordering" of the original data is inappropriate.

If data represents date, time, or date/time information, then convert the data to the appropriate datetime representation.

In [16]:
# Create new DataFrame to store converted data types.
df_converted = pd.DataFrame()

for column in list(df_after_drop_columns_and_rows.columns):
     # Cast column "ST_COLCODE" to type category.
    if column in ["SDOT_COLCODE"]:
        df_converted["SDOT_COLCODE"] = df_after_drop_columns_and_rows["SDOT_COLCODE"].astype('category')
    # Cast columns "INCDTTM" to type datetime.
    elif column in ["INCDTTM"]:
        df_converted[column] = pd.to_datetime(df_after_drop_columns_and_rows[column], infer_datetime_format=True)
    # Cast columns of type object to type category.
    elif (df_after_drop_columns_and_rows[column].dtype in [np.dtype('object')]):
        df_converted[column] = df_after_drop_columns_and_rows[column].astype('category')
    # Copy all other columns to new DataFrame without changing their types.
    else:
        df_converted[column] = df_after_drop_columns_and_rows[column]

In [17]:
# Display info about new DataFrame after casting objects to category or date
df_converted.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 185318 entries, 0 to 221388
Data columns (total 20 columns):
 #   Column           Non-Null Count   Dtype         
---  ------           --------------   -----         
 0   X                185318 non-null  float64       
 1   Y                185318 non-null  float64       
 2   ADDRTYPE         185318 non-null  category      
 3   SEVERITYCODE     185318 non-null  category      
 4   COLLISIONTYPE    185318 non-null  category      
 5   PERSONCOUNT      185318 non-null  int64         
 6   PEDCOUNT         185318 non-null  int64         
 7   PEDCYLCOUNT      185318 non-null  int64         
 8   VEHCOUNT         185318 non-null  int64         
 9   INJURIES         185318 non-null  int64         
 10  SERIOUSINJURIES  185318 non-null  int64         
 11  FATALITIES       185318 non-null  int64         
 12  INCDTTM          185318 non-null  datetime64[ns]
 13  JUNCTIONTYPE     185318 non-null  category      
 14  SDOT_COLCODE     185

In [18]:
# Create DataFrame of categorical columns.
df_categorical = df_converted.select_dtypes(include="category")

In [19]:
df_categorical.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 185318 entries, 0 to 221388
Data columns (total 10 columns):
 #   Column         Non-Null Count   Dtype   
---  ------         --------------   -----   
 0   ADDRTYPE       185318 non-null  category
 1   SEVERITYCODE   185318 non-null  category
 2   COLLISIONTYPE  185318 non-null  category
 3   JUNCTIONTYPE   185318 non-null  category
 4   SDOT_COLCODE   185318 non-null  category
 5   WEATHER        185318 non-null  category
 6   ROADCOND       185318 non-null  category
 7   LIGHTCOND      185318 non-null  category
 8   ST_COLCODE     185318 non-null  category
 9   HITPARKEDCAR   185318 non-null  category
dtypes: category(10)
memory usage: 3.2 MB


In [21]:
# For each categorical column in DataFrame df_categorical, print value_counts.
for column in list(df_categorical.columns):
    print(df_categorical[column].value_counts(normalize=True, dropna=False))
    print(df_categorical[column].value_counts(normalize=False, dropna=False))
    print()

Block          0.647157858
Intersection   0.352842142
Name: ADDRTYPE, Length: 2, dtype: float64
Block           119930
Intersection     65388
Name: ADDRTYPE, Length: 2, dtype: int64

1    0.677327621
2    0.304800397
2b   0.016080467
3    0.001780723
0    0.000010792
Name: SEVERITYCODE, Length: 5, dtype: float64
1     125521
2      56485
2b      2980
3        330
0          2
Name: SEVERITYCODE, Length: 5, dtype: int64

Parked Car   0.234272979
Angles       0.190111052
Rear Ended   0.180014893
Other        0.122691805
Sideswipe    0.097875004
Left Turn    0.075491857
Pedestrian   0.040589689
Cycles       0.031497210
Right Turn   0.015956356
Head On      0.011499153
Name: COLLISIONTYPE, Length: 10, dtype: float64
Parked Car    43415
Angles        35231
Rear Ended    33360
Other         22737
Sideswipe     18138
Left Turn     13990
Pedestrian     7522
Cycles         5837
Right Turn     2957
Head On        2131
Name: COLLISIONTYPE, Length: 10, dtype: int64

Mid-Block (not related to inter

#### Feature before One Hot Encoding