### Predicting the Severity of Automobile Accidents in Seattle, Washington ###

In this first week, you will discover your
project objectives, find your dataset that you will use for this capstone project, and publish your
dataset on GitHub.

In the second week, you will build your machine
learning solution.

In the third week,
you will finalize your model and be ready
to submit your work.

To complete capstone,
you will be working on a case study which is to predict the severity
of an accident.
Now, wouldn't it be great if there were something in place that could warn you, 
given the weather and the road conditions,
about the possibility of you getting into a car accident and how severe it would be,
so that you would drive more carefully or even change your travel plans?
Let's use our shared data for Seattle, Washington as an example of how to deal with the accidents data.

In [1]:
# Import packages.
import pandas as pd
import numpy as np
import itertools
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline

In [2]:
# NOTE: >>> help(pd.options.display. <TAB>
# pd.options.display.chop_threshold      pd.options.display.float_format        pd.options.display.max_info_columns    pd.options.display.notebook_repr_html
# pd.options.display.colheader_justify   pd.options.display.html                pd.options.display.max_info_rows       pd.options.display.pprint_nest_depth
# pd.options.display.column_space        pd.options.display.large_repr          pd.options.display.max_rows            pd.options.display.precision
# pd.options.display.date_dayfirst       pd.options.display.latex               pd.options.display.max_seq_items       pd.options.display.show_dimensions
# pd.options.display.date_yearfirst      pd.options.display.max_categories      pd.options.display.memory_usage        pd.options.display.unicode
# pd.options.display.encoding            pd.options.display.max_columns         pd.options.display.min_rows            pd.options.display.width
# pd.options.display.expand_frame_repr   pd.options.display.max_colwidth        pd.options.display.multi_sparse        

# Create a list of display options.
list_of_display_options_fully_qualified_names = str(\
"pd.options.display.chop_threshold, pd.options.display.float_format, pd.options.display.max_info_columns, pd.options.display.notebook_repr_html, \
pd.options.display.colheader_justify, pd.options.display.html, pd.options.display.max_info_rows, pd.options.display.pprint_nest_depth, \
pd.options.display.column_space, pd.options.display.large_repr, pd.options.display.max_rows, pd.options.display.precision, \
pd.options.display.date_dayfirst, pd.options.display.latex, pd.options.display.max_seq_items, pd.options.display.show_dimensions, \
pd.options.display.date_yearfirst, pd.options.display.max_categories, pd.options.display.memory_usage, pd.options.display.unicode, \
pd.options.display.encoding, pd.options.display.max_columns, pd.options.display.min_rows, pd.options.display.width, \
pd.options.display.expand_frame_repr, pd.options.display.max_colwidth, pd.options.display.multi_sparse").split(sep=', ')

# Print the number of display options in the list.
#print("Number of display options:", list_of_display_options_fully_qualified_names.__len__())
#print()
#print()

# Initialize an empty list to store all the short names for display options.
list_of_display_options_short_names = list()
# For each display option,
# print the fully qualified option name, 
# the short option name, 
# and a description of the option.
for fully_qualified_option_name in list_of_display_options_fully_qualified_names:
    # Print fully qualifed option name.
    #print("Fully qualified option name:", fully_qualified_option_name)
    
    # Set short option name.
    short_option_name = fully_qualified_option_name.split(sep='.')[-1]
    
    # Add short option name to list of display option short names.
    list_of_display_options_short_names.append(short_option_name)
    
    # Print short option name.
    #print("Short option name:", short_option_name)
    
    # Print a description of option.
    #pd.describe_option(short_option_name)

print(list_of_display_options_short_names)
print()

# Define dictionary of display option settings.
dict_of_display_option_settings_short_names=\
{"max_info_columns": 100,\
"max_info_rows": 200,\
"max_columns": 100,\
"max_rows": 200,\
"precision": 9,\
"max_seq_items": None,\
"show_dimensions": True,\
"max_categories": 1000,\
"max_colwidth": 100,\
"float_format": lambda x: '%.3f' % x}

print(dict_of_display_option_settings_short_names)
print()
# Set and print selected pandas display options.
for key in list(dict_of_display_option_settings_short_names.keys()):
    # Set display option.
    pd.set_option(key, dict_of_display_option_settings_short_names[key])
    # Print display option name and value.
    print(key, ": ", pd.get_option(key), sep='')
    print()

['chop_threshold', 'float_format', 'max_info_columns', 'notebook_repr_html', 'colheader_justify', 'html', 'max_info_rows', 'pprint_nest_depth', 'column_space', 'large_repr', 'max_rows', 'precision', 'date_dayfirst', 'latex', 'max_seq_items', 'show_dimensions', 'date_yearfirst', 'max_categories', 'memory_usage', 'unicode', 'encoding', 'max_columns', 'min_rows', 'width', 'expand_frame_repr', 'max_colwidth', 'multi_sparse']

{'max_info_columns': 100, 'max_info_rows': 200, 'max_columns': 100, 'max_rows': 200, 'precision': 9, 'max_seq_items': None, 'show_dimensions': True, 'max_categories': 1000, 'max_colwidth': 100, 'float_format': <function <lambda> at 0x7faf37cd5700>}

max_info_columns: 100

max_info_rows: 200

max_columns: 100

max_rows: 200

precision: 9

max_seq_items: None

show_dimensions: True

max_categories: 1000

max_colwidth: 100

float_format: <function <lambda> at 0x7faf37cd5700>



In [3]:
# Attribute Information URL: https://www.seattle.gov/Documents/Departments/SDOT/GIS/Collisions_OD.pdf
# Read the Collisions Data CSV file and store it as a DataFrame.
url="http://data-seattlecitygis.opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0.csv?outSR={%22latestWkid%22:2926,%22wkid%22:2926}"
df=pd.read_csv(url, low_memory=False)

In [4]:
# View the first few rows of the collisions DataFrame.
df.head()

Unnamed: 0,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,LOCATION,EXCEPTRSNCODE,EXCEPTRSNDESC,SEVERITYCODE,SEVERITYDESC,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INJURIES,SERIOUSINJURIES,FATALITIES,INCDATE,INCDTTM,JUNCTIONTYPE,SDOT_COLCODE,SDOT_COLDESC,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,1273535.055,225839.134,1,328476,329976,EA08706,Matched,Block,,BROADWAY BETWEEN E COLUMBIA ST AND BOYLSTON AVE,,,1,Property Damage Only Collision,Sideswipe,2,0,0,2,0,0,0,2020/01/22 00:00:00+00,1/22/2020 3:21:00 PM,Mid-Block (not related to intersection),11.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END AT ANGLE",,N,Raining,Wet,Dark - Street Lights On,,,,11.0,From same direction - both going straight - both moving - sideswipe,0,0,N
1,1274202.093,245094.095,2,328142,329642,EA06882,Matched,Block,,8TH AVE NE BETWEEN NE 45TH E ST AND NE 47TH ST,,,1,Property Damage Only Collision,Parked Car,2,0,0,2,0,0,0,2020/01/07 00:00:00+00,1/7/2020 8:00:00 AM,Mid-Block (not related to intersection),15.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, RIGHT SIDE SIDESWIPE",,N,Clear,Dry,Daylight,,,,32.0,One parked--one moving,0,0,Y
2,1271830.52,224042.637,3,20700,20700,1181833,Unmatched,Block,,JAMES ST BETWEEN 6TH AVE AND 7TH AVE,,,0,Unknown,,0,0,0,0,0,0,0,2004/01/30 00:00:00+00,1/30/2004,Mid-Block (but intersection related),11.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END AT ANGLE",,,,,,,4030032.0,,,,0,0,N
3,1272568.544,262054.386,4,332126,333626,M16001640,Unmatched,Block,,NE NORTHGATE WAY BETWEEN 1ST AVE NE AND NE NORTHGATE DR,,,0,Unknown,,0,0,0,0,0,0,0,2016/01/23 00:00:00+00,1/23/2016,Mid-Block (not related to intersection),11.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END AT ANGLE",,,,,,,,,,,0,0,N
4,1280249.222,207323.483,5,328238,329738,3857118,Unmatched,Block,,M L KING JR ER WAY S BETWEEN S ANGELINE ST AND S EDMUNDS ST,,,0,Unknown,,0,0,0,0,0,0,0,2020/01/26 00:00:00+00,1/26/2020,Mid-Block (not related to intersection),28.0,MOTOR VEHICLE RAN OFF ROAD - HIT FIXED OBJECT,,,,,,,,,,,0,0,N


In [5]:
# Print a concise, technical summary of the collisions DataFrame.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 221389 entries, 0 to 221388
Data columns (total 40 columns):
 #   Column           Dtype  
---  ------           -----  
 0   X                float64
 1   Y                float64
 2   OBJECTID         int64  
 3   INCKEY           int64  
 4   COLDETKEY        int64  
 5   REPORTNO         object 
 6   STATUS           object 
 7   ADDRTYPE         object 
 8   INTKEY           float64
 9   LOCATION         object 
 10  EXCEPTRSNCODE    object 
 11  EXCEPTRSNDESC    object 
 12  SEVERITYCODE     object 
 13  SEVERITYDESC     object 
 14  COLLISIONTYPE    object 
 15  PERSONCOUNT      int64  
 16  PEDCOUNT         int64  
 17  PEDCYLCOUNT      int64  
 18  VEHCOUNT         int64  
 19  INJURIES         int64  
 20  SERIOUSINJURIES  int64  
 21  FATALITIES       int64  
 22  INCDATE          object 
 23  INCDTTM          object 
 24  JUNCTIONTYPE     object 
 25  SDOT_COLCODE     float64
 26  SDOT_COLDESC     object 
 27  INATTENTIONIND

<h2 id="data_wrangling">Data Wrangling</h2>

Steps for working with missing data:
<ol>
    <li>Identify missing data.</li>
    <li>Deal with missing data.</li>
    <li>Correct data format.</li>
</ol>

<h3 id="identifying_missing_data">Identifying Missing Data</h3>

The missing values are converted to Python's default. We use Python's built-in functions to identify these missing values. There are two methods to detect missing data:
<ol>
    <li><b>.isnull()</b></li>
    <li><b>.notnull()</b></li>
</ol>
The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data.

In [6]:
missing_data = df.isnull()
missing_data.head(5)

Unnamed: 0,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,LOCATION,EXCEPTRSNCODE,EXCEPTRSNDESC,SEVERITYCODE,SEVERITYDESC,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INJURIES,SERIOUSINJURIES,FATALITIES,INCDATE,INCDTTM,JUNCTIONTYPE,SDOT_COLCODE,SDOT_COLDESC,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,False,False,False,False,False,False,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,True,True,False,False,False,False,False
1,False,False,False,False,False,False,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,True,True,False,False,False,False,False
2,False,False,False,False,False,False,False,False,True,False,True,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,True,True,True,False,True,True,True,False,False,False
3,False,False,False,False,False,False,False,False,True,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,True,True,True,True,True,False,True,False,False,False
4,False,False,False,False,False,False,False,False,True,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,True,True,True,True,True,False,True,False,False,False


"True" identifies a missing value, while "False" indicates that a value is not missing value.

<h4>Count the Missing Values in each Column</h4>
<p>
We use a for loop to count the number of missing ("True") values in each column of the collisions DataFrame.
</p>

In [7]:
# Print the number of missing ("True") values in each column of the collisions DataFrame
for column in list(missing_data.columns):
    print(column, "(True -> NaN)")
    print(missing_data[column].value_counts())
    print()    

X (True -> NaN)
False    213918
True       7471
Name: X, Length: 2, dtype: int64

Y (True -> NaN)
False    213918
True       7471
Name: Y, Length: 2, dtype: int64

OBJECTID (True -> NaN)
False    221389
Name: OBJECTID, Length: 1, dtype: int64

INCKEY (True -> NaN)
False    221389
Name: INCKEY, Length: 1, dtype: int64

COLDETKEY (True -> NaN)
False    221389
Name: COLDETKEY, Length: 1, dtype: int64

REPORTNO (True -> NaN)
False    221389
Name: REPORTNO, Length: 1, dtype: int64

STATUS (True -> NaN)
False    221389
Name: STATUS, Length: 1, dtype: int64

ADDRTYPE (True -> NaN)
False    217677
True       3712
Name: ADDRTYPE, Length: 2, dtype: int64

INTKEY (True -> NaN)
True     149505
False     71884
Name: INTKEY, Length: 2, dtype: int64

LOCATION (True -> NaN)
False    216801
True       4588
Name: LOCATION, Length: 2, dtype: int64

EXCEPTRSNCODE (True -> NaN)
True     120403
False    100986
Name: EXCEPTRSNCODE, Length: 2, dtype: int64

EXCEPTRSNDESC (True -> NaN)
True     209610
False   

In [8]:
# Initialize a list to hold the names of all the columns that are missing data.
list_of_columns_with_missing_data = list()

# For each column in the collisions DataFrame,
# if the Series contains at least one NaN, 
# then add the column name to the list of column names that are missing data.
for column in df.columns.values.tolist():
    if df[column].hasnans:
        list_of_columns_with_missing_data.append(column)

print("Total number of columns: %d" % df.columns.size)
print()
print("Names of all columns:")
print(df.columns)
print()
print("Number of columns that are missing data: %d" % list_of_columns_with_missing_data.__len__())
print()
print("Names of columns missing data:")
print(list_of_columns_with_missing_data)

Total number of columns: 40

Names of all columns:
Index(['X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO', 'STATUS',
       'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE', 'EXCEPTRSNDESC',
       'SEVERITYCODE', 'SEVERITYDESC', 'COLLISIONTYPE', 'PERSONCOUNT',
       'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INJURIES', 'SERIOUSINJURIES',
       'FATALITIES', 'INCDATE', 'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE',
       'SDOT_COLDESC', 'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND',
       'LIGHTCOND', 'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE',
       'ST_COLDESC', 'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR'],
      dtype='object')

Number of columns that are missing data: 22

Names of columns missing data:
['X', 'Y', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE', 'EXCEPTRSNDESC', 'SEVERITYCODE', 'COLLISIONTYPE', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC', 'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND', 'PEDROWNOTGRNT', 'SDOTCOLNUM'

In [9]:
# Note: value_counts(self, normalize=False, sort=True, ascending=False, bins=None, dropna=True)
# For each column with missing data,
# print normalized value counts.
starting_index = 0
for column in list_of_columns_with_missing_data:#[starting_index:starting_index + 1]:
    print(column, ": ", df[column].dtype, sep='')
    print()
    print("Relative Frequencies:")
    print(df[column].value_counts(normalize=True, dropna=False))
    print()

X: float64

Relative Frequencies:
nan           0.034
1271306.397   0.001
1268353.834   0.001
1271692.215   0.001
1268385.368   0.001
               ... 
1278324.402   0.000
1269153.198   0.000
1264316.899   0.000
1282793.205   0.000
1274789.992   0.000
Name: X, Length: 24973, dtype: float64

Y: float64

Relative Frequencies:
nan          0.034
262090.949   0.001
265256.610   0.001
223960.667   0.001
268124.504   0.001
              ... 
243483.471   0.000
250917.522   0.000
249624.026   0.000
261384.258   0.000
199572.449   0.000
Name: Y, Length: 24971, dtype: float64

ADDRTYPE: object

Relative Frequencies:
Block          0.655
Intersection   0.325
NaN            0.017
Alley          0.004
Name: ADDRTYPE, Length: 4, dtype: float64

INTKEY: float64

Relative Frequencies:
nan         0.675
29973.000   0.001
29933.000   0.001
29913.000   0.001
29549.000   0.001
             ... 
31672.000   0.000
37254.000   0.000
31674.000   0.000
35224.000   0.000
27795.000   0.000
Name: INTKEY, Lengt

<h3 id="deal_with_missing_data">Deal with Missing Data</h3>

<ol>
    <li>Drop the Data
        <ol>
            <li>Drop entire row.</li>
            <li>Drop entire column.</li>
        </ol>
    </li>
    <li>Replace the Data
        <ol>
            <li>Replace data by mean.</li>
            <li>Replace data by frequency.</li>
            <li>Replace data based on other functions.</li>
        </ol>
    </li>
        
</ol>

Whole columns should be dropped only if most entries in the column are empty.
If the feature to be predicted, "SEVERITYCODE", is missing from a row,
then that entire row must be dropped from the DataFrame.

In [26]:
type(np.nan)

float

In [10]:
# Print all the column labels for the collisions DataFrame.
df.columns

Index(['X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO', 'STATUS',
       'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE', 'EXCEPTRSNDESC',
       'SEVERITYCODE', 'SEVERITYDESC', 'COLLISIONTYPE', 'PERSONCOUNT',
       'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INJURIES', 'SERIOUSINJURIES',
       'FATALITIES', 'INCDATE', 'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE',
       'SDOT_COLDESC', 'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND',
       'LIGHTCOND', 'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE',
       'ST_COLDESC', 'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR'],
      dtype='object')

In [43]:
# NOTE: astype(self: ~FrameOrSeries, dtype, copy: bool = True, errors: str = 'raise') -> ~FrameOrSeries 
#    dtype : data type, or dict of column name -> data type
#        Use a numpy.dtype or Python type to cast entire pandas object to
#        the same type. Alternatively, use {col: dtype, ...}, where col is a
#        column label and dtype is a numpy.dtype or Python type to cast one
#        or more of the DataFrame's columns to column-specific types.

# Print a description of the selected column and the first few values of the column
# from the collisions DatatFrame,
# where the dtype is presented first in its original form from the collisions DataFrame,
# then again after being cast to categorical type.
column = list(df.columns.values)[9]
#print(column, " dtype: ", df[column].dtype, sep='')
#print(df[[column]].info())
print(df[[column]].describe(include="all"))
print(column, "Relative Frequency")
print(df[column].value_counts(normalize=True, dropna=False))
#print(df[[column]].head())
print()
print()
print(column,"AFTER CONVERSION TO CATEGORICAL TYPE:")
print()
#print(df[[column]].astype(dtype="category").info())
print(df[[column]].astype(dtype="category").describe(include="all"))
print(column, "Relative Frequency")
print(df[column].astype(dtype="category").value_counts(normalize=True, dropna=False))
#print(df[[column]].astype(dtype="category").head())

                                                              LOCATION
count                                                           216801
unique                                                           25198
top     BATTERY ST TUNNEL NB BETWEEN ALASKAN WY VI NB AND AURORA AVE N
freq                                                               298

[4 rows x 1 columns]
LOCATION Relative Frequency
NaN                                                              0.021
BATTERY ST TUNNEL NB BETWEEN ALASKAN WY VI NB AND AURORA AVE N   0.001
N NORTHGATE WAY BETWEEN MERIDIAN AVE N AND CORLISS AVE N         0.001
BATTERY ST TUNNEL SB BETWEEN AURORA AVE N AND ALASKAN WY VI SB   0.001
AURORA AVE N BETWEEN N 117TH PL AND N 125TH ST                   0.001
                                                                  ... 
SW GENESEE ST BETWEEN 44TH AVE SW AND 45TH AVE SW                0.000
16TH AVE NE BETWEEN NE 85TH ST AND NE 86TH ST                    0.000
37TH AVE NE BETWEEN NE 80TH

In [36]:
# Drop any column from the collisions DataFrame if it satisfies at least
# one of the following conditions:

# 1) the column contains only identification keys or codes with no predictive value;
# 2) the column's data is does not fit into a small (<) categories, such as an address or location description; or
# 3) a significant proportion (>15%) of the data is NaN; or
# 4) it is not clear how to interperet the data.
list_of_columns_to_drop = ["X", "Y", "OBJECTID", "INCKEY", "COLDETKEY", "REPORTNO", "INTKEY", "LOCATION" ,"EXCEPTRSNCODE", "EXCEPTRSNDESC", "INATTENTIONIND", "PEDROWNOTGRNT", "SDOTCOLNUM", "SPEEDING"]

In [37]:
#NOTE: drop(self, labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
# Drop the selected columns from the collisions DataFrame
# and store the result in a new DataFrame.
df_after_drop_columns = df.drop(columns=list_of_columns_to_drop, inplace=False)

In [38]:
# Print the first few rows of the DataFrame after dropping columns.
df_after_drop_columns.head()

Unnamed: 0,STATUS,ADDRTYPE,SEVERITYCODE,SEVERITYDESC,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INJURIES,SERIOUSINJURIES,FATALITIES,INCDATE,INCDTTM,JUNCTIONTYPE,SDOT_COLCODE,SDOT_COLDESC,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,Matched,Block,1,Property Damage Only Collision,Sideswipe,2,0,0,2,0,0,0,2020/01/22 00:00:00+00,1/22/2020 3:21:00 PM,Mid-Block (not related to intersection),11.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END AT ANGLE",N,Raining,Wet,Dark - Street Lights On,11.0,From same direction - both going straight - both moving - sideswipe,0,0,N
1,Matched,Block,1,Property Damage Only Collision,Parked Car,2,0,0,2,0,0,0,2020/01/07 00:00:00+00,1/7/2020 8:00:00 AM,Mid-Block (not related to intersection),15.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, RIGHT SIDE SIDESWIPE",N,Clear,Dry,Daylight,32.0,One parked--one moving,0,0,Y
2,Unmatched,Block,0,Unknown,,0,0,0,0,0,0,0,2004/01/30 00:00:00+00,1/30/2004,Mid-Block (but intersection related),11.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END AT ANGLE",,,,,,,0,0,N
3,Unmatched,Block,0,Unknown,,0,0,0,0,0,0,0,2016/01/23 00:00:00+00,1/23/2016,Mid-Block (not related to intersection),11.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END AT ANGLE",,,,,,,0,0,N
4,Unmatched,Block,0,Unknown,,0,0,0,0,0,0,0,2020/01/26 00:00:00+00,1/26/2020,Mid-Block (not related to intersection),28.0,MOTOR VEHICLE RAN OFF ROAD - HIT FIXED OBJECT,,,,,,,0,0,N


In [39]:
# Print a concise, technical summary of the collisions DataFrame.
df_after_drop_columns.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 221389 entries, 0 to 221388
Data columns (total 26 columns):
 #   Column           Dtype  
---  ------           -----  
 0   STATUS           object 
 1   ADDRTYPE         object 
 2   SEVERITYCODE     object 
 3   SEVERITYDESC     object 
 4   COLLISIONTYPE    object 
 5   PERSONCOUNT      int64  
 6   PEDCOUNT         int64  
 7   PEDCYLCOUNT      int64  
 8   VEHCOUNT         int64  
 9   INJURIES         int64  
 10  SERIOUSINJURIES  int64  
 11  FATALITIES       int64  
 12  INCDATE          object 
 13  INCDTTM          object 
 14  JUNCTIONTYPE     object 
 15  SDOT_COLCODE     float64
 16  SDOT_COLDESC     object 
 17  UNDERINFL        object 
 18  WEATHER          object 
 19  ROADCOND         object 
 20  LIGHTCOND        object 
 21  ST_COLCODE       object 
 22  ST_COLDESC       object 
 23  SEGLANEKEY       int64  
 24  CROSSWALKKEY     int64  
 25  HITPARKEDCAR     object 
dtypes: float64(1), int64(9), object(16)
memory u

In [41]:
# Print the proportion of missing ("True") values in each column.
for column in df_after_drop_columns.columns:
    print(column, "(True -> NaN)")
    print("Relative frequency:")
    print(df_after_drop_columns[column].isnull().value_counts(normalize=True, dropna=False))
    print(df_after_drop_columns[column].describe(include="all"))
    print()
    print()

STATUS (True -> NaN)
Relative frequency:
False   1.000
Name: STATUS, Length: 1, dtype: float64
count      221389
unique          2
top       Matched
freq       195232
Name: STATUS, Length: 4, dtype: object


ADDRTYPE (True -> NaN)
Relative frequency:
False   0.983
True    0.017
Name: ADDRTYPE, Length: 2, dtype: float64
count     217677
unique         3
top        Block
freq      144917
Name: ADDRTYPE, Length: 4, dtype: object


SEVERITYCODE (True -> NaN)
Relative frequency:
False   1.000
True    0.000
Name: SEVERITYCODE, Length: 2, dtype: float64
count     221388
unique         5
top            1
freq      137596
Name: SEVERITYCODE, Length: 4, dtype: object


SEVERITYDESC (True -> NaN)
Relative frequency:
False   1.000
Name: SEVERITYDESC, Length: 1, dtype: float64
count                             221389
unique                                 5
top       Property Damage Only Collision
freq                              137596
Name: SEVERITYDESC, Length: 4, dtype: object


COLLISIONTYPE 

In [None]:
# dropna(self, axis=0, how='any', thresh=None, subset=None, inplace=False)
# Drop any row that contains at least one NaN.
print("Number of columns: %d" % list(df_after_drop_columns.columns).__len__())
df_after_drop_columns_and_rows = df_after_drop_columns.dropna(axis="index", how="any", thresh=df_after_drop_columns.columns.values.tolist().__len__, subset=None, inplace=False)

<h4>Count the Missing Values in each Column</h4>
<p>
We use a for loop to count the number of missing ("True") values in each column of the collisions DataFrame.
</p>