### Predicting the Severity of Automobile Accidents in Seattle, Washington ###

In this first week, you will discover your
project objectives, find your dataset that you will use for this capstone project, and publish your
dataset on GitHub.

In the second week, you will build your machine
learning solution.

In the third week,
you will finalize your model and be ready
to submit your work.

To complete capstone,
you will be working on a case study which is to predict the severity
of an accident.
Now, wouldn't it be great if there were something in place that could warn you, 
given the weather and the road conditions,
about the possibility of you getting into a car accident and how severe it would be,
so that you would drive more carefully or even change your travel plans?
Let's use our shared data for Seattle, Washington as an example of how to deal with the accidents data.

In [1]:
# Import packages.
import pandas as pd
import numpy as np
import itertools
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline

In [2]:
# Set maximum number of columns and rows to display.
pd.set_option('display.max_columns', None) # Display all columns.
pd.set_option('display.max_rows', 100) # Display at most 100 rows.

In [3]:
# Attribute Information URL: https://www.seattle.gov/Documents/Departments/SDOT/GIS/Collisions_OD.pdf
# Read the Collisions Data CSV file and store it as a DataFrame.
url="http://data-seattlecitygis.opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0.csv?outSR={%22latestWkid%22:2926,%22wkid%22:2926}"
df=pd.read_csv(url, low_memory=False)

In [4]:
# View the first few rows of the collisions DataFrame.
df.head()

Unnamed: 0,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,LOCATION,EXCEPTRSNCODE,EXCEPTRSNDESC,SEVERITYCODE,SEVERITYDESC,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INJURIES,SERIOUSINJURIES,FATALITIES,INCDATE,INCDTTM,JUNCTIONTYPE,SDOT_COLCODE,SDOT_COLDESC,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,1273535.0,225839.133531,1,328476,329976,EA08706,Matched,Block,,BROADWAY BETWEEN E COLUMBIA ST AND BOYLSTON AVE,,,1,Property Damage Only Collision,Sideswipe,2,0,0,2,0,0,0,2020/01/22 00:00:00+00,1/22/2020 3:21:00 PM,Mid-Block (not related to intersection),11.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END ...",,N,Raining,Wet,Dark - Street Lights On,,,,11.0,From same direction - both going straight - bo...,0,0,N
1,1274202.0,245094.094895,2,328142,329642,EA06882,Matched,Block,,8TH AVE NE BETWEEN NE 45TH E ST AND NE 47TH ST,,,1,Property Damage Only Collision,Parked Car,2,0,0,2,0,0,0,2020/01/07 00:00:00+00,1/7/2020 8:00:00 AM,Mid-Block (not related to intersection),15.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, RIGHT SIDE...",,N,Clear,Dry,Daylight,,,,32.0,One parked--one moving,0,0,Y
2,1271831.0,224042.636505,3,20700,20700,1181833,Unmatched,Block,,JAMES ST BETWEEN 6TH AVE AND 7TH AVE,,,0,Unknown,,0,0,0,0,0,0,0,2004/01/30 00:00:00+00,1/30/2004,Mid-Block (but intersection related),11.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END ...",,,,,,,4030032.0,,,,0,0,N
3,1272569.0,262054.386176,4,332126,333626,M16001640,Unmatched,Block,,NE NORTHGATE WAY BETWEEN 1ST AVE NE AND NE NOR...,,,0,Unknown,,0,0,0,0,0,0,0,2016/01/23 00:00:00+00,1/23/2016,Mid-Block (not related to intersection),11.0,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END ...",,,,,,,,,,,0,0,N
4,1280249.0,207323.48276,5,328238,329738,3857118,Unmatched,Block,,M L KING JR ER WAY S BETWEEN S ANGELINE ST AND...,,,0,Unknown,,0,0,0,0,0,0,0,2020/01/26 00:00:00+00,1/26/2020,Mid-Block (not related to intersection),28.0,MOTOR VEHICLE RAN OFF ROAD - HIT FIXED OBJECT,,,,,,,,,,,0,0,N


In [5]:
# Print a concise, technical summary of the collisions DataFrame.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 221389 entries, 0 to 221388
Data columns (total 40 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   X                213918 non-null  float64
 1   Y                213918 non-null  float64
 2   OBJECTID         221389 non-null  int64  
 3   INCKEY           221389 non-null  int64  
 4   COLDETKEY        221389 non-null  int64  
 5   REPORTNO         221389 non-null  object 
 6   STATUS           221389 non-null  object 
 7   ADDRTYPE         217677 non-null  object 
 8   INTKEY           71884 non-null   float64
 9   LOCATION         216801 non-null  object 
 10  EXCEPTRSNCODE    100986 non-null  object 
 11  EXCEPTRSNDESC    11779 non-null   object 
 12  SEVERITYCODE     221388 non-null  object 
 13  SEVERITYDESC     221389 non-null  object 
 14  COLLISIONTYPE    195159 non-null  object 
 15  PERSONCOUNT      221389 non-null  int64  
 16  PEDCOUNT         221389 non-null  int6

<h2 id="data_wrangling">Data Wrangling</h2>

Steps for working with missing data:
<ol>
    <li>Identify missing data.</li>
    <li>Deal with missing data.</li>
    <li>Correct data format.</li>
</ol>

<h3 id="identifying_missing_data">Identifying Missing Data</h3>

The missing values are converted to Python's default. We use Python's built-in functions to identify these missing values. There are two methods to detect missing data:
<ol>
    <li><b>.isnull()</b></li>
    <li><b>.notnull()</b></li>
</ol>
The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data.

In [6]:
missing_data = df.isnull()
missing_data.head(5)

Unnamed: 0,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,LOCATION,EXCEPTRSNCODE,EXCEPTRSNDESC,SEVERITYCODE,SEVERITYDESC,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INJURIES,SERIOUSINJURIES,FATALITIES,INCDATE,INCDTTM,JUNCTIONTYPE,SDOT_COLCODE,SDOT_COLDESC,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,False,False,False,False,False,False,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,True,True,False,False,False,False,False
1,False,False,False,False,False,False,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,True,True,False,False,False,False,False
2,False,False,False,False,False,False,False,False,True,False,True,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,True,True,True,False,True,True,True,False,False,False
3,False,False,False,False,False,False,False,False,True,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,True,True,True,True,True,False,True,False,False,False
4,False,False,False,False,False,False,False,False,True,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,True,True,True,True,True,False,True,False,False,False


"True" identifies a missing value, while "False" indicates that a value is not missing value.

<h4>Count the Missing Values in each Column</h4>
<p>
We use a for loop to count the number of missing ("True") values in each column of the collisions DataFrame.
</p>

In [7]:
# Count the number of missing ("True") values in each column.
for column in missing_data.columns.values.tolist():
    print(column)
    print(missing_data[column].value_counts())
    print()    

X
False    213918
True       7471
Name: X, dtype: int64

Y
False    213918
True       7471
Name: Y, dtype: int64

OBJECTID
False    221389
Name: OBJECTID, dtype: int64

INCKEY
False    221389
Name: INCKEY, dtype: int64

COLDETKEY
False    221389
Name: COLDETKEY, dtype: int64

REPORTNO
False    221389
Name: REPORTNO, dtype: int64

STATUS
False    221389
Name: STATUS, dtype: int64

ADDRTYPE
False    217677
True       3712
Name: ADDRTYPE, dtype: int64

INTKEY
True     149505
False     71884
Name: INTKEY, dtype: int64

LOCATION
False    216801
True       4588
Name: LOCATION, dtype: int64

EXCEPTRSNCODE
True     120403
False    100986
Name: EXCEPTRSNCODE, dtype: int64

EXCEPTRSNDESC
True     209610
False     11779
Name: EXCEPTRSNDESC, dtype: int64

SEVERITYCODE
False    221388
True          1
Name: SEVERITYCODE, dtype: int64

SEVERITYDESC
False    221389
Name: SEVERITYDESC, dtype: int64

COLLISIONTYPE
False    195159
True      26230
Name: COLLISIONTYPE, dtype: int64

PERSONCOUNT
False    22

In [8]:
# Initialize a list to hold the names of all the columns that are missing data.
list_of_columns_with_missing_data = list()

# For each column in the collisions DataFrame,
# if the Series contains at least one NaN, 
# then add the column name to the list of column names that are missing data.
for column in df.columns.values.tolist():
    if df[column].hasnans:
        list_of_columns_with_missing_data.append(column)

print("Total number of columns: %d" % df.columns.size)
print()
print("Number of columns missing data: %d" % list_of_columns_with_missing_data.__len__())
print()
print("Names of columns missing data:")
print(list_of_columns_with_missing_data)

Total number of columns: 40

Number of columns missing data: 22

Names of columns missing data:
['X', 'Y', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE', 'EXCEPTRSNDESC', 'SEVERITYCODE', 'COLLISIONTYPE', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC', 'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND', 'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'ST_COLDESC']


In [9]:
# Note: value_counts(self, normalize=False, sort=True, ascending=False, bins=None, dropna=True)
# For each column with missing data,
# print normalized value counts.
starting_index = 0
for column in list_of_columns_with_missing_data:#[starting_index:starting_index + 1]:
    print(column, ": ", df[column].dtype, sep='')
#    print(df[[column]].describe(include="all"))
#    print()
    print("Relative Frequencies:")
    print(df[column].value_counts(normalize=True, dropna=False))
    print()
#    print("Value counts:")
#    print(df[column].value_counts(dropna=False))
#    print()
#    print(df[[column]].head(10))
#    print()

X: float64
Relative Frequencies:
NaN             0.033746
1.271306e+06    0.001337
1.268354e+06    0.001274
1.271692e+06    0.001247
1.268385e+06    0.001220
                  ...   
1.278324e+06    0.000005
1.269153e+06    0.000005
1.264317e+06    0.000005
1.282793e+06    0.000005
1.274790e+06    0.000005
Name: X, Length: 24973, dtype: float64

Y: float64
Relative Frequencies:
NaN              0.033746
262090.949056    0.001337
265256.609668    0.001274
223960.667289    0.001247
268124.504263    0.001220
                   ...   
243483.470509    0.000005
250917.522037    0.000005
249624.025908    0.000005
261384.257600    0.000005
199572.449408    0.000005
Name: Y, Length: 24971, dtype: float64

ADDRTYPE: object
Relative Frequencies:
Block           0.654581
Intersection    0.324695
NaN             0.016767
Alley           0.003957
Name: ADDRTYPE, dtype: float64

INTKEY: float64
Relative Frequencies:
NaN        0.675305
29973.0    0.001247
29933.0    0.000781
29913.0    0.000655
2954

<h3 id="deal_with_missing_data">Deal with Missing Data</h3>

<ol>
    <li>Drop the Data
        <ol>
            <li>Drop entire row.</li>
            <li>Drop entire column.</li>
        </ol>
    </li>
    <li>Replace the Data
        <ol>
            <li>Replace data by mean.</li>
            <li>Replace data by frequency.</li>
            <li>Replace data based on other functions.</li>
        </ol>
    </li>
        
</ol>

Whole columns should be dropped only if most entries in the column are empty.
If the feature to be predicted, "SEVERITYCODE", is missing from a row,
then that entire row must be dropped from the DataFrame.

In [10]:
# List of columns to drop because:
# 1) the data is an identification key or code offering no predictive value; or
# 2) the data is does not fit into a few categories, such as an address or location description; or
# 3) a significant proportion (>15%) of the data is NaN; or
# 4) it is not clear how to interperet the data.
list_of_columns_to_drop = ['X', 'Y', 'INTKEY', 'LOCATION' ,'EXCEPTRSNCODE', 'EXCEPTRSNDESC', 'INATTENTIONIND', 'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING']

In [11]:
# drop(self, labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
df_after_drop_columns = df.drop(columns=list_of_columns_to_drop, inplace=False)

In [15]:
# Count the number of missing ("True") values in each column.
for column in df_after_drop_columns.columns.values.tolist():
    print(column, "(False -> NaN)")
    print(df_after_drop_columns[column].notnull().value_counts(normalize=True, dropna=False))
    print()   

OBJECTID (False -> NaN)
True    1.0
Name: OBJECTID, dtype: float64

INCKEY (False -> NaN)
True    1.0
Name: INCKEY, dtype: float64

COLDETKEY (False -> NaN)
True    1.0
Name: COLDETKEY, dtype: float64

REPORTNO (False -> NaN)
True    1.0
Name: REPORTNO, dtype: float64

STATUS (False -> NaN)
True    1.0
Name: STATUS, dtype: float64

ADDRTYPE (False -> NaN)
True     0.983233
False    0.016767
Name: ADDRTYPE, dtype: float64

SEVERITYCODE (False -> NaN)
True     0.999995
False    0.000005
Name: SEVERITYCODE, dtype: float64

SEVERITYDESC (False -> NaN)
True    1.0
Name: SEVERITYDESC, dtype: float64

COLLISIONTYPE (False -> NaN)
True     0.881521
False    0.118479
Name: COLLISIONTYPE, dtype: float64

PERSONCOUNT (False -> NaN)
True    1.0
Name: PERSONCOUNT, dtype: float64

PEDCOUNT (False -> NaN)
True    1.0
Name: PEDCOUNT, dtype: float64

PEDCYLCOUNT (False -> NaN)
True    1.0
Name: PEDCYLCOUNT, dtype: float64

VEHCOUNT (False -> NaN)
True    1.0
Name: VEHCOUNT, dtype: float64

INJURIES (Fa

<h4>Count the Missing Values in each Column</h4>
<p>
We use a for loop to count the number of missing ("True") values in each column of the collisions DataFrame.
</p>