### Predicting the Severity of Automobile Accidents in Seattle, Washington ###

In this first week, you will discover your
project objectives, find your dataset that you will use for this capstone project, and publish your
dataset on GitHub.

In the second week, you will build your machine
learning solution.

In the third week,
you will finalize your model and be ready
to submit your work.

To complete capstone,
you will be working on a case study which is to predict the severity
of an accident.
Now, wouldn't it be great if there were something in place that could warn you, 
given the weather and the road conditions,
about the possibility of you getting into a car accident and how severe it would be,
so that you would drive more carefully or even change your travel plans?
Let's use our shared data for Seattle, Washington as an example of how to deal with the accidents data.

In [61]:
# Import common packages for Data Science applications.
import io
import itertools
import matplotlib as mpl
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import numpy as np
import os
import pandas as pd
import pylab as pl
import scipy
import scipy.optimize as opt
import seaborn as sns
import sklearn
import sklearn.linear_model
import sys
from matplotlib.ticker import NullFormatter
from scipy import optimize
from scipy.optimize import curve_fit
from sklearn import linear_model
from sklearn import metrics
from sklearn import pipeline
from sklearn import preprocessing
from sklearn import svm
from sklearn import tree
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import jaccard_score
from sklearn.metrics import log_loss
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
%matplotlib inline

In [2]:
notebook_start_time = os.times()[4]

In [3]:
# Create a list of display options.
list_of_display_options_fully_qualified_names = str(\
"pd.options.display.chop_threshold, pd.options.display.float_format, pd.options.display.max_info_columns, pd.options.display.notebook_repr_html, \
pd.options.display.colheader_justify, pd.options.display.html, pd.options.display.max_info_rows, pd.options.display.pprint_nest_depth, \
pd.options.display.column_space, pd.options.display.large_repr, pd.options.display.max_rows, pd.options.display.precision, \
pd.options.display.date_dayfirst, pd.options.display.latex, pd.options.display.max_seq_items, pd.options.display.show_dimensions, \
pd.options.display.date_yearfirst, pd.options.display.max_categories, pd.options.display.memory_usage, pd.options.display.unicode, \
pd.options.display.encoding, pd.options.display.max_columns, pd.options.display.min_rows, pd.options.display.width, \
pd.options.display.expand_frame_repr, pd.options.display.max_colwidth, pd.options.display.multi_sparse").split(sep=', ')

# Initialize an empty list to store all the short names for display options.
list_of_display_options_short_names = list()
# For each fully qualified option name,
# get the option's short name and add it to the list of short names.
for fully_qualified_option_name in list_of_display_options_fully_qualified_names:
    # Get short option name.
    short_option_name = fully_qualified_option_name.split(sep='.')[-1]
    
    # Add short option name to list of display option short names.
    list_of_display_options_short_names.append(short_option_name)

# Define dictionary of display option settings.
dict_of_display_option_settings_short_names=\
{"max_info_columns": 1000,\
"colheader_justify": "right",\
"max_info_rows": 1000000,\
"column_space": 1000,\
"max_rows": 1000000,\
"precision": 9,\
"max_seq_items": 1000000000000,\
"show_dimensions": True,\
"max_categories": 1000,\
"memory_usage": True,\
"max_columns": 1000,\
"max_colwidth": 1000,\
"float_format": lambda x: '%.9f' % x}

# Set pandas display options using dictionary of short names,
# and display the options/value pairs.
print("Setting display options...")
for key in list(dict_of_display_option_settings_short_names.keys()):
    # Set display option.
    pd.set_option(key, dict_of_display_option_settings_short_names[key])
    # Print display option name and value.
    print(key, ": ", pd.get_option(key), sep='')

Setting display options...
max_info_columns: 1000
colheader_justify: right
max_info_rows: 1000000
column_space: 1000
max_rows: 1000000
precision: 9
max_seq_items: 1000000000000
show_dimensions: True
max_categories: 1000
memory_usage: True
max_columns: 1000
max_colwidth: 1000
float_format: <function <lambda> at 0x7efcabcff670>


In [4]:
# Attribute Information URL: https://www.seattle.gov/Documents/Departments/SDOT/GIS/Collisions_OD.pdf
# Read the Collisions Data CSV file and store it as a DataFrame.
# url="https://opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0.csv" # HTTPError at 202009151050, using local copy of .csv instead.
# print(os.listdir("..")) # Print list of contents of current working directory.
local_path_to_csv = "../Collisions.csv"
df=pd.read_csv(local_path_to_csv, low_memory=False)

<h2 id="data_wrangling">Data Wrangling</h2>

Steps for working with missing data:
<ol>
    <li>Identify missing data.</li>
    <li>Deal with missing data.</li>
    <li>Correct data format.</li>
</ol>

<h3 id="identifying_missing_data">Identifying Missing Data</h3>

The metadata document that accompanied the data set indicates that certain columns have "sentinel" values
that indicate an unknown or missing value. Each of these missing values will first be converted into NaN.
Subsequently, the NaN values will be dropped from the DataFrame.

In [5]:
# If any row of the collisions DataFrame contains a sentinel value representing "unknown",
# then replace it with NaN. 
# Sentinels for "unknown" are listed in the metadata form that accompanied the dataset.
df_unknowns_converted_to_nan = df.replace(to_replace=\
{"EXCEPTRSNCODE": " ",\
 "EXCEPTRSNDESC": "Not Enough Information, or Insufficient Location Information",\
 "SEVERITYCODE": "0",\
 "SEVERITYDESC": "Unknown",\
 "JUNCTIONTYPE": "Unknown",\
 "WEATHER": "Unknown",\
 "ROADCOND": "Unknown",\
 "LIGHTCOND": "Unknown",\
 "SDOT_COLCODE": float(0),\
 "SDOT_COLDESC": "NOT ENOUGH INFORMATION / NOT APPLICABLE",\
 "ST_COLCODE": " ",\
 "ST_COLDESC": "Not stated"},\
value=np.nan, inplace=False, limit=None, regex=False, method='pad')

df_unknowns_converted_to_nan.replace(to_replace={"ST_COLCODE": "0", }, value=np.nan, inplace=True, limit=None, regex=False, method='pad')

<h3 id="deal_with_missing_data">Deal with Missing Data</h3>

<ol>
    <li>Drop the Data
        <ol>
            <li>Drop entire row.</li>
            <li>Drop entire column.</li>
        </ol>
    </li>
    <li>Replace the Data
        <ol>
            <li>Replace data by mean.</li>
            <li>Replace data by frequency.</li>
            <li>Replace data based on other functions.</li>
        </ol>
    </li>
        
</ol>

Whole columns should be dropped only if most entries in the column are empty.

In [6]:
# Initialize a list to store the labels for the columns with missing data.
list_of_columns_with_missing_data = list()

# For each column in the collisions DataFrame,
# if the column contains at least one NaN, 
# then add the column's label to the list.
for column in list(df_unknowns_converted_to_nan.columns):
    if df_unknowns_converted_to_nan[column].hasnans:
        list_of_columns_with_missing_data.append(column)

In [7]:
print(list(df.columns))

['X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO', 'STATUS', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE', 'EXCEPTRSNDESC', 'SEVERITYCODE', 'SEVERITYDESC', 'COLLISIONTYPE', 'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INJURIES', 'SERIOUSINJURIES', 'FATALITIES', 'INCDATE', 'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC', 'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND', 'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'ST_COLDESC', 'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR']


In [8]:
# Drop any column from the collisions DataFrame if it satisfies at least one of the following conditions:
# 1) more than 15% of the column's data is NaN;
# 2) the column only contains unique identification keys;
# 3) the column's data is naturally categorical but does not fit into a small (< 50) number of categories;
# 4) information in one column is redundant because it is already represented by another column;
# 5) it is not clear how to interpret the column's data.

list_of_columns_to_drop = [\
                           "ADDRTYPE",\
                           "STATUS",\
                           "OBJECTID",\
                           "INCKEY",\
                           "COLDETKEY",\
                           "REPORTNO",\
                           "INTKEY",\
                           "LOCATION",\
                           "EXCEPTRSNCODE",\
                           "EXCEPTRSNDESC",\
                           "SEVERITYDESC",\
                           "INCDATE",\
                           "SDOT_COLCODE",\
                           "SDOT_COLDESC",\
                           "INATTENTIONIND",\
                           "UNDERINFL",\
                           "PEDROWNOTGRNT",\
                           "SDOTCOLNUM",\
                           "SPEEDING",\
                           "ST_COLCODE",\
                           "ST_COLDESC",\
                           "SEGLANEKEY",\
                           "CROSSWALKKEY"]

In [9]:
# Drop the selected columns from the DataFrame after converting unknowns to NaN.
# and store the result in a new DataFrame.
df_drop_columns = df_unknowns_converted_to_nan.drop(columns=list_of_columns_to_drop, inplace=False)

In [10]:
# Drop any row that contains at least one NaN.
df_drop_columns_and_rows = df_drop_columns.dropna(axis="index", how="any", thresh=None, subset=None, inplace=False)

<h3 id="correct_data_format">Correct Data Format</h3>

Ensure that each data type is appropriate for the corresponding feature.
Convert integer data to "ordered" categorical types, e.g. SEVERITYCODE,
especially if the "integer ordering" of the original data is inappropriate.

If data represents date, time, or date/time information, then convert the data to the appropriate datetime representation.

In [11]:
# Create new DataFrame to store converted data types.
df_converted = pd.DataFrame()

for column in list(df_drop_columns_and_rows.columns):
    # Cast columns "INCDTTM" to type datetime.
    if column in ["INCDTTM"]:
        df_converted[column] = pd.to_datetime(df_drop_columns_and_rows[column], infer_datetime_format=True)
    # Cast columns of type object to type category.
    elif (df_drop_columns_and_rows[column].dtype in [np.dtype('object')]):
        df_converted[column] = df_drop_columns_and_rows[column].astype('category')
    # Copy all other columns to new DataFrame without changing their types.
    else:
        df_converted[column] = df_drop_columns_and_rows[column]

In [12]:
# Create DataFrame of categorical columns.
df_categorical = df_converted.select_dtypes(include="category")

#### Features before One Hot Encoding

In [13]:
list(df_categorical.columns)

['SEVERITYCODE',
 'COLLISIONTYPE',
 'JUNCTIONTYPE',
 'WEATHER',
 'ROADCOND',
 'LIGHTCOND',
 'HITPARKEDCAR']

In [14]:
df_categorical.head(10)

Unnamed: 0,SEVERITYCODE,COLLISIONTYPE,JUNCTIONTYPE,WEATHER,ROADCOND,LIGHTCOND,HITPARKEDCAR
0,1,Sideswipe,Mid-Block (not related to intersection),Raining,Wet,Dark - Street Lights On,N
1,1,Parked Car,Mid-Block (not related to intersection),Clear,Dry,Daylight,Y
5,1,Rear Ended,Mid-Block (not related to intersection),Clear,Dry,Daylight,N
6,1,Other,Mid-Block (but intersection related),Clear,Wet,Daylight,N
8,1,Sideswipe,At Intersection (intersection related),Overcast,Dry,Daylight,N
9,1,Sideswipe,Mid-Block (not related to intersection),Clear,Dry,Daylight,N
10,1,Rear Ended,Mid-Block (not related to intersection),Overcast,Dry,Daylight,N
11,1,Angles,Mid-Block (but intersection related),Overcast,Dry,Daylight,N
12,1,Parked Car,Mid-Block (not related to intersection),Clear,Wet,Dark - Street Lights On,N
13,2,Parked Car,Mid-Block (not related to intersection),Overcast,Dry,Dark - Street Lights On,N


In [15]:
#features = df_categorical[["COLLISIONTYPE", "WEATHER", "ROADCOND", "LIGHTCOND"]]
#features = df_categorical[["WEATHER", "ROADCOND", "LIGHTCOND"]]
features = df_categorical[["WEATHER"]]

In [16]:
list_of_features = list(features.columns)

In [17]:
print("SEVERITYCODE relative frequencies:")
print(df_categorical["SEVERITYCODE"].value_counts(normalize=True, dropna=False))
#print("SEVERITYCODE value counts:")
#print(df_categorical["SEVERITYCODE"].value_counts(normalize=False, dropna=False))

SEVERITYCODE relative frequencies:
1    0.655966184
2    0.324930284
2b   0.017183785
3    0.001919746
Name: SEVERITYCODE, Length: 4, dtype: float64


In [18]:
for feature in list_of_features:
    print(df_categorical.groupby(feature)["SEVERITYCODE"].value_counts(normalize=True, dropna=False))
    #print(df_categorical.groupby("SEVERITYCODE")[feature].value_counts(normalize=True, dropna=False))
    print()

WEATHER                   SEVERITYCODE
Blowing Sand/Dirt         1              0.697674419
                          2              0.302325581
Clear                     1              0.655914077
                          2              0.324196986
                          2b             0.017903702
                          3              0.001985234
Fog/Smog/Smoke            1              0.655616943
                          2              0.333333333
                          2b             0.005524862
                          3              0.005524862
Other                     1              0.661354582
                          2              0.306772908
                          2b             0.019920319
                          3              0.011952191
Overcast                  1              0.664109691
                          2              0.317843246
                          2b             0.016120067
                          3              0.001926996
Partly 

In [19]:
for feature in list_of_features:
    print(df_categorical.groupby("SEVERITYCODE")[feature].value_counts(normalize=True, dropna=False))
    #print(df_categorical.groupby("SEVERITYCODE")[feature].value_counts(normalize=False, dropna=False))
    print()

SEVERITYCODE  WEATHER                 
1             Clear                      0.641666816
              Raining                    0.186245905
              Overcast                   0.160389854
              Snowing                    0.005817388
              Fog/Smog/Smoke             0.003186139
              Other                      0.001485671
              Sleet/Hail/Freezing Rain   0.000742836
              Blowing Sand/Dirt          0.000268495
              Severe Crosswind           0.000152147
              Partly Cloudy              0.000044749
2             Clear                      0.640269572
              Raining                    0.196216597
              Overcast                   0.154967749
              Fog/Smog/Smoke             0.003270277
              Snowing                    0.002945056
              Other                      0.001391223
              Sleet/Hail/Freezing Rain   0.000505899
              Blowing Sand/Dirt          0.000234882
       

In [20]:
features.head(10)

Unnamed: 0,WEATHER
0,Raining
1,Clear
5,Clear
6,Clear
8,Overcast
9,Clear
10,Overcast
11,Overcast
12,Clear
13,Overcast


#### Use one hot encoding technique to convert categorical varables to binary variables and append them to the features DataFrame 

In [21]:
# For each feature of the features DataFrame,
# get dummy encoding for the feature,
# prefix the category column labels with the feature label and a '_' separator,
# and concatenate the one-hot encoded columns to the features DataFrame.
for feature in list(features.columns):
    features = pd.concat([features, pd.get_dummies(features[feature], prefix=feature, prefix_sep='_', dummy_na=False, columns=feature, sparse=False, drop_first=False)], axis=1)

### Feature selection

Let's define a features set represented by the numerical DataFrame X_not_normalized:

In [22]:
X_not_normalized = features.select_dtypes(include="number")

In [23]:
X_not_normalized.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 170335 entries, 0 to 221388
Data columns (total 10 columns):
 #   Column                            Non-Null Count   Dtype
---  ------                            --------------   -----
 0   WEATHER_Blowing Sand/Dirt         170335 non-null  uint8
 1   WEATHER_Clear                     170335 non-null  uint8
 2   WEATHER_Fog/Smog/Smoke            170335 non-null  uint8
 3   WEATHER_Other                     170335 non-null  uint8
 4   WEATHER_Overcast                  170335 non-null  uint8
 5   WEATHER_Partly Cloudy             170335 non-null  uint8
 6   WEATHER_Raining                   170335 non-null  uint8
 7   WEATHER_Severe Crosswind          170335 non-null  uint8
 8   WEATHER_Sleet/Hail/Freezing Rain  170335 non-null  uint8
 9   WEATHER_Snowing                   170335 non-null  uint8
dtypes: uint8(10)
memory usage: 2.9 MB


We also define the labels for the target variable, SEVERITYCODE:

In [24]:
y = df_categorical["SEVERITYCODE"].to_numpy()

In [25]:
y.shape

(170335,)

## Normalize Data 

We normalize the data, transforming to have zero mean and unit variance.

In [26]:
# X is a normalized numpy ndarray.
X = preprocessing.StandardScaler().fit(X_not_normalized).transform(X_not_normalized)

In [27]:
X.shape

(170335, 10)

# Classification 

We split the normalized data and target labels into a training test and a test set.
We use the training set to build an accurate model.
Afterwards, we use the test set to report the accuracy of the model.

We apply the following algorithms to produce various kinds of models.
- K Nearest Neighbor(KNN)
- Decision Tree
- Support Vector Machine
- Logistic Regression

# K Nearest Neighbor(KNN)
First, we find the best value of k with which to build a model with the greatest accuracy.

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

In [29]:
y_train.shape[0]

136268

In [30]:
X_train[0:10,:]

array([[-0.01589048,  0.7472064 , -0.05655113, -0.03841538, -0.43387274,
        -0.00766233, -0.48313333, -0.01211574, -0.02576507, -0.06967864],
       [-0.01589048, -1.33831831, -0.05655113, -0.03841538,  2.30482331,
        -0.00766233, -0.48313333, -0.01211574, -0.02576507, -0.06967864],
       [-0.01589048,  0.7472064 , -0.05655113, -0.03841538, -0.43387274,
        -0.00766233, -0.48313333, -0.01211574, -0.02576507, -0.06967864],
       [-0.01589048,  0.7472064 , -0.05655113, -0.03841538, -0.43387274,
        -0.00766233, -0.48313333, -0.01211574, -0.02576507, -0.06967864],
       [-0.01589048,  0.7472064 , -0.05655113, -0.03841538, -0.43387274,
        -0.00766233, -0.48313333, -0.01211574, -0.02576507, -0.06967864],
       [-0.01589048, -1.33831831, -0.05655113, -0.03841538,  2.30482331,
        -0.00766233, -0.48313333, -0.01211574, -0.02576507, -0.06967864],
       [-0.01589048,  0.7472064 , -0.05655113, -0.03841538, -0.43387274,
        -0.00766233, -0.48313333, -0.01211574

In [31]:
# Define the best KNN model.
print("Building KNeighborsClassifier for number of neighbors k = 10 ...")
start_time = os.times()[4]
neigh_best = KNeighborsClassifier(n_neighbors = 10).fit(X_train, y_train)
end_time = os.times()[4]
total_elapsed_time = end_time - start_time
print("Completed in", total_elapsed_time, "seconds.")

Building KNeighborsClassifier for number of neighbors k = 10 ...
Completed in 290.589999999851 seconds.


# Build a Decision Tree Model

In [32]:
# Build a decision tree model from the training data previously generated.
start_time = os.times()[4]
decision_tree = DecisionTreeClassifier(criterion="entropy")
decision_tree.fit(X_train,y_train)
end_time = os.times()[4]
elapsed_time = end_time - start_time
print("Built Decision Tree Model in", elapsed_time, "seconds.")
print()

Built Decision Tree Model in 0.7899999991059303 seconds.



# Build a Support Vector Machine Model

In [33]:
# Build a support vector machine model from the training data previously generated.
start_time = os.times()[4]
clf = svm.SVC(kernel='rbf', gamma='auto')
clf.fit(X_train, y_train)
end_time = os.times()[4]
elapsed_time = end_time - start_time
print("Built Support Vector Machine Model in", elapsed_time, "seconds.")
print()

Built Support Vector Machine Model in 1562.800000000745 seconds.



# Build a Logistic Regression Model

In [34]:
# Build a logistic regression model from the training data previously generated.
start_time = os.times()[4]
lr = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
end_time = os.times()[4]
elapsed_time = end_time - start_time
print("Built Logistic Regression Model in", elapsed_time, "seconds.")
print()

Built Logistic Regression Model in 3.3900000005960464 seconds.



# Evaluate the Various Models

In [35]:
pd.Series(y_test).value_counts(normalize=True, dropna=False)

1    0.655502392
2    0.326063346
2b   0.016291426
3    0.002142836
Length: 4, dtype: float64

In [36]:
start_time = os.times()[4]
# Apply KNN to the test set, generate predictions for KNN.
print("Running command: y_knn_predictions=neigh_best.predict(X_test)")
y_knn_predictions=neigh_best.predict(X_test)
end_time = os.times()[4]
elapsed_time = end_time - start_time
print("Completed in", elapsed_time, "seconds.")
print()

Running command: y_knn_predictions=neigh_best.predict(X_test)
Completed in 441.3499999977648 seconds.



In [37]:
y_knn_predictions.shape

(34067,)

In [38]:
pd.Series(y_knn_predictions).value_counts(normalize=True, dropna=False)

1   1.000000000
Length: 1, dtype: float64

In [39]:
# Apply Decision Tree to the test set, generate predictions for Decision Tree.
print("Running command: y_tree_predictions = decision_tree.predict(X_test)")
start_time = os.times()[4]
y_tree_predictions = decision_tree.predict(X_test)
end_time = os.times()[4]
elapsed_time = end_time - start_time
print("Completed in", elapsed_time, "seconds")
print()

Running command: y_tree_predictions = decision_tree.predict(X_test)
Completed in 0.019999999552965164 seconds



In [40]:
pd.Series(y_tree_predictions).value_counts(normalize=True, dropna=False)

1   1.000000000
Length: 1, dtype: float64

In [41]:
# Apply SVM to the test set, generate predictions for SVM.
print("Running command: y_svm_predictions = clf.predict(X_test)")
start_time = os.times()[4]
y_svm_predictions = clf.predict(X_test)
end_time = os.times()[4]
elapsed_time = end_time - start_time
print("Completed in", elapsed_time, "seconds.")
print()

Running command: y_svm_predictions = clf.predict(X_test)
Completed in 323.16999999806285 seconds.



In [42]:
pd.Series(y_svm_predictions).value_counts(normalize=True, dropna=False)

1   1.000000000
Length: 1, dtype: float64

In [43]:
# Apply Logistic Regression to the test set, generate predictions and probabilities for Logistic Regression.
print("Running command: y_lr_predictions = lr.predict(X_test)")
start_time = os.times()[4]
y_lr_predictions = lr.predict(X_test)
end_time = os.times()[4]
elapsed_time = end_time - start_time
print("Completed in", elapsed_time, "seconds.")
print()

print("Running command: y_lr_probabilities = lr.predict_proba(X_test\)")
start_time = os.times()[4]
y_lr_probabilities = lr.predict_proba(X_test)
end_time = os.times()[4]
elapsed_time = end_time - start_time
print("Completed in", elapsed_time, "seconds.")
print()

Running command: y_lr_predictions = lr.predict(X_test)
Completed in 0.05000000074505806 seconds.

Running command: y_lr_probabilities = lr.predict_proba(X_test\)
Completed in 0.019999999552965164 seconds.



In [44]:
pd.Series(y_lr_predictions).value_counts(normalize=True, dropna=False)

1   1.000000000
Length: 1, dtype: float64

In [67]:
# Define numpy arrays to store the results of tests of the various algorithms.
# index = 0 => KNN score
# index = 1 => Decision Tree
# index = 2 => SVM
# index = 3 => Logistic Regression
jaccard = np.zeros((4,4))
f1 = np.zeros((4,4))
logloss = np.zeros((4,4))
logloss[0] = np.nan
logloss[1] = np.nan
logloss[2] = np.nan

In [69]:
# For KNN model, compute Jaccard score.
jaccard[0] = jaccard_score(y_test, y_knn_predictions, labels=["1", "2", "2b", "3"], average=None)
print("KNN Jaccard score is", jaccard[0,:])
# For KNN model, compute F1-score.
f1[0] = f1_score(y_test, y_knn_predictions, average=None)
print("KNN F1-score is", f1[0])
print()

KNN Jaccard score is [0.65550239 0.         0.         0.        ]
KNN F1-score is [0.79190751 0.         0.         0.        ]



In [70]:
# For Decision Tree model, compute Jaccard score.
jaccard[1] = jaccard_score(y_test, y_tree_predictions, labels=["1", "2", "2b", "3"], average=None)
print("Decision Tree Jaccard score is: ", jaccard[1])
# For Decision Tree model, compute F1-score.
f1[1] = f1_score(y_test, y_tree_predictions, average=None)
print("Decision Tree F1-score is: ", f1[1])
print()

Decision Tree Jaccard score is:  [0.65550239 0.         0.         0.        ]
Decision Tree F1-score is:  [0.79190751 0.         0.         0.        ]



In [71]:
# For SVM algorithm, compute Jaccard score.
jaccard[2] = jaccard_score(y_test, y_svm_predictions, labels=["1", "2", "2b", "3"], average=None)
print("SVM Jaccard score is", jaccard[2])
# For SVM algorithm, compute F1-score.
f1[2] = f1_score(y_test, y_svm_predictions, average=None)
print("SVM F1-score is", f1[2])
print()

SVM Jaccard score is [0.65550239 0.         0.         0.        ]
SVM F1-score is [0.79190751 0.         0.         0.        ]



In [72]:
# For logistic regression algorithm, compute Jaccard score.
jaccard[3] = jaccard_score(y_test, y_lr_predictions, labels=["1", "2", "2b", "3"], average=None)
print("Logistic Regression Jaccard similarity score is", jaccard[3])
# For logistic regression algorithm, compute F1-score.
f1[3] = f1_score(y_test, y_lr_predictions, average=None)
print("Logistic Regression F1-score is", f1[3])
# For logistic regression algorithm, compute log loss.
logloss[3] = log_loss(y_test, y_lr_probabilities)
print("Logistic Regression log loss is", logloss[3])

Logistic Regression Jaccard similarity score is [0.65550239 0.         0.         0.        ]
Logistic Regression F1-score is [0.79190751 0.         0.         0.        ]
Logistic Regression log loss is [0.72416609 0.72416609 0.72416609 0.72416609]


In [45]:
notebook_end_time = os.times()[4]
notebook_total_elapsed_time = notebook_end_time - notebook_start_time
print("Notebook total elapsed time:", notebook_total_elapsed_time, "seconds.")

Notebook total elapsed time: 2728.390000000596 seconds.
