In [None]:
# Introduction

In an effort to reduce the frequency of car collisions in a community, an algorithim must be developed to predict the severity of an accident given the current weather, road and visibility conditions. When conditions are bad, this model will alert drivers to remind them to be more careful.

# Data Understanding

Our predictor or target variable will be 'SEVERITYCODE' because it is used measure the severity of an accident from 0 to 5 within the dataset. Attributes used to weigh the severity of an accident are 'WEATHER', 'ROADCOND' and 'LIGHTCOND'.

n the beginning of this notebook, we had categorical data that was of type 'object'. This is not a data type that we could have fed through an algoritim, so label encoding was used to created new classes that were of type int8; a numerical data type.

After solving that issue we were presented with another - imbalanced data. As mentioned earlier, class 1 was nearly three times larger than class 2. The solution to this was downsampling the majority class with sklearn's resample tool. We downsampled to match the minority class exactly with 58188 values each.

Once we analyzed and cleaned the data, it was then fed through three ML models; K-Nearest Neighbor, Decision Tree and Logistic Regression. Although the first two are ideal for this project, logistic regression made most sense because of its binary nature.

Evaluation metrics used to test the accuracy of our models were jaccard index, f-1 score and logloss for logistic regression. Choosing different k, max depth and hyparameter C values helped to improve our accuracy to be the best possible.

Conclusion

Based on historical data from weather conditions pointing to certain classes, we can conclude that particular weather conditions have a somewhat impact on whether or not travel could result in property damage (class 1) or injury (class 2).

In [81]:
import pandas as pd
import numpy as np
df = pd.read_csv('https://opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0.csv')


In [86]:
# Working columns

df_new = df[['SEVERITYCODE', 'WEATHER', "LIGHTCOND", 'ROADCOND']]
from sklearn.preprocessing import LabelEncoder
df_new.dropna(inplace = True)


df_new['SEVERITYCODE'] = LabelEncoder().fit_transform(df_new['SEVERITYCODE'])
df_new["LIGHTCOND_CAT"] = LabelEncoder().fit_transform(df_new['LIGHTCOND'])
df_new["WEATHER_CAT"] = LabelEncoder().fit_transform(df_new['WEATHER'])
df_new["ROADCOND_CAT"] = LabelEncoder().fit_transform(df_new['ROADCOND'])

# Define X , y

X = df_new[['WEATHER_CAT', 'LIGHTCOND_CAT', 'ROADCOND_CAT']].values
y = df_new['SEVERITYCODE'].values

# Standartization

from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X)

# Train-test 

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 12)
X_train.shape




A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try us

(155759, 3)

In [87]:
# Desicion Tree Model
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)



from sklearn.metrics import classification_report, jaccard_similarity_score, f1_score, log_loss
print(jaccard_similarity_score(y_test, y_pred))
print(f1_score(y_test, y_pred, average = 'macro'))
print(classification_report(y_test, y_pred))


0.6881612737544941
0.20403347885305698
              precision    recall  f1-score   support

           1       0.69      1.00      0.82     26803
           2       0.29      0.00      0.00     11470
           3       0.00      0.00      0.00       592
           4       0.00      0.00      0.00        75

   micro avg       0.69      0.69      0.69     38940
   macro avg       0.25      0.25      0.20     38940
weighted avg       0.56      0.69      0.56     38940



  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [88]:
# Logistic regressiion
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression(C=6, solver = 'liblinear').fit(X_train, y_train)
LR_pred = LR.predict(X_test)
LR_proba = LR.predict_proba(X_test)


print(jaccard_similarity_score(y_test, LR_pred))
print(f1_score(y_test, LR_pred, average = 'macro'))
print(log_loss(y_test, LR_proba))



0.6883153569594248
0.2038467973776676


  'precision', 'predicted', average, warn_for)


ValueError: y_true and y_pred contain different number of classes 4, 5. Please provide the true labels explicitly through the labels argument. Classes found in y_true: [1 2 3 4]