# IBM Capstone (Week 3)

## Introduction/Business Problem

#### From a business or application perspective, many parties would be interested in knowing car accident severity based on weather, lighting, and road conditions. Tow companies, police departments, and drivers themselves should be aware of when car crash severity may be increased. This is a machine learning problem. Thus, I will be using the car crash dataset, which includes weather, lighting, and road conditions. With these factors, I aim to develop a model that will predict the severity of a car crash. Drivers should be more vigilant and cautious when factors suggest a car crash may be more severe. Tow companies and police departments should be aware of these risks as well in order to respond more effectively.

## Data Understanding

### Here we explore and understand the dataset to figure out how to approach the problem.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("Data-Collisions.csv")
df.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


#### There are many unimportant columns and many rows with missing data. I will select the columns to be used in the model and deal with missing data before training a model.

#### Through exploration, you will find that the Severity of a collision is rated 1 if it caused property damage only, and 2 if it caused injury.

#### I aim to predict whether or not an accident will cause injury based on factors like weather, lighting, and road conditions.

#### First, let's explore the data further.

In [3]:
df.shape

(194673, 38)

#### There almost 200,000 rows and 38 columns.

In [4]:
print(df['SEVERITYCODE'].value_counts())

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64


#### There are two values for crash severity and most accidents are Category 1 (not severe).

In [5]:
df_new=df[['SEVERITYCODE','WEATHER','ROADCOND','LIGHTCOND']]
df_new.head()

Unnamed: 0,SEVERITYCODE,WEATHER,ROADCOND,LIGHTCOND
0,2,Overcast,Wet,Daylight
1,1,Raining,Wet,Dark - Street Lights On
2,1,Overcast,Dry,Daylight
3,1,Clear,Dry,Daylight
4,2,Raining,Wet,Daylight


In [6]:
df_new.shape

(194673, 4)

In [7]:
df_new = df_new.dropna()
df_new.head()

Unnamed: 0,SEVERITYCODE,WEATHER,ROADCOND,LIGHTCOND
0,2,Overcast,Wet,Daylight
1,1,Raining,Wet,Dark - Street Lights On
2,1,Overcast,Dry,Daylight
3,1,Clear,Dry,Daylight
4,2,Raining,Wet,Daylight


### Here, I check what are the unique values of each factor.

In [8]:
df_new.SEVERITYCODE.unique()

array([2, 1], dtype=int64)

In [9]:
df_new.WEATHER.unique()

array(['Overcast', 'Raining', 'Clear', 'Unknown', 'Other', 'Snowing',
       'Fog/Smog/Smoke', 'Sleet/Hail/Freezing Rain', 'Blowing Sand/Dirt',
       'Severe Crosswind', 'Partly Cloudy'], dtype=object)

In [10]:
df_new.ROADCOND.unique()

array(['Wet', 'Dry', 'Unknown', 'Snow/Slush', 'Ice', 'Other',
       'Sand/Mud/Dirt', 'Standing Water', 'Oil'], dtype=object)

In [11]:
df_new.LIGHTCOND.unique()

array(['Daylight', 'Dark - Street Lights On', 'Dark - No Street Lights',
       'Unknown', 'Dusk', 'Dawn', 'Dark - Street Lights Off', 'Other',
       'Dark - Unknown Lighting'], dtype=object)

In [12]:
dftest=df_new.copy()

## Methodology

### We will do one-hot encoding to convert these categorical features into numerical

In [13]:
data=pd.get_dummies(dftest, columns=['WEATHER','ROADCOND','LIGHTCOND'], prefix=['wtr','rdc','ltc'])

In [14]:
data.head()

Unnamed: 0,SEVERITYCODE,wtr_Blowing Sand/Dirt,wtr_Clear,wtr_Fog/Smog/Smoke,wtr_Other,wtr_Overcast,wtr_Partly Cloudy,wtr_Raining,wtr_Severe Crosswind,wtr_Sleet/Hail/Freezing Rain,...,rdc_Wet,ltc_Dark - No Street Lights,ltc_Dark - Street Lights Off,ltc_Dark - Street Lights On,ltc_Dark - Unknown Lighting,ltc_Dawn,ltc_Daylight,ltc_Dusk,ltc_Other,ltc_Unknown
0,2,0,0,0,0,1,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
1,1,0,0,0,0,0,0,1,0,0,...,1,0,0,1,0,0,0,0,0,0
2,1,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,2,0,0,0,0,0,0,1,0,0,...,1,0,0,0,0,0,1,0,0,0


In [15]:
data.dtypes

SEVERITYCODE                    int64
wtr_Blowing Sand/Dirt           uint8
wtr_Clear                       uint8
wtr_Fog/Smog/Smoke              uint8
wtr_Other                       uint8
wtr_Overcast                    uint8
wtr_Partly Cloudy               uint8
wtr_Raining                     uint8
wtr_Severe Crosswind            uint8
wtr_Sleet/Hail/Freezing Rain    uint8
wtr_Snowing                     uint8
wtr_Unknown                     uint8
rdc_Dry                         uint8
rdc_Ice                         uint8
rdc_Oil                         uint8
rdc_Other                       uint8
rdc_Sand/Mud/Dirt               uint8
rdc_Snow/Slush                  uint8
rdc_Standing Water              uint8
rdc_Unknown                     uint8
rdc_Wet                         uint8
ltc_Dark - No Street Lights     uint8
ltc_Dark - Street Lights Off    uint8
ltc_Dark - Street Lights On     uint8
ltc_Dark - Unknown Lighting     uint8
ltc_Dawn                        uint8
ltc_Daylight

In [16]:
import matplotlib as mpl
import matplotlib.pyplot as plt

In [17]:
print(df_new['WEATHER'].value_counts())
print(df_new['ROADCOND'].value_counts())
print(df_new['LIGHTCOND'].value_counts())
print(df_new['SEVERITYCODE'].value_counts())

Clear                       111008
Raining                      33117
Overcast                     27681
Unknown                      15039
Snowing                        901
Other                          824
Fog/Smog/Smoke                 569
Sleet/Hail/Freezing Rain       113
Blowing Sand/Dirt               55
Severe Crosswind                25
Partly Cloudy                    5
Name: WEATHER, dtype: int64
Dry               124300
Wet                47417
Unknown            15031
Ice                 1206
Snow/Slush           999
Other                131
Standing Water       115
Sand/Mud/Dirt         74
Oil                   64
Name: ROADCOND, dtype: int64
Daylight                    116077
Dark - Street Lights On      48440
Unknown                      13456
Dusk                          5889
Dawn                          2502
Dark - No Street Lights       1535
Dark - Street Lights Off      1192
Other                          235
Dark - Unknown Lighting         11
Name: LIGHTCOND, d

In [18]:
df_sev=df_new.groupby('SEVERITYCODE')
print(df_sev.get_group(1)['WEATHER'].value_counts())
print(df_sev.get_group(2)['WEATHER'].value_counts())


Clear                       75200
Raining                     21949
Overcast                    18942
Unknown                     14227
Snowing                       732
Other                         708
Fog/Smog/Smoke                382
Sleet/Hail/Freezing Rain       85
Blowing Sand/Dirt              40
Severe Crosswind               18
Partly Cloudy                   2
Name: WEATHER, dtype: int64
Clear                       35808
Raining                     11168
Overcast                     8739
Unknown                       812
Fog/Smog/Smoke                187
Snowing                       169
Other                         116
Sleet/Hail/Freezing Rain       28
Blowing Sand/Dirt              15
Severe Crosswind                7
Partly Cloudy                   3
Name: WEATHER, dtype: int64


In [19]:
data.shape

(189337, 30)

### Since the dataset is unbalanced, we downsample the majority type to balance it.

In [20]:
df_maj=data[data.SEVERITYCODE==1]
df_min=data[data.SEVERITYCODE==2]

In [21]:
df_maj.shape

(132285, 30)

In [22]:
df_min.shape

(57052, 30)

In [23]:
from sklearn.utils import resample
df_majds=resample(df_maj, replace=False, n_samples=57052, random_state=4)

In [24]:
df_dspl = pd.concat([df_majds, df_min])
df_dspl.SEVERITYCODE.value_counts()

2    57052
1    57052
Name: SEVERITYCODE, dtype: int64

### Now that the dataset is balanced, I split the set into X and y arrays.

In [25]:
df_dspl.iloc[:,1:30]
df_dspl.iloc[:,0]

94585     1
115949    1
77361     1
144047    1
156792    1
         ..
194663    2
194666    2
194668    2
194670    2
194671    2
Name: SEVERITYCODE, Length: 114104, dtype: int64

In [26]:
X=np.asarray(df_dspl.iloc[:,1:30])
y=np.asarray(df_dspl.iloc[:,0])

In [27]:
X[0:5]

array([[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        1, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 0]], dtype=uint8)

In [28]:
y[0:5]

array([1, 1, 1, 1, 1], dtype=int64)

### Now I preprocess the dataset and split it into training and testing sets.

In [29]:
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

array([[-1.62168796e-02, -1.22297116e+00, -5.63366815e-02,
        -6.03446868e-02,  2.39652596e+00, -5.12762082e-03,
        -4.68900694e-01, -9.81899056e-03, -2.38742632e-02,
        -6.54700725e-02, -2.52974398e-01, -1.42780978e+00,
        -7.64489881e-02, -1.96408218e-02, -2.58167187e-02,
        -1.96408218e-02, -6.70709821e-02, -2.31276018e-02,
        -2.52090786e-01,  1.69994726e+00, -8.60662966e-02,
        -7.80562359e-02, -5.85986243e-01, -8.37355320e-03,
        -1.17733790e-01, -1.30776313e+00,  5.52847526e+00,
        -3.32487067e-02, -2.37571597e-01],
       [-1.62168796e-02, -1.22297116e+00, -5.63366815e-02,
        -6.03446868e-02, -4.17270673e-01, -5.12762082e-03,
         2.13264773e+00, -9.81899056e-03, -2.38742632e-02,
        -6.54700725e-02, -2.52974398e-01, -1.42780978e+00,
        -7.64489881e-02, -1.96408218e-02, -2.58167187e-02,
        -1.96408218e-02, -6.70709821e-02, -2.31276018e-02,
        -2.52090786e-01,  1.69994726e+00, -8.60662966e-02,
        -7.80

In [30]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (79872, 29) (79872,)
Test set: (34232, 29) (34232,)


# Results

### Now I train the model with Logistic Regression since there are only two outcomes.

#### This is the result with solver set to 'liblinear'

In [31]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR

LogisticRegression(C=0.01, solver='liblinear')

In [32]:
yhat = LR.predict(X_test)
yhat
yhat_prob = LR.predict_proba(X_test)
yhat_prob

array([[0.46180034, 0.53819966],
       [0.46180034, 0.53819966],
       [0.50445951, 0.49554049],
       ...,
       [0.45821639, 0.54178361],
       [0.47745978, 0.52254022],
       [0.45618702, 0.54381298]])

In [33]:
from sklearn.metrics import log_loss
log_loss(yhat, yhat_prob)

0.5918495675963676

In [34]:
from sklearn.metrics import f1_score
f1_score(y_test, yhat, average='weighted') 

0.5309261554403156

#### This is with solver set to 'saga'

In [35]:
LR = LogisticRegression(C=0.01, solver='saga').fit(X_train,y_train)
LR
yhat = LR.predict(X_test)
yhat
yhat_prob = LR.predict_proba(X_test)
yhat_prob
log_loss(yhat, yhat_prob)

0.5919055166160684

In [36]:
f1_score(y_test, yhat, average='weighted') 

0.5309261554403156

#### With solver set to 'saga', the log loss sees a small improvement.

# Discussion

#### The dataset was full of NaNs, missing data, and useless columns. I selected columns or factors of interest, removed rows with missing data, and balanced the dataset. Columns of interest had to be converted from categorical data to numerical data. I converted these data using one-hot encoding. Then, after some more preprocessing, I was able to create testing and training sets to create a logistic regression model. I changed logistic regression parameters to find slight improvements in the model. The accuracy is shown through log loss and f1 scores.

# Conclusion

#### Using the model shown here, it's possible to predict accident severity based on weather, road, and lighting conditions. Tow companies, police departments, and drivers themselves can be better informed before even entering their vehicles.