## Car Accident Severity 


### Data Understanding
#### Exploratory Data Analysis

Importing neccessary libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [4]:
import pandas as pd

Import the required dataset

In [2]:
filename = "https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv"

In [7]:
df = pd.read_csv(filename, low_memory=False)
df.head()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194673 entries, 0 to 194672
Data columns (total 38 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   SEVERITYCODE    194673 non-null  int64  
 1   X               189339 non-null  float64
 2   Y               189339 non-null  float64
 3   OBJECTID        194673 non-null  int64  
 4   INCKEY          194673 non-null  int64  
 5   COLDETKEY       194673 non-null  int64  
 6   REPORTNO        194673 non-null  object 
 7   STATUS          194673 non-null  object 
 8   ADDRTYPE        192747 non-null  object 
 9   INTKEY          65070 non-null   float64
 10  LOCATION        191996 non-null  object 
 11  EXCEPTRSNCODE   84811 non-null   object 
 12  EXCEPTRSNDESC   5638 non-null    object 
 13  SEVERITYCODE.1  194673 non-null  int64  
 14  SEVERITYDESC    194673 non-null  object 
 15  COLLISIONTYPE   189769 non-null  object 
 16  PERSONCOUNT     194673 non-null  int64  
 17  PEDCOUNT  

In [9]:
df.describe

<bound method NDFrame.describe of         SEVERITYCODE           X          Y  OBJECTID  INCKEY  COLDETKEY  \
0                  2 -122.323148  47.703140         1    1307       1307   
1                  1 -122.347294  47.647172         2   52200      52200   
2                  1 -122.334540  47.607871         3   26700      26700   
3                  1 -122.334803  47.604803         4    1144       1144   
4                  2 -122.306426  47.545739         5   17700      17700   
...              ...         ...        ...       ...     ...        ...   
194668             2 -122.290826  47.565408    219543  309534     310814   
194669             1 -122.344526  47.690924    219544  309085     310365   
194670             2 -122.306689  47.683047    219545  311280     312640   
194671             2 -122.355317  47.678734    219546  309514     310794   
194672             1 -122.289360  47.611017    219547  308220     309500   

       REPORTNO   STATUS      ADDRTYPE   INTKEY  ... 

In [11]:
df.shape

(194673, 38)

In [12]:
df.columns

Index(['SEVERITYCODE', 'X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO',
       'STATUS', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE',
       'EXCEPTRSNDESC', 'SEVERITYCODE.1', 'SEVERITYDESC', 'COLLISIONTYPE',
       'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INCDATE',
       'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC',
       'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND',
       'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'ST_COLDESC',
       'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR'],
      dtype='object')

The problem is predicting the severity code by using the independent variables. Hence, it is a classification problem. The "severity" depends on the following data:

- Accident location: Latitude("Y" column - float), Longitude("X" column - float)
- Road coditions: "ROADCOND" column - text
- Weather condition: "WEATHER" column - text
- Junction: "JUNCTIONTYPE" column - text
- Car speeding: "SPEEDING" column - boolean
- Number of people involved: "PERSONCOUNT" column - integer
- Light conditions: "LIGHTCOND" column - text
- Number of vehicles involved in: "VEHCOUNT" column - integer
- The date time when the accident occurs: "INCDATE", "INCDTTM" columns - text

### Data Preparation

In [13]:
df['SEVERITYCODE'].value_counts(normalize=True)

1    0.701099
2    0.298901
Name: SEVERITYCODE, dtype: float64

We can see that the dataset contains only 2 severities: "1" (prop damage) and "2" (injury). It will limit the prediction because the classification can not perform with the label which doesn't exist in dataset such as "3" (fatality), "2b" (serious injury) and "0" (unknown).

We can use the Folium library to see the collision distribution on the map.

In [16]:
import folium
seattle_map = folium.Map(location=[47.608013, -122.335167], zoom_start=12)
accidents = folium.map.FeatureGroup()

for lat,lng in zip(df['Y'].dropna().head(1000), df['X'].dropna().head(1000)):
    accidents.add_child(
        folium.CircleMarker(
            [lat, lng],
            radius=3,
            color='orange',
            fill=True,
            fill_color='black',
            fill_opacity=0.6
        )
    )
seattle_map.add_child(accidents)
seattle_map

In [18]:
#We've to drop the missing value of the longitude and latitude in order to proceed further
df.dropna(subset=['X', 'Y'], inplace=True)
df[['X', 'Y', 'ROADCOND', 'WEATHER', 'JUNCTIONTYPE', 'SPEEDING', 'PERSONCOUNT', 'LIGHTCOND', 'VEHCOUNT', 'SEVERITYCODE']].isna().sum()

X                    0
Y                    0
ROADCOND          4858
WEATHER           4925
JUNCTIONTYPE      4193
SPEEDING        180619
PERSONCOUNT          0
LIGHTCOND         5012
VEHCOUNT             0
SEVERITYCODE         0
dtype: int64

The speed is "Y" if the accident relates to car speed. Then we fill the missing value with 0 and 'Y' by 1.

In [19]:
df['SPEEDING'].value_counts()
df['SPEEDING'].replace([np.NaN, 'Y'], [0, 1], inplace=True)
df['SPEEDING'].value_counts()

df.dropna(subset=['X', 'Y', 'ROADCOND', 'WEATHER', 'JUNCTIONTYPE', 'SPEEDING', 'PERSONCOUNT', 'LIGHTCOND', 'VEHCOUNT', 'SEVERITYCODE'], inplace=True)
len(df)

180086

In [20]:
#Value count of SEVERITY after drop missing data
df['SEVERITYCODE'].value_counts(normalize=True)

1    0.690026
2    0.309974
Name: SEVERITYCODE, dtype: float64

In [21]:
#Value count of Road condition
df['ROADCOND'].value_counts()

Dry               120635
Wet                45607
Unknown            11386
Ice                 1162
Snow/Slush           971
Other                115
Standing Water        99
Sand/Mud/Dirt         62
Oil                   49
Name: ROADCOND, dtype: int64

In [22]:
#encoding categorical data
encoding_road_cond = {
    'ROADCOND': {
        'Dry': 1,
        'Wet': 2,
        'Unknown': 0,
        'Ice': 3,
        'Snow/Slush': 4,
        'Other': 0,
        'Standing Water': 5,
        'Sand/Mud/Dirt': 6,
        'Oil': 7
    }
}
df.replace(encoding_road_cond, inplace=True)
df['ROADCOND'].value_counts()

1    120635
2     45607
0     11501
3      1162
4       971
5        99
6        62
7        49
Name: ROADCOND, dtype: int64

In [26]:
#weather encoded
df['WEATHER'].value_counts()

1    107698
2     31726
3     26815
0     12233
4       875
5       549
6       112
7        49
8        24
9         5
Name: WEATHER, dtype: int64

In [27]:
df['JUNCTIONTYPE'].value_counts()

Mid-Block (not related to intersection)              84517
At Intersection (intersection related)               60930
Mid-Block (but intersection related)                 22035
Driveway Junction                                    10430
At Intersection (but not related to intersection)     2030
Ramp Junction                                          139
Unknown                                                  5
Name: JUNCTIONTYPE, dtype: int64

In [28]:
encoding_junction = {
    'JUNCTIONTYPE': {
        'Mid-Block (not related to intersection)': 1,
        'At Intersection (intersection related)': 2,
        'Mid-Block (but intersection related)': 3,
        'Driveway Junction': 4,
        'At Intersection (but not related to intersection)': 5,
        'Ramp Junction': 6,
        'Unknown': 0
    }
}
df.replace(encoding_junction, inplace=True)
df['JUNCTIONTYPE'].value_counts()

1    84517
2    60930
3    22035
4    10430
5     2030
6      139
0        5
Name: JUNCTIONTYPE, dtype: int64

In [29]:
df['LIGHTCOND'].value_counts()

Daylight                    112229
Dark - Street Lights On      46686
Unknown                      10340
Dusk                          5709
Dawn                          2390
Dark - No Street Lights       1419
Dark - Street Lights Off      1130
Other                          172
Dark - Unknown Lighting         11
Name: LIGHTCOND, dtype: int64

In [30]:
encoding_light_cond = {
    'LIGHTCOND': {
        'Daylight': 1,
        'Dark - Street Lights On': 2,
        'Unknown': 0,
        'Dusk': 3,
        'Dawn': 4,
        'Dark - No Street Lights': 5,
        'Dark - Street Lights Off': 6,
        'Other': 0,
        'Dark - Unknown Lighting': 7
    }
}
df.replace(encoding_light_cond, inplace=True)
df['LIGHTCOND'].value_counts()

1    112229
2     46686
0     10512
3      5709
4      2390
5      1419
6      1130
7        11
Name: LIGHTCOND, dtype: int64

In [32]:
df[['INCDATE', 'INCDTTM']].head()

Unnamed: 0,INCDATE,INCDTTM
0,2013/03/27 00:00:00+00,3/27/2013 2:54:00 PM
1,2006/12/20 00:00:00+00,12/20/2006 6:55:00 PM
2,2004/11/18 00:00:00+00,11/18/2004 10:20:00 AM
3,2013/03/29 00:00:00+00,3/29/2013 9:26:00 AM
4,2004/01/28 00:00:00+00,1/28/2004 8:04:00 AM


In [33]:
#setting date features
df['INCDTTM'] = pd.to_datetime(df['INCDTTM'])
df['dayofweek'] = df['INCDTTM'].dt.dayofweek
df['hourofday'] = df['INCDTTM'].dt.hour
df[['dayofweek', 'hourofday']].head()

Unnamed: 0,dayofweek,hourofday
0,2,14
1,2,18
2,3,10
3,4,9
4,2,8


In [34]:
#Separate the dependent variable and independent variables
X = df[['X', 'Y', 'ROADCOND', 'WEATHER', 'JUNCTIONTYPE', 'SPEEDING', 'LIGHTCOND', 'VEHCOUNT', 'PERSONCOUNT', 'dayofweek', 'hourofday']].copy()
Y = df['SEVERITYCODE'].copy()
Y.value_counts(normalize=True)

1    0.690026
2    0.309974
Name: SEVERITYCODE, dtype: float64

Balance the unbalance dataset

In [45]:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler() 
X_rus, y_rus = rus.fit_sample(X, Y)
print("Remove number of rows: ", len(X) - len(X_rus))
y_rus.value_counts()

Remove number of rows:  68442


2    55822
1    55822
Name: SEVERITYCODE, dtype: int64

### Model Building

In [46]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_rus, y_rus, test_size=0.2, random_state=4)

In [47]:
#K Nearest Neighbor (KNN)
from sklearn.neighbors import KNeighborsClassifier
Ks=15
mean_acc=np.zeros((Ks-1))
std_acc=np.zeros((Ks-1))
ConfustionMx=[];
for n in range(1,Ks):
    
    #Train Model and Predict  
    kNN_model = KNeighborsClassifier(n_neighbors=n).fit(X_train,y_train)
    yhat = kNN_model.predict(X_test)
    
    
    mean_acc[n-1]=np.mean(yhat==y_test);
    
    std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])
mean_acc

array([0.60244525, 0.59106991, 0.61386538, 0.6126114 , 0.62434502,
       0.61735859, 0.62819652, 0.62434502, 0.63446639, 0.63083882,
       0.63778046, 0.632675  , 0.63563079, 0.63222715])

In [48]:
k = 13
#Train Model and Predict  
kNN_model = KNeighborsClassifier(n_neighbors=k).fit(X_train,y_train)
kNN_model

KNeighborsClassifier(n_neighbors=13)

In [50]:
#Decision Tree
from sklearn.tree import DecisionTreeClassifier
DT_model = DecisionTreeClassifier(criterion="entropy", max_depth = 6)
DT_model.fit(X_train,y_train)
DT_model

DecisionTreeClassifier(criterion='entropy', max_depth=6)

In [51]:
#Logistic Regression
from sklearn.linear_model import LogisticRegression
LR_model = LogisticRegression(C=0.01).fit(X_train,y_train)
LR_model

LogisticRegression(C=0.01)

### Evaluation

In [52]:
from sklearn.metrics import f1_score
from sklearn.metrics import jaccard_score

In [53]:
# For KNN

knn_y = kNN_model.predict(X_test)
print("KNN F1-score: %.2f" % f1_score(y_test, knn_y, average='weighted'))
print("KNN jaccard-score: %.2f" %jaccard_score(y_test, knn_y))

KNN F1-score: 0.64
KNN jaccard-score: 0.47


In [55]:
#For Decision Tree
DT_y = DT_model.predict(X_test)
print("Decision Tree F1-score: %.2f" % f1_score(y_test, DT_y, average='weighted'))
print("Decision Tree jaccard-score: %.2f" %jaccard_score(y_test, DT_y))

Decision Tree F1-score: 0.66
Decision Tree jaccard-score: 0.52


In [56]:
#For Logistic Regression
LR_y = LR_model.predict(X_test)
LR_y_pr = LR_model.predict_proba(X_test)
print("LR F1-score: %.2f" % f1_score(y_test, LR_y, average='weighted'))
print("LR jaccard-score: %.2f" %jaccard_score(y_test, LR_y))

LR F1-score: 0.62
LR jaccard-score: 0.47


### Result and Conclusion

- From the above evaluation of different classification models, we can observe that F1-score of the different models didn't varried much, yet, Logistic Regression model was significantly better choice for the project (score = 0.62).

- However, according to Jaccard Score both KNN and Logistic Regression equally suits the requirement with score of 0.47

- It can be concluded that three models chosen for the development and evaluation are being studied and verified altogether. 