# Understanding
The Seattle government is going to prevent avoidable car accidents by employing methods that alert drivers, health system, and police to remind them to be more careful in critical situations.
In most cases, not paying enough attention during driving, abusing drugs and alcohol or driving at very high speed are the main causes of occurring accidents that can be prevented by enacting harsher regulations. Besides the aforementioned reasons, weather, visibility, or road conditions are the major uncontrollable factors that can be prevented by revealing hidden patterns in the data and announcing warning to the local government, police and drivers on the targeted roads.
The target audience of the project is local Seattle government, police, rescue groups, and last but not least, car insurance institutes. The model and its results are going to provide some advice for the target audience to make insightful decisions for reducing the number of accidents and injuries for the city.
Data
The data was collected by the Seattle Police Department and Accident Traffic Records Department from 2004 to present.
The data consists of 37 independent variables and 194,673 rows. The dependent variable, “SEVERITYCODE”, contains numbers that correspond to different levels of severity caused by an accident from 0 to 4.
Severity codes are as follows:
0: Little to no Probability (Clear Conditions)
1: Very Low Probability — Chance or Property Damage
2: Low Probability — Chance of Injury
3: Mild Probability — Chance of Serious Injury
4: High Probability — Chance of Fatality
Furthermore, because of the existence of null values in some records, the data needs to be preprocessed before any further processing.
Data Preprocessing
The dataset in the original form is not ready for data analysis. In order to prepare the data, first, we need to drop the non-relevant columns. In addition, most of the features are of object data types that need to be converted into numerical data types.
After analyzing the data set, I have decided to focus on only four features, severity, weather conditions, road conditions, and light conditions, among others.
To get a good understanding of the dataset, I have checked different values in the features. The results show, the target feature is imbalance, so we use a simple statistical technique to balance it.
Image for post
As you can see, the number of rows in class 1 is almost three times bigger than the number of rows in class 2. It is possible to solve the issue by downsampling the class 1.
Image for post
Methodology
For implementing the solution, I have used Github as a repository and running Jupyter Notebook to preprocess data and build Machine Learning models. Regarding coding, I have used Python and its popular packages such as Pandas, NumPy and Sklearn.
Once I have load data into Pandas Dataframe, used ‘dtypes’ attribute to check the feature names and their data types. Then I have selected the most important features to predict the severity of accidents in Seattle. Among all the features, the following features have the most influence in the accuracy of the predictions:
“WEATHER”,
“ROADCOND”,
“LIGHTCOND”
Also, as I mentioned earlier, “SEVERITYCODE” is the target variable.
I have run a value count on road (‘ROADCOND’) and weather condition (‘WEATHER’) to get ideas of the different road and weather conditions. I also have run a value count on light condition (’LIGHTCOND’), to see the breakdowns of accidents occurring during the different light conditions.

In [86]:
import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix  

df = pd.read_csv('Accident_Information.csv', sep=',')

  interactivity=interactivity, compiler=compiler, result=result)


In [87]:
encoding = {
"Carriageway_Hazards": {"None": 0, "Other object on road": 1, "Any animal in carriageway (except ridden horse)": 1,  "Pedestrian in carriageway - not injured": 1, "Previous accident": 1, "Vehicle load on road": 1,  "Data missing or out of range": 0  }
}
df.replace(encoding, inplace=True)
print(df['Carriageway_Hazards'].value_counts())

0    1882972
1      34302
Name: Carriageway_Hazards, dtype: int64


In [88]:
print(df['Light_Conditions'].value_counts())
encoding_light = {"Light_Conditions": {"Daylight": 0, "Darkness - lights lit": 1, "Darkness - no lighting": 1, "Darkness - lighting unknown": 1, "Darkness - lights unlit": 1, "Data missing or out of range": 0}}
df.replace(encoding_light, inplace=True)
print(df['Light_Conditions'].value_counts())

Daylight                        1403443
Darkness - lights lit            377297
Darkness - no lighting           105966
Darkness - lighting unknown       21513
Darkness - lights unlit            9042
Data missing or out of range         13
Name: Light_Conditions, dtype: int64
0    1403456
1     513818
Name: Light_Conditions, dtype: int64


In [89]:
print(df['Day_of_Week'].value_counts())
encoding_day_of_week = {"Day_of_Week": {"Saturday": 1, "Sunday": 1, "Monday": 0, "Tuesday": 0, "Wednesday": 0, "Thursday": 0, "Friday": 0}}
df.replace(encoding_day_of_week, inplace=True)
print(df['Day_of_Week'].value_counts())

Friday       313938
Wednesday    289261
Thursday     288443
Tuesday      286810
Monday       272546
Saturday     255926
Sunday       210350
Name: Day_of_Week, dtype: int64
0    1450998
1     466276
Name: Day_of_Week, dtype: int64


In [90]:
print(df['Special_Conditions_at_Site'].value_counts())
encoding_Special_Conditions_at_Site = {"Special_Conditions_at_Site": {"None": 0, "Roadworks": 1, "Oil or diesel": 1, "Mud": 1, "Road surface defective": 1, "Auto traffic signal - out": 1, "Road sign or marking defective or obscured": 1, "Auto signal part defective": 1, "Data missing or out of range": 0}}
df.replace(encoding_Special_Conditions_at_Site, inplace=True)
print(df['Special_Conditions_at_Site'].value_counts())

None                                          1870097
Roadworks                                       22173
Oil or diesel                                    6527
Mud                                              5988
Road surface defective                           4593
Auto traffic signal - out                        3547
Road sign or marking defective or obscured       2771
Auto signal part defective                        949
Data missing or out of range                      629
Name: Special_Conditions_at_Site, dtype: int64
0    1870726
1      46548
Name: Special_Conditions_at_Site, dtype: int64


In [91]:
encoding_1st_road_class = {"1st_Road_Class": {"A": 1, "A(M)": 1, "B": 2, "C": 3, "Motorway": 4, "Unclassified": 1}}
df.replace(encoding_1st_road_class, inplace=True)
df['1st_Road_Class'].value_counts()

1    1433546
2     243115
3     166972
4      73641
Name: 1st_Road_Class, dtype: int64

In [92]:
#replacing 'Data missing or out of range' with most occured value 'Give way or uncontrolled'
df['Junction_Control'] = df['Junction_Control'].replace(['Data missing or out of range'], 'Give way or uncontrolled')

In [93]:
df['Junction_Control'].value_counts()

Give way or uncontrolled               1628156
Auto traffic signal                     197074
Not at junction or within 20 metres      77304
Stop sign                                11525
Authorised person                         3215
Name: Junction_Control, dtype: int64

In [94]:
encoding_junction_detail = {"Junction_Control": 
                            {"Give way or uncontrolled": 1,
                             "Auto traffic signal": 2,
                             "Not at junction or within 20 metres": 3, 
                             "Stop sign": 4,
                             "Authorised person": 5,
                              }}
df.replace(encoding_junction_detail, inplace=True)
df['Junction_Control'].value_counts()

1    1628156
2     197074
3      77304
4      11525
5       3215
Name: Junction_Control, dtype: int64

In [95]:
encoding_junction_detail = {"Junction_Detail": 
                            {"Not at junction or within 20 metres": 1,
                             "T or staggered junction": 2,
                             "Crossroads": 3, 
                             "Roundabout": 4,
                             "Private drive or entrance": 5,
                             "Other junction": 6,
                             "Slip road": 7,
                             "More than 4 arms (not roundabout)": 8,
                             "Mini-roundabout": 9,
                             "Data missing or out of range": 1 }}
df.replace(encoding_junction_detail, inplace=True)
df['Junction_Detail'].value_counts()

1    773070
2    596447
3    183931
4    166463
5     69487
6     54982
7     28004
8     23984
9     20906
Name: Junction_Detail, dtype: int64

In [96]:
encoding_road_surface_cond = {"Road_Surface_Conditions": 
                            {"Dry": 1,
                             "Wet or damp": 2,
                             "Frost or ice": 3, 
                             "Snow": 4,
                             "Flood over 3cm. deep": 5,
                             "Data missing or out of range": 1 }}
df.replace(encoding_road_surface_cond, inplace=True)
df['Road_Surface_Conditions'].value_counts()

1    1328795
2     535999
3      38002
4      11739
5       2739
Name: Road_Surface_Conditions, dtype: int64

In [97]:
encoding_road_type = {"Road_Type": 
                            {"Single carriageway": 1,
                             "Dual carriageway": 2,
                             "Roundabout": 3, 
                             "One way street": 4,
                             "Slip road": 5,
                             "Unknown": 0,
                             "Data missing or out of range": 1 }}
df.replace(encoding_road_type, inplace=True)
df['Road_Type'].value_counts()

1    1434072
2     283067
3     128337
4      39872
5      20082
0      11844
Name: Road_Type, dtype: int64

In [98]:
encoding_urban_rural = {"Urban_or_Rural_Area": 
                            {"Urban": 1,
                             "Rural": 2,
                             "Unallocated": 1 }}
df.replace(encoding_urban_rural, inplace=True)
df['Urban_or_Rural_Area'].value_counts()

1    1235039
2     682235
Name: Urban_or_Rural_Area, dtype: int64

In [99]:
encoding_weather = {"Weather_Conditions": 
                            {"Fine no high winds": 1,
                             "Raining no high winds": 2,
                             "Raining + high winds": 3,
                             "Fine + high winds": 4,
                             "Snowing no high winds": 5,
                             "Fog or mist": 6,
                             "Snowing + high winds": 7,
                             "Unknown": 1,
                             "Other": 1,
                             "Data missing or out of range": 1 }}
df.replace(encoding_weather, inplace=True)
df['Weather_Conditions'].value_counts()

1    1614899
2     224981
3      27241
4      24575
5      12746
6      10444
7       2388
Name: Weather_Conditions, dtype: int64

In [100]:
np.where(np.isnan(df['Speed_limit']))

(array([1801605, 1843133, 1843396, 1857338, 1857382, 1857458, 1857466,
        1857525, 1857526, 1857527, 1857531, 1857539, 1857561, 1857564,
        1857583, 1857610, 1857613, 1857618, 1857622, 1857627, 1857635,
        1857681, 1857704, 1857720, 1857736, 1857737, 1857772, 1898106,
        1898251, 1898467, 1898663, 1898938, 1899072, 1899103, 1899306,
        1899388, 1912877], dtype=int64),)

In [101]:
df['Speed_limit'].fillna((df['Speed_limit'].mean()), inplace=True)

In [102]:
df['Time'].fillna(0, inplace=True)

In [103]:
def period(row):
    rdf = []
    if(type(row) == float):
        row = str(row)
        rdf = row.split(".")
    else:
        rdf = str(row).split(":"); # day -- 8am-8pm
        
    hr = rdf[0]
    if int(hr) > 8 and int(hr) < 20:
        return 1;
    else:
        return 2;

In [105]:
df['Time'] = df['Time'].apply(period)

In [106]:
df_train1 = df[['1st_Road_Class','Carriageway_Hazards','Junction_Control','Day_of_Week','Junction_Detail','Light_Conditions','Road_Surface_Conditions','Road_Type','Special_Conditions_at_Site','Speed_limit','Time','Urban_or_Rural_Area','Weather_Conditions','Accident_Severity']]

In [107]:
df_slight = df_train1[df_train1['Accident_Severity']=='Slight']

In [108]:
df_serious = df_train1[df_train1['Accident_Severity']=='Serious']

In [109]:
df_fatal = df_train1[df_train1['Accident_Severity']=='Fatal']

In [110]:
df_serious['Accident_Severity'].value_counts()

Serious    263805
Name: Accident_Severity, dtype: int64

In [111]:
random_subset = df_slight.sample(n=3)
random_subset.head()

Unnamed: 0,1st_Road_Class,Carriageway_Hazards,Junction_Control,Day_of_Week,Junction_Detail,Light_Conditions,Road_Surface_Conditions,Road_Type,Special_Conditions_at_Site,Speed_limit,Time,Urban_or_Rural_Area,Weather_Conditions,Accident_Severity
1149489,1,0,2,0,2,1,2,4,0,30.0,2,1,2,Slight
1591132,1,0,1,0,1,0,1,1,0,40.0,2,2,1,Slight
1330828,1,0,1,0,6,0,1,1,0,40.0,2,2,4,Slight


In [112]:
df_fatal['Accident_Severity'].value_counts()

Fatal    24693
Name: Accident_Severity, dtype: int64

In [113]:
df_slight_sampling = df_slight.sample(n=45000)  #Matched the combined number of records for Fatal and Serious(As we are going to club fatal&serious to Serious)

In [114]:
df_serious_sampling = df_serious.sample(n=24693)  #Matched number of records with the rarer class (Fatal#24693)

In [115]:
df_final_sampling = pd.concat([df_serious_sampling,df_slight_sampling,df_fatal])

In [116]:
df_final_sampling.head()

Unnamed: 0,1st_Road_Class,Carriageway_Hazards,Junction_Control,Day_of_Week,Junction_Detail,Light_Conditions,Road_Surface_Conditions,Road_Type,Special_Conditions_at_Site,Speed_limit,Time,Urban_or_Rural_Area,Weather_Conditions,Accident_Severity
253486,1,0,1,1,1,1,1,2,0,70.0,2,2,1,Serious
518525,2,0,1,0,1,1,2,1,1,50.0,2,2,2,Serious
790409,1,0,1,0,1,0,2,1,0,60.0,2,2,3,Serious
113054,2,0,1,0,1,0,1,2,0,60.0,2,2,1,Serious
1332269,2,0,1,0,2,1,1,1,0,60.0,2,2,1,Serious


In [117]:
df_test = df_final_sampling[['Accident_Severity']]

In [118]:
#replacing 'Data missing or out of range' with most occured value 'None'
df_test['Accident_Severity'] = df_test['Accident_Severity'].replace(['Fatal'], 'Serious')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [119]:
df_train = df_final_sampling[['1st_Road_Class','Carriageway_Hazards','Junction_Control','Day_of_Week','Junction_Detail','Light_Conditions','Road_Surface_Conditions','Road_Type','Special_Conditions_at_Site','Speed_limit','Time','Urban_or_Rural_Area','Weather_Conditions']]

In [120]:
df_test['Accident_Severity'].value_counts()

Serious    49386
Slight     45000
Name: Accident_Severity, dtype: int64

In [121]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_train, df_test, test_size=0.2)

In [75]:
from sklearn.ensemble import RandomForestClassifier
#class_weight = dict({2:1, 1:15, 0:50})
rdf = RandomForestClassifier(n_estimators=300,random_state=35)

rdf.fit(X_train,y_train)

y_pred=rdf.predict(X_test)

#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix  
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred)) 

  """


Accuracy: 0.6143129568810255
[[5718 4094]
 [3187 5879]]
              precision    recall  f1-score   support

     Serious       0.64      0.58      0.61      9812
      Slight       0.59      0.65      0.62      9066

   micro avg       0.61      0.61      0.61     18878
   macro avg       0.62      0.62      0.61     18878
weighted avg       0.62      0.61      0.61     18878



In [39]:
from sklearn.ensemble import RandomForestClassifier
#class_weight = dict({2:1, 1:15, 0:50})
rdf = RandomForestClassifier(bootstrap=True,
            class_weight="balanced_subsample", 
            criterion='gini',
            max_depth=8, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=4, min_samples_split=10,
            min_weight_fraction_leaf=0.0, n_estimators=300,
            oob_score=True,
            random_state=35,
            verbose=0, warm_start=False)

  from numpy.core.umath_tests import inner1d


In [40]:
rdf.fit(X_train,y_train)

y_pred=rdf.predict(X_test)

  """Entry point for launching an IPython kernel.


In [41]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.6259667337641699


In [42]:
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix  
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred)) 

[[5376 4436]
 [2625 6441]]
             precision    recall  f1-score   support

    Serious       0.67      0.55      0.60      9812
     Slight       0.59      0.71      0.65      9066

avg / total       0.63      0.63      0.62     18878



In [47]:
from xgboost import XGBClassifier
model = XGBClassifier(learning_rate =0.07, n_estimators=300,
                      class_weight="balanced_subsample",
                      max_depth=8, min_child_weight=1,
                      scale_pos_weight=7,
                      seed=27,subsample=0.8,colsample_bytree=0.8)

model.fit(X_train,y_train)

y_pred=model.predict(X_test)

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


Accuracy: 0.48612141116643715


  if diff:


In [48]:
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred)) 

[[ 192 9620]
 [  81 8985]]
             precision    recall  f1-score   support

    Serious       0.70      0.02      0.04      9812
     Slight       0.48      0.99      0.65      9066

avg / total       0.60      0.49      0.33     18878



In [50]:
# import the class
from sklearn.neighbors import KNeighborsClassifier

# instantiate the model (with the default parameters)
knn = KNeighborsClassifier(n_neighbors=3,weights='distance')

# fit the model with data (occurs in-place)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred)) 

  


[[5713 4099]
 [3956 5110]]
             precision    recall  f1-score   support

    Serious       0.59      0.58      0.59      9812
     Slight       0.55      0.56      0.56      9066

avg / total       0.57      0.57      0.57     18878



In [51]:
from sklearn.linear_model import LogisticRegression
logisticRegr = LogisticRegression()
logisticRegr.fit(X_train, y_train)
y_pred = logisticRegr.predict(X_test)
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred)) 

  y = column_or_1d(y, warn=True)


[[6226 3586]
 [3605 5461]]
             precision    recall  f1-score   support

    Serious       0.63      0.63      0.63      9812
     Slight       0.60      0.60      0.60      9066

avg / total       0.62      0.62      0.62     18878



In [54]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)
print(accuracy_score(y_test, y_pred))
confusion_matrix(y_test, y_pred)
print(format(classification_report(y_test, y_pred)))

  y = column_or_1d(y, warn=True)


0.6171734293887065
             precision    recall  f1-score   support

    Serious       0.63      0.62      0.63      9812
     Slight       0.60      0.61      0.61      9066

avg / total       0.62      0.62      0.62     18878



In [124]:
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(loss="deviance", learning_rate=0.1, 
      n_estimators=100, subsample=1.0, criterion="friedman_mse", 
      min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, 
      max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, 
      random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, 
      presort="auto")

y_pred = gbc.fit(X_train, y_train.values.ravel()).predict(X_test)
print(format(classification_report(y_test, y_pred)))
print(accuracy_score(y_test, y_pred))

              precision    recall  f1-score   support

     Serious       0.65      0.62      0.63      9831
      Slight       0.60      0.63      0.62      9047

   micro avg       0.62      0.62      0.62     18878
   macro avg       0.62      0.63      0.62     18878
weighted avg       0.63      0.62      0.63     18878

0.6249072995020659
