# Early Models

This notebook will primarily focus on looking at a collection of simpler classification models using the combined collisions/traffic flow dataset.

Import all the basics that might be used for data manipulation and plotting:

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("darkgrid")

##### 0. Importing and cleaning the data

Import the data saved after using the `geomatch` notebook in the `Dataframe_creation` folder:

In [3]:
## Read in the data with traffic counter and
## bike feature information matched
data_raw = pd.read_csv('../../Dataframe_creation/Dataframes/combined_collisions_v3.csv')

In [62]:
## Features to ignore when dropping NaN values
## because only the traffic counter data should
## be required to match accident data
bike_features = ['aadf_FEATURE_ID', 'aadf_SVDATE', 'aadf_CLT_CARR', 
                     'aadf_CLT_SEGREG', 'aadf_CLT_STEPP', 'aadf_CLT_PARSEG',
                     'aadf_CLT_SHARED', 'aadf_CLT_MANDAT', 'aadf_CLT_ADVIS',
                     'aadf_CLT_PRIORI', 'aadf_CLT_CONTRA', 'aadf_CLT_BIDIRE',
                     'aadf_CLT_CBYPAS', 'aadf_CLT_BBYPAS', 'aadf_CLT_PARKR',
                     'aadf_CLT_WATERR', 'aadf_CLT_PTIME', 'aadf_CLT_ACCESS',
                     'aadf_CLT_COLOUR', 'aadf_BOROUGH']

In [63]:
## Dropping NaN values and dummy columns
## and restricting the data to Greater London
data = data_raw.dropna(subset=[column for column in data_raw.columns if column not in bike_features]).copy()
data.drop(columns='Unnamed: 0',inplace=True)
data = data.loc[data.in_london == True]
## This should be redundant, but running
## to make sure anyway
data = data.loc[data.match == True]

## Some extra cleaning of additional columns
def is_str(x):
    Nx = len(x)
    I = np.ones(Nx).astype(bool)
    for ii in range(Nx):
        if type(x[ii]) != str:
            I[ii] = False
    return I

data = data.loc[is_str(data.Time.values),:]
data.aadf_Count_point_id = data.aadf_Count_point_id.values.astype(int)

For now, we will ignore any additional time columns and neighborhood work from the `BuildRoadDataframe_v3` because it gets transformed when summing and the mode is taken anyway, making them virtually meaningless columns. We cannot fully use this notebook since it sums rows, resulting in the loss of the `Accident_Severity` column. We will, however, pull some of the cleaning and manipulation from the notebook here.

In [64]:
Inolane = np.ones(data.shape[0]).astype(bool)
for feature in bike_features:
    Inolane &= (data[feature].values == False)
    
data['bikelane'] = ~Inolane

data.sample(5)

Unnamed: 0,Accident_Index,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,...,aadf_CLT_BBYPAS,aadf_CLT_PARKR,aadf_CLT_WATERR,aadf_CLT_PTIME,aadf_CLT_ACCESS,aadf_CLT_COLOUR,aadf_BOROUGH,distance_to_cp,match,bikelane
183563,201501HT20532,538260.0,179740.0,-0.009521,51.499696,1,3,2,1,2015-06-18,...,False,True,False,False,,[['NONE']],[['Tower Hamlets']],0.545429,True,True
87670,201001YR90206,533120.0,190740.0,-0.079345,51.599726,1,3,2,1,2010-03-23,...,False,False,False,False,,"[['GREEN', 'GREEN', 'GREEN', 'GREEN']]","[['Haringey', 'Haringey', 'Haringey', 'Haringe...",0.085724,True,True
101851,201101CW11582,528200.0,181050.0,-0.153873,51.513784,1,2,2,1,2011-08-22,...,False,False,False,True,,"[['NONE', 'NONE']]","[['Westminster', 'Westminster']]",0.032099,True,True
185279,201501TW60269,518820.0,176900.0,-0.290397,51.478588,1,3,3,1,2015-08-08,...,False,False,False,True,,"[['NONE', 'NONE', 'NONE']]","[['Richmond upon Thames', 'Richmond upon Thame...",0.571906,True,True
160516,201401CP00335,531600.0,181340.0,-0.104795,51.515609,48,3,1,1,2014-08-14,...,False,False,False,False,,"[['NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'NON...","[['City of London', 'City of London', 'City of...",0.041053,True,True


In [65]:
data.bikelane.value_counts()

True    37044
Name: bikelane, dtype: int64

In [66]:
data[bike_features].any()

aadf_FEATURE_ID     True
aadf_SVDATE         True
aadf_CLT_CARR       True
aadf_CLT_SEGREG     True
aadf_CLT_STEPP      True
aadf_CLT_PARSEG     True
aadf_CLT_SHARED     True
aadf_CLT_MANDAT     True
aadf_CLT_ADVIS      True
aadf_CLT_PRIORI     True
aadf_CLT_CONTRA     True
aadf_CLT_BIDIRE     True
aadf_CLT_CBYPAS     True
aadf_CLT_BBYPAS     True
aadf_CLT_PARKR      True
aadf_CLT_WATERR     True
aadf_CLT_PTIME      True
aadf_CLT_ACCESS    False
aadf_CLT_COLOUR     True
aadf_BOROUGH        True
dtype: bool

Here, scaling the data is important because of the number of binary variables, which are naturally outweighed by the continuous ones without scaling.

In [67]:
## import StandardScaler
from sklearn.preprocessing import StandardScaler

In [68]:
## Make a scaler object
scaler = StandardScaler()

## fit the scaler
data[['aadf_Pedal_cycles_scaled', 'aadf_All_motor_vehicles_scaled', 'distance_to_cp_scaled']] = scaler.fit_transform(data[['aadf_Pedal_cycles', 'aadf_All_motor_vehicles', 'distance_to_cp']])

In [69]:
## Identify potential regression columns
class_features = ['Longitude', 'Latitude', 'Day_of_Week',
                    'Time', 'Road_Type', 'Speed_limit',
                    'Light_Conditions', 'Weather_Conditions', 'Road_Surface_Conditions',
                    'Special_Conditions_at_Site', 'Carriageway_Hazards', 'Urban_or_Rural_Area',
                    'aadf_Year', 'aadf_Pedal_cycles', 'aadf_All_motor_vehicles',
                    'aadf_CLT_CARR', 'aadf_CLT_SEGREG', 'aadf_CLT_STEPP', 
                    'aadf_CLT_PARSEG', 'aadf_CLT_SHARED', 'aadf_CLT_MANDAT',
                    'aadf_CLT_ADVIS', 'aadf_CLT_PRIORI', 'aadf_CLT_CONTRA', 
                    'aadf_CLT_BIDIRE', 'aadf_CLT_CBYPAS', 'aadf_CLT_BBYPAS', 
                    'aadf_CLT_PARKR', 'aadf_CLT_WATERR', 'aadf_CLT_PTIME', 
                    'aadf_CLT_ACCESS', 'distance_to_cp']

Finally, make a train test split for preliminary regression to work on:

In [70]:
## import train_test_split
from sklearn.model_selection import train_test_split

In [71]:
data

Unnamed: 0,Accident_Index,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,...,aadf_CLT_PTIME,aadf_CLT_ACCESS,aadf_CLT_COLOUR,aadf_BOROUGH,distance_to_cp,match,bikelane,aadf_Pedal_cycles_scaled,aadf_All_motor_vehicles_scaled,distance_to_cp_scaled
2,200501BS70004,524870.0,181880.0,-0.201543,51.521988,1,3,2,1,2005-03-02,...,False,,[[]],[[]],0.154730,True,True,-0.325591,-0.513288,-0.188243
3,200501BS70009,525840.0,177020.0,-0.189301,51.478096,1,3,2,1,2005-03-02,...,False,,"[['NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'NON...","[[None, None, 'Hammersmith & Fulham', 'Hammers...",0.122721,True,True,0.947518,0.673221,-0.335638
5,200501BS70020,527020.0,179020.0,-0.171599,51.495806,1,3,2,1,2005-10-02,...,True,,[['NONE']],[['Kensington & Chelsea']],0.085581,True,True,-0.432457,1.693743,-0.506661
6,200501BS70027,525190.0,180500.0,-0.197423,51.509515,1,3,2,1,2005-02-17,...,False,,[[]],[[]],0.353096,True,True,-0.772971,-1.270212,0.725209
7,200501BS70030,526360.0,177420.0,-0.181674,51.481575,1,3,2,1,2005-02-16,...,False,,"[['NONE', 'NONE', 'NONE', 'NONE', 'NONE']]","[['Kensington & Chelsea', 'Kensington & Chelse...",0.061657,True,True,-0.400597,-0.405460,-0.616829
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
248443,2018480820486,531384.0,181528.0,-0.107830,51.517404,48,3,1,1,2018-09-10,...,True,,"[['NONE', 'NONE', 'NONE', 'NONE', 'NONE']]","[['City of London', 'City of London', 'City of...",0.158715,True,True,-0.202793,-0.810446,-0.169890
248446,2018480820687,532724.0,181597.0,-0.088503,51.517711,48,3,2,1,2018-02-11,...,False,,"[['GREEN', 'NONE', 'NONE']]","[['City of London', 'Islington', 'City of Lond...",0.047618,True,True,0.607004,-0.649489,-0.681476
248449,2018480820747,531882.0,181369.0,-0.100717,51.515859,48,3,2,1,2018-10-31,...,False,,"[['NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'NON...","[['City of London', 'City of London', 'City of...",0.021948,True,True,2.021495,0.223280,-0.799684
248452,2018480820779,531630.0,181180.0,-0.104417,51.514220,48,2,2,1,2018-12-14,...,False,,"[['NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'NON...","[['City of London', 'City of London', 'City of...",0.087788,True,True,6.199920,1.389827,-0.496502


In [83]:
## make the validation set
data_train, data_test = train_test_split(data,
                                             shuffle = True,
                                             random_state = 9483,
                                             test_size = .1,
                                             stratify = data.Accident_Severity)

##### 1. Logistic regression

In [84]:
from sklearn.linear_model import LogisticRegression

In [85]:
features = ['Longitude', 'Latitude', 'Speed_limit', 'Light_Conditions', 'Weather_Conditions', 'Road_Surface_Conditions', 'Special_Conditions_at_Site', 'Urban_or_Rural_Area', 'aadf_All_motor_vehicles']

In [86]:
log_reg = LogisticRegression(penalty='none', multi_class='multinomial', max_iter=100000)

In [87]:
log_reg.fit(data_train[features], data_train.Accident_Severity)
data_train[features].sample(5)

Unnamed: 0,Longitude,Latitude,Speed_limit,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Urban_or_Rural_Area,aadf_All_motor_vehicles
87243,-0.166409,51.442039,30.0,4,1,1,0,1,34820.0
162490,-0.018479,51.581539,30.0,1,1,1,0,1,14637.0
1764,0.016201,51.455137,30.0,1,1,1,0,1,29497.0
121261,-0.080157,51.515388,30.0,1,1,1,0,1,5539.0
217516,-0.111282,51.456237,20.0,4,1,1,0,1,2804.0


In [88]:
log_reg.predict(data_test[features])

array([3, 3, 3, ..., 3, 3, 3])

In [89]:
np.unique(log_reg.predict(data_test[features]))

array([3])

In [90]:
data_test.Accident_Severity.values

array([3, 3, 3, ..., 3, 2, 2])

In [91]:
from sklearn.metrics import accuracy_score

In [92]:
accuracy_score(data_test.Accident_Severity, log_reg.predict(data_test[features]))

0.8682860998650472

One massive hurdle for classification, here, is the unbalanced dataset where nearly all accidents are not serious at all. Thus, metrics like accuracy can indicate that even simple models do a good job while it is clear that the only predictions are that the accident is not severe and, therefore, most predictions are correct.

##### 2. GridSearchCV

In [93]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [94]:
max_depths = range(1, 11)
n_trees = [100, 500]

In [95]:
grid_cv = GridSearchCV(RandomForestClassifier(), # first put the model object here
                          param_grid = {'max_depth':max_depths, # place the grid values for max_depth and
                                        'n_estimators':n_trees}, # and n_estimators here
                          scoring = "accuracy", # put the metric we are trying to optimize here as a string, "accuracy"
                          cv = 10)

In [96]:
grid_cv.fit(data_train[features], data_train.Accident_Severity)

In [97]:
score_df = pd.DataFrame({'feature':features,
                            'importance_score': grid_cv.best_estimator_.feature_importances_})

score_df.sort_values('importance_score',ascending=False)

Unnamed: 0,feature,importance_score
2,Speed_limit,0.27
1,Latitude,0.26
6,Special_Conditions_at_Site,0.12
7,Urban_or_Rural_Area,0.12
0,Longitude,0.11
3,Light_Conditions,0.09
8,aadf_All_motor_vehicles,0.03
4,Weather_Conditions,0.0
5,Road_Surface_Conditions,0.0


In [98]:
grid_cv.best_params_

{'max_depth': 1, 'n_estimators': 100}

In [99]:
grid_cv.best_score_

0.8683223841280349

In [100]:
grid_cv.best_estimator_.predict(data_test[features])

array([3, 3, 3, ..., 3, 3, 3])

In [101]:
np.unique(grid_cv.best_estimator_.predict(data_test[features]))

array([3])

The previous two examples are very simple examples of preliminary classificatoin being used on our dataset. Future directions would be to find a way to balance the dataset or at least account for the fact that it is so unbalanced to get meaningful classification. 