## Selecting features using Pearson's chi squared

This notebook is the only one I did in Python. It will select the most correlated variables to fatal crashes out of the more than 100 categorical variables in my dataset. I added helmet data to the set, too.

I also ran this test using only crashes since 2004 to see if that affected the helmet data (helmetless v. helmeted crashes weren't well documented before 2004) but using only post-2004 data did not change the variables selected.

In [11]:
#load libraries to use in the notebook
import os, sys
import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import mutual_info_classif
from sklearn.linear_model import LinearRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.feature_selection import RFE
from sklearn.base import clone

In [12]:
# Dataset location
DATASET = 'Datasets/cycle_flag.csv'
assert os.path.exists(DATASET)

# # Load and shuffle
dataset = pd.read_csv(DATASET, sep=',').sample(frac = 1).reset_index(drop=True)
dataset.drop(['Unnamed: 0', 'CRN', 'FATAL_OR_MAJ_INJ','CRASH_YEAR','COUNTY','MUNICIPALITY','COUNTY_YEAR','MOTORCYCLE_COUNT',
              'FATAL_COUNT','MCYCLE_DEATH_COUNT','DEC_LAT','DEC_LONG','PSP_REPORTED','MC_DVR_HLMT_TYPE','MC_PAS_HLMT_TYPE','MC_PAS_HLMTON_IND'], axis=1, inplace=True)




In [4]:
#eplore variable types. The chi squared test only works on numeric variables
g = dataset.columns.to_series().groupby(dataset.dtypes).groups
g

{dtype('int64'): Index(['INTERSTATE', 'STATE_ROAD', 'LOCAL_ROAD', 'LOCAL_ROAD_ONLY', 'TURNPIKE',
        'WET_ROAD', 'SNOW_SLUSH_ROAD', 'ICY_ROAD', 'SUDDEN_DEER',
        'SHLDR_RELATED', 'REAR_END', 'HO_OPPDIR_SDSWP', 'HIT_FIXED_OBJECT',
        'SV_RUN_OFF_RD', 'WORK_ZONE', 'PROPERTY_DAMAGE_ONLY', 'INJURY_OR_FATAL',
        'INJURY', 'FATAL', 'NON_INTERSECTION', 'INTERSECTION', 'SIGNALIZED_INT',
        'STOP_CONTROLLED_INT', 'UNSIGNALIZED_INT', 'SCHOOL_BUS', 'SCHOOL_ZONE',
        'HIT_DEER', 'HIT_TREE_SHRUB', 'HIT_EMBANKMENT', 'HIT_POLE',
        'HIT_GDRAIL', 'HIT_GDRAIL_END', 'HIT_BARRIER', 'HIT_BRIDGE',
        'OVERTURNED', 'MOTORCYCLE', 'BICYCLE', 'HVY_TRUCK_RELATED',
        'VEHICLE_FAILURE', 'TRAIN_TROLLEY', 'PHANTOM_VEHICLE',
        'ALCOHOL_RELATED', 'DRINKING_DRIVER', 'UNDERAGE_DRNK_DRV', 'UNLICENSED',
        'DISTRACTED', 'CELL_PHONE', 'NO_CLEARANCE', 'RUNNING_RED_LT',
        'TAILGATING', 'CROSS_MEDIAN', 'CURVED_ROAD', 'CURVE_DVR_ERROR',
        'LIMIT_65MPH', 'SPEE

Below I'm one-hot encoding the helmet variable to make it into seperate binary columns. that allows me to work with it like the other binary variables in the dataset.

In [13]:
dataset = pd.get_dummies(dataset, columns=["MC_DVR_HLMTON_IND"])

In [14]:
#now that the helmet variable has been broken into new columns, remove the old variable and some other unnecessary columns

dataset.drop(['MC_PASSNGR_IND', 'MC_DVR_HLMTDOT_IND', 'MC_PAS_HLMTDOT_IND','MINOR_INJURY','MODERATE_INJURY','MAJOR_INJURY'], axis=1, inplace=True)

In [None]:
#look over the data to check that the one hot columns look ok
dataset.describe()

In [15]:
# # View some metadata of the dataset and see if that makes sense
print('dataset.shape', dataset.shape)

#split the dataset into x and y with x being all the data except fatalities and y being my target variable 'FATAL'
X = np.array(dataset.loc[:, dataset.columns != 'FATAL'])
y = np.array(dataset.FATAL)

#print the size and shape of selected data
print('X', X.shape, 'y', y.shape)
print('Label distribution:', {i: np.sum(y==i) for i in np.unique(dataset.FATAL)})

dataset.shape (101773, 89)
X (101773, 88) y (101773,)
Label distribution: {0: 97853, 1: 3920}


In [16]:
#run the pearson's chi squared test. the selected indicies at the bottom are the variables the test has chosen

selector = SelectKBest(chi2, k=5)
selector.fit(X, y)
print('χ² statistic', selector.scores_)
print('Selected indices', selector.get_support(True))

χ² statistic [  2.10288341e+01   1.89747964e+02   2.93082545e+02   3.42505896e+02
   1.29112805e+00   2.15173981e+01   9.83788182e-02   4.48456713e-01
   5.87527139e+00   3.84681829e+00   2.45889653e+01   2.39226507e+02
   6.63186421e+02   1.15401672e+02   9.17742942e-01   1.53870806e+02
   1.81060348e+01   1.78925925e+03   1.89236188e+02   2.25679350e+02
   8.29324793e+01   7.67908054e+01   1.43772327e+02   3.41118885e+00
   6.21018716e+00   2.45432657e+00   8.21483316e+02   2.51875022e+02
   1.07658764e+03   6.75672119e+02   1.46663556e+02   5.11170011e+01
   7.59367732e+01   6.66651954e-01   3.30088203e+02   6.87566959e+02
   2.91888189e+02   2.08619669e+01   3.23136110e+00   8.96633527e+00
   3.41187907e+03   3.65002788e+03   5.80902605e+01   1.73150261e+00
   8.32844890e+00   3.02308137e-02   3.88942387e+00   9.17728327e+00
   8.17365406e-01   2.12978905e+03   7.64374037e+02   4.50554029e+02
   1.65603795e+01   2.46287979e+03   1.82193136e+03   3.73101129e+02
   7.56817717e+00   5

In [17]:
#Get the variable names of the selected indices

X_selected = selector.transform(X)

[dataset.columns[i] for i in selector.get_support(True)]


['PHANTOM_VEHICLE',
 'ALCOHOL_RELATED',
 'LIMIT_65MPH',
 'HAZARDOUS_TRUCK',
 'DRUGGED_DRIVER']

So now I have the five most correlated variables that I can explore further. In the matrix and regression notebook, I will run a regression to test the selected features.