Place for a picture

# Logistic Regression of a... (Phase Three Project)

## Business Problem/Question

What factors from this dataset are most relevant to determining whether a private passenger vehicle crash in Chicago incurs property damage over $1,500 and can we make good predictions using these fewer factors? 

## EDA

In [1]:
# Importing packages
import numpy as np
import pandas as pd 
import math

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.impute import MissingIndicator, SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.dummy import DummyClassifier

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier

from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.metrics import roc_curve, roc_auc_score, plot_roc_curve
from sklearn.metrics import plot_confusion_matrix, confusion_matrix

### Read in and create new csv

In [124]:
# Read in data

df_vehicles = pd.read_csv("Data\Traffic_Crashes_-_Vehicles_20231109.csv", parse_dates=["CRASH_DATE"], low_memory=False)
df_people = pd.read_csv("Data\Traffic_Crashes_-_People_20231109.csv", parse_dates=["CRASH_DATE"], low_memory=False)
df_crashes = pd.read_csv("Data\Traffic_Crashes_-_Crashes_20231109.csv", parse_dates=["CRASH_DATE"], low_memory=False)

In [125]:
# Filter out any data from before 2021

df_crashes = df_crashes[df_crashes["CRASH_DATE"].dt.year >= 2021]
df_people = df_people[df_people["CRASH_DATE"].dt.year >= 2021]
df_vehicles = df_vehicles[df_vehicles["CRASH_DATE"].dt.year >= 2021]

Using the data dictionaries to understand column names, we are dropping columns from each set that are not relevant to the business problem.

In [126]:
# Drop columns from Vehicles dataframe

df_vehicles.drop(columns = ['TOWED_I', 'FIRE_I', 'TOWED_BY', 'TOWED_TO', 'CMV_ID', 
                        'USDOT_NO', 'CCMC_NO', 'ILCC_NO', 'COMMERCIAL_SRC', 'GVWR', 
                        'CARRIER_NAME', 'CARRIER_STATE', 'CARRIER_CITY',
                        'HAZMAT_PLACARDS_I', 'HAZMAT_NAME', 'UN_NO', 'HAZMAT_PRESENT_I', 
                        'HAZMAT_REPORT_I', 'HAZMAT_REPORT_NO', 'MCS_REPORT_I',
                        'MCS_REPORT_NO', 'HAZMAT_VIO_CAUSE_CRASH_I', 'MCS_VIO_CAUSE_CRASH_I', 
                        'IDOT_PERMIT_NO', 'WIDE_LOAD_I', 'TRAILER1_WIDTH', 'TRAILER2_WIDTH', 
                        'TRAILER1_LENGTH', 'TRAILER2_LENGTH', 'TOTAL_VEHICLE_LENGTH',
                        'AXLE_CNT', 'VEHICLE_CONFIG', 'CARGO_BODY_TYPE', 'LOAD_TYPE',
                        'HAZMAT_OUT_OF_SERVICE_I', 'MCS_OUT_OF_SERVICE_I', 'HAZMAT_CLASS'],
                         inplace=True)

In [127]:
# Drop columns from People dataframe

df_people.drop(columns = ['HOSPITAL', 'EMS_AGENCY', 'EMS_RUN_NO'], inplace=True)

In [128]:
# Drop columns from Crashes dataframe

df_crashes.drop(columns = ['REPORT_TYPE', 'DATE_POLICE_NOTIFIED', 'PHOTOS_TAKEN_I',
                       'STATEMENTS_TAKEN_I', 'DOORING_I', 'INJURIES_TOTAL', 
                       'INJURIES_FATAL', 'INJURIES_INCAPACITATING', 
                       'INJURIES_NON_INCAPACITATING', 'INJURIES_REPORTED_NOT_EVIDENT', 
                       'INJURIES_NO_INDICATION', 'INJURIES_UNKNOWN'], inplace=True)

Because we are most interested in vehicle damage, we are using the Vehicles dataframe as the main and merging the others into it.

In [129]:
# Merge People dataframe with Vehicles dataframe

df = df_vehicles.merge(df_people, how="left", on=["CRASH_RECORD_ID", "CRASH_DATE", "RD_NO", "VEHICLE_ID"])

In [130]:
# Merge Crashes dataframe with merged dataframe

df = pd.merge(df, df_crashes, how = 'inner', on = ['CRASH_RECORD_ID', "CRASH_DATE", "RD_NO"])

In [131]:
# Exporting new dataframe to csv for use throughout rest of notebook

df.to_csv("Data\chicago_traffic_accidents_2021_to_11-09-2023.csv")

### Working with a single merge dataset

In [318]:
# Can load merged dataframe without needing to go through above steps each time
df = pd.read_csv("Data\chicago_traffic_accidents_2021_to_11-09-2023.zip", 
                 parse_dates=["CRASH_DATE"], low_memory=False)

We don't need all the identifying columns as they are not useful in making a model. We will drop all of those now.

In [319]:
df.drop(columns = ["Unnamed: 0", "CRASH_UNIT_ID", "CRASH_RECORD_ID", "RD_NO",
                    "UNIT_NO", "VEHICLE_ID", "PERSON_ID"], inplace=True)

We only want unique vehicle damage count, so we need to remove rows that represent passengers, as these will duplicate the vehicle damage. We also should remove any other rows that don't represent drivers. We can use the "PERSON_TYPE" column for this.

In [320]:
# Check values in Person_type column

df['PERSON_TYPE'].value_counts(normalize=True)

DRIVER                 0.780694
PASSENGER              0.197756
PEDESTRIAN             0.012899
BICYCLE                0.007730
NON-MOTOR VEHICLE      0.000760
NON-CONTACT VEHICLE    0.000162
Name: PERSON_TYPE, dtype: float64

In [321]:
# Remove all types of person except DRIVER

df = df[df['PERSON_TYPE'] == 'DRIVER']

In [322]:
# Sanity check

df['PERSON_TYPE'].value_counts(normalize=True)

DRIVER    1.0
Name: PERSON_TYPE, dtype: float64

### Missingness

Next we look at null values to try to determine which columns might need to be imputed or if the data is too incomplete to be useful. 

In [323]:
# First dropping columns that no longer have any data after removing all but DRIVER entries

df.dropna(axis=1, how="all", inplace=True)

In [324]:
# Looking at the total nulls left in remaining columns

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 530319 entries, 0 to 766594
Data columns (total 82 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   CRASH_DATE               530319 non-null  datetime64[ns]
 1   UNIT_TYPE                530310 non-null  object        
 2   NUM_PASSENGERS           85079 non-null   float64       
 3   CMRC_VEH_I               10945 non-null   object        
 4   MAKE                     530098 non-null  object        
 5   MODEL                    530098 non-null  object        
 6   LIC_PLATE_STATE          469533 non-null  object        
 7   VEHICLE_YEAR             434488 non-null  float64       
 8   VEHICLE_DEFECT           530098 non-null  object        
 9   VEHICLE_TYPE             530098 non-null  object        
 10  VEHICLE_USE              530098 non-null  object        
 11  TRAVEL_DIRECTION         530098 non-null  object        
 12  MANEUVER        

There are several columns that seem useful from the data dictionaries, but look almost entirely full of nulls. We do a value_counts for those columns to see what's in them.

In [325]:
# NUM_PASSENGERS

df["NUM_PASSENGERS"].value_counts(dropna=False)

NaN     445240
1.0      60610
2.0      15374
3.0       6037
4.0       2034
5.0        560
6.0        234
7.0         84
8.0         28
10.0        23
9.0         19
11.0        18
12.0        10
17.0         6
14.0         5
19.0         4
13.0         4
16.0         4
15.0         3
18.0         2
22.0         2
21.0         2
27.0         2
43.0         2
26.0         2
20.0         1
33.0         1
34.0         1
28.0         1
46.0         1
42.0         1
30.0         1
32.0         1
31.0         1
24.0         1
Name: NUM_PASSENGERS, dtype: int64

There is no 0 value, so the NaNs are problably 0. However, OCCUPANT_CNT represents the same information so we won't need this column. 

In [326]:
# CMRC_VEH_I

df["CMRC_VEH_I"].value_counts(dropna=False)

NaN    519374
Y        6602
N        4343
Name: CMRC_VEH_I, dtype: int64

This is a binary flag, but the missing values represent the overwhelming majority. We can drop the Y rows because they are commercial vehicles and do not fit the business problem, but the rest we will leave as we cannot make an assumption from such a small subset that the Y/N ratio is representative of the whole. 

In [327]:
# Dropping Commercial Vehicles

df = df[df["CMRC_VEH_I"]!="Y"]

In [328]:
# Sanity Check

df["CMRC_VEH_I"].value_counts(dropna=False)

NaN    519374
N        4343
Name: CMRC_VEH_I, dtype: int64

In [329]:
# EXCEED_SPEED_LIMIT_I

df["EXCEED_SPEED_LIMIT_I"].value_counts(dropna=False)

NaN    523708
N           5
Y           4
Name: EXCEED_SPEED_LIMIT_I, dtype: int64

This is a binary flag, but the missing values represent the overwhelming majority. We cannot make an assumption from such a small subset that the Y/N ratio is representative of the whole. This does not seem to be a useful column.

In [330]:
# AGE

df["AGE"].value_counts(dropna=False)

NaN      152142
28.0      10738
27.0      10706
29.0      10618
26.0      10583
          ...  
101.0         5
102.0         4
103.0         3
98.0          3
110.0         2
Name: AGE, Length: 106, dtype: int64

Missing values is a smaller percentage, so we can impute based on the average age. 

In [331]:
df["AGE"].mean()

39.92547399582857

In [332]:
# BAC_RESULT for BAC_RESULT VALUE

df["BAC_RESULT"].value_counts(dropna=False)

TEST NOT OFFERED                   516361
TEST REFUSED                         5312
TEST PERFORMED, RESULTS UNKNOWN      1180
TEST TAKEN                            864
Name: BAC_RESULT, dtype: int64

There are no nulls. We could create a binary flag feature as TESTED_FOR_BAC. 

In [333]:
# CELL_PHONE_USE

df["CELL_PHONE_USE"].value_counts(dropna=False)

NaN    523715
N           2
Name: CELL_PHONE_USE, dtype: int64

This is a binary flag, but the missing values represent the overwhelming majority. We cannot make an assumption from such a small subset that the Y/N ratio is representative of the whole. This does not seem to be a useful column.

In [334]:
# LANE_CNT

df["LANE_CNT"].value_counts(dropna=False)

NaN    523647
2.0        25
4.0        20
1.0         7
3.0         7
6.0         4
5.0         4
0.0         2
8.0         1
Name: LANE_CNT, dtype: int64

The missing values represent the overwhelming majority. We cannot make an assumption from such a small subset that the ratio is representative of the whole. This does not seem to be a useful column.

In [335]:
# INTERSECTION_RELATED_I

df["INTERSECTION_RELATED_I"].value_counts(dropna=False)

NaN    383355
Y      134078
N        6284
Name: INTERSECTION_RELATED_I, dtype: int64

This is a binary flag, but the missing values represent the majority. We may be able to impute values to the NaN because the Y/N values are a sizeable fraction of the whole, but we may want to leave this out of our initial model. 

In [336]:
# NOT_RIGHT_OF_WAY_I

df["NOT_RIGHT_OF_WAY_I"].value_counts(dropna=False)

NaN    505547
Y       16342
N        1828
Name: NOT_RIGHT_OF_WAY_I, dtype: int64

This is a binary flag, but the missing values represent the overwhelming majority. We cannot make an assumption from such a small subset that the Y/N ratio is representative of the whole. This does not seem to be a useful.

In [337]:
# HIT_AND_RUN_I

df["HIT_AND_RUN_I"].value_counts(dropna=False)

NaN    358329
Y      158154
N        7234
Name: HIT_AND_RUN_I, dtype: int64

This is a binary flag, but the missing values represent the majority. We may be able to impute values to the NaN because the Y/N values are a sizeable fraction of the whole, but we may want to leave this out of our initial model. We could assume that N is the default.

In [338]:
# WORK_ZONE_I 

df["WORK_ZONE_I"].value_counts(dropna=False)

NaN    521390
Y        1737
N         590
Name: WORK_ZONE_I, dtype: int64

This is a binary flag, but the missing values represent the overwhelming majority. We cannot make an assumption from such a small subset that the Y/N ratio is representative of the whole. This does not seem to be a useful column.

In [339]:
# Dropping all columns determined not to be useful

df.drop(columns = ["CMRC_VEH_I", "EXCEED_SPEED_LIMIT_I", "CELL_PHONE_USE", 
                   "LANE_CNT", "NOT_RIGHT_OF_WAY_I", "WORK_ZONE_I"], inplace=True)

In [340]:
#Sanity Check

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 523717 entries, 0 to 766594
Data columns (total 76 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   CRASH_DATE               523717 non-null  datetime64[ns]
 1   UNIT_TYPE                523708 non-null  object        
 2   NUM_PASSENGERS           84450 non-null   float64       
 3   MAKE                     523496 non-null  object        
 4   MODEL                    523496 non-null  object        
 5   LIC_PLATE_STATE          463232 non-null  object        
 6   VEHICLE_YEAR             428682 non-null  float64       
 7   VEHICLE_DEFECT           523496 non-null  object        
 8   VEHICLE_TYPE             523496 non-null  object        
 9   VEHICLE_USE              523496 non-null  object        
 10  TRAVEL_DIRECTION         523496 non-null  object        
 11  MANEUVER                 523496 non-null  object        
 12  OCCUPANT_CNT    

In [355]:
df.describe()

Unnamed: 0,NUM_PASSENGERS,VEHICLE_YEAR,OCCUPANT_CNT,AGE,BAC_RESULT VALUE,POSTED_SPEED_LIMIT,STREET_NO,BEAT_OF_OCCURRENCE,NUM_UNITS,CRASH_HOUR,CRASH_DAY_OF_WEEK,CRASH_MONTH,LATITUDE,LONGITUDE,BAC_TEST,DAMAGE_OVER_1500
count,84450.0,428682.0,523496.0,371575.0,610.0,523717.0,523717.0,523717.0,523717.0,523717.0,523717.0,523717.0,519247.0,519247.0,523717.0,523717.0
mean,1.456945,2014.883128,1.234516,39.925474,0.175885,28.965458,3746.654508,1239.549677,2.091479,13.360557,4.137664,6.42281,41.852148,-87.67329,0.014046,0.687432
std,0.99552,118.634721,0.66726,15.843725,0.104812,5.312512,2852.738278,701.016896,0.48505,5.512825,1.982557,3.24658,0.368367,0.752426,0.11768,0.46354
min,1.0,1900.0,0.0,0.0,0.0,0.0,0.0,111.0,1.0,0.0,1.0,1.0,0.0,-87.936193,0.0,0.0
25%,1.0,2009.0,1.0,27.0,0.13,30.0,1329.0,715.0,2.0,10.0,2.0,4.0,41.779949,-87.722859,0.0,0.0
50%,1.0,2014.0,1.0,37.0,0.18,30.0,3299.0,1134.0,2.0,14.0,4.0,6.0,41.871981,-87.675525,0.0,1.0
75%,2.0,2018.0,1.0,51.0,0.22,30.0,5611.0,1814.0,2.0,17.0,6.0,9.0,41.924092,-87.633793,0.0,1.0
max,46.0,9999.0,47.0,110.0,1.0,70.0,13799.0,6100.0,18.0,23.0,7.0,12.0,42.02278,0.0,1.0,1.0


Something needs to be done with VEHICLE_YEAR

## Feature Engineering

In [341]:
# Create a new column to identify BAC_TEST as Y/N

bac_dict = {'TEST NOT OFFERED':0, 'TEST REFUSED':1, 'TEST PERFORMED, RESULTS UNKNOWN':1, 'TEST TAKEN':1}
df['BAC_TEST'] =  df.loc[:, ('BAC_RESULT')].map(bac_dict).copy()

In [342]:
# Sanity check

df["BAC_TEST"].value_counts()

0    516361
1      7356
Name: BAC_TEST, dtype: int64

In [343]:
# Check the values in Damage column

df['DAMAGE'].value_counts()

OVER $1,500      360020
$501 - $1,500    120416
$500 OR LESS      43281
Name: DAMAGE, dtype: int64

In [344]:
# Create a new column to identify damage as > $1500 or <= $1500

damage_dict = {'OVER $1,500':1, '$501 - $1,500':0, '$500 OR LESS':0}
df['DAMAGE_OVER_1500'] =  df.loc[:, ('DAMAGE')].map(damage_dict).copy()

In [345]:
# Sanity check

df['DAMAGE_OVER_1500'].value_counts()

1    360020
0    163697
Name: DAMAGE_OVER_1500, dtype: int64

With a suitable target column now in place, we can look at how other variables connect to it.

In [346]:
df.corr()

Unnamed: 0,NUM_PASSENGERS,VEHICLE_YEAR,OCCUPANT_CNT,AGE,BAC_RESULT VALUE,POSTED_SPEED_LIMIT,STREET_NO,BEAT_OF_OCCURRENCE,NUM_UNITS,CRASH_HOUR,CRASH_DAY_OF_WEEK,CRASH_MONTH,LATITUDE,LONGITUDE,BAC_TEST,DAMAGE_OVER_1500
NUM_PASSENGERS,1.0,-0.001655,0.995752,-0.016513,-0.028629,0.017434,-0.000359,-0.016809,0.002686,0.01309,-0.006831,0.012681,-0.006715,0.002088,-0.005771,0.020105
VEHICLE_YEAR,-0.001655,1.0,-0.00285,0.000106,-0.036227,-9.5e-05,-0.003429,0.000494,1.5e-05,-6e-06,-0.000827,0.00202,-0.000507,0.000479,-0.002264,-0.004112
OCCUPANT_CNT,0.995752,-0.00285,1.0,-0.061632,0.015512,0.033677,-0.008932,-0.023356,0.024629,0.026685,-0.003875,0.00462,-0.0042,-0.00023,0.006504,0.036412
AGE,-0.016513,0.000106,-0.061632,1.0,0.020686,-0.040398,0.037893,0.006513,-0.019236,-0.013054,0.003114,0.005801,-0.001881,-0.000832,-0.009008,-0.03927
BAC_RESULT VALUE,-0.028629,-0.036227,0.015512,0.020686,1.0,0.003852,0.065534,0.039524,0.059508,0.012405,-0.025293,-0.069595,0.0242,-0.02769,,0.003224
POSTED_SPEED_LIMIT,0.017434,-9.5e-05,0.033677,-0.040398,0.003852,1.0,-0.011535,-0.056266,0.068015,0.010362,0.009942,0.011418,-0.006481,0.007401,0.003891,0.070672
STREET_NO,-0.000359,-0.003429,-0.008932,0.037893,0.065534,-0.011535,1.0,-0.013593,0.006633,-0.007356,-0.003633,-0.008994,-0.065354,-0.011788,-0.00097,0.018644
BEAT_OF_OCCURRENCE,-0.016809,0.000494,-0.023356,0.006513,0.039524,-0.056266,-0.013593,1.0,0.010561,0.007599,0.004718,0.00184,0.140275,-0.03926,0.00054,-0.044948
NUM_UNITS,0.002686,1.5e-05,0.024629,-0.019236,0.059508,0.068015,0.006633,0.010561,1.0,0.009488,0.005662,0.004383,-0.000341,-0.001079,0.014922,0.098145
CRASH_HOUR,0.01309,-6e-06,0.026685,-0.013054,0.012405,0.010362,-0.007356,0.007599,0.009488,1.0,0.058574,-0.004925,0.001265,0.002127,-0.012039,-0.026877


It looks like nothing currently numeric is very correlated to DAMAGE_OVER_$1500. However, we will look at the non-numeric variables as well.

In [358]:
df.groupby("MAKE").DAMAGE_OVER_1500.count().sort_values(ascending=False)

MAKE
UNKNOWN                               61453
CHEVROLET                             58486
TOYOTA                                56867
FORD                                  51225
NISSAN                                40652
                                      ...  
KENT MANUFACTURING COMPANY INC.           1
KNOWLES MANUFACTURING COMPANY             1
KOMATSU AMERICAN CORPORATION              1
KROMAG (SUBSIDIARY OF PUCH)               1
INTERCONSULT MANUFACTURING COMPANY        1
Name: DAMAGE_OVER_1500, Length: 628, dtype: int64

In [348]:
df.groupby("AGE").DAMAGE_OVER_1500.count().sort_values(ascending=False)

AGE
28.0     10738
27.0     10706
29.0     10618
26.0     10583
25.0     10465
         ...  
101.0        5
102.0        4
98.0         3
103.0        3
110.0        2
Name: DAMAGE_OVER_1500, Length: 105, dtype: int64

In [349]:
df.groupby("POSTED_SPEED_LIMIT").DAMAGE_OVER_1500.count().sort_values(ascending=False)

POSTED_SPEED_LIMIT
30    395685
35     35929
25     31842
20     20261
15     15025
10     11118
40      6127
45      4472
5       1273
0        834
55       592
50       212
3        122
39        46
60        38
24        30
34        14
26        11
65        11
2         11
32         9
11         8
1          7
9          7
33         6
7          6
8          4
29         3
22         3
38         2
23         2
44         2
70         2
12         1
62         1
14         1
Name: DAMAGE_OVER_1500, dtype: int64

In [350]:
df.groupby("STREET_NO").DAMAGE_OVER_1500.count().sort_values(ascending=False)

STREET_NO
1600     4010
800      3382
100      3062
7900     3057
2400     2984
         ... 
5668        1
10442       1
577         1
10441       1
12117       1
Name: DAMAGE_OVER_1500, Length: 10612, dtype: int64

In [351]:
df.groupby("BEAT_OF_OCCURRENCE").DAMAGE_OVER_1500.count().sort_values(ascending=False)

BEAT_OF_OCCURRENCE
813.0     5777
1834.0    5636
114.0     5582
815.0     5228
833.0     4831
          ... 
1653.0     395
1655.0     257
1652.0     182
1650.0      62
6100.0       3
Name: DAMAGE_OVER_1500, Length: 276, dtype: int64

In [352]:
df.groupby("NUM_UNITS").DAMAGE_OVER_1500.count().sort_values(ascending=False)

NUM_UNITS
2     456712
3      39450
1      17029
4       7775
5       1818
6        543
7        211
8        120
9         28
12        14
18         7
10         6
11         2
14         1
13         1
Name: DAMAGE_OVER_1500, dtype: int64

In [353]:
df.groupby("CRASH_HOUR").DAMAGE_OVER_1500.count().sort_values(ascending=False)

CRASH_HOUR
15    43311
16    42333
17    40080
14    35919
18    32666
13    32084
12    30479
8     27017
11    25979
19    23951
10    22914
9     22580
7     21094
20    19369
21    17464
22    15692
23    13954
0     11159
6      9981
1      9617
2      8040
5      6386
3      6283
4      5365
Name: DAMAGE_OVER_1500, dtype: int64

In [354]:
df.groupby("BAC_TEST").DAMAGE_OVER_1500.count().sort_values(ascending=False)

BAC_TEST
0    516361
1      7356
Name: DAMAGE_OVER_1500, dtype: int64

## Dummy Model

### Decide Xs/y

The target is DAMAGE_OVER_1500. There are a few columns which represent interrelated variables, so only one will be used. 

In [317]:
X = df.drop(["NUM_PASSENGERS", "DAMAGE", "DAMAGE_OVER_1500"], axis=1)
y = df["DAMAGE_OVER_1500"]

In [170]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2024)

In [171]:
dummy_model = DummyClassifier(strategy="most_frequent")

In [172]:
dummy_model.fit(X_train, y_train)

DummyClassifier(strategy='most_frequent')

In [173]:
cv_results_dummy = cross_val_score(dummy_model, X_train, y_train, cv=5)
cv_results_dummy

array([0.68784847, 0.68784847, 0.68785722, 0.6878445 , 0.6878445 ])

# Model 2

## Baseline Numeric Model

For a baseline model, we will use only the already numeric columns as Xs to predict DAMAGE_OVER_$1500 as y. We will omit all columns that are simply identifiers or keys. We also omit NUM_PASSENGERS as it is directly related to OCCUPANT_CNT.

In [174]:
# Using .describe to see the variables that are already numeric

df.describe()

Unnamed: 0.1,Unnamed: 0,NUM_PASSENGERS,VEHICLE_YEAR,OCCUPANT_CNT,AGE,BAC_RESULT VALUE,POSTED_SPEED_LIMIT,STREET_NO,BEAT_OF_OCCURRENCE,NUM_UNITS,CRASH_HOUR,CRASH_DAY_OF_WEEK,CRASH_MONTH,LATITUDE,LONGITUDE,DAMAGE_OVER_$1500,BAC_TEST
count,523717.0,84450.0,428682.0,523496.0,371575.0,610.0,523717.0,523717.0,523717.0,523717.0,523717.0,523717.0,523717.0,519247.0,519247.0,523717.0,523717.0
mean,389995.01249,1.456945,2014.883128,1.234516,39.925474,0.175885,28.965458,3746.654508,1239.549677,2.091479,13.360557,4.137664,6.42281,41.852148,-87.67329,0.687432,0.014046
std,222581.925857,0.99552,118.634721,0.66726,15.843725,0.104812,5.312512,2852.738278,701.016896,0.48505,5.512825,1.982557,3.24658,0.368367,0.752426,0.46354,0.11768
min,0.0,1.0,1900.0,0.0,0.0,0.0,0.0,0.0,111.0,1.0,0.0,1.0,1.0,0.0,-87.936193,0.0,0.0
25%,195000.0,1.0,2009.0,1.0,27.0,0.13,30.0,1329.0,715.0,2.0,10.0,2.0,4.0,41.779949,-87.722859,0.0,0.0
50%,396756.0,1.0,2014.0,1.0,37.0,0.18,30.0,3299.0,1134.0,2.0,14.0,4.0,6.0,41.871981,-87.675525,1.0,0.0
75%,582781.0,2.0,2018.0,1.0,51.0,0.22,30.0,5611.0,1814.0,2.0,17.0,6.0,9.0,41.924092,-87.633793,1.0,0.0
max,766594.0,46.0,9999.0,47.0,110.0,1.0,70.0,13799.0,6100.0,18.0,23.0,7.0,12.0,42.02278,0.0,1.0,1.0


There's something wrong with vehicle year.

In [176]:
# Making a baseline dataframe 

df_bl = df[["VEHICLE_YEAR", "OCCUPANT_CNT", "AGE", "BAC_RESULT VALUE", "POSTED_SPEED_LIMIT", "STREET_NO", "BEAT_OF_OCCURRENCE", "NUM_UNITS", "CRASH_HOUR", "CRASH_DAY_OF_WEEK", "CRASH_MONTH", "LATITUDE", "LONGITUDE", "BAC_TEST", "DAMAGE_OVER_$1500"]]

In [177]:
# Assigning Xs & y 

X_bl = df_bl.drop("DAMAGE_OVER_$1500", axis=1)
y_bl = df["DAMAGE_OVER_$1500"]

### Train/Test Split

In [178]:
X_train_bl, X_test_bl, y_train_bl, y_test_bl = train_test_split(X_bl, y_bl, random_state=2024)

### Preprocessing Steps (SS, OHE, SI)

In [179]:
numeric_imputer = SimpleImputer()
X_train_blimp = numeric_imputer.fit_transform(X_train_bl)

### Modeling (look at Coefficients, P-values)

In [180]:
bl_logreg = LogisticRegression(random_state=2024, penalty="none", max_iter=1000)

In [181]:
bl_logreg.fit(X_train_blimp, y_train_bl)

LogisticRegression(max_iter=1000, penalty='none', random_state=2024)

In [182]:
confusion_matrix(y_train_bl, bl_logreg.predict(X_train_blimp))

array([[   297, 122312],
       [   396, 269782]], dtype=int64)

### Evaluation

In [183]:
cv_results = cross_val_score(bl_logreg, X_train_blimp, y_train_bl, cv=5)
cv_results

array([0.68652461, 0.68769572, 0.68640605, 0.68739896, 0.68660972])

In [187]:
print("Dummy Model CV:          ", cv_results_dummy)
print("Initial Numeric Model CV:", cv_results)

Dummy Model CV:           [0.68784847 0.68784847 0.68785722 0.6878445  0.6878445 ]
Initial Numeric Model CV: [0.68652461 0.68769572 0.68640605 0.68739896 0.68660972]


So we can see that a model using only the columns which are already numeric is only as good as picking the most frequent.

# Model 3

## Evaluation OF/UF report Test