Place for a picture

# Logistic Regression of a... (Phase Three Project)

## Business Problem/Question

Can we use factors from this dataset that are trackable by an insurance company to determining whether a private passenger vehicle crash in Chicago incurs property damage over $1,500 and can we make good predictions using these factors? 

## EDA

In [4]:
# Importing packages
import numpy as np
import pandas as pd 
import math

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.impute import MissingIndicator, SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.dummy import DummyClassifier

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier

from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.metrics import roc_curve, roc_auc_score, plot_roc_curve
from sklearn.metrics import plot_confusion_matrix, confusion_matrix

### Read in and create new csv

In [2]:
# Read in data

df_vehicles = pd.read_csv("Data\Traffic_Crashes_-_Vehicles_20231109.csv", parse_dates=["CRASH_DATE"], low_memory=False)
df_people = pd.read_csv("Data\Traffic_Crashes_-_People_20231109.csv", parse_dates=["CRASH_DATE"], low_memory=False)
df_crashes = pd.read_csv("Data\Traffic_Crashes_-_Crashes_20231109.csv", parse_dates=["CRASH_DATE"], low_memory=False)

In [3]:
# Filter out any data from before 2021

df_crashes = df_crashes[df_crashes["CRASH_DATE"].dt.year >= 2021]
df_people = df_people[df_people["CRASH_DATE"].dt.year >= 2021]
df_vehicles = df_vehicles[df_vehicles["CRASH_DATE"].dt.year >= 2021]

Using the data dictionaries to understand column names, we are dropping columns from each set that are not relevant to the business problem.

In [4]:
# Drop columns from Vehicles dataframe

df_vehicles.drop(columns = ['UNIT_NO', 'LIC_PLATE_STATE', 'TRAVEL_DIRECTION', 'MANEUVER',
                            'TOWED_I', 'FIRE_I', 'TOWED_BY', 'TOWED_TO', 'AREA_00_I', 
                            'AREA_01_I', 'AREA_02_I', 'AREA_03_I', 'AREA_04_I', 'AREA_05_I', 
                            'AREA_06_I', 'AREA_07_I', 'AREA_08_I', 'AREA_09_I', 'AREA_10_I', 
                            'AREA_11_I', 'AREA_12_I', 'AREA_99_I', 'FIRST_CONTACT_POINT', 'CMV_ID', 
                            'USDOT_NO', 'CCMC_NO', 'ILCC_NO', 'COMMERCIAL_SRC', 'GVWR', 
                            'CARRIER_NAME', 'CARRIER_STATE', 'CARRIER_CITY',
                            'HAZMAT_PLACARDS_I', 'HAZMAT_NAME', 'UN_NO', 'HAZMAT_PRESENT_I', 
                            'HAZMAT_REPORT_I', 'HAZMAT_REPORT_NO', 'MCS_REPORT_I',
                            'MCS_REPORT_NO', 'HAZMAT_VIO_CAUSE_CRASH_I', 'MCS_VIO_CAUSE_CRASH_I', 
                            'IDOT_PERMIT_NO', 'WIDE_LOAD_I', 'TRAILER1_WIDTH', 'TRAILER2_WIDTH', 
                            'TRAILER1_LENGTH', 'TRAILER2_LENGTH', 'TOTAL_VEHICLE_LENGTH',
                            'AXLE_CNT', 'VEHICLE_CONFIG', 'CARGO_BODY_TYPE', 'LOAD_TYPE',
                            'HAZMAT_OUT_OF_SERVICE_I', 'MCS_OUT_OF_SERVICE_I', 'HAZMAT_CLASS'],
                            inplace=True)

In [5]:
# Drop columns from People dataframe

df_people.drop(columns = ['PERSON_ID', 'SEAT_NO', 'CITY', 'STATE', 'ZIPCODE', 'SAFETY_EQUIPMENT', 
                          'AIRBAG_DEPLOYED', 'EJECTION', 'INJURY_CLASSIFICATION', 'HOSPITAL', 
                          'EMS_AGENCY', 'EMS_RUN_NO', 'DRIVER_ACTION', 'DRIVER_VISION', 'PHYSICAL_CONDITION',
                          'PEDPEDAL_ACTION', 'PEDPEDAL_VISIBILITY', 'PEDPEDAL_LOCATION'], inplace=True)

In [6]:
# Drop columns from Crashes dataframe

df_crashes.drop(columns = ['FIRST_CRASH_TYPE', 'LANE_CNT', 'REPORT_TYPE', 'CRASH_TYPE', 'INTERSECTION_RELATED_I', 
                           'NOT_RIGHT_OF_WAY_I', 'HIT_AND_RUN_I', 'DATE_POLICE_NOTIFIED', 'STREET_NO', 
                           'STREET_DIRECTION', 'STREET_NAME', 'PHOTOS_TAKEN_I', 'STATEMENTS_TAKEN_I', 'DOORING_I', 
                           'WORK_ZONE_I', 'WORK_ZONE_TYPE', 'WORKERS_PRESENT_I', 'NUM_UNITS', 'MOST_SEVERE_INJURY', 
                           'INJURIES_TOTAL', 'INJURIES_FATAL', 'INJURIES_INCAPACITATING', 'INJURIES_NON_INCAPACITATING', 
                           'INJURIES_REPORTED_NOT_EVIDENT', 'INJURIES_NO_INDICATION', 'INJURIES_UNKNOWN', 'CRASH_MONTH'], 
                           inplace=True)

Because we are most interested in vehicle damage, we are using the Vehicles dataframe as the main and merging the others into it.

In [7]:
# Merge People dataframe with Vehicles dataframe

df = df_vehicles.merge(df_people, how="left", on=["CRASH_RECORD_ID", "CRASH_DATE", "RD_NO", "VEHICLE_ID"])

In [8]:
# Merge Crashes dataframe with merged dataframe

df = pd.merge(df, df_crashes, how = 'inner', on = ['CRASH_RECORD_ID', "CRASH_DATE", "RD_NO"])

In [9]:
# Exporting new dataframe to csv for use throughout rest of notebook

df.to_csv("Data\chicago_traffic_accidents_2021_to_11-09-2023.csv")

### Working with a single merge dataset

In [5]:
# Can load merged dataframe without needing to go through above steps each time
df = pd.read_csv("Data\chicago_traffic_accidents_2021_to_11-09-2023.zip", 
                 parse_dates=["CRASH_DATE"], low_memory=False)

We don't need all the identifying columns as they are not useful in making a model. We will drop all of those now.

In [6]:
df.drop(columns = ["Unnamed: 0", "CRASH_UNIT_ID", "CRASH_RECORD_ID", 
                   "RD_NO", "VEHICLE_ID"], inplace=True)

We only want unique vehicle damage count, so we need to remove rows that represent passengers, as these will duplicate the vehicle damage. We also should remove any other rows that don't represent drivers. We can use the "PERSON_TYPE" column for this.

In [7]:
# Check values in Person_type column

df['PERSON_TYPE'].value_counts(normalize=True)

DRIVER                 0.780694
PASSENGER              0.197756
PEDESTRIAN             0.012899
BICYCLE                0.007730
NON-MOTOR VEHICLE      0.000760
NON-CONTACT VEHICLE    0.000162
Name: PERSON_TYPE, dtype: float64

In [8]:
# Remove all types of person except DRIVER

df = df[df['PERSON_TYPE'] == 'DRIVER']

In [9]:
# Sanity check

df['PERSON_TYPE'].value_counts(normalize=True)

DRIVER    1.0
Name: PERSON_TYPE, dtype: float64

### Missingness

Next we look at null values to try to determine which columns might need to be imputed or if the data is too incomplete to be useful. 

In [10]:
# First dropping columns that no longer have any data after removing all but DRIVER entries

df= df.dropna(axis=1, how="all")

In [11]:
# Looking at the total nulls left in remaining columns

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 530319 entries, 0 to 766594
Data columns (total 39 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   CRASH_DATE               530319 non-null  datetime64[ns]
 1   UNIT_TYPE                530310 non-null  object        
 2   NUM_PASSENGERS           85079 non-null   float64       
 3   CMRC_VEH_I               10945 non-null   object        
 4   MAKE                     530098 non-null  object        
 5   MODEL                    530098 non-null  object        
 6   VEHICLE_YEAR             434488 non-null  float64       
 7   VEHICLE_DEFECT           530098 non-null  object        
 8   VEHICLE_TYPE             530098 non-null  object        
 9   VEHICLE_USE              530098 non-null  object        
 10  OCCUPANT_CNT             530098 non-null  float64       
 11  EXCEED_SPEED_LIMIT_I     9 non-null       object        
 12  PERSON_TYPE     

There are several columns that seem useful from the data dictionaries, but look almost entirely full of nulls. We do a value_counts for those columns to see what's in them.

In [12]:
# NUM_PASSENGERS

df["NUM_PASSENGERS"].value_counts(dropna=False)

NaN     445240
1.0      60610
2.0      15374
3.0       6037
4.0       2034
5.0        560
6.0        234
7.0         84
8.0         28
10.0        23
9.0         19
11.0        18
12.0        10
17.0         6
14.0         5
19.0         4
13.0         4
16.0         4
15.0         3
18.0         2
22.0         2
21.0         2
27.0         2
43.0         2
26.0         2
20.0         1
33.0         1
34.0         1
28.0         1
46.0         1
42.0         1
30.0         1
32.0         1
31.0         1
24.0         1
Name: NUM_PASSENGERS, dtype: int64

There is no 0 value, so the NaNs are problably 0. However, OCCUPANT_CNT represents the same information so we won't need this column. 

In [13]:
# CMRC_VEH_I

df["CMRC_VEH_I"].value_counts(dropna=False)

NaN    519374
Y        6602
N        4343
Name: CMRC_VEH_I, dtype: int64

This is a binary flag, but the missing values represent the overwhelming majority. We can drop the Y rows because they are commercial vehicles and do not fit the business problem, but the rest we will leave as we cannot make an assumption from such a small subset that the Y/N ratio is representative of the whole. 

In [14]:
# Dropping Commercial Vehicles

df = df[df["CMRC_VEH_I"]!="Y"]

In [15]:
# Sanity Check

df["CMRC_VEH_I"].value_counts(dropna=False)

NaN    519374
N        4343
Name: CMRC_VEH_I, dtype: int64

In [16]:
# EXCEED_SPEED_LIMIT_I

df["EXCEED_SPEED_LIMIT_I"].value_counts(dropna=False)

NaN    523708
N           5
Y           4
Name: EXCEED_SPEED_LIMIT_I, dtype: int64

This is a binary flag, but the missing values represent the overwhelming majority. We cannot make an assumption from such a small subset that the Y/N ratio is representative of the whole. This does not seem to be a useful column.

In [17]:
# AGE

df["AGE"].value_counts(dropna=False)

NaN      152142
28.0      10738
27.0      10706
29.0      10618
26.0      10583
          ...  
101.0         5
102.0         4
103.0         3
98.0          3
110.0         2
Name: AGE, Length: 106, dtype: int64

In [18]:
df["AGE"].mean()

39.92547399582857

Missing values is a smaller percentage, and the mean of age seems to make sense, so we can impute the mean later on and keep this column.

In [19]:
# CELL_PHONE_USE

df["CELL_PHONE_USE"].value_counts(dropna=False)

NaN    523715
N           2
Name: CELL_PHONE_USE, dtype: int64

This is a binary flag, but the missing values represent the overwhelming majority. We cannot make an assumption from such a small subset that the Y/N ratio is representative of the whole. This does not seem to be a useful column.

In [20]:
# Dropping all columns determined not to be useful

df.drop(columns = ["CMRC_VEH_I", "EXCEED_SPEED_LIMIT_I", "CELL_PHONE_USE", ], inplace=True)

In [21]:
#Sanity Check

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 523717 entries, 0 to 766594
Data columns (total 36 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   CRASH_DATE               523717 non-null  datetime64[ns]
 1   UNIT_TYPE                523708 non-null  object        
 2   NUM_PASSENGERS           84450 non-null   float64       
 3   MAKE                     523496 non-null  object        
 4   MODEL                    523496 non-null  object        
 5   VEHICLE_YEAR             428682 non-null  float64       
 6   VEHICLE_DEFECT           523496 non-null  object        
 7   VEHICLE_TYPE             523496 non-null  object        
 8   VEHICLE_USE              523496 non-null  object        
 9   OCCUPANT_CNT             523496 non-null  float64       
 10  PERSON_TYPE              523717 non-null  object        
 11  SEX                      523717 non-null  object        
 12  AGE             

### Compare Variables

In [22]:
# Check out the rest of the variables for wierdness

df.describe(include="all")

  df.describe(include="all")


Unnamed: 0,CRASH_DATE,UNIT_TYPE,NUM_PASSENGERS,MAKE,MODEL,VEHICLE_YEAR,VEHICLE_DEFECT,VEHICLE_TYPE,VEHICLE_USE,OCCUPANT_CNT,...,ROAD_DEFECT,DAMAGE,PRIM_CONTRIBUTORY_CAUSE,SEC_CONTRIBUTORY_CAUSE,BEAT_OF_OCCURRENCE,CRASH_HOUR,CRASH_DAY_OF_WEEK,LATITUDE,LONGITUDE,LOCATION
count,523717,523708,84450.0,523496,523496,428682.0,523496,523496,523496,523496.0,...,523717,523717,523717,523717,523717.0,523717.0,523717.0,519247.0,519247.0,519247
unique,204665,5,,628,1694,,17,21,23,,...,7,3,38,38,,,,,,154726
top,2022-02-17 15:30:00,DRIVER,,UNKNOWN,OTHER (EXPLAIN IN NARRATIVE),,UNKNOWN,PASSENGER,PERSONAL,,...,NO DEFECTS,"OVER $1,500",UNABLE TO DETERMINE,NOT APPLICABLE,,,,,,POINT (-87.905309125103 41.976201139024)
freq,37,523684,,61453,98351,,272814,322655,351807,,...,405349,360020,202712,214936,,,,,,1040
first,2021-01-01 00:00:00,,,,,,,,,,...,,,,,,,,,,
last,2023-11-09 02:40:00,,,,,,,,,,...,,,,,,,,,,
mean,,,1.456945,,,2014.883128,,,,1.234516,...,,,,,1239.549677,13.360557,4.137664,41.852148,-87.67329,
std,,,0.99552,,,118.634721,,,,0.66726,...,,,,,701.016896,5.512825,1.982557,0.368367,0.752426,
min,,,1.0,,,1900.0,,,,0.0,...,,,,,111.0,0.0,1.0,0.0,-87.936193,
25%,,,1.0,,,2009.0,,,,1.0,...,,,,,715.0,10.0,2.0,41.779949,-87.722859,


Something needs to be done with VEHICLE_YEAR

In [23]:
# First we'll see how many vehicles have vehicle years that are not possible

future_cars = df[df["VEHICLE_YEAR"] > 2024]
future_cars["VEHICLE_YEAR"].count()

308

In [24]:
# See if anything else is apparent about these rows

future_cars.head()

Unnamed: 0,CRASH_DATE,UNIT_TYPE,NUM_PASSENGERS,MAKE,MODEL,VEHICLE_YEAR,VEHICLE_DEFECT,VEHICLE_TYPE,VEHICLE_USE,OCCUPANT_CNT,...,ROAD_DEFECT,DAMAGE,PRIM_CONTRIBUTORY_CAUSE,SEC_CONTRIBUTORY_CAUSE,BEAT_OF_OCCURRENCE,CRASH_HOUR,CRASH_DAY_OF_WEEK,LATITUDE,LONGITUDE,LOCATION
773,2023-04-24 14:13:00,DRIVER,,ACURA,ILX,2032.0,UNKNOWN,PASSENGER,PERSONAL,1.0,...,NO DEFECTS,"OVER $1,500",IMPROPER TURNING/NO SIGNAL,NOT APPLICABLE,123.0,14,2,41.870712,-87.626059,POINT (-87.626059232625 41.870711859759)
1823,2023-08-18 16:00:00,DRIVER,,HONDA,HR-V,2108.0,NONE,SPORT UTILITY VEHICLE (SUV),PERSONAL,1.0,...,NO DEFECTS,"$501 - $1,500",NOT APPLICABLE,NOT APPLICABLE,922.0,16,6,41.80989,-87.700308,POINT (-87.700307527795 41.809889876424)
4078,2023-04-26 21:05:00,DRIVER,,UNKNOWN,OTHER (EXPLAIN IN NARRATIVE),9999.0,UNKNOWN,OTHER,UNKNOWN/NA,1.0,...,NO DEFECTS,$500 OR LESS,UNABLE TO DETERMINE,UNABLE TO DETERMINE,2212.0,21,4,41.697183,-87.681473,POINT (-87.681473376583 41.697183178537)
5209,2023-04-27 10:00:00,DRIVER,,NISSAN,ROGUE,2212.0,NONE,PASSENGER,PERSONAL,1.0,...,NO DEFECTS,"OVER $1,500",UNABLE TO DETERMINE,NOT APPLICABLE,1632.0,10,5,41.938256,-87.79652,POINT (-87.796519503363 41.938255709148)
8326,2023-08-20 02:00:00,DRIVER,,UNKNOWN,MOTORIZED,9999.0,UNKNOWN,UNKNOWN/NA,UNKNOWN/NA,1.0,...,NO DEFECTS,"OVER $1,500",UNABLE TO DETERMINE,NOT APPLICABLE,1811.0,2,1,41.919481,-87.662996,POINT (-87.66299600038 41.919481465693)


In [25]:
# Percent of rows with future VEHICLE_YEARS

len(future_cars)/len(df)*100

0.05881038805308974

In [26]:
# Nothing seems apparently off about these vehicles, 
# but as they are such a small percentage we will drop them out

df = df[df["VEHICLE_YEAR"] <= 2024]

Now we compare variables to the DAMAGE column. We will be making a target based on DAMAGE later below.

In [35]:
df.groupby("MAKE")

Unnamed: 0_level_0,NUM_PASSENGERS,VEHICLE_YEAR,OCCUPANT_CNT,AGE,BAC_RESULT VALUE,POSTED_SPEED_LIMIT,BEAT_OF_OCCURRENCE,CRASH_HOUR,CRASH_DAY_OF_WEEK,LATITUDE,LONGITUDE
MAKE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"(HOMEMADE MOTORCYCLE, MOPED.ETC.)",1.0,2013.227273,1.015152,33.774194,,29.545455,1462.196970,14.378788,4.303030,41.886035,-87.689217
(HOMEMADE TRAILER),,2015.500000,1.000000,42.700000,,30.000000,1262.000000,15.250000,4.250000,41.863117,-87.715259
(RECONSTRUCTED TRAILERS),1.0,2013.470588,1.058824,38.466667,,30.000000,1149.588235,11.647059,4.411765,41.842316,-87.688997
(UNLISTED CONSTRUCTION EQUIPMENT MAKE),,2017.727273,1.000000,51.272727,,28.181818,1490.454545,9.818182,3.818182,41.902160,-87.691162
(UNLISTED MAKE),,2014.466667,1.000000,38.090909,,28.666667,1325.533333,13.200000,4.000000,41.859210,-87.662477
...,...,...,...,...,...,...,...,...,...,...,...
YAMAHA,1.1,2010.972067,1.061453,33.537500,,30.016760,1316.217877,14.664804,4.301676,41.878427,-87.686869
YARBROUGH MANUFACTURING COMPANY - COMET MOTORCYCLE TRAILER,,1998.500000,1.000000,35.000000,,32.500000,1472.500000,10.500000,4.500000,41.747996,-87.653645
"YELLOWSTONE, INC.",4.0,2015.250000,3.000000,57.000000,,30.000000,197.250000,18.250000,4.500000,41.827603,-87.624124
YUKON DELTA,2.0,2008.444444,1.666667,46.250000,,26.111111,1021.000000,12.222222,3.444444,41.824509,-87.650988


## Feature Engineering

We are interested in driving skills, knowledge, or experience as a contributing cause of accidents.

In [36]:
(df['PRIM_CONTRIBUTORY_CAUSE'] == 'DRIVING SKILLS/KNOWLEDGE/EXPERIENCE').value_counts()

False    413932
True      14442
Name: PRIM_CONTRIBUTORY_CAUSE, dtype: int64

In [37]:
(df['SEC_CONTRIBUTORY_CAUSE'] == 'DRIVING SKILLS/KNOWLEDGE/EXPERIENCE').value_counts()

False    414731
True      13643
Name: SEC_CONTRIBUTORY_CAUSE, dtype: int64

In [38]:
((df['PRIM_CONTRIBUTORY_CAUSE'] == 'DRIVING SKILLS/KNOWLEDGE/EXPERIENCE') & (df['SEC_CONTRIBUTORY_CAUSE'] == 'DRIVING SKILLS/KNOWLEDGE/EXPERIENCE')).value_counts()

False    425924
True       2450
dtype: int64

In [39]:
# Create a new column to identify any contributory cause as "driving skills/knowledge/experience"

def get_cause(row):
    if row['PRIM_CONTRIBUTORY_CAUSE'] == 'DRIVING SKILLS/KNOWLEDGE/EXPERIENCE':
        return 1
    if row['SEC_CONTRIBUTORY_CAUSE'] == 'DRIVING SKILLS/KNOWLEDGE/EXPERIENCE':
        return 1
    else:
        return 0
df['DRIVING_SKILLS'] = df.apply(get_cause, axis=1)
   

In [40]:
df['DRIVING_SKILLS'].value_counts()

0    402739
1     25635
Name: DRIVING_SKILLS, dtype: int64

We want to identify crashes by the time of day they occurred.

In [41]:
# Create a new column to identify the time of day crashes occurred

def hour(row):
    if (row['CRASH_HOUR'] < 7) or (row['CRASH_HOUR'] >23):
        return 'Overnight'
    if (row['CRASH_HOUR'] in range(7,10)) or (row['CRASH_HOUR'] in range(16,20)):
        return 'Commute'
    if row['CRASH_HOUR'] in range(10,16):
        return 'Daytime'
    if row['CRASH_HOUR'] in range(20,24):
        return 'Evening'
    
df['TIME_OF_DAY'] = df.apply(hour, axis=1)

In [42]:
df['TIME_OF_DAY'].value_counts(dropna=False)

Commute      174318
Daytime      158559
Evening       52719
Overnight     42778
Name: TIME_OF_DAY, dtype: int64

In [43]:
# Check the values in Damage column

df['DAMAGE'].value_counts()

OVER $1,500      297113
$501 - $1,500     97272
$500 OR LESS      33989
Name: DAMAGE, dtype: int64

In [44]:
# Create a new column to identify damage as > $1500 or <= $1500

damage_dict = {'OVER $1,500':1, '$501 - $1,500':0, '$500 OR LESS':0}
df['DAMAGE_OVER_1500'] =  df.loc[:, ('DAMAGE')].map(damage_dict).copy()

In [45]:
# Sanity check

df['DAMAGE_OVER_1500'].value_counts()

1    297113
0    131261
Name: DAMAGE_OVER_1500, dtype: int64

## Dummy Model

### Decide Xs/y

The target is DAMAGE_OVER_1500. There are a few columns which represent interrelated variables, so only one will be used. 

In [None]:
X = df.drop(["NUM_PASSENGERS", "DAMAGE", "DAMAGE_OVER_1500"], axis=1)
y = df["DAMAGE_OVER_1500"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2024)

In [None]:
dummy_model = DummyClassifier(strategy="most_frequent")

In [None]:
dummy_model.fit(X_train, y_train)

In [None]:
cv_results_dummy = cross_val_score(dummy_model, X_train, y_train, cv=5)
cv_results_dummy

# Model 2

## Baseline Numeric Model

For a baseline model, we will use only the already numeric columns as Xs to predict DAMAGE_OVER_$1500 as y. We will omit all columns that are simply identifiers or keys. We also omit NUM_PASSENGERS as it is directly related to OCCUPANT_CNT.

In [None]:
# Using .describe to see the variables that are already numeric

df.describe()

There's something wrong with vehicle year.

In [None]:
# Making a baseline dataframe 

df_bl = df[["VEHICLE_YEAR", "OCCUPANT_CNT", "AGE", "BAC_RESULT VALUE", "POSTED_SPEED_LIMIT", "STREET_NO", "BEAT_OF_OCCURRENCE", "NUM_UNITS", "CRASH_HOUR", "CRASH_DAY_OF_WEEK", "CRASH_MONTH", "LATITUDE", "LONGITUDE", "BAC_TEST", "DAMAGE_OVER_$1500"]]

In [None]:
# Assigning Xs & y 

X_bl = df_bl.drop("DAMAGE_OVER_$1500", axis=1)
y_bl = df["DAMAGE_OVER_$1500"]

### Train/Test Split

In [None]:
X_train_bl, X_test_bl, y_train_bl, y_test_bl = train_test_split(X_bl, y_bl, random_state=2024)

### Preprocessing Steps (SS, OHE, SI)

In [None]:
numeric_imputer = SimpleImputer()
X_train_blimp = numeric_imputer.fit_transform(X_train_bl)

### Modeling (look at Coefficients, P-values)

In [None]:
bl_logreg = LogisticRegression(random_state=2024, penalty="none", max_iter=1000)

In [None]:
bl_logreg.fit(X_train_blimp, y_train_bl)

In [None]:
confusion_matrix(y_train_bl, bl_logreg.predict(X_train_blimp))

### Evaluation

In [None]:
cv_results = cross_val_score(bl_logreg, X_train_blimp, y_train_bl, cv=5)
cv_results

In [None]:
print("Dummy Model CV:          ", cv_results_dummy)
print("Initial Numeric Model CV:", cv_results)

So we can see that a model using only the columns which are already numeric is only as good as picking the most frequent.

# Model 3

## Evaluation OF/UF report Test