<h1 align="center">MACHINE LEARNING CLASSIFICATION MODEL FOR TRAFFIC CRASHES</h1>

After viewing and analyzing the data, we'll create a classification Machine Learning model. We need to extract, clean, and process the data to find the best model for the classification job.

## IMPORTING LIBRARIES

We'll import the necessary libraries for preprocessing, creating and evaluating the Machine Learning classification model:

In [2]:
import numpy as np
import pandas as pd
import os
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from scipy.stats import randint, uniform
from time import time
import joblib

## EXTRACTING THE DATA

Let's read df_filtered file for construct our model:

In [3]:
pd.set_option('display.max_columns', None)
project_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
data_dir = os.path.join(project_dir, "data", "processed")

path = os.path.join(data_dir, 'traffic_crashes_for_ml.csv')
df= pd.read_csv(path)

df.sample(3)


Unnamed: 0,TRAFFIC_CONTROL_DEVICE,DEVICE_CONDITION,FIRST_CRASH_TYPE,TRAFFICWAY_TYPE,REPORT_TYPE,CRASH_TYPE,HIT_AND_RUN_I,DAMAGE,PRIM_CONTRIBUTORY_CAUSE,MOST_SEVERE_INJURY,INJURIES_TOTAL,INJURIES_FATAL,INJURIES_INCAPACITATING,INJURIES_NON_INCAPACITATING,INJURIES_REPORTED_NOT_EVIDENT,INJURIES_NO_INDICATION,CRASH_HOUR,CRASH_DAY_OF_WEEK,LATITUDE,LONGITUDE,MONTH_POLICE_NOTIFIED,DAY_POLICE_NOTIFIED
34092,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,REAR END,DIVIDED - W/MEDIAN BARRIER,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,DA,"$501 - $1,500",UNABLE TO DETERMINE,NO INDICATION OF INJURY,0.0,0.0,0.0,0.0,0.0,2.0,8,4,41.722179,-87.584962,10,27
545544,STOP SIGN/FLASHER,FUNCTIONING PROPERLY,ANGLE,NOT DIVIDED,ON SCENE,NO INJURY / DRIVE AWAY,Y,"OVER $1,500",UNABLE TO DETERMINE,NO INDICATION OF INJURY,0.0,0.0,0.0,0.0,0.0,3.0,21,2,41.875225,-87.725337,11,15
638189,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,SIDESWIPE SAME DIRECTION,NOT DIVIDED,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,DA,"$501 - $1,500",UNABLE TO DETERMINE,NO INDICATION OF INJURY,0.0,0.0,0.0,0.0,0.0,2.0,13,7,41.892593,-87.624334,8,11


In [4]:
df.shape

(862214, 22)

## DATA PREPROCESSING

For our model, we need to transform the numeric variables to categorical:

In [4]:
categorical_cols = df.select_dtypes(exclude=['number']).columns.tolist()
categorical_cols

['TRAFFIC_CONTROL_DEVICE',
 'DEVICE_CONDITION',
 'FIRST_CRASH_TYPE',
 'TRAFFICWAY_TYPE',
 'REPORT_TYPE',
 'CRASH_TYPE',
 'HIT_AND_RUN_I',
 'DAMAGE',
 'PRIM_CONTRIBUTORY_CAUSE',
 'MOST_SEVERE_INJURY']

In [5]:
label_encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le

df.head(2)

Unnamed: 0,TRAFFIC_CONTROL_DEVICE,DEVICE_CONDITION,FIRST_CRASH_TYPE,TRAFFICWAY_TYPE,REPORT_TYPE,CRASH_TYPE,HIT_AND_RUN_I,DAMAGE,PRIM_CONTRIBUTORY_CAUSE,MOST_SEVERE_INJURY,INJURIES_TOTAL,INJURIES_FATAL,INJURIES_INCAPACITATING,INJURIES_NON_INCAPACITATING,INJURIES_REPORTED_NOT_EVIDENT,INJURIES_NO_INDICATION,CRASH_HOUR,CRASH_DAY_OF_WEEK,LATITUDE,LONGITUDE,MONTH_POLICE_NOTIFIED,DAY_POLICE_NOTIFIED
0,16,1,7,2,2,1,2,2,17,2,0.0,0.0,0.0,0.0,0.0,1.0,14,7,41.85412,-87.665902,7,29
1,4,3,8,8,2,0,0,1,17,3,1.0,0.0,0.0,1.0,0.0,1.0,17,6,41.942976,-87.761883,8,18


Now, we have the final dataset for ML processing:

In [6]:
df_to_ml= df.sample(round(df.shape[0]*0.05))

In [7]:
df_to_ml.shape

(43111, 22)

## GENERATING A CLASSIFICATION MACHINE LEARNING MODEL

We'll create a function that reduces the dimensionality, trains multiple models, selects the best model and save it:

In [8]:
def train_and_evaluate_models(df, target_column, models, param_distributions, n_iter):
    X = df.drop(target_column, axis=1)
    y = df[target_column]


    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    X_scaled = pd.DataFrame(X_scaled, columns=X.columns)


    X_train, X_val, y_train, y_val = train_test_split(X_scaled, y, test_size=0.25, random_state=42)


    best_model_overall = None
    best_model_name = None
    best_accuracy = 0

    for name, model in models.items():
        print(f"\nTraining and Evaluating {name}:")


        start_time = time()

  
        random_search = RandomizedSearchCV(estimator=model,  param_distributions=param_distributions[name], n_iter=n_iter, cv=5, scoring='accuracy', 
                                           n_jobs=-1)
        
        random_search.fit(X_train, y_train)


        best_estimator = random_search.best_estimator_
        print(f"Best params for {name}: {random_search.best_params_}")


        y_pred = best_estimator.predict(X_val)
        accuracy = accuracy_score(y_val, y_pred)
        print(f'{name}: Accuracy = {accuracy:.4f}')


        end_time = time()
        print(f"Training time for {name}: {end_time - start_time:.2f} seconds")


        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_model_overall = best_estimator
            best_model_name = name

    print(f'\n🏆 Best model: {best_model_name} with accuracy {best_accuracy:.4f}')

    # Guardar el mejor modelo
    project_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
    model_dir = os.path.join(project_dir, "models")
    os.makedirs(model_dir, exist_ok=True)

    best_model_path = os.path.join(model_dir, "best_model.pkl")
    joblib.dump(best_model_overall, best_model_path)
    print(f"✅ Model saved in: {best_model_path}")


Now, let's define the models and their hyperparameter search spaces:

In [9]:
param_distributions = {
    'RandomForestClassifier': {
        'n_estimators': randint(50, 200),
        'max_depth': randint(10, 50),
        'min_samples_split': randint(2, 10),
        'min_samples_leaf': randint(1, 5)
    },
    'GradientBoostingClassifier': {
        'n_estimators': randint(50, 200),
        'learning_rate': uniform(0.01, 0.3),
        'max_depth': randint(3, 10),
        'min_samples_split': randint(2, 10)
    },
    'AdaBoostClassifier': {
        'n_estimators': randint(50, 200),
        'learning_rate': uniform(0.01, 0.3)
    },
    'LogisticRegression': {
        'C': uniform(0.1, 10),
        'penalty': ['l1', 'l2'],
        'solver': ['liblinear', 'saga']
    },

    'DecisionTreeClassifier': {
        'max_depth': randint(1, 20),
        'min_samples_split': randint(2, 10),
        'min_samples_leaf': randint(1, 5)
    },
    'KNeighborsClassifier': {
        'n_neighbors': randint(3, 15)
    }

}

models = {
    'RandomForestClassifier': RandomForestClassifier(),
    'GradientBoostingClassifier': GradientBoostingClassifier(),
    'AdaBoostClassifier': AdaBoostClassifier(),
    'LogisticRegression': LogisticRegression(),
    'DecisionTreeClassifier': DecisionTreeClassifier(),
    'KNeighborsClassifier': KNeighborsClassifier(),
}

Use the defined function:

In [10]:
n_iter=5

print(f"Sample size: {df_to_ml.shape[0]}")
print(f"Number of iteration: {n_iter}")

best_model = train_and_evaluate_models(df_to_ml, 'CRASH_TYPE', models, param_distributions, n_iter)


Sample size: 43111
Number of iteration: 5

Training and Evaluating RandomForestClassifier:
Best params for RandomForestClassifier: {'max_depth': 23, 'min_samples_leaf': 1, 'min_samples_split': 7, 'n_estimators': 127}
RandomForestClassifier: Accuracy = 0.8981
Training time for RandomForestClassifier: 55.58 seconds

Training and Evaluating GradientBoostingClassifier:
Best params for GradientBoostingClassifier: {'learning_rate': 0.0738720132669246, 'max_depth': 3, 'min_samples_split': 7, 'n_estimators': 153}
GradientBoostingClassifier: Accuracy = 0.9004
Training time for GradientBoostingClassifier: 102.70 seconds

Training and Evaluating AdaBoostClassifier:
Best params for AdaBoostClassifier: {'learning_rate': 0.2700261315772635, 'n_estimators': 134}
AdaBoostClassifier: Accuracy = 0.8954
Training time for AdaBoostClassifier: 32.57 seconds

Training and Evaluating LogisticRegression:
Best params for LogisticRegression: {'C': 7.530807338739853, 'penalty': 'l1', 'solver': 'liblinear'}
Logist

We see that RandomForestClassifier and GradientBoostingClassifier have the best accuracy, so we apply the defined function for only these models with expanded data and increased iterations:

In [11]:
param_distributions_2 = {
    
    'RandomForestClassifier': {
        'n_estimators': randint(50, 200),
        'max_depth': randint(10, 50),
        'min_samples_split': randint(2, 10),
        'min_samples_leaf': randint(1, 5)
                             },
    'GradientBoostingClassifier': {
        'n_estimators': randint(50, 200),
        'learning_rate': uniform(0.01, 0.3),
        'max_depth': randint(3, 10),
        'min_samples_split': randint(2, 10)
                             },
}

# Define models
models_2 = {
    'RandomForestClassifier': RandomForestClassifier(),
    'GradientBoostingClassifier': GradientBoostingClassifier(),

}

In [28]:
df_to_ml_2= df.sample(round(df.shape[0]*0.5))
df_to_ml_2.shape

(431107, 22)

In [30]:
n_iter_2=50

print(f"Sample size: {df_to_ml_2.shape[0]}")
print(f"Number of iteration: {n_iter_2}")

best_model = train_and_evaluate_models(df_to_ml_2, 'CRASH_TYPE', models_2, param_distributions_2, n_iter_2)

Sample size: 431107
Number of iteration: 50

Training and Evaluating RandomForestClassifier:
Best params for RandomForestClassifier: {'max_depth': 47, 'min_samples_leaf': 4, 'min_samples_split': 8, 'n_estimators': 195}
RandomForestClassifier: Accuracy = 0.8991
Training time for RandomForestClassifier: 3097.42 seconds

Training and Evaluating GradientBoostingClassifier:
Best params for GradientBoostingClassifier: {'learning_rate': 0.0883692634160222, 'max_depth': 6, 'min_samples_split': 7, 'n_estimators': 188}
GradientBoostingClassifier: Accuracy = 0.9014
Training time for GradientBoostingClassifier: 8418.56 seconds

🏆 Best model: GradientBoostingClassifier with accuracy 0.9014
✅ Model saved in: c:\Users\ingde\Documents\DAP\Traffic Crashes - Crashes\models\best_model.pkl
