# TEAM GREEN: LUCY AND IAN

# PROJECT TITLE: PREDICTING THE SEVERITY OF ROAD ACCIDENTS USING MACHINE LEARNING

# INTRODUCTION
This project focuses on predicting the severity of road accidents through the application of machine learning techniques. It utilizes historical accident data to identify key factors influencing accident outcomes. By developing a predictive model, the project aims to enhance understanding of accident patterns and provide data-driven insights that can inform policy decisions and improve road safety measures. Through data cleaning, exploratory analysis and machine learning modeling, the project will demonstrate the integration of AI tools in solving complex problems using data-driven approaches.


# Step 1: Data Loading and Exploration

In [1]:
# Importing necessary libraries 
import pandas as pd
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt

In [2]:
# Loading dataset
df = pd.read_csv(r"C:\Users\admin\OneDrive\Desktop\Lucy Tumaini\Module 10\Capstone Project\RTA Dataset.csv")
df.shape
display(df.head())

Unnamed: 0,Time,Day_of_week,Age_band_of_driver,Sex_of_driver,Educational_level,Vehicle_driver_relation,Driving_experience,Type_of_vehicle,Owner_of_vehicle,Service_year_of_vehicle,...,Vehicle_movement,Casualty_class,Sex_of_casualty,Age_band_of_casualty,Casualty_severity,Work_of_casuality,Fitness_of_casuality,Pedestrian_movement,Cause_of_accident,Accident_severity
0,17:02:00,Monday,18-30,Male,Above high school,Employee,1-2yr,Automobile,Owner,Above 10yr,...,Going straight,na,na,na,na,,,Not a Pedestrian,Moving Backward,Slight Injury
1,17:02:00,Monday,31-50,Male,Junior high school,Employee,Above 10yr,Public (> 45 seats),Owner,5-10yrs,...,Going straight,na,na,na,na,,,Not a Pedestrian,Overtaking,Slight Injury
2,17:02:00,Monday,18-30,Male,Junior high school,Employee,1-2yr,Lorry (41?100Q),Owner,,...,Going straight,Driver or rider,Male,31-50,3,Driver,,Not a Pedestrian,Changing lane to the left,Serious Injury
3,1:06:00,Sunday,18-30,Male,Junior high school,Employee,5-10yr,Public (> 45 seats),Governmental,,...,Going straight,Pedestrian,Female,18-30,3,Driver,Normal,Not a Pedestrian,Changing lane to the right,Slight Injury
4,1:06:00,Sunday,18-30,Male,Junior high school,Employee,2-5yr,,Owner,5-10yrs,...,Going straight,na,na,na,na,,,Not a Pedestrian,Overtaking,Slight Injury


In [3]:
# Checking statistical distribution
display(df.describe())

Unnamed: 0,Number_of_vehicles_involved,Number_of_casualties
count,12316.0,12316.0
mean,2.040679,1.548149
std,0.68879,1.007179
min,1.0,1.0
25%,2.0,1.0
50%,2.0,1.0
75%,2.0,2.0
max,7.0,8.0


In [4]:
# Basic info
display(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12316 entries, 0 to 12315
Data columns (total 32 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Time                         12316 non-null  object
 1   Day_of_week                  12316 non-null  object
 2   Age_band_of_driver           12316 non-null  object
 3   Sex_of_driver                12316 non-null  object
 4   Educational_level            11575 non-null  object
 5   Vehicle_driver_relation      11737 non-null  object
 6   Driving_experience           11487 non-null  object
 7   Type_of_vehicle              11366 non-null  object
 8   Owner_of_vehicle             11834 non-null  object
 9   Service_year_of_vehicle      8388 non-null   object
 10  Defect_of_vehicle            7889 non-null   object
 11  Area_accident_occured        12077 non-null  object
 12  Lanes_or_Medians             11931 non-null  object
 13  Road_allignment              12

None

In [5]:
# Checking for duplicates
print("No. of Duplicate Rows:", df.duplicated().sum())

No. of Duplicate Rows: 0


In [6]:
# Check missing values
nulls = df.isnull().sum().reset_index()
nulls.columns = ['Column', 'Missing_Values']

# Add percentage column
nulls['Percentage'] = (nulls['Missing_Values'] / len(df)) * 100

print(nulls)

                         Column  Missing_Values  Percentage
0                          Time               0    0.000000
1                   Day_of_week               0    0.000000
2            Age_band_of_driver               0    0.000000
3                 Sex_of_driver               0    0.000000
4             Educational_level             741    6.016564
5       Vehicle_driver_relation             579    4.701202
6            Driving_experience             829    6.731082
7               Type_of_vehicle             950    7.713543
8              Owner_of_vehicle             482    3.913608
9       Service_year_of_vehicle            3928   31.893472
10            Defect_of_vehicle            4427   35.945112
11        Area_accident_occured             239    1.940565
12             Lanes_or_Medians             385    3.126015
13              Road_allignment             142    1.152972
14            Types_of_Junction             887    7.202014
15            Road_surface_type         

# Step 2: Data Preprocessing

# Handling missing values
We start by identifying the numerical columns and the categorical columns then impute the missing values in numerical columns with the mean and in categorical columns with the mode.
# Why imputation over dropping?
- Prevents loss of valuable information that would occur if rows were dropped.
- Maintains data consistency by keeping all records intact.
- Reduces bias - avoids skewing the data that can result from deleting missing entries.
- Provides realistic estimates - mean and mode imputation gives reasonable approximations based on existing data.

In [7]:
# Separating numerical and categorical columns
num_cols = df.select_dtypes(include=['float64','int64']).columns
cat_cols = df.select_dtypes(include=['object']).columns

In [8]:
# Filling the missing values with mean for the numerical columns and mode for the categorical columns
for col in cat_cols:
        if not df[col].mode().empty:
            df[col]=df[col].fillna(df[col].mode()[0])
for col in num_cols:
        df[col]=df[col].fillna(df[col].mean())

print(df.isnull().sum())

Time                           0
Day_of_week                    0
Age_band_of_driver             0
Sex_of_driver                  0
Educational_level              0
Vehicle_driver_relation        0
Driving_experience             0
Type_of_vehicle                0
Owner_of_vehicle               0
Service_year_of_vehicle        0
Defect_of_vehicle              0
Area_accident_occured          0
Lanes_or_Medians               0
Road_allignment                0
Types_of_Junction              0
Road_surface_type              0
Road_surface_conditions        0
Light_conditions               0
Weather_conditions             0
Type_of_collision              0
Number_of_vehicles_involved    0
Number_of_casualties           0
Vehicle_movement               0
Casualty_class                 0
Sex_of_casualty                0
Age_band_of_casualty           0
Casualty_severity              0
Work_of_casuality              0
Fitness_of_casuality           0
Pedestrian_movement            0
Cause_of_a

In [9]:
# Listing the non-numeric columns
cat_cols = df.select_dtypes(include=['object']).columns
print('Non-numeric columns:', cat_cols)

Non-numeric columns: Index(['Time', 'Day_of_week', 'Age_band_of_driver', 'Sex_of_driver',
       'Educational_level', 'Vehicle_driver_relation', 'Driving_experience',
       'Type_of_vehicle', 'Owner_of_vehicle', 'Service_year_of_vehicle',
       'Defect_of_vehicle', 'Area_accident_occured', 'Lanes_or_Medians',
       'Road_allignment', 'Types_of_Junction', 'Road_surface_type',
       'Road_surface_conditions', 'Light_conditions', 'Weather_conditions',
       'Type_of_collision', 'Vehicle_movement', 'Casualty_class',
       'Sex_of_casualty', 'Age_band_of_casualty', 'Casualty_severity',
       'Work_of_casuality', 'Fitness_of_casuality', 'Pedestrian_movement',
       'Cause_of_accident', 'Accident_severity'],
      dtype='object')


In [10]:
# Converting Time to datetime
df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S', errors='coerce')

# Features extraction
df['Hour'] = df['Time'].dt.hour
df['Minute'] = df['Time'].dt.minute

# Categorizing Hour into time-of-day
def time_of_day(hr):
    if 5 <= hr < 12:
        return 'Morning'
    elif 12 <= hr < 17:
        return 'Afternoon'
    elif 17 <= hr < 21:
        return 'Evening'
    else:
        return 'Night'

df['Time_of_day'] = df['Hour'].apply(time_of_day)

# Dropping original Time column
df = df.drop('Time', axis=1)

In [11]:
df['Time_of_day'].head()

0    Evening
1    Evening
2    Evening
3      Night
4      Night
Name: Time_of_day, dtype: object

In [12]:
# Re-listing the non-numeric columns
cat_cols = df.select_dtypes(include=['object']).columns
print('Non-numeric columns:', cat_cols)

Non-numeric columns: Index(['Day_of_week', 'Age_band_of_driver', 'Sex_of_driver',
       'Educational_level', 'Vehicle_driver_relation', 'Driving_experience',
       'Type_of_vehicle', 'Owner_of_vehicle', 'Service_year_of_vehicle',
       'Defect_of_vehicle', 'Area_accident_occured', 'Lanes_or_Medians',
       'Road_allignment', 'Types_of_Junction', 'Road_surface_type',
       'Road_surface_conditions', 'Light_conditions', 'Weather_conditions',
       'Type_of_collision', 'Vehicle_movement', 'Casualty_class',
       'Sex_of_casualty', 'Age_band_of_casualty', 'Casualty_severity',
       'Work_of_casuality', 'Fitness_of_casuality', 'Pedestrian_movement',
       'Cause_of_accident', 'Accident_severity', 'Time_of_day'],
      dtype='object')


In [13]:
# Separating training features from target
X = df.drop("Accident_severity", axis=1)
y = df["Accident_severity"]

In [14]:
# Encoding categorical features excluding target
cat_cols = ['Day_of_week', 'Age_band_of_driver', 'Sex_of_driver',
       'Educational_level', 'Vehicle_driver_relation', 'Driving_experience',
       'Type_of_vehicle', 'Owner_of_vehicle', 'Service_year_of_vehicle',
       'Defect_of_vehicle', 'Area_accident_occured', 'Lanes_or_Medians',
       'Road_allignment', 'Types_of_Junction', 'Road_surface_type',
       'Road_surface_conditions', 'Light_conditions', 'Weather_conditions',
       'Type_of_collision', 'Vehicle_movement', 'Casualty_class',
       'Sex_of_casualty', 'Age_band_of_casualty', 'Casualty_severity',
       'Work_of_casuality', 'Fitness_of_casuality', 'Pedestrian_movement',
       'Cause_of_accident', 'Time_of_day']

# One-Hot Encoding
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(drop='first', sparse_output=False)  # Initializing the encoder
X_encoded = ohe.fit_transform(X[cat_cols]) # Fitting and transforming categorical columns
encoded_cols = ohe.get_feature_names_out(cat_cols)
X_encoded_df = pd.DataFrame(X_encoded, columns=encoded_cols, index=X.index) # Converting to dataframe with propercolumn names
X = pd.concat([X.drop(columns=cat_cols), X_encoded_df], axis=1) # Dropping original categorical columnsand concatenate encoded features

In [15]:
# Encoding the target variable
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
print(dict(zip(le.classes_, le.transform(le.classes_)))) # Checking mapping

{'Fatal injury': np.int64(0), 'Serious Injury': np.int64(1), 'Slight Injury': np.int64(2)}


In [16]:
print("Final Feature Shape:", X.shape)
print("Target Shape:", y.shape)

Final Feature Shape: (12316, 183)
Target Shape: (12316,)


In [17]:
# Checking target distribution
print('Counts of Target:', df['Accident_severity'].value_counts())

# Checking percentage distribution 
severity_percent = (df['Accident_severity'].value_counts(normalize=True) * 100).round(2)
print('Percentage of Target:', severity_percent)

Counts of Target: Accident_severity
Slight Injury     10415
Serious Injury     1743
Fatal injury        158
Name: count, dtype: int64
Percentage of Target: Accident_severity
Slight Injury     84.56
Serious Injury    14.15
Fatal injury       1.28
Name: proportion, dtype: float64


# Handling class imbalancing
The target variable in this dataset is highly imbalanced:
- Slight Injury: 85%  
- Serious Injury: 14%  
- Fatal Injury: 1%

This imbalance causes machine learning models to become biased toward predicting the majority class (“Slight”), resulting in poor performance on the minority classes, especially Fatal, which is critically important for decision-making and safety interventions.
To address this issue, the following steps are applied:
1. Stratified Train–Test Split  
A stratified split ensures that the proportion of each target class remains consistent in both the training and testing sets. This prevents data leakage and ensures reliable model evaluation.
2. Resampling using SMOTETomek  
- SMOTE generates synthetic examples for the minority classes, increasing their representation.  
- Tomek Links undersampling removes noisy or overlapping samples from the majority class, making the classes more separable. This combined approach helps balance the dataset while reducing noise.
3. Applying Class Weights  
Even after resampling, slight imbalance may remain.  
Using class weights instructs the model to penalize misclassifications of minority classes more heavily, improving recall for Serious and Fatal injuries.

This balancing strategy improves fairness, stability, and overall model performance.

In [19]:
# Importing necessary libraries
from sklearn.model_selection import train_test_split
from imblearn.combine import SMOTETomek
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils.class_weight import compute_class_weight
import collections

In [20]:
# Splitting the train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

In [21]:
# Resampling
smote_tomek = SMOTETomek(random_state=42)
X_train_res, y_train_res = smote_tomek.fit_resample(X_train, y_train)
print('Class distribution after resampling:', collections.Counter(y_train_res)) # Checking new class distribution

Class distribution after resampling: Counter({np.int64(2): 8331, np.int64(1): 8331, np.int64(0): 8331})


# Step 3: Model Building

In [22]:
# importing Random Forest and metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Initializing Random Forest
rf = RandomForestClassifier(n_estimators=200, random_state=42, class_weight='balanced')

# Training on resampled training set
rf.fit(X_train_res, y_train_res)

In [25]:
# Making predictions on the test set
y_pred = rf.predict(X_test)

In [27]:
# Evaluating the model

# Accuracy
print("Accuracy on test set:", accuracy_score(y_test, y_pred))

# Confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=le.classes_, zero_division=0))

Accuracy on test set: 0.8457792207792207

Confusion Matrix:
[[   0    0   31]
 [   0    0  349]
 [   0    0 2084]]

Classification Report:
                precision    recall  f1-score   support

  Fatal injury       0.00      0.00      0.00        31
Serious Injury       0.00      0.00      0.00       349
 Slight Injury       0.85      1.00      0.92      2084

      accuracy                           0.85      2464
     macro avg       0.28      0.33      0.31      2464
  weighted avg       0.72      0.85      0.78      2464



# Model summary
The model predicts the majority class (Slight injury) well and fails to predict minority classes (Serious & Fatal injuries) in the test set. High accuracy is misleading due to severe class imbalance. SMOTETomek balanced training set, but test set remained highly imbalanced.
The next stepsinvolved will be training XGBoost to improve prediction of minority classes.

In [28]:
# importing XGBoost and metrics
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Initializing the model
xgb = XGBClassifier(n_estimators=200, learning_rate=0.1, max_depth=6, random_state=42, use_label_encoder=False, eval_metric='mlogloss')

# Training on resampled training set
xgb.fit(X_train_res, y_train_res)

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


In [29]:
# Making predictions on the test set
y_pred_xgb = xgb.predict(X_test)

In [30]:
# Evaluating the model

# Accuracy
print("Accuracy on test set:", accuracy_score(y_test, y_pred_xgb))

# Confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_xgb))

# Detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred_xgb, target_names=le.classes_, zero_division=0))

Accuracy on test set: 0.8526785714285714

Confusion Matrix:
[[   0    0   31]
 [   0   26  323]
 [   0    9 2075]]

Classification Report:
                precision    recall  f1-score   support

  Fatal injury       0.00      0.00      0.00        31
Serious Injury       0.74      0.07      0.14       349
 Slight Injury       0.85      1.00      0.92      2084

      accuracy                           0.85      2464
     macro avg       0.53      0.36      0.35      2464
  weighted avg       0.83      0.85      0.80      2464



# Model summary
Slight injury is predicted very well (majority class) while Serious injury predictions improved over Random Forest (26 correctly predicted) and Fatal injury is still not predicted due to extreme rarity. Accuracy is similar to Random Forest (~85%), but macro F1 shows improvement for minority classes.
Next steps for improvement involves tuning class weights or scale_pos_weight, adjusting thresholds and try ensemble methods.