# Traffic Safety Research Project

## Analysis of Injury Severity in Traffic Accidents: The Role of Safety Equipment, Age, Gender, and Seating Position

## Introduction

Traffic accidents remain a significant public health issue, leading to considerable morbidity and mortality worldwide. Understanding the factors that influence the severity of injuries can help in designing better safety protocols and vehicular safety features. This report examines a dataset of 10,000 traffic accident victims, analyzing how different variables affect the outcomes of these incidents.

##Motivation

Injuries from traffic crashe­s cause death and health issue­s around the globe. To help le­ssen these e­vents, this research aims to grasp and lowe­r the factors leading to seve­re injuries. These­ are the key re­asons.

Public Health Impact: Public health suffers from wrecks hurting many folks and causing loss of human life­. Finding causal factors enables deve­loping strategies for saving lives and cutting me­dical costs. Vital data can protect people.

Safety Equipment Effectiveness: Safe­ty devices like be­lts, and airbags must prove effective­ness in preventing injury se­verity when cars collide. Manufacture­rs, lawmakers, and consumers use re­search info for safer vehicle­ design. Crucial for decision-making across industries.

Demographic Vulnerabilities: Age­ and gender impact injury risk. Data shows demographic vulne­rabilities targeted by tailore­d campaigns and rules safeguarding risky groups. Policies support higher safety for all ages and gende­rs on roadways.

Vehicle Design and Policy Making: The study examines where passengers sit and the effects on their injuries from crashes. Its findings may lead to improved car structures that better protect occupants. Additionally, it could influence policymakers when creating laws governing passenger vehicles.


##Impacts

Safer Equipment: By demonstrating which motor-vehicle safety equipment designs and standards work best and for which age groups, you could make the equipment safer.

Targeted Safety Campaigns: identification of the most vulnerable demographic subgroups and under-industrial use of available safety devices can guide focused educational campaigns that may decrease injuries and reduce injury severity.

Policy and Regulation: The results could lead to new safety policies and regulations, spurring the design of safer vehicles, and infrastructure and confirming the safety of existing road areas, especially in higher-risk regions.

Allocation of Healthcare Resources: Using signals concerning injury severity, healthcare systems can optimize the allocation of major resources into their ambulance and pre-hospital systems, as well as trauma and emergency centers, to provide better emergency and care responses when accidents occur.

Expanding the Scope: Integrating Additional Data Sources. Including data from road conditions, traffic behavior, and new technologies such as dashcams may deepen your analysis, providing a more complete picture of the factors that lead to accidents. Cross-Regional Comparisons: Comparing data from different regions or countries can reveal cultural and geographical differences in traffic accidents, as well as the effectiveness of safety equipment.

Impact Variability among Subpopulations: Socioeconomic status can influence the types of vehicles people drive, their access to safety features, and their overall health, all of which can have an impact on the severity of injuries in accidents. Geographic location can have a significant impact on injury outcomes.

In [23]:
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [5]:
data = pd.read_csv("victims.csv")
data.head()

Unnamed: 0,id,case_id,party_number,victim_role,victim_sex,victim_age,victim_degree_of_injury,victim_seating_position,victim_safety_equipment_1,victim_safety_equipment_2,victim_ejected
0,2843841,2730277,1,2,female,54.0,killed,2.0,P,Y,1.0
1,3158878,3027346,3,1,male,49.0,complaint of pain,1.0,P,W,1.0
2,8619911,9365010617153010198,1,1,male,50.0,other visible injury,1.0,W,,2.0
3,5283048,91168150,1,1,male,40.0,5,1.0,P,W,0.0
4,4417930,8949648,1,1,male,64.0,other visible injury,9.0,N,P,0.0


## Data

### Data Introduction

Data:
Introduction: The research will use Statewide Integrated Traffic Re­cords System (SWITRS) data (We cross-referenced two sets of sources, https://tims.berkeley.edu/summary.php.
https://www.chp.ca.gov/programs-services/services-information/switrs-internet-statewide-integrated-traffic-records-system.)

 This system is managed by the California Highway Patrol (CHP) and includes information about every crash that the CHP receives from local and other government agencies.

Variables: Case ID, Party Number, Victim Ge­nder, Victim Age, Victim Role (Is the victim a driver, passager, Pedestrian or other), Victim Se­ating Position(The severity of the injury to the victim), Victim Safety Equipment(The usage of safety equipment of the victim, such as airbag).

 Purpose: By studying individual cases, we learn how factors such as seat position, safety equipment, and victim characteristics such as age and sex relate to injury severity. We observe effects at the victim level. Integrating micro-level investigations teaches us much that is fundamental.

 Analytic Potential: The data allows for victim-level analysis, which is useful for reporting aggregate measures about how different factors predict the injury severity that comes from being hit by a vehicle.


### Data Cleaning

##Check Missing Value

In [6]:
missing_values = data.isnull().sum()
print(missing_values)

id                              0
case_id                         0
party_number                    0
victim_role                     0
victim_sex                    194
victim_age                    171
victim_degree_of_injury         0
victim_seating_position         9
victim_safety_equipment_1     443
victim_safety_equipment_2    2417
victim_ejected                 31
dtype: int64


##Print unique values for numerical and string-type columns to check consistency

In [7]:
check_columns = ['party_number', 'victim_seating_position', 'victim_ejected', 'victim_role']
for column in check_columns:
    print(column, ':', data[column].unique())

party_number : [ 1  3  2  5 14  4  6  8]
victim_seating_position : [ 2.  1.  9.  3.  0.  6.  4.  5.  7.  8. nan]
victim_ejected : [ 1.  2.  0.  3. nan]
victim_role : [2 1 3 6 4 5]


#Fill Missing Value
Handling Missing Values: Each field was checked for missing data, and appropriate measures were taken to address these gaps, such as imputation, deletion, or replacement, to ensure data integrity.

In [8]:
# Filter 'victim_ejected' to include only 0 or 1, set others to NaN
data['victim_ejected'] = data['victim_ejected'].apply(lambda x: x if x in [0, 1] else np.nan)

# Ensure 'victim_sex' includes only 'female' or 'male', set others to NaN
data['victim_sex'] = data['victim_sex'].apply(lambda x: x if x in ['female', 'male'] else np.nan)

# Standardize and consolidate 'victim_degree_of_injury'
# First, correct invalid entries
valid_injuries = ['killed', 'complaint of pain', 'other visible injury', 'severe injury', 'no injury']
data['victim_degree_of_injury'] = data['victim_degree_of_injury'].apply(lambda x: x if x in valid_injuries else np.nan)

# Consolidate injury descriptions
injury_mapping = {
    'killed': 'Killed or Severely Injured',
    'severe injury': 'Killed or Severely Injured',
    'complaint of pain': 'Other Visible Injury',
    'other visible injury': 'Other Visible Injury'
}
data['victim_degree_of_injury'] = data['victim_degree_of_injury'].map(injury_mapping)

# Drop any rows with NaN values to ensure dataset completeness
data.dropna(inplace=True)

# Convert categorical object-type data to numerical type using category codes
for col in data.columns:
    if data[col].dtype == 'object':
        data[col] = data[col].astype('category').cat.codes

# Save the cleaned and transformed data
data.to_csv('cleaned_victims.csv', index=False)

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4624 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   id                         4624 non-null   int64  
 1   case_id                    4624 non-null   uint64 
 2   party_number               4624 non-null   int64  
 3   victim_role                4624 non-null   int64  
 4   victim_sex                 4624 non-null   int8   
 5   victim_age                 4624 non-null   float64
 6   victim_degree_of_injury    4624 non-null   int8   
 7   victim_seating_position    4624 non-null   float64
 8   victim_safety_equipment_1  4624 non-null   int8   
 9   victim_safety_equipment_2  4624 non-null   int8   
 10  victim_ejected             4624 non-null   float64
dtypes: float64(3), int64(3), int8(4), uint64(1)
memory usage: 307.1 KB


In [10]:
data.head(5)

Unnamed: 0,id,case_id,party_number,victim_role,victim_sex,victim_age,victim_degree_of_injury,victim_seating_position,victim_safety_equipment_1,victim_safety_equipment_2,victim_ejected
0,2843841,2730277,1,2,0,54.0,0,2.0,7,18,1.0
1,3158878,3027346,3,1,1,49.0,1,1.0,7,16,1.0
4,4417930,8949648,1,1,1,64.0,1,9.0,6,11,0.0
6,2741821,2628480,1,1,1,50.0,1,1.0,4,5,0.0
8,4936094,90498594,1,1,0,23.0,1,1.0,7,16,0.0


In [11]:
data_matrix = data.to_numpy()
data_matrix

array([[2.8438410e+06, 2.7302770e+06, 1.0000000e+00, ..., 7.0000000e+00,
        1.8000000e+01, 1.0000000e+00],
       [3.1588780e+06, 3.0273460e+06, 3.0000000e+00, ..., 7.0000000e+00,
        1.6000000e+01, 1.0000000e+00],
       [4.4179300e+06, 8.9496480e+06, 1.0000000e+00, ..., 6.0000000e+00,
        1.1000000e+01, 0.0000000e+00],
       ...,
       [4.6443620e+06, 9.0164708e+07, 1.0000000e+00, ..., 7.0000000e+00,
        1.5000000e+01, 1.0000000e+00],
       [1.9398840e+06, 5.5994920e+06, 1.0000000e+00, ..., 7.0000000e+00,
        1.6000000e+01, 1.0000000e+00],
       [7.9728900e+05, 4.4160650e+06, 1.0000000e+00, ..., 7.0000000e+00,
        1.6000000e+01, 1.0000000e+00]])

##Split Data into Train and Test

In [12]:
X = data.drop(['id', 'case_id', 'victim_degree_of_injury'], axis=1)  # exclude the ID columns and target variable
y = data['victim_degree_of_injury']

# Standardizing the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)  # Convert back to DataFrame to retain column names

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


##Definition VIF

In [13]:
def calculate_vif(X):
    vif_data = pd.DataFrame()
    vif_data["feature"] = X.columns
    vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    return vif_data

In [14]:
# Initial VIF calculation
vif_df = calculate_vif(X_scaled)
print("Initial VIFs:\n", vif_df)

while True:
    # Check the maximum VIF in the dataframe
    max_vif = vif_df['VIF'].max()
    if max_vif < 10:
        break
    # Identify the predictor with the highest VIF and remove it
    feature_to_drop = vif_df[vif_df['VIF'] == max_vif]['feature'].values[0]
    X_scaled.drop(columns=feature_to_drop, inplace=True)
    print(f"Dropped {feature_to_drop} due to high VIF.")

    # Recalculate VIF
    vif_df = calculate_vif(X_scaled)
    print("Updated VIFs:\n", vif_df)

Initial VIFs:
                      feature       VIF
0               party_number  1.031693
1                victim_role  1.334748
2                 victim_sex  1.262277
3                 victim_age  1.010115
4    victim_seating_position  1.216541
5  victim_safety_equipment_1  1.177039
6  victim_safety_equipment_2  1.365981
7             victim_ejected  1.219652


#OLS Regression

In [15]:
# Adding a constant to the model (for the intercept)
X_train_sm = sm.add_constant(X_train)

# Fit the OLS model
model_ols = sm.OLS(y_train, X_train_sm)
results_ols = model_ols.fit()

# Print out the statistics
print(results_ols.summary())

                               OLS Regression Results                              
Dep. Variable:     victim_degree_of_injury   R-squared:                       0.049
Model:                                 OLS   Adj. R-squared:                  0.047
Method:                      Least Squares   F-statistic:                     23.60
Date:                     Tue, 10 Sep 2024   Prob (F-statistic):           1.29e-35
Time:                             04:31:51   Log-Likelihood:                -1802.3
No. Observations:                     3699   AIC:                             3623.
Df Residuals:                         3690   BIC:                             3679.
Df Model:                                8                                         
Covariance Type:                 nonrobust                                         
                                coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------

In [16]:
# Predict on the testing data
X_test_sm = sm.add_constant(X_test)
y_test_pred_ols = results_ols.predict(X_test_sm)

# Calculate the mean squared error for OLS
mse_ols = mean_squared_error(y_test, y_test_pred_ols)
print(f'Mean Squared Error (OLS): {mse_ols}')

Mean Squared Error (OLS): 0.16519885895665765


#Ridge Regression

In [17]:
# Initialize and fit the Ridge regression model
# Note: Alpha is the regularization strength; larger values specify stronger regularization.
model_ridge = Ridge(alpha=1.0)
model_ridge.fit(X_train, y_train)

# Predict on the training data
y_train_pred_ridge = model_ridge.predict(X_train)

In [18]:
# Predict on the testing set
y_test_pred_ridge = model_ridge.predict(X_test)

# Calculate the mean squared error for Ridge
mse_ridge = mean_squared_error(y_test, y_test_pred_ridge)
print(f'Mean Squared Error (Ridge): {mse_ridge}')

Mean Squared Error (Ridge): 0.16520103092805324


#Logistic Regression

In [19]:
logit_model = sm.Logit(y_train, X_train_sm)
result = logit_model.fit()
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.479470
         Iterations 7
                              Logit Regression Results                             
Dep. Variable:     victim_degree_of_injury   No. Observations:                 3699
Model:                               Logit   Df Residuals:                     3690
Method:                                MLE   Df Model:                            8
Date:                     Tue, 10 Sep 2024   Pseudo R-squ.:                 0.05526
Time:                             04:31:51   Log-Likelihood:                -1773.6
converged:                            True   LL-Null:                       -1877.3
Covariance Type:                 nonrobust   LLR p-value:                 1.696e-40
                                coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------------
const                         3.3454  

In [24]:
# Convert probabilities to binary predictions
y_test_pred_log_class = (y_test_pred_log >= 0.5).astype(int)
accuracy_logit = accuracy_score(y_test, y_test_pred_log_class)
print(f"Logistic Regression Accuracy: {accuracy_logit}")

Logistic Regression Accuracy: 0.772972972972973


#Decision Tree

In [25]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import mean_squared_error, accuracy_score
model_dt = DecisionTreeClassifier(random_state=42)
model_dt.fit(X_train, y_train)
y_pred_dt = model_dt.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)
mse_dt = mean_squared_error(y_test, y_pred_dt)

print(f"Decision Tree Accuracy: {accuracy_dt}")

Decision Tree Accuracy: 0.7416216216216216


#Regression MSE Summary

In [26]:
print(f'0LS Mean Squared Error: {mse_ols}')
print(f'Ridge Mean Squared Error: {mse_ridge}')
print(f"Logistic Regression Accuracy: {accuracy_logit}")
print(f"Decision Tree Accuracy: {accuracy_dt}")

0LS Mean Squared Error: 0.16519885895665765
Ridge Mean Squared Error: 0.16520103092805324
Logistic Regression Accuracy: 0.772972972972973
Decision Tree Accuracy: 0.7416216216216216


## Analytics models:
In this study, we've developed and compared four regression models, namely Ordinary Least Squares (OLS), Ridge Regression, Logistic Regression and Decision Tree, with the aim of predicting the severity of injuries resulting from traffic accidents. Utilizing data from the Statewide Integrated Traffic Records System (SWITRS), we processed and standardized the features while setting the degree of victim injury as our target variable. These four models underwent performance evaluation using different metrics. Mean Squared Error (MSE) was used for the OLS and Ridge regression models, while accuracy was used for the Logistic Regression and Decision Tree models.

By calculating VIF before modeling and processing variables with high VIF values, such as removing or transforming these variables, it can effectively reduce multicollinearity problems in the model. The result of this will usually make the performance of the linear regression model OLS more stable and reliable.


The evaluation yielded MSE scores of approximately 0.1652 for both OLS and Ridge Regression(Table 1), indicating a near-identical performance level between the two models on the test dataset. Such a close result suggests that the regularization process inherent to Ridge Regression does not markedly improve model performance for our dataset, possibly due to a lack of significant multicollinearity among features or minimal overfitting by the model. That's consistent with the VIF check we did earlier.


Logistic Regression and Decision Tree: Logistic Regression, typically used for classification, produced an accuracy of approximately 77.3%, which is comparable to the linear models' performance in terms of prediction consistency. This outcome suggests that when injury severity is treated as a categorical variable, Logistic Regression performs similarly to OLS and Ridge Regression. The decision tree model reported a lower accuracy of 74.2%, indicating less accuracy in predicting injury severity compared to the other models. The reduced accuracy could be due to the model’s tendency to overfit, especially if the tree depth is not optimally managed. Decision Trees are known for their high variance, which can degrade performance on unseen data.

In further looking at the applicability of the models, we find that both the OLS and Ridge models display comparable accuracy in determining injury severity. An examination of the OLS model coefficients could provide deeper insights into the influence of individual features on injury severity. The similarity in MSE scores leads to the inference that the simpler OLS model might suffice in this context since the additional complexity introduced by Ridge regularization does not appear to significantly impact the predictive ability.

In conclusion, while the performance of the OLS and Ridge regression models is acceptable, with MSE scores of 0.1652, and Logistic Regression performs well with an accuracy of 77.3%, there is potential for refinement. Future efforts could include a thorough diagnostic review, optimization of model parameters, and exploration of more sophisticated modeling techniques, such as ensemble methods or neural networks. This iterative approach aims not only for accurate predictions but also seeks to align the models closely with the data's inherent properties and the theoretical framework of the study.


## Conclusions

This comprehensive analysis has explored the influence of safety equipment, age, gender, and seating position on the severity of injuries sustained in traffic accidents. The findings from the study offer critical insights into the factors that contribute to injury severity and provide evidence-based recommendations to enhance road safety.

### Key Conclusions:

- Safety Equipment: The use of effective safety equipment significantly reduces injury severity, underscoring the importance of equipping vehicles with reliable safety features and promoting their use among all passengers.

- Age and Gender: Vulnerable age groups, particularly the very young and the elderly, are at a higher risk of severe injuries in traffic accidents. Additionally, the slight variation in injury severity between genders suggests targeted safety campaigns could be beneficial.

- Seating Position: Drivers and front passengers are at greater risk of severe injuries, likely due to their exposure to frontal impacts. Rear passengers, while generally safer, still require attention to improve safety in various accident scenarios.

### Implications for Policy and Practice:

- Vehicle Design: Manufacturers should focus on enhancing the protective capabilities of vehicles, especially in front-seating areas, and consider more robust safety mechanisms for rear seats.

- Public Safety Campaigns: Education and public awareness campaigns should emphasize the importance of using safety equipment. Special attention should be given to educating vulnerable groups about safety practices.

- Regulatory Measures: Regulators should consider revising and enforcing safety standards that require the inclusion of advanced safety technologies in all new vehicles.