# Road Accident Severity Classification (Template)

**Research question:** Can machine learning models classify countries based on the level of road accident severity using existing road accident and fatality statistics along with demographic and traffic-related indicators?

Use this template to complete the classification task. Replace TODOs with your own work, results, and interpretations based on the assignment instructions.

## 1. Setup

In [None]:
# If running in Colab, install the essentials (uncomment if needed)
# !pip -q install pandas numpy scikit-learn matplotlib seaborn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

RANDOM_STATE = 42

## 2. Load Data

In [None]:
# Update the path if needed
data_path = '/content/road_accident_dataset.csv'  # Colab example
# data_path = '/Users/aryadevrijal/Downloads/road_accident_dataset.csv'  # local example

df = pd.read_csv(data_path)
df.head()

## 3. Understand the Data

In [None]:
df.info()
df.describe(include='all').T.head(20)

## 4. Define Target and Features
**TODO:** Identify the target column for *severity class* (e.g., Low/Medium/High).
If you need to create the target (e.g., binning a fatality rate), do it here and justify the thresholds.

In [None]:
# Example placeholder
# target_col = 'Severity_Class'
# df[target_col].value_counts()

# TODO: replace with the correct target column name
target_col = 'Severity_Class'

X = df.drop(columns=[target_col])
y = df[target_col]

## 5. Preprocessing
Handle missing values and scale numeric features. If you have categorical features, decide how to encode them.

In [None]:
numeric_cols = X.select_dtypes(include=['int64','float64']).columns
categorical_cols = X.select_dtypes(include=['object','category']).columns

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))
    ,('scaler', StandardScaler())
])

# Simple one-hot for categorical columns (if any)
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))
    ,('onehot', __import__('sklearn').preprocessing.OneHotEncoder(handle_unknown='ignore'))
])

preprocess = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_cols),
        ('cat', categorical_transformer, categorical_cols)
    ]
)

## 6. Train/Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)

## 7. Baseline Model (Logistic Regression)

In [None]:
log_reg = Pipeline(steps=[
    ('preprocess', preprocess),
    ('model', LogisticRegression(max_iter=1000, random_state=RANDOM_STATE))
])

log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)

print('Accuracy:', accuracy_score(y_test, y_pred))
print('
Classification Report:
', classification_report(y_test, y_pred))

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix - Logistic Regression')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

## 8. Alternative Model (Random Forest)
Use a second model to compare performance. Keep it simple.

In [None]:
rf = Pipeline(steps=[
    ('preprocess', preprocess),
    ('model', RandomForestClassifier(n_estimators=200, random_state=RANDOM_STATE))
])

rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print('Accuracy:', accuracy_score(y_test, y_pred_rf))
print('
Classification Report:
', classification_report(y_test, y_pred_rf))

cm_rf = confusion_matrix(y_test, y_pred_rf)
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Greens')
plt.title('Confusion Matrix - Random Forest')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

## 9. Feature Importance (Optional, if required)
Use this only if the assignment requires it.

In [None]:
# Example: show top features from Random Forest
# This requires getting feature names after preprocessing
# If you have categorical variables, one-hot expansion increases feature count.

# Uncomment and adapt if required
# ohe = rf.named_steps['preprocess'].named_transformers_['cat'].named_steps['onehot']
# cat_feature_names = ohe.get_feature_names_out(categorical_cols)
# feature_names = np.concatenate([numeric_cols, cat_feature_names])
# importances = rf.named_steps['model'].feature_importances_
# imp_df = pd.DataFrame({'feature': feature_names, 'importance': importances})
# imp_df.sort_values('importance', ascending=False).head(10)

## 10. Conclusion
Write a short conclusion answering the research question and summarizing results.

**TODO:** Add your conclusion here.