# RandomForest Model for Injury Risk Prediction

This notebook trains a RandomForest model to predict injury risk levels (Low, Medium, High) for athletes based on their training and physical attributes. The pipeline includes data loading, preprocessing, feature engineering, model training with hyperparameter tuning, probability calibration, evaluation, and testing.

## Objectives
- Load and preprocess the dataset.
- Perform feature engineering to create meaningful features.
- Train a RandomForest model with optimized hyperparameter tuning.
- Calibrate probabilities to ensure reliable confidence scores.
- Evaluate the model using appropriate metrics and visualizations.
- Save the model and encoder for use in the prediction pipeline.
- Test the saved model on the test set or new data.

In [1]:
# Install required packages
!pip install -U scikit-learn==1.6.1 joblib imbalanced-learn



In [2]:
# Import libraries
import os
import pandas as pd
import numpy as np
import joblib
import matplotlib.pyplot as plt
import seaborn as sns
import random
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, RandomizedSearchCV
from sklearn.metrics import classification_report, confusion_matrix, f1_score, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import LabelEncoder
from sklearn.calibration import CalibratedClassifierCV

# Set seed for reproducibility across all random processes
SEED = 42
np.random.seed(SEED)
random.seed(SEED)
os.environ['PYTHONHASHSEED'] = str(SEED)

In [3]:
# ---------------------- 1. Load Data ----------------------
print("Loading dataset...")
data_path = "/content/Refined_Sports_Injury_Dataset.csv"

# Check if file exists
if not os.path.exists(data_path):
    raise FileNotFoundError(f"Dataset file not found at {data_path}. Please ensure the file exists.")

df = pd.read_csv(data_path)

# Validate dataset structure
expected_columns = [
    "Age", "Gender", "Sport_Type", "Experience_Level", "Flexibility_Score",
    "Total_Weekly_Training_Hours", "High_Intensity_Training_Hours", "Strength_Training_Frequency",
    "Recovery_Time_Between_Sessions", "Training_Load_Score", "Sprint_Speed", "Endurance_Score",
    "Agility_Score", "Fatigue_Level", "Previous_Injury_Count", "Previous_Injury_Type",
    "Injury_Risk_Level"
]
if not all(col in df.columns for col in expected_columns):
    missing_cols = [col for col in expected_columns if col not in df.columns]
    raise ValueError(f"Dataset is missing required columns: {missing_cols}")

# Display dataset info
print("Dataset loaded. Shape:", df.shape)
print("\nDataset Info:")
print(df.info())
print("\nClass Distribution:")
print(df["Injury_Risk_Level"].value_counts())

Loading dataset...
Dataset loaded. Shape: (10000, 18)

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 18 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Age                           10000 non-null  int64  
 1   Gender                        10000 non-null  object 
 2   Sport_Type                    10000 non-null  object 
 3   Experience_Level              10000 non-null  object 
 4   Flexibility_Score             10000 non-null  float64
 5   Total_Weekly_Training_Hours   10000 non-null  float64
 6   High_Intensity_Training_Hours 10000 non-null  float64
 7   Strength_Training_Frequency   10000 non-null  int64  
 8   Recovery_Time_Between_Sessions 10000 non-null  float64
 9   Training_Load_Score           10000 non-null  float64
 10  Sprint_Speed                  10000 non-null  float64
 11  Endurance_Score               10000 non-null  floa

In [4]:
# ---------------------- 2. Data Preprocessing & Encoding ----------------------
print("\n🔠 Encoding categorical columns...")

# Define mappings for categorical variables
gender_mapping = {"Male": 0, "Female": 1}
experience_mapping = {"Beginner": 0, "Intermediate": 1, "Advanced": 2, "Professional": 3}
injury_type_mapping = {"None": 0, "Sprain": 1, "Ligament Tear": 2, "Tendonitis": 3, "Strain": 4, "Fracture": 5}

# Handle NaN values in Previous_Injury_Type by treating them as 'None'
df["Previous_Injury_Type"] = df["Previous_Injury_Type"].fillna("None")

# Validate categorical columns before encoding
for col, mapping in [("Gender", gender_mapping), ("Experience_Level", experience_mapping), ("Previous_Injury_Type", injury_type_mapping)]:
    invalid_values = set(df[col]) - set(mapping.keys())
    if invalid_values:
        raise ValueError(f"Invalid values found in {col}: {invalid_values}. Expected values: {list(mapping.keys())}")

# Encode Gender
df["Gender"] = df["Gender"].map(gender_mapping).fillna(0).astype(int)

# Encode Sport_Type dynamically
df["Sport_Type"] = df["Sport_Type"].astype("category").cat.codes

# Encode Experience_Level
df["Experience_Level"] = df["Experience_Level"].map(experience_mapping).fillna(0).astype(int)

# Encode Previous_Injury_Type
df["Previous_Injury_Type"] = df["Previous_Injury_Type"].map(injury_type_mapping).fillna(0).astype(int)

# Encode target variable (Injury_Risk_Level)
le = LabelEncoder()
df["Injury_Risk_Level"] = le.fit_transform(df["Injury_Risk_Level"].astype(str))

# Verify encoding
print("Encoded class mapping:", dict(zip(le.classes_, range(len(le.classes_)))))
print("Sample of encoded data:\n", df.head())


🔠 Encoding categorical columns...
Encoded class mapping: {'High': 0, 'Low': 1, 'Medium': 2}
Sample of encoded data:
    Age  Gender  Sport_Type  Experience_Level  Flexibility_Score  \
0   34       0           2                 0                7.2   
1   29       1           4                 3                8.5   
2   31       0           2                 2                6.8   
3   27       1           0                 1                7.9   
4   33       0           3                 2                6.5   

   Total_Weekly_Training_Hours  High_Intensity_Training_Hours  \
0                         12.0                            4.0   
1                          8.0                            2.0   
2                         15.0                            6.0   
3                         10.0                            3.0   
4                          9.0                            3.0   

   Strength_Training_Frequency  Recovery_Time_Between_Sessions  \
0                     

In [None]:
# ---------------------- 3. Feature Engineering ----------------------
print("\n🛠️ Creating derived features...")

# Replace 0 with 0.1 in Total_Weekly_Training_Hours to avoid division by zero
df["Total_Weekly_Training_Hours"] = df["Total_Weekly_Training_Hours"].replace(0, 0.1)

# Create derived features
df["Intensity_Ratio"] = df["High_Intensity_Training_Hours"] / df["Total_Weekly_Training_Hours"]
df["Recovery_Per_Training"] = df["Recovery_Time_Between_Sessions"] / df["Total_Weekly_Training_Hours"]

# Check for NaN or infinite values in derived features
if df[["Intensity_Ratio", "Recovery_Per_Training"]].isna().any().any() or np.isinf(df[["Intensity_Ratio", "Recovery_Per_Training"]]).any().any():
    raise ValueError("NaN or infinite values found in derived features. Check data for inconsistencies.")

# Define features
features = [
    "Age", "Gender", "Sport_Type", "Experience_Level", "Flexibility_Score",
    "Total_Weekly_Training_Hours", "High_Intensity_Training_Hours", "Strength_Training_Frequency",
    "Recovery_Time_Between_Sessions", "Training_Load_Score", "Sprint_Speed", "Endurance_Score",
    "Agility_Score", "Fatigue_Level", "Previous_Injury_Count", "Previous_Injury_Type",
    "Intensity_Ratio", "Recovery_Per_Training"
]




🛠️ Creating derived features...
Features created: ['Age', 'Gender', 'Sport_Type', 'Experience_Level', 'Flexibility_Score', 'Total_Weekly_Training_Hours', 'High_Intensity_Training_Hours', 'Strength_Training_Frequency', 'Recovery_Time_Between_Sessions', 'Training_Load_Score', 'Sprint_Speed', 'Endurance_Score', 'Agility_Score', 'Fatigue_Level', 'Previous_Injury_Count', 'Previous_Injury_Type', 'Intensity_Ratio', 'Recovery_Per_Training']
Sample of features:
    Age  Gender  Sport_Type  Experience_Level  Flexibility_Score  \
0   34       0           2                 0                7.2   
1   29       1           4                 3                8.5   
2   31       0           2                 2                6.8   
3   27       1           0                 1                7.9   
4   33       0           3                 2                6.5   

   Total_Weekly_Training_Hours  High_Intensity_Training_Hours  \
0                         12.0                            4.0   
1       

In [None]:
# Prepare features and target
X = df[features]
y = df["Injury_Risk_Level"]

# Verify features
print("Features created:", features)
print("Sample of features:\n", X.head())

In [6]:
# ---------------------- 4. Train/Test Split & SMOTE ----------------------
print("\n📊 Splitting data & applying SMOTE...")

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=SEED)

# Apply SMOTE to balance the training set
try:
    sm = SMOTE(random_state=SEED)
    X_train_res, y_train_res = sm.fit_resample(X_train, y_train)
except Exception as e:
    raise RuntimeError(f"SMOTE failed: {str(e)}. Check for invalid data in X_train or y_train.")

# Print class distribution after SMOTE
print("Training set class distribution after SMOTE:")
print(pd.Series(y_train_res).value_counts())


📊 Splitting data & applying SMOTE...
Training set class distribution after SMOTE:
0    4811
1    4811
2    4811
Name: count, dtype: int64


In [7]:
# ---------------------- 5. Hyperparameter Tuning with RandomizedSearchCV ----------------------
print("\n🔍 Performing hyperparameter tuning...")

# Compute class weights to prioritize 'High' class
class_weights = {0: 2.0, 1: 1.0, 2: 1.0}  # Higher weight for 'High' (encoded as 0)

# Define RandomForest model
rf = RandomForestClassifier(random_state=SEED, class_weight=class_weights)

# Define reduced hyperparameter grid for faster tuning
param_dist = {
    'n_estimators': [100, 200, 300],
    'max_depth': [15, 20, 25],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# Perform randomized search with cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, n_iter=15, cv=cv,
    scoring='f1_macro', n_jobs=1, random_state=SEED
)
random_search.fit(X_train_res, y_train_res)

# Get best model
best_rf = random_search.best_estimator_
print("Best hyperparameters:", random_search.best_params_)
print("Best cross-validation F1-score:", random_search.best_score_)


🔍 Performing hyperparameter tuning...
Best hyperparameters: {'n_estimators': 300, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': 25}
Best cross-validation F1-score: 0.947693499624901


In [8]:
# ---------------------- 6. Calibrate Probabilities ----------------------
print("\n📏 Calibrating probabilities...")

# Calibrate the best model using CalibratedClassifierCV with reduced folds
calibrated_rf = CalibratedClassifierCV(best_rf, method='sigmoid', cv=3, ensemble=True)
calibrated_rf.fit(X_train_res, y_train_res)

# Evaluate calibrated model on test set
y_pred = calibrated_rf.predict(X_test)
y_proba = calibrated_rf.predict_proba(X_test)

# Print evaluation metrics
print("\n📈 Model Evaluation on Test Set:")
print("F1 Score (Macro):", f1_score(y_test, y_pred, average="macro"))
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=le.classes_))

# Cross-validation on calibrated model
cv_scores = cross_val_score(calibrated_rf, X_train_res, y_train_res, cv=cv, scoring="f1_macro")
print(f"Cross-Validation F1 Scores: {cv_scores}")
print(f"Mean CV F1 Score: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")


📏 Calibrating probabilities...

📈 Model Evaluation on Test Set:
F1 Score (Macro): 0.9141359042280297
Accuracy: 0.929

Classification Report:
               precision    recall  f1-score   support

        High       0.83      0.92      0.87       231
         Low       0.92      0.95      0.93       566
      Medium       0.96      0.92      0.94      1203

    accuracy                           0.93      2000
   macro avg       0.90      0.93      0.91      2000
weighted avg       0.93      0.93      0.93      2000

Cross-Validation F1 Scores: [0.94929183 0.94712258 0.94513455 0.94776378 0.95005877]
Mean CV F1 Score: 0.9479 ± 0.0017


In [9]:
# ---------------------- 7. Visual Insights ----------------------
print("\n📉 Generating visualizations...")

# Ensure model directory exists
os.makedirs("model", exist_ok=True)

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="g", cmap="Blues", xticklabels=le.classes_, yticklabels=le.classes_)
plt.title("Confusion Matrix - RandomForest")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.savefig("model/rf_confusion_matrix.png")
plt.close()

# Feature Importance
feat_imp = pd.DataFrame({"Feature": features, "Importance": best_rf.feature_importances_})
feat_imp.sort_values("Importance", ascending=False, inplace=True)
plt.figure(figsize=(10, 6))
sns.barplot(x="Importance", y="Feature", data=feat_imp)
plt.title("Feature Importances - RandomForest")
plt.tight_layout()
plt.savefig("model/rf_feature_importance.png")
plt.close()

print("Visuals saved to model/ directory.")


📉 Generating visualizations...
Visuals saved to model/ directory.


In [10]:
# ---------------------- 8. Save Model and Encoder ----------------------
print("\n💾 Saving model and encoder...")

# Ensure model directory exists
os.makedirs("model", exist_ok=True)

# Save the calibrated model
joblib.dump(calibrated_rf, "model/rf_injury_model.pkl")

# Save the label encoder
joblib.dump(le, "model/rf_target_encoder.pkl")

print("Model and encoder saved to model/ directory.")


💾 Saving model and encoder...
Model and encoder saved to model/ directory.
