# XGBoost Model for Injury Risk Prediction

This notebook trains an XGBoost model to predict injury risk levels (Low, Medium, High) for athletes based on their training and physical attributes. The pipeline includes data loading, preprocessing, feature engineering, model training with hyperparameter tuning, probability calibration, and evaluation.

## Objectives
- Load and preprocess the dataset consistently with the RandomForest model.
- Perform feature engineering to create meaningful features.
- Train an XGBoost model with hyperparameter tuning.
- Calibrate probabilities to ensure reliable confidence scores.
- Evaluate the model using appropriate metrics and visualizations.
- Save the model and encoder for use in the prediction pipeline.

In [1]:
# Import libraries
import os
import pandas as pd
import numpy as np
import xgboost as xgb
import joblib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, f1_score, accuracy_score
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTE
from sklearn.calibration import CalibratedClassifierCV

# Set seed for reproducibility
np.random.seed(42)

In [2]:
# ---------------------- 1. Load Dataset ----------------------
# Load the dataset
df = pd.read_csv("/content/Refined_Sports_Injury_Dataset.csv")

# Display dataset info
print("Dataset loaded. Shape:", df.shape)
print("\nDataset Info:")
print(df.info())
print("\nClass Distribution:")
print(df["Injury_Risk_Level"].value_counts())

Dataset loaded. Shape: (10000, 18)

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 18 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Age                           10000 non-null  int64  
 1   Gender                        10000 non-null  object 
 2   Sport_Type                    10000 non-null  object 
 3   Experience_Level              10000 non-null  object 
 4   Flexibility_Score             10000 non-null  float64
 5   Total_Weekly_Training_Hours   10000 non-null  float64
 6   High_Intensity_Training_Hours 10000 non-null  float64
 7   Strength_Training_Frequency   10000 non-null  int64  
 8   Recovery_Time_Between_Sessions 10000 non-null  float64
 9   Training_Load_Score           10000 non-null  float64
 10  Sprint_Speed                  10000 non-null  float64
 11  Endurance_Score               10000 non-null  float64
 12  Agility_Sc

In [3]:
# ---------------------- 2. Encode Categorical Columns ----------------------
print("\n🔠 Encoding categorical columns...")

# Define mappings for categorical variables (consistent with RandomForest)
gender_mapping = {"Male": 0, "Female": 1}
experience_mapping = {"Beginner": 0, "Intermediate": 1, "Advanced": 2, "Professional": 3}
injury_type_mapping = {"None": 0, "Sprain": 1, "Ligament Tear": 2, "Tendonitis": 3, "Strain": 4, "Fracture": 5}

# Encode Gender
df["Gender"] = df["Gender"].map(gender_mapping).fillna(0).astype(int)

# Encode Sport_Type dynamically
df["Sport_Type"] = df["Sport_Type"].astype("category").cat.codes

# Encode Experience_Level
df["Experience_Level"] = df["Experience_Level"].map(experience_mapping).fillna(0).astype(int)

# Encode Previous_Injury_Type
df["Previous_Injury_Type"] = df["Previous_Injury_Type"].map(injury_type_mapping).fillna(0).astype(int)

# Encode Target
le = LabelEncoder()
df["Injury_Risk_Level"] = le.fit_transform(df["Injury_Risk_Level"].astype(str))

# Verify encoding
print("Encoded class mapping:", dict(zip(le.classes_, range(len(le.classes_)))))
print("Sample of encoded data:\n", df.head())


🔠 Encoding categorical columns...
Encoded class mapping: {'High': 0, 'Low': 1, 'Medium': 2}
Sample of encoded data:
    Age  Gender  Sport_Type  Experience_Level  Flexibility_Score  \
0   34       0           2                 0                7.2   
1   29       1           4                 3                8.5   
2   31       0           2                 2                6.8   
3   27       1           0                 1                7.9   
4   33       0           3                 2                6.5   

   Total_Weekly_Training_Hours  High_Intensity_Training_Hours  \
0                         12.0                            4.0   
1                          8.0                            2.0   
2                         15.0                            6.0   
3                         10.0                            3.0   
4                          9.0                            3.0   

   Strength_Training_Frequency  Recovery_Time_Between_Sessions  \
0                     

In [4]:
# ---------------------- 3. Feature Engineering ----------------------
print("\n🛠️ Creating derived features...")

# Replace 0 with 0.1 in Total_Weekly_Training_Hours to avoid division by zero
df["Total_Weekly_Training_Hours"] = df["Total_Weekly_Training_Hours"].replace(0, 0.1)

# Create derived features
df["Intensity_Ratio"] = df["High_Intensity_Training_Hours"] / df["Total_Weekly_Training_Hours"]
df["Recovery_Per_Training"] = df["Recovery_Time_Between_Sessions"] / df["Total_Weekly_Training_Hours"]

# Define features (removed Predicted_Injury_Type)
features = [
    "Age", "Gender", "Sport_Type", "Experience_Level", "Flexibility_Score",
    "Total_Weekly_Training_Hours", "High_Intensity_Training_Hours", "Strength_Training_Frequency",
    "Recovery_Time_Between_Sessions", "Training_Load_Score", "Sprint_Speed", "Endurance_Score",
    "Agility_Score", "Fatigue_Level", "Previous_Injury_Count", "Previous_Injury_Type",
    "Intensity_Ratio", "Recovery_Per_Training"
]

# Prepare features and target
X = df[features]
y = df["Injury_Risk_Level"]

# Verify features
print("Features created:", features)
print("Sample of features:\n", X.head())


🛠️ Creating derived features...
Features created: ['Age', 'Gender', 'Sport_Type', 'Experience_Level', 'Flexibility_Score', 'Total_Weekly_Training_Hours', 'High_Intensity_Training_Hours', 'Strength_Training_Frequency', 'Recovery_Time_Between_Sessions', 'Training_Load_Score', 'Sprint_Speed', 'Endurance_Score', 'Agility_Score', 'Fatigue_Level', 'Previous_Injury_Count', 'Previous_Injury_Type', 'Intensity_Ratio', 'Recovery_Per_Training']
Sample of features:
    Age  Gender  Sport_Type  Experience_Level  Flexibility_Score  \
0   34       0           2                 0                7.2   
1   29       1           4                 3                8.5   
2   31       0           2                 2                6.8   
3   27       1           0                 1                7.9   
4   33       0           3                 2                6.5   

   Total_Weekly_Training_Hours  High_Intensity_Training_Hours  \
0                         12.0                            4.0   
1       

In [5]:
# ---------------------- 4. Train/Test Split & SMOTE ----------------------
print("\n📊 Splitting data & applying SMOTE...")

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Apply SMOTE to balance the training set
X_train_res, y_train_res = SMOTE(random_state=42).fit_resample(X_train, y_train)

# Print class distribution after SMOTE
print("Training set class distribution after SMOTE:")
print(pd.Series(y_train_res).value_counts())


📊 Splitting data & applying SMOTE...
Training set class distribution after SMOTE:
0    4811
1    4811
2    4811
Name: count, dtype: int64


In [6]:
# ---------------------- 5. Hyperparameter Tuning with GridSearchCV ----------------------
print("\n🔍 Performing hyperparameter tuning...")

# Define XGBoost model
xgb_model = xgb.XGBClassifier(
    objective="multi:softprob",
    eval_metric="mlogloss",
    num_class=len(le.classes_),
    random_state=42
)

# Define hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15],
    'learning_rate': [0.01, 0.1, 0.3],
    'subsample': [0.8, 1.0]
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid, cv=5, scoring='f1_macro', n_jobs=-1)
grid_search.fit(X_train_res, y_train_res)

# Get best model
best_xgb = grid_search.best_estimator_
print("Best hyperparameters:", grid_search.best_params_)
print("Best cross-validation F1-score:", grid_search.best_score_)


🔍 Performing hyperparameter tuning...
Best hyperparameters: {'learning_rate': 0.1, 'max_depth': 15, 'n_estimators': 300, 'subsample': 1.0}
Best cross-validation F1-score: 0.9561784041412857


In [7]:
# ---------------------- 6. Calibrate Probabilities ----------------------
print("\n📏 Calibrating probabilities...")

# Calibrate the best model using CalibratedClassifierCV
calibrated_xgb = CalibratedClassifierCV(best_xgb, method='sigmoid', cv=5)
calibrated_xgb.fit(X_train_res, y_train_res)

# Evaluate calibrated model on test set
y_pred = calibrated_xgb.predict(X_test)
y_proba = calibrated_xgb.predict_proba(X_test)

# Print evaluation metrics
print("\n📈 Model Evaluation on Test Set:")
print("F1 Score (Macro):", f1_score(y_test, y_pred, average="macro"))
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=le.classes_))


📏 Calibrating probabilities...

📈 Model Evaluation on Test Set:
F1 Score (Macro): 0.9154651493598866
Accuracy: 0.929

Classification Report:
               precision    recall  f1-score   support

        High       0.82      0.90      0.86       231
         Low       0.93      0.94      0.94       566
      Medium       0.95      0.93      0.94      1203

    accuracy                           0.93      2000
   macro avg       0.90      0.93      0.92      2000
weighted avg       0.93      0.93      0.93      2000


In [8]:
# ---------------------- 7. Visual Insights ----------------------
print("\n📉 Generating visualizations...")

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="g", cmap="Blues", xticklabels=le.classes_, yticklabels=le.classes_)
plt.title("Confusion Matrix - XGBoost")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.savefig("model/xgb_confusion_matrix.png")
plt.close()

# Feature Importance
feat_imp = pd.DataFrame({"Feature": features, "Importance": best_xgb.feature_importances_})
feat_imp.sort_values("Importance", ascending=False, inplace=True)
plt.figure(figsize=(10, 6))
sns.barplot(x="Importance", y="Feature", data=feat_imp)
plt.title("Feature Importances - XGBoost")
plt.tight_layout()
plt.savefig("model/xgb_feature_importance.png")
plt.close()

print("Visuals saved to model/ directory.")


📉 Generating visualizations...
Visuals saved to model/ directory.


In [9]:
# ---------------------- 8. Save Model and Encoder ----------------------
print("\n💾 Saving model and encoder...")

# Create model directory if it doesn't exist
os.makedirs("model", exist_ok=True)

# Save the calibrated model
joblib.dump(calibrated_xgb, "model/xgboost_injury_model.pkl")

# Save the label encoder
joblib.dump(le, "model/xgb_target_encoder.pkl")

print("Model and encoder saved to model/ directory.")


💾 Saving model and encoder...
Model and encoder saved to model/ directory.
