# Calibrate Injury Likelihood Mapping

This notebook calibrates the injury likelihood mapping for the injury risk prediction system. We use the training dataset (`Refined_Sports_Injury_Dataset.csv`) to map the model's predicted probabilities to true injury probabilities.

## Objectives
- Load the dataset and trained models.
- Preprocess the data consistently with the training pipeline.
- Use the models to predict probabilities on a calibration set.
- Fit a logistic regression model to map predicted probabilities to true injury probabilities.
- Save the calibration model for use in `predict.py`.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import brier_score_loss
import matplotlib.pyplot as plt

# Set random seed
np.random.seed(42)

In [2]:
# ---------------------- 1. Load Data and Models ----------------------
# Load the dataset
data_path = "/content/Refined_Sports_Injury_Dataset.csv"
try:
    df = pd.read_csv(data_path)
except FileNotFoundError:
    raise FileNotFoundError(f"Dataset file not found at {data_path}. Please ensure the file exists.")
print("Dataset loaded. Shape:", df.shape)

# Load the trained models with error handling
model_dir = "/content/model"
try:
    rf_model = joblib.load(f"{model_dir}/rf_injury_model.pkl")
except Exception as e:
    raise FileNotFoundError(f"Failed to load RandomForest model: {str(e)}. Ensure RandomForest.ipynb has been run successfully to generate {model_dir}/rf_injury_model.pkl.")

try:
    xgb_model = joblib.load(f"{model_dir}/xgboost_injury_model.pkl")
except Exception as e:
    raise FileNotFoundError(f"Failed to load XGBoost model: {str(e)}. Ensure XGBOOST.ipynb has been run successfully to generate {model_dir}/xgboost_injury_model.pkl.")

print("Models loaded.")

Dataset loaded. Shape: (10000, 18)
Models loaded.


In [3]:
# ---------------------- 2. Preprocess Data ----------------------
# Encode categorical columns (consistent with training notebooks)
gender_mapping = {"Male": 0, "Female": 1}
experience_mapping = {"Beginner": 0, "Intermediate": 1, "Advanced": 2, "Professional": 3}
injury_type_mapping = {"None": 0, "Sprain": 1, "Ligament Tear": 2, "Tendonitis": 3, "Strain": 4, "Fracture": 5}

# Handle NaN values in Previous_Injury_Type
df["Previous_Injury_Type"] = df["Previous_Injury_Type"].fillna("None")

df["Gender"] = df["Gender"].map(gender_mapping).fillna(0).astype(int)
df["Sport_Type"] = df["Sport_Type"].astype("category").cat.codes
df["Experience_Level"] = df["Experience_Level"].map(experience_mapping).fillna(0).astype(int)
df["Previous_Injury_Type"] = df["Previous_Injury_Type"].map(injury_type_mapping).fillna(0).astype(int)

# Replace 0 with 0.1 in Total_Weekly_Training_Hours
df["Total_Weekly_Training_Hours"] = df["Total_Weekly_Training_Hours"].replace(0, 0.1)

# Create derived features
df["Intensity_Ratio"] = df["High_Intensity_Training_Hours"] / df["Total_Weekly_Training_Hours"]
df["Recovery_Per_Training"] = df["Recovery_Time_Between_Sessions"] / df["Total_Weekly_Training_Hours"]

# Create Injury_Occurred column probabilistically based on Injury_Risk_Level
print("Injury_Risk_Level distribution:\n", df["Injury_Risk_Level"].value_counts())

# Define probabilities of injury occurrence based on risk level
injury_probabilities = {
    "High": 0.95,  # 95% chance of injury
    "Medium": 0.5, # 50% chance of injury
    "Low": 0.05    # 5% chance of injury
}

# Generate Injury_Occurred using random sampling based on Injury_Risk_Level
df["Injury_Occurred"] = df["Injury_Risk_Level"].apply(
    lambda x: np.random.binomial(1, injury_probabilities[x])
)

# Check the distribution of Injury_Occurred
print("Injury_Occurred distribution (full dataset):\n", df["Injury_Occurred"].value_counts())

# Ensure both classes are present
if len(df["Injury_Occurred"].unique()) < 2:
    raise ValueError("Injury_Occurred contains only one class after probabilistic assignment. Adjust probabilities or dataset.")

# Define features
features = [
    "Age", "Gender", "Sport_Type", "Experience_Level", "Flexibility_Score",
    "Total_Weekly_Training_Hours", "High_Intensity_Training_Hours", "Strength_Training_Frequency",
    "Recovery_Time_Between_Sessions", "Training_Load_Score", "Sprint_Speed", "Endurance_Score",
    "Agility_Score", "Fatigue_Level", "Previous_Injury_Count", "Previous_Injury_Type",
    "Intensity_Ratio", "Recovery_Per_Training"
]

# Prepare features and target
X = df[features]
y_outcome = df["Injury_Occurred"]

print("Features prepared:", features)

Injury_Risk_Level distribution:
 Injury_Risk_Level
Medium    6016
Low       2827
High      1157
Name: count, dtype: int64
Injury_Occurred distribution (full dataset):
 Injury_Occurred
0    5801
1    4199
Name: count, dtype: int64
Features prepared: ['Age', 'Gender', 'Sport_Type', 'Experience_Level', 'Flexibility_Score', 'Total_Weekly_Training_Hours', 'High_Intensity_Training_Hours', 'Strength_Training_Frequency', 'Recovery_Time_Between_Sessions', 'Training_Load_Score', 'Sprint_Speed', 'Endurance_Score', 'Agility_Score', 'Fatigue_Level', 'Previous_Injury_Count', 'Previous_Injury_Type', 'Intensity_Ratio', 'Recovery_Per_Training']


In [4]:
# ---------------------- 3. Split Data for Calibration ----------------------
# Split data into training and calibration sets
X_train, X_calib, y_train_outcome, y_calib_outcome = train_test_split(
    X, y_outcome, test_size=0.2, stratify=y_outcome, random_state=42
)

print("Calibration set size:", X_calib.shape)
print("Calibration Injury_Occurred distribution:\n", y_calib_outcome.value_counts())

# Check if y_calib_outcome has both classes
if len(y_calib_outcome.unique()) < 2:
    raise ValueError("Calibration set contains only one class in Injury_Occurred. Cannot proceed with calibration.")

Calibration set size: (2000, 18)
Calibration Injury_Occurred distribution:
 Injury_Occurred
0    1160
1     840
Name: count, dtype: int64


In [5]:
# ---------------------- 4. Predict Probabilities with Ensemble ----------------------
# Use the ensemble (average of RandomForest and XGBoost) to predict probabilities
rf_probs = rf_model.predict_proba(X_calib)
xgb_probs = xgb_model.predict_proba(X_calib)

# Average the probabilities (same as in predict.py)
avg_probs = (rf_probs + xgb_probs) / 2

# Get the predicted class and confidence
predicted_classes = np.argmax(avg_probs, axis=1)
confidences = np.max(avg_probs, axis=1)

print("Sample of predicted confidences:\n", confidences[:5])
print("Sample of predicted classes:\n", predicted_classes[:5])

# Inspect the relationship between predicted classes and Injury_Occurred
calib_df = pd.DataFrame({
    "Predicted_Class": predicted_classes,
    "Injury_Occurred": y_calib_outcome
})
print("Distribution of Injury_Occurred by Predicted Class:\n", calib_df.groupby("Predicted_Class")["Injury_Occurred"].value_counts())

Sample of predicted confidences:
 [0.98814357 0.9881264  0.97794891 0.97335743 0.97383904]
Sample of predicted classes:
 [2 2 2 1 1]
Distribution of Injury_Occurred by Predicted Class:
 Predicted_Class  Injury_Occurred
0                1                  216
                 0                   14
1                0                  540
                 1                   40
2                0                  606
                 1                  584
Name: count, dtype: int64


In [6]:
# ---------------------- 5. Map Probabilities to Injury Probabilities ----------------------
# Use the raw probabilities as features for calibration
calib_data = pd.DataFrame({
    "prob_high": avg_probs[:, 0],  # Probability of High risk
    "prob_low": avg_probs[:, 1],   # Probability of Low risk
    "prob_medium": avg_probs[:, 2] # Probability of Medium risk
})

# Fit logistic regression with regularization
lr_calib = LogisticRegression(max_iter=1000, penalty='l2', C=1.0)
lr_calib.fit(calib_data, y_calib_outcome)

# Predict calibrated probabilities
calibrated_probs = lr_calib.predict_proba(calib_data)[:, 1]

# Evaluate calibration using Brier score
brier_score = brier_score_loss(y_calib_outcome, calibrated_probs)
print(f"Brier Score (lower is better): {brier_score:.4f}")

# Inspect the mapping
calib_results = pd.DataFrame({
    "Predicted_Class": predicted_classes,
    "Confidence": confidences,
    "Calibrated_Probability": calibrated_probs,
    "True_Injury_Occurred": y_calib_outcome
})
print("Average Calibrated Probability by Predicted Class:\n", calib_results.groupby("Predicted_Class")["Calibrated_Probability"].mean())

# Plot calibration curve
plt.figure(figsize=(8, 6))
plt.scatter(confidences, calibrated_probs, alpha=0.5)
plt.plot([0, 1], [0, 1], 'k--', label="Perfectly Calibrated")
plt.xlabel("Original Confidence (Ensemble)")
plt.ylabel("Calibrated Injury Probability (P(Injury_Occurred=1))")
plt.title("Calibration Curve: Confidence vs. Injury Probability")
plt.legend()
plt.savefig(f"{model_dir}/calibration_curve.png")
plt.close()

print(f"Calibration curve saved to {model_dir}/calibration_curve.png")

Brier Score (lower is better): 0.1734
Average Calibrated Probability by Predicted Class:
 Predicted_Class
0    0.923188
1    0.075402
2    0.490621
Name: Calibrated_Probability, dtype: float64
Calibration curve saved to /content/model/calibration_curve.png


In [7]:
# ---------------------- 6. Save Calibration Model ----------------------
# Save the logistic regression model for use in predict.py
import os
os.makedirs(model_dir, exist_ok=True)
joblib.dump(lr_calib, f"{model_dir}/likelihood_calibrator.pkl")
print(f"Calibration model saved to {model_dir}/likelihood_calibrator.pkl")
print("Note: You are running this in Google Colab. The file is saved to /content/model/likelihood_calibrator.pkl.")
print("Please download it and move it to C:/Users/amrHa/Desktop/final 3/deployment/model/ for deployment.")
print("Alternatively, if running locally, update model_dir to 'C:/Users/amrHa/Desktop/final 3/deployment/model'.")

Calibration model saved to /content/model/likelihood_calibrator.pkl
Note: You are running this in Google Colab. The file is saved to /content/model/likelihood_calibrator.pkl.
Please download it and move it to C:/Users/amrHa/Desktop/final 3/deployment/model/ for deployment.
Alternatively, if running locally, update model_dir to 'C:/Users/amrHa/Desktop/final 3/deployment/model'.
