Base repo of data: https://github.com/mlsedigital/SPL-open-data

Great xyz data dict: https://www.inpredictable.com/2021/01/nba-player-shooting-motions-data-dump.html


## Ultimate Goals:
* Find optimal shooting form/release point
  * Include the feedback system based on optimal shooting form and release point
  * ### Use shooting meter from NBA 2k to show good vs bad releases and body movements
  * ^use the motions that:
    * produce good results (a make or enough kinetic energy for a near make)
    * are within certain boundaries where the player is comfortable AND in right shooting form
    * are within certain ranges (this inch to this inch is the best motion area for this player, etc.)
    * Compared to Popular Shooters to make recommendations to shoot more like someone like Klay Thompson or Buddy Hield (see if the data from the xyz data dictionary has good shooters to make recommendations to shoot like them)
* Exhaustion levels and optimal energy max and min
* Shot outcome prediction


1. Finding Optimal Shooting Form/Release Point

    Detailed Analysis of Body Movements: Include analysis of body kinematics (positions and velocities of limbs) at the time of release. Use joint coordinates to extract features like elbow angle, shoulder rotation, and wrist flexion at the release frame.
    Machine Learning for Form Classification: Train a model to classify "good" vs. "bad" shooting forms using successful shot data (makes or near-makes) and compare them against unsuccessful shots.
    Feedback Mechanism: Implement a feedback system that suggests adjustments based on the comparison of current shooting mechanics to historical optimal ones.

2. Shooting Meter Simulation (NBA 2K-Style)

    Visual Feedback System: Develop a visualization tool that overlays a “meter” on shot video frames, indicating the quality of the release in real-time. This can be based on a scoring function derived from the shooting form features and ball dynamics.
    Comfort Zone Identification: Use clustering algorithms (e.g., K-means) to identify comfortable ranges of motion based on historical shooting data for individual players.

3. Shot Outcome Prediction

    Feature Engineering for Outcome Modeling: Include additional features such as:
        Kinetic Energy Calculation: KE=12mv2KE=21​mv2 to see if the ball had sufficient energy for a make.
        Entry Angle and Trajectory Analysis: Analyze whether the angle at which the ball approaches the hoop aligns with optimal scoring trajectories.
    Model Development: Build a machine learning model (e.g., logistic regression, XGBoost) that predicts shot outcomes based on ball speed, entry angle, release point, and body dynamics.
    Training with Data Augmentation: Use synthetic data generation to include a wide variety of shot scenarios.

4. Exhaustion Levels and Energy Management

    Tracking Player Movements: Use the coordinates of major joints (e.g., knees, hips) to estimate a player's exertion level using metrics like the average vertical displacement over time.
    Velocity and Acceleration Patterns: Monitor changes in the velocity and acceleration of the player's body parts throughout a game to detect fatigue.
    Feature Integration: Create features such as average speed and distance covered leading up to the shot to include in the predictive model.

5. shot simulator: 
    if by these metrics we can simulate the shot to a nearby hoop. We can set the hoop in a set location (or when using spatial, we'd pick the location), and set it up to virtualize the experience with yolo/opencv where when you make those motions to show shooting motion, you can show where the ball might go (even without a ball in hand).

# ML Dataset Pipeline:
    - categorize data
    - multicollinearity/feature importances = feature selection
    - preprocessing suggestions, datasets, preprocessing with strict guidelines
    - ml model selection
    - ml model testing
    - ml data inverse transform
    - ml model prediction
    - live prediction and re-calculate the dataset formulas for when new data attaches
    - get optimal ranges for the angles (knee/wrist/elbow) and input them into the meters
    - input the meters into a video for optimal angles
    - add in ml classification with re-calculation
    - live camera feed with this? 
      - streamlit example of how this works with li


In [8]:
import os, sys
print("Current working directory:", os.getcwd())
print("sys.path:", sys.path)

import sys
from pathlib import Path



import ml.feature_selection.multicollinearity_checker
import ml.feature_selection.feature_importance_calculator

from ml.feature_selection.feature_importance_calculator import manage_features
from datapreprocessor import DataPreprocessor
from ml.train import bayes_best_model_train
from ml.train_utils.train_utils import (
evaluate_model, save_model, load_model, plot_decision_boundary,
tune_random_forest, tune_xgboost, tune_decision_tree
)

Current working directory: c:\Users\ghadf\vscode_projects\docker_projects\spl_freethrow_biomechanics_analysis_ml_prediction\notebooks\freethrow_predictions
sys.path: ['c:\\Users\\ghadf\\anaconda3\\envs\\data_science_ft_bio_predictions\\python310.zip', 'c:\\Users\\ghadf\\anaconda3\\envs\\data_science_ft_bio_predictions\\DLLs', 'c:\\Users\\ghadf\\anaconda3\\envs\\data_science_ft_bio_predictions\\lib', 'c:\\Users\\ghadf\\anaconda3\\envs\\data_science_ft_bio_predictions', '', 'c:\\Users\\ghadf\\anaconda3\\envs\\data_science_ft_bio_predictions\\lib\\site-packages', 'c:\\Users\\ghadf\\anaconda3\\envs\\data_science_ft_bio_predictions\\lib\\site-packages\\win32', 'c:\\Users\\ghadf\\anaconda3\\envs\\data_science_ft_bio_predictions\\lib\\site-packages\\win32\\lib', 'c:\\Users\\ghadf\\anaconda3\\envs\\data_science_ft_bio_predictions\\lib\\site-packages\\Pythonwin']


In [9]:
%%writefile ml/feature_selection/multicollinearity_checker.py

import pandas as pd
import numpy as np

def check_multicollinearity(df, threshold=0.8, debug=False):
    """
    Identifies pairs of features with correlation above the specified threshold.
    Args:
        df (DataFrame): DataFrame containing numerical features.
        threshold (float): Correlation coefficient threshold.
        debug (bool): If True, prints debugging information.
    Returns:
        DataFrame: Pairs of features with high correlation.
    """
    # Select only numerical columns
    numeric_df = df.select_dtypes(include=[np.number])
    
    if debug:
        print(f"Computing correlation matrix for {len(numeric_df.columns)} numerical features...")

    corr_matrix = numeric_df.corr().abs()
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

    # Identify highly correlated features
    highly_correlated = [
        (column, idx, upper.loc[column, idx])
        for column in upper.columns
        for idx in upper.index
        if (upper.loc[column, idx] > threshold)
    ]

    multicollinearity_df = pd.DataFrame(highly_correlated, columns=['Feature1', 'Feature2', 'Correlation'])

    if debug:
        if not multicollinearity_df.empty:
            print(f"Found {len(multicollinearity_df)} pairs of highly correlated features:")
            print(multicollinearity_df)
        else:
            print("No highly correlated feature pairs found.")

    return multicollinearity_df



if __name__ == "__main__":
    import pickle
    # Load the category bin configuration
    with open('../../data/model/pipeline/category_bin_config.pkl', 'rb') as f:
        loaded_category_bin_config = pickle.load(f)

    file_path = "../../data/processed/final_ml_dataset.csv"
    #import ml dataset from spl_dataset_prep
    final_ml_df = pd.read_csv(file_path)
    
    # Feature selection based on multi collinearity and random forest importance selection
    target_variable = 'result'
    correlation_threshold = 0.8
    debug = True

    # Remove columns to address collinearity
    drop_features = [ 'L_KNEE_min_power', 'L_HIP_max_power']
    
    # Step 1: Check for multicollinearity
    print("\nChecking for Multicollinearity...")
    multicollinearity_df = check_multicollinearity(final_ml_df, threshold=correlation_threshold, debug=debug)

    # Step 2: Handle multicollinearity
    if not multicollinearity_df.empty:
        for index, row in multicollinearity_df.iterrows():
            feature1, feature2, correlation = row['Feature1'], row['Feature2'], row['Correlation']
            print(f"High correlation ({correlation}) between '{feature1}' and '{feature2}'.")
    else:
        print("No multicollinearity issues detected.")


Overwriting ml/feature_selection/multicollinearity_checker.py


In [10]:
%%writefile ml/feature_selection/feature_importance_calculator.py

import pandas as pd
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
import logging
import pandas as pd
import pickle
from typing import Optional, List, Dict, Any, Tuple
import numpy as np

def calculate_feature_importance(df, target_variable, n_estimators=100, random_state=42, debug=False):
    """
    Calculates feature importance using a Random Forest model.
    Args:
        df (DataFrame): Input DataFrame.
        target_variable (str): Target column name.
        n_estimators (int): Number of trees in the forest.
        random_state (int): Random seed.
        debug (bool): If True, prints debugging information.
    Returns:
        DataFrame: Feature importances.
    """
    X = df.drop(columns=[target_variable])
    y = df[target_variable]

    # Encode target variable if necessary
    if y.dtype == 'object' or str(y.dtype) == 'category':
        if debug:
            print(f"Target variable '{target_variable}' is categorical. Encoding labels.")
        le = LabelEncoder()
        y = le.fit_transform(y)

    # Separate categorical and numerical features
    categorical_cols = X.select_dtypes(include=['object', 'category']).columns.tolist()
    numeric_cols = X.select_dtypes(exclude=['object', 'category']).columns.tolist()

    if debug:
        print(f"Categorical columns: {categorical_cols}")
        print(f"Numerical columns: {numeric_cols}")

    # Encode categorical features if present
    if categorical_cols:
        ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
        X_encoded = ohe.fit_transform(X[categorical_cols])
        X_encoded_df = pd.DataFrame(X_encoded, columns=ohe.get_feature_names_out(categorical_cols), index=X.index)
        X = pd.concat([X[numeric_cols], X_encoded_df], axis=1)

    # Select model
    model = (
        RandomForestRegressor(n_estimators=n_estimators, random_state=random_state)
        if y.dtype in ['int64', 'float64'] else
        RandomForestClassifier(n_estimators=n_estimators, random_state=random_state)
    )

    if debug:
        print(f"Training Random Forest model with {n_estimators} estimators...")

    model.fit(X, y)

    # Calculate feature importances
    feature_importances = pd.DataFrame({
        'Feature': X.columns,
        'Importance': model.feature_importances_
    }).sort_values(by='Importance', ascending=False).reset_index(drop=True)

    if debug:
        print("Feature Importances:")
        print(feature_importances)

    return feature_importances


def manage_features(
    mode: str,
    features_df: Optional[pd.DataFrame] = None,
    ordinal_categoricals: Optional[List[str]] = None,
    nominal_categoricals: Optional[List[str]] = None,
    numericals: Optional[List[str]] = None,
    y_variable: Optional[str] = None,
    paths: Optional[Dict[str, str]] = None
) -> Optional[Dict[str, Any]]:
    """
    Save or load features and metadata.

    Parameters:
        mode (str): Operation mode - 'save' or 'load'.
        features_df (pd.DataFrame, optional): DataFrame containing features to save (required for 'save').
        ordinal_categoricals (list, optional): List of ordinal categorical features.
        nominal_categoricals (list, optional): List of nominal categorical features.
        numericals (list, optional): List of numerical features.
        y_variable (str, optional): Target variable.
        paths (dict, optional): Dictionary mapping each item to its file path.

    Returns:
        If mode is 'load', returns a dictionary with loaded items.
        If mode is 'save', returns None.
    """

    # Define default paths if not provided
    default_paths = {
        'features': 'final_ml_df_selected_features_columns_test.pkl',
        'ordinal_categoricals': 'ordinal_categoricals.pkl',
        'nominal_categoricals': 'nominal_categoricals.pkl',
        'numericals': 'numericals.pkl',
        'y_variable': 'y_variable.pkl'
    }

    # Update default paths with any provided paths
    if paths:
        default_paths.update(paths)

    try:
        if mode == 'save':
            if features_df is None:
                raise ValueError("features_df must be provided in 'save' mode.")

            # Prepare data to save
            data_to_save = {
                'features': features_df.columns.tolist(),
                'ordinal_categoricals': ordinal_categoricals,
                'nominal_categoricals': nominal_categoricals,
                'numericals': numericals,
                'y_variable': y_variable
            }

            # Iterate and save each item
            for key, path in default_paths.items():
                with open(path, 'wb') as f:
                    pickle.dump(data_to_save[key], f)
                print(f"✅ {key.replace('_', ' ').capitalize()} saved to {path}")

        elif mode == 'load':
            loaded_data = {}

            # Iterate and load each item
            for key, path in default_paths.items():
                with open(path, 'rb') as f:
                    loaded_data[key] = pickle.load(f)
                print(f"✅ {key.replace('_', ' ').capitalize()} loaded from {path}")

            return loaded_data

        else:
            raise ValueError("Mode should be either 'save' or 'load'.")

    except Exception as e:
        print(f"❌ Error during '{mode}' operation: {e}")
        if mode == 'load':
            return {key: None for key in default_paths.keys()}
        

if __name__ == "__main__":
    # from feature_selection.multicollinearity_checker import check_multicollinearity

    final_ml_features_path = '../../data/model/pipeline/final_ml_df_selected_features_columns_test.pkl'
    final_ml_ordinal_categoricals_path = '../../data/model/pipeline/features_info/ordinal_categoricals.pkl'
    final_ml_nominal_categoricals_path = '../../data/model/pipeline/features_info/nominal_categoricals.pkl'
    final_ml_numericals_features_path = '../../data/model/pipeline/features_info/numericals.pkl'
    final_ml_y_variable_pkl_path = '../../data/model/pipeline/features_info/y_variable.pkl'
    final_ml_dataset_path = ''
    
    # Load the category bin configuration
    with open('../../data/model/pipeline/category_bin_config.pkl', 'rb') as f:
        loaded_category_bin_config = pickle.load(f)

    file_path = "../../data/processed/final_ml_dataset.csv"
    #import ml dataset from spl_dataset_prep
    final_ml_df = pd.read_csv(file_path)
    
    # Feature selection based on multi collinearity and random forest importance selection
    target_variable = 'result'
    correlation_threshold = 0.8
    debug = True

    
    # Step 1: Check for multicollinearity
    print("\nChecking for Multicollinearity...")
    multicollinearity_df = check_multicollinearity(final_ml_df, threshold=correlation_threshold, debug=debug)

    # Step 2: Handle multicollinearity
    if not multicollinearity_df.empty:
        for index, row in multicollinearity_df.iterrows():
            feature1, feature2, correlation = row['Feature1'], row['Feature2'], row['Correlation']
            print(f"High correlation ({correlation}) between '{feature1}' and '{feature2}'.")
    else:
        print("No multicollinearity issues detected.")


    # Remove columns to address collinearity
    drop_features = [
        'trial_id', 'player_participant_id', 'landing_y', 'landing_x', 'entry_angle', 'shot_id',
        'L_KNEE_avg_power', 'L_WRIST_energy_std', 'R_WRIST_energy_max', 
        'R_ANKLE_energy_mean', 'R_5THFINGER_energy_std', 'R_KNEE_avg_power', 'L_1STFINGER_max_power', 
        'L_5THFINGER_energy_max', 'L_WRIST_max_power', 'R_HIP_energy_std', 'L_1STFINGER_energy_max', 
        'R_ANKLE_energy_max', 'R_ELBOW_energy_max', 'R_ANKLE_energy_std', 'L_WRIST_energy_max', 
        'player_estimated_hand_length_cm', 'player_estimated_standing_reach_cm', 
        'player_estimated_wingspan_cm', 'player_weight__in_kg', 'L_KNEE_energy_std', 'L_HIP_energy_max', 
        'L_ANKLE_energy_max', 'L_WRIST_std_power', 'L_ELBOW_std_power', 'R_KNEE_max_power', 
        'L_ELBOW_avg_power', 'R_ELBOW_min_power', 'L_WRIST_min_power', 'R_HIP_energy_mean', 
        'L_ELBOW_energy_max', 'L_ELBOW_min_power', 'R_1STFINGER_min_power', 'L_ANKLE_min_power', 
        'L_1STFINGER_avg_power', 'R_ANKLE_std_power', 'R_5THFINGER_avg_power', 'L_1STFINGER_energy_mean', 
        'R_HIP_max_power', 'R_WRIST_avg_power', 'R_ELBOW_energy_mean', 'L_WRIST_avg_power', 
        'L_1STFINGER_std_power', 'L_KNEE_energy_max', 'L_WRIST_energy_mean', 'R_KNEE_energy_std', 
        'L_HIP_energy_std', 'L_KNEE_energy_mean', 'R_WRIST_energy_mean', 'L_ELBOW_max_power', 
        'R_WRIST_energy_std', 'L_ANKLE_std_power', 'L_HIP_energy_mean', 'L_ELBOW_energy_mean', 
        'R_HIP_avg_power', 'L_HIP_std_power', 'R_KNEE_std_power', 'L_ANKLE_energy_std', 
        'release_frame_time', 'R_ANKLE_avg_power', 'L_ANKLE_max_power', 'L_5THFINGER_energy_std', 
        'R_WRIST_min_power', 'R_1STFINGER_energy_mean', 'R_ELBOW_energy_std', 'R_HIP_std_power', 
        'R_KNEE_energy_max', 'R_WRIST_std_power', 'L_1STFINGER_energy_std', 'L_HIP_avg_power', 
        'R_5THFINGER_energy_mean', 'R_ANKLE_max_power', 'L_ANKLE_avg_power', 'R_5THFINGER_max_power', 
        'R_5THFINGER_energy_max', 'L_5THFINGER_min_power', 'L_ELBOW_energy_std', 
        'R_1STFINGER_energy_max', 'R_KNEE_min_power', 'R_1STFINGER_energy_std', 
        'R_5THFINGER_std_power', 'L_1STFINGER_min_power', 'R_ELBOW_max_power', 'L_HIP_min_power', 
        'L_5THFINGER_std_power', 'R_1STFINGER_max_power', 'R_KNEE_energy_mean', 'L_5THFINGER_avg_power', 
        'L_5THFINGER_max_power', 'R_HIP_min_power', 'L_KNEE_max_power', 'R_5THFINGER_min_power', 
        'R_1STFINGER_std_power', 'R_ELBOW_avg_power', 'L_ANKLE_energy_mean', 'R_ELBOW_std_power', 
        'L_5THFINGER_energy_mean', 'R_1STFINGER_avg_power', 'R_HIP_energy_max', 'L_KNEE_std_power',
        'R_ANKLE_min_power', 'L_KNEE_min_power', 'L_HIP_max_power'
    ]
    
    # Step 2: Handle multicollinearity
    if not final_ml_df.empty:

            # Drop or combine features based on criteria
            # Example decision logic here...
            # drop_features = ['trial_id', 'player_participant_id']
            # # Drop the identified features from the dataset
            # Drop the identified features from the dataset
            final_ml_df = final_ml_df.drop(columns=drop_features, errors='ignore')

            print(f"Dropped {len(drop_features)} features: {', '.join(drop_features)}")
    else:
        print("No multicollinearity issues detected.")

    # Step 3: Calculate feature importance
    print("\nCalculating Feature Importance...")
    feature_importances = calculate_feature_importance(
        final_ml_df, target_variable=target_variable, n_estimators=100, random_state=42, debug=debug
    )

    print("\nFinal Feature Importances:")
    print(feature_importances.to_string(index=False))
    
    
    #Final Decisions: 
    # Features recommended for dropping
    features_to_drop = [
        'peak_height_relative'
    ]
    print(f"Dropped features (for redundancy or multicollinearity): {', '.join(features_to_drop)}")
    

    # Define categories and column names
    ordinal_categoricals = []
    nominal_categoricals = [] #'player_estimated_hand_length_cm_category'
    numericals = [        'release_ball_direction_x' ,'release_ball_direction_z', 'release_ball_direction_y',
        'elbow_release_angle', 'elbow_max_angle',
        'wrist_release_angle', 'wrist_max_angle',
        'knee_release_angle', 'knee_max_angle',
        'release_ball_speed',
        'release_ball_velocity_x', 'release_ball_velocity_y','release_ball_velocity_z']
    y_variable = ['result']
    final_keep_list = ordinal_categoricals + nominal_categoricals + numericals + y_variable
    
    # Apply the filter to keep only these columns
    final_ml_df_selected_features = final_ml_df[final_keep_list]
    print(f"Retained {len(final_keep_list)} features: {', '.join(final_keep_list)}")

    # Save feature names to a file
    with open(final_ml_features_path, 'wb') as f:
        pickle.dump(final_ml_df_selected_features.columns.tolist(), f)

    print(f"Retained {len(final_keep_list)} features: {', '.join(final_keep_list)}")

    # Define paths (optional, will use defaults if not provided)
    paths = {
        'features': '../../data/model/pipeline/final_ml_df_selected_features_columns_test.pkl',
        'ordinal_categoricals': '../../data/model/pipeline/features_info/ordinal_categoricals.pkl',
        'nominal_categoricals': '../../data/model/pipeline/features_info/nominal_categoricals.pkl',
        'numericals': '../../data/model/pipeline/features_info/numericals.pkl',
        'y_variable': '../../data/model/pipeline/features_info/y_variable.pkl'
    }
    # Save features and metadata
    manage_features(
        mode='save',
        features_df=final_ml_df,
        ordinal_categoricals=ordinal_categoricals,
        nominal_categoricals=nominal_categoricals,
        numericals=numericals,
        y_variable=y_variable,
        paths=paths
    )
    
    # Load features and metadata
    loaded = manage_features(
        mode='load',
        paths=paths
    )
    
    # Access loaded data
    if loaded:
        features = loaded.get('features')
        ordinals = loaded.get('ordinal_categoricals')
        nominals = loaded.get('nominal_categoricals')
        nums = loaded.get('numericals')
        y_var = loaded.get('y_variable')
        
        print("\n📥 Loaded Data:")
        print("Features:", features)
        print("Ordinal Categoricals:", ordinals)
        print("Nominal Categoricals:", nominals)
        print("Numericals:", nums)
        print("Y Variable:", y_var)


Overwriting ml/feature_selection/feature_importance_calculator.py


https://chatgpt.com/c/67583ce6-82ac-8012-b1d7-c415f658caa0

In [11]:
%%writefile ml/mlflow/mlflow_logger.py
import mlflow
import mlflow.sklearn
from mlflow.data import from_pandas
import os
import logging

class MLflowLogger:
    def __init__(self, tracking_uri=None, experiment_name="Default Experiment", enable_mlflow=True):
        """
        Initialize MLflowLogger.

        Args:
            tracking_uri (str): URI for MLflow Tracking Server.
            experiment_name (str): Name of the experiment.
            enable_mlflow (bool): Flag to enable or disable MLflow logging.
        """
        self.enable_mlflow = enable_mlflow

        # Set up basic logging
        self.logger = logging.getLogger(__name__)
        if not self.logger.handlers:
            # Prevent adding multiple handlers in interactive environments
            handler = logging.StreamHandler()
            formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")
            handler.setFormatter(formatter)
            self.logger.addHandler(handler)
            self.logger.setLevel(logging.INFO)

        if self.enable_mlflow:
            if tracking_uri:
                mlflow.set_tracking_uri(tracking_uri)
                self.logger.info(f"MLflow Tracking URI set to: {tracking_uri}")
            else:
                self.logger.warning("MLflow Tracking URI not provided. Using default.")

            self.experiment_name = experiment_name
            mlflow.set_experiment(self.experiment_name)
            self.logger.info(f"MLflow Experiment set to: {self.experiment_name}")
        else:
            self.logger.info("MLflow logging is disabled. Using basic logging.")

    def log_run(self, run_name, params, metrics, artifacts=None, tags=None, datasets=None, nested=False):
        """
        Log a single run with MLflow or basic logging.

        Args:
            run_name (str): Name of the run.
            params (dict): Parameters to log.
            metrics (dict): Metrics to log.
            artifacts (str or list): Paths to artifacts (files or directories).
            tags (dict): Tags to set for the run.
            datasets (list): List of datasets to log.
            nested (bool): Whether this run is nested under an active run.
        """
        if self.enable_mlflow:
            try:
                with mlflow.start_run(run_name=run_name, nested=nested):
                    self.logger.info(f"Started MLflow run: {run_name}")
                    
                    # Log parameters
                    if params:
                        mlflow.log_params(params)
                        self.logger.debug(f"Logged parameters: {params}")
                    
                    # Log metrics
                    if metrics:
                        for key, value in metrics.items():
                            if isinstance(value, (list, tuple)):  # Handle multiple values (e.g., per epoch)
                                for step, metric_value in enumerate(value):
                                    mlflow.log_metric(key, metric_value, step=step)
                                    self.logger.debug(f"Logged metric '{key}' at step {step}: {metric_value}")
                            else:
                                mlflow.log_metric(key, value)
                                self.logger.debug(f"Logged metric '{key}': {value}")
                    
                    # Log tags
                    if tags:
                        mlflow.set_tags(tags)
                        self.logger.debug(f"Set tags: {tags}")
                    
                    # Log artifacts
                    if artifacts:
                        if isinstance(artifacts, str):  # Single file or directory
                            mlflow.log_artifact(artifacts)
                            self.logger.debug(f"Logged artifact: {artifacts}")
                        elif isinstance(artifacts, list):  # Multiple paths
                            for artifact_path in artifacts:
                                mlflow.log_artifact(artifact_path)
                                self.logger.debug(f"Logged artifact: {artifact_path}")
                    
                    # Log datasets
                    if datasets:
                        for dataset in datasets:
                            self.log_datasets(dataset["dataframe"], dataset["source"], dataset["name"])
                    
            except Exception as e:
                self.logger.error(f"Failed to log MLflow run '{run_name}': {e}")
                raise
        else:
            # Basic Logging
            self.logger.info(f"Run Name: {run_name}")
            if params:
                self.logger.info(f"Parameters: {params}")
            if metrics:
                self.logger.info(f"Metrics: {metrics}")
            if tags:
                self.logger.info(f"Tags: {tags}")
            if artifacts:
                self.logger.info(f"Artifacts: {artifacts}")
            if datasets:
                for dataset in datasets:
                    self.logger.info(f"Dataset Logged: {dataset['name']} from {dataset['source']}")

    def log_model(self, model, model_name, conda_env=None):
        """
        Log a model to MLflow or perform basic logging.

        Args:
            model: Trained model object.
            model_name (str): Name of the model.
            conda_env (str): Path to a Conda environment file.
        """
        if self.enable_mlflow:
            try:
                mlflow.sklearn.log_model(model, artifact_path=model_name, conda_env=conda_env)
                self.logger.info(f"Model '{model_name}' logged to MLflow.")
            except Exception as e:
                self.logger.error(f"Failed to log model '{model_name}' to MLflow: {e}")
                raise
        else:
            # Basic Logging
            self.logger.info(f"Model '{model_name}' training complete. (MLflow logging disabled)")

    def log_datasets(self, dataframe, source, name):
        """
        Create a dataset log entry or perform basic logging.

        Args:
            dataframe (pd.DataFrame): The dataset.
            source (str): Data source (e.g., file path, S3 URI).
            name (str): Dataset name.

        Returns:
            Dataset object logged to MLflow or None.
        """
        if self.enable_mlflow:
            try:
                dataset = from_pandas(dataframe, source=source, name=name)
                mlflow.log_input(dataset)
                self.logger.info(f"Dataset '{name}' logged to MLflow from source '{source}'.")
                return dataset
            except Exception as e:
                self.logger.error(f"Failed to log dataset '{name}' to MLflow: {e}")
                raise
        else:
            # Basic Logging
            self.logger.info(f"Dataset '{name}' from source '{source}' logged. (MLflow logging disabled)")
            return None

    def get_active_run(self):
        """
        Retrieve the current active MLflow run or log a message.

        Returns:
            Active run info object or None.
        """
        if self.enable_mlflow:
            return mlflow.active_run()
        else:
            self.logger.info("No active MLflow run. MLflow logging is disabled.")
            return None

    def get_last_run(self):
        """
        Retrieve the last active MLflow run or log a message.

        Returns:
            Last run info object or None.
        """
        if self.enable_mlflow:
            return mlflow.last_active_run()
        else:
            self.logger.info("No last MLflow run available. MLflow logging is disabled.")
            return None


Overwriting ml/mlflow/mlflow_logger.py


In [12]:
%%writefile ml/train_utils/train_utils.py


from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA 
from skopt import BayesSearchCV
from skopt.space import Real, Integer, Categorical
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    roc_auc_score,
    precision_score,
    recall_score,
    f1_score,
    log_loss,
)
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
import logging
import json
from sklearn.model_selection import StratifiedKFold

# Main function with MLflow integration

import joblib
import os
import logging
import pandas as pd
from pathlib import Path

# Local imports
from ml.feature_selection.data_loader_post_select_features import load_selected_features_data

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Set up logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)


def evaluate_model(model, X_test, y_test, save_path="classification_report.txt"):
    """
    Evaluate the model and log performance metrics.

    Parameters:
    - model: Trained model to evaluate.
    - X_test: Test features.
    - y_test: True labels for the test data.
    - save_path: Path to save the classification report.

    Returns:
    - metrics: Dictionary containing evaluation metrics.
    """
    logger.info("Evaluating model...")
    y_pred = model.predict(X_test)
    logger.info(f"Predictions: {y_pred}")
    
    # Check if the model supports probability predictions
    if hasattr(model, "predict_proba"):
        y_proba = model.predict_proba(X_test)[:, 1]
        logger.info(f"Predicted probabilities: {y_proba}")
    else:
        y_proba = None
        logger.info("Model does not support probability predictions.")

    # Calculate metrics
    metrics = {
        "Accuracy": accuracy_score(y_test, y_pred),
        "Precision": precision_score(y_test, y_pred, average="weighted", zero_division=0),
        "Recall": recall_score(y_test, y_pred, average="weighted", zero_division=0),
        "F1 Score": f1_score(y_test, y_pred, average="weighted", zero_division=0),
        "ROC AUC": roc_auc_score(y_test, y_proba) if y_proba is not None else None,
        "Log Loss": log_loss(y_test, y_proba) if y_proba is not None else None,
    }

    # Log metrics
    logger.info(f"Evaluation Metrics: {metrics}")

    # Generate and save classification report
    report = classification_report(y_test, y_pred, output_dict=False)
    logger.info("\n" + report)
    with open(save_path, "w") as f:
        f.write(report)
    logger.info(f"Classification report saved to {save_path}")

    return metrics


def save_model(model, model_name, save_dir="../../data/model"):
    """
    Save the trained model and preprocessing steps to disk.

    Parameters:
    - model: Trained model to save.
    - model_name: Name of the model for saving.
    - preprocessing_steps: Dictionary of preprocessing objects (e.g., encoders, scalers).
    - save_dir: Directory to save the model and preprocessing steps.
    """
    os.makedirs(save_dir, exist_ok=True)
    model_path = os.path.join(save_dir, f"{model_name}_model.pkl")

    # Save the model
    joblib.dump(model, model_path)
    logger.info(f"Model saved to {model_path}")


def load_model(model_name, save_dir="../../data/model"):
    """
    Load the trained model from disk.

    Parameters:
    - model_name: Name of the model to load.
    - save_dir: Directory where the model is saved.

    Returns:
    - model: Loaded trained model.
    """
    model_path = os.path.join(save_dir, f"{model_name}_model.pkl")
    model = joblib.load(model_path)
    logger.info(f"Model loaded from {model_path}")
    return model


# Plot decision boundary
def plot_decision_boundary(model, X, y, title, use_pca=True):
    """
    Plot decision boundaries for the model.

    Parameters:
    - model: Trained model to visualize.
    - X: Feature data (test set).
    - y: Target labels.
    - title: Title for the plot.
    - use_pca: If True, applies PCA for dimensionality reduction if X has >2 features.
    """
    logger.info(f"Original X shape: {X.shape}")
    if X.shape[1] > 2 and use_pca:
        logger.info("X has more than 2 features, applying PCA for visualization.")
        pca = PCA(n_components=2)
        X_2d = pca.fit_transform(X)
        explained_variance = pca.explained_variance_ratio_
        logger.info(f"PCA explained variance ratios: {explained_variance}")
    elif X.shape[1] > 2:
        logger.error("Cannot plot decision boundary for more than 2D without PCA.")
        raise ValueError("Cannot plot decision boundary for more than 2D without PCA.")
    else:
        logger.info("X has 2 or fewer features, using original features for plotting.")
        X_2d = X

    logger.info(f"Transformed X shape for plotting: {X_2d.shape}")

    # Create mesh grid
    x_min, x_max = X_2d[:, 0].min() - 1, X_2d[:, 0].max() + 1
    y_min, y_max = X_2d[:, 1].min() - 1, X_2d[:, 1].max() + 1
    xx, yy = np.meshgrid(
        np.arange(x_min, x_max, 0.01),
        np.arange(y_min, y_max, 0.01)
    )
    logger.info(f"Mesh grid created with shape xx: {xx.shape}, yy: {yy.shape}")

    # Flatten the grid arrays and combine into a single array
    grid_points_2d = np.c_[xx.ravel(), yy.ravel()]
    logger.info(f"Grid points in 2D PCA space shape: {grid_points_2d.shape}")

    if X.shape[1] > 2 and use_pca:
        # Inverse transform the grid points back to the original feature space
        logger.info("Inverse transforming grid points back to original feature space for prediction.")
        grid_points_original = pca.inverse_transform(grid_points_2d)
        logger.info(f"Grid points in original feature space shape: {grid_points_original.shape}")
        # Predict on the grid points in original feature space
        try:
            Z = model.predict(grid_points_original)
        except ValueError as e:
            logger.error(f"Error predicting decision boundary: {e}")
            return
    else:
        # For 2D data, use grid points directly for prediction
        grid_points_original = grid_points_2d
        Z = model.predict(grid_points_original)

    Z = Z.reshape(xx.shape)
    logger.info(f"Decision boundary predictions reshaped to: {Z.shape}")

    # Plot the decision boundary
    plt.figure(figsize=(10, 8))
    plt.contourf(xx, yy, Z, alpha=0.8, cmap=plt.cm.RdYlBu)
    plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y, edgecolor="k", cmap=plt.cm.RdYlBu)
    plt.title(title)
    plt.xlabel("Principal Component 1" if use_pca and X.shape[1] > 2 else "Feature 1")
    plt.ylabel("Principal Component 2" if use_pca and X.shape[1] > 2 else "Feature 2")
    plt.show()

# Hyperparameter tuning for Random Forest
def tune_random_forest(X_train, y_train, scoring_metric="neg_log_loss"):
    logger.info("Starting hyperparameter tuning for Random Forest...")
    param_space = {
        "n_estimators": Integer(10, 500),
        "max_depth": Integer(2, 50),
        "min_samples_split": Integer(2, 20),
        "min_samples_leaf": Integer(1, 10),
        "max_features": Categorical(["sqrt", "log2", None]),
        "bootstrap": Categorical([True, False]),
        "criterion": Categorical(["gini", "entropy"]),
    }
    logger.info(f"Parameter space: {param_space}")

    search = BayesSearchCV(
        RandomForestClassifier(random_state=42, n_jobs=-1),
        param_space,
        n_iter=60,
        scoring=scoring_metric, #accuracy, neg_log_loss
        cv=cv,
        n_jobs=-1,
        random_state=42
    )
    search.fit(X_train, y_train)
    logger.info(f"Best parameters found: {search.best_params_}")
    logger.info(f"Best cross-validation score: {search.best_score_}")
    return search.best_params_, search.best_score_, search.best_estimator_

# Hyperparameter tuning for XGBoost
def tune_xgboost(X_train, y_train, scoring_metric="neg_log_loss"):
    logger.info("Starting hyperparameter tuning for XGBoost...")
    param_space = {
        'learning_rate': Real(0.01, 0.3, prior='log-uniform'),
        'n_estimators': Integer(100, 500),
        'max_depth': Integer(3, 15),
        'min_child_weight': Integer(1, 10),
        'gamma': Real(0, 5),
        'subsample': Real(0.5, 1.0),
        'colsample_bytree': Real(0.5, 1.0),
        'reg_alpha': Real(1e-8, 1.0, prior='log-uniform'),
        'reg_lambda': Real(1e-8, 1.0, prior='log-uniform'),
    }
    logger.info(f"Parameter space: {param_space}")

    search = BayesSearchCV(
        XGBClassifier(eval_metric="logloss", random_state=42, n_jobs=-1),
        param_space,
        n_iter=60,
        scoring=scoring_metric,
        cv=cv,
        n_jobs=-1,
        random_state=42
    )
    search.fit(X_train, y_train)
    logger.info(f"Best parameters found: {search.best_params_}")
    logger.info(f"Best cross-validation score: {search.best_score_}")
    return search.best_params_, search.best_score_, search.best_estimator_

# Hyperparameter tuning for Decision Tree
def tune_decision_tree(X_train, y_train, scoring_metric="neg_log_loss"):
    logger.info("Starting hyperparameter tuning for Decision Tree...")
    param_space = {
        "max_depth": Integer(2, 50),
        "min_samples_split": Integer(2, 20),
        "min_samples_leaf": Integer(1, 10),
        "criterion": Categorical(["gini", "entropy"]),
        "splitter": Categorical(["best", "random"]),
    }
    logger.info(f"Parameter space: {param_space}")

    search = BayesSearchCV(
        DecisionTreeClassifier(random_state=42),
        param_space,
        n_iter=60,
        scoring=scoring_metric,
        cv=cv,
        n_jobs=-1,
        random_state=42
    )
    search.fit(X_train, y_train)
    logger.info(f"Best parameters found: {search.best_params_}")
    logger.info(f"Best cross-validation score: {search.best_score_}")
    return search.best_params_, search.best_score_, search.best_estimator_

Overwriting ml/train_utils/train_utils.py


In [13]:
%%writefile ml/train.py
import logging
import json
from typing import Any, Dict
from pathlib import Path
import yaml
import pandas as pd
import joblib  # Ensure joblib is imported
import xgboost as xgb
import os

from ml.feature_selection.feature_importance_calculator import manage_features

import matplotlib.pyplot as plt

# Local imports - Adjust the import paths based on your project structure
from datapreprocessor import DataPreprocessor
from ml.train_utils.train_utils import (
evaluate_model, save_model, load_model, plot_decision_boundary,
tune_random_forest, tune_xgboost, tune_decision_tree
)


# Set up logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)

print("WE'RE IN THIS DIRECTORY = ", os.getcwd())
import sys
print("WE'RE IN THIS syspath = ",sys.path)


def load_config(config_path: Path) -> Dict[str, Any]:
    """
    Load configuration from a YAML file.

    Args:
        config_path (Path): Path to the configuration file.

    Returns:
        Dict[str, Any]: Configuration dictionary.
    """
    try:
        with config_path.open('r') as f:
            config = yaml.safe_load(f)
        logger.info(f"✅ Configuration loaded from {config_path}.")
        return config
    except Exception as e:
        logger.error(f"❌ Failed to load configuration: {e}")
        raise

def bayes_best_model_train(
    X_train: pd.DataFrame,
    y_train: pd.Series,
    X_test: pd.DataFrame,
    y_test: pd.Series,
    selection_metric: str,
    model_save_dir: Path,
    classification_save_path: Path,
    tuning_results_save: Path,
    selected_models: Any,
    use_pca: bool = False
):
    """
    Streamlined function for model tuning, evaluation, and saving the best model.
    """
    logger.info("Starting the Bayesian hyperparameter tuning process...")

    # Scoring metric selection
    scoring_metric = "neg_log_loss" if selection_metric.lower() == "log loss" else "accuracy"

    # Prepare model registry
    model_registry = {
        "XGBoost": tune_xgboost,
        "Random Forest": tune_random_forest,
        "Decision Tree": tune_decision_tree
    }

    # Normalize selected_models input
    if isinstance(selected_models, str):
        selected_models = [selected_models]
    elif not selected_models:
        selected_models = list(model_registry.keys())
        logger.info(f"No models specified. Using all available: {selected_models}")

    tuning_results = {}
    best_model_name = None
    best_model = None
    best_metric_value = None

    # Ensure model_save_dir exists
    model_save_dir.mkdir(parents=True, exist_ok=True)
    logger.debug(f"Ensured that the model save directory '{model_save_dir}' exists.")

    # Define metric key mapping
    metric_key_mapping = {
        "log loss": "Log Loss",
        "accuracy": "Accuracy",
        "precision": "Precision",
        "recall": "Recall",
        "f1 score": "F1 Score",
        "roc auc": "ROC AUC"
    }

    # Loop over requested models
    for model_name in selected_models:
        if model_name not in model_registry:
            logger.warning(f"Unsupported model: {model_name}. Skipping.")
            continue
        try:
            logger.info(f"📌 Tuning hyperparameters for {model_name}...")
            tuner_func = model_registry[model_name]

            best_params, best_score, best_estimator = tuner_func(
                X_train, y_train, scoring_metric=scoring_metric
            )
            logger.info(f"✅ {model_name} tuning done. Best Params: {best_params}, Best CV Score: {best_score}")

            # Evaluate on X_test
            metrics = evaluate_model(best_estimator, X_test, y_test, save_path=classification_save_path)
            metric_key = metric_key_mapping.get(selection_metric.lower(), selection_metric)
            metric_value = metrics.get(metric_key)

            # Debugging
            logger.debug(f"Selection Metric Key: {metric_key}")
            logger.debug(f"Available Metrics: {metrics.keys()}")

            if metric_value is not None:
                logger.debug(f"Metric value for {selection_metric}: {metric_value}")
                if best_metric_value is None:
                    best_metric_value = metric_value
                    best_model_name = model_name
                    best_model = best_estimator
                    logger.debug(f"Best model set to {best_model_name} with {selection_metric}={best_metric_value}")
                else:
                    # For log loss, lower is better
                    if selection_metric.lower() == "log loss" and metric_value < best_metric_value:
                        best_metric_value = metric_value
                        best_model_name = model_name
                        best_model = best_estimator
                        logger.debug(f"Best model updated to {best_model_name} with {selection_metric}={best_metric_value}")
                    # For other metrics (accuracy, f1, etc.), higher is better
                    elif selection_metric.lower() != "log loss" and metric_value > best_metric_value:
                        best_metric_value = metric_value
                        best_model_name = model_name
                        best_model = best_estimator
                        logger.debug(f"Best model updated to {best_model_name} with {selection_metric}={best_metric_value}")
            else:
                logger.debug(f"Metric value for {selection_metric} is None. Best model not updated.")

            # Save partial results
            tuning_results[model_name] = {
                "Best Params": best_params,
                "Best CV Score": best_score,
                "Evaluation Metrics": metrics,
            }

            # Plot boundary (optional for tree-based with PCA)
            try:
                plot_decision_boundary(best_estimator, X_test, y_test, f"{model_name} Decision Boundary", use_pca=use_pca)
            except ValueError as e:
                logger.warning(f"Skipping decision boundary plot for {model_name}: {e}")

            # Add feature importance plots for XGBoost
            if model_name.lower() == "xgboost":
                logger.info("Generating feature importance plots for XGBoost...")
                try:
                    xgb.plot_importance(best_model, importance_type="weight")
                    plt.title("Feature Importance by Weight")
                    plt.show()

                    xgb.plot_importance(best_model, importance_type="cover")
                    plt.title("Feature Importance by Cover")
                    plt.show()

                    xgb.plot_importance(best_model, importance_type="gain")
                    plt.title("Feature Importance by Gain")
                    plt.show()
                except Exception as e:
                    logger.error(f"Error generating feature importance plots: {e}")
        except Exception as e:
            logger.error(f"❌ Error tuning {model_name}: {e}")
            continue

    # Save best model information
    if best_model_name:
        logger.info(f"✅ Best model is {best_model_name} with {selection_metric}={best_metric_value}")
        try:
            save_model(best_model, best_model_name, save_dir=model_save_dir)
            logger.info(f"✅ Model '{best_model_name}' saved successfully in '{model_save_dir}'.")
        except Exception as e:
            logger.error(f"❌ Failed to save best model {best_model_name}: {e}")
            raise  # Ensure the exception is propagated

        # Add Best Model info to tuning_results
        tuning_results["Best Model"] = {
            "model_name": best_model_name,
            "metric_value": best_metric_value,
            "path": str(Path(model_save_dir) / best_model_name.replace(" ", "_") / 'trained_model.pkl')
        }
    else:
        logger.warning("⚠️ No best model was selected. Tuning might have failed for all models.")

    # Save tuning results
    try:
        with tuning_results_save.open("w") as f:
            json.dump(tuning_results, f, indent=4)
        logger.info(f"✅ Tuning results saved to {tuning_results_save}.")
    except Exception as e:
        logger.error(f"❌ Error saving tuning results: {e}")



def main():
    # ----------------------------
    # 1. Load Configuration
    # ----------------------------
    config = load_config(Path('../../data/model/preprocessor_config/preprocessor_config.yaml'))

    # Extract paths from configuration
    paths_config = config.get('paths', {})
    base_data_dir = Path(paths_config.get('data_dir', '../../dataset/test/data')).resolve()
    raw_data_file = base_data_dir / paths_config.get('raw_data', 'final_ml_dataset.csv')

    # Output directories
    log_dir = Path(paths_config.get('log_dir', '../preprocessor/logs')).resolve()
    model_save_base_dir = Path(paths_config.get('model_save_base_dir', '../preprocessor/models')).resolve()
    transformers_save_base_dir = Path(paths_config.get('transformers_save_base_dir', '../preprocessor/transformers')).resolve()
    plots_output_dir = Path(paths_config.get('plots_output_dir', '../preprocessor/plots')).resolve()
    training_output_dir = Path(paths_config.get('training_output_dir', '../preprocessor/training_output')).resolve()

    # Initialize Paths for saving
    MODEL_SAVE_DIR = model_save_base_dir
    MODEL_SAVE_DIR.mkdir(parents=True, exist_ok=True)  # Ensure the directory exists
    CLASSIFICATION_REPORT_PATH = MODEL_SAVE_DIR / "classification_report.txt"
    TUNING_RESULTS_SAVE_PATH = MODEL_SAVE_DIR / "tuning_results.json"


    LOG_FILE = 'training.log'

    # Extract model-related config
    selected_models = config.get('models', {}).get('selected_models', ["XGBoost", "Random Forest", "Decision Tree"])
    selection_metric = config.get('models', {}).get('selection_metric', "Log Loss")

    # Extract Tree Based Classifier options from config
    tree_classifier_options = config.get('models', {}).get('Tree Based Classifier', {})

    # ----------------------------
    # 2. Setup Logging
    # ----------------------------
    logger.info("✅ Starting the training module.")

    # ----------------------------
    # 3. Load Data
    # ----------------------------
    try:
        filtered_df = pd.read_csv(raw_data_file)
        logger.info(f"✅ Loaded dataset from {raw_data_file}. Shape: {filtered_df.shape}")
    except FileNotFoundError:
        logger.error(f"❌ Dataset not found at {raw_data_file}.")
        return
    except Exception as e:
        logger.error(f"❌ Failed to load dataset: {e}")
        return


    # Define paths (optional, will use defaults if not provided)
    paths = {
        'features': '../../data/model/pipeline/final_ml_df_selected_features_columns_test.pkl',
        'ordinal_categoricals': '../../data/model/pipeline/features_info/ordinal_categoricals.pkl',
        'nominal_categoricals': '../../data/model/pipeline/features_info/nominal_categoricals.pkl',
        'numericals': '../../data/model/pipeline/features_info/numericals.pkl',
        'y_variable': '../../data/model/pipeline/features_info/y_variable.pkl'
    }

    # Load features and metadata
    loaded = manage_features(
        mode='load',
        paths=paths
    )
    
    # Access loaded data
    if loaded:
        features = loaded.get('features')
        ordinals = loaded.get('ordinal_categoricals')
        nominals = loaded.get('nominal_categoricals')
        nums = loaded.get('numericals')
        y_var = loaded.get('y_variable')
        
        print("\n📥 Loaded Data:")
        print("Features:", features)
        print("Ordinal Categoricals:", ordinals)
        print("Nominal Categoricals:", nominals)
        print("Numericals:", nums)
        print("Y Variable:", y_var)
    # ----------------------------
    # 4. Initialize DataPreprocessor
    # ----------------------------
    # Assuming a supervised classification use case: "Tree Based Classifier"
    preprocessor = DataPreprocessor(
        model_type="Tree Based Classifier",
        y_variable=y_var,
        ordinal_categoricals=ordinals,
        nominal_categoricals=nominals,
        numericals=nums, 
        mode='train',
        #options=tree_classifier_options,  # The options from config for "Tree Based Classifier"
        debug=config.get('logging', {}).get('debug', False),  # or config-based
        normalize_debug=config.get('execution', {}).get('train', {}).get('normalize_debug', False),
        normalize_graphs_output=config.get('execution', {}).get('train', {}).get('normalize_graphs_output', False),
        graphs_output_dir=plots_output_dir,
        transformers_dir=transformers_save_base_dir
    )

    # ----------------------------
    # 5. Execute Preprocessing
    # ----------------------------
    try:
        # Execute preprocessing by passing the entire filtered_df
        X_train, X_test, y_train, y_test, recommendations, X_test_inverse = preprocessor.final_preprocessing(filtered_df)
        print("types of all variables starting with X_train", type(X_train), "X_test type", type(X_test), "y_train type =", type(y_train), "y_test type =", type(y_test),"X_test_inverse type =", type(X_test_inverse))
        logger.info(f"✅ Preprocessing complete. X_train shape: {X_train.shape}, X_test shape: {X_test.shape}")
    except Exception as e:
        logger.error(f"❌ Error during preprocessing: {e}")
        return
 
    # ----------------------------
    # 6. Train & Tune the Model
    # ----------------------------
    try:
        bayes_best_model_train(
            X_train=X_train,
            y_train=y_train,
            X_test=X_test,
            y_test=y_test,
            selection_metric=selection_metric,
            model_save_dir=MODEL_SAVE_DIR,
            classification_save_path=CLASSIFICATION_REPORT_PATH,
            tuning_results_save=TUNING_RESULTS_SAVE_PATH,
            selected_models=selected_models,
            use_pca=True  
        )
    except Exception as e:
        logger.error(f"❌ Model training/tuning failed: {e}")
        return

    # ----------------------------
    # 7. Completion Message
    # ----------------------------
    logger.info("✅ Training workflow completed successfully.")

if __name__ == "__main__":
    main()



Overwriting ml/train.py


# Shap Functions for Prediction Module

### Prediction Module

In [14]:
%%writefile ml/predict/predict.py

import pandas as pd
import logging
import os
import yaml
import joblib
import json
from pathlib import Path
from typing import Any, Dict

# Local imports - Adjust based on your project structure
from ml.train_utils.train_utils import load_model  # Ensure correct import path
from datapreprocessor import DataPreprocessor  # Uncomment and adjust as necessary

def load_dataset(path: Path) -> pd.DataFrame:
    if not path.exists():
        raise FileNotFoundError(f"Dataset not found at {path}")
    return pd.read_csv(path)

def load_config(config_path: Path) -> Dict[str, Any]:
    if not config_path.exists():
        raise FileNotFoundError(f"Configuration file not found at {config_path}")
    with open(config_path, 'r') as file:
        config = yaml.safe_load(file)
    return config

# Set up logger
logger = logging.getLogger("PredictAndAttachLogger")
logger.setLevel(logging.DEBUG)  # Set to DEBUG for detailed logs, adjust as needed

# Console handler
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.INFO)  # Adjust this level for console output

# File handler
file_handler = logging.FileHandler("predictions.log")
file_handler.setLevel(logging.DEBUG)  # Log detailed information to a file

# Formatter
formatter = logging.Formatter(
    "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
console_handler.setFormatter(formatter)
file_handler.setFormatter(formatter)

# Add handlers to logger
logger.addHandler(console_handler)
logger.addHandler(file_handler)


def predict_and_attach_predict_probs(trained_model, X_preprocessed, X_inversed):
    try:
        predictions = trained_model.predict(X_preprocessed)
        logger.info("✅ Predictions made successfully.")
        logger.debug(f"Predictions sample: {predictions[:5]}")
    except Exception as e:
        logger.error(f"❌ Prediction failed: {e}")
        return

    try:
        prediction_probs = trained_model.predict_proba(X_preprocessed)
        logger.info("✅ Prediction probabilities computed successfully.")
        logger.debug(f"Prediction probabilities sample:\n{prediction_probs[:2]}")
    except Exception as e:
        logger.error(f"❌ Prediction probabilities computation failed: {e}")
        
    # ----------------------------
    # Step 11: Attach Predictions to Inverse-Transformed DataFrame
    # ----------------------------
    try:
        if X_inversed is not None:
            if 'Prediction' not in X_inversed.columns:
                X_inversed['Prediction'] = predictions
                logger.debug("Predictions attached to inverse-transformed DataFrame.")
            if 'Prediction_Probabilities' not in X_inversed.columns:
                X_inversed['Prediction_Probabilities'] = prediction_probs.tolist()
                logger.debug("Prediction probabilities attached to inverse-transformed DataFrame.")
            logger.debug(f"Final shape of X_inversed: {X_inversed.shape}")
            logger.debug(f"Final columns in X_inversed: {X_inversed.columns.tolist()}")
        else:
            logger.warning("X_inversed is None. Creating a new DataFrame with predictions.")
            X_inversed = pd.DataFrame({'Prediction': predictions, 'Prediction_Probabilities': prediction_probs.tolist()})
            logger.debug(f"Created new X_inversed DataFrame with shape: {X_inversed.shape}")
    except Exception as e:
        logger.error(f"❌ Failed to attach predictions to inverse-transformed DataFrame: {e}")
        X_inversed = pd.DataFrame({'Prediction': predictions})  # Assign to X_inversed
        return

    return predictions, prediction_probs, X_inversed


def main():
    # ----------------------------
    # Step 1: Load Configuration
    # ----------------------------
    config_path = Path('../../data/model/preprocessor_config/preprocessor_config.yaml')  # Adjust as needed
    try:
        config = load_config(config_path)
    except Exception as e:
        print(f"❌ Failed to load configuration: {e}")
        return  # Exit if config loading fails

    # ----------------------------
    # Step 2: Extract Paths from Configuration
    # ----------------------------
    paths = config.get('paths', {})
    data_dir = Path(paths.get('data_dir', '../../data/processed')).resolve()
    raw_data_path = data_dir / paths.get('raw_data', 'final_ml_dataset.csv')  # Corrected key
    processed_data_dir = data_dir / paths.get('processed_data_dir', 'preprocessor/processed')
    transformers_dir = Path(paths.get('transformers_save_base_dir', '../preprocessor/transformers')).resolve()  # Corrected key
    predictions_output_dir = Path(paths.get('predictions_output_dir', 'preprocessor/predictions')).resolve()
    log_dir = Path(paths.get('log_dir', '../preprocessor/logs')).resolve()
    model_save_dir = Path(paths.get('model_save_base_dir', '../preprocessor/models')).resolve()  # Corrected key
    log_file = paths.get('log_file', 'prediction.log')  # Ensure this key exists in config

    # ----------------------------
    # Step 3: Setup Logging
    # ----------------------------
    logger.info("✅ Starting prediction module.")

    # ----------------------------
    # Step 4: Extract Feature Assets
    # ----------------------------
    features_config = config.get('features', {})
    column_assets = {
        'y_variable': features_config.get('y_variable', []),
        'ordinal_categoricals': features_config.get('ordinal_categoricals', []),
        'nominal_categoricals': features_config.get('nominal_categoricals', []),
        'numericals': features_config.get('numericals', [])
    }

    # ----------------------------
    # Step 5: Load Tuning Results to Find Best Model
    # ----------------------------
    tuning_results_path = model_save_dir / "tuning_results.json"
    if not tuning_results_path.exists():
        logger.error(f"❌ Tuning results not found at '{tuning_results_path}'. Cannot determine the best model.")
        return

    try:
        with open(tuning_results_path, 'r') as f:
            tuning_results = json.load(f)
        best_model_info = tuning_results.get("Best Model")
        if not best_model_info:
            logger.error("❌ Best model information not found in tuning results.")
            return
        best_model_name = best_model_info.get("model_name")
        if not best_model_name:
            logger.error("❌ Best model name not found in tuning results.")
            return
        logger.info(f"Best model identified: {best_model_name}")
    except Exception as e:
        logger.error(f"❌ Failed to load tuning results: {e}")
        return

    # ----------------------------
    # Step 6: Preprocess the Data
    # ----------------------------
    # Load Prediction Dataset
    if not raw_data_path.exists():
        logger.error(f"❌ Prediction input dataset not found at '{raw_data_path}'.")
        return

    try:
        df_predict = load_dataset(raw_data_path)
        logger.info(f"✅ Prediction input data loaded from '{raw_data_path}'.")
    except Exception as e:
        logger.error(f"❌ Failed to load prediction input data: {e}")
        return

    # Initialize DataPreprocessor
    preprocessor = DataPreprocessor(
        model_type="Tree Based Classifier",  # Or dynamically set based on best_model_name if necessary
        y_variable=column_assets.get('y_variable', []),
        ordinal_categoricals=column_assets.get('ordinal_categoricals', []),
        nominal_categoricals=column_assets.get('nominal_categoricals', []),
        numericals=column_assets.get('numericals', []),
        mode='predict',
        options={},  # Adjust based on config or load from somewhere
        debug=False,  # Can be parameterized
        normalize_debug=False,  # As per hardcoded paths
        normalize_graphs_output=False,  # As per hardcoded paths
        graphs_output_dir=Path(paths.get('plots_output_dir', '../preprocessor/plots')).resolve(),
        transformers_dir=transformers_dir
    )

    # Execute Preprocessing for Prediction
    try:
        X_preprocessed, recommendations, X_inversed = preprocessor.final_preprocessing(df_predict)
        print("X_new_preprocessed type = ", type(X_preprocessed), "X_new_inverse type = ", type(X_inversed))
        logger.info("✅ Preprocessing completed successfully in predict mode.")
    except Exception as e:
        logger.error(f"❌ Preprocessing failed in predict mode: {e}")
        return

    # ----------------------------
    # Step 7: Load the Best Model
    # ----------------------------
    try:
        trained_model = load_model(best_model_name, model_save_dir)
        logger.info(f"✅ Trained model loaded from '{model_save_dir / best_model_name.replace(' ', '_') / 'trained_model.pkl'}'.")
    except Exception as e:
        logger.error(f"❌ Failed to load the best model '{best_model_name}': {e}")
        return

    # ----------------------------
    # Step 8/9: Make Predictions/probs
    # ----------------------------
    predictions, prediction_probs, X_inversed = predict_and_attach_predict_probs(trained_model, X_preprocessed, X_inversed)


    # ----------------------------
    # Step 10: Save Predictions
    # ----------------------------
    try:
        predictions_output_dir.mkdir(parents=True, exist_ok=True)
        predictions_filename = predictions_output_dir / f'predictions_{best_model_name.replace(" ", "_")}.csv'
        X_inversed.to_csv(predictions_filename, index=False)
        logger.info(f"✅ Predictions saved to '{predictions_filename}'.")
    except Exception as e:
        logger.error(f"❌ Failed to save predictions: {e}")
        return

    logger.info(f"✅ All prediction tasks completed successfully for model '{best_model_name}'.")


if __name__ == "__main__":
    main()

Overwriting ml/predict/predict.py


In [15]:
%%writefile ml/shap/shap_utils.py

import logging
import shap
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os
import pickle
from pathlib import Path

# Set up logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)


def compute_shap_values(model, X, debug: bool = False, logger: logging.Logger = None):
    if logger:
        logger.info("Initializing SHAP explainer...")

    try:
        explainer = shap.Explainer(model, X)
        if logger and debug:
            logger.debug(f"SHAP Explainer initialized: {type(explainer)}")
            logger.debug(f"Explainer details: {explainer}")

        shap_values = explainer(X)
        if logger and debug:
            logger.debug(f"SHAP values computed: {type(shap_values)}")
            logger.debug(f"Shape of shap_values: {shap_values.shape}")
            if hasattr(shap_values, 'values'):
                logger.debug(f"Sample SHAP values:\n{shap_values.values[:2]}")
            if hasattr(shap_values, 'feature_names'):
                logger.debug(f"SHAP feature names: {shap_values.feature_names}")

        # Determine number of classes
        n_classes = len(model.classes_)
        logger.debug(f"Number of classes in the model: {n_classes}")

        if shap_values.values.ndim == 3:
            if n_classes > 1:
                # For multi-class classification, select SHAP values for the positive class
                shap_values_class = shap_values.values[:, :, 1]
                logger.debug(f"Extracted SHAP values for class 1: Shape {shap_values_class.shape}")
            else:
                # For single-class models, retain SHAP values as is
                shap_values_class = shap_values.values[:, :, 0]
                logger.debug(f"Extracted SHAP values for single class: Shape {shap_values_class.shape}")
        elif shap_values.values.ndim == 2:
            if n_classes > 1:
                # For binary classification, SHAP returns 2D array for the positive class
                shap_values_class = shap_values.values
                logger.debug(f"Extracted SHAP values for positive class: Shape {shap_values_class.shape}")
            else:
                shap_values_class = shap_values.values
                logger.debug(f"Extracted SHAP values for single class: Shape {shap_values_class.shape}")
        else:
            logger.error(f"Unexpected SHAP values dimensions: {shap_values.values.ndim}")
            raise ValueError("Unexpected SHAP values dimensions.")

        return explainer, shap_values_class
    except Exception as e:
        if logger:
            logger.error(f"Failed to compute SHAP values: {e}")
        raise


def plot_shap_summary(shap_values, X_original: pd.DataFrame, save_path: str, debug: bool = False, 
                     logger: logging.Logger = None):
    """
    Generate and save a SHAP summary plot.

    :param shap_values: SHAP values computed for the dataset.
    :param X_original: Original (preprocessed) feature DataFrame.
    :param save_path: Full file path to save the plot.
    :param debug: If True, enable detailed debug logs.
    :param logger: Logger instance for logging.
    """
    if logger:
        logger.info("Generating SHAP summary plot...")
        logger.debug(f"Type of shap_values: {type(shap_values)}")
        logger.debug(f"Shape of shap_values: {shap_values.shape}")
        
        # Check if shap_values is a shap.Explanation object
        if isinstance(shap_values, shap.Explanation):
            logger.debug(f"SHAP feature names: {shap_values.feature_names}")
        else:
            logger.debug("shap_values is not a shap.Explanation object. Attempting to extract feature names.")
            if hasattr(shap_values, 'feature_names'):
                logger.debug(f"shap_values feature names: {shap_values.feature_names}")
            else:
                logger.debug("Cannot extract feature names from shap_values.")
        
        logger.debug(f"Type of X_original: {type(X_original)}")
        logger.debug(f"Shape of X_original: {X_original.shape}")
        logger.debug(f"Columns in X_original: {X_original.columns.tolist()}")
        
        # Verify column alignment
        if isinstance(shap_values, shap.Explanation):
            shap_feature_names = shap_values.feature_names
        elif hasattr(shap_values, 'feature_names'):
            shap_feature_names = shap_values.feature_names
        else:
            shap_feature_names = X_original.columns.tolist()  # Fallback
        
        if list(shap_feature_names) != list(X_original.columns):
            logger.error("Column mismatch between SHAP values and X_original.")
            logger.error(f"SHAP feature names ({len(shap_feature_names)}): {shap_feature_names}")
            logger.error(f"X_original columns ({len(X_original.columns)}): {X_original.columns.tolist()}")
            raise ValueError("Column mismatch between SHAP values and X_original.")
        else:
            logger.debug("Column alignment verified between SHAP values and X_original.")
    
    try:
        plt.figure(figsize=(10, 8))
        shap.summary_plot(shap_values, X_original, show=False)
        plt.tight_layout()
        plt.savefig(save_path, bbox_inches='tight')
        plt.close()
        if logger and debug:
            logger.debug(f"SHAP summary plot saved to {save_path}")
        if logger:
            logger.info("SHAP summary plot generated successfully.")
    except Exception as e:
        if logger:
            logger.error(f"Failed to generate SHAP summary plot: {e}")
        raise




def plot_shap_dependence(shap_values, feature: str, X_original: pd.DataFrame, save_path: str, debug: bool = False, 
                         logger: logging.Logger = None):
    """
    Generate and save a SHAP dependence plot for a specific feature.

    :param shap_values: SHAP values computed for the dataset.
    :param feature: Feature name to generate the dependence plot for.
    :param X_original: Original (untransformed) feature DataFrame.
    :param save_path: Full file path to save the plot.
    :param debug: If True, enable detailed debug logs.
    :param logger: Logger instance for logging.
    """
    if logger:
        logger.info(f"Generating SHAP dependence plot for feature '{feature}'...")
    try:
        plt.figure(figsize=(8, 6))
        # shap.dependence_plot(feature, shap_values.values, X_original, show=False)
        shap.dependence_plot(feature, shap_values, X_original, show=False)
        plt.tight_layout()
        plt.savefig(save_path, bbox_inches='tight')
        plt.close()
        if logger and debug:
            logger.debug(f"SHAP dependence plot for '{feature}' saved to {save_path}")
        if logger:
            logger.info(f"SHAP dependence plot for feature '{feature}' generated successfully.")
    except Exception as e:
        if logger:
            logger.error(f"Failed to generate SHAP dependence plot for feature '{feature}': {e}")
        raise


def generate_global_recommendations(shap_values, X_original: pd.DataFrame, top_n: int = 5, debug: bool = False, 
                                    use_mad: bool = False, logger: logging.Logger = None) -> dict:
    """
    Generate global recommendations based on SHAP values and feature distributions.

    :param shap_values: SHAP values computed for the dataset.
    :param X_original: Original (untransformed) feature DataFrame.
    :param top_n: Number of top features to generate recommendations for.
    :param debug: If True, enable detailed debug logs.
    :param use_mad: If True, use Median Absolute Deviation for range definition.
    :param logger: Logger instance for logging.
    :return: recommendations: Dictionary mapping features to recommended value ranges, importance, and direction.
    """
    if logger:
        logger.info("Generating feature importance based on SHAP values...")
    try:
        # shap_df = pd.DataFrame(shap_values.values, columns=X_original.columns)
        shap_df = pd.DataFrame(shap_values, columns=X_original.columns)

        # Calculate mean absolute SHAP values for importance
        feature_importance = pd.DataFrame({
            'feature': X_original.columns,
            'importance': np.abs(shap_df).mean(axis=0),
            'mean_shap': shap_df.mean(axis=0)
        }).sort_values(by='importance', ascending=False)
        
        if logger and debug:
            logger.debug(f"Feature importance (top {top_n}):\n{feature_importance.head(top_n)}")
        
        top_features = feature_importance.head(top_n)['feature'].tolist()
        recommendations = {}
        
        for feature in top_features:
            feature_values = X_original[feature]
            
            if use_mad:
                # Use Median and MAD for robust statistics
                median = feature_values.median()
                mad = feature_values.mad()
                lower_bound = median - 1.5 * mad
                upper_bound = median + 1.5 * mad
                range_str = f"{lower_bound:.1f}–{upper_bound:.1f}"
            else:
                # Default to Interquartile Range (IQR)
                lower_bound = feature_values.quantile(0.25)
                upper_bound = feature_values.quantile(0.75)
                range_str = f"{lower_bound:.1f}–{upper_bound:.1f}"
            
            # Determine direction based on mean SHAP value
            mean_shap = feature_importance.loc[feature_importance['feature'] == feature, 'mean_shap'].values[0]
            direction = 'positive' if mean_shap > 0 else 'negative'
            
            recommendations[feature] = {
                'range': range_str,
                'importance': round(feature_importance.loc[feature_importance['feature'] == feature, 'importance'].values[0], 4),  # Rounded for readability
                'direction': direction
            }
            if logger and debug:
                logger.debug(f"Recommendation for {feature}: Range={range_str}, Importance={feature_importance.loc[feature_importance['feature'] == feature, 'importance'].values[0]}, Direction={direction}")
        
        if logger and debug:
            logger.debug(f"Final Recommendations with Importance and Direction: {recommendations}")
        return recommendations
    except Exception as e:
        if logger:
            logger.error(f"Failed to generate global recommendations: {e}")
        raise


def generate_individual_feedback(trial: pd.Series, shap_values_trial: np.ndarray, feature_metadata: dict = None, 
                                 logger: logging.Logger = None) -> dict:
    """
    Generate specific feedback for a single trial based on its SHAP values and feature metadata.

    Args:
        trial (pd.Series): A single trial's data.
        shap_values_trial (np.ndarray): SHAP values for the trial.
        feature_metadata (dict, optional): Additional metadata for features (e.g., units).
        logger (logging.Logger, optional): Logger instance for logging.

    Returns:
        dict: Feedback messages for each feature.
    """
    feedback = {}
    feature_names = trial.index.tolist()

    for feature, shap_value in zip(feature_names, shap_values_trial):
        if shap_value > 0:
            adjustment = "maintain or increase"
            direction = "positively"
        elif shap_value < 0:
            adjustment = "decrease"
            direction = "positively"
        else:
            feedback[feature] = f"{feature.replace('_', ' ').capitalize()} has no impact on the prediction."
            continue

        # Map SHAP values to meaningful adjustment magnitudes
        # Example: 10% of the current feature value
        current_value = trial[feature]
        adjustment_factor = 0.1
        adjustment_amount = adjustment_factor * abs(current_value)

        # Incorporate feature metadata if available
        if feature_metadata and feature in feature_metadata:
            unit = feature_metadata[feature].get('unit', '')
            adjustment_str = f"{adjustment_amount:.2f} {unit}" if unit else f"{adjustment_amount:.2f}"
        else:
            adjustment_str = f"{adjustment_amount:.2f}"

        # Construct feedback message
        feedback_message = (
            f"Consider to {adjustment} '{feature.replace('_', ' ')}' by approximately {adjustment_str} "
            f"to {direction} influence the result."
        )
        feedback[feature] = feedback_message

    return feedback


def compute_individual_shap_values(explainer, X_transformed: pd.DataFrame, trial_index: int, 
                                   logger: logging.Logger = None):
    """
    Compute SHAP values for a single trial.

    :param explainer: SHAP explainer object.
    :param X_transformed: Transformed features used for prediction.
    :param trial_index: Index of the trial.
    :param logger: Logger instance.
    :return: shap_values for the trial.
    """
    if logger:
        logger.info(f"Computing SHAP values for trial at index {trial_index}...")
    try:
        trial = X_transformed.iloc[[trial_index]]
        shap_values = explainer(trial)
        if logger:
            logger.debug(f"SHAP values for trial {trial_index} computed successfully.")
    except Exception as e:
        if logger:
            logger.error(f"Failed to compute SHAP values for trial {trial_index}: {e}")
        raise
    return shap_values


def plot_individual_shap_force(shap_explainer, shap_values, X_original: pd.DataFrame, trial_index: int, 
                               save_path: str, logger: logging.Logger = None):
    """
    Generate and save a SHAP force plot for a specific trial.

    :param shap_explainer: SHAP explainer object.
    :param shap_values: SHAP values for the trial.
    :param X_original: Original feature DataFrame.
    :param trial_index: Index of the trial.
    :param save_path: Full file path to save the force plot.
    :param logger: Logger instance.
    """
    if logger:
        logger.info(f"Generating SHAP force plot for trial {trial_index}...")
    try:
        shap_plot = shap.force_plot(
            shap_explainer.expected_value, 
            shap_values.values[0], 
            X_original.iloc[trial_index],
            matplotlib=False
        )
        shap.save_html(save_path, shap_plot)
        if logger:
            logger.debug(f"SHAP force plot saved to {save_path}")
            logger.info(f"SHAP force plot for trial {trial_index} generated successfully.")
    except Exception as e:
        if logger:
            logger.error(f"Failed to generate SHAP force plot for trial {trial_index}: {e}")
        raise


def extract_force_plot_values(shap_values, trial_index: int, logger: logging.Logger = None) -> dict:
    """
    Extract SHAP values and feature contributions for a specific trial.

    Args:
        shap_values (shap.Explanation): SHAP values object.
        trial_index (int): Index of the trial.
        logger (logging.Logger, optional): Logger instance.

    Returns:
        dict: Dictionary of feature contributions.
    """
    try:
        shap_values_instance = shap_values.values[trial_index]
        features_instance = shap_values.data[trial_index]
        feature_contributions = dict(zip(shap_values.feature_names, shap_values_instance))
        if logger and logger.isEnabledFor(logging.DEBUG):
            logger.debug(f"SHAP values for trial {trial_index}: {feature_contributions}")
        return feature_contributions
    except Exception as e:
        if logger:
            logger.error(f"Error extracting SHAP values for trial {trial_index}: {e}")
        raise


def save_shap_values(shap_values, save_path: str, logger: logging.Logger = None):
    """
    Save SHAP values to a file using pickle.

    :param shap_values: SHAP values object to save.
    :param save_path: File path to save the SHAP values.
    :param logger: Logger instance.
    """
    if logger:
        logger.info(f"Saving SHAP values to {save_path}...")
    try:
        with open(save_path, "wb") as f:
            pickle.dump(shap_values, f)
        if logger:
            logger.info(f"SHAP values saved successfully to {save_path}.")
    except Exception as e:
        if logger:
            logger.error(f"Failed to save SHAP values: {e}")
        raise


def load_shap_values(load_path: str, logger: logging.Logger = None):
    """
    Load SHAP values from a pickle file.

    :param load_path: File path to load the SHAP values from.
    :param logger: Logger instance.
    :return: Loaded SHAP values object.
    """
    if logger:
        logger.info(f"Loading SHAP values from {load_path}...")
    try:
        with open(load_path, "rb") as f:
            shap_values = pickle.load(f)
        if logger:
            logger.info(f"SHAP values loaded successfully from {load_path}.")
    except Exception as e:
        if logger:
            logger.error(f"Failed to load SHAP values: {e}")
        raise
    return shap_values


def plot_shap_force(shap_values, X_original, trial_index: int, save_path: Path, logger: logging.Logger, debug: bool):
    try:
        shap.force_plot(
            shap_explainer.expected_value, 
            shap_values[trial_index], 
            X_original.iloc[trial_index],
            matplotlib=True
        )
        plt.tight_layout()
        plt.savefig(save_path, bbox_inches='tight')
        plt.close()
        if debug:
            logger.debug(f"SHAP force plot for trial {trial_index} saved to {save_path}.")
        logger.info(f"✅ SHAP force plot for trial {trial_index} saved.")
    except Exception as e:
        logger.error(f"❌ Failed to generate SHAP force plot for trial {trial_index}: {e}")
        raise




Overwriting ml/shap/shap_utils.py


In [16]:
# %%writefile ml/predict_with_shap_usage.py

import pandas as pd
import logging
import os
import yaml
import joblib
import json
from pathlib import Path
from typing import Any, Dict

from ml.predict.predict import predict_and_attach_predict_probs
from datapreprocessor import DataPreprocessor
from ml.feature_selection.feature_importance_calculator import manage_features  # New Import

# Local imports for SHAP and model loading should be uncommented and adjusted as needed
from ml.shap.shap_utils import (
    compute_shap_values,
    plot_shap_summary,
    plot_shap_dependence,
    generate_global_recommendations,
    generate_individual_feedback
)
from ml.train_utils.train_utils import load_model  # Ensure 'train_utils.py' contains load_model

# Set up logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)


def load_dataset(path: Path) -> pd.DataFrame:
    if not path.exists():
        raise FileNotFoundError(f"Dataset not found at {path}")
    return pd.read_csv(path)


def load_config(config_path: Path) -> Dict[str, Any]:
    if not config_path.exists():
        raise FileNotFoundError(f"Configuration file not found at {config_path}")
    with open(config_path, 'r') as file:
        config = yaml.safe_load(file)
    return config





def main():
    # ----------------------------
    # Step 1: Load Configuration
    # ----------------------------
    config_path = Path('../../data/model/preprocessor_config/preprocessor_config.yaml')  # Adjust as needed
    try:
        config = load_config(config_path)
        print(f"Configuration loaded successfully from {config_path}.")
    except Exception as e:
        print(f"❌ Failed to load configuration: {e}")
        return  # Exit if config loading fails

    # ----------------------------
    # Step 2: Extract Paths from Configuration
    # ----------------------------
    paths = config.get('paths', {})
    feature_paths = {
        'features': Path('../../data/model/pipeline/final_ml_df_selected_features_columns_test.pkl'),
        'ordinal_categoricals': Path('../../data/model/pipeline/features_info/ordinal_categoricals.pkl'),
        'nominal_categoricals': Path('../../data/model/pipeline/features_info/nominal_categoricals.pkl'),
        'numericals': Path('../../data/model/pipeline/features_info/numericals.pkl'),
        'y_variable': Path('../../data/model/pipeline/features_info/y_variable.pkl')
    }

    # Define other necessary paths
    data_dir = Path(paths.get('data_dir', '../../data/processed')).resolve()
    raw_data_path = data_dir / paths.get('raw_data', 'final_ml_dataset.csv')  # Corrected key
    processed_data_dir = data_dir / paths.get('processed_data_dir', 'preprocessor/processed')
    transformers_dir = Path(paths.get('transformers_save_base_dir', '../preprocessor/transformers')).resolve()  # Corrected key
    predictions_output_path = Path(paths.get('predictions_output_dir', 'preprocessor/predictions')).resolve()
    log_dir = Path(paths.get('log_dir', '../preprocessor/logs')).resolve()
    model_save_dir = Path(paths.get('model_save_base_dir', '../preprocessor/models')).resolve()  # Corrected key
    log_file = paths.get('log_file', 'prediction.log')  # Ensure this key exists in config

    # ----------------------------
    # Step 3: Setup Logging
    # ----------------------------
    logger.info("✅ Starting prediction module.")
    logger.debug(f"Configuration paths extracted: {paths}")

    # ----------------------------
    # Step 4: Load Feature Lists Using manage_features
    # ----------------------------
    try:
        feature_lists = manage_features(
            mode='load',
            paths=feature_paths
        )
        y_variable = feature_lists.get('y_variable', [])
        ordinal_categoricals = feature_lists.get('ordinal_categoricals', [])
        nominal_categoricals = feature_lists.get('nominal_categoricals', [])
        numericals = feature_lists.get('numericals', [])

        logger.debug(f"Loaded Feature Lists: y_variable={y_variable}, "
                     f"ordinal_categoricals={ordinal_categoricals}, "
                     f"nominal_categoricals={nominal_categoricals}, "
                     f"numericals={numericals}")
    except Exception as e:
        logger.error(f"❌ Failed to load feature lists: {e}")
        return

    # ----------------------------
    # Step 5: Load Tuning Results to Find Best Model
    # ----------------------------
    tuning_results_path = model_save_dir / "tuning_results.json"
    if not tuning_results_path.exists():
        logger.error(f"❌ Tuning results not found at '{tuning_results_path}'. Cannot determine the best model.")
        return

    try:
        with open(tuning_results_path, 'r') as f:
            tuning_results = json.load(f)
        best_model_info = tuning_results.get("Best Model")
        if not best_model_info:
            logger.error("❌ Best model information not found in tuning results.")
            return
        best_model_name = best_model_info.get("model_name")
        if not best_model_name:
            logger.error("❌ Best model name not found in tuning results.")
            return
        logger.info(f"Best model identified: {best_model_name}")
        logger.debug(f"Best model details: {best_model_info}")
    except Exception as e:
        logger.error(f"❌ Failed to load tuning results: {e}")
        return

    # ----------------------------
    # Step 6: Preprocess the Data
    # ----------------------------
    # Load Prediction Dataset
    if not raw_data_path.exists():
        logger.error(f"❌ Prediction input dataset not found at '{raw_data_path}'.")
        return

    try:
        df_predict = load_dataset(raw_data_path)
        logger.info(f"✅ Prediction input data loaded from '{raw_data_path}'.")
        logger.debug(f"Prediction input data shape: {df_predict.shape}")
        logger.debug(f"Prediction input data columns: {df_predict.columns.tolist()}")
    except Exception as e:
        logger.error(f"❌ Failed to load prediction input data: {e}")
        return

    # Initialize DataPreprocessor with Loaded Feature Lists
    preprocessor = DataPreprocessor(
        model_type="Tree Based Classifier",  # Adjust if model type varies
        y_variable=y_variable,
        ordinal_categoricals=ordinal_categoricals,
        nominal_categoricals=nominal_categoricals,
        numericals=numericals,
        mode='predict',
        options={},  # Populate based on specific requirements or configurations
        debug=True,  # Enable debug mode for detailed logs
        normalize_debug=False,  # As per requirements
        normalize_graphs_output=False,  # As per requirements
        graphs_output_dir=Path(paths.get('plots_output_dir', '../preprocessor/plots')).resolve(),
        transformers_dir=transformers_dir
    )

    # Execute Preprocessing for Prediction
    try:
        X_preprocessed, recommendations, X_inversed = preprocessor.final_preprocessing(df_predict)
        print("X_new_preprocessed type = ", type(X_preprocessed), "X_new_inverse type = ", type(X_inversed))
        logger.info("✅ Preprocessing completed successfully in predict mode.")
        logger.debug(f"Shape of X_preprocessed: {X_preprocessed.shape}")
        logger.debug(f"Columns in X_preprocessed: {X_preprocessed.columns.tolist()}")
        logger.debug(f"Shape of X_inversed: {X_inversed.shape}")
        logger.debug(f"Columns in X_inversed: {X_inversed.columns.tolist()}")
    except Exception as e:
        logger.error(f"❌ Preprocessing failed in predict mode: {e}")
        return

    # ----------------------------
    # Step 7: Load the Best Model
    # ----------------------------
    try:
        trained_model = load_model(best_model_name, model_save_dir)
        model_path = model_save_dir / best_model_name.replace(' ', '_') / 'trained_model.pkl'
        logger.info(f"✅ Trained model loaded from '{model_path}'.")
        logger.debug(f"Trained model type: {type(trained_model)}")
    except Exception as e:
        logger.error(f"❌ Failed to load the best model '{best_model_name}': {e}")
        return

    # ----------------------------
    # Step 8: Make Predictions + Prediction Probabilities and add them to the inverse-transformed DataFrame
    # ----------------------------
    predictions, prediction_probs, X_inversed = predict_and_attach_predict_probs(trained_model, X_preprocessed, X_inversed)


    # ----------------------------
    # Step 12: Compute SHAP Values for Interpretability
    # ----------------------------
    try:
        logger.info("📊 Computing SHAP values for interpretability...")
        explainer, shap_values = compute_shap_values(
            trained_model, 
            X_preprocessed,  # Compute SHAP on preprocessed data
            debug=config.get('logging', {}).get('debug', False), 
            logger=logger
        )
        logger.info("✅ SHAP values computed successfully.")
        
        # Additional Debugging
        logger.debug(f"Type of shap_values: {type(shap_values)}")
        logger.debug(f"Shape of shap_values: {shap_values.shape}")
        if hasattr(shap_values, 'feature_names'):
            logger.debug(f"SHAP feature names: {shap_values.feature_names}")
        else:
            logger.debug("shap_values does not have 'feature_names' attribute.")
        
        logger.debug(f"Type of X_preprocessed: {type(X_preprocessed)}")
        logger.debug(f"Shape of X_preprocessed: {X_preprocessed.shape}")
        logger.debug(f"Columns in X_preprocessed: {X_preprocessed.columns.tolist()}")
        
        # Check feature alignment
        if hasattr(shap_values, 'feature_names'):
            shap_feature_names = shap_values.feature_names
        else:
            shap_feature_names = X_preprocessed.columns.tolist()
        
        if list(shap_feature_names) != list(X_preprocessed.columns):
            logger.error("Column mismatch between SHAP values and X_preprocessed.")
            logger.error(f"SHAP feature names ({len(shap_feature_names)}): {shap_feature_names}")
            logger.error(f"X_preprocessed columns ({len(X_preprocessed.columns)}): {X_preprocessed.columns.tolist()}")
            raise ValueError("Column mismatch between SHAP values and X_preprocessed.")
        else:
            logger.debug("Column alignment verified between SHAP values and X_preprocessed.")
        
        # Ensure all features are numeric
        X_preprocessed = X_preprocessed.apply(pd.to_numeric, errors='coerce')
        logger.debug("Converted all features in X_preprocessed to numeric types.")
        
        # Check for non-numeric columns
        non_numeric_cols = X_preprocessed.select_dtypes(include=['object', 'category']).columns.tolist()
        if non_numeric_cols:
            logger.error(f"Non-numeric columns found in X_preprocessed: {non_numeric_cols}")
            raise ValueError("All features must be numeric for SHAP plotting.")
        else:
            logger.debug("All features in X_preprocessed are numeric.")
        
        # Check for missing values in shap_values
        if hasattr(shap_values, 'values'):
            if pd.isnull(shap_values.values).any():
                logger.warning("SHAP values contain NaNs.")
            else:
                logger.debug("SHAP values do not contain NaNs.")
        else:
            logger.warning("shap_values do not have 'values' attribute.")
        
    except Exception as e:
        logger.error(f"❌ Failed to compute SHAP values: {e}")
        return

    # ----------------------------
    # Step 13: Generate SHAP Summary Plot
    # ----------------------------
    try:
        shap_summary_path = predictions_output_path / 'shap_summary.png'
        logger.debug(f"SHAP summary plot will be saved to: {shap_summary_path}")
        
        # Pass X_preprocessed to SHAP plotting functions
        plot_shap_summary(
            shap_values=shap_values, 
            X_original=X_preprocessed,  # Use preprocessed data that matches SHAP values
            save_path=shap_summary_path, 
            debug=config.get('logging', {}).get('debug', False), 
            logger=logger
        )
        logger.info(f"✅ SHAP summary plot saved to '{shap_summary_path}'.")
    except Exception as e:
        logger.error(f"❌ Failed to generate SHAP summary plot: {e}")
        return

    # ----------------------------
    # Step 14: Generate SHAP Dependence Plots for Top Features
    # ----------------------------
    try:
        top_n = len(X_preprocessed.columns)  # Define how many top features you want
        logger.debug(f"Generating SHAP dependence plots for top {top_n} features.")
        recommendations = generate_global_recommendations(
            shap_values=shap_values, 
            X_original=X_preprocessed,  # Use preprocessed data
            top_n=top_n, 
            debug=config.get('logging', {}).get('debug', False), 
            use_mad=False,  # Set to True if using MAD for range definitions
            logger=logger
        )
        shap_dependence_dir = predictions_output_path / 'shap_dependence_plots'
        os.makedirs(shap_dependence_dir, exist_ok=True)
        logger.debug(f"SHAP dependence plots will be saved to: {shap_dependence_dir}")
        for feature in recommendations.keys():
            shap_dependence_filename = f"shap_dependence_{feature}.png"
            shap_dependence_path = shap_dependence_dir / shap_dependence_filename
            logger.debug(f"Generating SHAP dependence plot for feature '{feature}' at '{shap_dependence_path}'.")
            plot_shap_dependence(
                shap_values=shap_values, 
                feature=feature, 
                X_original=X_preprocessed,  # Use preprocessed data
                save_path=shap_dependence_path, 
                debug=config.get('logging', {}).get('debug', False), 
                logger=logger
            )
        logger.info(f"✅ SHAP dependence plots saved to '{shap_dependence_dir}'.")
    except Exception as e:
        logger.error(f"❌ Failed to generate SHAP dependence plots: {e}")
        return

    # ----------------------------
    # Step 15: Annotate Final Dataset with SHAP Recommendations
    # ----------------------------
    try:
        logger.info("🔍 Annotating final dataset with SHAP recommendations...")
        feature_metadata = {}
        for feature in X_preprocessed.columns:
            if 'angle' in feature.lower():
                unit = 'degrees'
            else:
                unit = 'meters'
            feature_metadata[feature] = {'unit': unit}

        specific_feedback_list = []
        for idx in X_inversed.index:
            trial = X_inversed.loc[idx]
            shap_values_trial = shap_values[idx]  # Removed .values
            feedback = generate_individual_feedback(trial, shap_values_trial, feature_metadata, logger=logger)
            specific_feedback_list.append(feedback)
            if config.get('logging', {}).get('debug', False):
                logger.debug(f"Generated feedback for trial {idx}: {feedback}")
        X_inversed['specific_feedback'] = specific_feedback_list

        logger.info("✅ Specific feedback generated for all trials.")
    except Exception as e:
        logger.error(f"❌ Failed to generate specific feedback: {e}")
        return

    # ----------------------------
    # Step 16: Save the Final Dataset with Predictions and SHAP Annotations
    # ----------------------------
    try:
        os.makedirs(predictions_output_path, exist_ok=True)
        final_output_path = predictions_output_path / 'final_predictions_with_shap.csv'
        X_inversed.to_csv(final_output_path, index=False)
        logger.info(f"✅ Final dataset with predictions and SHAP annotations saved to '{final_output_path}'.")
        logger.debug(f"Final dataset shape: {X_inversed.shape}")
        logger.debug(f"Final dataset columns: {X_inversed.columns.tolist()}")
    except Exception as e:
        logger.error(f"❌ Failed to save the final dataset: {e}")
        return

    # ----------------------------
    # Step 17: Save Global Recommendations as JSON
    # ----------------------------
    try:
        recommendations_filename = "global_shap_recommendations.json"
        recommendations_save_path = predictions_output_path / recommendations_filename
        with open(recommendations_save_path, "w") as f:
            json.dump(recommendations, f, indent=4)
        logger.info(f"✅ Global SHAP recommendations saved to '{recommendations_save_path}'.")
        logger.debug(f"Global SHAP recommendations saved.")
    except Exception as e:
        logger.error(f"❌ Failed to save global SHAP recommendations: {e}")
        return

    # ----------------------------
    # Step 18: Generate Specific Feedback for Each Trial
    # ----------------------------
    # Note: This step was moved above in the script for logical flow.

    # ----------------------------
    # Step 19: Display Minimal Outputs
    # ----------------------------
    try:
        # Display the first few rows of the final dataset
        print("\nInverse Transformed Prediction DataFrame with Predictions and SHAP Annotations:")
        print(X_inversed.head())

        # Display global recommendations
        print("\nGlobal SHAP Recommendations:")
        for feature, rec in recommendations.items():
            print(f"{feature}: Range={rec['range']}, Importance={rec['importance']}, Direction={rec['direction']}")

        # Display specific feedback for each trial
        print("\nSpecific Feedback for Each Trial:")
        for idx, row in X_inversed.iterrows():
            print(f"Trial {idx + 1}:")
            for feature, feedback in row['specific_feedback'].items():
                print(f"  - {feedback}")
            print()
    except Exception as e:
        logger.error(f"❌ Error during displaying outputs: {e}")
        return

    logger.info("✅ Prediction pipeline with SHAP analysis completed successfully.")


if __name__ == "__main__":
    main()


2025-01-13 12:36:58,154 - INFO - ✅ Starting prediction module.
2025-01-13 12:36:58,173 - INFO - Best model identified: Random Forest
2025-01-13 12:36:58,182 - INFO - ✅ Prediction input data loaded from 'C:\Users\ghadf\vscode_projects\docker_projects\spl_freethrow_biomechanics_analysis_ml_prediction\data\processed\final_ml_dataset.csv'.
2025-01-13 12:36:58,183 [INFO] Starting: Final Preprocessing Pipeline in 'predict' mode.
2025-01-13 12:36:58,183 - INFO - Starting: Final Preprocessing Pipeline in 'predict' mode.
2025-01-13 12:36:58,184 [INFO] Step: filter_columns
2025-01-13 12:36:58,184 - INFO - Step: filter_columns
2025-01-13 12:36:58,187 [INFO] ✅ Filtered DataFrame to include only specified features. Shape: (125, 13)
2025-01-13 12:36:58,187 - INFO - ✅ Filtered DataFrame to include only specified features. Shape: (125, 13)
2025-01-13 12:36:58,188 [DEBUG] Selected Features: ['release_ball_direction_x', 'release_ball_direction_z', 'release_ball_direction_y', 'elbow_release_angle', 'elbo

Configuration loaded successfully from ..\..\data\model\preprocessor_config\preprocessor_config.yaml.
✅ Features loaded from ..\..\data\model\pipeline\final_ml_df_selected_features_columns_test.pkl
✅ Ordinal categoricals loaded from ..\..\data\model\pipeline\features_info\ordinal_categoricals.pkl
✅ Nominal categoricals loaded from ..\..\data\model\pipeline\features_info\nominal_categoricals.pkl
✅ Numericals loaded from ..\..\data\model\pipeline\features_info\numericals.pkl
✅ Y variable loaded from ..\..\data\model\pipeline\features_info\y_variable.pkl
X_new_preprocessed type =  <class 'pandas.core.frame.DataFrame'> X_new_inverse type =  <class 'pandas.core.frame.DataFrame'>


2025-01-13 12:36:58,361 - PredictAndAttachLogger - INFO - ✅ Predictions made successfully.
2025-01-13 12:36:58,361 - INFO - ✅ Predictions made successfully.
2025-01-13 12:36:58,363 - DEBUG - Predictions sample: [1 1 0 0 1]
2025-01-13 12:36:58,408 - PredictAndAttachLogger - INFO - ✅ Prediction probabilities computed successfully.
2025-01-13 12:36:58,408 - INFO - ✅ Prediction probabilities computed successfully.
2025-01-13 12:36:58,409 - DEBUG - Prediction probabilities sample:
[[0.47692308 0.52307692]
 [0.         1.        ]]
2025-01-13 12:36:58,411 - DEBUG - Predictions attached to inverse-transformed DataFrame.
2025-01-13 12:36:58,412 - DEBUG - Prediction probabilities attached to inverse-transformed DataFrame.
2025-01-13 12:36:58,412 - DEBUG - Final shape of X_inversed: (125, 15)
2025-01-13 12:36:58,413 - DEBUG - Final columns in X_inversed: ['release_ball_direction_x', 'release_ball_direction_z', 'release_ball_direction_y', 'elbow_release_angle', 'elbow_max_angle', 'wrist_release_a


Inverse Transformed Prediction DataFrame with Predictions and SHAP Annotations:
   release_ball_direction_x  release_ball_direction_z  \
0                  0.377012                  0.926203   
1                  0.417644                  0.901906   
2                  0.383086                  0.922353   
3                  0.292054                  0.956396   
4                  0.389917                  0.918894   

   release_ball_direction_y  elbow_release_angle  elbow_max_angle  \
0                  0.002969            72.325830       106.272118   
1                 -0.110176            58.430068       101.798798   
2                 -0.050096            73.883728       106.989633   
3                  0.003209            71.424835       104.160865   
4                 -0.059987            73.619348       103.832024   

   wrist_release_angle  wrist_max_angle  knee_release_angle  knee_max_angle  \
0            28.102765        35.918406           32.646276       63.541007   
1  

<Figure size 800x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

In [17]:
# %%writefile ml/predict_with_shap_usage.py

import pandas as pd
import logging
import os
import yaml
import joblib
import json
from pathlib import Path
from typing import Any, Dict, List, Optional
from datetime import datetime

from ml.predict.predict import predict_and_attach_predict_probs
from datapreprocessor import DataPreprocessor
from ml.feature_selection.feature_importance_calculator import manage_features  # New Import

# Local imports for SHAP and model loading:
from ml.shap.shap_utils import (
    compute_shap_values,
    plot_shap_summary,
    plot_shap_dependence,
    generate_individual_feedback
)
from ml.train_utils.train_utils import load_model  # Ensure train_utils.py contains load_model

from jinja2 import Environment, FileSystemLoader
import logging.config
import shap
import matplotlib.pyplot as plt
import pickle


def load_dataset(path: Path) -> pd.DataFrame:
    if not path.exists():
        raise FileNotFoundError(f"Dataset not found at {path}")
    return pd.read_csv(path)


def load_config(config_path: Path) -> Dict[str, Any]:
    if not config_path.exists():
        raise FileNotFoundError(f"Configuration file not found at {config_path}")
    with open(config_path, 'r') as file:
        config = yaml.safe_load(file)
    # (Optionally validate required keys here)
    return config


def setup_logging(config: Dict[str, Any], log_file_path: Path) -> logging.Logger:
    log_level = config.get('logging', {}).get('level', 'INFO').upper()
    logging_config = {
        'version': 1,
        'disable_existing_loggers': False,
        'formatters': {
            'standard': {
                'format': '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
            },
        },
        'handlers': {
            'console': {
                'class': 'logging.StreamHandler',
                'level': log_level,
                'formatter': 'standard',
                'stream': 'ext://sys.stdout',
            },
            'file': {
                'class': 'logging.handlers.RotatingFileHandler',
                'level': 'DEBUG',
                'formatter': 'standard',
                'filename': str(log_file_path),
                'maxBytes': 10**6,
                'backupCount': 5,
                'encoding': 'utf8',
            },
        },
        'loggers': {
            '__main__': {
                'handlers': ['console', 'file'],
                'level': log_level,
                'propagate': False
            },
        }
    }
    logging.config.dictConfig(logging_config)
    return logging.getLogger(__name__)


# Stub functions for loading the model and preprocessor, and applying preprocessing.
def load_model_from_path(model_path: Path):
    """
    Given model_path e.g., .../RandomForest/trained_model.pkl,
    extract the model name from the parent folder ("RandomForest")
    and the base directory (the parent of the model folder).
    Then call the original load_model function.
    """
    model_folder = model_path.parent      # e.g., .../RandomForest
    model_name = model_folder.name          # "RandomForest"
    base_dir = model_folder.parent          # base model directory
    return load_model(model_name, base_dir)

def load_preprocessor_from_path(transformer_path: Path):
    # Assume this returns a preprocessor instance. You may already have one defined.
    # For example, if you have a function in your datapreprocessor module.
    with open(transformer_path, 'rb') as f:
        preprocessor = pickle.load(f)
    return preprocessor

def apply_preprocessing(df: pd.DataFrame, preprocessor: DataPreprocessor):
    # Wrapper over your DataPreprocessor.final_preprocessing method.
    X_preprocessed, recommendations, X_inversed = preprocessor.final_preprocessing(df)
    return X_preprocessed, X_inversed

def build_feature_metadata(X: pd.DataFrame) -> Dict[str, Dict[str, Any]]:
    # Build metadata (e.g., units) for each feature.
    metadata = {}
    for col in X.columns:
        metadata[col] = {'unit': 'degrees' if 'angle' in col.lower() else 'meters'}
    return metadata


# New unified function that orchestrates prediction and SHAP.
def predict_and_shap(
    config: Dict[str, Any],
    df_input: pd.DataFrame,
    model_path: Path,
    transformer_path: Path,
    save_dir: Path,
    generate_summary_plot: bool = True,
    generate_dependence_plots: bool = False,
    generate_force_plots_for: Optional[List[int]] = None,
    top_n_features: int = 10,
    use_mad: bool = False,
    feedback_id_column: Optional[str] = None, 
    feedback_specific_ids: Optional[List[Any]] = None,
    logger: Optional[logging.Logger] = None
) -> Dict[str, Any]:
    results = {}
    
    # Step 1: Load Model and Preprocessor
    try:
        trained_model = load_model_from_path(model_path)
        preprocessor = load_preprocessor_from_path(transformer_path)
        if logger:
            logger.info("Model and Transformers loaded successfully.")
    except Exception as e:
        if logger:
            logger.error(f"Failed to load model/transformer: {e}")
        raise

    # Step 2: Preprocess the Data
    try:
        X_preprocessed, X_inversed = apply_preprocessing(df_input, preprocessor)
        if logger:
            logger.info("Data preprocessed successfully.")
    except Exception as e:
        if logger:
            logger.error(f"Preprocessing failed: {e}")
        raise

    # Step 3: Make Predictions
    try:
        predictions, prediction_probs, X_inversed = predict_and_attach_predict_probs(trained_model, X_preprocessed, X_inversed, logger)
        results['predictions'] = predictions
        results['prediction_probs'] = prediction_probs
        if logger:
            logger.info("Predictions generated and added to dataset.")
    except Exception as e:
        if logger:
            logger.error(f"Prediction failed: {e}")
        raise

    # Step 4: Compute SHAP Values
    try:
        explainer, shap_values = compute_shap_values(trained_model, X_preprocessed, debug=config.get('logging', {}).get('debug', False), logger=logger)
        results['shap_values'] = shap_values
        results['explainer'] = explainer
        if logger:
            logger.info("SHAP values computed successfully.")
    except Exception as e:
        if logger:
            logger.error(f"SHAP computation failed: {e}")
        raise

    # Step 5: Optional—Generate a SHAP Summary Plot
    if generate_summary_plot:
        shap_summary_path = save_dir / "shap_summary.png"
        try:
            plot_shap_summary(shap_values, X_preprocessed, shap_summary_path, debug=config.get('logging', {}).get('debug', False), logger=logger)
            results['shap_summary_plot'] = str(shap_summary_path)
            if logger:
                logger.info(f"SHAP summary plot saved at {shap_summary_path}.")
        except Exception as e:
            if logger:
                logger.error(f"Failed to generate SHAP summary plot: {e}")

    # Step 6: Optional—Generate SHAP Dependence Plots for Top Features
    recommendations = {}
    if generate_dependence_plots:
        try:
            recommendations = generate_global_recommendations(shap_values, X_preprocessed, top_n_features, use_mad, logger)
            results['recommendations'] = recommendations
            shap_dependence_dir = save_dir / "shap_dependence_plots"
            shap_dependence_dir.mkdir(exist_ok=True, parents=True)
            for feature in recommendations.keys():
                dep_path = shap_dependence_dir / f"shap_dependence_{feature}.png"
                plot_shap_dependence(shap_values, feature, X_preprocessed, dep_path, debug=config.get('logging', {}).get('debug', False), logger=logger)
            if logger:
                logger.info(f"SHAP dependence plots saved at {shap_dependence_dir}.")
        except Exception as e:
            if logger:
                logger.error(f"Failed to generate dependence plots: {e}")

    # Step 7: Optional—Generate SHAP Force Plots for Specific Row Indices
    if generate_force_plots_for:
        try:
            force_plots_dir = save_dir / "shap_force_plots"
            force_plots_dir.mkdir(exist_ok=True, parents=True)
            for trial_idx in generate_force_plots_for:
                force_path = force_plots_dir / f"shap_force_trial_{trial_idx}.png"
                plot_shap_force(explainer, shap_values, X_preprocessed, trial_idx, force_path, logger, config.get('logging', {}).get('debug', False))
            if logger:
                logger.info(f"SHAP force plots saved at {force_plots_dir}.")
        except Exception as e:
            if logger:
                logger.error(f"Failed to generate SHAP force plots: {e}")

    # Step 8: Optional—Generate Individual Feedback
    try:
        # If a feedback_id_column is specified, set the DataFrame index.
        if feedback_id_column and feedback_id_column in X_inversed.columns:
            X_inversed.set_index(feedback_id_column, inplace=True, drop=False)
        # Decide which rows to generate feedback for.
        if feedback_specific_ids is not None:
            target_indices = [i for i in feedback_specific_ids if i in X_inversed.index]
        else:
            target_indices = list(X_inversed.index)

        feature_metadata = build_feature_metadata(X_preprocessed)
        feedback_dict = {}
        for idx in target_indices:
            # Adjust the row index in shap_values to match your DataFrame indexing
            local_idx = X_inversed.index.get_loc(idx)
            fb = generate_individual_feedback(X_inversed.loc[idx], shap_values[local_idx], feature_metadata, logger=logger)
            feedback_dict[idx] = fb
            # Optionally attach the feedback to the DataFrame.
            X_inversed.at[idx, 'specific_feedback'] = fb
        results['individual_feedback'] = feedback_dict
        if logger:
            logger.info("Individual feedback generated.")
    except Exception as e:
        if logger:
            logger.error(f"Failed to generate individual feedback: {e}")

    # Step 9: Save the Final Dataset and Recommendations
    try:
        final_dataset_path = save_dir / "final_predictions_with_shap.csv"
        X_inversed.to_csv(final_dataset_path, index=False)
        results['final_dataset'] = str(final_dataset_path)
        if recommendations:
            recs_path = save_dir / "global_shap_recommendations.json"
            with open(recs_path, "w") as f:
                json.dump(recommendations, f, indent=4)
            results['recommendations_file'] = str(recs_path)
        if logger:
            logger.info("Final dataset and recommendations saved.")
    except Exception as e:
        if logger:
            logger.error(f"Failed to save outputs: {e}")
        raise

    if logger:
        logger.info("Predict+SHAP pipeline completed successfully.")
    return results


def main():
    # ----------------------------
    # Step 1: Load Configuration
    # ----------------------------
    config_path = Path('../../data/model/preprocessor_config/preprocessor_config.yaml')
    try:
        config = load_config(config_path)
        print(f"Configuration loaded successfully from {config_path}.")
    except Exception as e:
        print(f"❌ Failed to load configuration: {e}")
        return

    # ----------------------------
    # Step 2: Define Paths from Config
    # ----------------------------
    paths = config.get('paths', {})
    feature_paths = {
        'features': Path('../../data/model/pipeline/final_ml_df_selected_features_columns_test.pkl'),
        'ordinal_categoricals': Path('../../data/model/pipeline/features_info/ordinal_categoricals.pkl'),
        'nominal_categoricals': Path('../../data/model/pipeline/features_info/nominal_categoricals.pkl'),
        'numericals': Path('../../data/model/pipeline/features_info/numericals.pkl'),
        'y_variable': Path('../../data/model/pipeline/features_info/y_variable.pkl')
    }
    data_dir = Path(paths.get('data_dir', '../../data/processed')).resolve()
    raw_data_path = data_dir / paths.get('raw_data', 'final_ml_dataset.csv')
    transformers_dir = Path(paths.get('transformers_save_base_dir', '../preprocessor/transformers')).resolve()
    predictions_output_path = Path(paths.get('predictions_output_dir', 'preprocessor/predictions')).resolve()
    model_save_dir = Path(paths.get('model_save_base_dir', '../preprocessor/models')).resolve()
    log_dir = Path(paths.get('log_dir', '../preprocessor/logs')).resolve()
    log_file = paths.get('log_file', 'prediction.log')

    # ----------------------------
    # Step 3: Setup Logging
    # ----------------------------
    logger = setup_logging(config, log_dir / log_file)
    logger.info("✅ Starting prediction module (unified predict_and_shap).")
    logger.debug(f"Paths: {paths}")

    # ----------------------------
    # Step 4: Load Input Data
    # ----------------------------
    try:
        df_predict = load_dataset(raw_data_path)
        logger.info(f"✅ Prediction input data loaded from {raw_data_path}.")
    except Exception as e:
        logger.error(f"❌ Failed to load input data: {e}")
        return

    # ----------------------------
    # Step 5: Define Model and Transformer Paths & Save Directory
    # ----------------------------
    best_model_name = "RandomForest"  # For example, or determine from tuning results if needed.
    model_path = model_save_dir / f"{best_model_name.replace(' ', '_')}" / "trained_model.pkl"
    transformer_path = transformers_dir / "transformers.pkl"
    output_dir = predictions_output_path  # Outputs will be saved here.
    
    # ----------------------------
    # Step 6: Call the Unified predict_and_shap Function
    # ----------------------------
    # For example, we request:
    # - A summary plot,
    # - Dependence plots for the top 10 features,
    # - Force plots for the first two rows,
    # - And generate individual feedback using an identifier column "ID" (if exists)
    try:
        results = predict_and_shap(
            config=config,
            df_input=df_predict,
            model_path=model_path,
            transformer_path=transformer_path,
            save_dir=output_dir,
            generate_summary_plot=True,
            generate_dependence_plots=True,
            generate_force_plots_for=[0, 1],  # Generate force plots for trials 0 and 1
            top_n_features=10,
            use_mad=False,
            feedback_id_column="trial_id",            # Optional: if your data includes a column "ID"
            feedback_specific_ids=None,           # None means generate for all rows
            logger=logger
        )
        logger.info("Unified predict_and_shap function executed successfully.")
    except Exception as e:
        logger.error(f"Unified predict_and_shap function failed: {e}")
        return

    # ----------------------------
    # Step 7: (Optional) Display Minimal Outputs
    # ----------------------------
    try:
        # For simplicity, we just print the preview and the keys of our results dictionary.
        print("\nFinal Predictions with SHAP annotations (preview):")
        print(pd.read_csv(results['final_dataset']).head())
        print("\nResults keys:")
        print(results.keys())
    except Exception as e:
        logger.error(f"Failed to display outputs: {e}")


if __name__ == "__main__":
    main()


Configuration loaded successfully from ..\..\data\model\preprocessor_config\preprocessor_config.yaml.
2025-01-13 12:37:03,657 - __main__ - INFO - ✅ Starting prediction module (unified predict_and_shap).
2025-01-13 12:37:03,663 - __main__ - INFO - ✅ Prediction input data loaded from C:\Users\ghadf\vscode_projects\docker_projects\spl_freethrow_biomechanics_analysis_ml_prediction\data\processed\final_ml_dataset.csv.
2025-01-13 12:37:03,664 - __main__ - ERROR - Failed to load model/transformer: [Errno 2] No such file or directory: 'C:\\Users\\ghadf\\vscode_projects\\docker_projects\\spl_freethrow_biomechanics_analysis_ml_prediction\\notebooks\\freethrow_predictions\\preprocessor\\models\\RandomForest_model.pkl'
2025-01-13 12:37:03,665 - __main__ - ERROR - Unified predict_and_shap function failed: [Errno 2] No such file or directory: 'C:\\Users\\ghadf\\vscode_projects\\docker_projects\\spl_freethrow_biomechanics_analysis_ml_prediction\\notebooks\\freethrow_predictions\\preprocessor\\models\

### Bayesian Example

In [18]:
#%%writefile ../../src/freethrow_predictions/ml/bayes_optimization_post_predict/bayes_optim_angles_xgboostpreds_EXAMPLE.py

from skopt import gp_minimize
from skopt.space import Real
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import pandas as pd

def prepare_data():
    """Prepare training data."""
    np.random.seed(42)
    X_test = pd.DataFrame({
        'knee_max_angle': np.random.uniform(40, 140, 200),
        'wrist_max_angle': np.random.uniform(0, 90, 200),
        'elbow_max_angle': np.random.uniform(30, 160, 200),
    })
    y_test = pd.Series(np.random.choice([0, 1], size=200))

    features = ['knee_max_angle', 'wrist_max_angle', 'elbow_max_angle']
    X_train = X_test[features]
    y_train = y_test

    # Debug: Check the dataset details
    print(f"X_train shape: {X_train.shape}")
    print(f"X_train sample:\n{X_train.head()}")
    print(f"y_train sample:\n{y_train.head()}")
    return X_train, y_train, features


def train_model(X_train, y_train):
    """Train a Decision Tree Classifier."""
    clf = DecisionTreeClassifier()
    clf.fit(X_train, y_train)

    # Debug: Check feature importances
    print(f"Feature importances: {clf.feature_importances_}")
    return clf


def define_search_space(features):
    """Define the search space for optimization."""
    spaces = {
        'knee_max_angle': Real(40, 140, name='knee_angle'),
        'wrist_max_angle': Real(30, 90, name='wrist_angle'),
        'elbow_max_angle': Real(30, 160, name='elbow_angle')
    }
    return [spaces[feature] for feature in features]


def objective_function(clf):
    """Create an objective function for Bayesian Optimization."""
    def objective(params):
        knee, wrist, elbow = params
        input_df = pd.DataFrame([[knee, wrist, elbow]], 
                                columns=['knee_max_angle', 'wrist_max_angle', 'elbow_max_angle'])
        success_prob = clf.predict_proba(input_df)[0, 1]
        # Debug: Log evaluation details
        print(f"Evaluating: knee={knee:.2f}, wrist={wrist:.2f}, elbow={elbow:.2f}, success_prob={success_prob:.2f}")
        return -success_prob  # Negative for minimization
    return objective


def perform_optimization(objective, space):
    """Perform Bayesian Optimization."""
    res = gp_minimize(
        func=objective,
        dimensions=space,
        n_calls=50,
        n_random_starts=10,
        random_state=42
    )
    # Debug: Log optimization result details
    print(f"Optimization result: {res}")
    return res


def calculate_baselines(X_train):
    """Calculate baseline values for each feature."""
    baselines = {col: X_train[col].mean() for col in X_train.columns}
    print(f"Baseline values: {baselines}")
    return baselines


def compare_results(features, baselines, results):
    """Compare optimal values with baselines."""
    print("\nOptimization Results:")
    for feature, baseline, optimal in zip(features, baselines.values(), results.x):
        difference = optimal - baseline
        print(f"{feature} - Optimal: {optimal:.2f}, Baseline: {baseline:.2f}, Difference: {difference:.2f}")


# Main workflow
X_train, y_train, features = prepare_data()
clf = train_model(X_train, y_train)
space = define_search_space(features)
objective = objective_function(clf)
res = perform_optimization(objective, space)
baselines = calculate_baselines(X_train)
compare_results(features, baselines, res)


X_train shape: (200, 3)
X_train sample:
   knee_max_angle  wrist_max_angle  elbow_max_angle
0       77.454012        57.782848        43.406103
1      135.071431         7.572597       147.331878
2      113.199394        14.546584        95.682808
3       99.865848        80.869877       137.439471
4       55.601864        54.578615        71.606448
y_train sample:
0    0
1    1
2    1
3    1
4    0
dtype: int32
Feature importances: [0.37085458 0.26986149 0.35928393]
Evaluating: knee=119.65, wrist=41.01, elbow=131.36, success_prob=1.00
Evaluating: knee=99.69, wrist=56.75, elbow=43.00, success_prob=0.00
Evaluating: knee=85.92, wrist=50.02, elbow=48.57, success_prob=0.00
Evaluating: knee=105.09, wrist=33.38, elbow=123.86, success_prob=1.00
Evaluating: knee=133.86, wrist=30.05, elbow=158.99, success_prob=1.00
Evaluating: knee=101.75, wrist=66.70, elbow=30.92, success_prob=1.00
Evaluating: knee=42.31, wrist=61.49, elbow=81.98, success_prob=0.00
Evaluating: knee=44.67, wrist=88.43, elbow=60

In [19]:
#%%writefile ../../src/freethrow_predictions/ml/bayes_optim_angles_xgboostpreds.py
import os
import pandas as pd
import numpy as np
from skopt import gp_minimize
from skopt.space import Real
from joblib import load
import matplotlib.pyplot as plt
import pickle  # Added import


def log_debug(message, debug):
    """
    Log debug messages if debug mode is enabled.
    """
    if debug:
        print(message)



def preprocess_data(data_processor, debug):
    """
    Preprocess data and return the transformed data and optimization ranges.
    """
    results = data_processor.run(return_optimization_ranges=True)
    # example usage: X_transformed, y_encoded, transformed_data, transformed_data_df, original_data, X, y, optimization_ranges, optimization_transformed_ranges
    (X_transformed, _, _, _, _, _, _, optimization_ranges, optimization_transformed_ranges) = results

    log_debug(f"[Debug] Data preprocessing completed. Shape: {X_transformed.shape}", debug)
    log_debug(f"[Debug] Optimization Ranges: {optimization_ranges}", debug)
    log_debug(f"[Debug] Optimization Transformed Ranges: {optimization_transformed_ranges}", debug)

    # Return all relevant data
    return X_transformed, optimization_transformed_ranges, optimization_ranges


def define_search_space(transformed_ranges, optimization_columns, debug):
    """
    Define the search space for Bayesian optimization based on transformed ranges.
    """
    # Validate that all optimization_columns are present in transformed_ranges
    missing_columns = [col for col in optimization_columns if col not in transformed_ranges]
    if missing_columns:
        error_msg = f"The following optimization columns are missing in transformed_ranges: {missing_columns}"
        log_debug(f"[Error] {error_msg}", debug)
        raise KeyError(error_msg)

    search_space = [
        Real(transformed_ranges[col][0], transformed_ranges[col][1], name=col)
        for col in optimization_columns
    ]
    log_debug(f"[Debug] Search space defined for columns: {optimization_columns}", debug)
    return search_space


def calculate_baselines(data, columns, debug):
    """
    Calculate baseline values for the optimization columns.
    """
    baselines = data.mean(axis=0).loc[columns].to_dict()
    log_debug(f"[Debug] Baseline Values (Transformed Space): {baselines}", debug)
    return baselines


def objective(params, optimization_columns, X_transformed, model, debug):
    """
    Objective function for Bayesian optimization. Calculates the success probability.
    """
    param_dict = dict(zip(optimization_columns, params))
    feature_vector = pd.Series(X_transformed.mean(axis=0), index=X_transformed.columns)

    for col, value in param_dict.items():
        feature_vector[col] = value

    feature_vector = pd.DataFrame([feature_vector])
    success_prob = model.predict_proba(feature_vector)[0, 1]

    log_debug(f"[Debug] Success probability: {success_prob:.4f}", debug)
    return -success_prob


def perform_optimization(objective_function, search_space, n_calls, debug):
    """
    Perform Bayesian optimization to find the optimal parameters.
    """
    log_debug("[Debug] Starting Bayesian optimization...", debug)
    result = gp_minimize(func=objective_function, dimensions=search_space, n_calls=n_calls, random_state=42)
    log_debug("[Debug] Bayesian optimization completed.", debug)
    log_debug(f"[Debug] Evaluated Parameters: {result.x_iters}", debug)
    log_debug(f"[Debug] Evaluated Success Probabilities: {[-val for val in result.func_vals]}", debug)

    if debug:
        # Visualization of Success Probabilities Over Iterations
        iterations = range(1, len(result.func_vals) + 1)
        success_probabilities = [-val for val in result.func_vals]

        plt.figure(figsize=(10, 6))
        plt.plot(iterations, success_probabilities, marker='o', linestyle='-', color='b')
        plt.title("Bayesian Optimization Progression")
        plt.xlabel("Iteration")
        plt.ylabel("Success Probability")
        plt.grid(True)
        plt.show()

    return result


def inverse_transform_params(preprocessor, params, columns, debug):
    """
    Inverse transform optimization parameters and return them.
    """
    inverse_transformed = preprocessor.inverse_transform_optimization_params(
        params=params, optimization_columns=columns
    )
    log_debug(f"[Debug] Inverse-transformed parameters: {inverse_transformed.head()}", debug)
    return inverse_transformed


def compare_results(
    optimization_columns,
    baselines,
    inverse_baselines,
    final_results,
    res,
    inverse_transformed_params,
    optimization_ranges,
    model,
    X_transformed,
    debug,
):
    """
    Generate a comparison table between baseline, optimized parameters, success probabilities, and original ranges.

    Args:
        optimization_columns (list): List of parameter names.
        baselines (dict): Baseline values for the optimization columns.
        inverse_baselines (pd.DataFrame): Baseline values in the original scale.
        final_results (pd.DataFrame): Final results from the optimization process.
        res (OptimizeResult): Optimization result object.
        inverse_transformed_params (pd.DataFrame): Inverse-transformed parameters.
        optimization_ranges (dict): Original optimization ranges.
        model (object): Trained model to calculate probabilities.
        X_transformed (pd.DataFrame): Transformed feature space.
        debug (bool): Enable debug logging.

    Returns:
        pd.DataFrame: Final comparison table.
    """
    try:
        idx_optimal = res.x_iters.index(res.x)
    except ValueError:
        log_debug("[Error] Optimal parameters not found in x_iters.", debug)
        raise

    original_ranges = pd.DataFrame.from_dict(
        optimization_ranges, orient="index", columns=["Original Min Range", "Original Max Range"]
    ).reset_index().rename(columns={"index": "Parameter"})

    # Create feature vectors for baseline and optimized
    baseline_vector = X_transformed.mean(axis=0)
    optimized_vector = baseline_vector.copy()
    for i, col in enumerate(optimization_columns):
        optimized_vector[col] = res.x[i]

    # Calculate success probabilities
    baseline_vector_df = pd.DataFrame([baseline_vector], columns=X_transformed.columns)
    optimized_vector_df = pd.DataFrame([optimized_vector], columns=X_transformed.columns)
    baseline_success_prob = model.predict_proba(baseline_vector_df)[0, 1]
    optimized_success_prob = model.predict_proba(optimized_vector_df)[0, 1]

    comparison = pd.DataFrame({
        "Parameter": optimization_columns,
        "Baseline (Metric)": inverse_baselines.iloc[0].values,
        "Optimized (Metric)": inverse_transformed_params.iloc[idx_optimal].values,
        "Difference": inverse_transformed_params.iloc[idx_optimal].values - inverse_baselines.iloc[0].values,
        "Baseline Success Probability": [baseline_success_prob] * len(optimization_columns),
        "Optimized Success Probability": [optimized_success_prob] * len(optimization_columns),
    })

    # Merge original ranges into the comparison table
    comparison = comparison.merge(original_ranges, on="Parameter", how="left")

    log_debug("\nComparison of Starting and Optimized Parameters with Success Probabilities:", debug)
    log_debug(comparison, debug)
    return comparison


def perform_bayesian_call_optimization(objective_function, search_space, min_calls, max_calls, repeat_threshold, debug):
    """
    Perform Bayesian optimization with early stopping based on repeated max values or lack of improvement.

    Args:
        objective_function (callable): Function to optimize.
        search_space (list): List of parameter ranges to explore.
        min_calls (int): Minimum number of optimization calls before allowing early stopping.
        max_calls (int): Maximum number of optimization calls.
        repeat_threshold (int): Number of repeats of the max value before stopping.
        debug (bool): Enable debugging and visualization if True.

    Returns:
        skopt.OptimizeResult: Result object containing optimization details.
    """
    log_debug("[Debug] Starting Bayesian optimization with max tracking and early stopping...", debug)

    # Initialize tracking variables
    seen_values = {}
    iteration_results = []
    max_value = float('-inf')  # Start with a very low max value
    max_repeats = 0  # Counter for repeated max values
    iterations_since_max = 0  # Counter for iterations since last new max

    def early_stopping_callback(res):
        nonlocal iteration_results, seen_values, max_value, max_repeats, iterations_since_max
        success_prob = -res.func_vals[-1]  # Current success probability
        iteration_results.append(success_prob)

        # Track occurrences of success_prob
        seen_values[success_prob] = seen_values.get(success_prob, 0) + 1

        # Update max value and repetitions
        if success_prob > max_value:
            max_value = success_prob  # Update max value
            max_repeats = 1  # Reset repeats for the new max
            iterations_since_max = 0  # Reset since we've found a new max
            log_debug(f"[Debug] New max success probability found: {max_value:.4f}", debug)
        elif success_prob == max_value:
            max_repeats = seen_values[success_prob]  # Use total occurrences, not consecutive
            log_debug(f"[Debug] Max success probability {max_value:.4f} repeated {max_repeats} times.", debug)
        else:
            iterations_since_max += 1  # Increment counter for lack of improvement

        # Early stopping conditions
        if len(iteration_results) > min_calls:
            if max_repeats >= repeat_threshold:
                log_debug(f"[Debug] Early stopping triggered. Max value {max_value:.4f} repeated {max_repeats} times.", debug)
                return True  # Trigger early stopping
            if iterations_since_max >= 10:
                log_debug(f"[Debug] Early stopping triggered. No new max found in the last {iterations_since_max} iterations.", debug)
                return True  # Trigger early stopping

        log_debug(f"[Debug] Iteration {len(iteration_results)}, Success Probability: {success_prob:.4f}", debug)
        return False  # Continue optimization

    # Perform Bayesian optimization
    res = gp_minimize(
        func=objective_function,
        dimensions=search_space,
        n_calls=max_calls,
        random_state=42,
        callback=[early_stopping_callback]
    )

    # Visualization of Success Probabilities
    if debug:
        plt.figure(figsize=(10, 6))
        plt.plot(range(1, len(iteration_results) + 1), iteration_results, marker='o', linestyle='-', color='b')
        plt.title("Bayesian Optimization Progression with Early Stopping")
        plt.xlabel("Iteration")
        plt.ylabel("Success Probability")
        plt.grid(True)
        plt.show()

    return res


def save_comparison_table(comparison, save_path, debug):
    """
    Save the comparison table to a CSV file.

    Args:
        comparison (pd.DataFrame): Comparison data to save.
        save_path (str): Path to save the CSV file.
        debug (bool): Enable debug logging.
    """
    try:
        comparison.to_csv(save_path, index=False)
        log_debug(f"[Debug] Comparison table saved to {save_path}", debug)
    except Exception as e:
        log_debug(f"[Error] Failed to save comparison table: {e}", debug)


def bayesian_optimized_metrics_main(
    features_path, dataset_path, assets_path, 
    model_save_dir, comparison_save_path, model_name, y_variable,
    min_calls=20, max_calls=30, repeat_threshold=3, debug=False
):
    log_debug("[Debug] Starting main workflow...", debug)

    # Load selected feature columns from the .pkl file
    try:
        with open(features_path, 'rb') as f:
            selected_features = pickle.load(f)
        if isinstance(selected_features, pd.DataFrame):
            optimization_columns = selected_features.columns.tolist()
        elif isinstance(selected_features, list):
            optimization_columns = selected_features
        else:
            raise ValueError("Unsupported format for selected_features. Expected DataFrame or list.")
        log_debug(f"[Debug] Loaded optimization columns: {optimization_columns}", debug)
        
        # Exclude y_variable from optimization_columns if present
        if y_variable in optimization_columns:
            optimization_columns.remove(y_variable)
            log_debug(f"[Debug] Excluded '{y_variable}' from optimization columns.", debug)
    except Exception as e:
        log_debug(f"[Error] Failed to load optimization columns from {features_path}: {e}", debug)
        raise

    # Initialize data processor and preprocessor
    dp = DataPreprocessor(
        features_path=features_path,
        dataset_path=dataset_path,
        assets_path=assets_path,
        y_variable=y_variable,  # Pass the y_variable here
        optimization_columns=optimization_columns,
        preprocessor_train_test_split=False,
    )



    # Preprocess data
    X_transformed, optimization_transformed_ranges, optimization_ranges = preprocess_data(dp, debug)
    clf = load_model(model_name, save_dir=model_save_dir)

    if not hasattr(clf, 'predict_proba'):
        raise AttributeError(f"The model '{model_name}' does not support 'predict_proba'.")

    # Define the search space
    search_space = define_search_space(optimization_transformed_ranges, optimization_columns, debug)

    # Define the wrapped objective function
    def wrapped_objective(params):
        return objective(params, optimization_columns, X_transformed, clf, debug)

    # Perform optimization
    res = perform_bayesian_call_optimization(
        wrapped_objective,
        search_space,
        min_calls=min_calls,
        max_calls=max_calls,
        repeat_threshold=repeat_threshold,
        debug=debug
    )

    # Post-process results
    params_df = pd.DataFrame(res.x_iters, columns=optimization_columns)
    params_df['success_prob'] = -res.func_vals

    inverse_preprocessor = InversePreprocessor(assets_path=assets_path, debug=debug)
    inverse_transformed_params = inverse_transform_params(
        inverse_preprocessor, params_df[optimization_columns], optimization_columns, debug
    )

    baseline_df = pd.DataFrame([calculate_baselines(X_transformed, optimization_columns, debug)])
    inverse_baselines = inverse_transform_params(
        inverse_preprocessor, baseline_df, optimization_columns, debug
    )

    final_results = pd.concat(
        [inverse_transformed_params.reset_index(drop=True),
         params_df[['success_prob']].reset_index(drop=True)],
        axis=1
    )
    comparison = compare_results(
        optimization_columns,
        baseline_df,
        inverse_baselines,
        final_results,
        res,
        inverse_transformed_params,
        optimization_ranges,
        clf,  # Pass the trained model
        X_transformed,  # Pass the transformed feature space
        debug,
    )

    # Save the comparison table
    save_comparison_table(comparison, comparison_save_path, debug)

    if not debug:
        print("Comparison of Starting and Optimized Parameters:")
        print(comparison)


if __name__ == "__main__":
    bayesian_optimized_metrics_main(
        features_path='../../data/model/pipeline/final_ml_df_selected_features_columns.pkl',
        dataset_path='../../data/processed/final_ml_dataset.csv',
        assets_path='../../data/model/pipeline/preprocessing_assets.pkl',
        model_save_dir='../../data/model',
        comparison_save_path='../../data/model/bayesian_optimized_metrics/bayesian_optimized_metrics_comparison_table.csv',
        model_name='XGBoost',
        y_variable='result',  # Specify your target variable here to filter out of the final_ml_df_selected_features_columns
        min_calls=20,  # it finds the optimized amount of calls by looking for if the max is repeated after the min amount of calls
        max_calls=100,
        repeat_threshold=3,
        debug=True
    )


[Debug] Starting main workflow...
[Debug] Loaded optimization columns: ['player_estimated_hand_length_cm_category', 'release_ball_direction_x', 'release_ball_direction_z', 'release_ball_direction_y', 'elbow_release_angle', 'elbow_max_angle', 'wrist_release_angle', 'wrist_max_angle', 'knee_release_angle', 'knee_max_angle', 'result', 'release_ball_speed', 'release_ball_velocity_x', 'release_ball_velocity_y', 'release_ball_velocity_z', 'calculated_release_angle']
[Debug] Excluded 'result' from optimization columns.


TypeError: DataPreprocessor.__init__() got an unexpected keyword argument 'features_path'